2008-03-19 22:32:32

by J. Bruce Fields

[permalink] [raw]
Subject: asynchronous destroy messages

When an rpc client is shut down, gss destroy messages are sent out
asynchronously, and nobody waits for them.

If those rpc messages arrive after the client is completely shut down, I
assume they just get dropped by the networking code? Is it possible for
them to arrive while we're still in the process of shutting down, and if
so, what makes this safe?

Olga's seeing some odd oopses on shutdown after testing our gss callback
code. And of course it's probably our callback patches at fault, but I
was starting to wonder if there was a problem with those destroy
messages arriving at the wrong moment. Any pointers welcomed.

--b.


2008-03-19 19:35:29

by Myklebust, Trond

[permalink] [raw]
Subject: Re: asynchronous destroy messages


On Tue, 2008-03-18 at 18:32 -0400, J. Bruce Fields wrote:
> On Tue, Mar 18, 2008 at 06:15:15PM -0400, bfields wrote:
> > When an rpc client is shut down, gss destroy messages are sent out
> > asynchronously, and nobody waits for them.
> >
> > If those rpc messages arrive after the client is completely shut down, I
> > assume they just get dropped by the networking code? Is it possible for
> > them to arrive while we're still in the process of shutting down, and if
> > so, what makes this safe?
>
> In other words--when does the task that's created to send the destroy
> message get freed, and if that doesn't happen till after the rpc client
> and associated objects are freed, is there a risk of someone trying to
> access those objects through fields in that task?
>
> --b.

Are you talking about the rpc_task? That gets destroyed in the normal
fashion, and since the collection of all rpc_tasks should hold a
reference to the rpc_client, there shouldn't normally be any ordering
problems there.

Note however that we do some magic in 'rpc_free_auth' to extend the
natural lifetime of the rpc_client beyond that of the NFS 'shutdown'.

> > Olga's seeing some odd oopses on shutdown after testing our gss callback
> > code. And of course it's probably our callback patches at fault, but I
> > was starting to wonder if there was a problem with those destroy
> > messages arriving at the wrong moment. Any pointers welcomed.
> >
> > --b.

Is the new code touching the destroy path? If so, how, and what are the
new assumptions that are being made?

--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com

2008-03-19 19:35:30

by Myklebust, Trond

[permalink] [raw]
Subject: Re: asynchronous destroy messages


On Wed, 2008-03-19 at 10:16 -0400, Olga Kornievskaia wrote:
>
> J. Bruce Fields wrote:
> > When an rpc client is shut down, gss destroy messages are sent out
> > asynchronously, and nobody waits for them.
> >
> > If those rpc messages arrive after the client is completely shut down, I
> > assume they just get dropped by the networking code? Is it possible for
> > them to arrive while we're still in the process of shutting down, and if
> > so, what makes this safe?
> >
> > Olga's seeing some odd oopses on shutdown after testing our gss callback
> > code. And of course it's probably our callback patches at fault, but I
> > was starting to wonder if there was a problem with those destroy
> > messages arriving at the wrong moment. Any pointers welcomed.
> >
> >
> What I'm seeing is that nfs4_client structure goes away while an
> rpc_client is still active. nfs4_client and rpc_client share a pointer
> to the rpc_stat structure. so when nfs4_client memory goes away, the
> rpc_client oopses trying to dereference something within cl_stats.
>
> put_nfs4_client() causes rpc_shutdown_client() causes an async destroy
> context message. that message shares the rpc_stats memory with the
> nfs4_client memory that is currently being released. since the task is
> asynchronous, put_nfs4_client() completes and memory goes away. the task
> that's handling destroy context message wakes up (usually in
> call_timeout or call_refresh) and tries to dereference cl_stats.

clnt->cl_stats is supposed to point to a statically allocated structure
(in this case 'nfs_rpcstat'). While that can, in theory, disappear if
the user removes the NFS module, in practice that is very unlikely. I
therefore think you are rather seeing some sort of memory corruption
issue that is affecting the rpc_client.

Are you seeing this with the 'devel' branch from my git tree, or does it
only affect your patched kernel. If the latter, may we see your patches?

--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com

2008-03-19 20:14:57

by Olga Kornievskaia

[permalink] [raw]
Subject: Re: asynchronous destroy messages



J. Bruce Fields wrote:
> When an rpc client is shut down, gss destroy messages are sent out
> asynchronously, and nobody waits for them.
>
> If those rpc messages arrive after the client is completely shut down, I
> assume they just get dropped by the networking code? Is it possible for
> them to arrive while we're still in the process of shutting down, and if
> so, what makes this safe?
>
> Olga's seeing some odd oopses on shutdown after testing our gss callback
> code. And of course it's probably our callback patches at fault, but I
> was starting to wonder if there was a problem with those destroy
> messages arriving at the wrong moment. Any pointers welcomed.
>
>
What I'm seeing is that nfs4_client structure goes away while an
rpc_client is still active. nfs4_client and rpc_client share a pointer
to the rpc_stat structure. so when nfs4_client memory goes away, the
rpc_client oopses trying to dereference something within cl_stats.

put_nfs4_client() causes rpc_shutdown_client() causes an async destroy
context message. that message shares the rpc_stats memory with the
nfs4_client memory that is currently being released. since the task is
asynchronous, put_nfs4_client() completes and memory goes away. the task
that's handling destroy context message wakes up (usually in
call_timeout or call_refresh) and tries to dereference cl_stats.



2008-03-19 20:15:06

by Olga Kornievskaia

[permalink] [raw]
Subject: Re: asynchronous destroy messages



J. Bruce Fields wrote:
> When an rpc client is shut down, gss destroy messages are sent out
> asynchronously, and nobody waits for them.
>
> If those rpc messages arrive after the client is completely shut down, I
> assume they just get dropped by the networking code? Is it possible for
> them to arrive while we're still in the process of shutting down, and if
> so, what makes this safe?
>
> Olga's seeing some odd oopses on shutdown after testing our gss callback
> code. And of course it's probably our callback patches at fault, but I
> was starting to wonder if there was a problem with those destroy
> messages arriving at the wrong moment. Any pointers welcomed.
>
What I'm seeing is that nfs4_client structure goes away while an
rpc_client is still active. nfs4_client and rpc_client share a pointer
to the rpc_stat structure. so when nfs4_client memory goes away, the
rpc_client oopses trying to dereference something within cl_stats.

put_nfs4_client() causes rpc_shutdown_client() causes an async destroy
context message. that message shares the rpc_stats memory with the
nfs4_client memory that is currently being released. since the task is
asynchronous, put_nfs4_client() completes and memory goes away. the task
that's handling destroy context message wakes up (usually in
call_timeout or call_refresh) and tries to dereference cl_stats.


2008-03-19 22:32:31

by J. Bruce Fields

[permalink] [raw]
Subject: Re: asynchronous destroy messages

On Tue, Mar 18, 2008 at 06:15:15PM -0400, bfields wrote:
> When an rpc client is shut down, gss destroy messages are sent out
> asynchronously, and nobody waits for them.
>
> If those rpc messages arrive after the client is completely shut down, I
> assume they just get dropped by the networking code? Is it possible for
> them to arrive while we're still in the process of shutting down, and if
> so, what makes this safe?

In other words--when does the task that's created to send the destroy
message get freed, and if that doesn't happen till after the rpc client
and associated objects are freed, is there a risk of someone trying to
access those objects through fields in that task?

--b.

>
> Olga's seeing some odd oopses on shutdown after testing our gss callback
> code. And of course it's probably our callback patches at fault, but I
> was starting to wonder if there was a problem with those destroy
> messages arriving at the wrong moment. Any pointers welcomed.
>
> --b.

2008-03-19 22:32:30

by J. Bruce Fields

[permalink] [raw]
Subject: Re: asynchronous destroy messages

On Tue, Mar 18, 2008 at 07:27:34PM -0400, Trond Myklebust wrote:
>
> On Tue, 2008-03-18 at 18:32 -0400, J. Bruce Fields wrote:
> > On Tue, Mar 18, 2008 at 06:15:15PM -0400, bfields wrote:
> > > When an rpc client is shut down, gss destroy messages are sent out
> > > asynchronously, and nobody waits for them.
> > >
> > > If those rpc messages arrive after the client is completely shut down, I
> > > assume they just get dropped by the networking code? Is it possible for
> > > them to arrive while we're still in the process of shutting down, and if
> > > so, what makes this safe?
> >
> > In other words--when does the task that's created to send the destroy
> > message get freed, and if that doesn't happen till after the rpc client
> > and associated objects are freed, is there a risk of someone trying to
> > access those objects through fields in that task?
> >
> > --b.
>
> Are you talking about the rpc_task? That gets destroyed in the normal
> fashion, and since the collection of all rpc_tasks should hold a
> reference to the rpc_client, there shouldn't normally be any ordering
> problems there.

Oh, OK, got it. Like Ogla says, we were getting oopses due to
references to an rpc_stats structure that was freed on the assumption it
was safe to do so after rpc_shutdown_client() returned.

(In the case of the nfs client, the rpc_stats structure is declared
statically. But isn't that memory freed when the module is removed?)

--b.

>
> Note however that we do some magic in 'rpc_free_auth' to extend the
> natural lifetime of the rpc_client beyond that of the NFS 'shutdown'.
>
> > > Olga's seeing some odd oopses on shutdown after testing our gss callback
> > > code. And of course it's probably our callback patches at fault, but I
> > > was starting to wonder if there was a problem with those destroy
> > > messages arriving at the wrong moment. Any pointers welcomed.
> > >
> > > --b.
>
> Is the new code touching the destroy path? If so, how, and what are the
> new assumptions that are being made?
>
> --
> Trond Myklebust
> Linux NFS client maintainer
>
> NetApp
> [email protected]
> http://www.netapp.com