LinuxLists.cc - [PATCH 0/2] NLM: fix use after free in lockd

2008-01-28 19:29:17

Subject: [PATCH 0/2] NLM: fix use after free in lockd

This first patch in this set replaces the patch that I had originally
proposed for fixing the use after free problem. The earlier patch seemed
to work in some situations, but when the timing was a bit different it
was still possible for lockd to go down before the RPC callback
occurred. The fix is to just tear down the rpc client altogether.

The second patch in the set prevents a problem whereby lockd could stay
up indefinitely looping through nlmsvc_retry_blocked.

This set depends on the patchset I proposed earlier to convert lockd to
use the kthread API.

Comments and/or suggestions appreciated.

Signed-off-by: Jeff Layton <[email protected]>

2008-01-29 00:04:28

by Jeff Layton

[permalink] [raw]

Subject: Re: [PATCH 1/2] NLM: tear down RPC clients in nlm_shutdown_hosts

On Mon, 28 Jan 2008 18:18:26 -0500
"J. Bruce Fields" <[email protected]> wrote:

> On Mon, Jan 28, 2008 at 02:29:10PM -0500, Jeff Layton wrote:
> > It's possible for a RPC to outlive the lockd daemon that created
> > it, so we need to make sure that all RPC's are killed when lockd is
> > coming down. When nlm_shutdown_hosts is called, kill off all RPC
> > tasks associated with the host. Since we need to wait until they
> > have all gone away, we might as well just shut down the RPC client
> > altogether.
> >
> > Signed-off-by: Jeff Layton <[email protected]>
> > ---
> > fs/lockd/host.c | 7 ++++++-
> > 1 files changed, 6 insertions(+), 1 deletions(-)
> >
> > diff --git a/fs/lockd/host.c b/fs/lockd/host.c
> > index ebec009..ca6b16f 100644
> > --- a/fs/lockd/host.c
> > +++ b/fs/lockd/host.c
> > @@ -379,8 +379,13 @@ nlm_shutdown_hosts(void)
> > /* First, make all hosts eligible for gc */
> > dprintk("lockd: nuking all hosts...\n");
> > for (chain = nlm_hosts; chain < nlm_hosts +
> > NLM_HOST_NRHASH; ++chain) {
> > - hlist_for_each_entry(host, pos, chain, h_hash)
> > + hlist_for_each_entry(host, pos, chain, h_hash) {
> > host->h_expires = jiffies - 1;
> > + if (host->h_rpcclnt) {
>
> So the only difference from the previous patch is this test and the
> following assignment? OK.
>
> --b.
>

No, the old patch used rpc_killall_tasks and this one uses
rpc_shutdown_client. I don't believe rpc_killall_tasks gives any
guarantee about when the tasks actually complete, and we need to know
that they're gone before we allow lockd to exit.

> > +
> > rpc_shutdown_client(host->h_rpcclnt);
> > + host->h_rpcclnt = NULL;
> > + }
> > + }
> > }
> >
> > /* Then, perform a garbage collection pass */
> > --
> > 1.5.3.7
> >

--
Jeff Layton <[email protected]>

2008-01-29 00:05:21

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [PATCH 2/2] NLM: break out of nlmsvc_retry_blocked if lockd is coming down

On Mon, Jan 28, 2008 at 02:29:11PM -0500, Jeff Layton wrote:
> It's possible to end up continually looping in nlmsvc_retry_blocked()
> even when the lockd kthread has been requested to come down. I've been
> able to reproduce this fairly easily by having an RPC grant callback
> queued up to an RPC client that's not bound. If the other host is
> unresponsive, then the block never gets taken off the list, and lockd
> continually respawns new RPC's.
>
> If lockd is going down anyway, we don't want to keep retrying the
> blocks, so allow it to break out of this loop. Additionally, in that
> situation kthread_stop will have already woken lockd, so when
> nlmsvc_retry_blocked returns, have lockd check for kthread_stop() so
> that it doesn't end up calling svc_recv and waiting for a long time.

Stupid question: how is this different from the kthread_stop() coming
just after this kthread_should_stop() call but before we call
svc_recv()? What keeps us from waiting in svc_recv() indefinitely after
kthread_stop()?

> This patch prevents this continual looping and allows lockd to
> eventually come down, but this can quite some time since lockd
> won't check the condition in the nlmsvc_retry_blocked while loop
> until nlmsvc_grant_blocked returns. That won't happen until
> all of the portmap requests time out.

So, shutdown problems aside, does that mean an unresponsive client can
cause lock to stop serving lock requests in normal operation?

--b.

>
> Signed-off-by: Jeff Layton <[email protected]>
> ---
> fs/lockd/svc.c | 4 ++++
> fs/lockd/svclock.c | 3 ++-
> 2 files changed, 6 insertions(+), 1 deletions(-)
>
> diff --git a/fs/lockd/svc.c b/fs/lockd/svc.c
> index 5752e1b..a0e5318 100644
> --- a/fs/lockd/svc.c
> +++ b/fs/lockd/svc.c
> @@ -166,6 +166,10 @@ lockd(void *vrqstp)
> } else if (time_before(grace_period_expire, jiffies))
> clear_grace_period();
>
> + /* nlmsvc_retry_blocked can block, so check for kthread_stop */
> + if (kthread_should_stop())
> + break;
> +
> /*
> * Find a socket with data available and call its
> * recvfrom routine.
> diff --git a/fs/lockd/svclock.c b/fs/lockd/svclock.c
> index 2f4d8fa..d8324b6 100644
> --- a/fs/lockd/svclock.c
> +++ b/fs/lockd/svclock.c
> @@ -29,6 +29,7 @@
> #include <linux/sunrpc/svc.h>
> #include <linux/lockd/nlm.h>
> #include <linux/lockd/lockd.h>
> +#include <linux/kthread.h>
>
> #define NLMDBG_FACILITY NLMDBG_SVCLOCK
>
> @@ -867,7 +868,7 @@ nlmsvc_retry_blocked(void)
> unsigned long timeout = MAX_SCHEDULE_TIMEOUT;
> struct nlm_block *block;
>
> - while (!list_empty(&nlm_blocked)) {
> + while (!list_empty(&nlm_blocked) && !kthread_should_stop()) {
> block = list_entry(nlm_blocked.next, struct nlm_block, b_list);
>
> if (block->b_when == NLM_NEVER)
> --
> 1.5.3.7
>

2008-01-29 11:23:18

by Jeff Layton

[permalink] [raw]

Subject: Re: [PATCH 2/2] NLM: break out of nlmsvc_retry_blocked if lockd is coming down

On Mon, 28 Jan 2008 22:02:49 -0500
"J. Bruce Fields" <[email protected]> wrote:

> On Mon, Jan 28, 2008 at 09:29:42PM -0500, Jeff Layton wrote:
> > On Mon, 28 Jan 2008 20:23:51 -0500
> > "J. Bruce Fields" <[email protected]> wrote:
> >
> > > On Mon, Jan 28, 2008 at 08:12:10PM -0500, Jeff Layton wrote:
> > > > On Mon, 28 Jan 2008 19:05:17 -0500
> > > > "J. Bruce Fields" <[email protected]> wrote:
> > > >
> > > > > On Mon, Jan 28, 2008 at 02:29:11PM -0500, Jeff Layton wrote:
> > > > > > It's possible to end up continually looping in
> > > > > > nlmsvc_retry_blocked() even when the lockd kthread has been
> > > > > > requested to come down. I've been able to reproduce this
> > > > > > fairly easily by having an RPC grant callback queued up to
> > > > > > an RPC client that's not bound. If the other host is
> > > > > > unresponsive, then the block never gets taken off the list,
> > > > > > and lockd continually respawns new RPC's.
> > > > > >
> > > > > > If lockd is going down anyway, we don't want to keep
> > > > > > retrying the blocks, so allow it to break out of this loop.
> > > > > > Additionally, in that situation kthread_stop will have
> > > > > > already woken lockd, so when nlmsvc_retry_blocked returns,
> > > > > > have lockd check for kthread_stop() so that it doesn't end
> > > > > > up calling svc_recv and waiting for a long time.
> > > > >
> > > > > Stupid question: how is this different from the
> > > > > kthread_stop() coming just after this kthread_should_stop()
> > > > > call but before we call svc_recv()? What keeps us from
> > > > > waiting in svc_recv() indefinitely after kthread_stop()?
> > > > >
> > > >
> > > > Nothing at all, unfortunately :-(
> > > >
> > > > Since we've taken signalling out of the lockd_down codepath,
> > > > this gets a lot trickier than it used to be. I'm starting to
> > > > wonder if we should go back to sending a signal on lockd_down.
> > >
> > > OK, thanks. So do I keep these patches, or not? This sounds
> > > like a regression (if perhaps a very minor one--I'm not quite
> > > clear). Help!
> > >
> >
> > Well, if we reintroduce signalling in lockd_down then this
> > particular problem goes away. It may be reasonable to add that back
> > into the mix for now.
> >
> > As to the more general question of whether to keep these patches...
> >
> > Most of them are pretty innocuous, but the patch to convert to
> > kthread API is the biggest change. 2.6.25 might be a bit aggressive
> > for that. It wouldn't hurt to let the conversion to kthread API
> > brew until 2.6.26.
> >
> > lockd still does have a use after free problem though, so patch 1/2
> > here might be worth considering even without changing things over to
> > kthreads. I've not tested it on a kernel without the kthread
> > patches, however. I can do that if you think that's the right
> > approach.
>
> Pfft, I was hoping you'd tell me what to do.
>
> But, yes, it sounds like dropping the kthread conversion and sending
> in that one bugfix is probably wisest. (But I assume we should keep
> the first of those three patches, "SUNRPC: spin svc_rqst
> initialization to its own function".)
>

Yes. The first patch fixes a possible NULL pointer dereference so I
think we want that one here. This patch:

Subject: [PATCH 2/4] SUNRPC: export svc_sock_update_bufs

...should be safe too, though it's not strictly needed until we convert
lockd to the kthread API.

So at this point we have 2 questions:

1) is there a way to make sure we don't block in svc_recv() without
resorting to signals and can we reasonably mix signaling +
kthread_stop? There may be a chicken and egg problem involved here
though I haven't thought it through yet...

2) can an unresponsive client make lockd loop continuously in
nlmsvc_retry_blocked() and prevent it from serving new lock requests?

I'll try to get answers to both of these...

--
Jeff Layton <[email protected]>

2008-01-28 19:29:17

by Jeff Layton

[permalink] [raw]

Subject: [PATCH 2/2] NLM: break out of nlmsvc_retry_blocked if lockd is coming down

It's possible to end up continually looping in nlmsvc_retry_blocked()
even when the lockd kthread has been requested to come down. I've been
able to reproduce this fairly easily by having an RPC grant callback
queued up to an RPC client that's not bound. If the other host is
unresponsive, then the block never gets taken off the list, and lockd
continually respawns new RPC's.

If lockd is going down anyway, we don't want to keep retrying the
blocks, so allow it to break out of this loop. Additionally, in that
situation kthread_stop will have already woken lockd, so when
nlmsvc_retry_blocked returns, have lockd check for kthread_stop() so
that it doesn't end up calling svc_recv and waiting for a long time.

This patch prevents this continual looping and allows lockd to
eventually come down, but this can quite some time since lockd
won't check the condition in the nlmsvc_retry_blocked while loop
until nlmsvc_grant_blocked returns. That won't happen until
all of the portmap requests time out.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/lockd/svc.c | 4 ++++
fs/lockd/svclock.c | 3 ++-
2 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/fs/lockd/svc.c b/fs/lockd/svc.c
index 5752e1b..a0e5318 100644
--- a/fs/lockd/svc.c
+++ b/fs/lockd/svc.c
@@ -166,6 +166,10 @@ lockd(void *vrqstp)
} else if (time_before(grace_period_expire, jiffies))
clear_grace_period();

+ /* nlmsvc_retry_blocked can block, so check for kthread_stop */
+ if (kthread_should_stop())
+ break;
+
/*
* Find a socket with data available and call its
* recvfrom routine.
diff --git a/fs/lockd/svclock.c b/fs/lockd/svclock.c
index 2f4d8fa..d8324b6 100644
--- a/fs/lockd/svclock.c
+++ b/fs/lockd/svclock.c
@@ -29,6 +29,7 @@
#include <linux/sunrpc/svc.h>
#include <linux/lockd/nlm.h>
#include <linux/lockd/lockd.h>
+#include <linux/kthread.h>

#define NLMDBG_FACILITY NLMDBG_SVCLOCK

@@ -867,7 +868,7 @@ nlmsvc_retry_blocked(void)
unsigned long timeout = MAX_SCHEDULE_TIMEOUT;
struct nlm_block *block;

- while (!list_empty(&nlm_blocked)) {
+ while (!list_empty(&nlm_blocked) && !kthread_should_stop()) {
block = list_entry(nlm_blocked.next, struct nlm_block, b_list);

if (block->b_when == NLM_NEVER)
--
1.5.3.7

2008-01-28 19:29:17

by Jeff Layton

[permalink] [raw]

Subject: [PATCH 1/2] NLM: tear down RPC clients in nlm_shutdown_hosts

It's possible for a RPC to outlive the lockd daemon that created it, so
we need to make sure that all RPC's are killed when lockd is coming
down. When nlm_shutdown_hosts is called, kill off all RPC tasks
associated with the host. Since we need to wait until they have all gone
away, we might as well just shut down the RPC client altogether.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/lockd/host.c | 7 ++++++-
1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/fs/lockd/host.c b/fs/lockd/host.c
index ebec009..ca6b16f 100644
--- a/fs/lockd/host.c
+++ b/fs/lockd/host.c
@@ -379,8 +379,13 @@ nlm_shutdown_hosts(void)
/* First, make all hosts eligible for gc */
dprintk("lockd: nuking all hosts...\n");
for (chain = nlm_hosts; chain < nlm_hosts + NLM_HOST_NRHASH; ++chain) {
- hlist_for_each_entry(host, pos, chain, h_hash)
+ hlist_for_each_entry(host, pos, chain, h_hash) {
host->h_expires = jiffies - 1;
+ if (host->h_rpcclnt) {
+ rpc_shutdown_client(host->h_rpcclnt);
+ host->h_rpcclnt = NULL;
+ }
+ }
}

/* Then, perform a garbage collection pass */
--
1.5.3.7

2008-01-28 23:18:32

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [PATCH 1/2] NLM: tear down RPC clients in nlm_shutdown_hosts

On Mon, Jan 28, 2008 at 02:29:10PM -0500, Jeff Layton wrote:
> It's possible for a RPC to outlive the lockd daemon that created it, so
> we need to make sure that all RPC's are killed when lockd is coming
> down. When nlm_shutdown_hosts is called, kill off all RPC tasks
> associated with the host. Since we need to wait until they have all gone
> away, we might as well just shut down the RPC client altogether.
>
> Signed-off-by: Jeff Layton <[email protected]>
> ---
> fs/lockd/host.c | 7 ++++++-
> 1 files changed, 6 insertions(+), 1 deletions(-)
>
> diff --git a/fs/lockd/host.c b/fs/lockd/host.c
> index ebec009..ca6b16f 100644
> --- a/fs/lockd/host.c
> +++ b/fs/lockd/host.c
> @@ -379,8 +379,13 @@ nlm_shutdown_hosts(void)
> /* First, make all hosts eligible for gc */
> dprintk("lockd: nuking all hosts...\n");
> for (chain = nlm_hosts; chain < nlm_hosts + NLM_HOST_NRHASH; ++chain) {
> - hlist_for_each_entry(host, pos, chain, h_hash)
> + hlist_for_each_entry(host, pos, chain, h_hash) {
> host->h_expires = jiffies - 1;
> + if (host->h_rpcclnt) {

So the only difference from the previous patch is this test and the
following assignment? OK.

--b.

> + rpc_shutdown_client(host->h_rpcclnt);
> + host->h_rpcclnt = NULL;
> + }
> + }
> }
>
> /* Then, perform a garbage collection pass */
> --
> 1.5.3.7
>

2008-01-29 01:12:14

by Jeff Layton

[permalink] [raw]

Subject: Re: [PATCH 2/2] NLM: break out of nlmsvc_retry_blocked if lockd is coming down

On Mon, 28 Jan 2008 19:05:17 -0500
"J. Bruce Fields" <[email protected]> wrote:

> On Mon, Jan 28, 2008 at 02:29:11PM -0500, Jeff Layton wrote:
> > It's possible to end up continually looping in
> > nlmsvc_retry_blocked() even when the lockd kthread has been
> > requested to come down. I've been able to reproduce this fairly
> > easily by having an RPC grant callback queued up to an RPC client
> > that's not bound. If the other host is unresponsive, then the block
> > never gets taken off the list, and lockd continually respawns new
> > RPC's.
> >
> > If lockd is going down anyway, we don't want to keep retrying the
> > blocks, so allow it to break out of this loop. Additionally, in that
> > situation kthread_stop will have already woken lockd, so when
> > nlmsvc_retry_blocked returns, have lockd check for kthread_stop() so
> > that it doesn't end up calling svc_recv and waiting for a long time.
>
> Stupid question: how is this different from the kthread_stop() coming
> just after this kthread_should_stop() call but before we call
> svc_recv()? What keeps us from waiting in svc_recv() indefinitely
> after kthread_stop()?
>

Nothing at all, unfortunately :-(

Since we've taken signalling out of the lockd_down codepath, this gets
a lot trickier than it used to be. I'm starting to wonder if we should
go back to sending a signal on lockd_down.

> > This patch prevents this continual looping and allows lockd to
> > eventually come down, but this can quite some time since lockd
> > won't check the condition in the nlmsvc_retry_blocked while loop
> > until nlmsvc_grant_blocked returns. That won't happen until
> > all of the portmap requests time out.
>
> So, shutdown problems aside, does that mean an unresponsive client can
> cause lock to stop serving lock requests in normal operation?
>

Possibly.

I think that in general what happens is that blocks eventually get
reinserted into the list with an expire time that is well after the
current jiffies and we eventually hit this condition:

if (time_after(block->b_when,jiffies)) {
timeout = block->b_when - jiffies;
break;
}

...but I was seeing lockd loop in this loop over several iterations
without returning back up to the main lockd loop. I think what was
happening was that we were inserting the block with a later jiffies in
nlmsvc_grant_blocked, but the rpcbind attempts would end up taking
longer than that timeout. So when we returned from
nlmsvc_grant_blocked() that condition would no longer fire, and we'd end up retrying the grant callback again.

As to why we don't see this more often, I'm not sure. I probably need
to experiment a bit more with it to make sure I'm understanding this
correctly.

Let me do that and get back to you...

Thanks,
--
Jeff Layton <[email protected]>

2008-01-29 01:23:55

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [PATCH 2/2] NLM: break out of nlmsvc_retry_blocked if lockd is coming down

On Mon, Jan 28, 2008 at 08:12:10PM -0500, Jeff Layton wrote:
> On Mon, 28 Jan 2008 19:05:17 -0500
> "J. Bruce Fields" <[email protected]> wrote:
>
> > On Mon, Jan 28, 2008 at 02:29:11PM -0500, Jeff Layton wrote:
> > > It's possible to end up continually looping in
> > > nlmsvc_retry_blocked() even when the lockd kthread has been
> > > requested to come down. I've been able to reproduce this fairly
> > > easily by having an RPC grant callback queued up to an RPC client
> > > that's not bound. If the other host is unresponsive, then the block
> > > never gets taken off the list, and lockd continually respawns new
> > > RPC's.
> > >
> > > If lockd is going down anyway, we don't want to keep retrying the
> > > blocks, so allow it to break out of this loop. Additionally, in that
> > > situation kthread_stop will have already woken lockd, so when
> > > nlmsvc_retry_blocked returns, have lockd check for kthread_stop() so
> > > that it doesn't end up calling svc_recv and waiting for a long time.
> >
> > Stupid question: how is this different from the kthread_stop() coming
> > just after this kthread_should_stop() call but before we call
> > svc_recv()? What keeps us from waiting in svc_recv() indefinitely
> > after kthread_stop()?
> >
>
> Nothing at all, unfortunately :-(
>
> Since we've taken signalling out of the lockd_down codepath, this gets
> a lot trickier than it used to be. I'm starting to wonder if we should
> go back to sending a signal on lockd_down.

OK, thanks. So do I keep these patches, or not? This sounds like a
regression (if perhaps a very minor one--I'm not quite clear). Help!

--b.

>
> > > This patch prevents this continual looping and allows lockd to
> > > eventually come down, but this can quite some time since lockd
> > > won't check the condition in the nlmsvc_retry_blocked while loop
> > > until nlmsvc_grant_blocked returns. That won't happen until
> > > all of the portmap requests time out.
> >
> > So, shutdown problems aside, does that mean an unresponsive client can
> > cause lock to stop serving lock requests in normal operation?
> >
>
> Possibly.
>
> I think that in general what happens is that blocks eventually get
> reinserted into the list with an expire time that is well after the
> current jiffies and we eventually hit this condition:
>
> if (time_after(block->b_when,jiffies)) {
> timeout = block->b_when - jiffies;
> break;
> }
>
> ...but I was seeing lockd loop in this loop over several iterations
> without returning back up to the main lockd loop. I think what was
> happening was that we were inserting the block with a later jiffies in
> nlmsvc_grant_blocked, but the rpcbind attempts would end up taking
> longer than that timeout. So when we returned from
> nlmsvc_grant_blocked() that condition would no longer fire, and we'd end up retrying the grant callback again.
>
> As to why we don't see this more often, I'm not sure. I probably need
> to experiment a bit more with it to make sure I'm understanding this
> correctly.
>
> Let me do that and get back to you...

2008-01-29 02:30:49

by Jeff Layton

[permalink] [raw]

Subject: Re: [PATCH 2/2] NLM: break out of nlmsvc_retry_blocked if lockd is coming down

On Mon, 28 Jan 2008 20:23:51 -0500
"J. Bruce Fields" <[email protected]> wrote:

> On Mon, Jan 28, 2008 at 08:12:10PM -0500, Jeff Layton wrote:
> > On Mon, 28 Jan 2008 19:05:17 -0500
> > "J. Bruce Fields" <[email protected]> wrote:
> >
> > > On Mon, Jan 28, 2008 at 02:29:11PM -0500, Jeff Layton wrote:
> > > > It's possible to end up continually looping in
> > > > nlmsvc_retry_blocked() even when the lockd kthread has been
> > > > requested to come down. I've been able to reproduce this fairly
> > > > easily by having an RPC grant callback queued up to an RPC client
> > > > that's not bound. If the other host is unresponsive, then the block
> > > > never gets taken off the list, and lockd continually respawns new
> > > > RPC's.
> > > >
> > > > If lockd is going down anyway, we don't want to keep retrying the
> > > > blocks, so allow it to break out of this loop. Additionally, in that
> > > > situation kthread_stop will have already woken lockd, so when
> > > > nlmsvc_retry_blocked returns, have lockd check for kthread_stop() so
> > > > that it doesn't end up calling svc_recv and waiting for a long time.
> > >
> > > Stupid question: how is this different from the kthread_stop() coming
> > > just after this kthread_should_stop() call but before we call
> > > svc_recv()? What keeps us from waiting in svc_recv() indefinitely
> > > after kthread_stop()?
> > >
> >
> > Nothing at all, unfortunately :-(
> >
> > Since we've taken signalling out of the lockd_down codepath, this gets
> > a lot trickier than it used to be. I'm starting to wonder if we should
> > go back to sending a signal on lockd_down.
>
> OK, thanks. So do I keep these patches, or not? This sounds like a
> regression (if perhaps a very minor one--I'm not quite clear). Help!
>

Well, if we reintroduce signalling in lockd_down then this particular
problem goes away. It may be reasonable to add that back into the mix
for now.

As to the more general question of whether to keep these patches...

Most of them are pretty innocuous, but the patch to convert to kthread
API is the biggest change. 2.6.25 might be a bit aggressive for that. It
wouldn't hurt to let the conversion to kthread API brew until 2.6.26.

lockd still does have a use after free problem though, so patch 1/2
here might be worth considering even without changing things over to
kthreads. I've not tested it on a kernel without the kthread patches,
however. I can do that if you think that's the right approach.

> >
> > > > This patch prevents this continual looping and allows lockd to
> > > > eventually come down, but this can quite some time since lockd
> > > > won't check the condition in the nlmsvc_retry_blocked while loop
> > > > until nlmsvc_grant_blocked returns. That won't happen until
> > > > all of the portmap requests time out.
> > >
> > > So, shutdown problems aside, does that mean an unresponsive client can
> > > cause lock to stop serving lock requests in normal operation?
> > >
> >
> > Possibly.
> >
> > I think that in general what happens is that blocks eventually get
> > reinserted into the list with an expire time that is well after the
> > current jiffies and we eventually hit this condition:
> >
> > if (time_after(block->b_when,jiffies)) {
> > timeout = block->b_when - jiffies;
> > break;
> > }
> >
> > ...but I was seeing lockd loop in this loop over several iterations
> > without returning back up to the main lockd loop. I think what was
> > happening was that we were inserting the block with a later jiffies in
> > nlmsvc_grant_blocked, but the rpcbind attempts would end up taking
> > longer than that timeout. So when we returned from
> > nlmsvc_grant_blocked() that condition would no longer fire, and we'd end up retrying the grant callback again.
> >
> > As to why we don't see this more often, I'm not sure. I probably need
> > to experiment a bit more with it to make sure I'm understanding this
> > correctly.
> >
> > Let me do that and get back to you...

--
Jeff Layton <[email protected]>

2008-01-29 03:02:57

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [PATCH 2/2] NLM: break out of nlmsvc_retry_blocked if lockd is coming down

On Mon, Jan 28, 2008 at 09:29:42PM -0500, Jeff Layton wrote:
> On Mon, 28 Jan 2008 20:23:51 -0500
> "J. Bruce Fields" <[email protected]> wrote:
>
> > On Mon, Jan 28, 2008 at 08:12:10PM -0500, Jeff Layton wrote:
> > > On Mon, 28 Jan 2008 19:05:17 -0500
> > > "J. Bruce Fields" <[email protected]> wrote:
> > >
> > > > On Mon, Jan 28, 2008 at 02:29:11PM -0500, Jeff Layton wrote:
> > > > > It's possible to end up continually looping in
> > > > > nlmsvc_retry_blocked() even when the lockd kthread has been
> > > > > requested to come down. I've been able to reproduce this fairly
> > > > > easily by having an RPC grant callback queued up to an RPC client
> > > > > that's not bound. If the other host is unresponsive, then the block
> > > > > never gets taken off the list, and lockd continually respawns new
> > > > > RPC's.
> > > > >
> > > > > If lockd is going down anyway, we don't want to keep retrying the
> > > > > blocks, so allow it to break out of this loop. Additionally, in that
> > > > > situation kthread_stop will have already woken lockd, so when
> > > > > nlmsvc_retry_blocked returns, have lockd check for kthread_stop() so
> > > > > that it doesn't end up calling svc_recv and waiting for a long time.
> > > >
> > > > Stupid question: how is this different from the kthread_stop() coming
> > > > just after this kthread_should_stop() call but before we call
> > > > svc_recv()? What keeps us from waiting in svc_recv() indefinitely
> > > > after kthread_stop()?
> > > >
> > >
> > > Nothing at all, unfortunately :-(
> > >
> > > Since we've taken signalling out of the lockd_down codepath, this gets
> > > a lot trickier than it used to be. I'm starting to wonder if we should
> > > go back to sending a signal on lockd_down.
> >
> > OK, thanks. So do I keep these patches, or not? This sounds like a
> > regression (if perhaps a very minor one--I'm not quite clear). Help!
> >
>
> Well, if we reintroduce signalling in lockd_down then this particular
> problem goes away. It may be reasonable to add that back into the mix
> for now.
>
> As to the more general question of whether to keep these patches...
>
> Most of them are pretty innocuous, but the patch to convert to kthread
> API is the biggest change. 2.6.25 might be a bit aggressive for that. It
> wouldn't hurt to let the conversion to kthread API brew until 2.6.26.
>
> lockd still does have a use after free problem though, so patch 1/2
> here might be worth considering even without changing things over to
> kthreads. I've not tested it on a kernel without the kthread patches,
> however. I can do that if you think that's the right approach.

Pfft, I was hoping you'd tell me what to do.

But, yes, it sounds like dropping the kthread conversion and sending in
that one bugfix is probably wisest. (But I assume we should keep the
first of those three patches, "SUNRPC: spin svc_rqst initialization to
its own function".)

--b.