2007-12-21 15:28:20

by Jeff Layton

[permalink] [raw]
Subject: [PATCH 0/6] Intro: convert lockd to kthread and fix use-after-free (try #4)

This is the third patchset to fix the use-after-free problem in lockd
which we originally discussed back in October. The main problem is
detailed in the last patch of the series. Along the way, Christoph
Hellwig mentioned that it would be advantageous to convert lockd to use
the kthread API. This patch set first makes that change and then
patches it to actually fix the use after free problem. It also fixes
a couple of minor bugs in the current lockd implementation.

The main change from the last patchset is that I've dropped the first
patch that changed svc_pool_map_set_cpumask, since it's no longer
strictly needed. I've also done some style cleanups recommended by
checkpatch.pl.

I've done some basic smoke testing and everything seems to work as
expected. I've also tested this against the reproducer that I have for
the use-after-free problem and this does fix it. I've tried to make
this cleanly bisectable, but have only really tested the final result.

I'd like to see this soak in nfs-2.6 git tree for a bit and then be
considered for 2.6.25. Thoughts?

Thanks,

Signed-off-by: Jeff Layton <[email protected]>



2007-12-21 20:47:14

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH 6/6] NLM: Add reference counting to lockd

On Fri, 21 Dec 2007 15:25:29 -0500
Chuck Lever <[email protected]> wrote:

> On Dec 21, 2007, at 2:54 PM, Jeff Layton wrote:
> > On Fri, 21 Dec 2007 12:51:25 -0500
> > Chuck Lever <[email protected]> wrote:
> >
> >> You could easily post a message to the kernel log that says "lockd
> >> was signaled to stop, but is waiting for outstanding requests to
> >> complete."
> >
> > Here's a respun patch 6 that contains the warning message suggested
> > by Chuck.
> >
> > Thoughts?
> >
> > ------------[snip]--------------
> >
> > NLM: Add reference counting to lockd
> >
> > ...and only have lockd exit when the last reference is dropped.
> >
> > The problem is this:
> >
> > When a lock that a client is blocking on comes free, lockd does
> > this in
> > nlmsvc_grant_blocked():
> >
> > nlm_async_call(block->b_call, NLMPROC_GRANTED_MSG,
> > &nlmsvc_grant_ops);
> >
> > the callback from this call is nlmsvc_grant_callback(). That
> > function does this at the end to wake up lockd:
> >
> > svc_wake_up(block->b_daemon);
> >
> > However there is no guarantee that lockd will be up when this
> > happens. If someone shuts down or restarts lockd before the async
> > call completes,
> > then the b_daemon pointer will point to freed memory and the
> > kernel may
> > oops.
> >
> > I first noticed this on older kernels and had mistakenly thought
> > that newer kernels weren't susceptible, but that's not correct.
> > There's a bit
> > of a race to make sure that the nlm_host is bound when the async
> > call is
> > done, but I can now reproduce this at will on current kernels.
> >
> > This patch is based on Trond's suggestion to add a new reference
> > counter
> > to lockd, and only allows lockd to go down when it reaches 0. With
> > this
> > change we can't use kthread_stop here. nlmsvc_unlink_block is
> > called by
> > lockd and a kthread can't call kthread_stop on itself. So the patch
> > changes lockd to check the refcount itself and to return if it goes
> > to 0. We do the checking and exit while holding the nlmsvc_mutex to
> > make sure that a new lockd is not started until the old one is down.
> >
> > Signed-off-by: Jeff Layton <[email protected]>
> > ---
> > fs/lockd/svc.c | 52
> > ++++++++++++++++++++++++++++++++ +---------
> > fs/lockd/svclock.c | 5 ++++
> > include/linux/lockd/lockd.h | 1 +
> > 3 files changed, 47 insertions(+), 11 deletions(-)
> >
> > diff --git a/fs/lockd/svc.c b/fs/lockd/svc.c
> > index d7209ea..71a4f65 100644
> > --- a/fs/lockd/svc.c
> > +++ b/fs/lockd/svc.c
> > @@ -51,6 +51,7 @@ static DEFINE_MUTEX(nlmsvc_mutex);
> > static unsigned int nlmsvc_users;
> > static struct task_struct *nlmsvc_task;
> > static struct svc_serv *nlmsvc_serv;
> > +atomic_t nlmsvc_ref = ATOMIC_INIT(0);
> > int nlmsvc_grace_period;
> > unsigned long nlmsvc_timeout;
> >
> > @@ -134,7 +135,10 @@ lockd(void *vrqstp)
> >
> > set_freezable();
> >
> > - /* Process request with signals blocked, but allow
> > SIGKILL. */
> > + /*
> > + * Process request with signals blocked, but allow SIGKILL
> > which
> > + * signifies that lockd should drop all of its locks.
> > + */
> > allow_signal(SIGKILL);
> >
> > dprintk("NFS locking service started (ver " LOCKD_VERSION
> > ").\n"); @@ -147,15 +151,19 @@ lockd(void *vrqstp)
> >
> > /*
> > * The main request loop. We don't terminate until the last
> > - * NFS mount or NFS daemon has gone away, and we've been
> > sent a
> > - * signal, or else another process has taken over our job.
> > + * NFS mount or NFS daemon has gone away, and the
> > nlm_blocked
> > + * list is empty. The nlmsvc_mutex ensures that we prevent
> > a
> > + * new lockd from being started before the old one is down.
> > */
> > - while (!kthread_should_stop()) {
> > + mutex_lock(&nlmsvc_mutex);
> > + while (atomic_read(&nlmsvc_ref) != 0) {
> > long timeout = MAX_SCHEDULE_TIMEOUT;
> > char buf[RPC_MAX_ADDRBUFLEN];
> >
> > + mutex_unlock(&nlmsvc_mutex);
> > +
> > if (try_to_freeze())
> > - continue;
> > + goto again;
> >
> > if (signalled()) {
> > flush_signals(current);
> > @@ -182,11 +190,12 @@ lockd(void *vrqstp)
> > */
> > err = svc_recv(rqstp, timeout);
> > if (err == -EAGAIN || err == -EINTR)
> > - continue;
> > + goto again;
> > if (err < 0) {
> > printk(KERN_WARNING
> > "lockd: terminating on error %d\n",
> > -err);
> > + mutex_lock(&nlmsvc_mutex);
> > break;
> > }
> >
> > @@ -194,19 +203,22 @@ lockd(void *vrqstp)
> > svc_print_addr(rqstp, buf,
> > sizeof(buf)));
> >
> > svc_process(rqstp);
> > +again:
> > + mutex_lock(&nlmsvc_mutex);
> > }
> >
> > - flush_signals(current);
> > -
> > /*
> > - * Check whether there's a new lockd process before
> > - * shutting down the hosts and clearing the slot.
> > + * at this point lockd is committed to going down. We hold
> > the
> > + * nlmsvc_mutex until just before exit to prevent a new one
> > + * from starting before it's down.
> > */
> > + flush_signals(current);
> > if (nlmsvc_ops)
> > nlmsvc_invalidate_all();
> > nlm_shutdown_hosts();
> > nlmsvc_task = NULL;
> > nlmsvc_serv = NULL;
> > + mutex_unlock(&nlmsvc_mutex);
> >
> > /* Exit the RPC thread */
> > svc_exit_thread(rqstp);
> > @@ -269,6 +281,10 @@ lockd_up(int proto) /* Maybe add a 'family'
> > option when IPv6 is supported ?? */
> > int error = 0;
> >
> > mutex_lock(&nlmsvc_mutex);
> > +
> > + if (!nlmsvc_users)
> > + atomic_inc(&nlmsvc_ref);
> > +
> > /*
> > * Check whether we're already up and running.
> > */
> > @@ -328,6 +344,8 @@ lockd_up(int proto) /* Maybe add a 'family'
> > option when IPv6 is supported ?? */
> > destroy_and_out:
> > svc_destroy(serv);
> > out:
> > + if (!nlmsvc_users && error)
> > + atomic_dec(&nlmsvc_ref);
> > if (!error)
> > nlmsvc_users++;
> > mutex_unlock(&nlmsvc_mutex);
> > @@ -357,7 +375,19 @@ lockd_down(void)
> > goto out;
> > }
> > warned = 0;
> > - kthread_stop(nlmsvc_task);
> > + if (atomic_sub_return(1, &nlmsvc_ref) != 0)
> > + printk(KERN_WARNING "lockd_down: lockd signalled
> > to go down, "
> > + "but is waiting for outstanding requests
> > to "
> > + "complete.\n");
>
> We could quibble about the proper spelling of "signaled".
>
> "lockd_down: lockd is waiting for outstanding requests to complete
> before exiting."
>
> might be less awkward.
>
> Otherwise, I think this is helpful.
>

m-w.com says either form is correct, though you're probably right that
your later suggestion is less awkward (and also more correct since
lockd technically hasn't been sent a signal at that point).

I'll plan to let this sit for a day or two and see if anyone else has
comments on the set. If not, then I'll respin with a clarified warning
message.

Thanks for the review,
--
Jeff Layton <[email protected]>

2007-12-21 15:28:21

by Jeff Layton

[permalink] [raw]
Subject: [PATCH 5/6] NLM: Convert lockd to use kthreads

Have lockd_up start lockd using kthread_run. With this change,
lockd_down now blocks until lockd actually exits, so there's no longer
need for the waitqueue code at the end of lockd_down. This also means
that only one lockd can be running at a time which simplifies the code
within lockd's main loop.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/lockd/svc.c | 79 ++++++++++++++++++++++++++-----------------------------
1 files changed, 37 insertions(+), 42 deletions(-)

diff --git a/fs/lockd/svc.c b/fs/lockd/svc.c
index 03a83a0..d7209ea 100644
--- a/fs/lockd/svc.c
+++ b/fs/lockd/svc.c
@@ -25,6 +25,7 @@
#include <linux/smp.h>
#include <linux/smp_lock.h>
#include <linux/mutex.h>
+#include <linux/kthread.h>
#include <linux/freezer.h>

#include <linux/sunrpc/types.h>
@@ -48,13 +49,12 @@ EXPORT_SYMBOL(nlmsvc_ops);

static DEFINE_MUTEX(nlmsvc_mutex);
static unsigned int nlmsvc_users;
-static pid_t nlmsvc_pid;
+static struct task_struct *nlmsvc_task;
static struct svc_serv *nlmsvc_serv;
int nlmsvc_grace_period;
unsigned long nlmsvc_timeout;

static DECLARE_COMPLETION(lockd_start_done);
-static DECLARE_WAIT_QUEUE_HEAD(lockd_exit);

/*
* These can be set at insmod time (useful for NFS as root filesystem),
@@ -111,10 +111,11 @@ static inline void clear_grace_period(void)
/*
* This is the lockd kernel thread
*/
-static void
-lockd(struct svc_rqst *rqstp)
+static int
+lockd(void *vrqstp)
{
int err = 0;
+ struct svc_rqst *rqstp = vrqstp;
unsigned long grace_period_expire;

/* Lock module and set up kernel thread */
@@ -128,11 +129,9 @@ lockd(struct svc_rqst *rqstp)
/*
* Let our maker know we're running.
*/
- nlmsvc_pid = current->pid;
nlmsvc_serv = rqstp->rq_server;
complete(&lockd_start_done);

- daemonize("lockd");
set_freezable();

/* Process request with signals blocked, but allow SIGKILL. */
@@ -151,7 +150,7 @@ lockd(struct svc_rqst *rqstp)
* NFS mount or NFS daemon has gone away, and we've been sent a
* signal, or else another process has taken over our job.
*/
- while ((nlmsvc_users || !signalled()) && nlmsvc_pid == current->pid) {
+ while (!kthread_should_stop()) {
long timeout = MAX_SCHEDULE_TIMEOUT;
char buf[RPC_MAX_ADDRBUFLEN];

@@ -203,23 +202,19 @@ lockd(struct svc_rqst *rqstp)
* Check whether there's a new lockd process before
* shutting down the hosts and clearing the slot.
*/
- if (!nlmsvc_pid || current->pid == nlmsvc_pid) {
- if (nlmsvc_ops)
- nlmsvc_invalidate_all();
- nlm_shutdown_hosts();
- nlmsvc_pid = 0;
- nlmsvc_serv = NULL;
- } else
- printk(KERN_DEBUG
- "lockd: new process, skipping host shutdown\n");
- wake_up(&lockd_exit);
+ if (nlmsvc_ops)
+ nlmsvc_invalidate_all();
+ nlm_shutdown_hosts();
+ nlmsvc_task = NULL;
+ nlmsvc_serv = NULL;

/* Exit the RPC thread */
svc_exit_thread(rqstp);

/* Release module */
unlock_kernel();
- module_put_and_exit(0);
+ module_put(THIS_MODULE);
+ return 0;
}


@@ -269,14 +264,15 @@ static int make_socks(struct svc_serv *serv, int proto)
int
lockd_up(int proto) /* Maybe add a 'family' option when IPv6 is supported ?? */
{
- struct svc_serv * serv;
- int error = 0;
+ struct svc_serv *serv;
+ struct svc_rqst *rqstp;
+ int error = 0;

mutex_lock(&nlmsvc_mutex);
/*
* Check whether we're already up and running.
*/
- if (nlmsvc_pid) {
+ if (nlmsvc_task) {
if (proto)
error = make_socks(nlmsvc_serv, proto);
goto out;
@@ -303,11 +299,24 @@ lockd_up(int proto) /* Maybe add a 'family' option when IPv6 is supported ?? */
/*
* Create the kernel thread and wait for it to start.
*/
+ rqstp = svc_prepare_thread(serv, &serv->sv_pools[0]);
+ if (IS_ERR(rqstp)) {
+ error = PTR_ERR(rqstp);
+ printk(KERN_WARNING
+ "lockd_up: svc_rqst allocation failed, error=%d\n",
+ error);
+ goto destroy_and_out;
+ }
+
+ svc_sock_update_bufs(serv);
init_completion(&lockd_start_done);
- error = svc_create_thread(lockd, serv);
- if (error) {
+ nlmsvc_task = kthread_run(lockd, rqstp, serv->sv_name);
+ if (IS_ERR(nlmsvc_task)) {
+ error = PTR_ERR(nlmsvc_task);
+ nlmsvc_task = NULL;
printk(KERN_WARNING
- "lockd_up: create thread failed, error=%d\n", error);
+ "lockd_up: kthread_run failed, error=%d\n", error);
+ svc_exit_thread(rqstp);
goto destroy_and_out;
}
wait_for_completion(&lockd_start_done);
@@ -339,30 +348,16 @@ lockd_down(void)
if (--nlmsvc_users)
goto out;
} else
- printk(KERN_WARNING "lockd_down: no users! pid=%d\n", nlmsvc_pid);
+ printk(KERN_WARNING "lockd_down: no users! task=%p\n",
+ nlmsvc_task);

- if (!nlmsvc_pid) {
+ if (!nlmsvc_task) {
if (warned++ == 0)
printk(KERN_WARNING "lockd_down: no lockd running.\n");
goto out;
}
warned = 0;
-
- kill_proc(nlmsvc_pid, SIGKILL, 1);
- /*
- * Wait for the lockd process to exit, but since we're holding
- * the lockd semaphore, we can't wait around forever ...
- */
- clear_thread_flag(TIF_SIGPENDING);
- interruptible_sleep_on_timeout(&lockd_exit, HZ);
- if (nlmsvc_pid) {
- printk(KERN_WARNING
- "lockd_down: lockd failed to exit, clearing pid\n");
- nlmsvc_pid = 0;
- }
- spin_lock_irq(&current->sighand->siglock);
- recalc_sigpending();
- spin_unlock_irq(&current->sighand->siglock);
+ kthread_stop(nlmsvc_task);
out:
mutex_unlock(&nlmsvc_mutex);
}
--
1.5.3.6


2007-12-21 15:28:23

by Jeff Layton

[permalink] [raw]
Subject: [PATCH 4/6] NLM: Have lockd call try_to_freeze

lockd makes itself freezable, but never calls try_to_freeze(). Have it
call try_to_freeze() within the main loop.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/lockd/svc.c | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/fs/lockd/svc.c b/fs/lockd/svc.c
index 0f4148a..03a83a0 100644
--- a/fs/lockd/svc.c
+++ b/fs/lockd/svc.c
@@ -155,6 +155,9 @@ lockd(struct svc_rqst *rqstp)
long timeout = MAX_SCHEDULE_TIMEOUT;
char buf[RPC_MAX_ADDRBUFLEN];

+ if (try_to_freeze())
+ continue;
+
if (signalled()) {
flush_signals(current);
if (nlmsvc_ops) {
--
1.5.3.6


2007-12-21 15:28:21

by Jeff Layton

[permalink] [raw]
Subject: [PATCH 1/6] SUNRPC: spin svc_rqst initialization to its own function

Move the initialzation in __svc_create_thread that happens prior to
thread creation to a new function. Export the function to allow
services to have better control over the svc_rqst structs.

Signed-off-by: Jeff Layton <[email protected]>
---
include/linux/sunrpc/svc.h | 2 ++
net/sunrpc/svc.c | 43 +++++++++++++++++++++++++++++++------------
2 files changed, 33 insertions(+), 12 deletions(-)

diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 8531a70..5f07300 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -382,6 +382,8 @@ struct svc_procedure {
*/
struct svc_serv * svc_create(struct svc_program *, unsigned int,
void (*shutdown)(struct svc_serv*));
+struct svc_rqst *svc_prepare_thread(struct svc_serv *serv,
+ struct svc_pool *pool);
int svc_create_thread(svc_thread_fn, struct svc_serv *);
void svc_exit_thread(struct svc_rqst *);
struct svc_serv * svc_create_pooled(struct svc_program *, unsigned int,
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index a4a6bf7..cde480e 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -535,23 +535,14 @@ svc_release_buffer(struct svc_rqst *rqstp)
put_page(rqstp->rq_pages[i]);
}

-/*
- * Create a thread in the given pool. Caller must hold BKL.
- * On a NUMA or SMP machine, with a multi-pool serv, the thread
- * will be restricted to run on the cpus belonging to the pool.
- */
-static int
-__svc_create_thread(svc_thread_fn func, struct svc_serv *serv,
- struct svc_pool *pool)
+struct svc_rqst *
+svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool)
{
struct svc_rqst *rqstp;
- int error = -ENOMEM;
- int have_oldmask = 0;
- cpumask_t oldmask;

rqstp = kzalloc(sizeof(*rqstp), GFP_KERNEL);
if (!rqstp)
- goto out;
+ goto out_enomem;

init_waitqueue_head(&rqstp->rq_wait);

@@ -567,6 +558,34 @@ __svc_create_thread(svc_thread_fn func, struct svc_serv *serv,
spin_unlock_bh(&pool->sp_lock);
rqstp->rq_server = serv;
rqstp->rq_pool = pool;
+ return rqstp;
+
+out_thread:
+ svc_exit_thread(rqstp);
+out_enomem:
+ return ERR_PTR(-ENOMEM);
+}
+EXPORT_SYMBOL(svc_prepare_thread);
+
+/*
+ * Create a thread in the given pool. Caller must hold BKL.
+ * On a NUMA or SMP machine, with a multi-pool serv, the thread
+ * will be restricted to run on the cpus belonging to the pool.
+ */
+static int
+__svc_create_thread(svc_thread_fn func, struct svc_serv *serv,
+ struct svc_pool *pool)
+{
+ struct svc_rqst *rqstp;
+ int error = -ENOMEM;
+ int have_oldmask = 0;
+ cpumask_t oldmask;
+
+ rqstp = svc_prepare_thread(serv, pool);
+ if (IS_ERR(rqstp)) {
+ error = PTR_ERR(rqstp);
+ goto out;
+ }

if (serv->sv_nrpools > 1)
have_oldmask = svc_pool_map_set_cpumask(pool->sp_id, &oldmask);
--
1.5.3.6


2007-12-21 15:28:19

by Jeff Layton

[permalink] [raw]
Subject: [PATCH 6/6] NLM: Add reference counting to lockd

...and only have lockd exit when the last reference is dropped.

The problem is this:

When a lock that a client is blocking on comes free, lockd does this in
nlmsvc_grant_blocked():

nlm_async_call(block->b_call, NLMPROC_GRANTED_MSG, &nlmsvc_grant_ops);

the callback from this call is nlmsvc_grant_callback(). That function
does this at the end to wake up lockd:

svc_wake_up(block->b_daemon);

However there is no guarantee that lockd will be up when this happens.
If someone shuts down or restarts lockd before the async call completes,
then the b_daemon pointer will point to freed memory and the kernel may
oops.

I first noticed this on older kernels and had mistakenly thought that
newer kernels weren't susceptible, but that's not correct. There's a bit
of a race to make sure that the nlm_host is bound when the async call is
done, but I can now reproduce this at will on current kernels.

This patch is based on Trond's suggestion to add a new reference counter
to lockd, and only allows lockd to go down when it reaches 0. With this
change we can't use kthread_stop here. nlmsvc_unlink_block is called by
lockd and a kthread can't call kthread_stop on itself. So the patch
changes lockd to check the refcount itself and to return if it goes to
0. We do the checking and exit while holding the nlmsvc_mutex to make
sure that a new lockd is not started until the old one is down.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/lockd/svc.c | 49 +++++++++++++++++++++++++++++++++---------
fs/lockd/svclock.c | 5 ++++
include/linux/lockd/lockd.h | 1 +
3 files changed, 44 insertions(+), 11 deletions(-)

diff --git a/fs/lockd/svc.c b/fs/lockd/svc.c
index d7209ea..216f9be 100644
--- a/fs/lockd/svc.c
+++ b/fs/lockd/svc.c
@@ -51,6 +51,7 @@ static DEFINE_MUTEX(nlmsvc_mutex);
static unsigned int nlmsvc_users;
static struct task_struct *nlmsvc_task;
static struct svc_serv *nlmsvc_serv;
+atomic_t nlmsvc_ref = ATOMIC_INIT(0);
int nlmsvc_grace_period;
unsigned long nlmsvc_timeout;

@@ -134,7 +135,10 @@ lockd(void *vrqstp)

set_freezable();

- /* Process request with signals blocked, but allow SIGKILL. */
+ /*
+ * Process request with signals blocked, but allow SIGKILL which
+ * signifies that lockd should drop all of its locks.
+ */
allow_signal(SIGKILL);

dprintk("NFS locking service started (ver " LOCKD_VERSION ").\n");
@@ -147,15 +151,19 @@ lockd(void *vrqstp)

/*
* The main request loop. We don't terminate until the last
- * NFS mount or NFS daemon has gone away, and we've been sent a
- * signal, or else another process has taken over our job.
+ * NFS mount or NFS daemon has gone away, and the nlm_blocked
+ * list is empty. The nlmsvc_mutex ensures that we prevent a
+ * new lockd from being started before the old one is down.
*/
- while (!kthread_should_stop()) {
+ mutex_lock(&nlmsvc_mutex);
+ while (atomic_read(&nlmsvc_ref) != 0) {
long timeout = MAX_SCHEDULE_TIMEOUT;
char buf[RPC_MAX_ADDRBUFLEN];

+ mutex_unlock(&nlmsvc_mutex);
+
if (try_to_freeze())
- continue;
+ goto again;

if (signalled()) {
flush_signals(current);
@@ -182,11 +190,12 @@ lockd(void *vrqstp)
*/
err = svc_recv(rqstp, timeout);
if (err == -EAGAIN || err == -EINTR)
- continue;
+ goto again;
if (err < 0) {
printk(KERN_WARNING
"lockd: terminating on error %d\n",
-err);
+ mutex_lock(&nlmsvc_mutex);
break;
}

@@ -194,19 +203,22 @@ lockd(void *vrqstp)
svc_print_addr(rqstp, buf, sizeof(buf)));

svc_process(rqstp);
+again:
+ mutex_lock(&nlmsvc_mutex);
}

- flush_signals(current);
-
/*
- * Check whether there's a new lockd process before
- * shutting down the hosts and clearing the slot.
+ * at this point lockd is committed to going down. We hold the
+ * nlmsvc_mutex until just before exit to prevent a new one
+ * from starting before it's down.
*/
+ flush_signals(current);
if (nlmsvc_ops)
nlmsvc_invalidate_all();
nlm_shutdown_hosts();
nlmsvc_task = NULL;
nlmsvc_serv = NULL;
+ mutex_unlock(&nlmsvc_mutex);

/* Exit the RPC thread */
svc_exit_thread(rqstp);
@@ -269,6 +281,10 @@ lockd_up(int proto) /* Maybe add a 'family' option when IPv6 is supported ?? */
int error = 0;

mutex_lock(&nlmsvc_mutex);
+
+ if (!nlmsvc_users)
+ atomic_inc(&nlmsvc_ref);
+
/*
* Check whether we're already up and running.
*/
@@ -328,6 +344,8 @@ lockd_up(int proto) /* Maybe add a 'family' option when IPv6 is supported ?? */
destroy_and_out:
svc_destroy(serv);
out:
+ if (!nlmsvc_users && error)
+ atomic_dec(&nlmsvc_ref);
if (!error)
nlmsvc_users++;
mutex_unlock(&nlmsvc_mutex);
@@ -357,7 +375,16 @@ lockd_down(void)
goto out;
}
warned = 0;
- kthread_stop(nlmsvc_task);
+ atomic_dec(&nlmsvc_ref);
+
+ /*
+ * Sending a signal is necessary here. If we get to this point and
+ * nlm_blocked isn't empty then lockd may be held hostage by clients
+ * that are still blocking. Sending the signal makes sure that lockd
+ * invalidates all of its locks so that it's just waiting on RPC
+ * callbacks to complete
+ */
+ kill_proc(nlmsvc_task->pid, SIGKILL, 1);
out:
mutex_unlock(&nlmsvc_mutex);
}
diff --git a/fs/lockd/svclock.c b/fs/lockd/svclock.c
index d120ec3..b8fbda3 100644
--- a/fs/lockd/svclock.c
+++ b/fs/lockd/svclock.c
@@ -61,6 +61,9 @@ nlmsvc_insert_block(struct nlm_block *block, unsigned long when)
struct list_head *pos;

dprintk("lockd: nlmsvc_insert_block(%p, %ld)\n", block, when);
+ if (list_empty(&nlm_blocked))
+ atomic_inc(&nlmsvc_ref);
+
if (list_empty(&block->b_list)) {
kref_get(&block->b_count);
} else {
@@ -239,6 +242,8 @@ static int nlmsvc_unlink_block(struct nlm_block *block)
/* Remove block from list */
status = posix_unblock_lock(block->b_file->f_file, &block->b_call->a_args.lock.fl);
nlmsvc_remove_block(block);
+ if (list_empty(&nlm_blocked))
+ atomic_dec(&nlmsvc_ref);
return status;
}

diff --git a/include/linux/lockd/lockd.h b/include/linux/lockd/lockd.h
index e2d1ce3..7389553 100644
--- a/include/linux/lockd/lockd.h
+++ b/include/linux/lockd/lockd.h
@@ -154,6 +154,7 @@ extern struct svc_procedure nlmsvc_procedures4[];
extern int nlmsvc_grace_period;
extern unsigned long nlmsvc_timeout;
extern int nsm_use_hostnames;
+extern atomic_t nlmsvc_ref;

/*
* Lockd client functions
--
1.5.3.6


2007-12-21 15:28:20

by Jeff Layton

[permalink] [raw]
Subject: [PATCH 3/6] NLM: Initialize completion variable in lockd_up

lockd_start_done is a global var that can be reused if lockd is
restarted, but it's never reinitialized. On all but the first use,
wait_for_completion isn't actually waiting on it since it has
already completed once.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/lockd/svc.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/fs/lockd/svc.c b/fs/lockd/svc.c
index 82e2192..0f4148a 100644
--- a/fs/lockd/svc.c
+++ b/fs/lockd/svc.c
@@ -300,6 +300,7 @@ lockd_up(int proto) /* Maybe add a 'family' option when IPv6 is supported ?? */
/*
* Create the kernel thread and wait for it to start.
*/
+ init_completion(&lockd_start_done);
error = svc_create_thread(lockd, serv);
if (error) {
printk(KERN_WARNING
--
1.5.3.6


2007-12-21 15:28:20

by Jeff Layton

[permalink] [raw]
Subject: [PATCH 2/6] SUNRPC: export svc_sock_update_bufs

Needed since the plan is to not have a svc_create_thread helper and to
have current users of that function just call kthread_run directly.

Signed-off-by: Jeff Layton <[email protected]>
---
net/sunrpc/svcsock.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index c75bffe..f34af48 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -1405,6 +1405,7 @@ svc_sock_update_bufs(struct svc_serv *serv)
}
spin_unlock_bh(&serv->sv_lock);
}
+EXPORT_SYMBOL(svc_sock_update_bufs);

/*
* Receive the next request on any socket. This code is carefully
--
1.5.3.6


2007-12-21 16:44:47

by Chuck Lever

[permalink] [raw]
Subject: Re: [PATCH 6/6] NLM: Add reference counting to lockd

Hi Jeff-

On Dec 21, 2007, at 10:28 AM, Jeff Layton wrote:
> ...and only have lockd exit when the last reference is dropped.
>
> The problem is this:
>
> When a lock that a client is blocking on comes free, lockd does
> this in
> nlmsvc_grant_blocked():
>
> nlm_async_call(block->b_call, NLMPROC_GRANTED_MSG,
> &nlmsvc_grant_ops);
>
> the callback from this call is nlmsvc_grant_callback(). That function
> does this at the end to wake up lockd:
>
> svc_wake_up(block->b_daemon);
>
> However there is no guarantee that lockd will be up when this happens.
> If someone shuts down or restarts lockd before the async call
> completes,
> then the b_daemon pointer will point to freed memory and the kernel
> may
> oops.

Here is perhaps a naive question.

If there is a network partition between client and server while one
of these async requests is outstanding, and someone does a lockd_down
() during the partition, will lockd hang until the network is
restored? If it does, is there any indication to administrators or
users what may be causing the hang?

Current behavior is either a crash or a clean lockd restart, yes?
Would the new behavior be a hang?

> I first noticed this on older kernels and had mistakenly thought that
> newer kernels weren't susceptible, but that's not correct. There's
> a bit
> of a race to make sure that the nlm_host is bound when the async
> call is
> done, but I can now reproduce this at will on current kernels.
>
> This patch is based on Trond's suggestion to add a new reference
> counter
> to lockd, and only allows lockd to go down when it reaches 0. With
> this
> change we can't use kthread_stop here. nlmsvc_unlink_block is
> called by
> lockd and a kthread can't call kthread_stop on itself. So the patch
> changes lockd to check the refcount itself and to return if it goes to
> 0. We do the checking and exit while holding the nlmsvc_mutex to make
> sure that a new lockd is not started until the old one is down.
>
> Signed-off-by: Jeff Layton <[email protected]>
> ---
> fs/lockd/svc.c | 49 ++++++++++++++++++++++++++++++++
> +---------
> fs/lockd/svclock.c | 5 ++++
> include/linux/lockd/lockd.h | 1 +
> 3 files changed, 44 insertions(+), 11 deletions(-)
>
> diff --git a/fs/lockd/svc.c b/fs/lockd/svc.c
> index d7209ea..216f9be 100644
> --- a/fs/lockd/svc.c
> +++ b/fs/lockd/svc.c
> @@ -51,6 +51,7 @@ static DEFINE_MUTEX(nlmsvc_mutex);
> static unsigned int nlmsvc_users;
> static struct task_struct *nlmsvc_task;
> static struct svc_serv *nlmsvc_serv;
> +atomic_t nlmsvc_ref = ATOMIC_INIT(0);
> int nlmsvc_grace_period;
> unsigned long nlmsvc_timeout;
>
> @@ -134,7 +135,10 @@ lockd(void *vrqstp)
>
> set_freezable();
>
> - /* Process request with signals blocked, but allow SIGKILL. */
> + /*
> + * Process request with signals blocked, but allow SIGKILL which
> + * signifies that lockd should drop all of its locks.
> + */
> allow_signal(SIGKILL);
>
> dprintk("NFS locking service started (ver " LOCKD_VERSION ").\n");
> @@ -147,15 +151,19 @@ lockd(void *vrqstp)
>
> /*
> * The main request loop. We don't terminate until the last
> - * NFS mount or NFS daemon has gone away, and we've been sent a
> - * signal, or else another process has taken over our job.
> + * NFS mount or NFS daemon has gone away, and the nlm_blocked
> + * list is empty. The nlmsvc_mutex ensures that we prevent a
> + * new lockd from being started before the old one is down.
> */
> - while (!kthread_should_stop()) {
> + mutex_lock(&nlmsvc_mutex);
> + while (atomic_read(&nlmsvc_ref) != 0) {
> long timeout = MAX_SCHEDULE_TIMEOUT;
> char buf[RPC_MAX_ADDRBUFLEN];
>
> + mutex_unlock(&nlmsvc_mutex);
> +
> if (try_to_freeze())
> - continue;
> + goto again;
>
> if (signalled()) {
> flush_signals(current);
> @@ -182,11 +190,12 @@ lockd(void *vrqstp)
> */
> err = svc_recv(rqstp, timeout);
> if (err == -EAGAIN || err == -EINTR)
> - continue;
> + goto again;
> if (err < 0) {
> printk(KERN_WARNING
> "lockd: terminating on error %d\n",
> -err);
> + mutex_lock(&nlmsvc_mutex);
> break;
> }
>
> @@ -194,19 +203,22 @@ lockd(void *vrqstp)
> svc_print_addr(rqstp, buf, sizeof(buf)));
>
> svc_process(rqstp);
> +again:
> + mutex_lock(&nlmsvc_mutex);
> }
>
> - flush_signals(current);
> -
> /*
> - * Check whether there's a new lockd process before
> - * shutting down the hosts and clearing the slot.
> + * at this point lockd is committed to going down. We hold the
> + * nlmsvc_mutex until just before exit to prevent a new one
> + * from starting before it's down.
> */
> + flush_signals(current);
> if (nlmsvc_ops)
> nlmsvc_invalidate_all();
> nlm_shutdown_hosts();
> nlmsvc_task = NULL;
> nlmsvc_serv = NULL;
> + mutex_unlock(&nlmsvc_mutex);
>
> /* Exit the RPC thread */
> svc_exit_thread(rqstp);
> @@ -269,6 +281,10 @@ lockd_up(int proto) /* Maybe add a 'family'
> option when IPv6 is supported ?? */
> int error = 0;
>
> mutex_lock(&nlmsvc_mutex);
> +
> + if (!nlmsvc_users)
> + atomic_inc(&nlmsvc_ref);
> +
> /*
> * Check whether we're already up and running.
> */
> @@ -328,6 +344,8 @@ lockd_up(int proto) /* Maybe add a 'family'
> option when IPv6 is supported ?? */
> destroy_and_out:
> svc_destroy(serv);
> out:
> + if (!nlmsvc_users && error)
> + atomic_dec(&nlmsvc_ref);
> if (!error)
> nlmsvc_users++;
> mutex_unlock(&nlmsvc_mutex);
> @@ -357,7 +375,16 @@ lockd_down(void)
> goto out;
> }
> warned = 0;
> - kthread_stop(nlmsvc_task);
> + atomic_dec(&nlmsvc_ref);
> +
> + /*
> + * Sending a signal is necessary here. If we get to this point and
> + * nlm_blocked isn't empty then lockd may be held hostage by clients
> + * that are still blocking. Sending the signal makes sure that lockd
> + * invalidates all of its locks so that it's just waiting on RPC
> + * callbacks to complete
> + */
> + kill_proc(nlmsvc_task->pid, SIGKILL, 1);
> out:
> mutex_unlock(&nlmsvc_mutex);
> }
> diff --git a/fs/lockd/svclock.c b/fs/lockd/svclock.c
> index d120ec3..b8fbda3 100644
> --- a/fs/lockd/svclock.c
> +++ b/fs/lockd/svclock.c
> @@ -61,6 +61,9 @@ nlmsvc_insert_block(struct nlm_block *block,
> unsigned long when)
> struct list_head *pos;
>
> dprintk("lockd: nlmsvc_insert_block(%p, %ld)\n", block, when);
> + if (list_empty(&nlm_blocked))
> + atomic_inc(&nlmsvc_ref);
> +
> if (list_empty(&block->b_list)) {
> kref_get(&block->b_count);
> } else {
> @@ -239,6 +242,8 @@ static int nlmsvc_unlink_block(struct nlm_block
> *block)
> /* Remove block from list */
> status = posix_unblock_lock(block->b_file->f_file, &block->b_call-
> >a_args.lock.fl);
> nlmsvc_remove_block(block);
> + if (list_empty(&nlm_blocked))
> + atomic_dec(&nlmsvc_ref);
> return status;
> }
>
> diff --git a/include/linux/lockd/lockd.h b/include/linux/lockd/lockd.h
> index e2d1ce3..7389553 100644
> --- a/include/linux/lockd/lockd.h
> +++ b/include/linux/lockd/lockd.h
> @@ -154,6 +154,7 @@ extern struct svc_procedure nlmsvc_procedures4[];
> extern int nlmsvc_grace_period;
> extern unsigned long nlmsvc_timeout;
> extern int nsm_use_hostnames;
> +extern atomic_t nlmsvc_ref;
>
> /*
> * Lockd client functions

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

2007-12-21 17:02:44

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH 6/6] NLM: Add reference counting to lockd

On Fri, 21 Dec 2007 11:43:20 -0500
Chuck Lever <[email protected]> wrote:

> Hi Jeff-
>
> On Dec 21, 2007, at 10:28 AM, Jeff Layton wrote:
> > ...and only have lockd exit when the last reference is dropped.
> >
> > The problem is this:
> >
> > When a lock that a client is blocking on comes free, lockd does
> > this in
> > nlmsvc_grant_blocked():
> >
> > nlm_async_call(block->b_call, NLMPROC_GRANTED_MSG,
> > &nlmsvc_grant_ops);
> >
> > the callback from this call is nlmsvc_grant_callback(). That
> > function does this at the end to wake up lockd:
> >
> > svc_wake_up(block->b_daemon);
> >
> > However there is no guarantee that lockd will be up when this
> > happens. If someone shuts down or restarts lockd before the async
> > call completes,
> > then the b_daemon pointer will point to freed memory and the
> > kernel may
> > oops.
>
> Here is perhaps a naive question.
>
> If there is a network partition between client and server while one
> of these async requests is outstanding, and someone does a lockd_down
> () during the partition, will lockd hang until the network is
> restored?

Yes. lockd stays up indefinitely.

> If it does, is there any indication to administrators or
> users what may be causing the hang?
>

No, though that's a good point. Perhaps there should be. I'll think
about how we could add a notification.

> Current behavior is either a crash or a clean lockd restart, yes?
> Would the new behavior be a hang?
>

Current behavior is a crash period (well, a crash if memory poisoning
is on, otherwise YMMV). If lockd is restarted the svc_wake_up still
wakes up a non-existent b_daemon.

The new behavior is not a "hang" per-se. lockd will just stay up and
processing until nlmsvc_ref goes to 0, which will never happen if
there's a network partition. If the partition is removed, the callback
will complete and lockd will go down. If a lockd_up is done before the
callback completes, then the existing lockd will just stay up, though it
will have been signalled and will have dropped all of its existing
locks.

My original patch for this changed the svc_wake_up() call to a call to
wake up whatever lockd happened to be up at the time. Trond seemed to
think that lockd should just stay up in this situation, and this
patchset is an attempt to implement that.

Thanks,
--
Jeff Layton <[email protected]>

2007-12-21 17:53:00

by Chuck Lever

[permalink] [raw]
Subject: Re: [PATCH 6/6] NLM: Add reference counting to lockd

On Dec 21, 2007, at 12:02 PM, Jeff Layton wrote:
> On Fri, 21 Dec 2007 11:43:20 -0500
> Chuck Lever <[email protected]> wrote:
>
>> Hi Jeff-
>>
>> On Dec 21, 2007, at 10:28 AM, Jeff Layton wrote:
>>> ...and only have lockd exit when the last reference is dropped.
>>>
>>> The problem is this:
>>>
>>> When a lock that a client is blocking on comes free, lockd does
>>> this in
>>> nlmsvc_grant_blocked():
>>>
>>> nlm_async_call(block->b_call, NLMPROC_GRANTED_MSG,
>>> &nlmsvc_grant_ops);
>>>
>>> the callback from this call is nlmsvc_grant_callback(). That
>>> function does this at the end to wake up lockd:
>>>
>>> svc_wake_up(block->b_daemon);
>>>
>>> However there is no guarantee that lockd will be up when this
>>> happens. If someone shuts down or restarts lockd before the async
>>> call completes,
>>> then the b_daemon pointer will point to freed memory and the
>>> kernel may
>>> oops.
>>
>> Here is perhaps a naive question.
>>
>> If there is a network partition between client and server while one
>> of these async requests is outstanding, and someone does a lockd_down
>> () during the partition, will lockd hang until the network is
>> restored?
>
> Yes. lockd stays up indefinitely.

But it's still responsive, yes? So this isn't a livelock hang in
that situation.

>> If it does, is there any indication to administrators or
>> users what may be causing the hang?
>>
>
> No, though that's a good point. Perhaps there should be. I'll think
> about how we could add a notification.

You could easily post a message to the kernel log that says "lockd
was signaled to stop, but is waiting for outstanding requests to
complete."

>> Current behavior is either a crash or a clean lockd restart, yes?
>> Would the new behavior be a hang?
>
> Current behavior is a crash period (well, a crash if memory poisoning
> is on, otherwise YMMV). If lockd is restarted the svc_wake_up still
> wakes up a non-existent b_daemon.
>
> The new behavior is not a "hang" per-se. lockd will just stay up and
> processing until nlmsvc_ref goes to 0, which will never happen if
> there's a network partition. If the partition is removed, the callback
> will complete and lockd will go down. If a lockd_up is done before the
> callback completes, then the existing lockd will just stay up,
> though it
> will have been signalled and will have dropped all of its existing
> locks.
>
> My original patch for this changed the svc_wake_up() call to a call to
> wake up whatever lockd happened to be up at the time. Trond seemed to
> think that lockd should just stay up in this situation, and this
> patchset is an attempt to implement that.

Appropriate shutdown processing and not dropping requests in progress
may have conflicting purposes here. :-)

Continuing to operate makes sense if there's another lockd waiting to
start up; but during system shutdown processing, we may just want to
pull the rug out.

Would it make sense to time out the requests so lockd could shutdown
in an orderly fashion? Or perhaps somehow teach lockd to distinguish
between a daemon restart and a system shutdown? Maybe it's not worth
the effort.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

2007-12-21 18:25:54

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH 6/6] NLM: Add reference counting to lockd

On Fri, 21 Dec 2007 12:51:25 -0500
Chuck Lever <[email protected]> wrote:

> On Dec 21, 2007, at 12:02 PM, Jeff Layton wrote:
> > On Fri, 21 Dec 2007 11:43:20 -0500
> > Chuck Lever <[email protected]> wrote:
> >
> >> Hi Jeff-
> >>
> >> On Dec 21, 2007, at 10:28 AM, Jeff Layton wrote:
> >>> ...and only have lockd exit when the last reference is dropped.
> >>>
> >>> The problem is this:
> >>>
> >>> When a lock that a client is blocking on comes free, lockd does
> >>> this in
> >>> nlmsvc_grant_blocked():
> >>>
> >>> nlm_async_call(block->b_call, NLMPROC_GRANTED_MSG,
> >>> &nlmsvc_grant_ops);
> >>>
> >>> the callback from this call is nlmsvc_grant_callback(). That
> >>> function does this at the end to wake up lockd:
> >>>
> >>> svc_wake_up(block->b_daemon);
> >>>
> >>> However there is no guarantee that lockd will be up when this
> >>> happens. If someone shuts down or restarts lockd before the async
> >>> call completes,
> >>> then the b_daemon pointer will point to freed memory and the
> >>> kernel may
> >>> oops.
> >>
> >> Here is perhaps a naive question.
> >>
> >> If there is a network partition between client and server while one
> >> of these async requests is outstanding, and someone does a
> >> lockd_down () during the partition, will lockd hang until the
> >> network is restored?
> >
> > Yes. lockd stays up indefinitely.
>
> But it's still responsive, yes? So this isn't a livelock hang in
> that situation.
>

Yes, it's still responsive. It's essentially just looping as it
normally does until the nlmsvc_ref hits 0.

> >> If it does, is there any indication to administrators or
> >> users what may be causing the hang?
> >>
> >
> > No, though that's a good point. Perhaps there should be. I'll think
> > about how we could add a notification.
>
> You could easily post a message to the kernel log that says "lockd
> was signaled to stop, but is waiting for outstanding requests to
> complete."
>

Yep. I just have to figure out the best way to detect this, but I don't
think it'll be too tough.

> >> Current behavior is either a crash or a clean lockd restart, yes?
> >> Would the new behavior be a hang?
> >
> > Current behavior is a crash period (well, a crash if memory
> > poisoning is on, otherwise YMMV). If lockd is restarted the
> > svc_wake_up still wakes up a non-existent b_daemon.
> >
> > The new behavior is not a "hang" per-se. lockd will just stay up and
> > processing until nlmsvc_ref goes to 0, which will never happen if
> > there's a network partition. If the partition is removed, the
> > callback will complete and lockd will go down. If a lockd_up is
> > done before the callback completes, then the existing lockd will
> > just stay up, though it
> > will have been signalled and will have dropped all of its existing
> > locks.
> >
> > My original patch for this changed the svc_wake_up() call to a call
> > to wake up whatever lockd happened to be up at the time. Trond
> > seemed to think that lockd should just stay up in this situation,
> > and this patchset is an attempt to implement that.
>
> Appropriate shutdown processing and not dropping requests in
> progress may have conflicting purposes here. :-)
>
> Continuing to operate makes sense if there's another lockd waiting
> to start up; but during system shutdown processing, we may just want
> to pull the rug out.
>
> Would it make sense to time out the requests so lockd could shutdown
> in an orderly fashion? Or perhaps somehow teach lockd to
> distinguish between a daemon restart and a system shutdown? Maybe
> it's not worth the effort.
>

I'm not sure it's worth the effort. I haven't seen any issues shutting
down the machine when lockd is still up and trying to complete the
callback while the network is partitioned.

The last patch makes lockd_down always return before lockd comes down,
so shutdown scripts, etc. shouldn't hang.

--
Jeff Layton <[email protected]>

2007-12-21 19:55:18

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH 6/6] NLM: Add reference counting to lockd

On Fri, 21 Dec 2007 12:51:25 -0500
Chuck Lever <[email protected]> wrote:

> You could easily post a message to the kernel log that says "lockd
> was signaled to stop, but is waiting for outstanding requests to
> complete."

Here's a respun patch 6 that contains the warning message suggested by
Chuck.

Thoughts?

------------[snip]--------------

NLM: Add reference counting to lockd

...and only have lockd exit when the last reference is dropped.

The problem is this:

When a lock that a client is blocking on comes free, lockd does this in
nlmsvc_grant_blocked():

nlm_async_call(block->b_call, NLMPROC_GRANTED_MSG, &nlmsvc_grant_ops);

the callback from this call is nlmsvc_grant_callback(). That function
does this at the end to wake up lockd:

svc_wake_up(block->b_daemon);

However there is no guarantee that lockd will be up when this happens.
If someone shuts down or restarts lockd before the async call completes,
then the b_daemon pointer will point to freed memory and the kernel may
oops.

I first noticed this on older kernels and had mistakenly thought that
newer kernels weren't susceptible, but that's not correct. There's a bit
of a race to make sure that the nlm_host is bound when the async call is
done, but I can now reproduce this at will on current kernels.

This patch is based on Trond's suggestion to add a new reference counter
to lockd, and only allows lockd to go down when it reaches 0. With this
change we can't use kthread_stop here. nlmsvc_unlink_block is called by
lockd and a kthread can't call kthread_stop on itself. So the patch
changes lockd to check the refcount itself and to return if it goes to
0. We do the checking and exit while holding the nlmsvc_mutex to make
sure that a new lockd is not started until the old one is down.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/lockd/svc.c | 52 +++++++++++++++++++++++++++++++++---------
fs/lockd/svclock.c | 5 ++++
include/linux/lockd/lockd.h | 1 +
3 files changed, 47 insertions(+), 11 deletions(-)

diff --git a/fs/lockd/svc.c b/fs/lockd/svc.c
index d7209ea..71a4f65 100644
--- a/fs/lockd/svc.c
+++ b/fs/lockd/svc.c
@@ -51,6 +51,7 @@ static DEFINE_MUTEX(nlmsvc_mutex);
static unsigned int nlmsvc_users;
static struct task_struct *nlmsvc_task;
static struct svc_serv *nlmsvc_serv;
+atomic_t nlmsvc_ref = ATOMIC_INIT(0);
int nlmsvc_grace_period;
unsigned long nlmsvc_timeout;

@@ -134,7 +135,10 @@ lockd(void *vrqstp)

set_freezable();

- /* Process request with signals blocked, but allow SIGKILL. */
+ /*
+ * Process request with signals blocked, but allow SIGKILL which
+ * signifies that lockd should drop all of its locks.
+ */
allow_signal(SIGKILL);

dprintk("NFS locking service started (ver " LOCKD_VERSION ").\n");
@@ -147,15 +151,19 @@ lockd(void *vrqstp)

/*
* The main request loop. We don't terminate until the last
- * NFS mount or NFS daemon has gone away, and we've been sent a
- * signal, or else another process has taken over our job.
+ * NFS mount or NFS daemon has gone away, and the nlm_blocked
+ * list is empty. The nlmsvc_mutex ensures that we prevent a
+ * new lockd from being started before the old one is down.
*/
- while (!kthread_should_stop()) {
+ mutex_lock(&nlmsvc_mutex);
+ while (atomic_read(&nlmsvc_ref) != 0) {
long timeout = MAX_SCHEDULE_TIMEOUT;
char buf[RPC_MAX_ADDRBUFLEN];

+ mutex_unlock(&nlmsvc_mutex);
+
if (try_to_freeze())
- continue;
+ goto again;

if (signalled()) {
flush_signals(current);
@@ -182,11 +190,12 @@ lockd(void *vrqstp)
*/
err = svc_recv(rqstp, timeout);
if (err == -EAGAIN || err == -EINTR)
- continue;
+ goto again;
if (err < 0) {
printk(KERN_WARNING
"lockd: terminating on error %d\n",
-err);
+ mutex_lock(&nlmsvc_mutex);
break;
}

@@ -194,19 +203,22 @@ lockd(void *vrqstp)
svc_print_addr(rqstp, buf, sizeof(buf)));

svc_process(rqstp);
+again:
+ mutex_lock(&nlmsvc_mutex);
}

- flush_signals(current);
-
/*
- * Check whether there's a new lockd process before
- * shutting down the hosts and clearing the slot.
+ * at this point lockd is committed to going down. We hold the
+ * nlmsvc_mutex until just before exit to prevent a new one
+ * from starting before it's down.
*/
+ flush_signals(current);
if (nlmsvc_ops)
nlmsvc_invalidate_all();
nlm_shutdown_hosts();
nlmsvc_task = NULL;
nlmsvc_serv = NULL;
+ mutex_unlock(&nlmsvc_mutex);

/* Exit the RPC thread */
svc_exit_thread(rqstp);
@@ -269,6 +281,10 @@ lockd_up(int proto) /* Maybe add a 'family' option when IPv6 is supported ?? */
int error = 0;

mutex_lock(&nlmsvc_mutex);
+
+ if (!nlmsvc_users)
+ atomic_inc(&nlmsvc_ref);
+
/*
* Check whether we're already up and running.
*/
@@ -328,6 +344,8 @@ lockd_up(int proto) /* Maybe add a 'family' option when IPv6 is supported ?? */
destroy_and_out:
svc_destroy(serv);
out:
+ if (!nlmsvc_users && error)
+ atomic_dec(&nlmsvc_ref);
if (!error)
nlmsvc_users++;
mutex_unlock(&nlmsvc_mutex);
@@ -357,7 +375,19 @@ lockd_down(void)
goto out;
}
warned = 0;
- kthread_stop(nlmsvc_task);
+ if (atomic_sub_return(1, &nlmsvc_ref) != 0)
+ printk(KERN_WARNING "lockd_down: lockd signalled to go down, "
+ "but is waiting for outstanding requests to "
+ "complete.\n");
+
+ /*
+ * Sending a signal is necessary here. If we get to this point and
+ * nlm_blocked isn't empty then lockd may be held hostage by clients
+ * that are still blocking. Sending the signal makes sure that lockd
+ * invalidates all of its locks so that it's just waiting on RPC
+ * callbacks to complete
+ */
+ kill_proc(nlmsvc_task->pid, SIGKILL, 1);
out:
mutex_unlock(&nlmsvc_mutex);
}
diff --git a/fs/lockd/svclock.c b/fs/lockd/svclock.c
index d120ec3..b8fbda3 100644
--- a/fs/lockd/svclock.c
+++ b/fs/lockd/svclock.c
@@ -61,6 +61,9 @@ nlmsvc_insert_block(struct nlm_block *block, unsigned long when)
struct list_head *pos;

dprintk("lockd: nlmsvc_insert_block(%p, %ld)\n", block, when);
+ if (list_empty(&nlm_blocked))
+ atomic_inc(&nlmsvc_ref);
+
if (list_empty(&block->b_list)) {
kref_get(&block->b_count);
} else {
@@ -239,6 +242,8 @@ static int nlmsvc_unlink_block(struct nlm_block *block)
/* Remove block from list */
status = posix_unblock_lock(block->b_file->f_file, &block->b_call->a_args.lock.fl);
nlmsvc_remove_block(block);
+ if (list_empty(&nlm_blocked))
+ atomic_dec(&nlmsvc_ref);
return status;
}

diff --git a/include/linux/lockd/lockd.h b/include/linux/lockd/lockd.h
index e2d1ce3..7389553 100644
--- a/include/linux/lockd/lockd.h
+++ b/include/linux/lockd/lockd.h
@@ -154,6 +154,7 @@ extern struct svc_procedure nlmsvc_procedures4[];
extern int nlmsvc_grace_period;
extern unsigned long nlmsvc_timeout;
extern int nsm_use_hostnames;
+extern atomic_t nlmsvc_ref;

/*
* Lockd client functions
--
1.5.3.6


2007-12-21 20:27:38

by Chuck Lever

[permalink] [raw]
Subject: Re: [PATCH 6/6] NLM: Add reference counting to lockd

On Dec 21, 2007, at 2:54 PM, Jeff Layton wrote:
> On Fri, 21 Dec 2007 12:51:25 -0500
> Chuck Lever <[email protected]> wrote:
>
>> You could easily post a message to the kernel log that says "lockd
>> was signaled to stop, but is waiting for outstanding requests to
>> complete."
>
> Here's a respun patch 6 that contains the warning message suggested by
> Chuck.
>
> Thoughts?
>
> ------------[snip]--------------
>
> NLM: Add reference counting to lockd
>
> ...and only have lockd exit when the last reference is dropped.
>
> The problem is this:
>
> When a lock that a client is blocking on comes free, lockd does
> this in
> nlmsvc_grant_blocked():
>
> nlm_async_call(block->b_call, NLMPROC_GRANTED_MSG,
> &nlmsvc_grant_ops);
>
> the callback from this call is nlmsvc_grant_callback(). That function
> does this at the end to wake up lockd:
>
> svc_wake_up(block->b_daemon);
>
> However there is no guarantee that lockd will be up when this happens.
> If someone shuts down or restarts lockd before the async call
> completes,
> then the b_daemon pointer will point to freed memory and the kernel
> may
> oops.
>
> I first noticed this on older kernels and had mistakenly thought that
> newer kernels weren't susceptible, but that's not correct. There's
> a bit
> of a race to make sure that the nlm_host is bound when the async
> call is
> done, but I can now reproduce this at will on current kernels.
>
> This patch is based on Trond's suggestion to add a new reference
> counter
> to lockd, and only allows lockd to go down when it reaches 0. With
> this
> change we can't use kthread_stop here. nlmsvc_unlink_block is
> called by
> lockd and a kthread can't call kthread_stop on itself. So the patch
> changes lockd to check the refcount itself and to return if it goes to
> 0. We do the checking and exit while holding the nlmsvc_mutex to make
> sure that a new lockd is not started until the old one is down.
>
> Signed-off-by: Jeff Layton <[email protected]>
> ---
> fs/lockd/svc.c | 52 ++++++++++++++++++++++++++++++++
> +---------
> fs/lockd/svclock.c | 5 ++++
> include/linux/lockd/lockd.h | 1 +
> 3 files changed, 47 insertions(+), 11 deletions(-)
>
> diff --git a/fs/lockd/svc.c b/fs/lockd/svc.c
> index d7209ea..71a4f65 100644
> --- a/fs/lockd/svc.c
> +++ b/fs/lockd/svc.c
> @@ -51,6 +51,7 @@ static DEFINE_MUTEX(nlmsvc_mutex);
> static unsigned int nlmsvc_users;
> static struct task_struct *nlmsvc_task;
> static struct svc_serv *nlmsvc_serv;
> +atomic_t nlmsvc_ref = ATOMIC_INIT(0);
> int nlmsvc_grace_period;
> unsigned long nlmsvc_timeout;
>
> @@ -134,7 +135,10 @@ lockd(void *vrqstp)
>
> set_freezable();
>
> - /* Process request with signals blocked, but allow SIGKILL. */
> + /*
> + * Process request with signals blocked, but allow SIGKILL which
> + * signifies that lockd should drop all of its locks.
> + */
> allow_signal(SIGKILL);
>
> dprintk("NFS locking service started (ver " LOCKD_VERSION ").\n");
> @@ -147,15 +151,19 @@ lockd(void *vrqstp)
>
> /*
> * The main request loop. We don't terminate until the last
> - * NFS mount or NFS daemon has gone away, and we've been sent a
> - * signal, or else another process has taken over our job.
> + * NFS mount or NFS daemon has gone away, and the nlm_blocked
> + * list is empty. The nlmsvc_mutex ensures that we prevent a
> + * new lockd from being started before the old one is down.
> */
> - while (!kthread_should_stop()) {
> + mutex_lock(&nlmsvc_mutex);
> + while (atomic_read(&nlmsvc_ref) != 0) {
> long timeout = MAX_SCHEDULE_TIMEOUT;
> char buf[RPC_MAX_ADDRBUFLEN];
>
> + mutex_unlock(&nlmsvc_mutex);
> +
> if (try_to_freeze())
> - continue;
> + goto again;
>
> if (signalled()) {
> flush_signals(current);
> @@ -182,11 +190,12 @@ lockd(void *vrqstp)
> */
> err = svc_recv(rqstp, timeout);
> if (err == -EAGAIN || err == -EINTR)
> - continue;
> + goto again;
> if (err < 0) {
> printk(KERN_WARNING
> "lockd: terminating on error %d\n",
> -err);
> + mutex_lock(&nlmsvc_mutex);
> break;
> }
>
> @@ -194,19 +203,22 @@ lockd(void *vrqstp)
> svc_print_addr(rqstp, buf, sizeof(buf)));
>
> svc_process(rqstp);
> +again:
> + mutex_lock(&nlmsvc_mutex);
> }
>
> - flush_signals(current);
> -
> /*
> - * Check whether there's a new lockd process before
> - * shutting down the hosts and clearing the slot.
> + * at this point lockd is committed to going down. We hold the
> + * nlmsvc_mutex until just before exit to prevent a new one
> + * from starting before it's down.
> */
> + flush_signals(current);
> if (nlmsvc_ops)
> nlmsvc_invalidate_all();
> nlm_shutdown_hosts();
> nlmsvc_task = NULL;
> nlmsvc_serv = NULL;
> + mutex_unlock(&nlmsvc_mutex);
>
> /* Exit the RPC thread */
> svc_exit_thread(rqstp);
> @@ -269,6 +281,10 @@ lockd_up(int proto) /* Maybe add a 'family'
> option when IPv6 is supported ?? */
> int error = 0;
>
> mutex_lock(&nlmsvc_mutex);
> +
> + if (!nlmsvc_users)
> + atomic_inc(&nlmsvc_ref);
> +
> /*
> * Check whether we're already up and running.
> */
> @@ -328,6 +344,8 @@ lockd_up(int proto) /* Maybe add a 'family'
> option when IPv6 is supported ?? */
> destroy_and_out:
> svc_destroy(serv);
> out:
> + if (!nlmsvc_users && error)
> + atomic_dec(&nlmsvc_ref);
> if (!error)
> nlmsvc_users++;
> mutex_unlock(&nlmsvc_mutex);
> @@ -357,7 +375,19 @@ lockd_down(void)
> goto out;
> }
> warned = 0;
> - kthread_stop(nlmsvc_task);
> + if (atomic_sub_return(1, &nlmsvc_ref) != 0)
> + printk(KERN_WARNING "lockd_down: lockd signalled to go down, "
> + "but is waiting for outstanding requests to "
> + "complete.\n");

We could quibble about the proper spelling of "signaled".

"lockd_down: lockd is waiting for outstanding requests to complete
before exiting."

might be less awkward.

Otherwise, I think this is helpful.

> +
> + /*
> + * Sending a signal is necessary here. If we get to this point and
> + * nlm_blocked isn't empty then lockd may be held hostage by clients
> + * that are still blocking. Sending the signal makes sure that lockd
> + * invalidates all of its locks so that it's just waiting on RPC
> + * callbacks to complete
> + */
> + kill_proc(nlmsvc_task->pid, SIGKILL, 1);
> out:
> mutex_unlock(&nlmsvc_mutex);
> }
> diff --git a/fs/lockd/svclock.c b/fs/lockd/svclock.c
> index d120ec3..b8fbda3 100644
> --- a/fs/lockd/svclock.c
> +++ b/fs/lockd/svclock.c
> @@ -61,6 +61,9 @@ nlmsvc_insert_block(struct nlm_block *block,
> unsigned long when)
> struct list_head *pos;
>
> dprintk("lockd: nlmsvc_insert_block(%p, %ld)\n", block, when);
> + if (list_empty(&nlm_blocked))
> + atomic_inc(&nlmsvc_ref);
> +
> if (list_empty(&block->b_list)) {
> kref_get(&block->b_count);
> } else {
> @@ -239,6 +242,8 @@ static int nlmsvc_unlink_block(struct nlm_block
> *block)
> /* Remove block from list */
> status = posix_unblock_lock(block->b_file->f_file, &block->b_call-
> >a_args.lock.fl);
> nlmsvc_remove_block(block);
> + if (list_empty(&nlm_blocked))
> + atomic_dec(&nlmsvc_ref);
> return status;
> }
>
> diff --git a/include/linux/lockd/lockd.h b/include/linux/lockd/lockd.h
> index e2d1ce3..7389553 100644
> --- a/include/linux/lockd/lockd.h
> +++ b/include/linux/lockd/lockd.h
> @@ -154,6 +154,7 @@ extern struct svc_procedure nlmsvc_procedures4[];
> extern int nlmsvc_grace_period;
> extern unsigned long nlmsvc_timeout;
> extern int nsm_use_hostnames;
> +extern atomic_t nlmsvc_ref;
>
> /*
> * Lockd client functions
> --
> 1.5.3.6
>

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




2008-01-09 17:36:06

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 3/6] NLM: Initialize completion variable in lockd_up

On Tue, Jan 08, 2008 at 02:33:15PM -0500, Jeff Layton wrote:
> lockd_start_done is a global var that can be reused if lockd is
> restarted, but it's never reinitialized. On all but the first use,
> wait_for_completion isn't actually waiting on it since it has
> already completed once.

I don't think we'll need lockd_start_done anymore after the kthread
conversion. When kthread_run returns the thread it created is
guaranteed to have run until it scheduled away.


2008-01-09 17:45:11

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 5/6] NLM: Convert lockd to use kthreads

On Tue, Jan 08, 2008 at 02:33:17PM -0500, Jeff Layton wrote:
> - struct svc_serv * serv;
> - int error = 0;
> + struct svc_serv *serv;
> + struct svc_rqst *rqstp;
> + int error = 0;
>
> mutex_lock(&nlmsvc_mutex);
> /*
> * Check whether we're already up and running.
> */
> - if (nlmsvc_pid) {
> + if (nlmsvc_task) {
> if (proto)
> error = make_socks(nlmsvc_serv, proto);

While equivalent I think it would be clener to check for nlmsvc_serv
above as that'swhat we're passing to make_socks. But I think the whole
of lockd_up could use a little makeover, but that's for later.

> void
> lockd_down(void)
> {
> mutex_lock(&nlmsvc_mutex);
> if (nlmsvc_users) {
> if (--nlmsvc_users)
> goto out;
> + } else {
> + printk(KERN_ERR "lockd_down: no users! task=%p\n",
> + nlmsvc_task);
> + BUG();
> }
> + if (!nlmsvc_task) {
> + printk(KERN_ERR "lockd_down: no lockd running.\n");
> + BUG();
> }
> + kthread_stop(nlmsvc_task);

I think all this user/foo checking here should be BUG_ONs as it's quite
fatal errors.

e.g.

void
lockd_down(void)
{
mutex_lock(&nlmsvc_mutex);

BUG_ON(!nlmsvc_task);
BUG_ON(!nlmsvc_users);

if (!--nlmsvc_users)
kthread_stop(nlmsvc_task);
mutex_unlock(&nlmsvc_mutex);
}


same applies for similar checks in lockd_up aswell.


2008-01-09 17:47:12

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 6/6] NLM: Add reference counting to lockd

On Tue, Jan 08, 2008 at 02:33:18PM -0500, Jeff Layton wrote:
> ...and only have lockd exit when the last reference is dropped.
>
> The problem is this:
>
> When a lock that a client is blocking on comes free, lockd does this in
> nlmsvc_grant_blocked():
>
> nlm_async_call(block->b_call, NLMPROC_GRANTED_MSG, &nlmsvc_grant_ops);
>
> the callback from this call is nlmsvc_grant_callback(). That function
> does this at the end to wake up lockd:
>
> svc_wake_up(block->b_daemon);
>
> However there is no guarantee that lockd will be up when this happens.
> If someone shuts down or restarts lockd before the async call completes,
> then the b_daemon pointer will point to freed memory and the kernel may
> oops.
>
> I first noticed this on older kernels and had mistakenly thought that
> newer kernels weren't susceptible, but that's not correct. There's a bit
> of a race to make sure that the nlm_host is bound when the async call is
> done, but I can now reproduce this at will on current kernels.
>
> This patch is based on Trond's suggestion to add a new reference counter
> to lockd, and only allows lockd to go down when it reaches 0. With this
> change we can't use kthread_stop here. nlmsvc_unlink_block is called by
> lockd and a kthread can't call kthread_stop on itself. So the patch
> changes lockd to check the refcount itself and to return if it goes to
> 0. We do the checking and exit while holding the nlmsvc_mutex to make
> sure that a new lockd is not started until the old one is down.

I don't like this signals/kthread mixture at all. Why can't we simply
call kthread_stop when the refcount hits zero and keep all the nice
kthread helpers?


2008-01-09 18:06:18

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH 3/6] NLM: Initialize completion variable in lockd_up

On Wed, 9 Jan 2008 17:35:42 +0000
Christoph Hellwig <[email protected]> wrote:

> I don't think we'll need lockd_start_done anymore after the kthread
> conversion. When kthread_run returns the thread it created is
> guaranteed to have run until it scheduled away.
>

Makes sense. My only concern is that we make sure this is behavior we
can count on in the future and not just an artifact of the current
kthread implementation. If that's the case, then I'll plan to remove it
on the next respin.

--
Jeff Layton <[email protected]>

2008-01-09 18:08:24

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH 5/6] NLM: Convert lockd to use kthreads

On Wed, 9 Jan 2008 17:45:06 +0000
Christoph Hellwig <[email protected]> wrote:

> On Tue, Jan 08, 2008 at 02:33:17PM -0500, Jeff Layton wrote:
> > - struct svc_serv * serv;
> > - int error = 0;
> > + struct svc_serv *serv;
> > + struct svc_rqst *rqstp;
> > + int error = 0;
> >
> > mutex_lock(&nlmsvc_mutex);
> > /*
> > * Check whether we're already up and running.
> > */
> > - if (nlmsvc_pid) {
> > + if (nlmsvc_task) {
> > if (proto)
> > error = make_socks(nlmsvc_serv, proto);
>
> While equivalent I think it would be clener to check for nlmsvc_serv
> above as that'swhat we're passing to make_socks. But I think the
> whole of lockd_up could use a little makeover, but that's for later.
>

Probably so. If I respin, I'll plan to fix that too.

> > void
> > lockd_down(void)
> > {
> > mutex_lock(&nlmsvc_mutex);
> > if (nlmsvc_users) {
> > if (--nlmsvc_users)
> > goto out;
> > + } else {
> > + printk(KERN_ERR "lockd_down: no users! task=%p\n",
> > + nlmsvc_task);
> > + BUG();
> > }
> > + if (!nlmsvc_task) {
> > + printk(KERN_ERR "lockd_down: no lockd running.\n");
> > + BUG();
> > }
> > + kthread_stop(nlmsvc_task);
>
> I think all this user/foo checking here should be BUG_ONs as it's
> quite fatal errors.
>
> e.g.
>
> void
> lockd_down(void)
> {
> mutex_lock(&nlmsvc_mutex);
>
> BUG_ON(!nlmsvc_task);
> BUG_ON(!nlmsvc_users);
>
> if (!--nlmsvc_users)
> kthread_stop(nlmsvc_task);
> mutex_unlock(&nlmsvc_mutex);
> }
>
>
> same applies for similar checks in lockd_up aswell.
>

With this patch the lockd_down checks should now be BUGs. I decided
not to do that in lockd_up. If there's an error within the main
lockd loop, it can exit without being requested to do so. If someone
then calls lockd_up then the counts will be off and the check will fire.

It seems like if we're going to make the check in lockd_up be a BUG,
then we should also BUG rather than letting lockd exit prematurely.

--
Jeff Layton <[email protected]>

2008-01-09 18:15:01

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 3/6] NLM: Initialize completion variable in lockd_up

On Wed, Jan 09, 2008 at 01:05:54PM -0500, Jeff Layton wrote:
> Makes sense. My only concern is that we make sure this is behavior we
> can count on in the future and not just an artifact of the current
> kthread implementation. If that's the case, then I'll plan to remove it
> on the next respin.

It's absolutely intentional and one of the reasons why the kthread
infrastructure is so much nicer than plain kthread_create :)


2008-01-09 18:36:42

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH 6/6] NLM: Add reference counting to lockd

On Wed, 9 Jan 2008 17:47:07 +0000
Christoph Hellwig <[email protected]> wrote:

> On Tue, Jan 08, 2008 at 02:33:18PM -0500, Jeff Layton wrote:
> > ...and only have lockd exit when the last reference is dropped.
> >
> > The problem is this:
> >
> > When a lock that a client is blocking on comes free, lockd does
> > this in nlmsvc_grant_blocked():
> >
> > nlm_async_call(block->b_call, NLMPROC_GRANTED_MSG,
> > &nlmsvc_grant_ops);
> >
> > the callback from this call is nlmsvc_grant_callback(). That
> > function does this at the end to wake up lockd:
> >
> > svc_wake_up(block->b_daemon);
> >
> > However there is no guarantee that lockd will be up when this
> > happens. If someone shuts down or restarts lockd before the async
> > call completes, then the b_daemon pointer will point to freed
> > memory and the kernel may oops.
> >
> > I first noticed this on older kernels and had mistakenly thought
> > that newer kernels weren't susceptible, but that's not correct.
> > There's a bit of a race to make sure that the nlm_host is bound
> > when the async call is done, but I can now reproduce this at will
> > on current kernels.
> >
> > This patch is based on Trond's suggestion to add a new reference
> > counter to lockd, and only allows lockd to go down when it reaches
> > 0. With this change we can't use kthread_stop here.
> > nlmsvc_unlink_block is called by lockd and a kthread can't call
> > kthread_stop on itself. So the patch changes lockd to check the
> > refcount itself and to return if it goes to 0. We do the checking
> > and exit while holding the nlmsvc_mutex to make sure that a new
> > lockd is not started until the old one is down.
>
> I don't like this signals/kthread mixture at all. Why can't we simply
> call kthread_stop when the refcount hits zero and keep all the nice
> kthread helpers?
>

As I stated in an earlier email, I'm not fond of this either :-)

I don't see a good alternative though. We need to be able to drop the
and check the refcount in nlmsvc_unlink_block. That function is called
from lockd, and we can't have lockd call kthread_stop on itself.

If you see a better way to do this, I'm certainly open to suggestions.

I'll note that my first stab at fixing this problem was to change the
svc_wake_up() call in the rpc callback to a routine to wake up any
lockd on the box that happened to be up. That sidesteps this entire
problem of having to make sure lockd stays up. If we decided that was
the right approach we could dump the last patch in this series
altogether.

That said there could be other use after free bugs lurking in the lockd
code so maybe keeping lockd up until nlm_blocked is empty is the right
thing to do.

--
Jeff Layton <[email protected]>

2008-01-09 18:48:23

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 6/6] NLM: Add reference counting to lockd

On Wed, Jan 09, 2008 at 01:36:21PM -0500, Jeff Layton wrote:
> I don't see a good alternative though. We need to be able to drop the
> and check the refcount in nlmsvc_unlink_block. That function is called
> from lockd, and we can't have lockd call kthread_stop on itself.
>
> If you see a better way to do this, I'm certainly open to suggestions.
>
> I'll note that my first stab at fixing this problem was to change the
> svc_wake_up() call in the rpc callback to a routine to wake up any
> lockd on the box that happened to be up. That sidesteps this entire
> problem of having to make sure lockd stays up. If we decided that was
> the right approach we could dump the last patch in this series
> altogether.
>
> That said there could be other use after free bugs lurking in the lockd
> code so maybe keeping lockd up until nlm_blocked is empty is the right
> thing to do.

What about just not exiting from lockd as long as nlm_blocked is not
empty? lockd_down still simply calls kthread_stop, but lockd only
honours it when nlm_blocked is empty?

2008-01-09 18:59:35

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH 6/6] NLM: Add reference counting to lockd

On Wed, 9 Jan 2008 18:48:14 +0000
Christoph Hellwig <[email protected]> wrote:

> On Wed, Jan 09, 2008 at 01:36:21PM -0500, Jeff Layton wrote:
> > I don't see a good alternative though. We need to be able to drop
> > the and check the refcount in nlmsvc_unlink_block. That function is
> > called from lockd, and we can't have lockd call kthread_stop on
> > itself.
> >
> > If you see a better way to do this, I'm certainly open to
> > suggestions.
> >
> > I'll note that my first stab at fixing this problem was to change
> > the svc_wake_up() call in the rpc callback to a routine to wake up
> > any lockd on the box that happened to be up. That sidesteps this
> > entire problem of having to make sure lockd stays up. If we decided
> > that was the right approach we could dump the last patch in this
> > series altogether.
> >
> > That said there could be other use after free bugs lurking in the
> > lockd code so maybe keeping lockd up until nlm_blocked is empty is
> > the right thing to do.
>
> What about just not exiting from lockd as long as nlm_blocked is not
> empty? lockd_down still simply calls kthread_stop, but lockd only
> honours it when nlm_blocked is empty?

lockd can basically block forever in this situation if the client
goes away for good. With the current kthread implementation,
kthread_stops are serialized and I don't think we want to monopolize
the kthread_stop queue.

If kthread_stops could occur in parallel, that would be a different
situation :-)

--
Jeff Layton <[email protected]>

2008-01-10 03:29:30

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 6/6] NLM: Add reference counting to lockd

On Tuesday January 8, [email protected] wrote:
> ...and only have lockd exit when the last reference is dropped.
>
> The problem is this:
>
> When a lock that a client is blocking on comes free, lockd does this in
> nlmsvc_grant_blocked():
>
> nlm_async_call(block->b_call, NLMPROC_GRANTED_MSG, &nlmsvc_grant_ops);
>
> the callback from this call is nlmsvc_grant_callback(). That function
> does this at the end to wake up lockd:
>
> svc_wake_up(block->b_daemon);

Uhmmm... Maybe there is an easier way.

block->b_daemon will always be nlmsvc_serv, so can we simply make this

svc_wake_up(nlmsvc_serv);
with a little locking to make sure nlmsvc_serv is valid?

Actually svc_wake_up is only called from lockd and goes through
various hoops to find the right rqstp, which we could have known in
advance.
So store the rqstp in some global wrapped in a spinlock so we can
access it safely and just:

spin_lock(whatever)
if (nlmsvc_rqstp)
wake_up(&nlmsvc_rqstp->rq_wait)
spin_unlock(whatever)


That seems a somewhat simpler way of avoiding the particular problem.


Hmmm.... I guess that nlmsvc_grant_callback could then be run after
the 'lockd' module had been unloaded.
Maybe nlm_shutdown_hosts could call rpc_killall_tasks(host->h_rpcclnt)
on each host. That should ensure the callback wont happen afterwards.

Maybe?

NeilBrown


2008-01-10 11:58:21

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH 6/6] NLM: Add reference counting to lockd

On Thu, 10 Jan 2008 14:29:22 +1100
Neil Brown <[email protected]> wrote:

> On Tuesday January 8, [email protected] wrote:
> > ...and only have lockd exit when the last reference is dropped.
> >
> > The problem is this:
> >
> > When a lock that a client is blocking on comes free, lockd does
> > this in nlmsvc_grant_blocked():
> >
> > nlm_async_call(block->b_call, NLMPROC_GRANTED_MSG,
> > &nlmsvc_grant_ops);
> >
> > the callback from this call is nlmsvc_grant_callback(). That
> > function does this at the end to wake up lockd:
> >
> > svc_wake_up(block->b_daemon);
>
> Uhmmm... Maybe there is an easier way.
>
> block->b_daemon will always be nlmsvc_serv, so can we simply make this
>
> svc_wake_up(nlmsvc_serv);
> with a little locking to make sure nlmsvc_serv is valid?
>

That's very close to my original patch to fix this problem. I just
replaced svc_wake_up with a call to a new function that wakes up any
lockd that happens to be up. I'm not sure that my original patch was
careful enough with the locking though...

> Actually svc_wake_up is only called from lockd and goes through
> various hoops to find the right rqstp, which we could have known in
> advance.
> So store the rqstp in some global wrapped in a spinlock so we can
> access it safely and just:
>
> spin_lock(whatever)
> if (nlmsvc_rqstp)
> wake_up(&nlmsvc_rqstp->rq_wait)
> spin_unlock(whatever)
>
>
> That seems a somewhat simpler way of avoiding the particular problem.
>

Yes. Much.

>
> Hmmm.... I guess that nlmsvc_grant_callback could then be run after
> the 'lockd' module had been unloaded.
> Maybe nlm_shutdown_hosts could call rpc_killall_tasks(host->h_rpcclnt)
> on each host. That should ensure the callback wont happen afterwards.
>
> Maybe?
>

I think so. If we let lockd go down before all the RPC's are done,
then the whole problem of accessing lockd data from them sounds like it
could be a problem. If not now, then future changes could cause it.

IIRC, The reason we don't get nlm_destroy_host done on each nlm_host in
this situation is because the h_count is too high. Doing
rpc_killall_tasks in this situation might fix that, but the logic in
all of this is pretty convoluted. I'll see if I can cook up a new
patchset that does this instead.

--
Jeff Layton <[email protected]>

2008-01-13 13:27:35

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH 3/6] NLM: Initialize completion variable in lockd_up

On Wed, 9 Jan 2008 17:35:42 +0000
Christoph Hellwig <[email protected]> wrote:

> On Tue, Jan 08, 2008 at 02:33:15PM -0500, Jeff Layton wrote:
> > lockd_start_done is a global var that can be reused if lockd is
> > restarted, but it's never reinitialized. On all but the first use,
> > wait_for_completion isn't actually waiting on it since it has
> > already completed once.
>
> I don't think we'll need lockd_start_done anymore after the kthread
> conversion. When kthread_run returns the thread it created is
> guaranteed to have run until it scheduled away.
>

Christoph,
I've been hitting an intermittent null pointer dereference ever
since I've made this change:

BUG: unable to handle kernel NULL pointer dereference at virtual address 00000038
printing eip: e09ddee1 *pde = 1f377067 *pte = 00000000
Oops: 0000 [#1] SMP
Modules linked in: nfsd nfs_acl auth_rpcgss exportfs rfcomm l2cap bluetooth autofs4 lockd sunrpc nf_conntrack_ipv6 xt_state nf_conntrack xt_tcpudp ip6t_ipv6header ip6t_REJECT ip6table_filter ip6_tables x_tables ipv6 loop dm_multipath pcspkr 8139cp 8139too mii joydev i2c_piix4 i2c_core sr_mod sg cdrom dm_snapshot dm_zero dm_mirror dm_mod ata_piix pata_acpi ata_generic libata sd_mod scsi_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd

Pid: 1946, comm: rpc.nfsd Not tainted (2.6.24-0.138.rc7.kthread.2.fc9 #1)
EIP: 0060:[<e09ddee1>] EFLAGS: 00010202 CPU: 0
EIP is at find_socket+0xa/0x3f [lockd]
EAX: 00000000 EBX: 00000006 ECX: 00000000 EDX: 00000011
ESI: 00000000 EDI: 00000011 EBP: df358ec4 ESP: df358eb8
DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Process rpc.nfsd (pid: 1946, ti=df358000 task=df38ae10 task.ti=df358000)
Stack: 00000006 00000000 00000006 df358ee0 e09de0cb 22222222 22222222 00000006
00000008 00000801 df358f04 e09de337 00000000 00000000 00000000 00000000
00000801 00000008 00000801 df358f1c e0aa05ac 00000000 df356200 00000008
Call Trace:
[<c040649a>] show_trace_log_lvl+0x1a/0x2f
[<c040654a>] show_stack_log_lvl+0x9b/0xa3
[<c04065f9>] show_registers+0xa7/0x178
[<c04067ff>] die+0x135/0x220
[<c063ff1b>] do_page_fault+0x553/0x631
[<c063e5a2>] error_code+0x72/0x78
[<e09de0cb>] make_socks+0x27/0xbe [lockd]
[<e09de337>] lockd_up+0x3b/0x148 [lockd]
[<e0aa05ac>] nfsd_svc+0xf2/0x107 [nfsd]
[<e0aa0b19>] write_svc+0x1a/0x20 [nfsd]
[<e0aa0d10>] nfsctl_transaction_write+0x39/0x63 [nfsd]
[<c04b97cf>] sys_nfsservctl+0x11f/0x160
[<c0405252>] syscall_call+0x7/0xb
=======================
Code: 89 d8 5b 5d c3 55 89 e5 c7 05 48 a4 9e e0 01 00 00 00 e8 a7 ff ff ff 8b 15 00 0c 76 c0 5d 01 d0 c3 55 89 e5 57 89 d7 56 89 c6 53 <8b> 48 38 83 e9 08 eb 15 8b 41 14 0f b6 40 29 39 f8 75 07 b8 01
EIP: [<e09ddee1>] find_socket+0xa/0x3f [lockd] SS:ESP 0068:df358eb8
---[ end trace 7d509b4c18b144aa ]---

The problem is that make_socks is occasionally getting called with a
NULL nlmsvc_serv pointer. I think the problem occurs here in lockd_up().

if (nlmsvc_task) {
if (proto)
error = make_socks(nlmsvc_serv, proto);
goto out;
}

You pointed out earlier that this should really be checking that
nlmsvc_serv is non-NULL. I can and will make this change, and that will
likely fix this particular oops, but the fact that I'm hitting it here
suggests that kthread_run is returning before lockd has a chance
to set nlmsvc_serv. It shouldn't be according to your statement above.

Are you sure that kthread_run is working correctly? It seems like it
might not be doing the right thing here...

--
Jeff Layton <[email protected]>

2008-01-13 18:17:53

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 3/6] NLM: Initialize completion variable in lockd_up

On Sun, Jan 13, 2008 at 08:27:18AM -0500, Jeff Layton wrote:
> I've been hitting an intermittent null pointer dereference ever
> since I've made this change:

The first thing lockd does is to call lock_kernel(). This may either
block (or spin) when it is contended and thus delay updating
nlmsvc_serv. Now lockd_up checks for nlmsvc_task which already is
non-NULL and happily dereference nlmsvc_serv. The patch below
updates nlmsvc_serv in lockd_up where it is protected by nlmsvc_mutex
and also checks for nlmsvc_serv beeing set instead of nlmsvc_task to
fix this problem.

The patch hasn't actually been tested but I'm sure it will fix this
issue.

Btw, lockd() takes BKL just after starting up and only implicitly drops
it when blocking. This seems very dangerous to me and badly wants
updating to some real locking scheme..


Signed-off-by: Christoph Hellwig <[email protected]>

Index: linux-2.6/fs/lockd/svc.c
===================================================================
--- linux-2.6.orig/fs/lockd/svc.c 2008-01-13 19:07:17.000000000 +0100
+++ linux-2.6/fs/lockd/svc.c 2008-01-13 19:13:23.000000000 +0100
@@ -118,7 +118,6 @@ lockd(void *vrqstp)

/* set up kernel thread */
lock_kernel();
- nlmsvc_serv = rqstp->rq_server;
set_freezable();

/* Allow SIGKILL to tell lockd to drop all of its locks */
@@ -253,7 +252,7 @@ lockd_up(int proto) /* Maybe add a 'fami
/*
* Check whether we're already up and running.
*/
- if (nlmsvc_task) {
+ if (nlmsvc_serv) {
if (proto)
error = make_socks(nlmsvc_serv, proto);
goto out;
@@ -290,6 +289,9 @@ lockd_up(int proto) /* Maybe add a 'fami
}

svc_sock_update_bufs(serv);
+
+ nlmsvc_serv = rqstp->rq_server;
+
nlmsvc_task = kthread_run(lockd, rqstp, serv->sv_name);
if (IS_ERR(nlmsvc_task)) {
error = PTR_ERR(nlmsvc_task);

2008-01-13 19:12:31

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 3/6] NLM: Initialize completion variable in lockd_up

On Sun, Jan 13, 2008 at 06:17:43PM +0000, Christoph Hellwig wrote:
> Btw, lockd() takes BKL just after starting up and only implicitly drops
> it when blocking. This seems very dangerous to me and badly wants
> updating to some real locking scheme..

Yep.

--b.

2008-01-14 14:25:20

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH 3/6] NLM: Initialize completion variable in lockd_up

On Sun, 13 Jan 2008 18:17:43 +0000
Christoph Hellwig <[email protected]> wrote:

> On Sun, Jan 13, 2008 at 08:27:18AM -0500, Jeff Layton wrote:
> > I've been hitting an intermittent null pointer dereference ever
> > since I've made this change:
>
> The first thing lockd does is to call lock_kernel(). This may either
> block (or spin) when it is contended and thus delay updating
> nlmsvc_serv. Now lockd_up checks for nlmsvc_task which already is
> non-NULL and happily dereference nlmsvc_serv. The patch below
> updates nlmsvc_serv in lockd_up where it is protected by nlmsvc_mutex
> and also checks for nlmsvc_serv beeing set instead of nlmsvc_task to
> fix this problem.
>
> The patch hasn't actually been tested but I'm sure it will fix this
> issue.
>

Thanks Christoph. I incorporated this into my latest patchset. It does
seem to fix the issue (tested by bouncing NFS up and down for 30 mins
or so). Let me know if you want me to add a signed-off-by line for
you...

> Btw, lockd() takes BKL just after starting up and only implicitly
> drops it when blocking. This seems very dangerous to me and badly
> wants updating to some real locking scheme..
>

Yep -- It's ugly. I took a look a while back at what it would take to
change that. The problem is that it's very difficult to tell exactly
what the BKL is intended to protect in. I assume it does it for the same
reason that fs/locks.c uses it, but there may be other things that need
protection if it's removed.

It might be best to try to change this incrementally -- gradually audit
and move pieces of lockd() outside of the BKL, until it's clear that
it's no longer needed.

--
Jeff Layton <[email protected]>

2008-01-14 14:26:04

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 3/6] NLM: Initialize completion variable in lockd_up

On Mon, Jan 14, 2008 at 09:24:54AM -0500, Jeff Layton wrote:
> Thanks Christoph. I incorporated this into my latest patchset. It does
> seem to fix the issue (tested by bouncing NFS up and down for 30 mins
> or so). Let me know if you want me to add a signed-off-by line for
> you...

No need to add anything, this was just a two-liner trivial fix..


2008-03-15 03:44:31

by Mike Snitzer

[permalink] [raw]
Subject: Re: [PATCH 3/6] NLM: Initialize completion variable in lockd_up

On Sun, Jan 13, 2008 at 1:17 PM, Christoph Hellwig <[email protected]> wrote:
> Btw, lockd() takes BKL just after starting up and only implicitly drops
> it when blocking. This seems very dangerous to me and badly wants
> updating to some real locking scheme..

Can you elaborate on what is meant by lockd "blocking"? Blocking in
svc_recv() or during a SETLKW or ???

I'm trying to come to terms with why nlmsvc_lock() wouldn't have the
BKL on entry.