LinuxLists.cc - [PATCH 0/6] Intro: convert lockd to kthread and fix use-after-free (try #5)

2008-01-05 12:02:58

Subject: [PATCH 0/6] Intro: convert lockd to kthread and fix use-after-free (try #5)

This is the fifth patchset to fix the use-after-free problem in lockd
which we originally discussed back in October. The main problem is
detailed in the last patch of the series. Along the way, Christoph
Hellwig mentioned that it would be advantageous to convert lockd to use
the kthread API. This patch set first makes that change and then
patches it to actually fix the use after free problem. It also fixes
a couple of minor bugs in the current lockd implementation.

The main changes from the original patchset are:

+ dropped the new thread creation helper and just have lockd_up call
kthread_run directly.
+ dropped the first patch that changed svc_pool_map_set_cpumask, since
it's no longer needed.
+ added a warning message when lockd_down is called for the final time,
but lockd is still up
+ done some style cleanups recommended by checkpatch.pl.

I've done some basic smoke testing and everything seems to work as
expected. I've also tested this against the reproducer that I have for
the use-after-free problem and this does fix it. I've tried to make
this cleanly bisectable, but have only really tested the final result.

Many thanks to Trond Myklebust, Chuck Lever and Christoph Hellwig for
their guidance on this.

Signed-off-by: Jeff Layton <[email protected]>

2008-01-05 12:02:57

by Jeff Layton

[permalink] [raw]

Subject: [PATCH 1/6] SUNRPC: spin svc_rqst initialization to its own function

Move the initialzation in __svc_create_thread that happens prior to
thread creation to a new function. Export the function to allow
services to have better control over the svc_rqst structs.

Signed-off-by: Jeff Layton <[email protected]>
---
include/linux/sunrpc/svc.h | 2 ++
net/sunrpc/svc.c | 43 +++++++++++++++++++++++++++++++------------
2 files changed, 33 insertions(+), 12 deletions(-)

diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 8531a70..5f07300 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -382,6 +382,8 @@ struct svc_procedure {
*/
struct svc_serv * svc_create(struct svc_program *, unsigned int,
void (*shutdown)(struct svc_serv*));
+struct svc_rqst *svc_prepare_thread(struct svc_serv *serv,
+ struct svc_pool *pool);
int svc_create_thread(svc_thread_fn, struct svc_serv *);
void svc_exit_thread(struct svc_rqst *);
struct svc_serv * svc_create_pooled(struct svc_program *, unsigned int,
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index fca17d0..b29ed43 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -538,23 +538,14 @@ svc_release_buffer(struct svc_rqst *rqstp)
put_page(rqstp->rq_pages[i]);
}

-/*
- * Create a thread in the given pool. Caller must hold BKL.
- * On a NUMA or SMP machine, with a multi-pool serv, the thread
- * will be restricted to run on the cpus belonging to the pool.
- */
-static int
-__svc_create_thread(svc_thread_fn func, struct svc_serv *serv,
- struct svc_pool *pool)
+struct svc_rqst *
+svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool)
{
struct svc_rqst *rqstp;
- int error = -ENOMEM;
- int have_oldmask = 0;
- cpumask_t oldmask;

rqstp = kzalloc(sizeof(*rqstp), GFP_KERNEL);
if (!rqstp)
- goto out;
+ goto out_enomem;

init_waitqueue_head(&rqstp->rq_wait);

@@ -570,6 +561,34 @@ __svc_create_thread(svc_thread_fn func, struct svc_serv *serv,
spin_unlock_bh(&pool->sp_lock);
rqstp->rq_server = serv;
rqstp->rq_pool = pool;
+ return rqstp;
+
+out_thread:
+ svc_exit_thread(rqstp);
+out_enomem:
+ return ERR_PTR(-ENOMEM);
+}
+EXPORT_SYMBOL(svc_prepare_thread);
+
+/*
+ * Create a thread in the given pool. Caller must hold BKL.
+ * On a NUMA or SMP machine, with a multi-pool serv, the thread
+ * will be restricted to run on the cpus belonging to the pool.
+ */
+static int
+__svc_create_thread(svc_thread_fn func, struct svc_serv *serv,
+ struct svc_pool *pool)
+{
+ struct svc_rqst *rqstp;
+ int error = -ENOMEM;
+ int have_oldmask = 0;
+ cpumask_t oldmask;
+
+ rqstp = svc_prepare_thread(serv, pool);
+ if (IS_ERR(rqstp)) {
+ error = PTR_ERR(rqstp);
+ goto out;
+ }

if (serv->sv_nrpools > 1)
have_oldmask = svc_pool_map_set_cpumask(pool->sp_id, &oldmask);
--
1.5.3.6

2008-01-05 12:03:00

by Jeff Layton

[permalink] [raw]

Subject: [PATCH 4/6] NLM: Have lockd call try_to_freeze

lockd makes itself freezable, but never calls try_to_freeze(). Have it
call try_to_freeze() within the main loop.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/lockd/svc.c | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/fs/lockd/svc.c b/fs/lockd/svc.c
index 0f4148a..03a83a0 100644
--- a/fs/lockd/svc.c
+++ b/fs/lockd/svc.c
@@ -155,6 +155,9 @@ lockd(struct svc_rqst *rqstp)
long timeout = MAX_SCHEDULE_TIMEOUT;
char buf[RPC_MAX_ADDRBUFLEN];

+ if (try_to_freeze())
+ continue;
+
if (signalled()) {
flush_signals(current);
if (nlmsvc_ops) {
--
1.5.3.6

2008-01-05 12:02:59

by Jeff Layton

[permalink] [raw]

Subject: [PATCH 6/6] NLM: Add reference counting to lockd

...and only have lockd exit when the last reference is dropped.

The problem is this:

When a lock that a client is blocking on comes free, lockd does this in
nlmsvc_grant_blocked():

nlm_async_call(block->b_call, NLMPROC_GRANTED_MSG, &nlmsvc_grant_ops);

the callback from this call is nlmsvc_grant_callback(). That function
does this at the end to wake up lockd:

svc_wake_up(block->b_daemon);

However there is no guarantee that lockd will be up when this happens.
If someone shuts down or restarts lockd before the async call completes,
then the b_daemon pointer will point to freed memory and the kernel may
oops.

I first noticed this on older kernels and had mistakenly thought that
newer kernels weren't susceptible, but that's not correct. There's a bit
of a race to make sure that the nlm_host is bound when the async call is
done, but I can now reproduce this at will on current kernels.

This patch is based on Trond's suggestion to add a new reference counter
to lockd, and only allows lockd to go down when it reaches 0. With this
change we can't use kthread_stop here. nlmsvc_unlink_block is called by
lockd and a kthread can't call kthread_stop on itself. So the patch
changes lockd to check the refcount itself and to return if it goes to
0. We do the checking and exit while holding the nlmsvc_mutex to make
sure that a new lockd is not started until the old one is down.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/lockd/svc.c | 51 +++++++++++++++++++++++++++++++++---------
fs/lockd/svclock.c | 5 ++++
include/linux/lockd/lockd.h | 1 +
3 files changed, 46 insertions(+), 11 deletions(-)

diff --git a/fs/lockd/svc.c b/fs/lockd/svc.c
index d7209ea..0f56edf 100644
--- a/fs/lockd/svc.c
+++ b/fs/lockd/svc.c
@@ -51,6 +51,7 @@ static DEFINE_MUTEX(nlmsvc_mutex);
static unsigned int nlmsvc_users;
static struct task_struct *nlmsvc_task;
static struct svc_serv *nlmsvc_serv;
+atomic_t nlmsvc_ref = ATOMIC_INIT(0);
int nlmsvc_grace_period;
unsigned long nlmsvc_timeout;

@@ -134,7 +135,10 @@ lockd(void *vrqstp)

set_freezable();

- /* Process request with signals blocked, but allow SIGKILL. */
+ /*
+ * Process request with signals blocked, but allow SIGKILL which
+ * signifies that lockd should drop all of its locks.
+ */
allow_signal(SIGKILL);

dprintk("NFS locking service started (ver " LOCKD_VERSION ").\n");
@@ -147,15 +151,19 @@ lockd(void *vrqstp)

/*
* The main request loop. We don't terminate until the last
- * NFS mount or NFS daemon has gone away, and we've been sent a
- * signal, or else another process has taken over our job.
+ * NFS mount or NFS daemon has gone away, and the nlm_blocked
+ * list is empty. The nlmsvc_mutex ensures that we prevent a
+ * new lockd from being started before the old one is down.
*/
- while (!kthread_should_stop()) {
+ mutex_lock(&nlmsvc_mutex);
+ while (atomic_read(&nlmsvc_ref) != 0) {
long timeout = MAX_SCHEDULE_TIMEOUT;
char buf[RPC_MAX_ADDRBUFLEN];

+ mutex_unlock(&nlmsvc_mutex);
+
if (try_to_freeze())
- continue;
+ goto again;

if (signalled()) {
flush_signals(current);
@@ -182,11 +190,12 @@ lockd(void *vrqstp)
*/
err = svc_recv(rqstp, timeout);
if (err == -EAGAIN || err == -EINTR)
- continue;
+ goto again;
if (err < 0) {
printk(KERN_WARNING
"lockd: terminating on error %d\n",
-err);
+ mutex_lock(&nlmsvc_mutex);
break;
}

@@ -194,19 +203,22 @@ lockd(void *vrqstp)
svc_print_addr(rqstp, buf, sizeof(buf)));

svc_process(rqstp);
+again:
+ mutex_lock(&nlmsvc_mutex);
}

- flush_signals(current);
-
/*
- * Check whether there's a new lockd process before
- * shutting down the hosts and clearing the slot.
+ * at this point lockd is committed to going down. We hold the
+ * nlmsvc_mutex until just before exit to prevent a new one
+ * from starting before it's down.
*/
+ flush_signals(current);
if (nlmsvc_ops)
nlmsvc_invalidate_all();
nlm_shutdown_hosts();
nlmsvc_task = NULL;
nlmsvc_serv = NULL;
+ mutex_unlock(&nlmsvc_mutex);

/* Exit the RPC thread */
svc_exit_thread(rqstp);
@@ -269,6 +281,10 @@ lockd_up(int proto) /* Maybe add a 'family' option when IPv6 is supported ?? */
int error = 0;

mutex_lock(&nlmsvc_mutex);
+
+ if (!nlmsvc_users)
+ atomic_inc(&nlmsvc_ref);
+
/*
* Check whether we're already up and running.
*/
@@ -328,6 +344,8 @@ lockd_up(int proto) /* Maybe add a 'family' option when IPv6 is supported ?? */
destroy_and_out:
svc_destroy(serv);
out:
+ if (!nlmsvc_users && error)
+ atomic_dec(&nlmsvc_ref);
if (!error)
nlmsvc_users++;
mutex_unlock(&nlmsvc_mutex);
@@ -357,7 +375,18 @@ lockd_down(void)
goto out;
}
warned = 0;
- kthread_stop(nlmsvc_task);
+ if (atomic_sub_return(1, &nlmsvc_ref) != 0)
+ printk(KERN_WARNING "lockd_down: lockd is waiting for "
+ "outstanding requests to complete before exiting.\n");
+
+ /*
+ * Sending a signal is necessary here. If we get to this point and
+ * nlm_blocked isn't empty then lockd may be held hostage by clients
+ * that are still blocking. Sending the signal makes sure that lockd
+ * invalidates all of its locks so that it's just waiting on RPC
+ * callbacks to complete
+ */
+ kill_proc(nlmsvc_task->pid, SIGKILL, 1);
out:
mutex_unlock(&nlmsvc_mutex);
}
diff --git a/fs/lockd/svclock.c b/fs/lockd/svclock.c
index d120ec3..b8fbda3 100644
--- a/fs/lockd/svclock.c
+++ b/fs/lockd/svclock.c
@@ -61,6 +61,9 @@ nlmsvc_insert_block(struct nlm_block *block, unsigned long when)
struct list_head *pos;

dprintk("lockd: nlmsvc_insert_block(%p, %ld)\n", block, when);
+ if (list_empty(&nlm_blocked))
+ atomic_inc(&nlmsvc_ref);
+
if (list_empty(&block->b_list)) {
kref_get(&block->b_count);
} else {
@@ -239,6 +242,8 @@ static int nlmsvc_unlink_block(struct nlm_block *block)
/* Remove block from list */
status = posix_unblock_lock(block->b_file->f_file, &block->b_call->a_args.lock.fl);
nlmsvc_remove_block(block);
+ if (list_empty(&nlm_blocked))
+ atomic_dec(&nlmsvc_ref);
return status;
}

diff --git a/include/linux/lockd/lockd.h b/include/linux/lockd/lockd.h
index e2d1ce3..7389553 100644
--- a/include/linux/lockd/lockd.h
+++ b/include/linux/lockd/lockd.h
@@ -154,6 +154,7 @@ extern struct svc_procedure nlmsvc_procedures4[];
extern int nlmsvc_grace_period;
extern unsigned long nlmsvc_timeout;
extern int nsm_use_hostnames;
+extern atomic_t nlmsvc_ref;

/*
* Lockd client functions
--
1.5.3.6

2008-01-05 12:03:00

by Jeff Layton

[permalink] [raw]

Subject: [PATCH 3/6] NLM: Initialize completion variable in lockd_up

lockd_start_done is a global var that can be reused if lockd is
restarted, but it's never reinitialized. On all but the first use,
wait_for_completion isn't actually waiting on it since it has
already completed once.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/lockd/svc.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/fs/lockd/svc.c b/fs/lockd/svc.c
index 82e2192..0f4148a 100644
--- a/fs/lockd/svc.c
+++ b/fs/lockd/svc.c
@@ -300,6 +300,7 @@ lockd_up(int proto) /* Maybe add a 'family' option when IPv6 is supported ?? */
/*
* Create the kernel thread and wait for it to start.
*/
+ init_completion(&lockd_start_done);
error = svc_create_thread(lockd, serv);
if (error) {
printk(KERN_WARNING
--
1.5.3.6

2008-01-05 12:03:00

by Jeff Layton

[permalink] [raw]

Subject: [PATCH 2/6] SUNRPC: export svc_sock_update_bufs

Needed since the plan is to not have a svc_create_thread helper and to
have current users of that function just call kthread_run directly.

Signed-off-by: Jeff Layton <[email protected]>
---
net/sunrpc/svcsock.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 057c870..f2bef16 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -1407,6 +1407,7 @@ svc_sock_update_bufs(struct svc_serv *serv)
}
spin_unlock_bh(&serv->sv_lock);
}
+EXPORT_SYMBOL(svc_sock_update_bufs);

/*
* Receive the next request on any socket. This code is carefully
--
1.5.3.6

2008-01-05 12:03:04

by Jeff Layton

[permalink] [raw]

Subject: [PATCH 5/6] NLM: Convert lockd to use kthreads

Have lockd_up start lockd using kthread_run. With this change,
lockd_down now blocks until lockd actually exits, so there's no longer
need for the waitqueue code at the end of lockd_down. This also means
that only one lockd can be running at a time which simplifies the code
within lockd's main loop.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/lockd/svc.c | 79 ++++++++++++++++++++++++++-----------------------------
1 files changed, 37 insertions(+), 42 deletions(-)

diff --git a/fs/lockd/svc.c b/fs/lockd/svc.c
index 03a83a0..d7209ea 100644
--- a/fs/lockd/svc.c
+++ b/fs/lockd/svc.c
@@ -25,6 +25,7 @@
#include <linux/smp.h>
#include <linux/smp_lock.h>
#include <linux/mutex.h>
+#include <linux/kthread.h>
#include <linux/freezer.h>

#include <linux/sunrpc/types.h>
@@ -48,13 +49,12 @@ EXPORT_SYMBOL(nlmsvc_ops);

static DEFINE_MUTEX(nlmsvc_mutex);
static unsigned int nlmsvc_users;
-static pid_t nlmsvc_pid;
+static struct task_struct *nlmsvc_task;
static struct svc_serv *nlmsvc_serv;
int nlmsvc_grace_period;
unsigned long nlmsvc_timeout;

static DECLARE_COMPLETION(lockd_start_done);
-static DECLARE_WAIT_QUEUE_HEAD(lockd_exit);

/*
* These can be set at insmod time (useful for NFS as root filesystem),
@@ -111,10 +111,11 @@ static inline void clear_grace_period(void)
/*
* This is the lockd kernel thread
*/
-static void
-lockd(struct svc_rqst *rqstp)
+static int
+lockd(void *vrqstp)
{
int err = 0;
+ struct svc_rqst *rqstp = vrqstp;
unsigned long grace_period_expire;

/* Lock module and set up kernel thread */
@@ -128,11 +129,9 @@ lockd(struct svc_rqst *rqstp)
/*
* Let our maker know we're running.
*/
- nlmsvc_pid = current->pid;
nlmsvc_serv = rqstp->rq_server;
complete(&lockd_start_done);

- daemonize("lockd");
set_freezable();

/* Process request with signals blocked, but allow SIGKILL. */
@@ -151,7 +150,7 @@ lockd(struct svc_rqst *rqstp)
* NFS mount or NFS daemon has gone away, and we've been sent a
* signal, or else another process has taken over our job.
*/
- while ((nlmsvc_users || !signalled()) && nlmsvc_pid == current->pid) {
+ while (!kthread_should_stop()) {
long timeout = MAX_SCHEDULE_TIMEOUT;
char buf[RPC_MAX_ADDRBUFLEN];

@@ -203,23 +202,19 @@ lockd(struct svc_rqst *rqstp)
* Check whether there's a new lockd process before
* shutting down the hosts and clearing the slot.
*/
- if (!nlmsvc_pid || current->pid == nlmsvc_pid) {
- if (nlmsvc_ops)
- nlmsvc_invalidate_all();
- nlm_shutdown_hosts();
- nlmsvc_pid = 0;
- nlmsvc_serv = NULL;
- } else
- printk(KERN_DEBUG
- "lockd: new process, skipping host shutdown\n");
- wake_up(&lockd_exit);
+ if (nlmsvc_ops)
+ nlmsvc_invalidate_all();
+ nlm_shutdown_hosts();
+ nlmsvc_task = NULL;
+ nlmsvc_serv = NULL;

/* Exit the RPC thread */
svc_exit_thread(rqstp);

/* Release module */
unlock_kernel();
- module_put_and_exit(0);
+ module_put(THIS_MODULE);
+ return 0;
}

@@ -269,14 +264,15 @@ static int make_socks(struct svc_serv *serv, int proto)
int
lockd_up(int proto) /* Maybe add a 'family' option when IPv6 is supported ?? */
{
- struct svc_serv * serv;
- int error = 0;
+ struct svc_serv *serv;
+ struct svc_rqst *rqstp;
+ int error = 0;

mutex_lock(&nlmsvc_mutex);
/*
* Check whether we're already up and running.
*/
- if (nlmsvc_pid) {
+ if (nlmsvc_task) {
if (proto)
error = make_socks(nlmsvc_serv, proto);
goto out;
@@ -303,11 +299,24 @@ lockd_up(int proto) /* Maybe add a 'family' option when IPv6 is supported ?? */
/*
* Create the kernel thread and wait for it to start.
*/
+ rqstp = svc_prepare_thread(serv, &serv->sv_pools[0]);
+ if (IS_ERR(rqstp)) {
+ error = PTR_ERR(rqstp);
+ printk(KERN_WARNING
+ "lockd_up: svc_rqst allocation failed, error=%d\n",
+ error);
+ goto destroy_and_out;
+ }
+
+ svc_sock_update_bufs(serv);
init_completion(&lockd_start_done);
- error = svc_create_thread(lockd, serv);
- if (error) {
+ nlmsvc_task = kthread_run(lockd, rqstp, serv->sv_name);
+ if (IS_ERR(nlmsvc_task)) {
+ error = PTR_ERR(nlmsvc_task);
+ nlmsvc_task = NULL;
printk(KERN_WARNING
- "lockd_up: create thread failed, error=%d\n", error);
+ "lockd_up: kthread_run failed, error=%d\n", error);
+ svc_exit_thread(rqstp);
goto destroy_and_out;
}
wait_for_completion(&lockd_start_done);
@@ -339,30 +348,16 @@ lockd_down(void)
if (--nlmsvc_users)
goto out;
} else
- printk(KERN_WARNING "lockd_down: no users! pid=%d\n", nlmsvc_pid);
+ printk(KERN_WARNING "lockd_down: no users! task=%p\n",
+ nlmsvc_task);

- if (!nlmsvc_pid) {
+ if (!nlmsvc_task) {
if (warned++ == 0)
printk(KERN_WARNING "lockd_down: no lockd running.\n");
goto out;
}
warned = 0;
-
- kill_proc(nlmsvc_pid, SIGKILL, 1);
- /*
- * Wait for the lockd process to exit, but since we're holding
- * the lockd semaphore, we can't wait around forever ...
- */
- clear_thread_flag(TIF_SIGPENDING);
- interruptible_sleep_on_timeout(&lockd_exit, HZ);
- if (nlmsvc_pid) {
- printk(KERN_WARNING
- "lockd_down: lockd failed to exit, clearing pid\n");
- nlmsvc_pid = 0;
- }
- spin_lock_irq(&current->sighand->siglock);
- recalc_sigpending();
- spin_unlock_irq(&current->sighand->siglock);
+ kthread_stop(nlmsvc_task);
out:
mutex_unlock(&nlmsvc_mutex);
}
--
1.5.3.6

2008-01-08 05:53:19

by NeilBrown

[permalink] [raw]

Subject: Re: [PATCH 1/6] SUNRPC: spin svc_rqst initialization to its own function

On Saturday January 5, [email protected] wrote:
> Move the initialzation in __svc_create_thread that happens prior to
> thread creation to a new function. Export the function to allow
> services to have better control over the svc_rqst structs.
>
> Signed-off-by: Jeff Layton <[email protected]>
> ---
> include/linux/sunrpc/svc.h | 2 ++
> net/sunrpc/svc.c | 43 +++++++++++++++++++++++++++++++------------
> 2 files changed, 33 insertions(+), 12 deletions(-)
>
> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> index 8531a70..5f07300 100644
> --- a/include/linux/sunrpc/svc.h
> +++ b/include/linux/sunrpc/svc.h
> @@ -382,6 +382,8 @@ struct svc_procedure {
> */
> struct svc_serv * svc_create(struct svc_program *, unsigned int,
> void (*shutdown)(struct svc_serv*));
> +struct svc_rqst *svc_prepare_thread(struct svc_serv *serv,
> + struct svc_pool *pool);
> int svc_create_thread(svc_thread_fn, struct svc_serv *);
> void svc_exit_thread(struct svc_rqst *);
> struct svc_serv * svc_create_pooled(struct svc_program *, unsigned int,
> diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> index fca17d0..b29ed43 100644
> --- a/net/sunrpc/svc.c
> +++ b/net/sunrpc/svc.c
> @@ -538,23 +538,14 @@ svc_release_buffer(struct svc_rqst *rqstp)
> put_page(rqstp->rq_pages[i]);
> }
>
> -/*
> - * Create a thread in the given pool. Caller must hold BKL.
> - * On a NUMA or SMP machine, with a multi-pool serv, the thread
> - * will be restricted to run on the cpus belonging to the pool.
> - */
> -static int
> -__svc_create_thread(svc_thread_fn func, struct svc_serv *serv,
> - struct svc_pool *pool)
> +struct svc_rqst *
> +svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool)
> {
> struct svc_rqst *rqstp;
> - int error = -ENOMEM;
> - int have_oldmask = 0;
> - cpumask_t oldmask;
>
> rqstp = kzalloc(sizeof(*rqstp), GFP_KERNEL);
> if (!rqstp)
> - goto out;
> + goto out_enomem;
>
> init_waitqueue_head(&rqstp->rq_wait);
>
> @@ -570,6 +561,34 @@ __svc_create_thread(svc_thread_fn func, struct svc_serv *serv,
> spin_unlock_bh(&pool->sp_lock);
> rqstp->rq_server = serv;
> rqstp->rq_pool = pool;
> + return rqstp;
> +
> +out_thread:
> + svc_exit_thread(rqstp);

I realise that the bug existed before your change, but calling
svc_exit_thread at this point is not good.
The 'goto out_thread' is *before* "pool->sp_nrthreads++", but
svc_exit_thread does "pool->sp_nrthreads--;". Not good.

As you are playing is this code, do you feel like fixing that error
path??

Otherwise, patch looks good.
Acked-By: NeilBrown <[email protected]>

Thanks,
NeilBrown

2008-01-08 06:17:00

by NeilBrown

[permalink] [raw]

Subject: Re: [PATCH 5/6] NLM: Convert lockd to use kthreads

On Saturday January 5, [email protected] wrote:
> Have lockd_up start lockd using kthread_run. With this change,
> lockd_down now blocks until lockd actually exits, so there's no longer
> need for the waitqueue code at the end of lockd_down. This also means
> that only one lockd can be running at a time which simplifies the code
> within lockd's main loop.
>
> Signed-off-by: Jeff Layton <[email protected]>
> ---

> - module_put_and_exit(0);
> + module_put(THIS_MODULE);
> + return 0;

This changes bothers me. Putting the last ref to a module in code
inside that module is not safe, which is why module_put_and_exit
exists.

So this module_put is either unsafe or not needed. I think the
latter.

As you say in the comment, lockd_down now blocks until lockd actually
exits. As every caller for lockd_down will own a reference to the
lockd module, the lockd thread no longer needs to own a reference too.
So I think it is safe to remove the module_put, and also remove the
__module_get at the top of the lockd function.

Also, I suspect that the "no users" and "no lockd running" messages
should probably be changed to BUGs as they really should be
impossible, not just unlikely.

Also:

> * Check whether there's a new lockd process before
> * shutting down the hosts and clearing the slot.
> */

This comment should go as the actual check has gone.

The rest looks fine. It is a substantial improvement, thanks.

NeilBrown

2008-01-08 06:46:42

by NeilBrown

[permalink] [raw]

Subject: Re: [PATCH 6/6] NLM: Add reference counting to lockd

On Saturday January 5, [email protected] wrote:
> @@ -357,7 +375,18 @@ lockd_down(void)
> goto out;
> }
> warned = 0;
> - kthread_stop(nlmsvc_task);
> + if (atomic_sub_return(1, &nlmsvc_ref) != 0)
> + printk(KERN_WARNING "lockd_down: lockd is waiting for "
> + "outstanding requests to complete before exiting.\n");

Why not "atomic_dec_and_test" ??

> +
> + /*
> + * Sending a signal is necessary here. If we get to this point and
> + * nlm_blocked isn't empty then lockd may be held hostage by clients
> + * that are still blocking. Sending the signal makes sure that lockd
> + * invalidates all of its locks so that it's just waiting on RPC
> + * callbacks to complete
> + */
> + kill_proc(nlmsvc_task->pid, SIGKILL, 1);

The previous patch removes a kill_proc(... SIGKILL), this one adds it
back.
That makes me wonder if the intermediate state is 'correct'.

But I also wonder what "correct" means.
Do we want all locks to be dropped when the last nfsd thread dies?
The answer is presumably either "yes" or "no".
If "yes", then we don't have that because if there are any NFS mounts
active, lockd will not be killed.
If "no", then we don't want this kill_proc here.

The comment in lockd() which currently reads:

/*
* The main request loop. We don't terminate until the last
* NFS mount or NFS daemon has gone away, and we've been sent a
* signal, or else another process has taken over our job.
*/

suggests that someone once thought that lockd could hang around after
all nfsd threads and nfs mounts had gone, but I don't think it does.

We really should think this through and get it right, because if lockd
ever drops it's locks, then we really need to make sure sm_notify gets
run. So it needs to be a well defined event.

Thoughts?

Also, it is sad that the inc/dec of nlmsvc_ref is called in somewhat
non-obvious ways.
e.g.

> + if (!nlmsvc_users && error)
> + atomic_dec(&nlmsvc_ref);

and

> + if (list_empty(&nlm_blocked))
> + atomic_inc(&nlmsvc_ref);
> +
> if (list_empty(&block->b_list)) {
> kref_get(&block->b_count);
> } else {

where if we moved the atomic_inc a little bit later next to the
"list_add_tail" (which seems to make more sense) it would actually be
wrong... But I think that code is correct as it is - just non-obvious.

NeilBrown

2008-01-08 12:11:29

by Jeff Layton

[permalink] [raw]

Subject: Re: [PATCH 1/6] SUNRPC: spin svc_rqst initialization to its own function

On Tue, 8 Jan 2008 16:53:09 +1100
Neil Brown <[email protected]> wrote:

> On Saturday January 5, [email protected] wrote:
> > Move the initialzation in __svc_create_thread that happens prior to
> > thread creation to a new function. Export the function to allow
> > services to have better control over the svc_rqst structs.
> >
> > Signed-off-by: Jeff Layton <[email protected]>
> > ---
> > include/linux/sunrpc/svc.h | 2 ++
> > net/sunrpc/svc.c | 43
> > +++++++++++++++++++++++++++++++------------ 2 files changed, 33
> > insertions(+), 12 deletions(-)
> >
> > diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> > index 8531a70..5f07300 100644
> > --- a/include/linux/sunrpc/svc.h
> > +++ b/include/linux/sunrpc/svc.h
> > @@ -382,6 +382,8 @@ struct svc_procedure {
> > */
> > struct svc_serv * svc_create(struct svc_program *, unsigned int,
> > void (*shutdown)(struct svc_serv*));
> > +struct svc_rqst *svc_prepare_thread(struct svc_serv *serv,
> > + struct svc_pool *pool);
> > int svc_create_thread(svc_thread_fn, struct
> > svc_serv *); void svc_exit_thread(struct svc_rqst
> > *); struct svc_serv * svc_create_pooled(struct svc_program *,
> > unsigned int, diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> > index fca17d0..b29ed43 100644
> > --- a/net/sunrpc/svc.c
> > +++ b/net/sunrpc/svc.c
> > @@ -538,23 +538,14 @@ svc_release_buffer(struct svc_rqst *rqstp)
> > put_page(rqstp->rq_pages[i]);
> > }
> >
> > -/*
> > - * Create a thread in the given pool. Caller must hold BKL.
> > - * On a NUMA or SMP machine, with a multi-pool serv, the thread
> > - * will be restricted to run on the cpus belonging to the pool.
> > - */
> > -static int
> > -__svc_create_thread(svc_thread_fn func, struct svc_serv *serv,
> > - struct svc_pool *pool)
> > +struct svc_rqst *
> > +svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool)
> > {
> > struct svc_rqst *rqstp;
> > - int error = -ENOMEM;
> > - int have_oldmask = 0;
> > - cpumask_t oldmask;
> >
> > rqstp = kzalloc(sizeof(*rqstp), GFP_KERNEL);
> > if (!rqstp)
> > - goto out;
> > + goto out_enomem;
> >
> > init_waitqueue_head(&rqstp->rq_wait);
> >
> > @@ -570,6 +561,34 @@ __svc_create_thread(svc_thread_fn func, struct
> > svc_serv *serv, spin_unlock_bh(&pool->sp_lock);
> > rqstp->rq_server = serv;
> > rqstp->rq_pool = pool;
> > + return rqstp;
> > +
> > +out_thread:
> > + svc_exit_thread(rqstp);
>
> I realise that the bug existed before your change, but calling
> svc_exit_thread at this point is not good.
> The 'goto out_thread' is *before* "pool->sp_nrthreads++", but
> svc_exit_thread does "pool->sp_nrthreads--;". Not good.
>
> As you are playing is this code, do you feel like fixing that error
> path??
>
>
> Otherwise, patch looks good.
> Acked-By: NeilBrown <[email protected]>
>
> Thanks,
> NeilBrown

Good catch. That bug will be a NULL pointer dereference since
rqstp->rq_pool isn't even set yet. There's a similar bug with
serv->sv_nrthreads too.

Still looking over your other comments, but I'll plan to post an
updated patchset that includes a fix for that.

--
Jeff Layton <[email protected]>

2008-01-08 13:26:13

by Jeff Layton

[permalink] [raw]

Subject: Re: [PATCH 6/6] NLM: Add reference counting to lockd

On Tue, 8 Jan 2008 17:46:33 +1100
Neil Brown <[email protected]> wrote:

The comments about patch 5/6 seem sane. I'll plan to incorporate them
in the respin...

> On Saturday January 5, [email protected] wrote:
> > @@ -357,7 +375,18 @@ lockd_down(void)
> > goto out;
> > }
> > warned = 0;
> > - kthread_stop(nlmsvc_task);
> > + if (atomic_sub_return(1, &nlmsvc_ref) != 0)
> > + printk(KERN_WARNING "lockd_down: lockd is waiting
> > for "
> > + "outstanding requests to complete before
> > exiting.\n");
>
> Why not "atomic_dec_and_test" ??
>

Temporary amnesia? :-) I'll change that, atomic_dec_and_test will be
clearer.

> > +
> > + /*
> > + * Sending a signal is necessary here. If we get to this
> > point and
> > + * nlm_blocked isn't empty then lockd may be held hostage
> > by clients
> > + * that are still blocking. Sending the signal makes sure
> > that lockd
> > + * invalidates all of its locks so that it's just waiting
> > on RPC
> > + * callbacks to complete
> > + */
> > + kill_proc(nlmsvc_task->pid, SIGKILL, 1);
>
> The previous patch removes a kill_proc(... SIGKILL), this one adds it
> back.
> That makes me wonder if the intermediate state is 'correct'.
>
> But I also wonder what "correct" means.
> Do we want all locks to be dropped when the last nfsd thread dies?
> The answer is presumably either "yes" or "no".
> If "yes", then we don't have that because if there are any NFS mounts
> active, lockd will not be killed.
> If "no", then we don't want this kill_proc here.
>
> The comment in lockd() which currently reads:
>
> /*
> * The main request loop. We don't terminate until the last
> * NFS mount or NFS daemon has gone away, and we've been sent
> a
> * signal, or else another process has taken over our job.
> */
>
> suggests that someone once thought that lockd could hang around after
> all nfsd threads and nfs mounts had gone, but I don't think it does.
>
> We really should think this through and get it right, because if lockd
> ever drops it's locks, then we really need to make sure sm_notify gets
> run. So it needs to be a well defined event.
>
> Thoughts?
>

This is the part I've been struggling with the most -- defining what
proper behavior should be when lockd is restarted. As you point out,
restarting lockd without doing a sm_notify could be bad news for data
integrity.

Then again, we'd like someone to be able to shut down the NFS "service"
and be able to unmount underlying filesystems without jumping through
special hoops....

Overall, I think I'd vote "yes". We need to drop locks when the last
nfsd goes down. If userspace brings down nfsd, then it's userspace's
responsibility to make sure that a sm_notify is sent when nfsd and lockd
are restarted.

As a side note, I'm not thrilled with this design that mixes signals
and kthreads, but didn't see another way to do this. I'm open to
suggestions if anyone has them...

> Also, it is sad that the inc/dec of nlmsvc_ref is called in somewhat
> non-obvious ways.
> e.g.
>
> > + if (!nlmsvc_users && error)
> > + atomic_dec(&nlmsvc_ref);
>
> and
>
> > + if (list_empty(&nlm_blocked))
> > + atomic_inc(&nlmsvc_ref);
> > +
> > if (list_empty(&block->b_list)) {
> > kref_get(&block->b_count);
> > } else {
>
> where if we moved the atomic_inc a little bit later next to the
> "list_add_tail" (which seems to make more sense) it would actually be
> wrong... But I think that code is correct as it is - just non-obvious.
>

The nlmsvc_ref logic is pretty convoluted, unfortunately. I'll plan to
add some comments to clarify what I'm doing there.

Thanks for the review, Neil. I'll see if I can get a new patchset done
in the next few days.

Cheers,
--
Jeff Layton <[email protected]>

2008-01-08 15:55:43

by Wendy Cheng

[permalink] [raw]

Subject: Re: [PATCH 6/6] NLM: Add reference counting to lockd

Jeff Layton wrote:
>
>> The previous patch removes a kill_proc(... SIGKILL), this one adds it
>> back.
>> That makes me wonder if the intermediate state is 'correct'.
>>
>> But I also wonder what "correct" means.
>> Do we want all locks to be dropped when the last nfsd thread dies?
>> The answer is presumably either "yes" or "no".
>> If "yes", then we don't have that because if there are any NFS mounts
>> active, lockd will not be killed.
>> If "no", then we don't want this kill_proc here.
>>
>> The comment in lockd() which currently reads:
>>
>> /*
>> * The main request loop. We don't terminate until the last
>> * NFS mount or NFS daemon has gone away, and we've been sent
>> a
>> * signal, or else another process has taken over our job.
>> */
>>
>> suggests that someone once thought that lockd could hang around after
>> all nfsd threads and nfs mounts had gone, but I don't think it does.
>>
>> We really should think this through and get it right, because if lockd
>> ever drops it's locks, then we really need to make sure sm_notify gets
>> run. So it needs to be a well defined event.
>>
>> Thoughts?
>>
>>
>
> This is the part I've been struggling with the most -- defining what
> proper behavior should be when lockd is restarted. As you point out,
> restarting lockd without doing a sm_notify could be bad news for data
> integrity.
>
> Then again, we'd like someone to be able to shut down the NFS "service"
> and be able to unmount underlying filesystems without jumping through
> special hoops....
>
> Overall, I think I'd vote "yes". We need to drop locks when the last
> nfsd goes down. If userspace brings down nfsd, then it's userspace's
> responsibility to make sure that a sm_notify is sent when nfsd and lockd
> are restarted.
>

I would vote for "no", at least for nfs v3. Shutting down lockd would
require clients to reclaim the locks. With current status (protocol,
design, and even the implementation itself, etc), it is simply too
disruptive. I understand current logic (i.e. shutting down nfsd but
leaving lockd alone) is awkward but debugging multiple platforms
(remember clients may not be on linux boxes) is very non-trivial.

-- Wendy

2008-01-08 16:13:13

by Peter Staubach

[permalink] [raw]

Subject: Re: [PATCH 6/6] NLM: Add reference counting to lockd

Jeff Layton wrote:
> On Tue, 8 Jan 2008 17:46:33 +1100
> Neil Brown <[email protected]> wrote:
>
> The comments about patch 5/6 seem sane. I'll plan to incorporate them
> in the respin...
>
>
>> On Saturday January 5, [email protected] wrote:
>>
>>> @@ -357,7 +375,18 @@ lockd_down(void)
>>> goto out;
>>> }
>>> warned = 0;
>>> - kthread_stop(nlmsvc_task);
>>> + if (atomic_sub_return(1, &nlmsvc_ref) != 0)
>>> + printk(KERN_WARNING "lockd_down: lockd is waiting
>>> for "
>>> + "outstanding requests to complete before
>>> exiting.\n");
>>>
>> Why not "atomic_dec_and_test" ??
>>
>>
>
> Temporary amnesia? :-) I'll change that, atomic_dec_and_test will be
> clearer.
>
>
>>> +
>>> + /*
>>> + * Sending a signal is necessary here. If we get to this
>>> point and
>>> + * nlm_blocked isn't empty then lockd may be held hostage
>>> by clients
>>> + * that are still blocking. Sending the signal makes sure
>>> that lockd
>>> + * invalidates all of its locks so that it's just waiting
>>> on RPC
>>> + * callbacks to complete
>>> + */
>>> + kill_proc(nlmsvc_task->pid, SIGKILL, 1);
>>>
>> The previous patch removes a kill_proc(... SIGKILL), this one adds it
>> back.
>> That makes me wonder if the intermediate state is 'correct'.
>>
>> But I also wonder what "correct" means.
>> Do we want all locks to be dropped when the last nfsd thread dies?
>> The answer is presumably either "yes" or "no".
>> If "yes", then we don't have that because if there are any NFS mounts
>> active, lockd will not be killed.
>> If "no", then we don't want this kill_proc here.
>>
>> The comment in lockd() which currently reads:
>>
>> /*
>> * The main request loop. We don't terminate until the last
>> * NFS mount or NFS daemon has gone away, and we've been sent
>> a
>> * signal, or else another process has taken over our job.
>> */
>>
>> suggests that someone once thought that lockd could hang around after
>> all nfsd threads and nfs mounts had gone, but I don't think it does.
>>
>> We really should think this through and get it right, because if lockd
>> ever drops it's locks, then we really need to make sure sm_notify gets
>> run. So it needs to be a well defined event.
>>
>> Thoughts?
>>
>>
>
> This is the part I've been struggling with the most -- defining what
> proper behavior should be when lockd is restarted. As you point out,
> restarting lockd without doing a sm_notify could be bad news for data
> integrity.
>
> Then again, we'd like someone to be able to shut down the NFS "service"
> and be able to unmount underlying filesystems without jumping through
> special hoops....
>
> Overall, I think I'd vote "yes". We need to drop locks when the last
> nfsd goes down. If userspace brings down nfsd, then it's userspace's
> responsibility to make sure that a sm_notify is sent when nfsd and lockd
> are restarted.
>

I would vote for the simplest possible model that makes sense.
We need a simple model for admins as well as a simple model
which is easy to implement in as bug free way as possible. The
trick is not making it too simple because that can cost
performance, but not making it too complicated to implement
reasonably and for admins to be able to figure out.

So, I would vote for "yes" as well. That will yield an
architecture where we can shutdown systems cleanly and will
be easy to understand when locks for clients exist and when
they do not.

Thanx...

ps

> As a side note, I'm not thrilled with this design that mixes signals
> and kthreads, but didn't see another way to do this. I'm open to
> suggestions if anyone has them...
>
>
>> Also, it is sad that the inc/dec of nlmsvc_ref is called in somewhat
>> non-obvious ways.
>> e.g.
>>
>>
>>> + if (!nlmsvc_users && error)
>>> + atomic_dec(&nlmsvc_ref);
>>>
>> and
>>
>>
>>> + if (list_empty(&nlm_blocked))
>>> + atomic_inc(&nlmsvc_ref);
>>> +
>>> if (list_empty(&block->b_list)) {
>>> kref_get(&block->b_count);
>>> } else {
>>>
>> where if we moved the atomic_inc a little bit later next to the
>> "list_add_tail" (which seems to make more sense) it would actually be
>> wrong... But I think that code is correct as it is - just non-obvious.
>>
>>
>
> The nlmsvc_ref logic is pretty convoluted, unfortunately. I'll plan to
> add some comments to clarify what I'm doing there.
>
> Thanks for the review, Neil. I'll see if I can get a new patchset done
> in the next few days.
>
> Cheers,
>

2008-01-08 16:13:55

by Jeff Layton

[permalink] [raw]

Subject: Re: [PATCH 6/6] NLM: Add reference counting to lockd

On Tue, 08 Jan 2008 10:52:19 -0500
Wendy Cheng <[email protected]> wrote:

> Jeff Layton wrote:
> >
> >> The previous patch removes a kill_proc(... SIGKILL), this one
> >> adds it back.
> >> That makes me wonder if the intermediate state is 'correct'.
> >>
> >> But I also wonder what "correct" means.
> >> Do we want all locks to be dropped when the last nfsd thread dies?
> >> The answer is presumably either "yes" or "no".
> >> If "yes", then we don't have that because if there are any NFS
> >> mounts active, lockd will not be killed.
> >> If "no", then we don't want this kill_proc here.
> >>
> >> The comment in lockd() which currently reads:
> >>
> >> /*
> >> * The main request loop. We don't terminate until the last
> >> * NFS mount or NFS daemon has gone away, and we've been
> >> sent a
> >> * signal, or else another process has taken over our job.
> >> */
> >>
> >> suggests that someone once thought that lockd could hang around
> >> after all nfsd threads and nfs mounts had gone, but I don't think
> >> it does.
> >>
> >> We really should think this through and get it right, because if
> >> lockd ever drops it's locks, then we really need to make sure
> >> sm_notify gets run. So it needs to be a well defined event.
> >>
> >> Thoughts?
> >>
> >>
> >
> > This is the part I've been struggling with the most -- defining what
> > proper behavior should be when lockd is restarted. As you point out,
> > restarting lockd without doing a sm_notify could be bad news for
> > data integrity.
> >
> > Then again, we'd like someone to be able to shut down the NFS
> > "service" and be able to unmount underlying filesystems without
> > jumping through special hoops....
> >
> > Overall, I think I'd vote "yes". We need to drop locks when the last
> > nfsd goes down. If userspace brings down nfsd, then it's userspace's
> > responsibility to make sure that a sm_notify is sent when nfsd and
> > lockd are restarted.
> >
>
> I would vote for "no", at least for nfs v3. Shutting down lockd would
> require clients to reclaim the locks. With current status (protocol,
> design, and even the implementation itself, etc), it is simply too
> disruptive. I understand current logic (i.e. shutting down nfsd but
> leaving lockd alone) is awkward but debugging multiple platforms
> (remember clients may not be on linux boxes) is very non-trivial.
>

The current lockd implementation already drops all locks if nfsd goes
down (providing there are no local NFS mounts). The last lockd_down call
will bring down lockd and it will drop all of its locks in the process.
My vote for "yes" is a vote to keep things the way they are. I don't
think I'd consider it disruptive.

Changing lockd to not drop locks will mean that userspace will need to
take extra steps if someone wants to bring down NFS and unmount an
underlying filesystem. Those extra steps could be a SIGKILL to lockd or
a call into the new interfaces your recent patchset adds. Either way,
that would mean a change in behavior that will have to be accounted for
in userspace.

--
Jeff Layton <[email protected]>