2013-02-01 11:27:42

by Stanislav Kinsbursky

[permalink] [raw]
Subject: [PATCH 0/2] NFSD: fix races in service per-net resources allocation

After "NFS" (SUNRPC + NFSd actually) containerization work some basic
principles of SUNRPC service initialization and deinitialization has been
changed: now one service can be shared between different network namespaces
and network "resources" can be attached or detached from the running service.
This leads to races, described here:

https://bugzilla.redhat.com/show_bug.cgi?id=904870

and which this small patch set is aimed to solve by using per-cpu rw semphores
to sync per-net resources processing and shutdown.

The following series implements...

---

Stanislav Kinsbursky (2):
per-cpu semaphores: export symbols to modules
SUNRPC: protect transport processing with per-cpu rw semaphore


include/linux/sunrpc/svc.h | 2 ++
lib/Makefile | 2 +-
lib/percpu-rwsem.c | 6 ++++++
net/sunrpc/Kconfig | 1 +
net/sunrpc/svc.c | 2 ++
net/sunrpc/svc_xprt.c | 33 +++++++++++++++++++++++++++------
6 files changed, 39 insertions(+), 7 deletions(-)



2013-02-11 00:26:03

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 0/2] NFSD: fix races in service per-net resources allocation

On Fri, Feb 01, 2013 at 02:28:21PM +0300, Stanislav Kinsbursky wrote:
> After "NFS" (SUNRPC + NFSd actually) containerization work some basic
> principles of SUNRPC service initialization and deinitialization has been
> changed: now one service can be shared between different network namespaces
> and network "resources" can be attached or detached from the running service.
> This leads to races, described here:
>
> https://bugzilla.redhat.com/show_bug.cgi?id=904870
>
> and which this small patch set is aimed to solve by using per-cpu rw semphores
> to sync per-net resources processing and shutdown.

Sorry for the slow response. I think this is probably correct.

But I think we got into this mess because the server shutdown logic is
too complicated. So I'd prefer to find a way to fix the problem by
simplifying things rather than by adding another lock.

Do you see anything wrong with the following?

--b

commit e8202f39f84b8863337f55159dd18478b9ccb616
Author: J. Bruce Fields <[email protected]>
Date: Sun Feb 10 16:08:11 2013 -0500

svcrpc: fix and simplify server shutdown

Simplify server shutdown, and make it correct whether or not there are
still threads running (as will happen in the case we're only shutting
down the service in one network namespace).

Do that by doing what we'd do in normal circumstances: just CLOSE each
socket, then enqueue it.

Since there may not be threads to handle the resulting queued xprts,
also run a simplified version of the svc_recv() loop run by a server to
clean up any closed xprts afterwards.

Signed-off-by: J. Bruce Fields <[email protected]>

diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 024a241..a98e818 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -966,12 +966,12 @@ static void svc_close_list(struct svc_serv *serv, struct list_head *xprt_list, s
if (xprt->xpt_net != net)
continue;
set_bit(XPT_CLOSE, &xprt->xpt_flags);
- set_bit(XPT_BUSY, &xprt->xpt_flags);
+ svc_xprt_enqueue(xprt);
}
spin_unlock(&serv->sv_lock);
}

-static void svc_clear_pools(struct svc_serv *serv, struct net *net)
+static struct svc_xprt *svc_dequeue_net(struct svc_serv *serv, struct net *net)
{
struct svc_pool *pool;
struct svc_xprt *xprt;
@@ -986,42 +986,31 @@ static void svc_clear_pools(struct svc_serv *serv, struct net *net)
if (xprt->xpt_net != net)
continue;
list_del_init(&xprt->xpt_ready);
+ spin_unlock_bh(&pool->sp_lock);
+ return xprt;
}
spin_unlock_bh(&pool->sp_lock);
}
+ return NULL;
}

-static void svc_clear_list(struct svc_serv *serv, struct list_head *xprt_list, struct net *net)
+static void svc_clean_up_xprts(struct svc_serv *serv, struct net *net)
{
struct svc_xprt *xprt;
- struct svc_xprt *tmp;
- LIST_HEAD(victims);

- spin_lock(&serv->sv_lock);
- list_for_each_entry_safe(xprt, tmp, xprt_list, xpt_list) {
- if (xprt->xpt_net != net)
- continue;
- list_move(&xprt->xpt_list, &victims);
- }
- spin_unlock(&serv->sv_lock);
-
- list_for_each_entry_safe(xprt, tmp, &victims, xpt_list)
+ while ((xprt = svc_dequeue_net(serv, net))) {
+ if (!test_bit(XPT_CLOSE, &xprt->xpt_flags))
+ pr_err("found un-closed xprt on service shutdown\n");
svc_delete_xprt(xprt);
+ }
}

void svc_close_net(struct svc_serv *serv, struct net *net)
{
- svc_close_list(serv, &serv->sv_tempsocks, net);
svc_close_list(serv, &serv->sv_permsocks, net);
-
- svc_clear_pools(serv, net);
- /*
- * At this point the sp_sockets lists will stay empty, since
- * svc_xprt_enqueue will not add new entries without taking the
- * sp_lock and checking XPT_BUSY.
- */
- svc_clear_list(serv, &serv->sv_tempsocks, net);
- svc_clear_list(serv, &serv->sv_permsocks, net);
+ svc_clean_up_xprts(serv, net);
+ svc_close_list(serv, &serv->sv_tempsocks, net);
+ svc_clean_up_xprts(serv, net);
}

/*

2013-02-13 05:16:32

by Stanislav Kinsbursky

[permalink] [raw]
Subject: Re: [PATCH 0/2] NFSD: fix races in service per-net resources allocation

13.02.2013 01:18, Peter Staubach пишет:
> The "+" thing seems a little odd. Why not use "||" instead? The sum of the two returns isn't really the important thing, is it? It is that either call to svc_close_list() returns non-zero.
>
> Thanx...
>

Yep, thanks for the notice.

> ps
>
>
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of J. Bruce Fields
> Sent: Tuesday, February 12, 2013 3:46 PM
> To: Stanislav Kinsbursky
> Cc: [email protected]; [email protected]; [email protected]; [email protected]; [email protected]
> Subject: Re: [PATCH 0/2] NFSD: fix races in service per-net resources allocation
>
> On Tue, Feb 12, 2013 at 01:52:32PM +0400, Stanislav Kinsbursky wrote:
>> 12.02.2013 00:58, J. Bruce Fields пишет:
>> <snip>
>>> void svc_close_net(struct svc_serv *serv, struct net *net)
>>> {
>>> - svc_close_list(serv, &serv->sv_tempsocks, net);
>>> - svc_close_list(serv, &serv->sv_permsocks, net);
>>> -
>>> - svc_clear_pools(serv, net);
>>> - /*
>>> - * At this point the sp_sockets lists will stay empty, since
>>> - * svc_xprt_enqueue will not add new entries without taking the
>>> - * sp_lock and checking XPT_BUSY.
>>> - */
>>> - svc_clear_list(serv, &serv->sv_tempsocks, net);
>>> - svc_clear_list(serv, &serv->sv_permsocks, net);
>>> + int closed;
>>> + int delay = 0;
>>> +
>>> +again:
>>> + closed = svc_close_list(serv, &serv->sv_permsocks, net);
>>> + closed += svc_close_list(serv, &serv->sv_tempsocks, net);
>>> + if (closed) {
>>> + svc_clean_up_xprts(serv, net);
>>> + msleep(delay++);
>>> + goto again;
>>> + }
>>
>> Frankly, this hunk above makes me feel sick... :( But I have no better
>> idea right now...
>> Maybe make this hunk a bit less weird (this is from my POW only, of course), like this:
>>
>>> + while (svc_close_list(serv, &serv->sv_permsocks, net) +
>>> + svc_close_list(serv, &serv->sv_tempsocks, net)) {
>>> + svc_clean_up_xprts(serv, net);
>>> + msleep(delay++);
>>> + }
>>
>> ?
>
> OK, that's a little more compact at least.
>
> --b.
>
>>
>> Anyway, thanks!
>>
>> Acked-by: Stanislav Kinsbursky <[email protected]>
>>
>> --
>> Best regards,
>> Stanislav Kinsbursky
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
>


--
Best regards,
Stanislav Kinsbursky

2013-02-12 09:53:02

by Stanislav Kinsbursky

[permalink] [raw]
Subject: Re: [PATCH 0/2] NFSD: fix races in service per-net resources allocation

12.02.2013 00:58, J. Bruce Fields пишет:
<snip>
> void svc_close_net(struct svc_serv *serv, struct net *net)
> {
> - svc_close_list(serv, &serv->sv_tempsocks, net);
> - svc_close_list(serv, &serv->sv_permsocks, net);
> -
> - svc_clear_pools(serv, net);
> - /*
> - * At this point the sp_sockets lists will stay empty, since
> - * svc_xprt_enqueue will not add new entries without taking the
> - * sp_lock and checking XPT_BUSY.
> - */
> - svc_clear_list(serv, &serv->sv_tempsocks, net);
> - svc_clear_list(serv, &serv->sv_permsocks, net);
> + int closed;
> + int delay = 0;
> +
> +again:
> + closed = svc_close_list(serv, &serv->sv_permsocks, net);
> + closed += svc_close_list(serv, &serv->sv_tempsocks, net);
> + if (closed) {
> + svc_clean_up_xprts(serv, net);
> + msleep(delay++);
> + goto again;
> + }

Frankly, this hunk above makes me feel sick... :(
But I have no better idea right now...
Maybe make this hunk a bit less weird (this is from my POW only, of course), like this:

> + while (svc_close_list(serv, &serv->sv_permsocks, net) +
> + svc_close_list(serv, &serv->sv_tempsocks, net)) {
> + svc_clean_up_xprts(serv, net);
> + msleep(delay++);
> + }

?

Anyway, thanks!

Acked-by: Stanislav Kinsbursky <[email protected]>

--
Best regards,
Stanislav Kinsbursky

2013-02-11 16:37:21

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 0/2] NFSD: fix races in service per-net resources allocation

On Mon, Feb 11, 2013 at 10:18:18AM +0400, Stanislav Kinsbursky wrote:
> This one looks a bit complicated and confusing to me. Probably because
> I'm not that familiar with service transports processing logic. So,
> as I can see, we now try to run over all per-net pool-assigned
> transports, remove them from "ready" queue and delete one by one.
> Then we try to enqueue all temporary sockets. But where in enqueueing
> of permanent sockets? I.e. how does they be destroyed with this patch?
> Then we once again try to run over all per-net pool-assigned
> transports, remove them from "ready" queue and delete one by one. Why
> twice? I.e. why not just lose them, then enqueue them and
> svc_clean_up_xprts()?

I think you missed the first svc_close_list?:

> > svc_close_list(serv, &serv->sv_permsocks, net);
> >+ svc_clean_up_xprts(serv, net);
> >+ svc_close_list(serv, &serv->sv_tempsocks, net);
> >+ svc_clean_up_xprts(serv, net);

The idea is that before we'd like to close all the listeners first, so
that they aren't busy creating more tempsocks while we're trying to
close them.

I overlooked a race, though: if another thread was already handling an
accept for one of the listeners then it might not get closed by that
first svc_clean_up_xprts.

I guess we could do something like:

delay = 0;

again:
numclosed = svc_close_list(serv, &serv->sv_permsocks, net);
numclosed += svc_close_list(serv, &serv->sv_tempsocks, net);
if (numclosed) {
svc_clean_up_xprts(serv, net);
msleep(delay++);
goto again;
}

Seems a little cheesy, but if we don't care much about shutdown
performance in a rare corner case, maybe it's the simplest way out?

--b

2013-02-01 11:27:43

by Stanislav Kinsbursky

[permalink] [raw]
Subject: [PATCH 1/2] per-cpu semaphores: export symbols to modules

Per-cpu semaphores are desired to be used by NFS kernel server.
As Bruce Fields suggested:

"The server rpc code goes to some care not to write to any global
structure, to prevent server threads running on multiple cores from
bouncing cache lines between them.

But my understanding is that even down_read() does modify the semaphore.
So we might want something like the percpu semaphore describe in
Documentation/percpu-rw-semaphore.txt."


So add EXPORT_SYMBOL_GPL() for all publically accessible per-cpu rw semaphores
and move to obj-y so that modules may use this library.

Signed-off-by: Stanislav Kinsbursky <[email protected]>
---
lib/Makefile | 2 +-
lib/percpu-rwsem.c | 6 ++++++
2 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/lib/Makefile b/lib/Makefile
index 02ed6c0..22e0c03 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -40,7 +40,7 @@ obj-$(CONFIG_DEBUG_LOCKING_API_SELFTESTS) += locking-selftest.o
obj-$(CONFIG_DEBUG_SPINLOCK) += spinlock_debug.o
lib-$(CONFIG_RWSEM_GENERIC_SPINLOCK) += rwsem-spinlock.o
lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o
-lib-$(CONFIG_PERCPU_RWSEM) += percpu-rwsem.o
+obj-$(CONFIG_PERCPU_RWSEM) += percpu-rwsem.o

CFLAGS_hweight.o = $(subst $(quote),,$(CONFIG_ARCH_HWEIGHT_CFLAGS))
obj-$(CONFIG_GENERIC_HWEIGHT) += hweight.o
diff --git a/lib/percpu-rwsem.c b/lib/percpu-rwsem.c
index 652a8ee..7f6aa71 100644
--- a/lib/percpu-rwsem.c
+++ b/lib/percpu-rwsem.c
@@ -7,6 +7,7 @@
#include <linux/rcupdate.h>
#include <linux/sched.h>
#include <linux/errno.h>
+#include <linux/export.h>

int __percpu_init_rwsem(struct percpu_rw_semaphore *brw,
const char *name, struct lock_class_key *rwsem_key)
@@ -22,6 +23,7 @@ int __percpu_init_rwsem(struct percpu_rw_semaphore *brw,
init_waitqueue_head(&brw->write_waitq);
return 0;
}
+EXPORT_SYMBOL_GPL(__percpu_init_rwsem);

void percpu_free_rwsem(struct percpu_rw_semaphore *brw)
{
@@ -87,6 +89,7 @@ void percpu_down_read(struct percpu_rw_semaphore *brw)
/* avoid up_read()->rwsem_release() */
__up_read(&brw->rw_sem);
}
+EXPORT_SYMBOL_GPL(percpu_down_read);

void percpu_up_read(struct percpu_rw_semaphore *brw)
{
@@ -99,6 +102,7 @@ void percpu_up_read(struct percpu_rw_semaphore *brw)
if (atomic_dec_and_test(&brw->slow_read_ctr))
wake_up_all(&brw->write_waitq);
}
+EXPORT_SYMBOL_GPL(percpu_up_read);

static int clear_fast_ctr(struct percpu_rw_semaphore *brw)
{
@@ -150,6 +154,7 @@ void percpu_down_write(struct percpu_rw_semaphore *brw)
/* wait for all readers to complete their percpu_up_read() */
wait_event(brw->write_waitq, !atomic_read(&brw->slow_read_ctr));
}
+EXPORT_SYMBOL_GPL(percpu_down_write);

void percpu_up_write(struct percpu_rw_semaphore *brw)
{
@@ -163,3 +168,4 @@ void percpu_up_write(struct percpu_rw_semaphore *brw)
/* the last writer unblocks update_fast_ctr() */
atomic_dec(&brw->write_ctr);
}
+EXPORT_SYMBOL_GPL(percpu_up_write);


2013-02-11 20:58:49

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 0/2] NFSD: fix races in service per-net resources allocation

On Mon, Feb 11, 2013 at 11:37:15AM -0500, J. Bruce Fields wrote:
> On Mon, Feb 11, 2013 at 10:18:18AM +0400, Stanislav Kinsbursky wrote:
> > This one looks a bit complicated and confusing to me. Probably because
> > I'm not that familiar with service transports processing logic. So,
> > as I can see, we now try to run over all per-net pool-assigned
> > transports, remove them from "ready" queue and delete one by one.
> > Then we try to enqueue all temporary sockets. But where in enqueueing
> > of permanent sockets? I.e. how does they be destroyed with this patch?
> > Then we once again try to run over all per-net pool-assigned
> > transports, remove them from "ready" queue and delete one by one. Why
> > twice? I.e. why not just lose them, then enqueue them and
> > svc_clean_up_xprts()?
>
> I think you missed the first svc_close_list?:
>
> > > svc_close_list(serv, &serv->sv_permsocks, net);
> > >+ svc_clean_up_xprts(serv, net);
> > >+ svc_close_list(serv, &serv->sv_tempsocks, net);
> > >+ svc_clean_up_xprts(serv, net);
>
> The idea is that before we'd like to close all the listeners first, so
> that they aren't busy creating more tempsocks while we're trying to
> close them.
>
> I overlooked a race, though: if another thread was already handling an
> accept for one of the listeners then it might not get closed by that
> first svc_clean_up_xprts.
>
> I guess we could do something like:
>
> delay = 0;
>
> again:
> numclosed = svc_close_list(serv, &serv->sv_permsocks, net);
> numclosed += svc_close_list(serv, &serv->sv_tempsocks, net);
> if (numclosed) {
> svc_clean_up_xprts(serv, net);
> msleep(delay++);
> goto again;
> }
>
> Seems a little cheesy, but if we don't care much about shutdown
> performance in a rare corner case, maybe it's the simplest way out?

That ends up looking like this.--b.

commit 8468ca5003356bbf5d6157807d4daed075fd438f
Author: J. Bruce Fields <[email protected]>
Date: Sun Feb 10 16:08:11 2013 -0500

svcrpc: fix rpc server shutdown races

Rewrite server shutdown to remove the assumption that there are no
longer any threads running (no longer true, for example, when shutting
down the service in one network namespace while it's still running in
others).

Do that by doing what we'd do in normal circumstances: just CLOSE each
socket, then enqueue it.

Since there may not be threads to handle the resulting queued xprts,
also run a simplified version of the svc_recv() loop run by a server to
clean up any closed xprts afterwards.

Signed-off-by: J. Bruce Fields <[email protected]>

diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index dbf12ac..2d34b6b 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -515,15 +515,6 @@ EXPORT_SYMBOL_GPL(svc_create_pooled);

void svc_shutdown_net(struct svc_serv *serv, struct net *net)
{
- /*
- * The set of xprts (contained in the sv_tempsocks and
- * sv_permsocks lists) is now constant, since it is modified
- * only by accepting new sockets (done by service threads in
- * svc_recv) or aging old ones (done by sv_temptimer), or
- * configuration changes (excluded by whatever locking the
- * caller is using--nfsd_mutex in the case of nfsd). So it's
- * safe to traverse those lists and shut everything down:
- */
svc_close_net(serv, net);

if (serv->sv_shutdown)
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 0b67409..0bd0b6f 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -948,21 +948,24 @@ void svc_close_xprt(struct svc_xprt *xprt)
}
EXPORT_SYMBOL_GPL(svc_close_xprt);

-static void svc_close_list(struct svc_serv *serv, struct list_head *xprt_list, struct net *net)
+static int svc_close_list(struct svc_serv *serv, struct list_head *xprt_list, struct net *net)
{
struct svc_xprt *xprt;
+ int ret = 0;

spin_lock(&serv->sv_lock);
list_for_each_entry(xprt, xprt_list, xpt_list) {
if (xprt->xpt_net != net)
continue;
+ ret++;
set_bit(XPT_CLOSE, &xprt->xpt_flags);
- set_bit(XPT_BUSY, &xprt->xpt_flags);
+ svc_xprt_enqueue(xprt);
}
spin_unlock(&serv->sv_lock);
+ return ret;
}

-static void svc_clear_pools(struct svc_serv *serv, struct net *net)
+static struct svc_xprt *svc_dequeue_net(struct svc_serv *serv, struct net *net)
{
struct svc_pool *pool;
struct svc_xprt *xprt;
@@ -977,42 +980,49 @@ static void svc_clear_pools(struct svc_serv *serv, struct net *net)
if (xprt->xpt_net != net)
continue;
list_del_init(&xprt->xpt_ready);
+ spin_unlock_bh(&pool->sp_lock);
+ return xprt;
}
spin_unlock_bh(&pool->sp_lock);
}
+ return NULL;
}

-static void svc_clear_list(struct svc_serv *serv, struct list_head *xprt_list, struct net *net)
+static void svc_clean_up_xprts(struct svc_serv *serv, struct net *net)
{
struct svc_xprt *xprt;
- struct svc_xprt *tmp;
- LIST_HEAD(victims);
-
- spin_lock(&serv->sv_lock);
- list_for_each_entry_safe(xprt, tmp, xprt_list, xpt_list) {
- if (xprt->xpt_net != net)
- continue;
- list_move(&xprt->xpt_list, &victims);
- }
- spin_unlock(&serv->sv_lock);

- list_for_each_entry_safe(xprt, tmp, &victims, xpt_list)
+ while ((xprt = svc_dequeue_net(serv, net))) {
+ set_bit(XPT_CLOSE, &xprt->xpt_flags);
svc_delete_xprt(xprt);
+ }
}

+/*
+ * Server threads may still be running (especially in the case where the
+ * service is still running in other network namespaces).
+ *
+ * So we shut down sockets the same way we would on a running server, by
+ * setting XPT_CLOSE, enqueuing, and letting a thread pick it up to do
+ * the close. In the case there are no such other threads,
+ * threads running, svc_clean_up_xprts() does a simple version of a
+ * server's main event loop, and in the case where there are other
+ * threads, we may need to wait a little while and then check again to
+ * see if they're done.
+ */
void svc_close_net(struct svc_serv *serv, struct net *net)
{
- svc_close_list(serv, &serv->sv_tempsocks, net);
- svc_close_list(serv, &serv->sv_permsocks, net);
-
- svc_clear_pools(serv, net);
- /*
- * At this point the sp_sockets lists will stay empty, since
- * svc_xprt_enqueue will not add new entries without taking the
- * sp_lock and checking XPT_BUSY.
- */
- svc_clear_list(serv, &serv->sv_tempsocks, net);
- svc_clear_list(serv, &serv->sv_permsocks, net);
+ int closed;
+ int delay = 0;
+
+again:
+ closed = svc_close_list(serv, &serv->sv_permsocks, net);
+ closed += svc_close_list(serv, &serv->sv_tempsocks, net);
+ if (closed) {
+ svc_clean_up_xprts(serv, net);
+ msleep(delay++);
+ goto again;
+ }
}

/*

2013-02-11 06:19:41

by Stanislav Kinsbursky

[permalink] [raw]
Subject: Re: [PATCH 0/2] NFSD: fix races in service per-net resources allocation

11.02.2013 04:25, J. Bruce Fields пишет:
> On Fri, Feb 01, 2013 at 02:28:21PM +0300, Stanislav Kinsbursky wrote:
>> After "NFS" (SUNRPC + NFSd actually) containerization work some basic
>> principles of SUNRPC service initialization and deinitialization has been
>> changed: now one service can be shared between different network namespaces
>> and network "resources" can be attached or detached from the running service.
>> This leads to races, described here:
>>
>> https://bugzilla.redhat.com/show_bug.cgi?id=904870
>>
>> and which this small patch set is aimed to solve by using per-cpu rw semphores
>> to sync per-net resources processing and shutdown.
>
> Sorry for the slow response. I think this is probably correct.
>
> But I think we got into this mess because the server shutdown logic is
> too complicated. So I'd prefer to find a way to fix the problem by
> simplifying things rather than by adding another lock.
>

Yeah, I wasn't satisfied with the patch I send. It was just an attempt to fix the issue fast.
Simplifying the logic instead of one more lock (there are too many of them already) is much better.
Thanks!

> Do you see anything wrong with the following?
>

This one looks a bit complicated and confusing to me. Probably because I'm not that familiar with service transports processing logic.
So, as I can see, we now try to run over all per-net pool-assigned transports, remove them from "ready" queue and delete one by one.
Then we try to enqueue all temporary sockets. But where in enqueueing of permanent sockets? I.e. how does they be destroyed with this patch?
Then we once again try to run over all per-net pool-assigned transports, remove them from "ready" queue and delete one by one.
Why twice? I.e. why not just lose them, then enqueue them and svc_clean_up_xprts()?

> --b
>
> commit e8202f39f84b8863337f55159dd18478b9ccb616
> Author: J. Bruce Fields <[email protected]>
> Date: Sun Feb 10 16:08:11 2013 -0500
>
> svcrpc: fix and simplify server shutdown
>
> Simplify server shutdown, and make it correct whether or not there are
> still threads running (as will happen in the case we're only shutting
> down the service in one network namespace).
>
> Do that by doing what we'd do in normal circumstances: just CLOSE each
> socket, then enqueue it.
>
> Since there may not be threads to handle the resulting queued xprts,
> also run a simplified version of the svc_recv() loop run by a server to
> clean up any closed xprts afterwards.
>
> Signed-off-by: J. Bruce Fields <[email protected]>
>
> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> index 024a241..a98e818 100644
> --- a/net/sunrpc/svc_xprt.c
> +++ b/net/sunrpc/svc_xprt.c
> @@ -966,12 +966,12 @@ static void svc_close_list(struct svc_serv *serv, struct list_head *xprt_list, s
> if (xprt->xpt_net != net)
> continue;
> set_bit(XPT_CLOSE, &xprt->xpt_flags);
> - set_bit(XPT_BUSY, &xprt->xpt_flags);
> + svc_xprt_enqueue(xprt);
> }
> spin_unlock(&serv->sv_lock);
> }
>
> -static void svc_clear_pools(struct svc_serv *serv, struct net *net)
> +static struct svc_xprt *svc_dequeue_net(struct svc_serv *serv, struct net *net)
> {
> struct svc_pool *pool;
> struct svc_xprt *xprt;
> @@ -986,42 +986,31 @@ static void svc_clear_pools(struct svc_serv *serv, struct net *net)
> if (xprt->xpt_net != net)
> continue;
> list_del_init(&xprt->xpt_ready);
> + spin_unlock_bh(&pool->sp_lock);
> + return xprt;
> }
> spin_unlock_bh(&pool->sp_lock);
> }
> + return NULL;
> }
>
> -static void svc_clear_list(struct svc_serv *serv, struct list_head *xprt_list, struct net *net)
> +static void svc_clean_up_xprts(struct svc_serv *serv, struct net *net)
> {
> struct svc_xprt *xprt;
> - struct svc_xprt *tmp;
> - LIST_HEAD(victims);
>
> - spin_lock(&serv->sv_lock);
> - list_for_each_entry_safe(xprt, tmp, xprt_list, xpt_list) {
> - if (xprt->xpt_net != net)
> - continue;
> - list_move(&xprt->xpt_list, &victims);
> - }
> - spin_unlock(&serv->sv_lock);
> -
> - list_for_each_entry_safe(xprt, tmp, &victims, xpt_list)
> + while ((xprt = svc_dequeue_net(serv, net))) {
> + if (!test_bit(XPT_CLOSE, &xprt->xpt_flags))
> + pr_err("found un-closed xprt on service shutdown\n");
> svc_delete_xprt(xprt);
> + }
> }
>
> void svc_close_net(struct svc_serv *serv, struct net *net)
> {
> - svc_close_list(serv, &serv->sv_tempsocks, net);
> svc_close_list(serv, &serv->sv_permsocks, net);
> -
> - svc_clear_pools(serv, net);
> - /*
> - * At this point the sp_sockets lists will stay empty, since
> - * svc_xprt_enqueue will not add new entries without taking the
> - * sp_lock and checking XPT_BUSY.
> - */
> - svc_clear_list(serv, &serv->sv_tempsocks, net);
> - svc_clear_list(serv, &serv->sv_permsocks, net);
> + svc_clean_up_xprts(serv, net);
> + svc_close_list(serv, &serv->sv_tempsocks, net);
> + svc_clean_up_xprts(serv, net);
> }
>
> /*
>


--
Best regards,
Stanislav Kinsbursky

2013-02-12 20:45:37

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 0/2] NFSD: fix races in service per-net resources allocation

On Tue, Feb 12, 2013 at 01:52:32PM +0400, Stanislav Kinsbursky wrote:
> 12.02.2013 00:58, J. Bruce Fields пишет:
> <snip>
> > void svc_close_net(struct svc_serv *serv, struct net *net)
> > {
> >- svc_close_list(serv, &serv->sv_tempsocks, net);
> >- svc_close_list(serv, &serv->sv_permsocks, net);
> >-
> >- svc_clear_pools(serv, net);
> >- /*
> >- * At this point the sp_sockets lists will stay empty, since
> >- * svc_xprt_enqueue will not add new entries without taking the
> >- * sp_lock and checking XPT_BUSY.
> >- */
> >- svc_clear_list(serv, &serv->sv_tempsocks, net);
> >- svc_clear_list(serv, &serv->sv_permsocks, net);
> >+ int closed;
> >+ int delay = 0;
> >+
> >+again:
> >+ closed = svc_close_list(serv, &serv->sv_permsocks, net);
> >+ closed += svc_close_list(serv, &serv->sv_tempsocks, net);
> >+ if (closed) {
> >+ svc_clean_up_xprts(serv, net);
> >+ msleep(delay++);
> >+ goto again;
> >+ }
>
> Frankly, this hunk above makes me feel sick... :(
> But I have no better idea right now...
> Maybe make this hunk a bit less weird (this is from my POW only, of course), like this:
>
> > + while (svc_close_list(serv, &serv->sv_permsocks, net) +
> > + svc_close_list(serv, &serv->sv_tempsocks, net)) {
> > + svc_clean_up_xprts(serv, net);
> > + msleep(delay++);
> > + }
>
> ?

OK, that's a little more compact at least.

--b.

>
> Anyway, thanks!
>
> Acked-by: Stanislav Kinsbursky <[email protected]>
>
> --
> Best regards,
> Stanislav Kinsbursky

2013-02-12 21:27:41

by Peter Staubach

[permalink] [raw]
Subject: RE: [PATCH 0/2] NFSD: fix races in service per-net resources allocation

VGhlICIrIiB0aGluZyBzZWVtcyBhIGxpdHRsZSBvZGQuICBXaHkgbm90IHVzZSAifHwiIGluc3Rl
YWQ/ICBUaGUgc3VtIG9mIHRoZSB0d28gcmV0dXJucyBpc24ndCByZWFsbHkgdGhlIGltcG9ydGFu
dCB0aGluZywgaXMgaXQ/ICBJdCBpcyB0aGF0IGVpdGhlciBjYWxsIHRvIHN2Y19jbG9zZV9saXN0
KCkgcmV0dXJucyBub24temVyby4NCg0KCVRoYW54Li4uDQoNCgkJcHMNCg0KDQotLS0tLU9yaWdp
bmFsIE1lc3NhZ2UtLS0tLQ0KRnJvbTogbGludXgtbmZzLW93bmVyQHZnZXIua2VybmVsLm9yZyBb
bWFpbHRvOmxpbnV4LW5mcy1vd25lckB2Z2VyLmtlcm5lbC5vcmddIE9uIEJlaGFsZiBPZiBKLiBC
cnVjZSBGaWVsZHMNClNlbnQ6IFR1ZXNkYXksIEZlYnJ1YXJ5IDEyLCAyMDEzIDM6NDYgUE0NClRv
OiBTdGFuaXNsYXYgS2luc2J1cnNreQ0KQ2M6IGFrcG1AbGludXgtZm91bmRhdGlvbi5vcmc7IGxp
bnV4LW5mc0B2Z2VyLmtlcm5lbC5vcmc7IFRyb25kLk15a2xlYnVzdEBuZXRhcHAuY29tOyBsaW51
eC1rZXJuZWxAdmdlci5rZXJuZWwub3JnOyBkZXZlbEBvcGVudnoub3JnDQpTdWJqZWN0OiBSZTog
W1BBVENIIDAvMl0gTkZTRDogZml4IHJhY2VzIGluIHNlcnZpY2UgcGVyLW5ldCByZXNvdXJjZXMg
YWxsb2NhdGlvbg0KDQpPbiBUdWUsIEZlYiAxMiwgMjAxMyBhdCAwMTo1MjozMlBNICswNDAwLCBT
dGFuaXNsYXYgS2luc2J1cnNreSB3cm90ZToNCj4gMTIuMDIuMjAxMyAwMDo1OCwgSi4gQnJ1Y2Ug
RmllbGRzINC/0LjRiNC10YI6DQo+IDxzbmlwPg0KPiA+ICB2b2lkIHN2Y19jbG9zZV9uZXQoc3Ry
dWN0IHN2Y19zZXJ2ICpzZXJ2LCBzdHJ1Y3QgbmV0ICpuZXQpDQo+ID4gIHsNCj4gPi0Jc3ZjX2Ns
b3NlX2xpc3Qoc2VydiwgJnNlcnYtPnN2X3RlbXBzb2NrcywgbmV0KTsNCj4gPi0Jc3ZjX2Nsb3Nl
X2xpc3Qoc2VydiwgJnNlcnYtPnN2X3Blcm1zb2NrcywgbmV0KTsNCj4gPi0NCj4gPi0Jc3ZjX2Ns
ZWFyX3Bvb2xzKHNlcnYsIG5ldCk7DQo+ID4tCS8qDQo+ID4tCSAqIEF0IHRoaXMgcG9pbnQgdGhl
IHNwX3NvY2tldHMgbGlzdHMgd2lsbCBzdGF5IGVtcHR5LCBzaW5jZQ0KPiA+LQkgKiBzdmNfeHBy
dF9lbnF1ZXVlIHdpbGwgbm90IGFkZCBuZXcgZW50cmllcyB3aXRob3V0IHRha2luZyB0aGUNCj4g
Pi0JICogc3BfbG9jayBhbmQgY2hlY2tpbmcgWFBUX0JVU1kuDQo+ID4tCSAqLw0KPiA+LQlzdmNf
Y2xlYXJfbGlzdChzZXJ2LCAmc2Vydi0+c3ZfdGVtcHNvY2tzLCBuZXQpOw0KPiA+LQlzdmNfY2xl
YXJfbGlzdChzZXJ2LCAmc2Vydi0+c3ZfcGVybXNvY2tzLCBuZXQpOw0KPiA+KwlpbnQgY2xvc2Vk
Ow0KPiA+KwlpbnQgZGVsYXkgPSAwOw0KPiA+Kw0KPiA+K2FnYWluOg0KPiA+KwljbG9zZWQgPSBz
dmNfY2xvc2VfbGlzdChzZXJ2LCAmc2Vydi0+c3ZfcGVybXNvY2tzLCBuZXQpOw0KPiA+KwljbG9z
ZWQgKz0gc3ZjX2Nsb3NlX2xpc3Qoc2VydiwgJnNlcnYtPnN2X3RlbXBzb2NrcywgbmV0KTsNCj4g
PisJaWYgKGNsb3NlZCkgew0KPiA+KwkJc3ZjX2NsZWFuX3VwX3hwcnRzKHNlcnYsIG5ldCk7DQo+
ID4rCQltc2xlZXAoZGVsYXkrKyk7DQo+ID4rCQlnb3RvIGFnYWluOw0KPiA+Kwl9DQo+IA0KPiBG
cmFua2x5LCB0aGlzIGh1bmsgYWJvdmUgbWFrZXMgbWUgZmVlbCBzaWNrLi4uIDooIEJ1dCBJIGhh
dmUgbm8gYmV0dGVyIA0KPiBpZGVhIHJpZ2h0IG5vdy4uLg0KPiBNYXliZSBtYWtlIHRoaXMgaHVu
ayBhIGJpdCBsZXNzIHdlaXJkICh0aGlzIGlzIGZyb20gbXkgUE9XIG9ubHksIG9mIGNvdXJzZSks
IGxpa2UgdGhpczoNCj4gDQo+ID4gKwl3aGlsZSAoc3ZjX2Nsb3NlX2xpc3Qoc2VydiwgJnNlcnYt
PnN2X3Blcm1zb2NrcywgbmV0KSArDQo+ID4gKwkgICAgICAgc3ZjX2Nsb3NlX2xpc3Qoc2Vydiwg
JnNlcnYtPnN2X3RlbXBzb2NrcywgbmV0KSkgew0KPiA+ICsJCXN2Y19jbGVhbl91cF94cHJ0cyhz
ZXJ2LCBuZXQpOw0KPiA+ICsJCW1zbGVlcChkZWxheSsrKTsNCj4gPiArCX0NCj4gDQo+ID8NCg0K
T0ssIHRoYXQncyBhIGxpdHRsZSBtb3JlIGNvbXBhY3QgYXQgbGVhc3QuDQoNCi0tYi4NCg0KPiAN
Cj4gQW55d2F5LCB0aGFua3MhDQo+IA0KPiBBY2tlZC1ieTogU3RhbmlzbGF2IEtpbnNidXJza3kg
PHNraW5zYnVyc2t5QHBhcmFsbGVscy5jb20+DQo+IA0KPiAtLQ0KPiBCZXN0IHJlZ2FyZHMsDQo+
IFN0YW5pc2xhdiBLaW5zYnVyc2t5DQotLQ0KVG8gdW5zdWJzY3JpYmUgZnJvbSB0aGlzIGxpc3Q6
IHNlbmQgdGhlIGxpbmUgInVuc3Vic2NyaWJlIGxpbnV4LW5mcyIgaW4gdGhlIGJvZHkgb2YgYSBt
ZXNzYWdlIHRvIG1ham9yZG9tb0B2Z2VyLmtlcm5lbC5vcmcgTW9yZSBtYWpvcmRvbW8gaW5mbyBh
dCAgaHR0cDovL3ZnZXIua2VybmVsLm9yZy9tYWpvcmRvbW8taW5mby5odG1sDQo=

2013-02-01 11:27:42

by Stanislav Kinsbursky

[permalink] [raw]
Subject: [PATCH 2/2] SUNRPC: protect transport processing with per-cpu rw semaphore

There could be a service transport, which is processed by service thread and
racing in the same time with per-net service shutdown like listed below:

CPU#0: CPU#1:

svc_recv svc_close_net
svc_get_next_xprt (list_del_init(xpt_ready))
svc_close_list (set XPT_BUSY and XPT_CLOSE)
svc_clear_pools(xprt was gained on CPU#0 already)
svc_delete_xprt (set XPT_DEAD)
svc_handle_xprt (is XPT_CLOSE => svc_delete_xprt()
BUG()

There can be different solutions of the problem.
Probably, the patch doesn't implement the best one, but I hope the simple one.
IOW, it protects critical section (dequeuing of pending transport and
enqueuing it back to the pool) by per-service rw semaphore, taken for read.
On per-net transports shutdown, this semaphore have to be taken for write.

Note: add the PERCPU_RWSEM config option selected by SUNRPC.

Signed-off-by: Stanislav Kinsbursky <[email protected]>
---
include/linux/sunrpc/svc.h | 2 ++
net/sunrpc/Kconfig | 1 +
net/sunrpc/svc.c | 2 ++
net/sunrpc/svc_xprt.c | 33 +++++++++++++++++++++++++++------
4 files changed, 32 insertions(+), 6 deletions(-)

diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 1f0216b..2031d14 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -103,6 +103,8 @@ struct svc_serv {
* entries in the svc_cb_list */
struct svc_xprt *sv_bc_xprt; /* callback on fore channel */
#endif /* CONFIG_SUNRPC_BACKCHANNEL */
+
+ struct percpu_rw_semaphore sv_xprt_sem; /* sem to sync against per-net transports shutdown */
};

/*
diff --git a/net/sunrpc/Kconfig b/net/sunrpc/Kconfig
index 03d03e3..5e6548c 100644
--- a/net/sunrpc/Kconfig
+++ b/net/sunrpc/Kconfig
@@ -1,5 +1,6 @@
config SUNRPC
tristate
+ select PERCPU_RWSEM

config SUNRPC_GSS
tristate
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index b9ba2a8..ef74c72 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -483,6 +483,8 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
if (svc_uses_rpcbind(serv) && (!serv->sv_shutdown))
serv->sv_shutdown = svc_rpcb_cleanup;

+ percpu_init_rwsem(&serv->sv_xprt_sem);
+
return serv;
}

diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 5a9d40c..855a6ba 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -471,6 +471,8 @@ static void svc_xprt_release(struct svc_rqst *rqstp)
svc_reserve(rqstp, 0);
rqstp->rq_xprt = NULL;

+ percpu_up_read(&rqstp->rq_server->sv_xprt_sem);
+
svc_xprt_put(xprt);
}

@@ -618,12 +620,14 @@ struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
struct svc_pool *pool = rqstp->rq_pool;
DECLARE_WAITQUEUE(wait, current);
long time_left;
+ struct svc_serv *serv = rqstp->rq_server;

/* Normally we will wait up to 5 seconds for any required
* cache information to be provided.
*/
rqstp->rq_chandle.thread_wait = 5*HZ;

+ percpu_down_read(&serv->sv_xprt_sem);
spin_lock_bh(&pool->sp_lock);
xprt = svc_xprt_dequeue(pool);
if (xprt) {
@@ -640,7 +644,8 @@ struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
if (pool->sp_task_pending) {
pool->sp_task_pending = 0;
spin_unlock_bh(&pool->sp_lock);
- return ERR_PTR(-EAGAIN);
+ xprt = ERR_PTR(-EAGAIN);
+ goto out_err;
}
/* No data pending. Go to sleep */
svc_thread_enqueue(pool, rqstp);
@@ -661,16 +666,19 @@ struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
if (kthread_should_stop()) {
set_current_state(TASK_RUNNING);
spin_unlock_bh(&pool->sp_lock);
- return ERR_PTR(-EINTR);
+ xprt = ERR_PTR(-EINTR);
+ goto out_err;
}

add_wait_queue(&rqstp->rq_wait, &wait);
spin_unlock_bh(&pool->sp_lock);
+ percpu_up_read(&serv->sv_xprt_sem);

time_left = schedule_timeout(timeout);

try_to_freeze();

+ percpu_down_read(&serv->sv_xprt_sem);
spin_lock_bh(&pool->sp_lock);
remove_wait_queue(&rqstp->rq_wait, &wait);
if (!time_left)
@@ -681,13 +689,24 @@ struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
svc_thread_dequeue(pool, rqstp);
spin_unlock_bh(&pool->sp_lock);
dprintk("svc: server %p, no data yet\n", rqstp);
- if (signalled() || kthread_should_stop())
- return ERR_PTR(-EINTR);
- else
- return ERR_PTR(-EAGAIN);
+ if (signalled() || kthread_should_stop()) {
+ xprt = ERR_PTR(-EINTR);
+ goto out_err;
+ } else {
+ xprt = ERR_PTR(-EAGAIN);
+ goto out_err;
+ }
}
}
spin_unlock_bh(&pool->sp_lock);
+out_err:
+ if (IS_ERR(xprt))
+ /*
+ * Note: we relased semaphore only if an error occured.
+ * Otherwise we hold it till transport processing will be
+ * completed in svc_xprt_release().
+ */
+ percpu_up_read(&serv->sv_xprt_sem);
return xprt;
}

@@ -1020,10 +1039,12 @@ static void svc_clear_list(struct svc_serv *serv, struct list_head *xprt_list, s

void svc_close_net(struct svc_serv *serv, struct net *net)
{
+ percpu_down_write(&serv->sv_xprt_sem);
svc_close_list(serv, &serv->sv_tempsocks, net);
svc_close_list(serv, &serv->sv_permsocks, net);

svc_clear_pools(serv, net);
+ percpu_up_write(&serv->sv_xprt_sem);
/*
* At this point the sp_sockets lists will stay empty, since
* svc_xprt_enqueue will not add new entries without taking the


2013-02-12 06:49:42

by Stanislav Kinsbursky

[permalink] [raw]
Subject: Re: [PATCH 0/2] NFSD: fix races in service per-net resources allocation

11.02.2013 20:37, J. Bruce Fields пишет:
> On Mon, Feb 11, 2013 at 10:18:18AM +0400, Stanislav Kinsbursky wrote:
>> This one looks a bit complicated and confusing to me. Probably because
>> I'm not that familiar with service transports processing logic. So,
>> as I can see, we now try to run over all per-net pool-assigned
>> transports, remove them from "ready" queue and delete one by one.
>> Then we try to enqueue all temporary sockets. But where in enqueueing
>> of permanent sockets? I.e. how does they be destroyed with this patch?
>> Then we once again try to run over all per-net pool-assigned
>> transports, remove them from "ready" queue and delete one by one. Why
>> twice? I.e. why not just lose them, then enqueue them and
>> svc_clean_up_xprts()?
>
> I think you missed the first svc_close_list?:
>

Yeah, thanks! To many deleted lines confused me.

>>> svc_close_list(serv, &serv->sv_permsocks, net);
>>> + svc_clean_up_xprts(serv, net);
>>> + svc_close_list(serv, &serv->sv_tempsocks, net);
>>> + svc_clean_up_xprts(serv, net);
>
> The idea is that before we'd like to close all the listeners first, so
> that they aren't busy creating more tempsocks while we're trying to
> close them.
>
> I overlooked a race, though: if another thread was already handling an
> accept for one of the listeners then it might not get closed by that
> first svc_clean_up_xprts.
>
> I guess we could do something like:
>
> delay = 0;
>
> again:
> numclosed = svc_close_list(serv, &serv->sv_permsocks, net);
> numclosed += svc_close_list(serv, &serv->sv_tempsocks, net);
> if (numclosed) {
> svc_clean_up_xprts(serv, net);
> msleep(delay++);
> goto again;
> }
>
> Seems a little cheesy, but if we don't care much about shutdown
> performance in a rare corner case, maybe it's the simplest way out?
>

Agreed. This part (per-net shutdown) has enough logical complexity already and would be great to not
increase it.


--
Best regards,
Stanislav Kinsbursky