Use READ_ONCE() to tell the compiler to not optimse away the read of
xprt->xpt_flags in svc_xprt_release_slot().
Signed-off-by: Trond Myklebust <[email protected]>
---
net/sunrpc/svc_xprt.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 51d36230b6e3..94d344325e22 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -363,9 +363,11 @@ static void svc_xprt_release_slot(struct svc_rqst *rqstp)
static bool svc_xprt_has_something_to_do(struct svc_xprt *xprt)
{
- if (xprt->xpt_flags & ((1<<XPT_CONN)|(1<<XPT_CLOSE)))
+ unsigned long xpt_flags = READ_ONCE(xprt->xpt_flags);
+
+ if (xpt_flags & (BIT(XPT_CONN) | BIT(XPT_CLOSE)))
return true;
- if (xprt->xpt_flags & ((1<<XPT_DATA)|(1<<XPT_DEFERRED))) {
+ if (xpt_flags & (BIT(XPT_DATA) | BIT(XPT_DEFERRED))) {
if (xprt->xpt_ops->xpo_has_wspace(xprt) &&
svc_xprt_slots_in_range(xprt))
return true;
--
2.20.1
On Thu, Jan 03, 2019 at 09:17:12AM -0500, Trond Myklebust wrote:
> Use READ_ONCE() to tell the compiler to not optimse away the read of
> xprt->xpt_flags in svc_xprt_release_slot().
What exactly is the possible race here? And why is a READ_ONCE()
sufficient, as opposed to some memory barriers?
I may need to shut myself in a room with memory-barriers.txt, I'm pretty
hazy on these things.
--b.
>
> Signed-off-by: Trond Myklebust <[email protected]>
> ---
> net/sunrpc/svc_xprt.c | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> index 51d36230b6e3..94d344325e22 100644
> --- a/net/sunrpc/svc_xprt.c
> +++ b/net/sunrpc/svc_xprt.c
> @@ -363,9 +363,11 @@ static void svc_xprt_release_slot(struct svc_rqst *rqstp)
>
> static bool svc_xprt_has_something_to_do(struct svc_xprt *xprt)
> {
> - if (xprt->xpt_flags & ((1<<XPT_CONN)|(1<<XPT_CLOSE)))
> + unsigned long xpt_flags = READ_ONCE(xprt->xpt_flags);
> +
> + if (xpt_flags & (BIT(XPT_CONN) | BIT(XPT_CLOSE)))
> return true;
> - if (xprt->xpt_flags & ((1<<XPT_DATA)|(1<<XPT_DEFERRED))) {
> + if (xpt_flags & (BIT(XPT_DATA) | BIT(XPT_DEFERRED))) {
> if (xprt->xpt_ops->xpo_has_wspace(xprt) &&
> svc_xprt_slots_in_range(xprt))
> return true;
> --
> 2.20.1
On Thu, 2019-01-03 at 17:45 -0500, J Bruce Fields wrote:
> On Thu, Jan 03, 2019 at 09:17:12AM -0500, Trond Myklebust wrote:
> > Use READ_ONCE() to tell the compiler to not optimse away the read
> > of
> > xprt->xpt_flags in svc_xprt_release_slot().
>
> What exactly is the possible race here? And why is a READ_ONCE()
> sufficient, as opposed to some memory barriers?
>
> I may need to shut myself in a room with memory-barriers.txt, I'm
> pretty
> hazy on these things.
>
It's not about fixing any races. It is about ensuring that the compiler
does not optimise away the read if the function is ever called from
inside a loop. Not an important fix, since I'm not aware of any cases
where this has happened. However strictly speaking, we should use
READ_ONCE() here because that variable is volatile; it can be changed
by a background action.
> --b.
>
> > Signed-off-by: Trond Myklebust <[email protected]>
> > ---
> > net/sunrpc/svc_xprt.c | 6 ++++--
> > 1 file changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> > index 51d36230b6e3..94d344325e22 100644
> > --- a/net/sunrpc/svc_xprt.c
> > +++ b/net/sunrpc/svc_xprt.c
> > @@ -363,9 +363,11 @@ static void svc_xprt_release_slot(struct
> > svc_rqst *rqstp)
> >
> > static bool svc_xprt_has_something_to_do(struct svc_xprt *xprt)
> > {
> > - if (xprt->xpt_flags & ((1<<XPT_CONN)|(1<<XPT_CLOSE)))
> > + unsigned long xpt_flags = READ_ONCE(xprt->xpt_flags);
> > +
> > + if (xpt_flags & (BIT(XPT_CONN) | BIT(XPT_CLOSE)))
> > return true;
> > - if (xprt->xpt_flags & ((1<<XPT_DATA)|(1<<XPT_DEFERRED))) {
> > + if (xpt_flags & (BIT(XPT_DATA) | BIT(XPT_DEFERRED))) {
> > if (xprt->xpt_ops->xpo_has_wspace(xprt) &&
> > svc_xprt_slots_in_range(xprt))
> > return true;
> > --
> > 2.20.1
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]
On Thu, Jan 03, 2019 at 11:40:21PM +0000, Trond Myklebust wrote:
> On Thu, 2019-01-03 at 17:45 -0500, J Bruce Fields wrote:
> > On Thu, Jan 03, 2019 at 09:17:12AM -0500, Trond Myklebust wrote:
> > > Use READ_ONCE() to tell the compiler to not optimse away the read
> > > of
> > > xprt->xpt_flags in svc_xprt_release_slot().
> >
> > What exactly is the possible race here? And why is a READ_ONCE()
> > sufficient, as opposed to some memory barriers?
> >
> > I may need to shut myself in a room with memory-barriers.txt, I'm
> > pretty
> > hazy on these things.
> >
>
> It's not about fixing any races. It is about ensuring that the compiler
> does not optimise away the read if the function is ever called from
> inside a loop. Not an important fix, since I'm not aware of any cases
> where this has happened. However strictly speaking, we should use
> READ_ONCE() here because that variable is volatile; it can be changed
> by a background action.
I wonder if there's a race here independent of that change:
svc_xprt_enqueue() callers all do something like:
1. change some condition
2. call svc_xprt_enqueue() to check whether the xprt should
now be enqueued.
where the conditions are settings of the xpt_flags, or socket wspace, or
xpt_nr_rqsts.
In theory if we miss some concurrent change we're OK because whoever's
making that change will then also call svc_xprt_enqueue. But that's not
enough; e.g.:
task 1 task 2
------ ------
set XPT_DATA
atomic_dec(xpt_nr_rqsts)
check XPT_DATA && check xpt_nr_rqsts
check XPT_DATA && check xpt_nr_rqsts
If the tasks only see their local changes, then neither see both
conditions true, so the socket doesn't get enqueued. (And a request
that was ready to be processed will sit around until someone else comes
calls svc_xprt_enqueue() on that xprt.)
The code's more complicated than that and maybe there's some reason that
can't happen.
--b.
On Fri, Jan 04, 2019 at 12:39:12PM -0500, [email protected] wrote:
> I wonder if there's a race here independent of that change:
>
> svc_xprt_enqueue() callers all do something like:
>
> 1. change some condition
> 2. call svc_xprt_enqueue() to check whether the xprt should
> now be enqueued.
>
> where the conditions are settings of the xpt_flags, or socket wspace, or
> xpt_nr_rqsts.
>
> In theory if we miss some concurrent change we're OK because whoever's
> making that change will then also call svc_xprt_enqueue. But that's not
> enough; e.g.:
>
> task 1 task 2
> ------ ------
> set XPT_DATA
> atomic_dec(xpt_nr_rqsts)
>
> check XPT_DATA && check xpt_nr_rqsts
>
> check XPT_DATA && check xpt_nr_rqsts
>
> If the tasks only see their local changes, then neither see both
> conditions true, so the socket doesn't get enqueued. (And a request
> that was ready to be processed will sit around until someone else comes
> calls svc_xprt_enqueue() on that xprt.)
So maybe we actually need
static bool svc_xprt_has_something_to_do(struct svc_xprt *xprt)
{
+ mb();
if (xprt->xpt_flags & ((1<<XPT_CONN)|(1<<XPT_CLOSE)))
return true;
if (xprt->xpt_flags & ((1<<XPT_DATA)|(1<<XPT_DEFERRED))) {
Then whichever memory barrier executes second guarantees that the
following check sees the result of both the XPT_DATA and xpt_nr_rqsts
changes. I think....
--b.
On Mon, 2019-01-07 at 16:32 -0500, [email protected] wrote:
> On Fri, Jan 04, 2019 at 12:39:12PM -0500, [email protected] wrote:
> > I wonder if there's a race here independent of that change:
> >
> > svc_xprt_enqueue() callers all do something like:
> >
> > 1. change some condition
> > 2. call svc_xprt_enqueue() to check whether the xprt should
> > now be enqueued.
> >
> > where the conditions are settings of the xpt_flags, or socket
> > wspace, or
> > xpt_nr_rqsts.
> >
> > In theory if we miss some concurrent change we're OK because
> > whoever's
> > making that change will then also call svc_xprt_enqueue. But
> > that's not
> > enough; e.g.:
> >
> > task 1 task 2
> > ------ ------
> > set XPT_DATA
> > atomic_dec(xpt_nr_rqsts)
> >
> > check XPT_DATA && check xpt_nr_rqsts
> >
> > check XPT_DATA && check
> > xpt_nr_rqsts
> >
> > If the tasks only see their local changes, then neither see both
> > conditions true, so the socket doesn't get enqueued. (And a
> > request
> > that was ready to be processed will sit around until someone else
> > comes
> > calls svc_xprt_enqueue() on that xprt.)
>
> So maybe we actually need
>
> static bool svc_xprt_has_something_to_do(struct svc_xprt *xprt)
> {
> + mb();
You would at best need a 'smp_rmb()'. There is nothing to gain from
adding a write barrier here, and you don't even need a read barrier in
the non-smp case.
> if (xprt->xpt_flags & ((1<<XPT_CONN)|(1<<XPT_CLOSE)))
> return true;
> if (xprt->xpt_flags & ((1<<XPT_DATA)|(1<<XPT_DEFERRED))) {
>
> Then whichever memory barrier executes second guarantees that the
> following check sees the result of both the XPT_DATA and xpt_nr_rqsts
> changes. I think....
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]
On Mon, Jan 07, 2019 at 10:06:19PM +0000, Trond Myklebust wrote:
> On Mon, 2019-01-07 at 16:32 -0500, [email protected] wrote:
> > So maybe we actually need
> >
> > static bool svc_xprt_has_something_to_do(struct svc_xprt *xprt)
> > {
> > + mb();
>
> You would at best need a 'smp_rmb()'. There is nothing to gain from
> adding a write barrier here,
That's not my understanding.
What we have is basically:
1 2
---- ----
WRITE to A WRITE to B
READ from A and B READ from A and B
and we want to guarantee that at least one of those two reads will see
both of the writes.
A read barrier only orders reads with respect to the barrier, it doesn't
do anything about writes, so doesn't guarantee anything here.
--b.
> and you don't even need a read barrier in
> the non-smp case.
>
> > if (xprt->xpt_flags & ((1<<XPT_CONN)|(1<<XPT_CLOSE)))
> > return true;
> > if (xprt->xpt_flags & ((1<<XPT_DATA)|(1<<XPT_DEFERRED))) {
> >
> > Then whichever memory barrier executes second guarantees that the
> > following check sees the result of both the XPT_DATA and xpt_nr_rqsts
> > changes. I think....
>
>
>
> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> [email protected]
>
>
On Tue, 2019-01-08 at 10:01 -0500, [email protected] wrote:
> On Mon, Jan 07, 2019 at 10:06:19PM +0000, Trond Myklebust wrote:
> > On Mon, 2019-01-07 at 16:32 -0500, [email protected] wrote:
> > > So maybe we actually need
> > >
> > > static bool (struct svc_xprt *xprt)
> > > {
> > > + mb();
> >
> > You would at best need a 'smp_rmb()'. There is nothing to gain from
> > adding a write barrier here,
>
> That's not my understanding.
>
> What we have is basically:
>
> 1 2
> ---- ----
> WRITE to A WRITE to B
>
> READ from A and B READ from A and B
>
> and we want to guarantee that at least one of those two reads will
> see
> both of the writes.
>
> A read barrier only orders reads with respect to the barrier, it
> doesn't
> do anything about writes, so doesn't guarantee anything here.
In this context 'WRITE to A' and/or 'WRITE to B' are presumably the
operations of setting the flag bits in xprt->xpt_flags, no? That's not
occurring here, it is occurring elsewhere.
The test_and_set_bit(XPT_DATA, &xprt->xpt_flags) in svc_data_ready()
performs an explicit barrier, so we shouldn't really care. The other
cases where we do set_bit(XPT_DATA) don't matter since the socket has
its own locking, and so the XPT_DATA is really just a test for whether
or not we need to enqueue the svc_xprt.
In the only place where XPT_DEFERRED is set, you have an implicit write
barrier (due to a spin_unlock) between the call to set_bit() and the
call to svc_xprt_enqueue(), so all data writes are guaranteed to be
complete before any attempt to enqueue the socket.
I can't see that you really care for the case of XPT_CONN, since the
just-created socket isn't going to be visible to other cpus until
you've added it to &pool->sp_sockets (which also has implicit write
barriers due to spin locks).
I don't think you really care for the case of XPT_CLOSE either since
svc_delete_xprt() doesn't depend on any other data writes that aren't
already protected by spinlocks.
So the conclusion would be to add smp_rmb() in
svc_xprt_has_something_to_do(). No extra write barriers are needed
AFAICS.
You may still need the READ_ONCE() in order to add a data dependency
barrier (i.e. to ensure that alpha processors don't reorder reads of
the xpt_flags with other speculative reads). That should reduce to a
standard read on all non-alpha architectures.
>
> --b.
>
>
>
> > and you don't even need a read barrier in
> > the non-smp case.
> >
> > > if (xprt->xpt_flags & ((1<<XPT_CONN)|(1<<XPT_CLOSE)))
> > > return true;
> > > if (xprt->xpt_flags & ((1<<XPT_DATA)|(1<<XPT_DEFERRED))) {
> > >
> > > Then whichever memory barrier executes second guarantees that the
> > > following check sees the result of both the XPT_DATA and
> > > xpt_nr_rqsts
> > > changes. I think....
> >
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]
On Tue, Jan 08, 2019 at 04:21:40PM +0000, Trond Myklebust wrote:
> On Tue, 2019-01-08 at 10:01 -0500, [email protected] wrote:
> > On Mon, Jan 07, 2019 at 10:06:19PM +0000, Trond Myklebust wrote:
> > > On Mon, 2019-01-07 at 16:32 -0500, [email protected] wrote:
> > > > So maybe we actually need
> > > >
> > > > static bool (struct svc_xprt *xprt)
> > > > {
> > > > + mb();
> > >
> > > You would at best need a 'smp_rmb()'. There is nothing to gain from
> > > adding a write barrier here,
> >
> > That's not my understanding.
> >
> > What we have is basically:
> >
> > 1 2
> > ---- ----
> > WRITE to A WRITE to B
> >
> > READ from A and B READ from A and B
> >
> > and we want to guarantee that at least one of those two reads will
> > see
> > both of the writes.
> >
> > A read barrier only orders reads with respect to the barrier, it
> > doesn't
> > do anything about writes, so doesn't guarantee anything here.
>
> In this context 'WRITE to A' and/or 'WRITE to B' are presumably the
> operations of setting the flag bits in xprt->xpt_flags, no?
Right, or I guess sk_sock->flags, or an atomic operation on xpt_reserved
or xpt_nr_rqsts.
> That's not occurring here, it is occurring elsewhere.
Right. And I hadn't tried to verify whether there were corresponding
(possibly implicit) write barriers in those places, thanks for doing
that work:
> The test_and_set_bit(XPT_DATA, &xprt->xpt_flags) in svc_data_ready()
> performs an explicit barrier, so we shouldn't really care.
OK.
> The other cases where we do set_bit(XPT_DATA) don't matter since the
> socket has its own locking, and so the XPT_DATA is really just a test
> for whether or not we need to enqueue the svc_xprt.
I'm not following, apologies.
In any case it's set only on initialization or in recvfrom, and in the
recvfrom case I think the
smp_mb__before_atomic();
clear_bit(XPT_BUSY, &xprt->xpt_flags);
in svc_xprt_received() provides the necessary write barrier.
But there are some exceptions in the rdma code, in svc_rdma_wc_receive
and svc_rdma_wc_done.
> In the only place where XPT_DEFERRED is set, you have an implicit write
> barrier (due to a spin_unlock) between the call to set_bit() and the
> call to svc_xprt_enqueue(), so all data writes are guaranteed to be
> complete before any attempt to enqueue the socket.
OK.
> I can't see that you really care for the case of XPT_CONN, since the
> just-created socket isn't going to be visible to other cpus until
> you've added it to &pool->sp_sockets (which also has implicit write
> barriers due to spin locks).
>
> I don't think you really care for the case of XPT_CLOSE either since
> svc_delete_xprt() doesn't depend on any other data writes that aren't
> already protected by spinlocks.
OK. Yes, I'm not worried about XPT_CONN or XPT_CLOSE.
> So the conclusion would be to add smp_rmb() in
> svc_xprt_has_something_to_do(). No extra write barriers are needed
> AFAICS.
> You may still need the READ_ONCE() in order to add a data dependency
> barrier (i.e. to ensure that alpha processors don't reorder reads of
> the xpt_flags with other speculative reads). That should reduce to a
> standard read on all non-alpha architectures.
That looks unnecessary; memory-barriers.txt say "Read memory barriers
imply data dependency barriers", and later "As of v4.15 of the Linux
kernel, an smp_read_barrier_depends() was added to READ_ONCE()".
I still wonder about:
- the RDMA cases above.
- svc_xprt_release_slot: no write barrier after writing to
xprt->xpt_nr_rqsts.
- svc_reserve: no barrier after writing to xpt_reserved
Also svc_write_space is setting SOCK_NOSPACE and then calling
svc_xprt_enqueue. I'm pretty sure the sk_write_space method has to have
a write barrier after that, though, so this is OK.
--b.
>
> >
> > --b.
> >
> >
> >
> > > and you don't even need a read barrier in
> > > the non-smp case.
> > >
> > > > if (xprt->xpt_flags & ((1<<XPT_CONN)|(1<<XPT_CLOSE)))
> > > > return true;
> > > > if (xprt->xpt_flags & ((1<<XPT_DATA)|(1<<XPT_DEFERRED))) {
> > > >
> > > > Then whichever memory barrier executes second guarantees that the
> > > > following check sees the result of both the XPT_DATA and
> > > > xpt_nr_rqsts
> > > > changes. I think....
> > >
> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> [email protected]
>
>
On Wed, 2019-01-09 at 11:51 -0500, [email protected] wrote:
> On Tue, Jan 08, 2019 at 04:21:40PM +0000, Trond Myklebust wrote:
> > On Tue, 2019-01-08 at 10:01 -0500, [email protected] wrote:
> > > On Mon, Jan 07, 2019 at 10:06:19PM +0000, Trond Myklebust wrote:
> > > > On Mon, 2019-01-07 at 16:32 -0500, [email protected] wrote:
> > > > > So maybe we actually need
> > > > >
> > > > > static bool (struct svc_xprt *xprt)
> > > > > {
> > > > > + mb();
> > > >
> > > > You would at best need a 'smp_rmb()'. There is nothing to gain
> > > > from
> > > > adding a write barrier here,
> > >
> > > That's not my understanding.
> > >
> > > What we have is basically:
> > >
> > > 1 2
> > > ---- ----
> > > WRITE to A WRITE to B
> > >
> > > READ from A and B READ from A and B
> > >
> > > and we want to guarantee that at least one of those two reads
> > > will
> > > see
> > > both of the writes.
> > >
> > > A read barrier only orders reads with respect to the barrier, it
> > > doesn't
> > > do anything about writes, so doesn't guarantee anything here.
> >
> > In this context 'WRITE to A' and/or 'WRITE to B' are presumably the
> > operations of setting the flag bits in xprt->xpt_flags, no?
>
> Right, or I guess sk_sock->flags, or an atomic operation on
> xpt_reserved
> or xpt_nr_rqsts.
>
> > That's not occurring here, it is occurring elsewhere.
>
> Right. And I hadn't tried to verify whether there were corresponding
> (possibly implicit) write barriers in those places, thanks for doing
> that work:
>
> > The test_and_set_bit(XPT_DATA, &xprt->xpt_flags) in
> > svc_data_ready()
> > performs an explicit barrier, so we shouldn't really care.
>
> OK.
>
> > The other cases where we do set_bit(XPT_DATA) don't matter since
> > the
> > socket has its own locking, and so the XPT_DATA is really just a
> > test
> > for whether or not we need to enqueue the svc_xprt.
>
> I'm not following, apologies.
>
> In any case it's set only on initialization or in recvfrom, and in
> the
> recvfrom case I think the
>
> smp_mb__before_atomic();
> clear_bit(XPT_BUSY, &xprt->xpt_flags);
>
> in svc_xprt_received() provides the necessary write barrier.
>
> But there are some exceptions in the rdma code, in
> svc_rdma_wc_receive
> and svc_rdma_wc_done.
>
> > In the only place where XPT_DEFERRED is set, you have an implicit
> > write
> > barrier (due to a spin_unlock) between the call to set_bit() and
> > the
> > call to svc_xprt_enqueue(), so all data writes are guaranteed to be
> > complete before any attempt to enqueue the socket.
>
> OK.
>
> > I can't see that you really care for the case of XPT_CONN, since
> > the
> > just-created socket isn't going to be visible to other cpus until
> > you've added it to &pool->sp_sockets (which also has implicit write
> > barriers due to spin locks).
> >
> > I don't think you really care for the case of XPT_CLOSE either
> > since
> > svc_delete_xprt() doesn't depend on any other data writes that
> > aren't
> > already protected by spinlocks.
>
> OK. Yes, I'm not worried about XPT_CONN or XPT_CLOSE.
>
> > So the conclusion would be to add smp_rmb() in
> > svc_xprt_has_something_to_do(). No extra write barriers are needed
> > AFAICS.
> > You may still need the READ_ONCE() in order to add a data
> > dependency
> > barrier (i.e. to ensure that alpha processors don't reorder reads
> > of
> > the xpt_flags with other speculative reads). That should reduce to
> > a
> > standard read on all non-alpha architectures.
>
> That looks unnecessary; memory-barriers.txt say "Read memory barriers
> imply data dependency barriers", and later "As of v4.15 of the Linux
> kernel, an smp_read_barrier_depends() was added to READ_ONCE()".
>
The above is stating that
smp_rmb();
smp_read_barrier_depends();
if (xprt->xpt_flags & ....)
is redundant and can be replaced with just
smp_rmb();
if (xprt->xpt_flags & ....)
However that's not the case for smp_rmb() followed by READ_ONCE(). That
would expand to
smp_rmb();
if (xprt->xpt_flags & ...) {
smp_read_barrier_depends();
} else
smp_read_barrier_depends();
which is not redundant. It is ensuring (on alpha only) that the read of
xprt->xpt_flags is also not re-ordered w.r.t. other data reads that
follow.
See, for instance, kernel/events/core.c which has several examples, or
kernel/exit.c.
> I still wonder about:
>
> - the RDMA cases above.
> - svc_xprt_release_slot: no write barrier after writing to
> xprt->xpt_nr_rqsts.
> - svc_reserve: no barrier after writing to xpt_reserved
>
> Also svc_write_space is setting SOCK_NOSPACE and then calling
> svc_xprt_enqueue. I'm pretty sure the sk_write_space method has to
> have
> a write barrier after that, though, so this is OK.
>
> --b.
>
> > > --b.
> > >
> > >
> > >
> > > > and you don't even need a read barrier in
> > > > the non-smp case.
> > > >
> > > > > if (xprt->xpt_flags & ((1<<XPT_CONN)|(1<<XPT_CLOSE)))
> > > > > return true;
> > > > > if (xprt->xpt_flags &
> > > > > ((1<<XPT_DATA)|(1<<XPT_DEFERRED))) {
> > > > >
> > > > > Then whichever memory barrier executes second guarantees that
> > > > > the
> > > > > following check sees the result of both the XPT_DATA and
> > > > > xpt_nr_rqsts
> > > > > changes. I think....
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]
On Wed, Jan 09, 2019 at 05:41:36PM +0000, Trond Myklebust wrote:
> The above is stating that
>
> smp_rmb();
> smp_read_barrier_depends();
> if (xprt->xpt_flags & ....)
>
> is redundant and can be replaced with just
>
> smp_rmb();
> if (xprt->xpt_flags & ....)
>
> However that's not the case for smp_rmb() followed by READ_ONCE(). That
> would expand to
>
> smp_rmb();
> if (xprt->xpt_flags & ...) {
> smp_read_barrier_depends();
> } else
> smp_read_barrier_depends();
>
> which is not redundant. It is ensuring (on alpha only) that the read of
> xprt->xpt_flags is also not re-ordered w.r.t. other data reads that
> follow.
>
> See, for instance, kernel/events/core.c which has several examples, or
> kernel/exit.c.
You're right, I was confused.
So, I think we need your patch plus something like this.
Chuck, maybe you could help me with the "XXX: Chuck:" parts?
(This applies on top of your patch plus another that just renames the
stupidly long svc_xprt_has_something_to_do() to svc_xprt_read().)
--b.
commit d7356c3250d4
Author: J. Bruce Fields <[email protected]>
Date: Fri Jan 11 15:36:40 2019 -0500
svcrpc: fix unlikely races preventing queueing of sockets
In the rpc server, When something happens that might be reason to wake
up a thread to do something, what we do is
- modify xpt_flags, sk_sock->flags, xpt_reserved, or
xpt_nr_rqsts to indicate the new situation
- call svc_xprt_enqueue() to decide whether to wake up a thread.
svc_xprt_enqueue may require multiple conditions to be true before
queueing up a thread to handle the xprt. In the SMP case, one of the
other CPU's may have set another required condition, and in that case,
although both CPUs run svc_xprt_enqueue(), it's possible that neither
call sees the writes done by the other CPU in time, and neither one
recognizes that all the required conditions have been set. A socket
could therefore be ignored indefinitely.
Add memory barries to ensure that any svc_xprt_enqueue() call will
always see the conditions changed by other CPUs before deciding to
ignore a socket.
I've never seen this race reported. In the unlikely event it happens,
another event will usually come along and the problem will fix itself.
So I don't think this is worth backporting to stable.
Signed-off-by: J. Bruce Fields <[email protected]>
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index d410ae512b02..2af21b84b3b6 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -357,6 +357,7 @@ static void svc_xprt_release_slot(struct svc_rqst *rqstp)
struct svc_xprt *xprt = rqstp->rq_xprt;
if (test_and_clear_bit(RQ_DATA, &rqstp->rq_flags)) {
atomic_dec(&xprt->xpt_nr_rqsts);
+ smp_wmb(); /* See smp_rmb() in svc_xprt_ready() */
svc_xprt_enqueue(xprt);
}
}
@@ -365,6 +366,15 @@ static bool svc_xprt_ready(struct svc_xprt *xprt)
{
unsigned long xpt_flags;
+ /*
+ * If another cpu has recently updated xpt_flags,
+ * sk_sock->flags, xpt_reserved, or xpt_nr_rqsts, we need to
+ * know about it; otherwise it's possible that both that cpu and
+ * this one could call svc_xprt_enqueue() without either
+ * svc_xprt_enqueue() recognizing that the conditions below
+ * are satisfied, and we could stall indefinitely:
+ */
+ smp_rmb();
READ_ONCE(xprt->xpt_flags);
if (xpt_flags & (BIT(XPT_CONN) | BIT(XPT_CLOSE)))
@@ -479,7 +489,7 @@ void svc_reserve(struct svc_rqst *rqstp, int space)
if (xprt && space < rqstp->rq_reserved) {
atomic_sub((rqstp->rq_reserved - space), &xprt->xpt_reserved);
rqstp->rq_reserved = space;
-
+ smp_wmb(); /* See smp_rmb() in svc_xprt_ready() */
svc_xprt_enqueue(xprt);
}
}
diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index 828b149eaaef..377244992ae8 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -316,6 +316,7 @@ static void svc_rdma_wc_receive(struct ib_cq *cq, struct ib_wc *wc)
list_add_tail(&ctxt->rc_list, &rdma->sc_rq_dto_q);
spin_unlock(&rdma->sc_rq_dto_lock);
set_bit(XPT_DATA, &rdma->sc_xprt.xpt_flags);
+ /* XXX: Chuck: do we need an smp_mb__after_atomic() here? */
if (!test_bit(RDMAXPRT_CONN_PENDING, &rdma->sc_flags))
svc_xprt_enqueue(&rdma->sc_xprt);
goto out;
diff --git a/net/sunrpc/xprtrdma/svc_rdma_rw.c b/net/sunrpc/xprtrdma/svc_rdma_rw.c
index dc1951759a8e..e1a790487d69 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_rw.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_rw.c
@@ -290,6 +290,7 @@ static void svc_rdma_wc_read_done(struct ib_cq *cq, struct ib_wc *wc)
spin_unlock(&rdma->sc_rq_dto_lock);
set_bit(XPT_DATA, &rdma->sc_xprt.xpt_flags);
+ /* XXX: Chuck: do we need a smp_mb__after_atomic() here? */
svc_xprt_enqueue(&rdma->sc_xprt);
}
> On Jan 11, 2019, at 4:12 PM, [email protected] wrote:
>
> On Wed, Jan 09, 2019 at 05:41:36PM +0000, Trond Myklebust wrote:
>> The above is stating that
>>
>> smp_rmb();
>> smp_read_barrier_depends();
>> if (xprt->xpt_flags & ....)
>>
>> is redundant and can be replaced with just
>>
>> smp_rmb();
>> if (xprt->xpt_flags & ....)
>>
>> However that's not the case for smp_rmb() followed by READ_ONCE(). That
>> would expand to
>>
>> smp_rmb();
>> if (xprt->xpt_flags & ...) {
>> smp_read_barrier_depends();
>> } else
>> smp_read_barrier_depends();
>>
>> which is not redundant. It is ensuring (on alpha only) that the read of
>> xprt->xpt_flags is also not re-ordered w.r.t. other data reads that
>> follow.
>>
>> See, for instance, kernel/events/core.c which has several examples, or
>> kernel/exit.c.
>
> You're right, I was confused.
>
> So, I think we need your patch plus something like this.
>
> Chuck, maybe you could help me with the "XXX: Chuck:" parts?
I haven't been following. Why do you think those are necessary?
We've had set_bit and atomic_{inc,dec} in this code for ages,
and I've never noticed a problem.
Rather than adding another CPU pipeline bubble in the RDMA code,
though, could you simply move the set_bit() call site inside the
critical sections?
> (This applies on top of your patch plus another that just renames the
> stupidly long svc_xprt_has_something_to_do() to svc_xprt_read().)
>
> --b.
>
> commit d7356c3250d4
> Author: J. Bruce Fields <[email protected]>
> Date: Fri Jan 11 15:36:40 2019 -0500
>
> svcrpc: fix unlikely races preventing queueing of sockets
>
> In the rpc server, When something happens that might be reason to wake
> up a thread to do something, what we do is
>
> - modify xpt_flags, sk_sock->flags, xpt_reserved, or
> xpt_nr_rqsts to indicate the new situation
> - call svc_xprt_enqueue() to decide whether to wake up a thread.
>
> svc_xprt_enqueue may require multiple conditions to be true before
> queueing up a thread to handle the xprt. In the SMP case, one of the
> other CPU's may have set another required condition, and in that case,
> although both CPUs run svc_xprt_enqueue(), it's possible that neither
> call sees the writes done by the other CPU in time, and neither one
> recognizes that all the required conditions have been set. A socket
> could therefore be ignored indefinitely.
>
> Add memory barries to ensure that any svc_xprt_enqueue() call will
> always see the conditions changed by other CPUs before deciding to
> ignore a socket.
>
> I've never seen this race reported. In the unlikely event it happens,
> another event will usually come along and the problem will fix itself.
> So I don't think this is worth backporting to stable.
>
> Signed-off-by: J. Bruce Fields <[email protected]>
>
> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> index d410ae512b02..2af21b84b3b6 100644
> --- a/net/sunrpc/svc_xprt.c
> +++ b/net/sunrpc/svc_xprt.c
> @@ -357,6 +357,7 @@ static void svc_xprt_release_slot(struct svc_rqst *rqstp)
> struct svc_xprt *xprt = rqstp->rq_xprt;
> if (test_and_clear_bit(RQ_DATA, &rqstp->rq_flags)) {
> atomic_dec(&xprt->xpt_nr_rqsts);
> + smp_wmb(); /* See smp_rmb() in svc_xprt_ready() */
> svc_xprt_enqueue(xprt);
> }
> }
> @@ -365,6 +366,15 @@ static bool svc_xprt_ready(struct svc_xprt *xprt)
> {
> unsigned long xpt_flags;
>
> + /*
> + * If another cpu has recently updated xpt_flags,
> + * sk_sock->flags, xpt_reserved, or xpt_nr_rqsts, we need to
> + * know about it; otherwise it's possible that both that cpu and
> + * this one could call svc_xprt_enqueue() without either
> + * svc_xprt_enqueue() recognizing that the conditions below
> + * are satisfied, and we could stall indefinitely:
> + */
> + smp_rmb();
> READ_ONCE(xprt->xpt_flags);
>
> if (xpt_flags & (BIT(XPT_CONN) | BIT(XPT_CLOSE)))
> @@ -479,7 +489,7 @@ void svc_reserve(struct svc_rqst *rqstp, int space)
> if (xprt && space < rqstp->rq_reserved) {
> atomic_sub((rqstp->rq_reserved - space), &xprt->xpt_reserved);
> rqstp->rq_reserved = space;
> -
> + smp_wmb(); /* See smp_rmb() in svc_xprt_ready() */
> svc_xprt_enqueue(xprt);
> }
> }
> diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> index 828b149eaaef..377244992ae8 100644
> --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> @@ -316,6 +316,7 @@ static void svc_rdma_wc_receive(struct ib_cq *cq, struct ib_wc *wc)
> list_add_tail(&ctxt->rc_list, &rdma->sc_rq_dto_q);
> spin_unlock(&rdma->sc_rq_dto_lock);
> set_bit(XPT_DATA, &rdma->sc_xprt.xpt_flags);
> + /* XXX: Chuck: do we need an smp_mb__after_atomic() here? */
> if (!test_bit(RDMAXPRT_CONN_PENDING, &rdma->sc_flags))
> svc_xprt_enqueue(&rdma->sc_xprt);
> goto out;
> diff --git a/net/sunrpc/xprtrdma/svc_rdma_rw.c b/net/sunrpc/xprtrdma/svc_rdma_rw.c
> index dc1951759a8e..e1a790487d69 100644
> --- a/net/sunrpc/xprtrdma/svc_rdma_rw.c
> +++ b/net/sunrpc/xprtrdma/svc_rdma_rw.c
> @@ -290,6 +290,7 @@ static void svc_rdma_wc_read_done(struct ib_cq *cq, struct ib_wc *wc)
> spin_unlock(&rdma->sc_rq_dto_lock);
>
> set_bit(XPT_DATA, &rdma->sc_xprt.xpt_flags);
> + /* XXX: Chuck: do we need a smp_mb__after_atomic() here? */
> svc_xprt_enqueue(&rdma->sc_xprt);
> }
>
--
Chuck Lever
> On Jan 11, 2019, at 4:52 PM, Chuck Lever <[email protected]> wrote:
>
>
>
>> On Jan 11, 2019, at 4:12 PM, [email protected] wrote:
>>
>> On Wed, Jan 09, 2019 at 05:41:36PM +0000, Trond Myklebust wrote:
>>> The above is stating that
>>>
>>> smp_rmb();
>>> smp_read_barrier_depends();
>>> if (xprt->xpt_flags & ....)
>>>
>>> is redundant and can be replaced with just
>>>
>>> smp_rmb();
>>> if (xprt->xpt_flags & ....)
>>>
>>> However that's not the case for smp_rmb() followed by READ_ONCE(). That
>>> would expand to
>>>
>>> smp_rmb();
>>> if (xprt->xpt_flags & ...) {
>>> smp_read_barrier_depends();
>>> } else
>>> smp_read_barrier_depends();
>>>
>>> which is not redundant. It is ensuring (on alpha only) that the read of
>>> xprt->xpt_flags is also not re-ordered w.r.t. other data reads that
>>> follow.
>>>
>>> See, for instance, kernel/events/core.c which has several examples, or
>>> kernel/exit.c.
>>
>> You're right, I was confused.
>>
>> So, I think we need your patch plus something like this.
>>
>> Chuck, maybe you could help me with the "XXX: Chuck:" parts?
>
> I haven't been following. Why do you think those are necessary?
> We've had set_bit and atomic_{inc,dec} in this code for ages,
> and I've never noticed a problem.
>
> Rather than adding another CPU pipeline bubble in the RDMA code,
> though, could you simply move the set_bit() call site inside the
> critical sections?
er, inside the preceding critical section. Just reverse the order
of the spin_unlock and the set_bit.
>
>
>> (This applies on top of your patch plus another that just renames the
>> stupidly long svc_xprt_has_something_to_do() to svc_xprt_read().)
>>
>> --b.
>>
>> commit d7356c3250d4
>> Author: J. Bruce Fields <[email protected]>
>> Date: Fri Jan 11 15:36:40 2019 -0500
>>
>> svcrpc: fix unlikely races preventing queueing of sockets
>>
>> In the rpc server, When something happens that might be reason to wake
>> up a thread to do something, what we do is
>>
>> - modify xpt_flags, sk_sock->flags, xpt_reserved, or
>> xpt_nr_rqsts to indicate the new situation
>> - call svc_xprt_enqueue() to decide whether to wake up a thread.
>>
>> svc_xprt_enqueue may require multiple conditions to be true before
>> queueing up a thread to handle the xprt. In the SMP case, one of the
>> other CPU's may have set another required condition, and in that case,
>> although both CPUs run svc_xprt_enqueue(), it's possible that neither
>> call sees the writes done by the other CPU in time, and neither one
>> recognizes that all the required conditions have been set. A socket
>> could therefore be ignored indefinitely.
>>
>> Add memory barries to ensure that any svc_xprt_enqueue() call will
>> always see the conditions changed by other CPUs before deciding to
>> ignore a socket.
>>
>> I've never seen this race reported. In the unlikely event it happens,
>> another event will usually come along and the problem will fix itself.
>> So I don't think this is worth backporting to stable.
>>
>> Signed-off-by: J. Bruce Fields <[email protected]>
>>
>> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
>> index d410ae512b02..2af21b84b3b6 100644
>> --- a/net/sunrpc/svc_xprt.c
>> +++ b/net/sunrpc/svc_xprt.c
>> @@ -357,6 +357,7 @@ static void svc_xprt_release_slot(struct svc_rqst *rqstp)
>> struct svc_xprt *xprt = rqstp->rq_xprt;
>> if (test_and_clear_bit(RQ_DATA, &rqstp->rq_flags)) {
>> atomic_dec(&xprt->xpt_nr_rqsts);
>> + smp_wmb(); /* See smp_rmb() in svc_xprt_ready() */
>> svc_xprt_enqueue(xprt);
>> }
>> }
>> @@ -365,6 +366,15 @@ static bool svc_xprt_ready(struct svc_xprt *xprt)
>> {
>> unsigned long xpt_flags;
>>
>> + /*
>> + * If another cpu has recently updated xpt_flags,
>> + * sk_sock->flags, xpt_reserved, or xpt_nr_rqsts, we need to
>> + * know about it; otherwise it's possible that both that cpu and
>> + * this one could call svc_xprt_enqueue() without either
>> + * svc_xprt_enqueue() recognizing that the conditions below
>> + * are satisfied, and we could stall indefinitely:
>> + */
>> + smp_rmb();
>> READ_ONCE(xprt->xpt_flags);
>>
>> if (xpt_flags & (BIT(XPT_CONN) | BIT(XPT_CLOSE)))
>> @@ -479,7 +489,7 @@ void svc_reserve(struct svc_rqst *rqstp, int space)
>> if (xprt && space < rqstp->rq_reserved) {
>> atomic_sub((rqstp->rq_reserved - space), &xprt->xpt_reserved);
>> rqstp->rq_reserved = space;
>> -
>> + smp_wmb(); /* See smp_rmb() in svc_xprt_ready() */
>> svc_xprt_enqueue(xprt);
>> }
>> }
>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
>> index 828b149eaaef..377244992ae8 100644
>> --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
>> +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
>> @@ -316,6 +316,7 @@ static void svc_rdma_wc_receive(struct ib_cq *cq, struct ib_wc *wc)
>> list_add_tail(&ctxt->rc_list, &rdma->sc_rq_dto_q);
>> spin_unlock(&rdma->sc_rq_dto_lock);
>> set_bit(XPT_DATA, &rdma->sc_xprt.xpt_flags);
>> + /* XXX: Chuck: do we need an smp_mb__after_atomic() here? */
>> if (!test_bit(RDMAXPRT_CONN_PENDING, &rdma->sc_flags))
>> svc_xprt_enqueue(&rdma->sc_xprt);
>> goto out;
>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_rw.c b/net/sunrpc/xprtrdma/svc_rdma_rw.c
>> index dc1951759a8e..e1a790487d69 100644
>> --- a/net/sunrpc/xprtrdma/svc_rdma_rw.c
>> +++ b/net/sunrpc/xprtrdma/svc_rdma_rw.c
>> @@ -290,6 +290,7 @@ static void svc_rdma_wc_read_done(struct ib_cq *cq, struct ib_wc *wc)
>> spin_unlock(&rdma->sc_rq_dto_lock);
>>
>> set_bit(XPT_DATA, &rdma->sc_xprt.xpt_flags);
>> + /* XXX: Chuck: do we need a smp_mb__after_atomic() here? */
>> svc_xprt_enqueue(&rdma->sc_xprt);
>> }
>>
>
> --
> Chuck Lever
--
Chuck Lever
On Fri, Jan 11, 2019 at 04:54:01PM -0500, Chuck Lever wrote:
> > On Jan 11, 2019, at 4:52 PM, Chuck Lever <[email protected]> wrote:
> >> So, I think we need your patch plus something like this.
> >>
> >> Chuck, maybe you could help me with the "XXX: Chuck:" parts?
> >
> > I haven't been following. Why do you think those are necessary?
I'm worried something like this could happen:
CPU 1 CPU 2
----- -----
set XPT_DATA dec xpt_nr_rqsts
svc_xprt_enqueue svc_xprt_enqueue
And both decide nothing should be done if neither sees the change that
the other made.
Maybe I'm still missing some reason that couldn't happen.
Even if it can happen, it's an unlikely race that will likely be fixed
when another event comes along a little later, which would explain why
we've never seen any reports.
> > We've had set_bit and atomic_{inc,dec} in this code for ages,
> > and I've never noticed a problem.
> >
> > Rather than adding another CPU pipeline bubble in the RDMA code,
> > though, could you simply move the set_bit() call site inside the
> > critical sections?
>
> er, inside the preceding critical section. Just reverse the order
> of the spin_unlock and the set_bit.
That'd do it, thanks!
--b.
>
>
> >
> >
> >> (This applies on top of your patch plus another that just renames the
> >> stupidly long svc_xprt_has_something_to_do() to svc_xprt_read().)
> >>
> >> --b.
> >>
> >> commit d7356c3250d4
> >> Author: J. Bruce Fields <[email protected]>
> >> Date: Fri Jan 11 15:36:40 2019 -0500
> >>
> >> svcrpc: fix unlikely races preventing queueing of sockets
> >>
> >> In the rpc server, When something happens that might be reason to wake
> >> up a thread to do something, what we do is
> >>
> >> - modify xpt_flags, sk_sock->flags, xpt_reserved, or
> >> xpt_nr_rqsts to indicate the new situation
> >> - call svc_xprt_enqueue() to decide whether to wake up a thread.
> >>
> >> svc_xprt_enqueue may require multiple conditions to be true before
> >> queueing up a thread to handle the xprt. In the SMP case, one of the
> >> other CPU's may have set another required condition, and in that case,
> >> although both CPUs run svc_xprt_enqueue(), it's possible that neither
> >> call sees the writes done by the other CPU in time, and neither one
> >> recognizes that all the required conditions have been set. A socket
> >> could therefore be ignored indefinitely.
> >>
> >> Add memory barries to ensure that any svc_xprt_enqueue() call will
> >> always see the conditions changed by other CPUs before deciding to
> >> ignore a socket.
> >>
> >> I've never seen this race reported. In the unlikely event it happens,
> >> another event will usually come along and the problem will fix itself.
> >> So I don't think this is worth backporting to stable.
> >>
> >> Signed-off-by: J. Bruce Fields <[email protected]>
> >>
> >> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> >> index d410ae512b02..2af21b84b3b6 100644
> >> --- a/net/sunrpc/svc_xprt.c
> >> +++ b/net/sunrpc/svc_xprt.c
> >> @@ -357,6 +357,7 @@ static void svc_xprt_release_slot(struct svc_rqst *rqstp)
> >> struct svc_xprt *xprt = rqstp->rq_xprt;
> >> if (test_and_clear_bit(RQ_DATA, &rqstp->rq_flags)) {
> >> atomic_dec(&xprt->xpt_nr_rqsts);
> >> + smp_wmb(); /* See smp_rmb() in svc_xprt_ready() */
> >> svc_xprt_enqueue(xprt);
> >> }
> >> }
> >> @@ -365,6 +366,15 @@ static bool svc_xprt_ready(struct svc_xprt *xprt)
> >> {
> >> unsigned long xpt_flags;
> >>
> >> + /*
> >> + * If another cpu has recently updated xpt_flags,
> >> + * sk_sock->flags, xpt_reserved, or xpt_nr_rqsts, we need to
> >> + * know about it; otherwise it's possible that both that cpu and
> >> + * this one could call svc_xprt_enqueue() without either
> >> + * svc_xprt_enqueue() recognizing that the conditions below
> >> + * are satisfied, and we could stall indefinitely:
> >> + */
> >> + smp_rmb();
> >> READ_ONCE(xprt->xpt_flags);
> >>
> >> if (xpt_flags & (BIT(XPT_CONN) | BIT(XPT_CLOSE)))
> >> @@ -479,7 +489,7 @@ void svc_reserve(struct svc_rqst *rqstp, int space)
> >> if (xprt && space < rqstp->rq_reserved) {
> >> atomic_sub((rqstp->rq_reserved - space), &xprt->xpt_reserved);
> >> rqstp->rq_reserved = space;
> >> -
> >> + smp_wmb(); /* See smp_rmb() in svc_xprt_ready() */
> >> svc_xprt_enqueue(xprt);
> >> }
> >> }
> >> diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> >> index 828b149eaaef..377244992ae8 100644
> >> --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> >> +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> >> @@ -316,6 +316,7 @@ static void svc_rdma_wc_receive(struct ib_cq *cq, struct ib_wc *wc)
> >> list_add_tail(&ctxt->rc_list, &rdma->sc_rq_dto_q);
> >> spin_unlock(&rdma->sc_rq_dto_lock);
> >> set_bit(XPT_DATA, &rdma->sc_xprt.xpt_flags);
> >> + /* XXX: Chuck: do we need an smp_mb__after_atomic() here? */
> >> if (!test_bit(RDMAXPRT_CONN_PENDING, &rdma->sc_flags))
> >> svc_xprt_enqueue(&rdma->sc_xprt);
> >> goto out;
> >> diff --git a/net/sunrpc/xprtrdma/svc_rdma_rw.c b/net/sunrpc/xprtrdma/svc_rdma_rw.c
> >> index dc1951759a8e..e1a790487d69 100644
> >> --- a/net/sunrpc/xprtrdma/svc_rdma_rw.c
> >> +++ b/net/sunrpc/xprtrdma/svc_rdma_rw.c
> >> @@ -290,6 +290,7 @@ static void svc_rdma_wc_read_done(struct ib_cq *cq, struct ib_wc *wc)
> >> spin_unlock(&rdma->sc_rq_dto_lock);
> >>
> >> set_bit(XPT_DATA, &rdma->sc_xprt.xpt_flags);
> >> + /* XXX: Chuck: do we need a smp_mb__after_atomic() here? */
> >> svc_xprt_enqueue(&rdma->sc_xprt);
> >> }
> >>
> >
> > --
> > Chuck Lever
>
> --
> Chuck Lever
>
>
> On Jan 11, 2019, at 5:10 PM, Bruce Fields <[email protected]> wrote:
>
> On Fri, Jan 11, 2019 at 04:54:01PM -0500, Chuck Lever wrote:
>>> On Jan 11, 2019, at 4:52 PM, Chuck Lever <[email protected]> wrote:
>>>> So, I think we need your patch plus something like this.
>>>>
>>>> Chuck, maybe you could help me with the "XXX: Chuck:" parts?
>>>
>>> I haven't been following. Why do you think those are necessary?
>
> I'm worried something like this could happen:
>
> CPU 1 CPU 2
> ----- -----
>
> set XPT_DATA dec xpt_nr_rqsts
>
> svc_xprt_enqueue svc_xprt_enqueue
>
> And both decide nothing should be done if neither sees the change that
> the other made.
>
> Maybe I'm still missing some reason that couldn't happen.
>
> Even if it can happen, it's an unlikely race that will likely be fixed
> when another event comes along a little later, which would explain why
> we've never seen any reports.
>
>>> We've had set_bit and atomic_{inc,dec} in this code for ages,
>>> and I've never noticed a problem.
>>>
>>> Rather than adding another CPU pipeline bubble in the RDMA code,
>>> though, could you simply move the set_bit() call site inside the
>>> critical sections?
>>
>> er, inside the preceding critical section. Just reverse the order
>> of the spin_unlock and the set_bit.
>
> That'd do it, thanks!
I can try that here and see if it results in a performance regression.
> --b.
>
>>
>>
>>>
>>>
>>>> (This applies on top of your patch plus another that just renames the
>>>> stupidly long svc_xprt_has_something_to_do() to svc_xprt_read().)
>>>>
>>>> --b.
>>>>
>>>> commit d7356c3250d4
>>>> Author: J. Bruce Fields <[email protected]>
>>>> Date: Fri Jan 11 15:36:40 2019 -0500
>>>>
>>>> svcrpc: fix unlikely races preventing queueing of sockets
>>>>
>>>> In the rpc server, When something happens that might be reason to wake
>>>> up a thread to do something, what we do is
>>>>
>>>> - modify xpt_flags, sk_sock->flags, xpt_reserved, or
>>>> xpt_nr_rqsts to indicate the new situation
>>>> - call svc_xprt_enqueue() to decide whether to wake up a thread.
>>>>
>>>> svc_xprt_enqueue may require multiple conditions to be true before
>>>> queueing up a thread to handle the xprt. In the SMP case, one of the
>>>> other CPU's may have set another required condition, and in that case,
>>>> although both CPUs run svc_xprt_enqueue(), it's possible that neither
>>>> call sees the writes done by the other CPU in time, and neither one
>>>> recognizes that all the required conditions have been set. A socket
>>>> could therefore be ignored indefinitely.
>>>>
>>>> Add memory barries to ensure that any svc_xprt_enqueue() call will
>>>> always see the conditions changed by other CPUs before deciding to
>>>> ignore a socket.
>>>>
>>>> I've never seen this race reported. In the unlikely event it happens,
>>>> another event will usually come along and the problem will fix itself.
>>>> So I don't think this is worth backporting to stable.
>>>>
>>>> Signed-off-by: J. Bruce Fields <[email protected]>
>>>>
>>>> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
>>>> index d410ae512b02..2af21b84b3b6 100644
>>>> --- a/net/sunrpc/svc_xprt.c
>>>> +++ b/net/sunrpc/svc_xprt.c
>>>> @@ -357,6 +357,7 @@ static void svc_xprt_release_slot(struct svc_rqst *rqstp)
>>>> struct svc_xprt *xprt = rqstp->rq_xprt;
>>>> if (test_and_clear_bit(RQ_DATA, &rqstp->rq_flags)) {
>>>> atomic_dec(&xprt->xpt_nr_rqsts);
>>>> + smp_wmb(); /* See smp_rmb() in svc_xprt_ready() */
>>>> svc_xprt_enqueue(xprt);
>>>> }
>>>> }
>>>> @@ -365,6 +366,15 @@ static bool svc_xprt_ready(struct svc_xprt *xprt)
>>>> {
>>>> unsigned long xpt_flags;
>>>>
>>>> + /*
>>>> + * If another cpu has recently updated xpt_flags,
>>>> + * sk_sock->flags, xpt_reserved, or xpt_nr_rqsts, we need to
>>>> + * know about it; otherwise it's possible that both that cpu and
>>>> + * this one could call svc_xprt_enqueue() without either
>>>> + * svc_xprt_enqueue() recognizing that the conditions below
>>>> + * are satisfied, and we could stall indefinitely:
>>>> + */
>>>> + smp_rmb();
>>>> READ_ONCE(xprt->xpt_flags);
>>>>
>>>> if (xpt_flags & (BIT(XPT_CONN) | BIT(XPT_CLOSE)))
>>>> @@ -479,7 +489,7 @@ void svc_reserve(struct svc_rqst *rqstp, int space)
>>>> if (xprt && space < rqstp->rq_reserved) {
>>>> atomic_sub((rqstp->rq_reserved - space), &xprt->xpt_reserved);
>>>> rqstp->rq_reserved = space;
>>>> -
>>>> + smp_wmb(); /* See smp_rmb() in svc_xprt_ready() */
>>>> svc_xprt_enqueue(xprt);
>>>> }
>>>> }
>>>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
>>>> index 828b149eaaef..377244992ae8 100644
>>>> --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
>>>> +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
>>>> @@ -316,6 +316,7 @@ static void svc_rdma_wc_receive(struct ib_cq *cq, struct ib_wc *wc)
>>>> list_add_tail(&ctxt->rc_list, &rdma->sc_rq_dto_q);
>>>> spin_unlock(&rdma->sc_rq_dto_lock);
>>>> set_bit(XPT_DATA, &rdma->sc_xprt.xpt_flags);
>>>> + /* XXX: Chuck: do we need an smp_mb__after_atomic() here? */
>>>> if (!test_bit(RDMAXPRT_CONN_PENDING, &rdma->sc_flags))
>>>> svc_xprt_enqueue(&rdma->sc_xprt);
>>>> goto out;
>>>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_rw.c b/net/sunrpc/xprtrdma/svc_rdma_rw.c
>>>> index dc1951759a8e..e1a790487d69 100644
>>>> --- a/net/sunrpc/xprtrdma/svc_rdma_rw.c
>>>> +++ b/net/sunrpc/xprtrdma/svc_rdma_rw.c
>>>> @@ -290,6 +290,7 @@ static void svc_rdma_wc_read_done(struct ib_cq *cq, struct ib_wc *wc)
>>>> spin_unlock(&rdma->sc_rq_dto_lock);
>>>>
>>>> set_bit(XPT_DATA, &rdma->sc_xprt.xpt_flags);
>>>> + /* XXX: Chuck: do we need a smp_mb__after_atomic() here? */
>>>> svc_xprt_enqueue(&rdma->sc_xprt);
>>>> }
>>>>
>>>
>>> --
>>> Chuck Lever
>>
>> --
>> Chuck Lever
>>
>>
--
Chuck Lever
On Fri, Jan 11, 2019 at 05:27:30PM -0500, Chuck Lever wrote:
>
>
> > On Jan 11, 2019, at 5:10 PM, Bruce Fields <[email protected]> wrote:
> >
> > On Fri, Jan 11, 2019 at 04:54:01PM -0500, Chuck Lever wrote:
> >>> On Jan 11, 2019, at 4:52 PM, Chuck Lever <[email protected]> wrote:
> >>>> So, I think we need your patch plus something like this.
> >>>>
> >>>> Chuck, maybe you could help me with the "XXX: Chuck:" parts?
> >>>
> >>> I haven't been following. Why do you think those are necessary?
> >
> > I'm worried something like this could happen:
> >
> > CPU 1 CPU 2
> > ----- -----
> >
> > set XPT_DATA dec xpt_nr_rqsts
> >
> > svc_xprt_enqueue svc_xprt_enqueue
> >
> > And both decide nothing should be done if neither sees the change that
> > the other made.
> >
> > Maybe I'm still missing some reason that couldn't happen.
> >
> > Even if it can happen, it's an unlikely race that will likely be fixed
> > when another event comes along a little later, which would explain why
> > we've never seen any reports.
> >
> >>> We've had set_bit and atomic_{inc,dec} in this code for ages,
> >>> and I've never noticed a problem.
> >>>
> >>> Rather than adding another CPU pipeline bubble in the RDMA code,
> >>> though, could you simply move the set_bit() call site inside the
> >>> critical sections?
> >>
> >> er, inside the preceding critical section. Just reverse the order
> >> of the spin_unlock and the set_bit.
> >
> > That'd do it, thanks!
>
> I can try that here and see if it results in a performance regression.
Thanks, I've got a version with a typo fixed at
git://linux-nfs.org/~bfields/linux.git nfsd-next
--b.
> On Jan 11, 2019, at 7:56 PM, Bruce Fields <[email protected]> wrote:
>
> On Fri, Jan 11, 2019 at 05:27:30PM -0500, Chuck Lever wrote:
>>
>>
>>> On Jan 11, 2019, at 5:10 PM, Bruce Fields <[email protected]> wrote:
>>>
>>> On Fri, Jan 11, 2019 at 04:54:01PM -0500, Chuck Lever wrote:
>>>>> On Jan 11, 2019, at 4:52 PM, Chuck Lever <[email protected]> wrote:
>>>>>> So, I think we need your patch plus something like this.
>>>>>>
>>>>>> Chuck, maybe you could help me with the "XXX: Chuck:" parts?
>>>>>
>>>>> I haven't been following. Why do you think those are necessary?
>>>
>>> I'm worried something like this could happen:
>>>
>>> CPU 1 CPU 2
>>> ----- -----
>>>
>>> set XPT_DATA dec xpt_nr_rqsts
>>>
>>> svc_xprt_enqueue svc_xprt_enqueue
>>>
>>> And both decide nothing should be done if neither sees the change that
>>> the other made.
>>>
>>> Maybe I'm still missing some reason that couldn't happen.
>>>
>>> Even if it can happen, it's an unlikely race that will likely be fixed
>>> when another event comes along a little later, which would explain why
>>> we've never seen any reports.
>>>
>>>>> We've had set_bit and atomic_{inc,dec} in this code for ages,
>>>>> and I've never noticed a problem.
>>>>>
>>>>> Rather than adding another CPU pipeline bubble in the RDMA code,
>>>>> though, could you simply move the set_bit() call site inside the
>>>>> critical sections?
>>>>
>>>> er, inside the preceding critical section. Just reverse the order
>>>> of the spin_unlock and the set_bit.
>>>
>>> That'd do it, thanks!
>>
>> I can try that here and see if it results in a performance regression.
>
> Thanks, I've got a version with a typo fixed at
>
> git://linux-nfs.org/~bfields/linux.git nfsd-next
Applied all four patches here. I don't see any performance regressions,
but my server has only a single last-level CPU cache.
--
Chuck Lever
On Mon, Jan 14, 2019 at 12:24:24PM -0500, Chuck Lever wrote:
>
>
> > On Jan 11, 2019, at 7:56 PM, Bruce Fields <[email protected]> wrote:
> >
> > On Fri, Jan 11, 2019 at 05:27:30PM -0500, Chuck Lever wrote:
> >>
> >>
> >>> On Jan 11, 2019, at 5:10 PM, Bruce Fields <[email protected]> wrote:
> >>>
> >>> On Fri, Jan 11, 2019 at 04:54:01PM -0500, Chuck Lever wrote:
> >>>>> On Jan 11, 2019, at 4:52 PM, Chuck Lever <[email protected]> wrote:
> >>>>>> So, I think we need your patch plus something like this.
> >>>>>>
> >>>>>> Chuck, maybe you could help me with the "XXX: Chuck:" parts?
> >>>>>
> >>>>> I haven't been following. Why do you think those are necessary?
> >>>
> >>> I'm worried something like this could happen:
> >>>
> >>> CPU 1 CPU 2
> >>> ----- -----
> >>>
> >>> set XPT_DATA dec xpt_nr_rqsts
> >>>
> >>> svc_xprt_enqueue svc_xprt_enqueue
> >>>
> >>> And both decide nothing should be done if neither sees the change that
> >>> the other made.
> >>>
> >>> Maybe I'm still missing some reason that couldn't happen.
> >>>
> >>> Even if it can happen, it's an unlikely race that will likely be fixed
> >>> when another event comes along a little later, which would explain why
> >>> we've never seen any reports.
> >>>
> >>>>> We've had set_bit and atomic_{inc,dec} in this code for ages,
> >>>>> and I've never noticed a problem.
> >>>>>
> >>>>> Rather than adding another CPU pipeline bubble in the RDMA code,
> >>>>> though, could you simply move the set_bit() call site inside the
> >>>>> critical sections?
> >>>>
> >>>> er, inside the preceding critical section. Just reverse the order
> >>>> of the spin_unlock and the set_bit.
> >>>
> >>> That'd do it, thanks!
> >>
> >> I can try that here and see if it results in a performance regression.
> >
> > Thanks, I've got a version with a typo fixed at
> >
> > git://linux-nfs.org/~bfields/linux.git nfsd-next
>
> Applied all four patches here. I don't see any performance regressions,
> but my server has only a single last-level CPU cache.
Thanks!
I'm adding a Tested-by: for you if that's OK.
--b.
> On Jan 25, 2019, at 12:30 PM, Bruce Fields <[email protected]> wrote:
>
> On Mon, Jan 14, 2019 at 12:24:24PM -0500, Chuck Lever wrote:
>>
>>
>>> On Jan 11, 2019, at 7:56 PM, Bruce Fields <[email protected]> wrote:
>>>
>>> On Fri, Jan 11, 2019 at 05:27:30PM -0500, Chuck Lever wrote:
>>>>
>>>>
>>>>> On Jan 11, 2019, at 5:10 PM, Bruce Fields <[email protected]> wrote:
>>>>>
>>>>> On Fri, Jan 11, 2019 at 04:54:01PM -0500, Chuck Lever wrote:
>>>>>>> On Jan 11, 2019, at 4:52 PM, Chuck Lever <[email protected]> wrote:
>>>>>>>> So, I think we need your patch plus something like this.
>>>>>>>>
>>>>>>>> Chuck, maybe you could help me with the "XXX: Chuck:" parts?
>>>>>>>
>>>>>>> I haven't been following. Why do you think those are necessary?
>>>>>
>>>>> I'm worried something like this could happen:
>>>>>
>>>>> CPU 1 CPU 2
>>>>> ----- -----
>>>>>
>>>>> set XPT_DATA dec xpt_nr_rqsts
>>>>>
>>>>> svc_xprt_enqueue svc_xprt_enqueue
>>>>>
>>>>> And both decide nothing should be done if neither sees the change that
>>>>> the other made.
>>>>>
>>>>> Maybe I'm still missing some reason that couldn't happen.
>>>>>
>>>>> Even if it can happen, it's an unlikely race that will likely be fixed
>>>>> when another event comes along a little later, which would explain why
>>>>> we've never seen any reports.
>>>>>
>>>>>>> We've had set_bit and atomic_{inc,dec} in this code for ages,
>>>>>>> and I've never noticed a problem.
>>>>>>>
>>>>>>> Rather than adding another CPU pipeline bubble in the RDMA code,
>>>>>>> though, could you simply move the set_bit() call site inside the
>>>>>>> critical sections?
>>>>>>
>>>>>> er, inside the preceding critical section. Just reverse the order
>>>>>> of the spin_unlock and the set_bit.
>>>>>
>>>>> That'd do it, thanks!
>>>>
>>>> I can try that here and see if it results in a performance regression.
>>>
>>> Thanks, I've got a version with a typo fixed at
>>>
>>> git://linux-nfs.org/~bfields/linux.git nfsd-next
>>
>> Applied all four patches here. I don't see any performance regressions,
>> but my server has only a single last-level CPU cache.
>
> Thanks!
>
> I'm adding a Tested-by: for you if that's OK.
Sorry, yes! that's fine with me.
--
Chuck Lever