2019-06-02 05:58:44

by Herbert Xu

[permalink] [raw]
Subject: rcu_read_lock lost its compiler barrier

Digging up an old email because I was not aware of this previously
but Paul pointed me to it during another discussion.

On Mon, Sep 21, 2015 at 01:43:27PM -0700, Paul E. McKenney wrote:
> On Mon, Sep 21, 2015 at 09:30:49PM +0200, Frederic Weisbecker wrote:
>
> > > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> > > index d63bb77..6c3cece 100644
> > > --- a/include/linux/rcupdate.h
> > > +++ b/include/linux/rcupdate.h
> > > @@ -297,12 +297,14 @@ void synchronize_rcu(void);
> > >
> > > static inline void __rcu_read_lock(void)
> > > {
> > > - preempt_disable();
> > > + if (IS_ENABLED(CONFIG_PREEMPT_COUNT))
> > > + preempt_disable();
> >
> > preempt_disable() is a no-op when !CONFIG_PREEMPT_COUNT, right?
> > Or rather it's a barrier(), which is anyway implied by rcu_read_lock().
> >
> > So perhaps we can get rid of the IS_ENABLED() check?
>
> Actually, barrier() is not intended to be implied by rcu_read_lock().
> In a non-preemptible RCU implementation, it doesn't help anything
> to have the compiler flush its temporaries upon rcu_read_lock()
> and rcu_read_unlock().

This is seriously broken. RCU has been around for years and is
used throughout the kernel while the compiler barrier existed.

You can't then go and decide to remove the compiler barrier! To do
that you'd need to audit every single use of rcu_read_lock in the
kernel to ensure that they're not depending on the compiler barrier.

This is also contrary to the definition of almost every other
*_lock primitive in the kernel where the compiler barrier is
included.

So please revert this patch.

Thanks,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


2019-06-02 21:16:27

by Linus Torvalds

[permalink] [raw]
Subject: Re: rcu_read_lock lost its compiler barrier

On Sat, Jun 1, 2019 at 10:56 PM Herbert Xu <[email protected]> wrote:
>
> You can't then go and decide to remove the compiler barrier! To do
> that you'd need to audit every single use of rcu_read_lock in the
> kernel to ensure that they're not depending on the compiler barrier.

What's the possible case where it would matter when there is no preemption?

Linus

2019-06-03 01:39:00

by Paul E. McKenney

[permalink] [raw]
Subject: Re: rcu_read_lock lost its compiler barrier

On Sun, Jun 02, 2019 at 01:56:07PM +0800, Herbert Xu wrote:
> Digging up an old email because I was not aware of this previously
> but Paul pointed me to it during another discussion.
>
> On Mon, Sep 21, 2015 at 01:43:27PM -0700, Paul E. McKenney wrote:
> > On Mon, Sep 21, 2015 at 09:30:49PM +0200, Frederic Weisbecker wrote:
> >
> > > > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> > > > index d63bb77..6c3cece 100644
> > > > --- a/include/linux/rcupdate.h
> > > > +++ b/include/linux/rcupdate.h
> > > > @@ -297,12 +297,14 @@ void synchronize_rcu(void);
> > > >
> > > > static inline void __rcu_read_lock(void)
> > > > {
> > > > - preempt_disable();
> > > > + if (IS_ENABLED(CONFIG_PREEMPT_COUNT))
> > > > + preempt_disable();
> > >
> > > preempt_disable() is a no-op when !CONFIG_PREEMPT_COUNT, right?
> > > Or rather it's a barrier(), which is anyway implied by rcu_read_lock().
> > >
> > > So perhaps we can get rid of the IS_ENABLED() check?
> >
> > Actually, barrier() is not intended to be implied by rcu_read_lock().
> > In a non-preemptible RCU implementation, it doesn't help anything
> > to have the compiler flush its temporaries upon rcu_read_lock()
> > and rcu_read_unlock().
>
> This is seriously broken. RCU has been around for years and is
> used throughout the kernel while the compiler barrier existed.

Please note that preemptible Tree RCU has lacked the compiler barrier on
all but the outermost rcu_read_unlock() for years before Boqun's patch.

So exactly where in the code that we are currently discussing
are you relying on compiler barriers in either rcu_read_lock() or
rcu_read_unlock()?

The grace-period guarantee allows the compiler ordering to be either in
the readers (SMP&&PREEMPT), in the grace-period mechanism (SMP&&!PREEMPT),
or both (SRCU).

> You can't then go and decide to remove the compiler barrier! To do
> that you'd need to audit every single use of rcu_read_lock in the
> kernel to ensure that they're not depending on the compiler barrier.
>
> This is also contrary to the definition of almost every other
> *_lock primitive in the kernel where the compiler barrier is
> included.
>
> So please revert this patch.

I do not believe that reverting that patch will help you at all.

But who knows? So please point me at the full code body that was being
debated earlier on this thread. It will no doubt take me quite a while to
dig through it, given my being on the road for the next couple of weeks,
but so it goes.

Thanx, Paul

2019-06-03 02:49:27

by Herbert Xu

[permalink] [raw]
Subject: Re: rcu_read_lock lost its compiler barrier

On Sun, Jun 02, 2019 at 01:54:12PM -0700, Linus Torvalds wrote:
> On Sat, Jun 1, 2019 at 10:56 PM Herbert Xu <[email protected]> wrote:
> >
> > You can't then go and decide to remove the compiler barrier! To do
> > that you'd need to audit every single use of rcu_read_lock in the
> > kernel to ensure that they're not depending on the compiler barrier.
>
> What's the possible case where it would matter when there is no preemption?

The case we were discussing is from net/ipv4/inet_fragment.c from
the net-next tree:

void fqdir_exit(struct fqdir *fqdir)
{
...
fqdir->dead = true;

/* call_rcu is supposed to provide memory barrier semantics,
* separating the setting of fqdir->dead with the destruction
* work. This implicit barrier is paired with inet_frag_kill().
*/

INIT_RCU_WORK(&fqdir->destroy_rwork, fqdir_rwork_fn);
queue_rcu_work(system_wq, &fqdir->destroy_rwork);
}

and

void inet_frag_kill(struct inet_frag_queue *fq)
{
...
rcu_read_lock();
/* The RCU read lock provides a memory barrier
* guaranteeing that if fqdir->dead is false then
* the hash table destruction will not start until
* after we unlock. Paired with inet_frags_exit_net().
*/
if (!fqdir->dead) {
rhashtable_remove_fast(&fqdir->rhashtable, &fq->node,
fqdir->f->rhash_params);
...
}
...
rcu_read_unlock();
...
}

I simplified this to

Initial values:

a = 0
b = 0

CPU1 CPU2
---- ----
a = 1 rcu_read_lock
synchronize_rcu if (a == 0)
b = 2 b = 1
rcu_read_unlock

On exit we want this to be true:
b == 2

Now what Paul was telling me is that unless every memory operation
is done with READ_ONCE/WRITE_ONCE then his memory model shows that
the exit constraint won't hold. IOW, we need

CPU1 CPU2
---- ----
WRITE_ONCE(a, 1) rcu_read_lock
synchronize_rcu if (READ_ONCE(a) == 0)
WRITE_ONCE(b, 2) WRITE_ONCE(b, 1)
rcu_read_unlock

Now I think this bullshit because if we really needed these compiler
barriers then we surely would need real memory barriers to go with
them.

In fact, the sole purpose of the RCU mechanism is to provide those
memory barriers. Quoting from
Documentation/RCU/Design/Requirements/Requirements.html:

<li> Each CPU that has an RCU read-side critical section that
begins before <tt>synchronize_rcu()</tt> starts is
guaranteed to execute a full memory barrier between the time
that the RCU read-side critical section ends and the time that
<tt>synchronize_rcu()</tt> returns.
Without this guarantee, a pre-existing RCU read-side critical section
might hold a reference to the newly removed <tt>struct foo</tt>
after the <tt>kfree()</tt> on line&nbsp;14 of
<tt>remove_gp_synchronous()</tt>.
<li> Each CPU that has an RCU read-side critical section that ends
after <tt>synchronize_rcu()</tt> returns is guaranteed
to execute a full memory barrier between the time that
<tt>synchronize_rcu()</tt> begins and the time that the RCU
read-side critical section begins.
Without this guarantee, a later RCU read-side critical section
running after the <tt>kfree()</tt> on line&nbsp;14 of
<tt>remove_gp_synchronous()</tt> might
later run <tt>do_something_gp()</tt> and find the
newly deleted <tt>struct foo</tt>.

My review of the RCU code shows that these memory barriers are
indeed present (at least when we're not in tiny mode where all
this discussion would be moot anyway). For example, in call_rcu
we eventually get down to rcu_segcblist_enqueue which has an smp_mb.
On the reader side (correct me if I'm wrong Paul) the memory
barrier is implicitly coming from the scheduler.

My point is that within our kernel whenever we have a CPU memory
barrier we always have a compiler barrier too. Therefore my code
example above does not need any extra compiler barriers such as
the ones provided by READ_ONCE/WRITE_ONCE.

I think perhaps Paul was perhaps thinking that I'm expecting
rcu_read_lock/rcu_read_unlock themselves to provide the memory
or compiler barriers. That would indeed be wrong but this is
not what I need. All I need is the RCU semantics as documented
for there to be memory and compiler barriers around the whole
grace period.

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2019-06-03 03:05:11

by Herbert Xu

[permalink] [raw]
Subject: Re: rcu_read_lock lost its compiler barrier

On Sun, Jun 02, 2019 at 05:06:17PM -0700, Paul E. McKenney wrote:
>
> Please note that preemptible Tree RCU has lacked the compiler barrier on
> all but the outermost rcu_read_unlock() for years before Boqun's patch.

Actually this is not true. Boqun's patch (commit bb73c52bad36) does
not add a barrier() to __rcu_read_lock. In fact I dug into the git
history and this compiler barrier() has existed in preemptible tree
RCU since the very start in 2009:

: commit f41d911f8c49a5d65c86504c19e8204bb605c4fd
: Author: Paul E. McKenney <[email protected]>
: Date: Sat Aug 22 13:56:52 2009 -0700
:
: rcu: Merge preemptable-RCU functionality into hierarchical RCU
:
: +/*
: + * Tree-preemptable RCU implementation for rcu_read_lock().
: + * Just increment ->rcu_read_lock_nesting, shared state will be updated
: + * if we block.
: + */
: +void __rcu_read_lock(void)
: +{
: + ACCESS_ONCE(current->rcu_read_lock_nesting)++;
: + barrier(); /* needed if we ever invoke rcu_read_lock in rcutree.c */
: +}
: +EXPORT_SYMBOL_GPL(__rcu_read_lock);

However, you are correct that in the non-preempt tree RCU case,
the compiler barrier in __rcu_read_lock was not always present.
In fact it was added by:

: commit 386afc91144b36b42117b0092893f15bc8798a80
: Author: Linus Torvalds <[email protected]>
: Date: Tue Apr 9 10:48:33 2013 -0700
:
: spinlocks and preemption points need to be at least compiler barriers

I suspect this is what prompted you to remove it in 2015.

> I do not believe that reverting that patch will help you at all.
>
> But who knows? So please point me at the full code body that was being
> debated earlier on this thread. It will no doubt take me quite a while to
> dig through it, given my being on the road for the next couple of weeks,
> but so it goes.

Please refer to my response to Linus for the code in question.

In any case, I am now even more certain that compiler barriers are
not needed in the code in question. The reasoning is quite simple.
If you need those compiler barriers then you surely need real memory
barriers.

Vice versa, if real memory barriers are already present thanks to
RCU, then you don't need those compiler barriers.

In fact this calls into question the use of READ_ONCE/WRITE_ONCE in
RCU primitives such as rcu_dereference and rcu_assign_pointer. IIRC
when RCU was first added to the Linux kernel we did not have compiler
barriers in rcu_dereference and rcu_assign_pointer. They were added
later on.

As compiler barriers per se are useless, these are surely meant to
be coupled with the memory barriers provided by RCU grace periods
and synchronize_rcu. But then those real memory barriers would have
compiler barriers too. So why do we need the compiler barriers in
rcu_dereference and rcu_assign_pointer?

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2019-06-03 03:50:23

by Paul E. McKenney

[permalink] [raw]
Subject: Re: rcu_read_lock lost its compiler barrier

On Mon, Jun 03, 2019 at 10:46:40AM +0800, Herbert Xu wrote:
> On Sun, Jun 02, 2019 at 01:54:12PM -0700, Linus Torvalds wrote:
> > On Sat, Jun 1, 2019 at 10:56 PM Herbert Xu <[email protected]> wrote:
> > >
> > > You can't then go and decide to remove the compiler barrier! To do
> > > that you'd need to audit every single use of rcu_read_lock in the
> > > kernel to ensure that they're not depending on the compiler barrier.
> >
> > What's the possible case where it would matter when there is no preemption?
>
> The case we were discussing is from net/ipv4/inet_fragment.c from
> the net-next tree:
>
> void fqdir_exit(struct fqdir *fqdir)
> {
> ...
> fqdir->dead = true;
>
> /* call_rcu is supposed to provide memory barrier semantics,
> * separating the setting of fqdir->dead with the destruction
> * work. This implicit barrier is paired with inet_frag_kill().
> */
>
> INIT_RCU_WORK(&fqdir->destroy_rwork, fqdir_rwork_fn);
> queue_rcu_work(system_wq, &fqdir->destroy_rwork);
> }
>
> and
>
> void inet_frag_kill(struct inet_frag_queue *fq)
> {
> ...
> rcu_read_lock();
> /* The RCU read lock provides a memory barrier
> * guaranteeing that if fqdir->dead is false then
> * the hash table destruction will not start until
> * after we unlock. Paired with inet_frags_exit_net().
> */
> if (!fqdir->dead) {
> rhashtable_remove_fast(&fqdir->rhashtable, &fq->node,
> fqdir->f->rhash_params);
> ...
> }
> ...
> rcu_read_unlock();
> ...
> }
>
> I simplified this to
>
> Initial values:
>
> a = 0
> b = 0
>
> CPU1 CPU2
> ---- ----
> a = 1 rcu_read_lock
> synchronize_rcu if (a == 0)
> b = 2 b = 1
> rcu_read_unlock
>
> On exit we want this to be true:
> b == 2
>
> Now what Paul was telling me is that unless every memory operation
> is done with READ_ONCE/WRITE_ONCE then his memory model shows that
> the exit constraint won't hold.

But please note that the plain-variable portion of the memory model is
very new and likely still has a bug or two. In fact, see below.

> IOW, we need
>
> CPU1 CPU2
> ---- ----
> WRITE_ONCE(a, 1) rcu_read_lock
> synchronize_rcu if (READ_ONCE(a) == 0)
> WRITE_ONCE(b, 2) WRITE_ONCE(b, 1)
> rcu_read_unlock
>
> Now I think this bullshit because if we really needed these compiler
> barriers then we surely would need real memory barriers to go with
> them.

On the one hand, you have no code before your rcu_read_lock() and also
1no code after your rcu_read_unlock(). So in this particular example,
adding compiler barriers to these guys won't help you.

On the other hand, on CPU 1's write to "b", I agree with you and disagree
with the model, though perhaps my partners in LKMM crime will show me the
error of my ways on this point. On CPU 2's write to "b", I can see the
memory model's point, but getting there requires some gymnastics on the
part of both the compiler and the CPU. The WRITE_ONCE() and READ_ONCE()
for "a" is the normal requirement for variables that are concurrently
loaded and stored.

Please note that garden-variety uses of RCU have similar requirements,
namely the rcu_assign_pointer() on the one side and the rcu_dereference()
on the other. Your use case allows rcu_assign_pointer() to be weakened
to WRITE_ONCE() and rcu_dereference() to be weakened to READ_ONCE()
(not that this last is all that much of a weakening these days).

> In fact, the sole purpose of the RCU mechanism is to provide those
> memory barriers. Quoting from
> Documentation/RCU/Design/Requirements/Requirements.html:
>
> <li> Each CPU that has an RCU read-side critical section that
> begins before <tt>synchronize_rcu()</tt> starts is
> guaranteed to execute a full memory barrier between the time
> that the RCU read-side critical section ends and the time that
> <tt>synchronize_rcu()</tt> returns.
> Without this guarantee, a pre-existing RCU read-side critical section
> might hold a reference to the newly removed <tt>struct foo</tt>
> after the <tt>kfree()</tt> on line&nbsp;14 of
> <tt>remove_gp_synchronous()</tt>.
> <li> Each CPU that has an RCU read-side critical section that ends
> after <tt>synchronize_rcu()</tt> returns is guaranteed
> to execute a full memory barrier between the time that
> <tt>synchronize_rcu()</tt> begins and the time that the RCU
> read-side critical section begins.
> Without this guarantee, a later RCU read-side critical section
> running after the <tt>kfree()</tt> on line&nbsp;14 of
> <tt>remove_gp_synchronous()</tt> might
> later run <tt>do_something_gp()</tt> and find the
> newly deleted <tt>struct foo</tt>.

Ahem.

1. These guarantees are of full memory barriers, -not- compiler
barriers.

2. These rules don't say exactly where these full memory barriers
go. SRCU is at one extreme, placing those full barriers in
srcu_read_lock() and srcu_read_unlock(), and !PREEMPT Tree RCU
at the other, placing these barriers entirely within the callback
queueing/invocation, grace-period computation, and the scheduler.
Preemptible Tree RCU is in the middle, with rcu_read_unlock()
sometimes including a full memory barrier, but other times with
the full memory barrier being confined as it is with !PREEMPT
Tree RCU.

So let's place those memory barriers in your example, interleaving:

CPU2: rcu_read_lock();
CPU1: WRITE_ONCE(a, 1) | CPU2: if (READ_ONCE(a) == 0)
CPU2: WRITE_ONCE(b, 1)
CPU2: rcu_read_unlock
/* Could put a full memory barrier here, but it wouldn't help. */
CPU1: synchronize_rcu
CPU1: b = 2;

Here the accesses to "a" are concurrent, and there are legal placements
for the required full memory barrier that don't change that. In fact,
I cannot see how any memory-barrier placement can help here. So the
WRITE_ONCE() and READ_ONCE() should be used for "a". And again, in
garden-variety RCU use cases, these would be rcu_assign_pointer() and
rcu_dereference(), respectively. So again, I advise using READ_ONCE()
and WRITE_ONCE() for the accesses to "a", for documentation purposes,
even ignoring the future proofing.

Now let's do the (admittedly quite crazy, at least here in 2019)
profiling-directed transformation of the "b = 1" on a weakly ordered
system:

CPU1: WRITE_ONCE(a, 1)
CPU1: synchronize_rcu
CPU1: b = 2;

CPU2: rcu_read_lock();
CPU2: if (READ_ONCE(a) == 0)
CPU2: if (b != 1)
CPU2: b = 1;
CPU2: rcu_read_unlock

Interleaving and inserting full memory barriers as per the rules above:

CPU1: WRITE_ONCE(a, 1)
CPU1: synchronize_rcu
/* Could put a full memory barrier here, but it wouldn't help. */

CPU2: rcu_read_lock();
CPU1: b = 2;
CPU2: if (READ_ONCE(a) == 0)
CPU2: if (b != 1) /* Weakly ordered CPU moved this up! */
CPU2: b = 1;
CPU2: rcu_read_unlock

In fact, CPU2's load from b might be moved up to race with CPU1's store,
which (I believe) is why the model complains in this case.

I still cannot imagine why CPU1's "b = 2" needs to be WRITE_ONCE(),
perhaps due to a failure of imagination on my part.

> My review of the RCU code shows that these memory barriers are
> indeed present (at least when we're not in tiny mode where all
> this discussion would be moot anyway). For example, in call_rcu
> we eventually get down to rcu_segcblist_enqueue which has an smp_mb.
> On the reader side (correct me if I'm wrong Paul) the memory
> barrier is implicitly coming from the scheduler.
>
> My point is that within our kernel whenever we have a CPU memory
> barrier we always have a compiler barrier too. Therefore my code
> example above does not need any extra compiler barriers such as
> the ones provided by READ_ONCE/WRITE_ONCE.

For the garden-variety RCU use cases, this is true. Instead, they
require things that are as strong or stronger than READ_ONCE/WRITE_ONCE.

For example:

CPU 0:
p = kzalloc(sizeof(*p), GFP_KERNEL);
BUG_ON(!p);
p->a = 42;
r1 = p->b;
rcu_assign_pointer(gp, p);

CPU 1:
rcu_read_lock()
q = rcu_dereference(gp);
r2 = p->a;
p->b = 202;
rcu_read_unlock()

In this case, the plain C-language loads and stores cannot possibly
execute concurrently, and the model agrees with this. Again, we didn't
use WRITE_ONCE() and READ_ONCE() -- we instead used rcu_assign_pointer()
and rcu_dereference().

Similar examples involving (say) list_del_rcu() and synchronize_rcu()
are also safe for plain C-language loads and stores. Plain C-language
accesses by readers to the item being deleted cannot race with plain
C-language accesses after the synchronize_rcu() returns. But here we
are using list_del_rcu() in the update-side code along with the
synchronize_rcu() and the readers are again using rcu_dereference().

> I think perhaps Paul was perhaps thinking that I'm expecting
> rcu_read_lock/rcu_read_unlock themselves to provide the memory
> or compiler barriers. That would indeed be wrong but this is
> not what I need. All I need is the RCU semantics as documented
> for there to be memory and compiler barriers around the whole
> grace period.

Whew!!! That was -exactly- what I was thinking, and I am glad that
you are not advocating rcu_read_lock() and rcu_read_unlock() supplying
those barriers. ;-)

And again, it is early days for plain accesses for the Linux-kernel
memory model. I am confident that READ_ONCE() and WRITE_ONCE for
"a" is a very good thing to do, but less confident for "b", most
especially for CPU1's store to "b".

Either way, your example is an excellent test case for the plain
C-language access capability of the Linux-kernel memory model, so thank
you for that!

Thanx, Paul

2019-06-03 04:02:58

by Herbert Xu

[permalink] [raw]
Subject: Re: rcu_read_lock lost its compiler barrier

On Sun, Jun 02, 2019 at 08:47:07PM -0700, Paul E. McKenney wrote:
>
> CPU2: if (b != 1)
> CPU2: b = 1;

Stop right there. The kernel is full of code that assumes that
assignment to an int/long is atomic. If your compiler breaks this
assumption that we can kiss the kernel good-bye.

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2019-06-03 04:19:52

by Herbert Xu

[permalink] [raw]
Subject: Re: rcu_read_lock lost its compiler barrier

On Mon, Jun 03, 2019 at 12:01:14PM +0800, Herbert Xu wrote:
> On Sun, Jun 02, 2019 at 08:47:07PM -0700, Paul E. McKenney wrote:
> >
> > CPU2: if (b != 1)
> > CPU2: b = 1;
>
> Stop right there. The kernel is full of code that assumes that
> assignment to an int/long is atomic. If your compiler breaks this
> assumption that we can kiss the kernel good-bye.

The slippery slope apparently started here:

: commit ea435467500612636f8f4fb639ff6e76b2496e4b
: Author: Matthew Wilcox <[email protected]>
: Date: Tue Jan 6 14:40:39 2009 -0800
:
: atomic_t: unify all arch definitions
:
: diff --git a/arch/x86/include/asm/atomic_32.h b/arch/x86/include/asm/atomic_32.h
: index ad5b9f6ecddf..85b46fba4229 100644
: --- a/arch/x86/include/asm/atomic_32.h
: +++ b/arch/x86/include/asm/atomic_32.h
: ...
: @@ -10,15 +11,6 @@
: * resource counting etc..
: */
:
: -/*
: - * Make sure gcc doesn't try to be clever and move things around
: - * on us. We need to use _exactly_ the address the user gave us,
: - * not some alias that contains the same information.
: - */
: -typedef struct {
: - int counter;
: -} atomic_t;
:
: diff --git a/include/linux/types.h b/include/linux/types.h
: index 121f349cb7ec..3b864f2d9560 100644
: --- a/include/linux/types.h
: +++ b/include/linux/types.h
: @@ -195,6 +195,16 @@ typedef u32 phys_addr_t;
:
: typedef phys_addr_t resource_size_t;
:
: +typedef struct {
: + volatile int counter;
: +} atomic_t;
: +

Before evolving into the READ_ONCE/WRITE_ONCE that we have now.

Linus, are we now really supporting a compiler where an assignment
(or a read) from an int/long/pointer can be non-atomic without the
volatile marker? Because if that's the case then we have a lot of
code to audit.

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2019-06-03 05:29:58

by Herbert Xu

[permalink] [raw]
Subject: Re: rcu_read_lock lost its compiler barrier

On Sun, Jun 02, 2019 at 08:47:07PM -0700, Paul E. McKenney wrote:
>
> 1. These guarantees are of full memory barriers, -not- compiler
> barriers.

What I'm saying is that wherever they are, they must come with
compiler barriers. I'm not aware of any synchronisation mechanism
in the kernel that gives a memory barrier without a compiler barrier.

> 2. These rules don't say exactly where these full memory barriers
> go. SRCU is at one extreme, placing those full barriers in
> srcu_read_lock() and srcu_read_unlock(), and !PREEMPT Tree RCU
> at the other, placing these barriers entirely within the callback
> queueing/invocation, grace-period computation, and the scheduler.
> Preemptible Tree RCU is in the middle, with rcu_read_unlock()
> sometimes including a full memory barrier, but other times with
> the full memory barrier being confined as it is with !PREEMPT
> Tree RCU.

The rules do say that the (full) memory barrier must precede any
RCU read-side that occur after the synchronize_rcu and after the
end of any RCU read-side that occur before the synchronize_rcu.

All I'm arguing is that wherever that full mb is, as long as it
also carries with it a barrier() (which it must do if it's done
using an existing kernel mb/locking primitive), then we're fine.

> Interleaving and inserting full memory barriers as per the rules above:
>
> CPU1: WRITE_ONCE(a, 1)
> CPU1: synchronize_rcu
> /* Could put a full memory barrier here, but it wouldn't help. */

CPU1: smp_mb();
CPU2: smp_mb();

Let's put them in because I think they are critical. smp_mb() also
carries with it a barrier().

> CPU2: rcu_read_lock();
> CPU1: b = 2;
> CPU2: if (READ_ONCE(a) == 0)
> CPU2: if (b != 1) /* Weakly ordered CPU moved this up! */
> CPU2: b = 1;
> CPU2: rcu_read_unlock
>
> In fact, CPU2's load from b might be moved up to race with CPU1's store,
> which (I believe) is why the model complains in this case.

Let's put aside my doubt over how we're even allowing a compiler
to turn

b = 1

into

if (b != 1)
b = 1

Since you seem to be assuming that (a == 0) is true in this case
(as the assignment b = 1 is carried out), then because of the
presence of the full memory barrier, the RCU read-side section
must have started prior to the synchronize_rcu. This means that
synchronize_rcu is not allowed to return until at least the end
of the grace period, or at least until the end of rcu_read_unlock.

So it actually should be:

CPU1: WRITE_ONCE(a, 1)
CPU1: synchronize_rcu called
/* Could put a full memory barrier here, but it wouldn't help. */

CPU1: smp_mb();
CPU2: smp_mb();

CPU2: grace period starts
...time passes...
CPU2: rcu_read_lock();
CPU2: if (READ_ONCE(a) == 0)
CPU2: if (b != 1) /* Weakly ordered CPU moved this up! */
CPU2: b = 1;
CPU2: rcu_read_unlock
...time passes...
CPU2: grace period ends

/* This full memory barrier is also guaranteed by RCU. */
CPU2: smp_mb();

CPU1 synchronize_rcu returns
CPU1: b = 2;

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2019-06-03 07:17:06

by Boqun Feng

[permalink] [raw]
Subject: Re: rcu_read_lock lost its compiler barrier

On Mon, Jun 03, 2019 at 01:26:26PM +0800, Herbert Xu wrote:
> On Sun, Jun 02, 2019 at 08:47:07PM -0700, Paul E. McKenney wrote:
> >
> > 1. These guarantees are of full memory barriers, -not- compiler
> > barriers.
>
> What I'm saying is that wherever they are, they must come with
> compiler barriers. I'm not aware of any synchronisation mechanism
> in the kernel that gives a memory barrier without a compiler barrier.
>
> > 2. These rules don't say exactly where these full memory barriers
> > go. SRCU is at one extreme, placing those full barriers in
> > srcu_read_lock() and srcu_read_unlock(), and !PREEMPT Tree RCU
> > at the other, placing these barriers entirely within the callback
> > queueing/invocation, grace-period computation, and the scheduler.
> > Preemptible Tree RCU is in the middle, with rcu_read_unlock()
> > sometimes including a full memory barrier, but other times with
> > the full memory barrier being confined as it is with !PREEMPT
> > Tree RCU.
>
> The rules do say that the (full) memory barrier must precede any
> RCU read-side that occur after the synchronize_rcu and after the
> end of any RCU read-side that occur before the synchronize_rcu.
>
> All I'm arguing is that wherever that full mb is, as long as it
> also carries with it a barrier() (which it must do if it's done
> using an existing kernel mb/locking primitive), then we're fine.
>
> > Interleaving and inserting full memory barriers as per the rules above:
> >
> > CPU1: WRITE_ONCE(a, 1)
> > CPU1: synchronize_rcu
> > /* Could put a full memory barrier here, but it wouldn't help. */
>
> CPU1: smp_mb();
> CPU2: smp_mb();
>
> Let's put them in because I think they are critical. smp_mb() also
> carries with it a barrier().
>
> > CPU2: rcu_read_lock();
> > CPU1: b = 2;
> > CPU2: if (READ_ONCE(a) == 0)
> > CPU2: if (b != 1) /* Weakly ordered CPU moved this up! */
> > CPU2: b = 1;
> > CPU2: rcu_read_unlock
> >
> > In fact, CPU2's load from b might be moved up to race with CPU1's store,
> > which (I believe) is why the model complains in this case.
>
> Let's put aside my doubt over how we're even allowing a compiler
> to turn
>
> b = 1
>
> into
>
> if (b != 1)
> b = 1
>
> Since you seem to be assuming that (a == 0) is true in this case

I think Paul's example assuming (a == 0) is false, and maybe
speculative writes (by compilers) needs to added into consideration?
Please consider the following case (I add a few smp_mb()s), the case may
be a little bit crasy, you have been warned ;-)

CPU1: WRITE_ONCE(a, 1)
CPU1: synchronize_rcu called

CPU1: smp_mb(); /* let assume there is one here */

CPU2: rcu_read_lock();
CPU2: smp_mb(); /* let assume there is one here */

/* "if (b != 1) b = 1" reordered */
CPU2: r0 = b; /* if (b != 1) reordered here, r0 == 0 */
CPU2: if (r0 != 1) /* true */
CPU2: b = 1; /* b == 1 now, this is a speculative write
by compiler
*/

CPU1: b = 2; /* b == 2 */

CPU2: if (READ_ONCE(a) == 0) /* false */
CPU2: ...
CPU2 else /* undo the speculative write */
CPU2: b = r0; /* b == 0 */

CPU2: smp_mb();
CPU2: read_read_unlock();

I know this is too crasy for us to think a compiler like this, but this
might be the reason why the model complain about this.

Paul, did I get this right? Or you mean something else?

Regards,
Boqun



> (as the assignment b = 1 is carried out), then because of the
> presence of the full memory barrier, the RCU read-side section
> must have started prior to the synchronize_rcu. This means that
> synchronize_rcu is not allowed to return until at least the end
> of the grace period, or at least until the end of rcu_read_unlock.
>
> So it actually should be:
>
> CPU1: WRITE_ONCE(a, 1)
> CPU1: synchronize_rcu called
> /* Could put a full memory barrier here, but it wouldn't help. */
>
> CPU1: smp_mb();
> CPU2: smp_mb();
>
> CPU2: grace period starts
> ...time passes...
> CPU2: rcu_read_lock();
> CPU2: if (READ_ONCE(a) == 0)
> CPU2: if (b != 1) /* Weakly ordered CPU moved this up! */
> CPU2: b = 1;
> CPU2: rcu_read_unlock
> ...time passes...
> CPU2: grace period ends
>
> /* This full memory barrier is also guaranteed by RCU. */
> CPU2: smp_mb();
>
> CPU1 synchronize_rcu returns
> CPU1: b = 2;
>
> Cheers,
> --
> Email: Herbert Xu <[email protected]>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Attachments:
(No filename) (4.60 kB)
signature.asc (499.00 B)
Download all attachments

2019-06-03 07:28:58

by Paul E. McKenney

[permalink] [raw]
Subject: Re: rcu_read_lock lost its compiler barrier

On Mon, Jun 03, 2019 at 12:01:14PM +0800, Herbert Xu wrote:
> On Sun, Jun 02, 2019 at 08:47:07PM -0700, Paul E. McKenney wrote:
> >
> > CPU2: if (b != 1)
> > CPU2: b = 1;
>
> Stop right there. The kernel is full of code that assumes that
> assignment to an int/long is atomic. If your compiler breaks this
> assumption that we can kiss the kernel good-bye.

Here you go:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55981
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56028

TL;DR: On x86, of you are doing a plain store of a 32-bit constant
that has bits set only in the lower few bits of each 16-bit half of
that constant, the compiler is plenty happy to use a pair of 16-bit
store-immediate instructions to carry out that store. This is also
known as "store tearing".

The two bugs were filed (and after some back and forth, fixed) because
someone forgot to exclude C11 atomics and volatile accesses from this
store tearing.

Thanx, Paul

2019-06-03 08:44:40

by Paul E. McKenney

[permalink] [raw]
Subject: Re: rcu_read_lock lost its compiler barrier

On Mon, Jun 03, 2019 at 12:23:39AM -0700, Paul E. McKenney wrote:
> On Mon, Jun 03, 2019 at 12:01:14PM +0800, Herbert Xu wrote:
> > On Sun, Jun 02, 2019 at 08:47:07PM -0700, Paul E. McKenney wrote:
> > >
> > > CPU2: if (b != 1)
> > > CPU2: b = 1;
> >
> > Stop right there. The kernel is full of code that assumes that
> > assignment to an int/long is atomic. If your compiler breaks this
> > assumption that we can kiss the kernel good-bye.
>
> Here you go:
>
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55981
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56028
>
> TL;DR: On x86, of you are doing a plain store of a 32-bit constant
> that has bits set only in the lower few bits of each 16-bit half of
> that constant, the compiler is plenty happy to use a pair of 16-bit
> store-immediate instructions to carry out that store. This is also
> known as "store tearing".
>
> The two bugs were filed (and after some back and forth, fixed) because
> someone forgot to exclude C11 atomics and volatile accesses from this
> store tearing.

I should hasten to add that I have not seen load tearing, nor have I seen
store tearing when storing a value unknown to the compiler. However,
plain C-language loads and stores can be invented, fused, and a few other
"interesting" optimization can be applied.

On kissing the kernel goodbye, a reasonable strategy might be to
identify the transformations that are actually occuring (like the
stores of certain constants called out above) and fix those. We do
occasionally use READ_ONCE() to prevent load-fusing optimizations that
would otherwise cause the compiler to turn while-loops into if-statements
guarding infinite loops. There is also the possibility of having the
compiler guys give us more command-line arguments.

Thanx, Paul

2019-06-03 09:29:22

by Paul E. McKenney

[permalink] [raw]
Subject: Re: rcu_read_lock lost its compiler barrier

On Mon, Jun 03, 2019 at 11:03:24AM +0800, Herbert Xu wrote:
> On Sun, Jun 02, 2019 at 05:06:17PM -0700, Paul E. McKenney wrote:
> >
> > Please note that preemptible Tree RCU has lacked the compiler barrier on
> > all but the outermost rcu_read_unlock() for years before Boqun's patch.
>
> Actually this is not true. Boqun's patch (commit bb73c52bad36) does
> not add a barrier() to __rcu_read_lock. In fact I dug into the git
> history and this compiler barrier() has existed in preemptible tree
> RCU since the very start in 2009:

I said rcu_read_unlock() and you said __rcu_read_lock().

> : commit f41d911f8c49a5d65c86504c19e8204bb605c4fd
> : Author: Paul E. McKenney <[email protected]>
> : Date: Sat Aug 22 13:56:52 2009 -0700
> :
> : rcu: Merge preemptable-RCU functionality into hierarchical RCU
> :
> : +/*
> : + * Tree-preemptable RCU implementation for rcu_read_lock().
> : + * Just increment ->rcu_read_lock_nesting, shared state will be updated
> : + * if we block.
> : + */
> : +void __rcu_read_lock(void)
> : +{
> : + ACCESS_ONCE(current->rcu_read_lock_nesting)++;
> : + barrier(); /* needed if we ever invoke rcu_read_lock in rcutree.c */
> : +}
> : +EXPORT_SYMBOL_GPL(__rcu_read_lock);

Thank you for finding this! This particular version does have an
unconditional barrier() in __rcu_read_unlock(), for whatever that
is worth:

+void __rcu_read_unlock(void)
+{
+ struct task_struct *t = current;
+
+ barrier(); /* needed if we ever invoke rcu_read_unlock in rcutree.c */
+ if (--ACCESS_ONCE(t->rcu_read_lock_nesting) == 0 &&
+ unlikely(ACCESS_ONCE(t->rcu_read_unlock_special)))
+ rcu_read_unlock_special(t);
+}

I would not have seen the point of a compiler barrier in the non-outermost
__rcu_read_unlock(), since the completion of an inner __rcu_read_unlock()
does not permit the grace period to complete.

> However, you are correct that in the non-preempt tree RCU case,
> the compiler barrier in __rcu_read_lock was not always present.
> In fact it was added by:
>
> : commit 386afc91144b36b42117b0092893f15bc8798a80
> : Author: Linus Torvalds <[email protected]>
> : Date: Tue Apr 9 10:48:33 2013 -0700
> :
> : spinlocks and preemption points need to be at least compiler barriers
>
> I suspect this is what prompted you to remove it in 2015.

If I remember correctly, it was pointed out to me that in !PREEMPT kernels,
the compiler barrier in the preempt_disable() invoked in rcu_read_lock()
(and similar on the rcu_read_unlock() side) wasn't helping anything,

> > I do not believe that reverting that patch will help you at all.
> >
> > But who knows? So please point me at the full code body that was being
> > debated earlier on this thread. It will no doubt take me quite a while to
> > dig through it, given my being on the road for the next couple of weeks,
> > but so it goes.
>
> Please refer to my response to Linus for the code in question.
>
> In any case, I am now even more certain that compiler barriers are
> not needed in the code in question. The reasoning is quite simple.
> If you need those compiler barriers then you surely need real memory
> barriers.

OK, we are in agreement on that point, then!

> Vice versa, if real memory barriers are already present thanks to
> RCU, then you don't need those compiler barriers.

For users of RCU, this seems reasonable.

On the other hand, the compiler barriers in PREEMPT Tree RCU's outermost
__rcu_read_lock() and __rcu_read_unlock() invocations really are needed
by RCU internals. This is because RCU uses of interrupt handlers that
access per-task and per-CPU variables, and these need to be able to
sense the edges of the nested set of RCU read-side critical sections.
It is OK for these interrupt handlers to think that the critical section
is larger than it really is, but fatal for them to think that the critical
sections are smaller than they really are.

> In fact this calls into question the use of READ_ONCE/WRITE_ONCE in
> RCU primitives such as rcu_dereference and rcu_assign_pointer.

No, these are -not- called into question, or if they are, the question
gets quickly answered it a way that supports current Linux-kernel code.
As mentioned in earlier emails, the traditional uses of RCU that involve
rcu_dereference(), rcu_assign_pointer(), and synchronize_rcu() all work
just fine.

In fact, from what I can see, the issue stems from having developed
intuitions from working with the traditional rcu_dereference(),
rcu_assign_pointer(), and synchronize_rcu() linked-structure use cases,
and then attempting to apply these intuition to use cases that have
neither rcu_dereference() nor rcu_assign_pointer(). Don't get me wrong,
it is only natural to try to extend your intuitions to something that
admittedly looks pretty similar to the traditional use cases. But this
is one of those cases where "pretty similar" might not be good enough.

> IIRC
> when RCU was first added to the Linux kernel we did not have compiler
> barriers in rcu_dereference and rcu_assign_pointer. They were added
> later on.

From what I can see, rcu_dereference() still does not have a compiler
barrier. Please note that the pair of barrier() calls in READ_ONCE()
only apply when READ_ONCE()ing something larger than the machine can load.
And if your platform cannot load and store pointers with a single access,
the Linux kernel isn't going to do very well regardless. Ditto for
WRITE_ONCE().

> As compiler barriers per se are useless, these are surely meant to
> be coupled with the memory barriers provided by RCU grace periods
> and synchronize_rcu. But then those real memory barriers would have
> compiler barriers too. So why do we need the compiler barriers in
> rcu_dereference and rcu_assign_pointer?

In rcu_dereference(), RCU does not need them. They are instead
inherited from READ_ONCE() for when it is used on a data structure too
big for any single load instruction available on the system in question.
These barrier() calls are in a case that rcu_dereference() had better
not be using -- after all, using them would mean that the hardware didn't
have a load instruction big enough to handle a pointer.

In rcu_assign_pointer(), RCU just needs this to act like a release
store, that is, the store itself must not be reordered with any earlier
memory accesses. The Linux kernel's smp_store_release() currently
over-approximates this using a barrier() or equivalent inline-assembly
directive, which enforces compiler ordering for not only the release
store, but also far all memory accesses following the release store.
Obviously, barrier is not enough for weakly ordered systems, which
must also emit an appropriate memory-barrier instruction (or a special
load instruction for architectures like ARMv8 providing such a thing).

The compiler barriers in __rcu_read_lock() and __rcu_read_unlock() are
there so that preemptible Tree RCU can use its various tricks to make
readers perform and scale well. Read-side state is confined to the CPU
and/or task in the common case, thus avoiding heavy synchronization
overhead in the common case (or, in the case of !PREEMPT RCU, thus
avoiding -any- synchronization overhead in the common case). For example,
the compiler barriers ensure that RCU's scheduler-clock code and softirq
code can trust per-CPU/task state indicating whether or not there is an
RCU read-side critical section in effect.

Does that help? Or am I missing your point?

Thanx, Paul

2019-06-03 09:37:05

by Paul E. McKenney

[permalink] [raw]
Subject: Re: rcu_read_lock lost its compiler barrier

On Mon, Jun 03, 2019 at 01:26:26PM +0800, Herbert Xu wrote:
> On Sun, Jun 02, 2019 at 08:47:07PM -0700, Paul E. McKenney wrote:
> >
> > 1. These guarantees are of full memory barriers, -not- compiler
> > barriers.
>
> What I'm saying is that wherever they are, they must come with
> compiler barriers. I'm not aware of any synchronisation mechanism
> in the kernel that gives a memory barrier without a compiler barrier.

Yes, if a given synchronization mechanism requires that memory references
need to be ordered, both the compiler and the CPU must maintain that
ordering.

> > 2. These rules don't say exactly where these full memory barriers
> > go. SRCU is at one extreme, placing those full barriers in
> > srcu_read_lock() and srcu_read_unlock(), and !PREEMPT Tree RCU
> > at the other, placing these barriers entirely within the callback
> > queueing/invocation, grace-period computation, and the scheduler.
> > Preemptible Tree RCU is in the middle, with rcu_read_unlock()
> > sometimes including a full memory barrier, but other times with
> > the full memory barrier being confined as it is with !PREEMPT
> > Tree RCU.
>
> The rules do say that the (full) memory barrier must precede any
> RCU read-side that occur after the synchronize_rcu and after the
> end of any RCU read-side that occur before the synchronize_rcu.
>
> All I'm arguing is that wherever that full mb is, as long as it
> also carries with it a barrier() (which it must do if it's done
> using an existing kernel mb/locking primitive), then we're fine.

Fair enough, and smp_mb() does provide what is needed.

> > Interleaving and inserting full memory barriers as per the rules above:
> >
> > CPU1: WRITE_ONCE(a, 1)
> > CPU1: synchronize_rcu
> > /* Could put a full memory barrier here, but it wouldn't help. */
>
> CPU1: smp_mb();
> CPU2: smp_mb();

What is CPU2's smp_mb() ordering? In other words, what comment would
you put on each of the above smp_mb() calls?

> Let's put them in because I think they are critical. smp_mb() also
> carries with it a barrier().

Again, agreed, smp_mb() implies barrier().

> > CPU2: rcu_read_lock();
> > CPU1: b = 2;
> > CPU2: if (READ_ONCE(a) == 0)
> > CPU2: if (b != 1) /* Weakly ordered CPU moved this up! */
> > CPU2: b = 1;
> > CPU2: rcu_read_unlock
> >
> > In fact, CPU2's load from b might be moved up to race with CPU1's store,
> > which (I believe) is why the model complains in this case.
>
> Let's put aside my doubt over how we're even allowing a compiler
> to turn
>
> b = 1
>
> into
>
> if (b != 1)
> b = 1
>
> Since you seem to be assuming that (a == 0) is true in this case
> (as the assignment b = 1 is carried out), then because of the
> presence of the full memory barrier, the RCU read-side section
> must have started prior to the synchronize_rcu. This means that
> synchronize_rcu is not allowed to return until at least the end
> of the grace period, or at least until the end of rcu_read_unlock.
>
> So it actually should be:
>
> CPU1: WRITE_ONCE(a, 1)
> CPU1: synchronize_rcu called
> /* Could put a full memory barrier here, but it wouldn't help. */
>
> CPU1: smp_mb();
> CPU2: smp_mb();
>
> CPU2: grace period starts
> ...time passes...
> CPU2: rcu_read_lock();
> CPU2: if (READ_ONCE(a) == 0)
> CPU2: if (b != 1) /* Weakly ordered CPU moved this up! */
> CPU2: b = 1;
> CPU2: rcu_read_unlock
> ...time passes...
> CPU2: grace period ends
>
> /* This full memory barrier is also guaranteed by RCU. */
> CPU2: smp_mb();

But in this case, given that there are no more statements for CPU2,
what is this smp_mb() ordering?

Thanx, Paul

> CPU1 synchronize_rcu returns
> CPU1: b = 2;
>
> Cheers,
> --
> Email: Herbert Xu <[email protected]>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
>

2019-06-03 15:28:43

by David Laight

[permalink] [raw]
Subject: RE: rcu_read_lock lost its compiler barrier

From: Paul E. McKenney
> Sent: 03 June 2019 09:42
...
> On kissing the kernel goodbye, a reasonable strategy might be to
> identify the transformations that are actually occuring (like the
> stores of certain constants called out above) and fix those.

> We do
> occasionally use READ_ONCE() to prevent load-fusing optimizations that
> would otherwise cause the compiler to turn while-loops into if-statements
> guarding infinite loops.

In that case the variable ought to be volatile...

> There is also the possibility of having the
> compiler guys give us more command-line arguments.

I wonder how much the code size (of anything) would increase
if the compiler:
1) Never read a value into a local more than once.
2) Never write a value that wasn't requested by the code.
3) Never used multiple memory accesses for 'machine word' (and
smaller) items.

(1) makes all reads READ_ONCE() except that the actual read
can be delayed until further down the code.
If I have a naive #define bswap32() I'd expect:
v = bswap32(foo->bar)
to possibly read foo->bar multiple times, but not:
int foo_bar = foo->bar;
v = bswap32(foo_bar);

(2) makes all writes WRITE_ONCE() except that if there are
multiple writes to the same location, only the last need
be done.
In particular it stops speculative writes and the use of
locations that are going to be written to as temporaries.
It also stop foo->bar = ~0; being implemented as a clear
then decrement.

(3) I'd never imagined the compiler would write the two halves
of a word separately!

If the compiler behaved like that (as one might expect it would)
then READ_ONCE() would be a READ_NOW() for when the sequencing
mattered.

I was also recently surprised by the code I got from this loop:
for (i = 0; i < limit; i++)
sum64 += array32[i];
(as in the IP checksum sum without add carry support).
The compiler unrolled it to used horrid sequences of sse3/avx
instructions.
This might be a gain for large enough buffers and 'hot cache'
but for small buffer and likely cold cache it is horrid.
I guess such optimisations are valid, but I wonder how often
they are an actual win for real programs.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2019-06-03 15:44:54

by Linus Torvalds

[permalink] [raw]
Subject: Re: rcu_read_lock lost its compiler barrier

On Mon, Jun 3, 2019 at 8:26 AM David Laight <[email protected]> wrote:
>
> From: Paul E. McKenney
>
> > We do
> > occasionally use READ_ONCE() to prevent load-fusing optimizations that
> > would otherwise cause the compiler to turn while-loops into if-statements
> > guarding infinite loops.
>
> In that case the variable ought to be volatile...

No.

We do not use volatile on variables.

The C people got the semantics wrong, probably because 'volatile' was
historically for IO, not for access atomicity without locking.

It's not the memory location that is volatile, it is really the
_access_ that is volatile.

The same memory location might be completely stable in other access
situations (ie when done under a lock).

In other words, we should *never* use volatile in the kernel. It's
fundamentally mis-designed for modern use.

(Of course, we then can use volatile in a cast in code, which drives
some compiler people crazy, but that's because said compiler people
don't care about reality, they care about some paperwork).

Linus

2019-06-03 15:57:43

by Linus Torvalds

[permalink] [raw]
Subject: Re: rcu_read_lock lost its compiler barrier

On Sun, Jun 2, 2019 at 8:03 PM Herbert Xu <[email protected]> wrote:
>
> In any case, I am now even more certain that compiler barriers are
> not needed in the code in question. The reasoning is quite simple.
> If you need those compiler barriers then you surely need real memory
> barriers.

So the above statement is not necessarily correct.

Compiler barriers very much can have real effects even in the absense
of "real" memory barriers.

But those effects are obviously not about multiple CPU's - they are
about code generation and can be about ordering on _one_ CPU. Those
effects definitely happen, though.

So a compiler barrier without a memory barrier may make a difference if you

(a) compile for UP (when a compiler barrier basically _is_ a memory barrier)

(b) have deadlock or ordering avoidance with only the local CPU
taking interrupts.

(c) need to re-load a value in a loop, but ordering isn't a concern

and possibly other situations.

In the above, (a) may be pointless and trivial, but (b) and (c) are
issues even on SMP. Some things only matter for the local CPU - an
interrupt or a code sequence that happens on another CPU can just
block, but if an interrupt comes in on the same CPU may dead-lock and
depend on particular access ordering. And (c) is for things like
cpu_relax() in loops that read stuff (although honestly, READ_ONCE()
is generally a much better pattern).

But it sounds like in this case at least, Herbert's and Paul's
disagreements aren't really all that fundamentally about the memory
barriers and locking, as just the fact that in general the only thing
that RCU protects against is single accesses, and thus any RCU region
technically should use something that guarantees that the compiler
might not do stupid things.

We do require a _minimum_ of compiler sanity, but the compiler turning
a non-marked read into two reads has definitely happened, and can be
very subtle. Even if on a C level it looks like a single access, and
correctness doesn't care about _what_ the value we read is, it might
be turned by the compiler into two separate accesses that get two
different values, and then the end result may be insane and
unreliable.

So on the whole, I do think that it's usually a good idea to show
_which_ access is protected by RCU. Perhaps with a
READ_ONCE/WRITE_ONCE pattern, although it can be other things.

I don't believe that it would necessarily help to turn a
rcu_read_lock() into a compiler barrier, because for the non-preempt
case rcu_read_lock() doesn't need to actually _do_ anything, and
anything that matters for the RCU read lock will already be a compiler
barrier for other reasons (ie a function call that can schedule).

Anyway, I suspect the code is correct in practice, but I do have some
sympathy for the "RCU protected accesses should be marked".

Linus

2019-06-03 16:11:19

by Linus Torvalds

[permalink] [raw]
Subject: Re: rcu_read_lock lost its compiler barrier

On Mon, Jun 3, 2019 at 8:55 AM Linus Torvalds
<[email protected]> wrote:
>
> I don't believe that it would necessarily help to turn a
> rcu_read_lock() into a compiler barrier, because for the non-preempt
> case rcu_read_lock() doesn't need to actually _do_ anything, and
> anything that matters for the RCU read lock will already be a compiler
> barrier for other reasons (ie a function call that can schedule).

Actually, thinking a bit more about this, and trying to come up with
special cases, I'm not at all convinced.

Even if we don't have preemption enabled, it turns out that we *do*
have things that can cause scheduling without being compiler barriers.

In particular, user accesses are not necessarily full compiler
barriers. One common pattern (x86) is

asm volatile("call __get_user_%P4"

which explicitly has a "asm volaile" so that it doesn't re-order wrt
other asms (and thus other user accesses), but it does *not* have a
"memory" clobber, because the user access doesn't actually change
kernel memory. Not even if it's a "put_user()".

So we've made those fairly relaxed on purpose. And they might be
relaxed enough that they'd allow re-ordering wrt something that does a
rcu read lock, unless the rcu read lock has some compiler barrier in
it.

IOW, imagine completely made up code like

get_user(val, ptr)
rcu_read_lock();
WRITE_ONCE(state, 1);

and unless the rcu lock has a barrier in it, I actually think that
write to 'state' could migrate to *before* the get_user().

I'm not convinced we have anything that remotely looks like the above,
but I'm actually starting to think that yes, all RCU barriers had
better be compiler barriers.

Because this is very much an example of something where you don't
necessarily need a memory barrier, but there's a code generation
barrier needed because of local ordering requirements. The possible
faulting behavior of "get_user()" must not migrate into the RCU
critical region.

Paul?

So I think the rule really should be: every single form of locking
that has any semantic meaning at all, absolutely needs to be at least
a compiler barrier.

(That "any semantic meaning" weaselwording is because I suspect that
we have locking that truly and intentionally becomes no-ops because
it's based on things that aren't relevant in some configurations. But
generally compiler barriers are really pretty damn cheap, even from a
code generation standpoint, and can help make the resulting code more
legible, so I think we should not try to aggressively remove them
without _very_ good reasons)

Linus

2019-06-03 19:55:40

by Paul E. McKenney

[permalink] [raw]
Subject: Re: rcu_read_lock lost its compiler barrier

On Mon, Jun 03, 2019 at 09:07:29AM -0700, Linus Torvalds wrote:
> On Mon, Jun 3, 2019 at 8:55 AM Linus Torvalds
> <[email protected]> wrote:
> >
> > I don't believe that it would necessarily help to turn a
> > rcu_read_lock() into a compiler barrier, because for the non-preempt
> > case rcu_read_lock() doesn't need to actually _do_ anything, and
> > anything that matters for the RCU read lock will already be a compiler
> > barrier for other reasons (ie a function call that can schedule).
>
> Actually, thinking a bit more about this, and trying to come up with
> special cases, I'm not at all convinced.
>
> Even if we don't have preemption enabled, it turns out that we *do*
> have things that can cause scheduling without being compiler barriers.
>
> In particular, user accesses are not necessarily full compiler
> barriers. One common pattern (x86) is
>
> asm volatile("call __get_user_%P4"
>
> which explicitly has a "asm volaile" so that it doesn't re-order wrt
> other asms (and thus other user accesses), but it does *not* have a
> "memory" clobber, because the user access doesn't actually change
> kernel memory. Not even if it's a "put_user()".
>
> So we've made those fairly relaxed on purpose. And they might be
> relaxed enough that they'd allow re-ordering wrt something that does a
> rcu read lock, unless the rcu read lock has some compiler barrier in
> it.
>
> IOW, imagine completely made up code like
>
> get_user(val, ptr)
> rcu_read_lock();
> WRITE_ONCE(state, 1);
>
> and unless the rcu lock has a barrier in it, I actually think that
> write to 'state' could migrate to *before* the get_user().
>
> I'm not convinced we have anything that remotely looks like the above,
> but I'm actually starting to think that yes, all RCU barriers had
> better be compiler barriers.
>
> Because this is very much an example of something where you don't
> necessarily need a memory barrier, but there's a code generation
> barrier needed because of local ordering requirements. The possible
> faulting behavior of "get_user()" must not migrate into the RCU
> critical region.
>
> Paul?

I agree that !PREEMPT rcu_read_lock() would not affect compiler code
generation, but given that get_user() is a volatile asm, isn't the
compiler already forbidden from reordering it with the volatile-casted
WRITE_ONCE() access, even if there was nothing at all between them?
Or are asms an exception to the rule that volatile executions cannot
be reordered?

> So I think the rule really should be: every single form of locking
> that has any semantic meaning at all, absolutely needs to be at least
> a compiler barrier.
>
> (That "any semantic meaning" weaselwording is because I suspect that
> we have locking that truly and intentionally becomes no-ops because
> it's based on things that aren't relevant in some configurations. But
> generally compiler barriers are really pretty damn cheap, even from a
> code generation standpoint, and can help make the resulting code more
> legible, so I think we should not try to aggressively remove them
> without _very_ good reasons)

We can of course put them back in, but this won't help in the typical
rcu_assign_pointer(), rcu_dereference(), and synchronize_rcu() situation
(nor do I see how it helps in Hubert's example). And in other RCU
use cases, the accesses analogous to the rcu_assign_pointer() and
rcu_dereference() (in Hubert's example, the accesses to variable "a")
really need to be READ_ONCE()/WRITE_ONCE() or stronger, correct?

Thanx, Paul

2019-06-03 20:06:15

by Paul E. McKenney

[permalink] [raw]
Subject: Re: rcu_read_lock lost its compiler barrier

On Mon, Jun 03, 2019 at 02:42:00PM +0800, Boqun Feng wrote:
> On Mon, Jun 03, 2019 at 01:26:26PM +0800, Herbert Xu wrote:
> > On Sun, Jun 02, 2019 at 08:47:07PM -0700, Paul E. McKenney wrote:
> > >
> > > 1. These guarantees are of full memory barriers, -not- compiler
> > > barriers.
> >
> > What I'm saying is that wherever they are, they must come with
> > compiler barriers. I'm not aware of any synchronisation mechanism
> > in the kernel that gives a memory barrier without a compiler barrier.
> >
> > > 2. These rules don't say exactly where these full memory barriers
> > > go. SRCU is at one extreme, placing those full barriers in
> > > srcu_read_lock() and srcu_read_unlock(), and !PREEMPT Tree RCU
> > > at the other, placing these barriers entirely within the callback
> > > queueing/invocation, grace-period computation, and the scheduler.
> > > Preemptible Tree RCU is in the middle, with rcu_read_unlock()
> > > sometimes including a full memory barrier, but other times with
> > > the full memory barrier being confined as it is with !PREEMPT
> > > Tree RCU.
> >
> > The rules do say that the (full) memory barrier must precede any
> > RCU read-side that occur after the synchronize_rcu and after the
> > end of any RCU read-side that occur before the synchronize_rcu.
> >
> > All I'm arguing is that wherever that full mb is, as long as it
> > also carries with it a barrier() (which it must do if it's done
> > using an existing kernel mb/locking primitive), then we're fine.
> >
> > > Interleaving and inserting full memory barriers as per the rules above:
> > >
> > > CPU1: WRITE_ONCE(a, 1)
> > > CPU1: synchronize_rcu
> > > /* Could put a full memory barrier here, but it wouldn't help. */
> >
> > CPU1: smp_mb();
> > CPU2: smp_mb();
> >
> > Let's put them in because I think they are critical. smp_mb() also
> > carries with it a barrier().
> >
> > > CPU2: rcu_read_lock();
> > > CPU1: b = 2;
> > > CPU2: if (READ_ONCE(a) == 0)
> > > CPU2: if (b != 1) /* Weakly ordered CPU moved this up! */
> > > CPU2: b = 1;
> > > CPU2: rcu_read_unlock
> > >
> > > In fact, CPU2's load from b might be moved up to race with CPU1's store,
> > > which (I believe) is why the model complains in this case.
> >
> > Let's put aside my doubt over how we're even allowing a compiler
> > to turn
> >
> > b = 1
> >
> > into
> >
> > if (b != 1)
> > b = 1
> >
> > Since you seem to be assuming that (a == 0) is true in this case
>
> I think Paul's example assuming (a == 0) is false, and maybe

Yes, otherwise, P0()'s write to "b" cannot have happened.

> speculative writes (by compilers) needs to added into consideration?

I would instead call it the compiler eliminating needless writes
by inventing reads -- if the variable already has the correct value,
no write happens. So no compiler speculation.

However, it is difficult to create a solid defensible example. Yes,
from LKMM's viewpoint, the weakly reordered invented read from "b"
can be concurrent with P0()'s write to "b", but in that case the value
loaded would have to manage to be equal to 1 for anything bad to happen.
This does feel wrong to me, but again, it is difficult to create a solid
defensible example.

> Please consider the following case (I add a few smp_mb()s), the case may
> be a little bit crasy, you have been warned ;-)
>
> CPU1: WRITE_ONCE(a, 1)
> CPU1: synchronize_rcu called
>
> CPU1: smp_mb(); /* let assume there is one here */
>
> CPU2: rcu_read_lock();
> CPU2: smp_mb(); /* let assume there is one here */
>
> /* "if (b != 1) b = 1" reordered */
> CPU2: r0 = b; /* if (b != 1) reordered here, r0 == 0 */
> CPU2: if (r0 != 1) /* true */
> CPU2: b = 1; /* b == 1 now, this is a speculative write
> by compiler
> */
>
> CPU1: b = 2; /* b == 2 */
>
> CPU2: if (READ_ONCE(a) == 0) /* false */
> CPU2: ...
> CPU2 else /* undo the speculative write */
> CPU2: b = r0; /* b == 0 */
>
> CPU2: smp_mb();
> CPU2: read_read_unlock();
>
> I know this is too crasy for us to think a compiler like this, but this
> might be the reason why the model complain about this.
>
> Paul, did I get this right? Or you mean something else?

Mostly there, except that I am not yet desperate enough to appeal to
compilers speculating stores. ;-)

Thanx, Paul

> Regards,
> Boqun
>
>
>
> > (as the assignment b = 1 is carried out), then because of the
> > presence of the full memory barrier, the RCU read-side section
> > must have started prior to the synchronize_rcu. This means that
> > synchronize_rcu is not allowed to return until at least the end
> > of the grace period, or at least until the end of rcu_read_unlock.
> >
> > So it actually should be:
> >
> > CPU1: WRITE_ONCE(a, 1)
> > CPU1: synchronize_rcu called
> > /* Could put a full memory barrier here, but it wouldn't help. */
> >
> > CPU1: smp_mb();
> > CPU2: smp_mb();
> >
> > CPU2: grace period starts
> > ...time passes...
> > CPU2: rcu_read_lock();
> > CPU2: if (READ_ONCE(a) == 0)
> > CPU2: if (b != 1) /* Weakly ordered CPU moved this up! */
> > CPU2: b = 1;
> > CPU2: rcu_read_unlock
> > ...time passes...
> > CPU2: grace period ends
> >
> > /* This full memory barrier is also guaranteed by RCU. */
> > CPU2: smp_mb();
> >
> > CPU1 synchronize_rcu returns
> > CPU1: b = 2;
> >
> > Cheers,
> > --
> > Email: Herbert Xu <[email protected]>
> > Home Page: http://gondor.apana.org.au/~herbert/
> > PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


2019-06-03 20:27:09

by Linus Torvalds

[permalink] [raw]
Subject: Re: rcu_read_lock lost its compiler barrier

On Mon, Jun 3, 2019 at 12:53 PM Paul E. McKenney <[email protected]> wrote:
>
> I agree that !PREEMPT rcu_read_lock() would not affect compiler code
> generation, but given that get_user() is a volatile asm, isn't the
> compiler already forbidden from reordering it with the volatile-casted
> WRITE_ONCE() access, even if there was nothing at all between them?
> Or are asms an exception to the rule that volatile executions cannot
> be reordered?

Paul, you MAKE NO SENSE.

What is wrong with you?

I just showed you an example of where rcu_read_lock() needs to be a
compiler barrier, and then you make incoherent noises about
WRITE_ONCE() that do not even exist in that example.

Forget about your READ_ONCE/WRITE_ONCE theories. Herbert already
showed code that doesn't have those accessors, so reality doesn't
match your fevered imagination.

And sometimes it's not even possible, since you can't do a bitfield
access, for exmaple, with READ_ONCE().

> We can of course put them back in,

Stop the craziness. It's not "we can". It is a "we will".

So I will add that barrier, and you need to stop arguing against it
based on specious theoretical arguments that do not match reality. And
we will not ever remove that barrier again. Herbert already pointed to
me having to do this once before in commit 386afc91144b ("spinlocks
and preemption points need to be at least compiler barriers"), and
rcu_read_lock() clearly has at a minimum that same preemption point
issue.

Linus

2019-06-04 14:46:07

by Alan Stern

[permalink] [raw]
Subject: Re: rcu_read_lock lost its compiler barrier

On Mon, 3 Jun 2019, Paul E. McKenney wrote:

> On Mon, Jun 03, 2019 at 02:42:00PM +0800, Boqun Feng wrote:
> > On Mon, Jun 03, 2019 at 01:26:26PM +0800, Herbert Xu wrote:
> > > On Sun, Jun 02, 2019 at 08:47:07PM -0700, Paul E. McKenney wrote:
> > > >
> > > > 1. These guarantees are of full memory barriers, -not- compiler
> > > > barriers.
> > >
> > > What I'm saying is that wherever they are, they must come with
> > > compiler barriers. I'm not aware of any synchronisation mechanism
> > > in the kernel that gives a memory barrier without a compiler barrier.
> > >
> > > > 2. These rules don't say exactly where these full memory barriers
> > > > go. SRCU is at one extreme, placing those full barriers in
> > > > srcu_read_lock() and srcu_read_unlock(), and !PREEMPT Tree RCU
> > > > at the other, placing these barriers entirely within the callback
> > > > queueing/invocation, grace-period computation, and the scheduler.
> > > > Preemptible Tree RCU is in the middle, with rcu_read_unlock()
> > > > sometimes including a full memory barrier, but other times with
> > > > the full memory barrier being confined as it is with !PREEMPT
> > > > Tree RCU.
> > >
> > > The rules do say that the (full) memory barrier must precede any
> > > RCU read-side that occur after the synchronize_rcu and after the
> > > end of any RCU read-side that occur before the synchronize_rcu.
> > >
> > > All I'm arguing is that wherever that full mb is, as long as it
> > > also carries with it a barrier() (which it must do if it's done
> > > using an existing kernel mb/locking primitive), then we're fine.
> > >
> > > > Interleaving and inserting full memory barriers as per the rules above:
> > > >
> > > > CPU1: WRITE_ONCE(a, 1)
> > > > CPU1: synchronize_rcu
> > > > /* Could put a full memory barrier here, but it wouldn't help. */
> > >
> > > CPU1: smp_mb();
> > > CPU2: smp_mb();
> > >
> > > Let's put them in because I think they are critical. smp_mb() also
> > > carries with it a barrier().
> > >
> > > > CPU2: rcu_read_lock();
> > > > CPU1: b = 2;
> > > > CPU2: if (READ_ONCE(a) == 0)
> > > > CPU2: if (b != 1) /* Weakly ordered CPU moved this up! */
> > > > CPU2: b = 1;
> > > > CPU2: rcu_read_unlock
> > > >
> > > > In fact, CPU2's load from b might be moved up to race with CPU1's store,
> > > > which (I believe) is why the model complains in this case.
> > >
> > > Let's put aside my doubt over how we're even allowing a compiler
> > > to turn
> > >
> > > b = 1
> > >
> > > into
> > >
> > > if (b != 1)
> > > b = 1

Even if you don't think the compiler will ever do this, the C standard
gives compilers the right to invent read accesses if a plain (i.e.,
non-atomic and non-volatile) write is present. The Linux Kernel Memory
Model has to assume that compilers will sometimes do this, even if it
doesn't take the exact form of checking a variable's value before
writing to it.

(Incidentally, regardless of whether the compiler will ever do this, I
have seen examples in the kernel where people did exactly this
manually, in order to avoid dirtying a cache line unnecessarily.)

> > > Since you seem to be assuming that (a == 0) is true in this case
> >
> > I think Paul's example assuming (a == 0) is false, and maybe
>
> Yes, otherwise, P0()'s write to "b" cannot have happened.
>
> > speculative writes (by compilers) needs to added into consideration?

On the other hand, the C standard does not allow compilers to add
speculative writes. The LKMM assumes they will never occur.

> I would instead call it the compiler eliminating needless writes
> by inventing reads -- if the variable already has the correct value,
> no write happens. So no compiler speculation.
>
> However, it is difficult to create a solid defensible example. Yes,
> from LKMM's viewpoint, the weakly reordered invented read from "b"
> can be concurrent with P0()'s write to "b", but in that case the value
> loaded would have to manage to be equal to 1 for anything bad to happen.
> This does feel wrong to me, but again, it is difficult to create a solid
> defensible example.
>
> > Please consider the following case (I add a few smp_mb()s), the case may
> > be a little bit crasy, you have been warned ;-)
> >
> > CPU1: WRITE_ONCE(a, 1)
> > CPU1: synchronize_rcu called
> >
> > CPU1: smp_mb(); /* let assume there is one here */
> >
> > CPU2: rcu_read_lock();
> > CPU2: smp_mb(); /* let assume there is one here */
> >
> > /* "if (b != 1) b = 1" reordered */
> > CPU2: r0 = b; /* if (b != 1) reordered here, r0 == 0 */
> > CPU2: if (r0 != 1) /* true */
> > CPU2: b = 1; /* b == 1 now, this is a speculative write
> > by compiler
> > */
> >
> > CPU1: b = 2; /* b == 2 */
> >
> > CPU2: if (READ_ONCE(a) == 0) /* false */
> > CPU2: ...
> > CPU2 else /* undo the speculative write */
> > CPU2: b = r0; /* b == 0 */
> >
> > CPU2: smp_mb();
> > CPU2: read_read_unlock();
> >
> > I know this is too crasy for us to think a compiler like this, but this
> > might be the reason why the model complain about this.
> >
> > Paul, did I get this right? Or you mean something else?
>
> Mostly there, except that I am not yet desperate enough to appeal to
> compilers speculating stores. ;-)

This example really does point out a weakness in the LKMM's handling of
data races. Herbert's litmus test is a great starting point:


C xu

{}

P0(int *a, int *b)
{
WRITE_ONCE(*a, 1);
synchronize_rcu();
*b = 2;
}

P1(int *a, int *b)
{
rcu_read_lock();
if (READ_ONCE(*a) == 0)
*b = 1;
rcu_read_unlock();
}

exists (~b=2)


Currently the LKMM says the test is allowed and there is a data race,
but this answer clearly is wrong since it would violate the RCU
guarantee.

The problem is that the LKMM currently requires all ordering/visibility
of plain accesses to be mediated by marked accesses. But in this case,
the visibility is mediated by RCU. Technically, we need to add a
relation like

([M] ; po ; rcu-fence ; po ; [M])

into the definitions of ww-vis, wr-vis, and rw-xbstar. Doing so
changes the litmus test's result to "not allowed" and no data race.
However, I'm not certain that this single change is the entire fix;
more thought is needed.

Alan

2019-06-04 16:07:18

by Linus Torvalds

[permalink] [raw]
Subject: Re: rcu_read_lock lost its compiler barrier

On Tue, Jun 4, 2019 at 7:44 AM Alan Stern <[email protected]> wrote:
>
> Even if you don't think the compiler will ever do this, the C standard
> gives compilers the right to invent read accesses if a plain (i.e.,
> non-atomic and non-volatile) write is present.

Note that for the kernel, it's not like we go strictly by what the
standard says anyway.

The reason for the "read for write" thing is that obviously you
sometimes have broken architectures (*cough*alpha*cough*) that need to
do some common writes are read-maskmodify-write accesses, and for
bitfields you obviously *always* have that issue.

In general, a sane compiler (as opposed to "we just read the
standard") had better not add random reads, because people might have
some mappings be write-only when the hardware supports it.

And we do generally expect sanity from our compilers, and will add
compiler flags to disable bad behavior if required - even if said
behavior would be "technically allowed by the standard".

> (Incidentally, regardless of whether the compiler will ever do this, I
> have seen examples in the kernel where people did exactly this
> manually, in order to avoid dirtying a cache line unnecessarily.)

I really hope and expect that this is not something that the compiler ever does.

If a compiler turns "x = y" into if (x != y) x = y" (like we do
manually at times, as you say), that compiler is garbage that we
cannot use for the kernel. It would break things like "smp_wmb()"
ordering guarantees, I'm pretty sure.

And as always, if we're doing actively stupid things, and the compiler
then turns our stupid code into something we don't expect, the
corollary is that then it's on _us_. IOW, if we do

if (x != 1) {
...
}
x = 1;

and the compiler goes "oh, we already checked that 'x == 1'" and moves
that "unconditional" 'x = 1' into the conditional section like

if (x != 1) {
..
x = 1;
}

then that's not something we can then complain about.

So our expectation is that the compiler is _sane_, not that it's some
"C as assembler". Adding spurious reads is not ok. But if we already
had reads in the code and the compiler combines them with other ops,
that's on us.

End result: in general, we do expect that the compiler turns a regular
assignment into a single plain write when that's what the code does,
and does not add extra logic over and beyond that.

In fact, the alpha port was always subtly buggy exactly because of the
"byte write turns into a read-and-masked-write", even if I don't think
anybody ever noticed (we did fix cases where people _did_ notice,
though, and we might still have some cases where we use 'int' for
booleans because of alpha issues.).

So I don't technically disagree with anything you say, I just wanted
to point out that as far as the kernel is concerned, we do have higher
quality expectations from the compiler than just "technically valid
according to the C standard".

Linus

2019-06-04 17:02:01

by Alan Stern

[permalink] [raw]
Subject: Re: rcu_read_lock lost its compiler barrier

On Tue, 4 Jun 2019, Linus Torvalds wrote:

> So I don't technically disagree with anything you say,

That's good to know!

> I just wanted
> to point out that as far as the kernel is concerned, we do have higher
> quality expectations from the compiler than just "technically valid
> according to the C standard".

Which suggests asking whether these higher expectations should be
reflected in the Linux Kernel Memory Model. So far we have largely
avoided doing that sort of thing, although there are a few exceptions.

(For example, we assume the compiler does not destroy address
dependencies from volatile reads -- but we also warn that this
assumption may fail if the programmer does not follow some rules
described in one of Paul's documentation files.)

Alan

2019-06-04 17:31:52

by Linus Torvalds

[permalink] [raw]
Subject: Re: rcu_read_lock lost its compiler barrier

On Tue, Jun 4, 2019 at 10:00 AM Alan Stern <[email protected]> wrote:
>
> Which suggests asking whether these higher expectations should be
> reflected in the Linux Kernel Memory Model. So far we have largely
> avoided doing that sort of thing, although there are a few exceptions.

I think they might be hard to describe - which in turn may be why the
standard leaves it open and only describes the simple cases.

Exactly that "we expect an assignment to be done as a single write" is
probably a good example. Yes, we do have that expectation for the
simple cases. But it's not an absolute rule, obviously, because it's
clearly violated by bitfield writes, and similarly it's obviously
violated for data that is bigger than the word size (ie a "u64"
assignment is obviously not a single write when you're on a 32-bit
target).

And while those cases are static and could be described separately,
the "oh, and we have _other_ accesses to the same variable nearby, and
we expect that the compiler might take those into account unless we
explicitly use WRITE_ONCE()" things make for much more hard to
describe issues.

Doing writes with speculative values is clearly bogus and garbage
(although compiler writers _have_ tried to do that too: "we then
overwrite it with the right value later, so it's ok"), and thankfully
even user space people complain about that due to threading and
signals. But eliding writes entirely by combining them with a later
one is clearly ok - and eliding them when there was an explcit earlier
read and value comparison (like in my example) sounds reasonable to me
too. Yet silently adding the elision that wasn't visible in the source
due to other accesses would be bad.

How do you say "sane and reasonable compiler" in a spec? You usually
don't - or you make the rules so odd and complex than nobody really
understands them any more, and make it all moot ;)

Linus

2019-06-04 21:17:41

by Paul E. McKenney

[permalink] [raw]
Subject: Re: rcu_read_lock lost its compiler barrier

On Mon, Jun 03, 2019 at 01:24:32PM -0700, Linus Torvalds wrote:
> On Mon, Jun 3, 2019 at 12:53 PM Paul E. McKenney <[email protected]> wrote:
> >
> > I agree that !PREEMPT rcu_read_lock() would not affect compiler code
> > generation, but given that get_user() is a volatile asm, isn't the
> > compiler already forbidden from reordering it with the volatile-casted
> > WRITE_ONCE() access, even if there was nothing at all between them?
> > Or are asms an exception to the rule that volatile executions cannot
> > be reordered?
>
> Paul, you MAKE NO SENSE.
>
> What is wrong with you?

Mostly that I didn't check all architectures' definitions of get_user().
Had I done so, I would have seen that not all of the corresponding asms
have the "volatile" keyword. And of course, without that keyword, there
is absolutely nothing preventing the compiler from reordering the asm
with pretty much anything. The only things that would be absolutely
guaranteed to prevent reordering would be things like memory clobbers
(barrier()) or accesses that overlap the asm's input/output list.

Yeah, I know, even with the "volatile" keyword, it is not entirely clear
how much reordering the compiler is allowed to do. I was relying on
https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html, which says:

Qualifiers

volatile

The typical use of extended asm statements is to
manipulate input values to produce output values. However,
your asm statements may also produce side effects. If so,
you may need to use the volatile qualifier to disable
certain optimizations. See Volatile.

But the linked-to "Volatile" section later in that same web page mostly
talks about the compiler's ability to hoist asms out of loops.

> I just showed you an example of where rcu_read_lock() needs to be a
> compiler barrier, and then you make incoherent noises about
> WRITE_ONCE() that do not even exist in that example.

I thought we were discussing this example, but it doesn't matter because
I was missing your point about get_user() and page faults:

get_user(val, ptr)
rcu_read_lock();
WRITE_ONCE(state, 1);

But regardless, given that some architectures omit volatile from their
asms implementing get_user(), even an optimistic interpretation of that
part of the GCC documentation would still permit reordering the above.
And again, I was missing your point about get_user() causing page faults
and thus context switches.

> Forget about your READ_ONCE/WRITE_ONCE theories. Herbert already
> showed code that doesn't have those accessors, so reality doesn't
> match your fevered imagination.

I get the feeling that you believe that I want LKMM to be some sort of
final judge and absolute arbiter of what code is legal and not from
a memory-ordering perspective. This is absolutely -not- the case.
The current state of the art, despite the recent impressive progress,
simply cannot reasonably do this. So all I can claim is that LKMM
dispenses advice, hopefully good advice. (It is early days for LKMM's
handling of plain accesses, so some work might be required to deliver
on the "good advice" promise, but we have to start somewhere. Plus it
is progressing nicely.)

The places where long-standing RCU patterns require rcu_dereference()
and rcu_assign_pointer() do require some attention to avoid compiler
optimizations, and {READ,WRITE}_ONCE() is one way of addressing this.
But not the only way, nor always the best way. For example, some
fastpaths might need the optimizations that {READ,WRITE}_ONCE()
suppresses. Therefore, Linux kernel hackers have a number of other
ways of paying attention. For example, accesses might be constrained
via barrier() and friends. For another example, some developers might
check assembly output (hopefully scripted somehow).

Again, the Linux-kernel memory model dispenses advice, not absolutes.
Furthermore, the way it dispenses advice is currently a bit limited.
It can currently say that it is nervous about lack of {READ,WRITE}_ONCE(),
as in "Flag data-race", but it would be difficult to make it recommend
the other options in an intelligent way. So we should interpret "Flag
data-race" as LKMM saying "I am nervous about your unmarked accesses"
rather than "You absolutely must call {READ,WRITE}_ONCE() more often!!!"
Again, advice, not absolutes.

So the idea is that you add and remove {READ,WRITE}_ONCE() to/from the
-litmus- -tests- to determine which accesses LKMM is nervous about.
But that doesn't necessarily mean that {READ,WRITE}_ONCE() goes into
the corresponding places in the Linux kernel.

Does that help, or am I still confused?

> And sometimes it's not even possible, since you can't do a bitfield
> access, for example, with READ_ONCE().

Ah, good point. So the Linux kernel uses bitfields to communicate
between mainline and interrupt handlers. New one on me. :-/

> > We can of course put them back in,
>
> Stop the craziness. It's not "we can". It is a "we will".
>
> So I will add that barrier, and you need to stop arguing against it
> based on specious theoretical arguments that do not match reality. And
> we will not ever remove that barrier again. Herbert already pointed to
> me having to do this once before in commit 386afc91144b ("spinlocks
> and preemption points need to be at least compiler barriers"), and
> rcu_read_lock() clearly has at a minimum that same preemption point
> issue.

And the lack of "volatile" allows get_user() to migrate page faults
(and thus context switches) into RCU read-side critical sections
in CONFIG_PREEMPT=n. Yes, this would be very bad.

OK, I finally got it, so please accept my apologies for my earlier
confusion.

I don't yet see a commit from you, so I queued the one below locally
and started testing.

Thanx, Paul

------------------------------------------------------------------------

commit 9b4766c5523efb8d3d52b2ba2a29fd69cdfc65bb
Author: Paul E. McKenney <[email protected]>
Date: Tue Jun 4 14:05:52 2019 -0700

rcu: Restore barrier() to rcu_read_lock() and rcu_read_unlock()

Commit bb73c52bad36 ("rcu: Don't disable preemption for Tiny and Tree
RCU readers") removed the barrier() calls from rcu_read_lock() and
rcu_write_lock() in CONFIG_PREEMPT=n&&CONFIG_PREEMPT_COUNT=n kernels.
Within RCU, this commit was OK, but it failed to account for things like
get_user() that can pagefault and that can be reordered by the compiler.
Lack of the barrier() calls in rcu_read_lock() and rcu_read_unlock()
can cause these page faults to migrate into RCU read-side critical
sections, which in CONFIG_PREEMPT=n kernels could result in too-short
grace periods and arbitrary misbehavior. Please see commit 386afc91144b
("spinlocks and preemption points need to be at least compiler barriers")
for more details.

This commit therefore restores the barrier() call to both rcu_read_lock()
and rcu_read_unlock(). It also removes them from places in the RCU update
machinery that used to need compensatory barrier() calls, effectively
reverting commit bb73c52bad36 ("rcu: Don't disable preemption for Tiny
and Tree RCU readers").

Reported-by: Herbert Xu <[email protected]>
Reported-by: Linus Torvalds <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 0c9b92799abc..8f7167478c1d 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -56,14 +56,12 @@ void __rcu_read_unlock(void);

static inline void __rcu_read_lock(void)
{
- if (IS_ENABLED(CONFIG_PREEMPT_COUNT))
- preempt_disable();
+ preempt_disable();
}

static inline void __rcu_read_unlock(void)
{
- if (IS_ENABLED(CONFIG_PREEMPT_COUNT))
- preempt_enable();
+ preempt_enable();
}

static inline int rcu_preempt_depth(void)
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 3f52d8438e0f..841060fce33c 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -288,7 +288,6 @@ void rcu_note_context_switch(bool preempt)
struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
struct rcu_node *rnp;

- barrier(); /* Avoid RCU read-side critical sections leaking down. */
trace_rcu_utilization(TPS("Start context switch"));
lockdep_assert_irqs_disabled();
WARN_ON_ONCE(!preempt && t->rcu_read_lock_nesting > 0);
@@ -340,7 +339,6 @@ void rcu_note_context_switch(bool preempt)
if (rdp->exp_deferred_qs)
rcu_report_exp_rdp(rdp);
trace_rcu_utilization(TPS("End context switch"));
- barrier(); /* Avoid RCU read-side critical sections leaking up. */
}
EXPORT_SYMBOL_GPL(rcu_note_context_switch);

@@ -828,11 +826,6 @@ static void rcu_qs(void)
* dyntick-idle quiescent state visible to other CPUs, which will in
* some cases serve for expedited as well as normal grace periods.
* Either way, register a lightweight quiescent state.
- *
- * The barrier() calls are redundant in the common case when this is
- * called externally, but just in case this is called from within this
- * file.
- *
*/
void rcu_all_qs(void)
{
@@ -847,14 +840,12 @@ void rcu_all_qs(void)
return;
}
this_cpu_write(rcu_data.rcu_urgent_qs, false);
- barrier(); /* Avoid RCU read-side critical sections leaking down. */
if (unlikely(raw_cpu_read(rcu_data.rcu_need_heavy_qs))) {
local_irq_save(flags);
rcu_momentary_dyntick_idle();
local_irq_restore(flags);
}
rcu_qs();
- barrier(); /* Avoid RCU read-side critical sections leaking up. */
preempt_enable();
}
EXPORT_SYMBOL_GPL(rcu_all_qs);
@@ -864,7 +855,6 @@ EXPORT_SYMBOL_GPL(rcu_all_qs);
*/
void rcu_note_context_switch(bool preempt)
{
- barrier(); /* Avoid RCU read-side critical sections leaking down. */
trace_rcu_utilization(TPS("Start context switch"));
rcu_qs();
/* Load rcu_urgent_qs before other flags. */
@@ -877,7 +867,6 @@ void rcu_note_context_switch(bool preempt)
rcu_tasks_qs(current);
out:
trace_rcu_utilization(TPS("End context switch"));
- barrier(); /* Avoid RCU read-side critical sections leaking up. */
}
EXPORT_SYMBOL_GPL(rcu_note_context_switch);


2019-06-05 02:22:57

by Herbert Xu

[permalink] [raw]
Subject: Re: rcu_read_lock lost its compiler barrier

On Tue, Jun 04, 2019 at 02:14:49PM -0700, Paul E. McKenney wrote:
>
> Yeah, I know, even with the "volatile" keyword, it is not entirely clear
> how much reordering the compiler is allowed to do. I was relying on
> https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html, which says:

The volatile keyword doesn't give any guarantees of this kind.
The key to ensuring ordering between unrelated variable/register
reads/writes is the memory clobber:

6.47.2.6 Clobbers and Scratch Registers

...

"memory" The "memory" clobber tells the compiler that the assembly
code performs memory reads or writes to items other than those
listed in the input and output operands (for example, accessing
the memory pointed to by one of the input parameters). To ensure
memory contains correct values, GCC may need to flush specific
register values to memory before executing the asm. Further,
the compiler does not assume that any values read from memory
before an asm remain unchanged after that asm; it reloads them as
needed. Using the "memory" clobber effectively forms a read/write
memory barrier for the compiler.

Note that this clobber does not prevent the processor from
doing speculative reads past the asm statement. To prevent that,
you need processor-specific fence instructions.

IOW you need a barrier().

Thanks,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2019-06-05 03:32:20

by Paul E. McKenney

[permalink] [raw]
Subject: Re: rcu_read_lock lost its compiler barrier

On Wed, Jun 05, 2019 at 10:21:17AM +0800, Herbert Xu wrote:
> On Tue, Jun 04, 2019 at 02:14:49PM -0700, Paul E. McKenney wrote:
> >
> > Yeah, I know, even with the "volatile" keyword, it is not entirely clear
> > how much reordering the compiler is allowed to do. I was relying on
> > https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html, which says:
>
> The volatile keyword doesn't give any guarantees of this kind.
> The key to ensuring ordering between unrelated variable/register
> reads/writes is the memory clobber:
>
> 6.47.2.6 Clobbers and Scratch Registers
>
> ...
>
> "memory" The "memory" clobber tells the compiler that the assembly
> code performs memory reads or writes to items other than those
> listed in the input and output operands (for example, accessing
> the memory pointed to by one of the input parameters). To ensure
> memory contains correct values, GCC may need to flush specific
> register values to memory before executing the asm. Further,
> the compiler does not assume that any values read from memory
> before an asm remain unchanged after that asm; it reloads them as
> needed. Using the "memory" clobber effectively forms a read/write
> memory barrier for the compiler.
>
> Note that this clobber does not prevent the processor from
> doing speculative reads past the asm statement. To prevent that,
> you need processor-specific fence instructions.
>
> IOW you need a barrier().

Understood. Does the patch I sent out a few hours ago cover it? Or is
something else needed?

Other than updates to the RCU requirements documentation, which is
forthcoming.

Thanx, Paul

2019-06-06 04:40:26

by Herbert Xu

[permalink] [raw]
Subject: Re: rcu_read_lock lost its compiler barrier

On Tue, Jun 04, 2019 at 08:30:39PM -0700, Paul E. McKenney wrote:
>
> Understood. Does the patch I sent out a few hours ago cover it? Or is
> something else needed?

It looks good to me.

> Other than updates to the RCU requirements documentation, which is
> forthcoming.

Thanks Paul.
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2019-06-06 04:52:41

by Herbert Xu

[permalink] [raw]
Subject: Re: rcu_read_lock lost its compiler barrier

On Tue, Jun 04, 2019 at 10:44:18AM -0400, Alan Stern wrote:
>
> Currently the LKMM says the test is allowed and there is a data race,
> but this answer clearly is wrong since it would violate the RCU
> guarantee.

Thank you! This is what I tried to say all along in this thread
but you expressed it in a much better way :)
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2019-06-06 06:07:21

by Paul E. McKenney

[permalink] [raw]
Subject: Re: rcu_read_lock lost its compiler barrier

On Thu, Jun 06, 2019 at 12:51:09PM +0800, Herbert Xu wrote:
> On Tue, Jun 04, 2019 at 10:44:18AM -0400, Alan Stern wrote:
> >
> > Currently the LKMM says the test is allowed and there is a data race,
> > but this answer clearly is wrong since it would violate the RCU
> > guarantee.
>
> Thank you! This is what I tried to say all along in this thread
> but you expressed it in a much better way :)

In case you were wondering, the reason that I was giving you such
a hard time was that from what I could see, you were pushing for no
{READ,WRITE}_ONCE() at all. ;-)

Thanx, Paul

2019-06-06 06:16:25

by Herbert Xu

[permalink] [raw]
Subject: Re: rcu_read_lock lost its compiler barrier

On Wed, Jun 05, 2019 at 11:05:11PM -0700, Paul E. McKenney wrote:
>
> In case you were wondering, the reason that I was giving you such
> a hard time was that from what I could see, you were pushing for no
> {READ,WRITE}_ONCE() at all. ;-)

Hmm, that's exactly what it should be in net/ipv4/inet_fragment.c.
We don't need the READ_ONCE/WRITE_ONCE (or volatile marking) at
all. Even if the compiler dices and slices the reads/writes of
"a" into a thousand pieces, it should still work if the RCU
primitives are worth their salt.

But I do concede that in the general RCU case you must have the
READ_ONCE/WRITE_ONCE calls for rcu_dereference/rcu_assign_pointer.

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2019-06-06 08:18:35

by Andrea Parri

[permalink] [raw]
Subject: Re: rcu_read_lock lost its compiler barrier

> This example really does point out a weakness in the LKMM's handling of
> data races. Herbert's litmus test is a great starting point:
>
>
> C xu
>
> {}
>
> P0(int *a, int *b)
> {
> WRITE_ONCE(*a, 1);
> synchronize_rcu();
> *b = 2;
> }
>
> P1(int *a, int *b)
> {
> rcu_read_lock();
> if (READ_ONCE(*a) == 0)
> *b = 1;
> rcu_read_unlock();
> }
>
> exists (~b=2)
>
>
> Currently the LKMM says the test is allowed and there is a data race,
> but this answer clearly is wrong since it would violate the RCU
> guarantee.
>
> The problem is that the LKMM currently requires all ordering/visibility
> of plain accesses to be mediated by marked accesses. But in this case,
> the visibility is mediated by RCU. Technically, we need to add a
> relation like
>
> ([M] ; po ; rcu-fence ; po ; [M])
>
> into the definitions of ww-vis, wr-vis, and rw-xbstar. Doing so
> changes the litmus test's result to "not allowed" and no data race.
> However, I'm not certain that this single change is the entire fix;
> more thought is needed.

This seems a sensible change to me: looking forward to seeing a patch,
on top of -rcu/dev, for further review and testing!

We could also add (to LKMM) the barrier() for rcu_read_{lock,unlock}()
discussed in this thread (maybe once the RCU code and the informal doc
will have settled in such direction).

It seems worth stressing the fact that _neither_ of these changes will
prevent the test below from being racy, considered the two accesses to
"a" happen concurrently / without synchronization.

Thanks,
Andrea

C xu-2

{}

P0(int *a, int *b)
{
*a = 1;
synchronize_rcu();
WRITE_ONCE(*b, 2);
}

P1(int *a, int *b)
{
rcu_read_lock();
if (*a == 0)
WRITE_ONCE(*b, 1);
rcu_read_unlock();
}

exists (~b=2)

2019-06-06 08:41:45

by Andrea Parri

[permalink] [raw]
Subject: Re: rcu_read_lock lost its compiler barrier

On Mon, Jun 03, 2019 at 10:46:40AM +0800, Herbert Xu wrote:

> The case we were discussing is from net/ipv4/inet_fragment.c from
> the net-next tree:

BTW, thank you for keeping me and other people who intervened in that
discussion in Cc:...

Andrea


>
> void fqdir_exit(struct fqdir *fqdir)
> {
> ...
> fqdir->dead = true;
>
> /* call_rcu is supposed to provide memory barrier semantics,
> * separating the setting of fqdir->dead with the destruction
> * work. This implicit barrier is paired with inet_frag_kill().
> */
>
> INIT_RCU_WORK(&fqdir->destroy_rwork, fqdir_rwork_fn);
> queue_rcu_work(system_wq, &fqdir->destroy_rwork);
> }
>
> and
>
> void inet_frag_kill(struct inet_frag_queue *fq)
> {
> ...
> rcu_read_lock();
> /* The RCU read lock provides a memory barrier
> * guaranteeing that if fqdir->dead is false then
> * the hash table destruction will not start until
> * after we unlock. Paired with inet_frags_exit_net().
> */
> if (!fqdir->dead) {
> rhashtable_remove_fast(&fqdir->rhashtable, &fq->node,
> fqdir->f->rhash_params);
> ...
> }
> ...
> rcu_read_unlock();
> ...
> }
>
> I simplified this to
>
> Initial values:
>
> a = 0
> b = 0
>
> CPU1 CPU2
> ---- ----
> a = 1 rcu_read_lock
> synchronize_rcu if (a == 0)
> b = 2 b = 1
> rcu_read_unlock
>
> On exit we want this to be true:
> b == 2

2019-06-06 09:09:53

by Paul E. McKenney

[permalink] [raw]
Subject: Re: rcu_read_lock lost its compiler barrier

On Thu, Jun 06, 2019 at 02:14:38PM +0800, Herbert Xu wrote:
> On Wed, Jun 05, 2019 at 11:05:11PM -0700, Paul E. McKenney wrote:
> >
> > In case you were wondering, the reason that I was giving you such
> > a hard time was that from what I could see, you were pushing for no
> > {READ,WRITE}_ONCE() at all. ;-)
>
> Hmm, that's exactly what it should be in net/ipv4/inet_fragment.c.
> We don't need the READ_ONCE/WRITE_ONCE (or volatile marking) at
> all. Even if the compiler dices and slices the reads/writes of
> "a" into a thousand pieces, it should still work if the RCU
> primitives are worth their salt.

OK, so I take it that there is additional synchronization in there
somewhere that is not captured in your simplified example code?

Or is your point instead that given the initial value of "a" being
zero and the value stored to "a" being one, there is no way that
any possible load and store tearing (your slicing and dicing) could
possibly mess up the test of the value loaded from "a"?

> But I do concede that in the general RCU case you must have the
> READ_ONCE/WRITE_ONCE calls for rcu_dereference/rcu_assign_pointer.

OK, good that we are in agreement on this part, at least! ;-)

Thanx, Paul

2019-06-06 09:31:57

by Herbert Xu

[permalink] [raw]
Subject: Re: rcu_read_lock lost its compiler barrier

On Thu, Jun 06, 2019 at 02:06:19AM -0700, Paul E. McKenney wrote:
>
> Or is your point instead that given the initial value of "a" being
> zero and the value stored to "a" being one, there is no way that
> any possible load and store tearing (your slicing and dicing) could
> possibly mess up the test of the value loaded from "a"?

Exactly. If you can dream up of a scenario where the compiler can
get this wrong I'm all ears.

> > But I do concede that in the general RCU case you must have the
> > READ_ONCE/WRITE_ONCE calls for rcu_dereference/rcu_assign_pointer.
>
> OK, good that we are in agreement on this part, at least! ;-)

Well only because we're allowing crazy compilers that can turn
a simple word-aligned word assignment (a = b) into two stores.

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2019-06-06 09:35:48

by Herbert Xu

[permalink] [raw]
Subject: Re: rcu_read_lock lost its compiler barrier

On Thu, Jun 06, 2019 at 10:38:56AM +0200, Andrea Parri wrote:
> On Mon, Jun 03, 2019 at 10:46:40AM +0800, Herbert Xu wrote:
>
> > The case we were discussing is from net/ipv4/inet_fragment.c from
> > the net-next tree:
>
> BTW, thank you for keeping me and other people who intervened in that
> discussion in Cc:...

FWIW I didn't drop you from the Cc list. The email discussion was
taken off-list by someone else and I simply kept that Cc list when
I brought it back onto lkml. On a second look I did end up dropping
Eric but I think he's had enough of this discussion :)
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2019-06-06 12:59:09

by Paul E. McKenney

[permalink] [raw]
Subject: Re: rcu_read_lock lost its compiler barrier

On Thu, Jun 06, 2019 at 05:28:55PM +0800, Herbert Xu wrote:
> On Thu, Jun 06, 2019 at 02:06:19AM -0700, Paul E. McKenney wrote:
> >
> > Or is your point instead that given the initial value of "a" being
> > zero and the value stored to "a" being one, there is no way that
> > any possible load and store tearing (your slicing and dicing) could
> > possibly mess up the test of the value loaded from "a"?
>
> Exactly. If you can dream up of a scenario where the compiler can
> get this wrong I'm all ears.

I believe that this is safe in practice, as long as you exercise
constant vigilance. (OK, OK, I might be overdramatizing...)

I cannot immediately think of a way that the compiler could get this
wrong even in theory, but similar code sequences can be messed up.
The reason for this is that in theory, the compiler could use the
stored-to location as temporary storage, like this:

a = whatever; // Compiler uses "a" as a temporary
do_something();
whatever = a;
a = 1; // Intended store

The compiler is allowed to do this (again, in theory and according to a
strict interpretation of the standard) because you haven't told it that
anything else is paying attention to variable "a". As a result, the
compiler is within its rights to use "a" as temporary storage immediately
prior to any plain C-language store to "a".

In practice, I don't know of any compilers that actually do this, nor
have I heard anyone suggesting that they might soon actually do this.

And even if they could, your example would still work because your example
doesn't care about anything other than zero and non-zero, so wouldn't
get confused by the compiler storing a temporary value of 42 or whatever.

> > > But I do concede that in the general RCU case you must have the
> > > READ_ONCE/WRITE_ONCE calls for rcu_dereference/rcu_assign_pointer.
> >
> > OK, good that we are in agreement on this part, at least! ;-)
>
> Well only because we're allowing crazy compilers that can turn
> a simple word-aligned word assignment (a = b) into two stores.

In my experience, the insanity of compilers increases with time,
but yes.

Thanx, Paul

2019-06-06 13:40:15

by Herbert Xu

[permalink] [raw]
Subject: Re: rcu_read_lock lost its compiler barrier

On Thu, Jun 06, 2019 at 03:58:17AM -0700, Paul E. McKenney wrote:
>
> I cannot immediately think of a way that the compiler could get this
> wrong even in theory, but similar code sequences can be messed up.
> The reason for this is that in theory, the compiler could use the
> stored-to location as temporary storage, like this:
>
> a = whatever; // Compiler uses "a" as a temporary
> do_something();
> whatever = a;
> a = 1; // Intended store

Well if the compiler is going to do this then surely it would
continue to do this even if you used WRITE_ONCE. Remember a is
not volatile, only the access of a through WRITE_ONCE is volatile.

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2019-06-06 13:51:53

by Paul E. McKenney

[permalink] [raw]
Subject: Re: rcu_read_lock lost its compiler barrier

On Thu, Jun 06, 2019 at 09:38:24PM +0800, Herbert Xu wrote:
> On Thu, Jun 06, 2019 at 03:58:17AM -0700, Paul E. McKenney wrote:
> >
> > I cannot immediately think of a way that the compiler could get this
> > wrong even in theory, but similar code sequences can be messed up.
> > The reason for this is that in theory, the compiler could use the
> > stored-to location as temporary storage, like this:
> >
> > a = whatever; // Compiler uses "a" as a temporary
> > do_something();
> > whatever = a;
> > a = 1; // Intended store
>
> Well if the compiler is going to do this then surely it would
> continue to do this even if you used WRITE_ONCE. Remember a is
> not volatile, only the access of a through WRITE_ONCE is volatile.

I disagree. Given a volatile store, the compiler cannot assume that the
stored-to location is normal memory at that point in time, and therefore
cannot assume that it is safe to invent a store to that location (as
shown above). Thus far, the C++ standards committee seems on-board with
this, though time will tell.

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1382r1.pdf

Thanx, Paul

2019-06-06 17:13:45

by Alan Stern

[permalink] [raw]
Subject: Re: rcu_read_lock lost its compiler barrier

On Thu, 6 Jun 2019, Andrea Parri wrote:

> This seems a sensible change to me: looking forward to seeing a patch,
> on top of -rcu/dev, for further review and testing!
>
> We could also add (to LKMM) the barrier() for rcu_read_{lock,unlock}()
> discussed in this thread (maybe once the RCU code and the informal doc
> will have settled in such direction).

Yes. Also for SRCU. That point had not escaped me.

Alan

2019-06-07 14:15:22

by Herbert Xu

[permalink] [raw]
Subject: inet: frags: Turn fqdir->dead into an int for old Alphas

On Tue, Jun 04, 2019 at 09:04:55AM -0700, Linus Torvalds wrote:
>
> In fact, the alpha port was always subtly buggy exactly because of the
> "byte write turns into a read-and-masked-write", even if I don't think
> anybody ever noticed (we did fix cases where people _did_ notice,
> though, and we might still have some cases where we use 'int' for
> booleans because of alpha issues.).

This is in fact a real bug in the code in question that no amount
of READ_ONCE/WRITE_ONCE would have caught. The field fqdir->dead is
declared as boolean so writing to it is not atomic (on old Alphas).

I don't think it currently matters because padding would ensure
that it is in fact 64 bits long. However, should someone add another
char/bool/bitfield in this struct in future it could become an issue.

So let's fix it.

---8<--
The field fqdir->dead is meant to be written (and read) atomically.
As old Alpha CPUs can't write a single byte atomically, we need at
least an int for it to work.

Signed-off-by: Herbert Xu <[email protected]>

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index e91b79ad4e4a..8c458fba74ad 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -14,7 +14,9 @@ struct fqdir {
int max_dist;
struct inet_frags *f;
struct net *net;
- bool dead;
+
+ /* We can't use boolean because this needs atomic writes. */
+ int dead;

struct rhashtable rhashtable ____cacheline_aligned_in_smp;

diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 35e9784fab4e..05aa7c145817 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -193,7 +193,7 @@ void fqdir_exit(struct fqdir *fqdir)
{
fqdir->high_thresh = 0; /* prevent creation of new frags */

- fqdir->dead = true;
+ fqdir->dead = 1;

/* call_rcu is supposed to provide memory barrier semantics,
* separating the setting of fqdir->dead with the destruction
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2019-06-07 16:15:43

by Eric Dumazet

[permalink] [raw]
Subject: Re: inet: frags: Turn fqdir->dead into an int for old Alphas



On 6/7/19 8:32 AM, Herbert Xu wrote:
> On Fri, Jun 07, 2019 at 08:26:12AM -0700, Eric Dumazet wrote:
>>
>> There is common knowledge among us programmers that bit fields
>> (or bool) sharing a common 'word' need to be protected
>> with a common lock.
>>
>> Converting all bit fields to plain int/long would be quite a waste of memory.
>>
>> In this case, fqdir_exit() is called right before the whole
>> struct fqdir is dismantled, and the only cpu that could possibly
>> change the thing is ourself, and we are going to start an RCU grace period.
>>
>> Note that first cache line in 'struct fqdir' is read-only.
>> Only ->dead field is flipped to one at exit time.
>>
>> Your patch would send a strong signal to programmers to not even try using
>> bit fields.
>>
>> Do we really want that ?
>
> If this were a bitfield then I'd think it would be safer because
> anybody adding a new bitfield is unlikely to try modifying both
> fields without locking or atomic ops.
>
> However, because this is a boolean, I can certainly see someone
> else coming along and adding another bool right next to it and
> expecting writes them to still be atomic.
>
> As it stands, my patch has zero impact on memory usage because
> it's simply using existing padding. Should this become an issue
> in future, we can always revisit this and use a more appropriate
> method of addressing it.
>
> But the point is to alert future developers that this field is
> not an ordinary boolean.

Okay, but you added a quite redundant comment.

/* We can't use boolean because this needs atomic writes. */

Should we add a similar comment in front of all bit-fields,
or could we factorize this in a proper Documentation perhaps ?

Can we just add a proper bit-field and not the comment ?

unsigned int dead:1;

This way, next programmer can just apply normal rules to add a new bit.

Thanks !

2019-06-07 16:21:34

by Linus Torvalds

[permalink] [raw]
Subject: Re: inet: frags: Turn fqdir->dead into an int for old Alphas

On Fri, Jun 7, 2019 at 8:26 AM Eric Dumazet <[email protected]> wrote:
>
> There is common knowledge among us programmers that bit fields
> (or bool) sharing a common 'word' need to be protected
> with a common lock.
>
> Converting all bit fields to plain int/long would be quite a waste of memory.

Yeah, and we really don't care about alpha. So 'char' should be safe.

No compiler actually turns a 'bool' in a struct into a bitfield,
afaik, because you're still supposed to be able to take the address of
a boolean.

But on the whole, I do not believe that we should ever use 'bool' in
structures anyway, because it's such a badly defined type. I think
it's 'char' in practice on just about all architectures, but there
really were traditional use cases where 'bool' was int.

But:

- we shouldn't turn them into 'int' anyway - alpha is dead, and no
sane architecture will make the same mistake anyway. People learnt.

- we might want to make sure 'bool' really is 'char' in practice, to
double-check that fthe compiler doesn't do anything stupid.

- bitfields obviously do need locks. 'char' does not.

If there's somebody who really notices the alpha issue in PRACTICE, we
can then bother to fix it. But there is approximately one user, and
it's not a heavy-duty one.

Linus

2019-06-07 17:01:28

by Eric Dumazet

[permalink] [raw]
Subject: Re: inet: frags: Turn fqdir->dead into an int for old Alphas



On 6/7/19 7:09 AM, Herbert Xu wrote:
> On Tue, Jun 04, 2019 at 09:04:55AM -0700, Linus Torvalds wrote:
>>
>> In fact, the alpha port was always subtly buggy exactly because of the
>> "byte write turns into a read-and-masked-write", even if I don't think
>> anybody ever noticed (we did fix cases where people _did_ notice,
>> though, and we might still have some cases where we use 'int' for
>> booleans because of alpha issues.).
>
> This is in fact a real bug in the code in question that no amount
> of READ_ONCE/WRITE_ONCE would have caught. The field fqdir->dead is
> declared as boolean so writing to it is not atomic (on old Alphas).
>
> I don't think it currently matters because padding would ensure
> that it is in fact 64 bits long. However, should someone add another
> char/bool/bitfield in this struct in future it could become an issue.
>
> So let's fix it.


There is common knowledge among us programmers that bit fields
(or bool) sharing a common 'word' need to be protected
with a common lock.

Converting all bit fields to plain int/long would be quite a waste of memory.

In this case, fqdir_exit() is called right before the whole
struct fqdir is dismantled, and the only cpu that could possibly
change the thing is ourself, and we are going to start an RCU grace period.

Note that first cache line in 'struct fqdir' is read-only.
Only ->dead field is flipped to one at exit time.

Your patch would send a strong signal to programmers to not even try using
bit fields.

Do we really want that ?

>
> ---8<--
> The field fqdir->dead is meant to be written (and read) atomically.
> As old Alpha CPUs can't write a single byte atomically, we need at
> least an int for it to work.
>
> Signed-off-by: Herbert Xu <[email protected]>
>
> diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
> index e91b79ad4e4a..8c458fba74ad 100644
> --- a/include/net/inet_frag.h
> +++ b/include/net/inet_frag.h
> @@ -14,7 +14,9 @@ struct fqdir {
> int max_dist;
> struct inet_frags *f;
> struct net *net;
> - bool dead;
> +
> + /* We can't use boolean because this needs atomic writes. */
> + int dead;
>
> struct rhashtable rhashtable ____cacheline_aligned_in_smp;
>
> diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
> index 35e9784fab4e..05aa7c145817 100644
> --- a/net/ipv4/inet_fragment.c
> +++ b/net/ipv4/inet_fragment.c
> @@ -193,7 +193,7 @@ void fqdir_exit(struct fqdir *fqdir)
> {
> fqdir->high_thresh = 0; /* prevent creation of new frags */
>
> - fqdir->dead = true;
> + fqdir->dead = 1;
>
> /* call_rcu is supposed to provide memory barrier semantics,
> * separating the setting of fqdir->dead with the destruction
>

2019-06-07 17:25:29

by Herbert Xu

[permalink] [raw]
Subject: Re: inet: frags: Turn fqdir->dead into an int for old Alphas

On Fri, Jun 07, 2019 at 08:26:12AM -0700, Eric Dumazet wrote:
>
> There is common knowledge among us programmers that bit fields
> (or bool) sharing a common 'word' need to be protected
> with a common lock.
>
> Converting all bit fields to plain int/long would be quite a waste of memory.
>
> In this case, fqdir_exit() is called right before the whole
> struct fqdir is dismantled, and the only cpu that could possibly
> change the thing is ourself, and we are going to start an RCU grace period.
>
> Note that first cache line in 'struct fqdir' is read-only.
> Only ->dead field is flipped to one at exit time.
>
> Your patch would send a strong signal to programmers to not even try using
> bit fields.
>
> Do we really want that ?

If this were a bitfield then I'd think it would be safer because
anybody adding a new bitfield is unlikely to try modifying both
fields without locking or atomic ops.

However, because this is a boolean, I can certainly see someone
else coming along and adding another bool right next to it and
expecting writes them to still be atomic.

As it stands, my patch has zero impact on memory usage because
it's simply using existing padding. Should this become an issue
in future, we can always revisit this and use a more appropriate
method of addressing it.

But the point is to alert future developers that this field is
not an ordinary boolean.

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2019-06-08 15:34:04

by Paul E. McKenney

[permalink] [raw]
Subject: Re: rcu_read_lock lost its compiler barrier

On Thu, Jun 06, 2019 at 10:19:43AM -0400, Alan Stern wrote:
> On Thu, 6 Jun 2019, Andrea Parri wrote:
>
> > This seems a sensible change to me: looking forward to seeing a patch,
> > on top of -rcu/dev, for further review and testing!
> >
> > We could also add (to LKMM) the barrier() for rcu_read_{lock,unlock}()
> > discussed in this thread (maybe once the RCU code and the informal doc
> > will have settled in such direction).
>
> Yes. Also for SRCU. That point had not escaped me.

And it does seem pretty settled. There are quite a few examples where
there are normal accesses at either end of the RCU read-side critical
sections, for example, the one in the requirements diffs below.

For SRCU, srcu_read_lock() and srcu_read_unlock() have implied compiler
barriers since 2006. ;-)

Thanx, Paul

------------------------------------------------------------------------

diff --git a/Documentation/RCU/Design/Requirements/Requirements.html b/Documentation/RCU/Design/Requirements/Requirements.html
index 5a9238a2883c..080b39cc1dbb 100644
--- a/Documentation/RCU/Design/Requirements/Requirements.html
+++ b/Documentation/RCU/Design/Requirements/Requirements.html
@@ -2129,6 +2129,8 @@ Some of the relevant points of interest are as follows:
<li> <a href="#Hotplug CPU">Hotplug CPU</a>.
<li> <a href="#Scheduler and RCU">Scheduler and RCU</a>.
<li> <a href="#Tracing and RCU">Tracing and RCU</a>.
+<li> <a href="#Accesses to User Mamory and RCU">
+Accesses to User Mamory and RCU</a>.
<li> <a href="#Energy Efficiency">Energy Efficiency</a>.
<li> <a href="#Scheduling-Clock Interrupts and RCU">
Scheduling-Clock Interrupts and RCU</a>.
@@ -2521,6 +2523,75 @@ cannot be used.
The tracing folks both located the requirement and provided the
needed fix, so this surprise requirement was relatively painless.

+<h3><a name="Accesses to User Mamory and RCU">
+Accesses to User Mamory and RCU</a></h3>
+
+<p>
+The kernel needs to access user-space memory, for example, to access
+data referenced by system-call parameters.
+The <tt>get_user()</tt> macro does this job.
+
+<p>
+However, user-space memory might well be paged out, which means
+that <tt>get_user()</tt> might well page-fault and thus block while
+waiting for the resulting I/O to complete.
+It would be a very bad thing for the compiler to reorder
+a <tt>get_user()</tt> invocation into an RCU read-side critical
+section.
+For example, suppose that the source code looked like this:
+
+<blockquote>
+<pre>
+ 1 rcu_read_lock();
+ 2 p = rcu_dereference(gp);
+ 3 v = p-&gt;value;
+ 4 rcu_read_unlock();
+ 5 get_user(user_v, user_p);
+ 6 do_something_with(v, user_v);
+</pre>
+</blockquote>
+
+<p>
+The compiler must not be permitted to transform this source code into
+the following:
+
+<blockquote>
+<pre>
+ 1 rcu_read_lock();
+ 2 p = rcu_dereference(gp);
+ 3 get_user(user_v, user_p); // BUG: POSSIBLE PAGE FAULT!!!
+ 4 v = p-&gt;value;
+ 5 rcu_read_unlock();
+ 6 do_something_with(v, user_v);
+</pre>
+</blockquote>
+
+<p>
+If the compiler did make this transformation in a
+<tt>CONFIG_PREEMPT=n</tt> kernel build, and if <tt>get_user()</tt> did
+page fault, the result would be a quiescent state in the middle
+of an RCU read-side critical section.
+This misplaced quiescent state could result in line&nbsp;4 being
+a use-after-free access, which could be bad for your kernel's
+actuarial statistics.
+Similar examples can be constructed with the call to <tt>get_user()</tt>
+preceding the <tt>rcu_read_lock()</tt>.
+
+<p>
+Unfortunately, <tt>get_user()</tt> doesn't have any particular
+ordering properties, and in some architectures the underlying <tt>asm</tt>
+isn't even marked <tt>volatile</tt>.
+And even if it was marked <tt>volatile</tt>, the above access to
+<tt>p-&gt;value</tt> is not volatile, so the compiler would not have any
+reason to keep those two accesses in order.
+
+<p>
+Therefore, the Linux-kernel definitions of <tt>rcu_read_lock()</tt>
+and <tt>rcu_read_unlock()</tt> must act as compiler barriers,
+at least for outermost instances of <tt>rcu_read_lock()</tt> and
+<tt>rcu_read_unlock()</tt> within a nested set of RCU read-side critical
+sections.
+
<h3><a name="Energy Efficiency">Energy Efficiency</a></h3>

<p>

2019-06-08 15:35:03

by Paul E. McKenney

[permalink] [raw]
Subject: Re: inet: frags: Turn fqdir->dead into an int for old Alphas

On Fri, Jun 07, 2019 at 09:19:42AM -0700, Linus Torvalds wrote:
> On Fri, Jun 7, 2019 at 8:26 AM Eric Dumazet <[email protected]> wrote:
> >
> > There is common knowledge among us programmers that bit fields
> > (or bool) sharing a common 'word' need to be protected
> > with a common lock.
> >
> > Converting all bit fields to plain int/long would be quite a waste of memory.
>
> Yeah, and we really don't care about alpha. So 'char' should be safe.
>
> No compiler actually turns a 'bool' in a struct into a bitfield,
> afaik, because you're still supposed to be able to take the address of
> a boolean.
>
> But on the whole, I do not believe that we should ever use 'bool' in
> structures anyway, because it's such a badly defined type. I think
> it's 'char' in practice on just about all architectures, but there
> really were traditional use cases where 'bool' was int.
>
> But:
>
> - we shouldn't turn them into 'int' anyway - alpha is dead, and no
> sane architecture will make the same mistake anyway. People learnt.
>
> - we might want to make sure 'bool' really is 'char' in practice, to
> double-check that fthe compiler doesn't do anything stupid.
>
> - bitfields obviously do need locks. 'char' does not.
>
> If there's somebody who really notices the alpha issue in PRACTICE, we
> can then bother to fix it. But there is approximately one user, and
> it's not a heavy-duty one.

C11 and later compilers are supposed to use read-modify-write atomic
operations in this sort of situation anyway because they are not supposed
to introduce data races. So if this problem comes up, the fix should
be in GCC rather than the Linux kernel, right?

Thanx, Paul

2019-06-08 15:57:59

by Alan Stern

[permalink] [raw]
Subject: Re: rcu_read_lock lost its compiler barrier

On Sat, 8 Jun 2019, Paul E. McKenney wrote:

> On Thu, Jun 06, 2019 at 10:19:43AM -0400, Alan Stern wrote:
> > On Thu, 6 Jun 2019, Andrea Parri wrote:
> >
> > > This seems a sensible change to me: looking forward to seeing a patch,
> > > on top of -rcu/dev, for further review and testing!
> > >
> > > We could also add (to LKMM) the barrier() for rcu_read_{lock,unlock}()
> > > discussed in this thread (maybe once the RCU code and the informal doc
> > > will have settled in such direction).
> >
> > Yes. Also for SRCU. That point had not escaped me.
>
> And it does seem pretty settled. There are quite a few examples where
> there are normal accesses at either end of the RCU read-side critical
> sections, for example, the one in the requirements diffs below.
>
> For SRCU, srcu_read_lock() and srcu_read_unlock() have implied compiler
> barriers since 2006. ;-)
>
> Thanx, Paul
>
> ------------------------------------------------------------------------
>
> diff --git a/Documentation/RCU/Design/Requirements/Requirements.html b/Documentation/RCU/Design/Requirements/Requirements.html
> index 5a9238a2883c..080b39cc1dbb 100644
> --- a/Documentation/RCU/Design/Requirements/Requirements.html
> +++ b/Documentation/RCU/Design/Requirements/Requirements.html
> @@ -2129,6 +2129,8 @@ Some of the relevant points of interest are as follows:
> <li> <a href="#Hotplug CPU">Hotplug CPU</a>.
> <li> <a href="#Scheduler and RCU">Scheduler and RCU</a>.
> <li> <a href="#Tracing and RCU">Tracing and RCU</a>.
> +<li> <a href="#Accesses to User Mamory and RCU">
------------------------------------^
> +Accesses to User Mamory and RCU</a>.
---------------------^
> <li> <a href="#Energy Efficiency">Energy Efficiency</a>.
> <li> <a href="#Scheduling-Clock Interrupts and RCU">
> Scheduling-Clock Interrupts and RCU</a>.
> @@ -2521,6 +2523,75 @@ cannot be used.
> The tracing folks both located the requirement and provided the
> needed fix, so this surprise requirement was relatively painless.
>
> +<h3><a name="Accesses to User Mamory and RCU">
----------------------------------^
> +Accesses to User Mamory and RCU</a></h3>
---------------------^

Are these issues especially notable for female programmers? :-)

Alan

2019-06-08 16:32:44

by Paul E. McKenney

[permalink] [raw]
Subject: Re: rcu_read_lock lost its compiler barrier

On Sat, Jun 08, 2019 at 11:56:04AM -0400, Alan Stern wrote:
> On Sat, 8 Jun 2019, Paul E. McKenney wrote:
>
> > On Thu, Jun 06, 2019 at 10:19:43AM -0400, Alan Stern wrote:
> > > On Thu, 6 Jun 2019, Andrea Parri wrote:
> > >
> > > > This seems a sensible change to me: looking forward to seeing a patch,
> > > > on top of -rcu/dev, for further review and testing!
> > > >
> > > > We could also add (to LKMM) the barrier() for rcu_read_{lock,unlock}()
> > > > discussed in this thread (maybe once the RCU code and the informal doc
> > > > will have settled in such direction).
> > >
> > > Yes. Also for SRCU. That point had not escaped me.
> >
> > And it does seem pretty settled. There are quite a few examples where
> > there are normal accesses at either end of the RCU read-side critical
> > sections, for example, the one in the requirements diffs below.
> >
> > For SRCU, srcu_read_lock() and srcu_read_unlock() have implied compiler
> > barriers since 2006. ;-)
> >
> > Thanx, Paul
> >
> > ------------------------------------------------------------------------
> >
> > diff --git a/Documentation/RCU/Design/Requirements/Requirements.html b/Documentation/RCU/Design/Requirements/Requirements.html
> > index 5a9238a2883c..080b39cc1dbb 100644
> > --- a/Documentation/RCU/Design/Requirements/Requirements.html
> > +++ b/Documentation/RCU/Design/Requirements/Requirements.html
> > @@ -2129,6 +2129,8 @@ Some of the relevant points of interest are as follows:
> > <li> <a href="#Hotplug CPU">Hotplug CPU</a>.
> > <li> <a href="#Scheduler and RCU">Scheduler and RCU</a>.
> > <li> <a href="#Tracing and RCU">Tracing and RCU</a>.
> > +<li> <a href="#Accesses to User Mamory and RCU">
> ------------------------------------^
> > +Accesses to User Mamory and RCU</a>.
> ---------------------^
> > <li> <a href="#Energy Efficiency">Energy Efficiency</a>.
> > <li> <a href="#Scheduling-Clock Interrupts and RCU">
> > Scheduling-Clock Interrupts and RCU</a>.
> > @@ -2521,6 +2523,75 @@ cannot be used.
> > The tracing folks both located the requirement and provided the
> > needed fix, so this surprise requirement was relatively painless.
> >
> > +<h3><a name="Accesses to User Mamory and RCU">
> ----------------------------------^
> > +Accesses to User Mamory and RCU</a></h3>
> ---------------------^
>
> Are these issues especially notable for female programmers? :-)

*Red face* Some days it just doesn't pay to get up in the morning...

Well, those issues certainly seem a bit inconsiderate to non-mammalian
programmers. :-/

How about the updated version shown below?

Thanx, Paul

------------------------------------------------------------------------

diff --git a/Documentation/RCU/Design/Requirements/Requirements.html b/Documentation/RCU/Design/Requirements/Requirements.html
index 5a9238a2883c..f04c467e55c5 100644
--- a/Documentation/RCU/Design/Requirements/Requirements.html
+++ b/Documentation/RCU/Design/Requirements/Requirements.html
@@ -2129,6 +2129,8 @@ Some of the relevant points of interest are as follows:
<li> <a href="#Hotplug CPU">Hotplug CPU</a>.
<li> <a href="#Scheduler and RCU">Scheduler and RCU</a>.
<li> <a href="#Tracing and RCU">Tracing and RCU</a>.
+<li> <a href="#Accesses to User Memory and RCU">
+Accesses to User Memory and RCU</a>.
<li> <a href="#Energy Efficiency">Energy Efficiency</a>.
<li> <a href="#Scheduling-Clock Interrupts and RCU">
Scheduling-Clock Interrupts and RCU</a>.
@@ -2521,6 +2523,75 @@ cannot be used.
The tracing folks both located the requirement and provided the
needed fix, so this surprise requirement was relatively painless.

+<h3><a name="Accesses to User Memory and RCU">
+Accesses to User Memory and RCU</a></h3>
+
+<p>
+The kernel needs to access user-space memory, for example, to access
+data referenced by system-call parameters.
+The <tt>get_user()</tt> macro does this job.
+
+<p>
+However, user-space memory might well be paged out, which means
+that <tt>get_user()</tt> might well page-fault and thus block while
+waiting for the resulting I/O to complete.
+It would be a very bad thing for the compiler to reorder
+a <tt>get_user()</tt> invocation into an RCU read-side critical
+section.
+For example, suppose that the source code looked like this:
+
+<blockquote>
+<pre>
+ 1 rcu_read_lock();
+ 2 p = rcu_dereference(gp);
+ 3 v = p-&gt;value;
+ 4 rcu_read_unlock();
+ 5 get_user(user_v, user_p);
+ 6 do_something_with(v, user_v);
+</pre>
+</blockquote>
+
+<p>
+The compiler must not be permitted to transform this source code into
+the following:
+
+<blockquote>
+<pre>
+ 1 rcu_read_lock();
+ 2 p = rcu_dereference(gp);
+ 3 get_user(user_v, user_p); // BUG: POSSIBLE PAGE FAULT!!!
+ 4 v = p-&gt;value;
+ 5 rcu_read_unlock();
+ 6 do_something_with(v, user_v);
+</pre>
+</blockquote>
+
+<p>
+If the compiler did make this transformation in a
+<tt>CONFIG_PREEMPT=n</tt> kernel build, and if <tt>get_user()</tt> did
+page fault, the result would be a quiescent state in the middle
+of an RCU read-side critical section.
+This misplaced quiescent state could result in line&nbsp;4 being
+a use-after-free access, which could be bad for your kernel's
+actuarial statistics.
+Similar examples can be constructed with the call to <tt>get_user()</tt>
+preceding the <tt>rcu_read_lock()</tt>.
+
+<p>
+Unfortunately, <tt>get_user()</tt> doesn't have any particular
+ordering properties, and in some architectures the underlying <tt>asm</tt>
+isn't even marked <tt>volatile</tt>.
+And even if it was marked <tt>volatile</tt>, the above access to
+<tt>p-&gt;value</tt> is not volatile, so the compiler would not have any
+reason to keep those two accesses in order.
+
+<p>
+Therefore, the Linux-kernel definitions of <tt>rcu_read_lock()</tt>
+and <tt>rcu_read_unlock()</tt> must act as compiler barriers,
+at least for outermost instances of <tt>rcu_read_lock()</tt> and
+<tt>rcu_read_unlock()</tt> within a nested set of RCU read-side critical
+sections.
+
<h3><a name="Energy Efficiency">Energy Efficiency</a></h3>

<p>

2019-06-08 17:45:15

by Linus Torvalds

[permalink] [raw]
Subject: Re: inet: frags: Turn fqdir->dead into an int for old Alphas

On Sat, Jun 8, 2019 at 8:32 AM Paul E. McKenney <[email protected]> wrote:
>
> On Fri, Jun 07, 2019 at 09:19:42AM -0700, Linus Torvalds wrote:
> >
> > - bitfields obviously do need locks. 'char' does not.
> >
> > If there's somebody who really notices the alpha issue in PRACTICE, we
> > can then bother to fix it. But there is approximately one user, and
> > it's not a heavy-duty one.
>
> C11 and later compilers are supposed to use read-modify-write atomic
> operations in this sort of situation anyway because they are not supposed
> to introduce data races.

I don't think that's possible on any common architecture. The
bitfields themselves will need locking, to serialize writes of
different fields against each other.

There are no atomic rmw sequences that have reasonable performance for
the bitfield updates themselves.

The fields *around* the bitfields had better be safe, but that's
something we already depend on, and which falls under the heading of
"we don't accept garbage compilers".

Linus

2019-06-08 17:52:37

by Linus Torvalds

[permalink] [raw]
Subject: Re: inet: frags: Turn fqdir->dead into an int for old Alphas

On Sat, Jun 8, 2019 at 10:42 AM Linus Torvalds
<[email protected]> wrote:
>
> There are no atomic rmw sequences that have reasonable performance for
> the bitfield updates themselves.

Note that this is purely about the writing side. Reads of bitfield
values can be (and generally _should_ be) atomic, and hopefully C11
means that you wouldn't see intermediate values.

But I'm not convinced about that either: one natural way to update a
bitfield is to first do the masking, and then do the insertion of new
bits, so a bitfield assignment very easily exposes non-real values to
a concurrent read on another CPU.

What I think C11 is supposed to protect is from compilers doing
horribly bad things, and accessing bitfields with bigger types than
the field itself, ie when you have

struct {
char c;
int field1:5;
};

then a write to "field1" had better not touch "char c" as part of the
rmw operation, because that would indeed introduce a data-race with a
completely independent field that might have completely independent
locking rules.

But

struct {
int c:8;
int field1:5;
};

would not sanely have the same guarantees, even if the layout in
memory might be identical. Once you have bitfields next to each other,
and use a base type that means they can be combined together, they
can't be sanely modified without locking.

(And I don't know if C11 took up the "base type of the bitfield"
thing. Maybe you still need to use the ":0" thing to force alignment,
and maybe the C standards people still haven't made the underlying
type be meaningful other than for sign handling).

Linus

2019-06-08 18:16:04

by Paul E. McKenney

[permalink] [raw]
Subject: Re: inet: frags: Turn fqdir->dead into an int for old Alphas

On Sat, Jun 08, 2019 at 10:42:41AM -0700, Linus Torvalds wrote:
> On Sat, Jun 8, 2019 at 8:32 AM Paul E. McKenney <[email protected]> wrote:
> >
> > On Fri, Jun 07, 2019 at 09:19:42AM -0700, Linus Torvalds wrote:
> > >
> > > - bitfields obviously do need locks. 'char' does not.
> > >
> > > If there's somebody who really notices the alpha issue in PRACTICE, we
> > > can then bother to fix it. But there is approximately one user, and
> > > it's not a heavy-duty one.
> >
> > C11 and later compilers are supposed to use read-modify-write atomic
> > operations in this sort of situation anyway because they are not supposed
> > to introduce data races.

Apologies, I should have explicitly stated that I was talking about char
stores, not bitfield stores. And last I checked, the C11 standard's
prohibition against data races did not extend to individual fields within
a bitfield. So, yes, for bitfields, the programmer must use a lock or
similar if it is necessary for updates to fields within a bitfield to
be atomic.

> I don't think that's possible on any common architecture. The
> bitfields themselves will need locking, to serialize writes of
> different fields against each other.

Yes, and again the C standard doesn't make any atomicity guarantees
regarding storing to different fields within a bitfield. The compiler is
free to assume that nothing else is happening anywhere in the bitfield
when storing to a field within that bitfield. Which gets back to your
"bitfields obviously do need locks", and it is of course the developer
(not the compiler) who must supply those locks. Plus a given lock must
cover the entire bitfield -- having one lock for half the fields within
a given bitfield and another lock for the other half will break.

Switching from bitfields to char, the C standard -does- require that
storing to one char must avoid even momentary corruption of adjacent
char, so given an old Alpha the compiler would need to use something
like an LL/SC loop. If it fails to do so, that compiler is failing to
comply with the standard.

> There are no atomic rmw sequences that have reasonable performance for
> the bitfield updates themselves.

Agreed, in the general case. In a few specific special cases, we do
sometimes hand-craft bitfields using shifts and masks, and sometimes
we use atomic RMW operations to update them. I suppose we could use
unions as an alternative, but it is not clear to me that this would
help anything.

> The fields *around* the bitfields had better be safe, but that's
> something we already depend on, and which falls under the heading of
> "we don't accept garbage compilers".

And the C standard does require the compiler to make that guarantee, so
for once the standard is even on our side. ;-)

Thanx, Paul

2019-06-08 18:53:34

by Paul E. McKenney

[permalink] [raw]
Subject: Re: inet: frags: Turn fqdir->dead into an int for old Alphas

On Sat, Jun 08, 2019 at 10:50:51AM -0700, Linus Torvalds wrote:
> On Sat, Jun 8, 2019 at 10:42 AM Linus Torvalds
> <[email protected]> wrote:
> >
> > There are no atomic rmw sequences that have reasonable performance for
> > the bitfield updates themselves.
>
> Note that this is purely about the writing side. Reads of bitfield
> values can be (and generally _should_ be) atomic, and hopefully C11
> means that you wouldn't see intermediate values.
>
> But I'm not convinced about that either: one natural way to update a
> bitfield is to first do the masking, and then do the insertion of new
> bits, so a bitfield assignment very easily exposes non-real values to
> a concurrent read on another CPU.

Agreed on the "not convinced" part (though perhaps most implementations
would handle concurrent reads and writes involving different fields of
the same bitfield). And the C standard does not guarantee this, because
data races are defined in terms of memory locations. So as far as the
C standard is concerned, if there are two concurrent accesses to fields
within a bitfield that are not separated by ":0", there is a data race
and so the compiler can do whatever it wants.

But do we really care about this case?

> What I think C11 is supposed to protect is from compilers doing
> horribly bad things, and accessing bitfields with bigger types than
> the field itself, ie when you have
>
> struct {
> char c;
> int field1:5;
> };
>
> then a write to "field1" had better not touch "char c" as part of the
> rmw operation, because that would indeed introduce a data-race with a
> completely independent field that might have completely independent
> locking rules.
>
> But
>
> struct {
> int c:8;
> int field1:5;
> };
>
> would not sanely have the same guarantees, even if the layout in
> memory might be identical. Once you have bitfields next to each other,
> and use a base type that means they can be combined together, they
> can't be sanely modified without locking.
>
> (And I don't know if C11 took up the "base type of the bitfield"
> thing. Maybe you still need to use the ":0" thing to force alignment,
> and maybe the C standards people still haven't made the underlying
> type be meaningful other than for sign handling).

The C standard draft (n2310) gives similar examples:

EXAMPLE A structure declared as

struct {
char a;
int b:5, c:11,:0, d:8;
struct { int ee:8; } e;
}

contains four separate memory locations: The member a, and
bit-fields d and e.ee are each separate memory locations,
and can be modified concurrently without interfering with each
other. The bit-fields b and c together constitute the fourth
memory location. The bit-fields b and c cannot be concurrently
modified, but b and a, for example, can be.

So yes, ":0" still forces alignment to the next storage unit. And it
can be used to allow concurrent accesses to fields within a bitfield,
but only when those two fields are separated by ":0".

On the underlying type, according to J.3.9 of the current C working draft,
the following are implementation-specified behavior:

- Whether a "plain" int bit-field is treated as a signed int
bit-field or as an unsigned int bit-field (6.7.2, 6.7.2.1).

- Whether atomic types are permitted for bit-fields (6.7.2.1).

This last is strange because you are not allowed to take the address of
a bit field, and the various operations on atomic types take addresses.
Search me!

Thanx, Paul