LinuxLists.cc - [PATCH 17/17] RCU'd vfsmounts

[permalink] [raw]

Subject: Re: [PATCH 17/17] RCU'd vfsmounts

On Thu, Oct 3, 2013 at 10:44 AM, Al Viro <[email protected]> wrote:
>
> Anyway, I've done nicer variants of that protection for everything except
> fuse (hadn't gotten around to it yet); see vfs.git#experimental now:

Ok, I did a quick test, and it looks ok here, so looking good for 3.13.

However, the new smp_mb() in mntput_no_expire() is quite noticeable in
the path lookup stress-test profiles. And I'm not seeing what that
allegedly protects against, especially if mnt_ns is NULL (ie all the
common important cases).

Linus

2013-10-03 19:43:54

[permalink] [raw]

Subject: Re: [PATCH 17/17] RCU'd vfsmounts

On Thu, Oct 03, 2013 at 12:06:04PM -0700, Linus Torvalds wrote:
> On Thu, Oct 3, 2013 at 10:44 AM, Al Viro <[email protected]> wrote:
> >
> > Anyway, I've done nicer variants of that protection for everything except
> > fuse (hadn't gotten around to it yet); see vfs.git#experimental now:
>
> Ok, I did a quick test, and it looks ok here, so looking good for 3.13.
>
> However, the new smp_mb() in mntput_no_expire() is quite noticeable in
> the path lookup stress-test profiles. And I'm not seeing what that
> allegedly protects against, especially if mnt_ns is NULL (ie all the
> common important cases).

In the common case it's ->mnt_ns is *not* NULL; that's what we get if
the damn thing is still mounted.

What we need to avoid is this:

mnt_ns non-NULL, mnt_count is 2
CPU1: umount -l CPU2: mntput
umount_tree() clears mnt_ns
drop mount_lock.lock
namespace_unlock() calls mntput()
decrement mnt_count
see that mnt_ns is NULL
grab mount_lock.lock
check mnt_count
decrement mnt_count
see old value of mnt_ns
decide to bugger off
see it equal to 1 (i.e. miss decrement on CPU2)
decide to bugger off

The barrier in mntput() is to prevent that combination, so that either CPU2
would see mnt_ns cleared by CPU1, or CPU1 would see mnt_count decrement done
by CPU2. Its counterpart on CPU1 is provided by spin_unlock/spin_lock we've
done between clearing mnt_ns and checking mnt_count. Note that
synchronize_rcu() in namespace_unlock() and rcu_read_lock() in mntput() are
irrelevant here - the latter on CPU2 might very well have happened after the
former on CPU1, so umount -l did *not* wait for CPU2 to do anything.

Any suggestions re getting rid of that barrier?

2013-10-03 20:19:19

[permalink] [raw]

Subject: Re: [PATCH 17/17] RCU'd vfsmounts

On Thu, Oct 3, 2013 at 12:43 PM, Al Viro <[email protected]> wrote:
>
> In the common case it's ->mnt_ns is *not* NULL; that's what we get if
> the damn thing is still mounted.

Yeah, I misread the profile assembly code. The point being that the
nice fast case now has the smp_mb() in it, and it accounts for about
60% of the cost of that function on my performance profile.

> What we need to avoid is this:
>
> mnt_ns non-NULL, mnt_count is 2
> CPU1: umount -l CPU2: mntput
> umount_tree() clears mnt_ns
> drop mount_lock.lock
> namespace_unlock() calls mntput()
> decrement mnt_count
> see that mnt_ns is NULL
> grab mount_lock.lock
> check mnt_count
> decrement mnt_count
> see old value of mnt_ns
> decide to bugger off
> see it equal to 1 (i.e. miss decrement on CPU2)
> decide to bugger off
>
> The barrier in mntput() is to prevent that combination, so that either CPU2
> would see mnt_ns cleared by CPU1, or CPU1 would see mnt_count decrement done
> by CPU2. Its counterpart on CPU1 is provided by spin_unlock/spin_lock we've
> done between clearing mnt_ns and checking mnt_count. Note that
> synchronize_rcu() in namespace_unlock() and rcu_read_lock() in mntput() are
> irrelevant here - the latter on CPU2 might very well have happened after the
> former on CPU1, so umount -l did *not* wait for CPU2 to do anything.
>
> Any suggestions re getting rid of that barrier?

Hmm. The CPU2 mntput can only happen under RCU readlock, right? After
the RCU grace period _and_ if the umount is going ahead, nothing
should have a mnt pointer, right?

So I'm wondering if you couldn't just have a synchronize_rcu() in that
umount path, after clearing mnt_ns. At that point you _know_ you're
the only one that should have access to the mnt.

You'd need to drop the mount-hash lock for that. But I think you can
do it in umount_tree(), right? IOW, you could make the rule be that
umount_tree() must be called with the namespace lock and the
mount-hash lock, and it will drop both. Or does that get too painful
too?

Linus

2013-10-03 20:41:44

[permalink] [raw]

Subject: Re: [PATCH 17/17] RCU'd vfsmounts

On Thu, Oct 03, 2013 at 01:19:16PM -0700, Linus Torvalds wrote:

> Hmm. The CPU2 mntput can only happen under RCU readlock, right? After
> the RCU grace period _and_ if the umount is going ahead, nothing
> should have a mnt pointer, right?

umount -l doesn't care.

> So I'm wondering if you couldn't just have a synchronize_rcu() in that
> umount path, after clearing mnt_ns. At that point you _know_ you're
> the only one that should have access to the mnt.

We have it there. See namespace_unlock(). And you are right about the
locking rules for umount_tree(), except that caller is responsible
for dropping those. With (potentially final) mntput() happening after
both (well, as part of namespace_unlock(), done after synchronize_rcu()).

The problem is this:
A = 1, B = 1
CPU1:
A = 0
<full barrier>
synchronize_rcu()
read B

CPU2:
rcu_read_lock()
B = 0
read A

Are we guaranteed that we won't get both of them seeing ones, in situation
when that rcu_read_lock() comes too late to be noticed by synchronize_rcu()?

2013-10-03 20:52:48

[permalink] [raw]

Subject: Re: [PATCH 17/17] RCU'd vfsmounts

On Thu, Oct 3, 2013 at 1:41 PM, Al Viro <[email protected]> wrote:
>
> The problem is this:
> A = 1, B = 1
> CPU1:
> A = 0
> <full barrier>
> synchronize_rcu()
> read B
>
> CPU2:
> rcu_read_lock()
> B = 0
> read A
>
> Are we guaranteed that we won't get both of them seeing ones, in situation
> when that rcu_read_lock() comes too late to be noticed by synchronize_rcu()?

Yeah, I think we should be guaranteed that, because the
synchronize_rcu() will guarantee that all other CPU's go through an
idle period. So the "read A" on CPU2 cannot possibly see a 1 _unless_
it happens so early that synchronize_rcu() definitely sees it (ie it's
a "preexisting reader" by definition), in which case synchronize_rcu()
will be waiting for a subsequent idle period, in which case the B=0 on
CPU2 is not only guaranteed to happen but also be visible out, so the
"read B" on CPU1 will see 0. And that's true even if CPU2 doesn't have
an explicit memory barrier, because the "RCU idle" state implies that
it has gone through a barrier.

So I don't see how they could possibly see ones. Modulo terminal bugs
in synchronize_barrier() (which can be very slow, but for umount I
wouldn't worry). Or modulo my brain being fried.

Linus

2013-10-03 21:14:51

[permalink] [raw]

Subject: Re: [PATCH 17/17] RCU'd vfsmounts

On Thu, Oct 03, 2013 at 01:52:45PM -0700, Linus Torvalds wrote:

> Yeah, I think we should be guaranteed that, because the
> synchronize_rcu() will guarantee that all other CPU's go through an
> idle period. So the "read A" on CPU2 cannot possibly see a 1 _unless_
> it happens so early that synchronize_rcu() definitely sees it (ie it's
> a "preexisting reader" by definition), in which case synchronize_rcu()
> will be waiting for a subsequent idle period, in which case the B=0 on
> CPU2 is not only guaranteed to happen but also be visible out, so the
> "read B" on CPU1 will see 0. And that's true even if CPU2 doesn't have
> an explicit memory barrier, because the "RCU idle" state implies that
> it has gone through a barrier.
>
> So I don't see how they could possibly see ones. Modulo terminal bugs
> in synchronize_barrier() (which can be very slow, but for umount I
> wouldn't worry). Or modulo my brain being fried.

There's one more place similar to that - kern_unmount(). There we also
go from "longterm vfsmount, mntput() doesn't need to bother checking"
to NULL ->mnt_ns. We can, of course, slap synchronize_rcu() there as
well, but that might make pid_ns and ipc_ns destruction slow...

2013-10-03 23:28:37

[permalink] [raw]

Subject: Re: [PATCH 17/17] RCU'd vfsmounts

On Thu, Oct 03, 2013 at 01:52:45PM -0700, Linus Torvalds wrote:
> On Thu, Oct 3, 2013 at 1:41 PM, Al Viro <[email protected]> wrote:
> >
> > The problem is this:
> > A = 1, B = 1
> > CPU1:
> > A = 0
> > <full barrier>
> > synchronize_rcu()
> > read B
> >
> > CPU2:
> > rcu_read_lock()
> > B = 0
> > read A
> >
> > Are we guaranteed that we won't get both of them seeing ones, in situation
> > when that rcu_read_lock() comes too late to be noticed by synchronize_rcu()?
>
> Yeah, I think we should be guaranteed that, because the
> synchronize_rcu() will guarantee that all other CPU's go through an
> idle period. So the "read A" on CPU2 cannot possibly see a 1 _unless_
> it happens so early that synchronize_rcu() definitely sees it (ie it's
> a "preexisting reader" by definition), in which case synchronize_rcu()
> will be waiting for a subsequent idle period, in which case the B=0 on
> CPU2 is not only guaranteed to happen but also be visible out, so the
> "read B" on CPU1 will see 0. And that's true even if CPU2 doesn't have
> an explicit memory barrier, because the "RCU idle" state implies that
> it has gone through a barrier.

I think the reasoning in one direction is actually quite a bit less
obvious than that.

rcu_read_unlock() does *not* necessarily imply a memory barrier (so the
B=0 can actually move logically outside the rcu_read_unlock()), but
synchronize_rcu() *does* imply (and enforce) that a memory barrier has
occurred on all CPUs as part of quiescence. However, likewise,
rcu_read_lock() doesn't imply anything in particular about writes; it
does enforce either that reads can't leak earlier or that if they do a
synchronize_rcu() will still wait for them, but I don't think the safety
interaction between a *write* in the RCU reader and a *read* in the RCU
writer necessarily follows from that enforcement.

(Also, to the best of my knowledge, you don't even need a barrier on
CPU1; synchronize_rcu() should imply one.)

If synchronize_rcu() on CPU1 sees rcu_read_lock() on CPU2, then
synchronize_rcu() will wait for CPU2's read-side critical section and a
memory barrier before reading B, so CPU1 will see B==0.

The harder direction: If synchronize_rcu() on CPU1 does not see
rcu_read_lock() on CPU2, then it won't necessarily wait for anything,
and since rcu_read_lock() itself does not imply any CPU write barriers,
it's not at all obvious that rcu_read_lock() prevents B=0 from occurring
before CPU1's read of B.

In short, the interaction between RCU's ordering guarantees and CPU
memory barriers in the presence of writes on the read side and reads on
the write side does not seem sufficiently clear to support the portable
use of the above pattern without an smp_wmb() on CPU2 between
rcu_read_lock() and B=0. I think it might happen to work with the
current implementations of RCU (with which synchronize_rcu() won't
actually notice a quiescent state and return until after either
the rcu_read_unlock() or a preemption point), but by the strict semantic
guarantees of the RCU primitives I think you could write a legitimate
RCU implementation that would break the above code.

That said, I believe this pattern *will* work with every existing
implementation of RCU. Thus, I'd suggest documenting it as a warning to
prospective RCU optimizers to avoid breaking the above pattern.

- Josh Triplett

2013-10-03 23:51:03

[permalink] [raw]

Subject: Re: [PATCH 17/17] RCU'd vfsmounts

On Thu, Oct 3, 2013 at 4:28 PM, Josh Triplett <[email protected]> wrote:
> On Thu, Oct 03, 2013 at 01:52:45PM -0700, Linus Torvalds wrote:
>> On Thu, Oct 3, 2013 at 1:41 PM, Al Viro <[email protected]> wrote:
>> >
>> > The problem is this:
>> > A = 1, B = 1
>> > CPU1:
>> > A = 0
>> > <full barrier>
>> > synchronize_rcu()
>> > read B
>> >
>> > CPU2:
>> > rcu_read_lock()
>> > B = 0
>> > read A
>> >
>> > Are we guaranteed that we won't get both of them seeing ones, in situation
>> > when that rcu_read_lock() comes too late to be noticed by synchronize_rcu()?
>>
>> Yeah, I think we should be guaranteed that, because the
>> synchronize_rcu() will guarantee that all other CPU's go through an
>> idle period. So the "read A" on CPU2 cannot possibly see a 1 _unless_
>> it happens so early that synchronize_rcu() definitely sees it (ie it's
>> a "preexisting reader" by definition), in which case synchronize_rcu()
>> will be waiting for a subsequent idle period, in which case the B=0 on
>> CPU2 is not only guaranteed to happen but also be visible out, so the
>> "read B" on CPU1 will see 0. And that's true even if CPU2 doesn't have
>> an explicit memory barrier, because the "RCU idle" state implies that
>> it has gone through a barrier.
>
> I think the reasoning in one direction is actually quite a bit less
> obvious than that.
>
> rcu_read_unlock() does *not* necessarily imply a memory barrier

Don't think of it in those terms.

The only thing that matters is semantics. The semantics of
synchronize_rcu() is that it needs to wait for all RCU users. It's
that simple. By definition, anything inside a "rcu_read_lock()" is a
RCU user, so if we have a read of memory (memory barrier or not), then
synchronize_rcu() needs to wait for it. Otherwise serialize_rcu() is
clearly totally broken.

Now, the fact that the normal rcu_read_lock() is just a compiler
barrier may make you think "oh, it cannot work", but the thing is, the
way things happens is that synchronize_rcu() ends up relying on the
_scheduler_ data structures, rather than anything else. It requires
seeing an idle scheduler state for each CPU after being called.

Linus

2013-10-04 00:41:17

[permalink] [raw]

Subject: Re: [PATCH 17/17] RCU'd vfsmounts

On Thu, Oct 03, 2013 at 04:51:00PM -0700, Linus Torvalds wrote:
> On Thu, Oct 3, 2013 at 4:28 PM, Josh Triplett <[email protected]> wrote:
> > On Thu, Oct 03, 2013 at 01:52:45PM -0700, Linus Torvalds wrote:
> >> On Thu, Oct 3, 2013 at 1:41 PM, Al Viro <[email protected]> wrote:
> >> >
> >> > The problem is this:
> >> > A = 1, B = 1
> >> > CPU1:
> >> > A = 0
> >> > <full barrier>
> >> > synchronize_rcu()
> >> > read B
> >> >
> >> > CPU2:
> >> > rcu_read_lock()
> >> > B = 0
> >> > read A
> >> >
> >> > Are we guaranteed that we won't get both of them seeing ones, in situation
> >> > when that rcu_read_lock() comes too late to be noticed by synchronize_rcu()?
> >>
> >> Yeah, I think we should be guaranteed that, because the
> >> synchronize_rcu() will guarantee that all other CPU's go through an
> >> idle period. So the "read A" on CPU2 cannot possibly see a 1 _unless_
> >> it happens so early that synchronize_rcu() definitely sees it (ie it's
> >> a "preexisting reader" by definition), in which case synchronize_rcu()
> >> will be waiting for a subsequent idle period, in which case the B=0 on
> >> CPU2 is not only guaranteed to happen but also be visible out, so the
> >> "read B" on CPU1 will see 0. And that's true even if CPU2 doesn't have
> >> an explicit memory barrier, because the "RCU idle" state implies that
> >> it has gone through a barrier.
> >
> > I think the reasoning in one direction is actually quite a bit less
> > obvious than that.
> >
> > rcu_read_unlock() does *not* necessarily imply a memory barrier
>
> Don't think of it in those terms.
>
> The only thing that matters is semantics. The semantics of
> synchronize_rcu() is that it needs to wait for all RCU users. It's
> that simple. By definition, anything inside a "rcu_read_lock()" is a
> RCU user, so if we have a read of memory (memory barrier or not), then
> synchronize_rcu() needs to wait for it. Otherwise serialize_rcu() is
> clearly totally broken.

Read, yes, but I don't think that's enough to force your example above
to work in all cases. That requires semantics beyond what RCU's
primitives guarantee, and I don't think you can draw conclusions about
those semantics without talking about CPU memory barriers.

synchronize_rcu() says it'll wait for any in-progress reader, which
includes whatever that reader might be doing. That guarantees one
direction of your example above. It's a tool for controlling what the
*reader* can observe. The vast majority of RCU usage models assume the
writers will use a lock to guard against each other, and that readers
don't make modifications. synchronize_rcu() doesn't make any particular
guarantees in the other direction about what the *writer* can observe,
especially in the case where synchronize_rcu() does not observe the
rcu_read_lock() and doesn't need to count that reader. (Also note that
B=0 is a blind write, with no read.)

The example above will work on x86 with any reasonable implementation
(due to write ordering guarantees), and it should work on any
architecture with all the current implementations of RCU in the Linux
kernel. If we want to require that all future implementations of RCU
allow the above example to work, we should document that example and
associated assumption for future reference, because I don't think the
semantics of the RCU primitives inherently guarantee it.

> Now, the fact that the normal rcu_read_lock() is just a compiler
> barrier may make you think "oh, it cannot work", but the thing is, the
> way things happens is that synchronize_rcu() ends up relying on the
> _scheduler_ data structures, rather than anything else. It requires
> seeing an idle scheduler state for each CPU after being called.

That's a detail of implementation. In practice, the required semantics
of synchronize_rcu() would be perfectly satisfied by an implementation
that can divine that no readers are currently running without waiting
for everyone to pass through the scheduler. synchronize_sched() means
"everyone has scheduled"; synchronize_rcu() *only* means "any readers
in-progress when I called synchronize_rcu() have finished when
synchronize_rcu() returns", which does not have to imply a trip through
the scheduler.

- Josh Triplett

2013-10-04 00:45:25

[permalink] [raw]

Subject: Re: [PATCH 17/17] RCU'd vfsmounts

On Thu, Oct 3, 2013 at 5:41 PM, Josh Triplett <[email protected]> wrote:
>
> Read, yes, but I don't think that's enough to force your example above
> to work in all cases. That requires semantics beyond what RCU's
> primitives guarantee, and I don't think you can draw conclusions about
> those semantics without talking about CPU memory barriers.

We seriosly depend on nothing leaking out. Not just reads. The "U" in
RCU is "update". So it's reads, and it's writes. The fact that it says
"read" in "rcu_read_lock()" doesn't mean that only reads would be
affected.

And no, this still has nothing to do with memory barriers. Every
single RCU user depends on the memory freeing being delayed by RCU,
for example. And again, that's not just reads. It's people taking
spinlocks on things that are RCU-protected etc.

So no, there is no question about this. The only question would be
whether we have some RCU mode that is _buggy_, not whether you need
extra memory barriers. And that is certainly possible.

Linus

2013-10-04 02:53:56

[permalink] [raw]

Subject: Re: [PATCH 17/17] RCU'd vfsmounts

On Thu, Oct 03, 2013 at 10:14:48PM +0100, Al Viro wrote:

> > So I don't see how they could possibly see ones. Modulo terminal bugs
> > in synchronize_barrier() (which can be very slow, but for umount I
> > wouldn't worry). Or modulo my brain being fried.
>
> There's one more place similar to that - kern_unmount(). There we also
> go from "longterm vfsmount, mntput() doesn't need to bother checking"
> to NULL ->mnt_ns. We can, of course, slap synchronize_rcu() there as
> well, but that might make pid_ns and ipc_ns destruction slow...

OK, fuse side of things done, smp_mb() in mntput_no_expire() dropped,
kern_umount() got synchronize_rcu() (I'm not happy about the last one,
but... hell knows; I want to see profiles before deciding what to do
about it).

Updated branch force-pushed. BTW, brlock defines can go after that;
we still two instances of lg_lock, but they spell the primitives
out instead of using br_{read,write}_lock aliases.

Speaking of those two - I really want to see file_table.c one killed.
Christoph, do you have anything along the lines of getting rid of the
mark_files_ro() nonsense? After all, a combination of r/w vfsmount
and a superblock with MS_RDONLY in flags should do about the right thing
these days... I can probably knock something together tomorrow, but
you've brought that thing up quite a few times, so if you happen to have
a patch more or less ready...

2013-10-04 05:30:07

by Paul E. McKenney

[permalink] [raw]

Subject: Re: [PATCH 17/17] RCU'd vfsmounts

On Thu, Oct 03, 2013 at 04:28:27PM -0700, Josh Triplett wrote:
> On Thu, Oct 03, 2013 at 01:52:45PM -0700, Linus Torvalds wrote:
> > On Thu, Oct 3, 2013 at 1:41 PM, Al Viro <[email protected]> wrote:
> > >
> > > The problem is this:
> > > A = 1, B = 1
> > > CPU1:
> > > A = 0
> > > <full barrier>
> > > synchronize_rcu()
> > > read B
> > >
> > > CPU2:
> > > rcu_read_lock()
> > > B = 0
> > > read A

/me scratches his head...

OK, for CPU2 to see 1 from its read from A, the corresponding RCU
read-side critical section must have started before CPU1 did A=0. This
means that this same RCU read-side critical section must have started
before CPU1's synchronize_rcu(), which means that it must complete
before that synchronize_rcu() returns. Therefore, CPU2's B=0 must
execute before CPU1's read of B, hence that read of B must return zero.

Conversely, if CPU1's read from B returns 1, we know that CPU2's
RCU read-side critical section must not have completed until after
CPU1's synchronize_rcu() returned, which means that the RCU read-side
critical section must have started after that synchronize_rcu() started,
so CPU1's assignment to A must also have already happened. Therefore,
CPU2's read from A must return zero.

> > > Are we guaranteed that we won't get both of them seeing ones, in situation
> > > when that rcu_read_lock() comes too late to be noticed by synchronize_rcu()?

Yes, at least one of CPU1's and CPU2's reads must return zero.

For whatever it is worth, it is easier to talk about if you write it
this way (easier for me, anyway):

A = 1, B = 1

CPU1:
A = 0
<full barrier>
synchronize_rcu()
r1 = B

CPU2:
rcu_read_lock()
B = 0
r2 = A

Then we are guaranteed that r1==0||r2==0.

> > Yeah, I think we should be guaranteed that, because the
> > synchronize_rcu() will guarantee that all other CPU's go through an
> > idle period. So the "read A" on CPU2 cannot possibly see a 1 _unless_
> > it happens so early that synchronize_rcu() definitely sees it (ie it's
> > a "preexisting reader" by definition), in which case synchronize_rcu()
> > will be waiting for a subsequent idle period, in which case the B=0 on
> > CPU2 is not only guaranteed to happen but also be visible out, so the
> > "read B" on CPU1 will see 0. And that's true even if CPU2 doesn't have
> > an explicit memory barrier, because the "RCU idle" state implies that
> > it has gone through a barrier.

I agree that the "<full barrier>" has no effect in this case.

The memory-barrier guarantees of RCU's grace-period primitives are
documented in the excessively long header comment for synchronize_sched(),
which documents an extended LKML conversation between Oleg Nesterov
and myself. Here is a summary, which applies on systems with more than
one CPU:

o When synchronize_sched() returns, each CPU is guaranteed to have
executed a full memory barrier since the end of its last RCU-sched
read-side critical section whose beginning preceded the call
to synchronize_sched().

o Each CPU having an RCU read-side critical section that extends
beyond the return from synchronize_sched() is guaranteed
to have executed a full memory barrier after the beginning
of synchronize_sched() and before the beginning of that RCU
read-side critical section.

o If CPU A invoked synchronize_sched(), which returned to its
caller on CPU B, then both CPU A and CPU B are guaranteed to
have executed a full memory barrier during the execution of
synchronize_sched(). This also applies when CPUs A and B are
the same CPU.

These guarantees apply to -all- CPUs, even those that are currently
offline or idle. Analogous guarantees of course apply to all flavors
of RCU.

> I think the reasoning in one direction is actually quite a bit less
> obvious than that.
>
> rcu_read_unlock() does *not* necessarily imply a memory barrier (so the
> B=0 can actually move logically outside the rcu_read_unlock()), but
> synchronize_rcu() *does* imply (and enforce) that a memory barrier has
> occurred on all CPUs as part of quiescence. However, likewise,
> rcu_read_lock() doesn't imply anything in particular about writes; it
> does enforce either that reads can't leak earlier or that if they do a
> synchronize_rcu() will still wait for them, but I don't think the safety
> interaction between a *write* in the RCU reader and a *read* in the RCU
> writer necessarily follows from that enforcement.
>
> (Also, to the best of my knowledge, you don't even need a barrier on
> CPU1; synchronize_rcu() should imply one.)
>
> If synchronize_rcu() on CPU1 sees rcu_read_lock() on CPU2, then
> synchronize_rcu() will wait for CPU2's read-side critical section and a
> memory barrier before reading B, so CPU1 will see B==0.
>
> The harder direction: If synchronize_rcu() on CPU1 does not see
> rcu_read_lock() on CPU2, then it won't necessarily wait for anything,
> and since rcu_read_lock() itself does not imply any CPU write barriers,
> it's not at all obvious that rcu_read_lock() prevents B=0 from occurring
> before CPU1's read of B.
>
> In short, the interaction between RCU's ordering guarantees and CPU
> memory barriers in the presence of writes on the read side and reads on
> the write side does not seem sufficiently clear to support the portable
> use of the above pattern without an smp_wmb() on CPU2 between
> rcu_read_lock() and B=0. I think it might happen to work with the
> current implementations of RCU (with which synchronize_rcu() won't
> actually notice a quiescent state and return until after either
> the rcu_read_unlock() or a preemption point), but by the strict semantic
> guarantees of the RCU primitives I think you could write a legitimate
> RCU implementation that would break the above code.
>
> That said, I believe this pattern *will* work with every existing
> implementation of RCU. Thus, I'd suggest documenting it as a warning to
> prospective RCU optimizers to avoid breaking the above pattern.

I believe that the comment headers for synchronize_sched() and call_rcu()
do document the required restrictions. Could you please take a look at
them and let me know if I left some wiggle room that needs to be taken
care of?

Thanx, Paul

2013-10-04 06:03:15

[permalink] [raw]

Subject: Re: [PATCH 17/17] RCU'd vfsmounts

On Thu, Oct 03, 2013 at 10:29:59PM -0700, Paul E. McKenney wrote:
> On Thu, Oct 03, 2013 at 04:28:27PM -0700, Josh Triplett wrote:
> > On Thu, Oct 03, 2013 at 01:52:45PM -0700, Linus Torvalds wrote:
> > > On Thu, Oct 3, 2013 at 1:41 PM, Al Viro <[email protected]> wrote:
> > > >
> > > > The problem is this:
> > > > A = 1, B = 1
> > > > CPU1:
> > > > A = 0
> > > > <full barrier>
> > > > synchronize_rcu()
> > > > read B
> > > >
> > > > CPU2:
> > > > rcu_read_lock()
> > > > B = 0
> > > > read A
>
> /me scratches his head...
>
> OK, for CPU2 to see 1 from its read from A, the corresponding RCU
> read-side critical section must have started before CPU1 did A=0. This
> means that this same RCU read-side critical section must have started
> before CPU1's synchronize_rcu(), which means that it must complete
> before that synchronize_rcu() returns. Therefore, CPU2's B=0 must
> execute before CPU1's read of B, hence that read of B must return zero.
>
> Conversely, if CPU1's read from B returns 1, we know that CPU2's
> RCU read-side critical section must not have completed until after
> CPU1's synchronize_rcu() returned, which means that the RCU read-side
> critical section must have started after that synchronize_rcu() started,
> so CPU1's assignment to A must also have already happened. Therefore,
> CPU2's read from A must return zero.

Yeah, that makes sense.

I think too much time spent staring at the *implementation* of RCU and
the exciting assumptions it has to make about barriers or memory
operations leaking out of the implementations of the RCU primitives (for
instance, the fun needed to guarantee a memory barrier on all CPUs, or
to safely use non-atomic operations inside RCU itself) makes it entirely
too difficult to look at a perfectly ordinary *use* of RCU primitives
and see the obvious. :)

- Josh Triplett

2013-10-04 06:15:11

by Paul E. McKenney

[permalink] [raw]

Subject: Re: [PATCH 17/17] RCU'd vfsmounts

On Thu, Oct 03, 2013 at 11:03:05PM -0700, Josh Triplett wrote:
> On Thu, Oct 03, 2013 at 10:29:59PM -0700, Paul E. McKenney wrote:
> > On Thu, Oct 03, 2013 at 04:28:27PM -0700, Josh Triplett wrote:
> > > On Thu, Oct 03, 2013 at 01:52:45PM -0700, Linus Torvalds wrote:
> > > > On Thu, Oct 3, 2013 at 1:41 PM, Al Viro <[email protected]> wrote:
> > > > >
> > > > > The problem is this:
> > > > > A = 1, B = 1
> > > > > CPU1:
> > > > > A = 0
> > > > > <full barrier>
> > > > > synchronize_rcu()
> > > > > read B
> > > > >
> > > > > CPU2:
> > > > > rcu_read_lock()
> > > > > B = 0
> > > > > read A
> >
> > /me scratches his head...
> >
> > OK, for CPU2 to see 1 from its read from A, the corresponding RCU
> > read-side critical section must have started before CPU1 did A=0. This
> > means that this same RCU read-side critical section must have started
> > before CPU1's synchronize_rcu(), which means that it must complete
> > before that synchronize_rcu() returns. Therefore, CPU2's B=0 must
> > execute before CPU1's read of B, hence that read of B must return zero.
> >
> > Conversely, if CPU1's read from B returns 1, we know that CPU2's
> > RCU read-side critical section must not have completed until after
> > CPU1's synchronize_rcu() returned, which means that the RCU read-side
> > critical section must have started after that synchronize_rcu() started,
> > so CPU1's assignment to A must also have already happened. Therefore,
> > CPU2's read from A must return zero.
>
> Yeah, that makes sense.
>
> I think too much time spent staring at the *implementation* of RCU and
> the exciting assumptions it has to make about barriers or memory
> operations leaking out of the implementations of the RCU primitives (for
> instance, the fun needed to guarantee a memory barrier on all CPUs, or
> to safely use non-atomic operations inside RCU itself) makes it entirely
> too difficult to look at a perfectly ordinary *use* of RCU primitives
> and see the obvious. :)

I must confess that my first thought upon seeing Al's example was "but
of course CPU2's write to B and read from A can be reordered by either
the compiler or the CPU!" I had to look again myself. ;-)

Thanx, Paul

2013-10-04 06:41:12

by Ingo Molnar

[permalink] [raw]

Subject: Re: [PATCH 17/17] RCU'd vfsmounts

* Linus Torvalds <[email protected]> wrote:

> On Thu, Oct 3, 2013 at 5:41 PM, Josh Triplett <[email protected]> wrote:
> >
> > Read, yes, but I don't think that's enough to force your example above
> > to work in all cases. That requires semantics beyond what RCU's
> > primitives guarantee, and I don't think you can draw conclusions about
> > those semantics without talking about CPU memory barriers.
>
> We seriosly depend on nothing leaking out. Not just reads. The "U" in
> RCU is "update". So it's reads, and it's writes. The fact that it says
> "read" in "rcu_read_lock()" doesn't mean that only reads would be
> affected.
>
> And no, this still has nothing to do with memory barriers. Every single
> RCU user depends on the memory freeing being delayed by RCU, for
> example. And again, that's not just reads. It's people taking spinlocks
> on things that are RCU-protected etc.
>
> So no, there is no question about this. The only question would be
> whether we have some RCU mode that is _buggy_, not whether you need
> extra memory barriers. And that is certainly possible.

Broken RCU modes are not unheard of, but Paul is extremely methodical
about testing and reviewing all the details - including formal proof
testing methods. There are lots of high-profile, high-frequency RCU users
in the kernel that make use of every aspect of RCU semantics and any
breakage would affect them as well.

There are over 2000 RCU critical sections in the kernel today, so the
likelyhood of the VFS triggering an unknown bug, without other users
breaking already, is fairly low. If it happens it will be fixed like other
RCU bugs.

So I really wouldn't worry about it too much. If you don't mind the
additional sys_umount() delay then RCU is goodness.

Thanks,

Ingo

2013-10-04 07:04:57

[permalink] [raw]

Subject: Re: [PATCH 17/17] RCU'd vfsmounts

On Thu, Oct 03, 2013 at 11:15:03PM -0700, Paul E. McKenney wrote:
> On Thu, Oct 03, 2013 at 11:03:05PM -0700, Josh Triplett wrote:
> > On Thu, Oct 03, 2013 at 10:29:59PM -0700, Paul E. McKenney wrote:
> > > On Thu, Oct 03, 2013 at 04:28:27PM -0700, Josh Triplett wrote:
> > > > On Thu, Oct 03, 2013 at 01:52:45PM -0700, Linus Torvalds wrote:
> > > > > On Thu, Oct 3, 2013 at 1:41 PM, Al Viro <[email protected]> wrote:
> > > > > >
> > > > > > The problem is this:
> > > > > > A = 1, B = 1
> > > > > > CPU1:
> > > > > > A = 0
> > > > > > <full barrier>
> > > > > > synchronize_rcu()
> > > > > > read B
> > > > > >
> > > > > > CPU2:
> > > > > > rcu_read_lock()
> > > > > > B = 0
> > > > > > read A
> > >
> > > /me scratches his head...
> > >
> > > OK, for CPU2 to see 1 from its read from A, the corresponding RCU
> > > read-side critical section must have started before CPU1 did A=0. This
> > > means that this same RCU read-side critical section must have started
> > > before CPU1's synchronize_rcu(), which means that it must complete
> > > before that synchronize_rcu() returns. Therefore, CPU2's B=0 must
> > > execute before CPU1's read of B, hence that read of B must return zero.
> > >
> > > Conversely, if CPU1's read from B returns 1, we know that CPU2's
> > > RCU read-side critical section must not have completed until after
> > > CPU1's synchronize_rcu() returned, which means that the RCU read-side
> > > critical section must have started after that synchronize_rcu() started,
> > > so CPU1's assignment to A must also have already happened. Therefore,
> > > CPU2's read from A must return zero.
> >
> > Yeah, that makes sense.
> >
> > I think too much time spent staring at the *implementation* of RCU and
> > the exciting assumptions it has to make about barriers or memory
> > operations leaking out of the implementations of the RCU primitives (for
> > instance, the fun needed to guarantee a memory barrier on all CPUs, or
> > to safely use non-atomic operations inside RCU itself) makes it entirely
> > too difficult to look at a perfectly ordinary *use* of RCU primitives
> > and see the obvious. :)
>
> I must confess that my first thought upon seeing Al's example was "but
> of course CPU2's write to B and read from A can be reordered by either
> the compiler or the CPU!" I had to look again myself. ;-)

Exactly.

- Josh Triplett

2013-10-04 08:37:37

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH 17/17] RCU'd vfsmounts

On Fri, Oct 04, 2013 at 03:53:51AM +0100, Al Viro wrote:
> Speaking of those two - I really want to see file_table.c one killed.
> Christoph, do you have anything along the lines of getting rid of the
> mark_files_ro() nonsense? After all, a combination of r/w vfsmount
> and a superblock with MS_RDONLY in flags should do about the right thing
> these days... I can probably knock something together tomorrow, but
> you've brought that thing up quite a few times, so if you happen to have
> a patch more or less ready...

I used to have a patch for it that was also sent to the list long ago,
but it got rid of the sysrq emergency remount r/o, which people didn't like.

2013-10-04 12:58:50