2008-02-23 16:23:18

by Oleg Nesterov

[permalink] [raw]
Subject: Re: + kthread-add-a-missing-memory-barrier-to-kthread_stop.patch added to -mm tree

(s/mm-commits/lkml, cc Steven and Linus).

On 02/22, Andrew Morton wrote:
>
> From: Dmitry Adamushko <[email protected]>
>
> We must ensure that kthread_stop_info.k has been updated before kthread's
> wakeup. This is required to properly support the use of kthread_should_stop()
> in the main loop of kthread.
>
> wake_up_process() doesn't imply a full memory barrier,
> so we add an explicit one.

I tried to raise the similar issue twice without success.

> --- a/kernel/kthread.c~kthread-add-a-missing-memory-barrier-to-kthread_stop
> +++ a/kernel/kthread.c
> @@ -53,6 +53,19 @@ static struct kthread_stop_info kthread_
> * When someone calls kthread_stop() on your kthread, it will be woken
> * and this will return true. You should then return, and your return
> * value will be passed through to kthread_stop().
> + *
> + * In order to safely use kthread_stop() for kthread, there is a requirement
> + * on how its main loop has to be orginized. Namely, the sequence of
> + * events that lead to kthread being blocked (schedule()) has to be
> + * ordered as follows:
> + *
> + * - set_current_state(TASK_INTERRUPTIBLE);
> + * - if (kthread_should_stop()) break;
> + * - schedule() or similar.
> + *
> + * set_current_state() implies a full memory barrier. kthread_stop()
> + * has a matching barrier right after an update of kthread_stop_info.k
> + * and before kthread's wakeup.
> */
> int kthread_should_stop(void)
> {
> @@ -211,6 +224,15 @@ int kthread_stop(struct task_struct *k)
>
> /* Now set kthread_should_stop() to true, and wake it up. */
> kthread_stop_info.k = k;
> +
> + /*
> + * We must ensure that kthread_stop_info.k has been updated before
> + * the following wakeup. This is required to properly support the use
> + * of kthread_should_stop() in the main loop of kthread
> + * (see description of kthread_should_stop() for more details).
> + */
> + smp_mb();
> +
> wake_up_process(k);
> put_task_struct(k);

I think we should fix wake_up_process() instead.

Please look at http://marc.info/?l=linux-kernel&m=118503598307267
and at this thread: http://marc.info/?t=116275561700001

In short: wake_up_process() doesn't imply mb(), this means that _in theory_
the commonly used code like

set_current_state(TASK_INTERRUPTIBLE);
if (CONDITION)
return;
schedule();

is racy wrt

CONDITION = 1;
wake_up_process(p);

I'll be happy to be wrong though, please correct me.

Oleg.


2008-02-23 17:37:22

by Steven Rostedt

[permalink] [raw]
Subject: Re: + kthread-add-a-missing-memory-barrier-to-kthread_stop.patch added to -mm tree


On Sat, 23 Feb 2008, Oleg Nesterov wrote:

> (s/mm-commits/lkml, cc Steven and Linus).

Thanks,

>
> On 02/22, Andrew Morton wrote:
> >
> > From: Dmitry Adamushko <[email protected]>
> >
> > We must ensure that kthread_stop_info.k has been updated before kthread's
> > wakeup. This is required to properly support the use of kthread_should_stop()
> > in the main loop of kthread.
> >
> > wake_up_process() doesn't imply a full memory barrier,
> > so we add an explicit one.
>
> I tried to raise the similar issue twice without success.
>


> I think we should fix wake_up_process() instead.

At first I was thinking that this may be too much on such an highly used
API. But you may be right. I did a quick seach on who uses
wake_up_process. I randomly picked one. ptrace.

I think we have a bug there. And this was just by randomly looking at
it.

In kernel/ptrace.c: ptrace_resume

child->exit_code = data;
wake_up_process(child);


Again, there's no guarantee that exit_code will equal data when the child
wakes up.

And in something like do_syscall_trace, we have

ptrace_notify(SIGTRAP | ((current->ptrace & PT_TRACESYSGOOD) ?
0x80:0));

/*
* this isn't the same as continuing with a signal, but it will do
* for normal use. strace only continues with a signal if the
* stopping signal is not SIGTRAP. -brl
*/
if (current->exit_code) {


ptrace_notify eventually calls ptrace_stop which schedules. This is what
gets woken up. There is a possible chance that current->exit_code will not
equal data in the ptrace_resume code. That is, if wake_up_process implies
no barrier.

I would recommend having a wake_up_process version that does not imply a
barrier, so we can keep straight forward wakeups fast and not
unnecessarily add barriers.

Good catch Oleg!

-- Steve


>
> Please look at http://marc.info/?l=linux-kernel&m=118503598307267
> and at this thread: http://marc.info/?t=116275561700001
>
> In short: wake_up_process() doesn't imply mb(), this means that _in theory_
> the commonly used code like
>
> set_current_state(TASK_INTERRUPTIBLE);
> if (CONDITION)
> return;
> schedule();
>
> is racy wrt
>
> CONDITION = 1;
> wake_up_process(p);
>
> I'll be happy to be wrong though, please correct me.
>
> Oleg.
>
>

2008-02-23 17:56:05

by Linus Torvalds

[permalink] [raw]
Subject: Re: + kthread-add-a-missing-memory-barrier-to-kthread_stop.patch added to -mm tree



On Sat, 23 Feb 2008, Oleg Nesterov wrote:
>
> In short: wake_up_process() doesn't imply mb(), this means that _in theory_
> the commonly used code like
>
> set_current_state(TASK_INTERRUPTIBLE);
> if (CONDITION)
> return;
> schedule();
>
> is racy wrt
>
> CONDITION = 1;
> wake_up_process(p);
>
> I'll be happy to be wrong though, please correct me.

Well, you should be wrong on x86, because the spinlock at the head of
wake_up_process() (well, "try_to_wake_up()" to be exact) will be a full
memory barrier.

But yeah, in general spinlocks can have weaker semantics, and let
preceding writes percolate into the critical section and thus past the
point that actually sets task->state.

And I do agree that we should *not* add a memory barrier in the caller
(that's just going to be really confusing for everybody, and make these
things much harder than they should be), and we should make sure that the
above sequence is always race-free.

I also think that a full memory barrier is overkill. We should be ok with
just adding a write barrier to the top of wake_up_process(), no? That way,
we know that the CONDITION will be seen on the other CPU before it sees
task->state change to TASK_RUNNING, so with the memory barrier int he
"set_current_state()", we know that the other side will see the new
condition _or_ it will see TASK_RUNNING when it hits schedule.

No?

(smp_wmb() also has the advantage of being a no-op on x86, where it's not
needed).

But memory ordering is hard. Maybe I'm not thinking right about this.

Linus

2008-02-23 18:19:22

by Oleg Nesterov

[permalink] [raw]
Subject: Re: + kthread-add-a-missing-memory-barrier-to-kthread_stop.patch added to -mm tree

On 02/23, Linus Torvalds wrote:
>
> On Sat, 23 Feb 2008, Oleg Nesterov wrote:
> >
> > In short: wake_up_process() doesn't imply mb(), this means that _in theory_
> > the commonly used code like
> >
> > set_current_state(TASK_INTERRUPTIBLE);
> > if (CONDITION)
> > return;
> > schedule();
> >
> > is racy wrt
> >
> > CONDITION = 1;
> > wake_up_process(p);
> >
> > I'll be happy to be wrong though, please correct me.
>
> Well, you should be wrong on x86, because the spinlock at the head of
> wake_up_process() (well, "try_to_wake_up()" to be exact) will be a full
> memory barrier.
>
> But yeah, in general spinlocks can have weaker semantics, and let
> preceding writes percolate into the critical section and thus past the
> point that actually sets task->state.

Yes. (I mean, this matches my understanding)

> And I do agree that we should *not* add a memory barrier in the caller
> (that's just going to be really confusing for everybody, and make these
> things much harder than they should be), and we should make sure that the
> above sequence is always race-free.
>
> I also think that a full memory barrier is overkill. We should be ok with
> just adding a write barrier to the top of wake_up_process(), no? That way,
> we know that the CONDITION will be seen on the other CPU before it sees
> task->state change to TASK_RUNNING, so with the memory barrier int he
> "set_current_state()", we know that the other side will see the new
> condition _or_ it will see TASK_RUNNING when it hits schedule.

Yes, but still I suspect wmb() is not enough. Note that try_to_wake_up()
first checks (reads) the task->state,

if (!(old_state & state))
goto out;

without the full mb() it is (in theory) possible that try_to_wake_up()
first reads TASK_RUNNING and only then sets CONDITION. IOW, STORE and
LOAD could be re-ordered.

> (smp_wmb() also has the advantage of being a no-op on x86, where it's not
> needed).

Can't we introduce smp_mb_before_spinlock() (or whatever) as iirc Steven
suggested some time ago?


A bit offtopic, but let's take another example, __queue_work()->insert_work().
With some trivial modification we can move set_wq_data() outside of cwq->lock.
But according to linux's memory model we still need wmb(), which is a bit
annoying. Perhaps we can also add smp_wmb_before_spinlock(). Not sure this
is not too ugly though.

Oleg.

2008-02-23 19:00:00

by Linus Torvalds

[permalink] [raw]
Subject: Re: + kthread-add-a-missing-memory-barrier-to-kthread_stop.patch added to -mm tree



On Sat, 23 Feb 2008, Oleg Nesterov wrote:
>
> Yes, but still I suspect wmb() is not enough. Note that try_to_wake_up()
> first checks (reads) the task->state,
>
> if (!(old_state & state))
> goto out;
>
> without the full mb() it is (in theory) possible that try_to_wake_up()
> first reads TASK_RUNNING and only then sets CONDITION. IOW, STORE and
> LOAD could be re-ordered.

No. The spinlock can have preceding stores (and loads, for that matter)
percolate *into* the locked region, but a spinlock can *not* have loads
(and stores) escape *out* of the region withou being totally broken.

So the spinlock that predeces that load of "old_state" absolutely *must*
be a memory barrier as far as that load is concerned.

Spinlocks are just "one-way" permeable. They stop both reads and writes,
but they stop them from escaping up to earlier code (and the spin-unlock
in turn stops reads and writes within the critical section from escaping
down past the unlock).

But they definitely must stop that direction, and no loads or stores that
are inside the critical section can possibly be allowed to be done outside
of it, or the spinlock would be pointless.

[ Yeah, yeah, I guess that in theory you could serialize spinlocks only
against each other, and their serialization would be in a totally
different "domain" from normal reads and writes, and then the spinlock
wouldn't act as a barrier at all except for stuff that is literally
inside another spinlock, but I don't think anybody really does that ]

So I think a simple smp_wmb() should be ok.

That said, I do not *mind* the notion of "smp_mb_before/after_spinlock()",
and just have architectures do whatever they want (with x86 just being a
no-op). I don't think it's wrong in any way, and may be the really
generic solution for some theoretical case where locking is done not by
using the normal cache coherency, but over a separate lock network. But I
would suggest against adding new abstractions unless there are real cases
of it mattering.

(The biggest reason to do it in practice is probably architectures that
have a really slow smp_wmb() and that also have a spinlock that is already
serializing enough, but if this is only for one single use - in
try_to_wake_up() - even that doesn't really seem to be a very strong
argument).

> A bit offtopic, but let's take another example, __queue_work()->insert_work().
> With some trivial modification we can move set_wq_data() outside of cwq->lock.
> But according to linux's memory model we still need wmb(), which is a bit
> annoying. Perhaps we can also add smp_wmb_before_spinlock(). Not sure this
> is not too ugly though.

Is that really so performance-critical? We still need the spinlock for the
actual list manipulation, so why would we care?

And the smp_wmb() really should be free. It's literally free on x86, but
doing a write barrier is generally a cheap operation on any sane
architecture (ie generally cheaper than a read barrier or a full barrier:
it should mostly be a matter of inserting a magic entry in the write
buffers).

Linus

2008-02-23 19:37:08

by Oleg Nesterov

[permalink] [raw]
Subject: Re: + kthread-add-a-missing-memory-barrier-to-kthread_stop.patch added to -mm tree

On 02/23, Linus Torvalds wrote:
>
> On Sat, 23 Feb 2008, Oleg Nesterov wrote:
> >
> > Yes, but still I suspect wmb() is not enough. Note that try_to_wake_up()
> > first checks (reads) the task->state,
> >
> > if (!(old_state & state))
> > goto out;
> >
> > without the full mb() it is (in theory) possible that try_to_wake_up()
> > first reads TASK_RUNNING and only then sets CONDITION. IOW, STORE and
> > LOAD could be re-ordered.
>
> No. The spinlock can have preceding stores (and loads, for that matter)
> percolate *into* the locked region, but a spinlock can *not* have loads
> (and stores) escape *out* of the region withou being totally broken.
>
> So the spinlock that predeces that load of "old_state" absolutely *must*
> be a memory barrier as far as that load is concerned.
>
> Spinlocks are just "one-way" permeable. They stop both reads and writes,
> but they stop them from escaping up to earlier code (and the spin-unlock
> in turn stops reads and writes within the critical section from escaping
> down past the unlock).
>
> But they definitely must stop that direction, and no loads or stores that
> are inside the critical section can possibly be allowed to be done outside
> of it, or the spinlock would be pointless.

Yes sure. But I meant that the STORE (set CONDITION) can leak into the critical
section, and it could be re-ordered with the LOAD (check ->state) _inside_
the critical section.

However,

> So I think a simple smp_wmb() should be ok.

now I think you are probably right. Because (please ack/nack my understanding)
this smp_wmb() ensures that the preceding STORE can't actually go into
the locked region.

> That said, I do not *mind* the notion of "smp_mb_before/after_spinlock()",
> and just have architectures do whatever they want (with x86 just being a
> no-op). I don't think it's wrong in any way, and may be the really
> generic solution for some theoretical case where locking is done not by
> using the normal cache coherency, but over a separate lock network. But I
> would suggest against adding new abstractions unless there are real cases
> of it mattering.
>
> (The biggest reason to do it in practice is probably architectures that
> have a really slow smp_wmb() and that also have a spinlock that is already
> serializing enough, but if this is only for one single use - in
> try_to_wake_up() - even that doesn't really seem to be a very strong
> argument).

Great.

> > A bit offtopic, but let's take another example, __queue_work()->insert_work().
> > With some trivial modification we can move set_wq_data() outside of cwq->lock.
> > But according to linux's memory model we still need wmb(), which is a bit
> > annoying. Perhaps we can also add smp_wmb_before_spinlock(). Not sure this
> > is not too ugly though.
>
> Is that really so performance-critical? We still need the spinlock for the
> actual list manipulation, so why would we care?
>
> And the smp_wmb() really should be free. It's literally free on x86, but
> doing a write barrier is generally a cheap operation on any sane
> architecture (ie generally cheaper than a read barrier or a full barrier:
> it should mostly be a matter of inserting a magic entry in the write
> buffers).

Yes, yes, this was just a "on a related note", thanks!

Oleg.

2008-02-23 19:41:27

by Dmitry Adamushko

[permalink] [raw]
Subject: Re: + kthread-add-a-missing-memory-barrier-to-kthread_stop.patch added to -mm tree

On 23/02/2008, Linus Torvalds <[email protected]> wrote:
> On Sat, 23 Feb 2008, Oleg Nesterov wrote:
> >
>
> > In short: wake_up_process() doesn't imply mb(), this means that _in theory_
> > the commonly used code like
> >
> > set_current_state(TASK_INTERRUPTIBLE);
> > if (CONDITION)
> > return;
> > schedule();
> >
> > is racy wrt
> >
> > CONDITION = 1;
> > wake_up_process(p);
> >
> > I'll be happy to be wrong though, please correct me.
>
>
> Well, you should be wrong on x86, because the spinlock at the head of
> wake_up_process() (well, "try_to_wake_up()" to be exact) will be a full
> memory barrier.
>
> But yeah, in general spinlocks can have weaker semantics, and let
> preceding writes percolate into the critical section and thus past the
> point that actually sets task->state.
>
> And I do agree that we should *not* add a memory barrier in the caller
> (that's just going to be really confusing for everybody, and make these
> things much harder than they should be), and we should make sure that the
> above sequence is always race-free.
>
> I also think that a full memory barrier is overkill. We should be ok with
> just adding a write barrier to the top of wake_up_process(), no?

No, wmb is not enough. I've provided an explanation in the original thread.
(http://groups.google.com/group/fa.linux.kernel/browse_thread/thread/44c45685680585fc/e58785df0eeee6f8?lnk=raot)

Actually, there seems to be _no_ problem at all, provided a task to be
woken up is _not_ running on another CPU at the exact moment of
wakeup.

Why?

shared_data = new;
wake_up_task(p);

(1) try_to_wake_up() holds a lock of the runqueue on which 'p' is to-be-placed ;

(2) it's _guaranteed_ that 'shared_data' will be updated by the moment
any UNLOCK is called -- in our case, try_to_wake_up(p) calls
unlock(&rq->lock);

(3) for 'p' to start running, something must call schedule() -> which
will take 'rq->lock'... and 'rq' is the same as in (2).

IOW, 'p' can't start running untill try_to_wake_up(p) releases
'rq->lock', and as said in (2), that implies that 'shared_data' will
be up-to-date by this moment.

IOW #2, 'rq->lock' is kind of a synchronization point/'barrier' in this case.

does it make sense now?

Another point is if 'p' is actually _running_ on another CPU at the
time when we do try_to_wake_up()... and I guess, the potential problem
is only relevant for situations like below:

CPU #0:

EIP_1 ---> (*) /* so 'p' is at this point now */

set_current_state(TASK_INTERRUPTIBLE);

if (shared_data == magic)
schedule();


CPU #1:

shared_data = magic;
try_to_wake_up(p);

now the problem is if 'p->state' (inside try_to_wake_up()) will be
loaded _before_ 'shared_data' has been updated.

recall, we are about to execute set_current_state(TASK_INTERRUPTIBLE)
on CPU #0...

so

p->state is still TASK_RUNNING, meaning that try_to_wake_up() just
exits ! (nothing to be done)

in the mean time,

on CPU #0: set_current_state(TASK_INTERRUPTIBLE) is called.

shared_data is checked _but_ it's still an old value (CPU #1 is still
inside try_to_wake_up(p))
so we call schedule() and go to sleep...

i.e. we actually lost a wakeup.

to sum it up, for the following scheme to work:

set_current_state(TASK_INTERRUPTIBLE); <--- here we have a smb_mb()
if (condition)
schedule();

effectively => (1) MODIFY(current->state) ; (2) LOAD(condition)

and a wake_up path must ensure access (LOAD or MODIFY) to the same
data happens in the _reverse_ order:

condition = new;
smb_mb();
try_to_wake_up();

=> (1) MODIFY(condition); (2) LOAD(current->state)

try_to_wake_up() does not need to be a full mb per se, the only
requirement (and only for situation like above) is that there is a
full mb between possible write ops. that have taken place before
try_to_wake_up() _and_ a load of p->state inside try_to_wake_up().

does it make sense #2 ? :-)

(yeah, maybe I'm just too paranoid :-)


>
> Linus
>

--
Best regards,
Dmitry Adamushko

2008-02-23 19:51:36

by Dmitry Adamushko

[permalink] [raw]
Subject: Re: + kthread-add-a-missing-memory-barrier-to-kthread_stop.patch added to -mm tree

On 23/02/2008, Linus Torvalds <[email protected]> wrote:
>
>
> On Sat, 23 Feb 2008, Oleg Nesterov wrote:
> >
>
> > Yes, but still I suspect wmb() is not enough. Note that try_to_wake_up()
> > first checks (reads) the task->state,
> >
> > if (!(old_state & state))
> > goto out;
> >
> > without the full mb() it is (in theory) possible that try_to_wake_up()
> > first reads TASK_RUNNING and only then sets CONDITION. IOW, STORE and
> > LOAD could be re-ordered.
>
>
> No. The spinlock can have preceding stores (and loads, for that matter)
> percolate *into* the locked region, but a spinlock can *not* have loads
> (and stores) escape *out* of the region withou being totally broken.

it's not a LOAD that escapes *out* of the region. It's a MODIFY that gets *in*:

(1)

MODIFY(a);

LOCK

LOAD(b);

UNLOCK


can become:

(2)

LOCK

MOFIDY(a)
LOAD(b);

UNLOCK

and (reordered)

(3)

LOCK

LOAD(a)
MODIFY(b)

UNLOCK

and this last one is a problem. No?


>
> Linus
>

--
Best regards,
Dmitry Adamushko

2008-02-23 20:03:41

by Linus Torvalds

[permalink] [raw]
Subject: Re: + kthread-add-a-missing-memory-barrier-to-kthread_stop.patch added to -mm tree



On Sat, 23 Feb 2008, Dmitry Adamushko wrote:
>
> No, wmb is not enough. I've provided an explanation in the original thread.

Here, let me answer your explanation from this thread (lots snipped to
keep it concise):

First off:

> Actually, there seems to be _no_ problem at all, provided a task to be
> woken up is _not_ running on another CPU at the exact moment of
> wakeup.

Agreed. The runqueue spinlock will protect the case of the target actually
sleeping. So that's not the case that anybody has worried about.

So to look at the "concurrently running on another CPU case":

> to sum it up, for the following scheme to work:
>
> set_current_state(TASK_INTERRUPTIBLE); <--- here we have a smb_mb()
> if (condition)
> schedule();
>
> effectively => (1) MODIFY(current->state) ; (2) LOAD(condition)
>
> and a wake_up path must ensure access (LOAD or MODIFY) to the same
> data happens in the _reverse_ order:

Yes.

> condition = new;
> smb_mb();
> try_to_wake_up();
>
> => (1) MODIFY(condition); (2) LOAD(current->state)
>
> try_to_wake_up() does not need to be a full mb per se, the only
> requirement (and only for situation like above) is that there is a
> full mb between possible write ops. that have taken place before
> try_to_wake_up() _and_ a load of p->state inside try_to_wake_up().
>
> does it make sense #2 ? :-)

.. and this is why I think a smp_wmb() is sufficient.

The spinlock already guarantees that the load cannot move up past the
spinlock (that would make a spinlock pointless), and the smp_wmb()
guarantees that the store cannot move down past the spinlock.

Now, I realize that a smp_wmb() only protects a write against another
write, but if it's at the top of try_to_wake_up(), the "other write" in
question is the spinlock itself. So while the smp_wmb() doesn't enforce
the order between STORE(condition) and LOAD(curent->state) *directly*, it
does end up doing so through the interaction with the spinlock.

At least I think so. Odd CPU memory ordering can actually break the notion
of "causality" (see, for example, the fact that we actually need to have
"smp_read_barrier_depends()" on some architectures), which is one reason
why it's hard to really think about them. I think I'm better than most on
memory ordering, but I won't guarantee that there cannot be some
architecture that could in theory break that

store -> wmb -> spinlock -> read

chain some really odd way.

Note that the position of the wmb does matter: if it is *after* the
spinlock, then it has zero impact on the read. My argument literally
depends on the wmb serializing with the write of the spinlock itself.

Linus

2008-02-23 20:09:48

by Linus Torvalds

[permalink] [raw]
Subject: Re: + kthread-add-a-missing-memory-barrier-to-kthread_stop.patch added to -mm tree



On Sat, 23 Feb 2008, Dmitry Adamushko wrote:
>
> it's not a LOAD that escapes *out* of the region. It's a MODIFY that gets *in*:

Not with the smp_wmb(). That's the whole point.

Ie the patch I'm suggesting is sufficient is appended, and the point is
that any write before the critical region will be ordered wrt the critical
region because of the write barrier before the spinlock (which itself is a
write).

This is also why I mentioned that if you have a really odd architecure
that considers spinlocks to be "outside" the normal cache coherency
domain, that would be broken, but I cannot think of a single valid case of
that, and I consider it insane.

(There are supecomputers that have a separate "barrier" network that can
be used to do global ordering, but they don't generally do cache coherency
and dependable memory ordering at all, which is why they need the separate
barrier network in the first place. So that odd case isn't relevant to
this discussion, even if it's historically interesting ;^)

But I'd love to hear a good argument why this wouldn't work on some
architecture that we actually support (or might want to support in the
future).. Memory ordering is interesting (even if the only people who do
it right is Intel and AMD).

Linus

---
kernel/sched.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index f28f19e..3bceb3b 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1831,6 +1831,7 @@ static int try_to_wake_up(struct task_struct *p, unsigned int state, int sync)
long old_state;
struct rq *rq;

+ smp_wmb();
rq = task_rq_lock(p, &flags);
old_state = p->state;
if (!(old_state & state))

2008-02-23 21:04:22

by Oleg Nesterov

[permalink] [raw]
Subject: [PATCH, 3rd resend] documentation: atomic_add_unless() doesn't imply mb() on failure

(sorry for being offtpoic, but while experts are here...)

A "typical" implementation of atomic_add_unless() can return 0 immediately
after the first atomic_read() (before doing cmpxchg). In that case it doesn't
provide any barrier semantics. See include/asm-ia64/atomic.h as an example.

We should either change the implementation, or fix the docs.

Signed-off-by: Oleg Nesterov <[email protected]>
Acked-by: Nick Piggin <[email protected]>

Documentation/atomic_ops.txt | 3 ++-
Documentation/memory-barriers.txt | 2 +-
2 files changed, 3 insertions(+), 2 deletions(-)

--- t/Documentation/atomic_ops.txt~doc_aau 2008-01-14 23:43:11.000000000 +0300
+++ t/Documentation/atomic_ops.txt 2008-02-23 23:53:12.000000000 +0300
@@ -186,7 +186,8 @@ If the atomic value v is not equal to u,
returns non zero. If v is equal to u then it returns zero. This is done as
an atomic operation.

-atomic_add_unless requires explicit memory barriers around the operation.
+atomic_add_unless requires explicit memory barriers around the operation
+unless it fails (returns 0).

atomic_inc_not_zero, equivalent to atomic_add_unless(v, 1, 0)

--- t/Documentation/memory-barriers.txt~doc_aau 2008-01-14 23:43:11.000000000 +0300
+++ t/Documentation/memory-barriers.txt 2008-02-23 23:53:12.000000000 +0300
@@ -1493,7 +1493,7 @@ explicit lock operations, described late
atomic_dec_and_test();
atomic_sub_and_test();
atomic_add_negative();
- atomic_add_unless();
+ atomic_add_unless(); /* when succeeds (returns 1) */
test_and_set_bit();
test_and_clear_bit();
test_and_change_bit();

2008-02-23 21:08:19

by Dmitry Adamushko

[permalink] [raw]
Subject: Re: + kthread-add-a-missing-memory-barrier-to-kthread_stop.patch added to -mm tree

On 23/02/2008, Linus Torvalds <[email protected]> wrote:
>
> On Sat, 23 Feb 2008, Dmitry Adamushko wrote:
> >
> > it's not a LOAD that escapes *out* of the region. It's a MODIFY that gets *in*:
>
>
> Not with the smp_wmb(). That's the whole point.
>
> Ie the patch I'm suggesting is sufficient is appended, and the point is
> that any write before the critical region will be ordered wrt the critical
> region because of the write barrier before the spinlock (which itself is a
> write).

Yeah, good point!

(heh... I wouldn't dare to say this 'obvious thing' only to Anton
Blanchard who is "the only person who always 'have a point' by
definition" :-))

> This is also why I mentioned that if you have a really odd architecure
> that considers spinlocks to be "outside" the normal cache coherency
> domain, that would be broken, but I cannot think of a single valid case of
> that, and I consider it insane.

Yeah, some potential implementations come into my mind but, I guess,
they are as far away from real hardware as science-fiction from
science :-/

So how should we proceed with this issue?

let's use your patch and declare try_to_wake_up() a 'full' mb for the case:

MODIFY
try_to_wake_up
LOAD or MODIFY (that take place either after or inside try_to_wake_up())

so we'll fix (lots of) potentially problematic cases with a single shot.

and

LOAD
try_to_wake_up()
LOAD or MODIFY

is probably not that common so we don't care.


--
Best regards,
Dmitry Adamushko

2008-02-25 14:04:32

by David Howells

[permalink] [raw]
Subject: Re: + kthread-add-a-missing-memory-barrier-to-kthread_stop.patch added to -mm tree

Dmitry Adamushko <[email protected]> wrote:

> (3)
>
> LOCK
>
> LOAD(a)
> MODIFY(b)
>
> UNLOCK
>
> and this last one is a problem. No?

I assume you meant:

LOCK
LOAD(b)
MODIFY(a)
UNLOCK

David