LinuxLists.cc - Possible memory ordering bug in page reclaim?

2005-10-15 03:27:57

Subject: Possible memory ordering bug in page reclaim?

Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c
+++ linux-2.6/mm/vmscan.c
@@ -511,7 +511,12 @@ static int shrink_list(struct list_head
* PageDirty _after_ making sure that the page is freeable and
* not in use by anybody. (pagecache + us == 2)
*/
- if (page_count(page) != 2 || PageDirty(page)) {
+ if (page_count(page) != 2) {
+ write_unlock_irq(&mapping->tree_lock);
+ goto keep_locked;
+ }
+ smp_rmb();
+ if (PageDirty(page)) {
write_unlock_irq(&mapping->tree_lock);
goto keep_locked;
}

Attachments:

mm-reclaim-memorder-fix.patch (604.00 B)

2005-10-15 06:18:28

by Hugh Dickins

[permalink] [raw]

Subject: Re: Possible memory ordering bug in page reclaim?

On Sat, 15 Oct 2005, Nick Piggin wrote:
>
> Is there anything that prevents PageDirty from theoretically being
> speculatively loaded before page_count here? (see patch)
>
> It would result in pagecache corruption in the following situation:
>
> 1 2
> find_get_page();
> write to page write_lock(tree_lock);
> SetPageDirty(); if (page_count != 2
> put_page(); || PageDirty())
>
> Now I'm worried that 2 might see PageDirty *before* SetPageDirty in
page->flags
> 1, and page_count *after* put_page in 1.

I think you're right. But I'm the last person to ask
barrier/ordering questions of. CC'ed Ben and Andrea.

Hugh

> --- linux-2.6.orig/mm/vmscan.c
> +++ linux-2.6/mm/vmscan.c
> @@ -511,7 +511,12 @@ static int shrink_list(struct list_head
> * PageDirty _after_ making sure that the page is freeable and
> * not in use by anybody. (pagecache + us == 2)
> */
> - if (page_count(page) != 2 || PageDirty(page)) {
> + if (page_count(page) != 2) {
> + write_unlock_irq(&mapping->tree_lock);
> + goto keep_locked;
> + }
> + smp_rmb();
> + if (PageDirty(page)) {
> write_unlock_irq(&mapping->tree_lock);
> goto keep_locked;
> }

2005-10-15 07:44:34

by Benjamin Herrenschmidt

[permalink] [raw]

Subject: Re: Possible memory ordering bug in page reclaim?

On Sat, 2005-10-15 at 07:17 +0100, Hugh Dickins wrote:
> On Sat, 15 Oct 2005, Nick Piggin wrote:
> >
> > Is there anything that prevents PageDirty from theoretically being
> > speculatively loaded before page_count here? (see patch)
> >
> > It would result in pagecache corruption in the following situation:
> >
> > 1 2
> > find_get_page();
> > write to page write_lock(tree_lock);
> > SetPageDirty(); if (page_count != 2
> > put_page(); || PageDirty())
> >
> > Now I'm worried that 2 might see PageDirty *before* SetPageDirty in
> page->flags
> > 1, and page_count *after* put_page in 1.
>
> I think you're right. But I'm the last person to ask
> barrier/ordering questions of. CC'ed Ben and Andrea.

yup, now the question is wether PG_Dirty will be visible to CPU 2 before
the page count is decremented right ? That depends on put_page, I
suppose. If it's doing a simple atomic, there is an issue. But atomics
with return has been so often abused as locks that they may have been
implemented with a barrier... (On ppc64, it will do an eieio, thus I
think it should be ok).

There is also a problem the other way around. Write to page, then set
page dirty... those writes may be visible to CPU 2 (that is the page
content be dirty) before find_get_page even increased the page count,
unless there is a barrier in there too.

Paul, Anton ?

Ben.

2005-10-15 08:01:20

by Herbert Xu

[permalink] [raw]

Subject: Re: Possible memory ordering bug in page reclaim?

Benjamin Herrenschmidt <[email protected]> wrote:
>
>> > 1 2
>> > find_get_page();
>> > write to page write_lock(tree_lock);
>> > SetPageDirty(); if (page_count != 2
>> > put_page(); || PageDirty())
>> >
>> > Now I'm worried that 2 might see PageDirty *before* SetPageDirty in
>> page->flags
>> > 1, and page_count *after* put_page in 1.
>
> yup, now the question is wether PG_Dirty will be visible to CPU 2 before
> the page count is decremented right ? That depends on put_page, I
> suppose. If it's doing a simple atomic, there is an issue. But atomics
> with return has been so often abused as locks that they may have been
> implemented with a barrier... (On ppc64, it will do an eieio, thus I
> think it should be ok).

Yes atomic_add_negative should always be a barrier.

> There is also a problem the other way around. Write to page, then set
> page dirty... those writes may be visible to CPU 2 (that is the page
> content be dirty) before find_get_page even increased the page count,
> unless there is a barrier in there too.

find_get_page does a read_unlock_irq after the increment which also
serves as a barrier.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2005-10-15 08:58:24

by Nick Piggin

[permalink] [raw]

Subject: Re: Possible memory ordering bug in page reclaim?

Benjamin Herrenschmidt wrote:
> On Sat, 2005-10-15 at 07:17 +0100, Hugh Dickins wrote:
>
>>On Sat, 15 Oct 2005, Nick Piggin wrote:
>>

>>>1 2
>>>find_get_page();
>>>write to page write_lock(tree_lock);
>>>SetPageDirty(); if (page_count != 2
>>>put_page(); || PageDirty())
>>>
>>>Now I'm worried that 2 might see PageDirty *before* SetPageDirty in
>>
>> page->flags
>>
>>>1, and page_count *after* put_page in 1.
>>
>>I think you're right. But I'm the last person to ask
>>barrier/ordering questions of. CC'ed Ben and Andrea.
>
>
> yup, now the question is wether PG_Dirty will be visible to CPU 2 before
> the page count is decremented right ? That depends on put_page, I
> suppose. If it's doing a simple atomic, there is an issue. But atomics
> with return has been so often abused as locks that they may have been
> implemented with a barrier... (On ppc64, it will do an eieio, thus I
> think it should be ok).
>

Well yes, that's on the store side (1, above). However can't a CPU
still speculatively (eg. guess the branch) load the page->flags
cacheline which might be satisfied from memory before the page->count
cacheline loads? Ie. you can still have the correct write ordering
but have incorrect read ordering?

Because neither PageDirty nor page_count is a barrier, and there is
no read barrier between them.

Nick

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-10-15 12:09:27

by Herbert Xu

[permalink] [raw]

Subject: Re: Possible memory ordering bug in page reclaim?

Nick Piggin <[email protected]> wrote:
>
> Well yes, that's on the store side (1, above). However can't a CPU
> still speculatively (eg. guess the branch) load the page->flags
> cacheline which might be satisfied from memory before the page->count
> cacheline loads? Ie. you can still have the correct write ordering
> but have incorrect read ordering?
>
> Because neither PageDirty nor page_count is a barrier, and there is
> no read barrier between them.

Yes you're right. A read barrier is required here.

I think Ben was actually agreeing with you. He's just questioning
whether the corresponding write barrier existed on CPU 1 (the answer
to which is affirmative).

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2005-10-15 13:35:05

by Nick Piggin

[permalink] [raw]

Subject: Re: Possible memory ordering bug in page reclaim?

In mm/vmscan.c, the page reclaim may have the following sequence 2
running concurrently with sequence 1 on another CPU:

1 2
find_get_page();
write to page write_lock(tree_lock);
SetPageDirty(); if (page_count != 2
put_page(); || PageDirty())
/* page dirty or busy */
else
/* free it */

The comment indicates that PageDirty must be checked *after* page_count
indicates there are no users of this page, which prevents the dirty bit
from being lost in the case that that sequence 2 might see the state of
PageDirty() *before* SetPageDirty() in 1, but page_count *after* put_page
in 1.

However, there is no read memory barrier there, and so nothing to stop a
CPU from loading page_count before PageDirty (ie. ->flags). Theoretically,
data corruption is possible.

Signed-off-by: Nick Piggin <[email protected]>

Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c
+++ linux-2.6/mm/vmscan.c
@@ -511,7 +511,12 @@ static int shrink_list(struct list_head
* PageDirty _after_ making sure that the page is freeable and
* not in use by anybody. (pagecache + us == 2)
*/
- if (page_count(page) != 2 || PageDirty(page)) {
+ if (page_count(page) != 2) {
+ write_unlock_irq(&mapping->tree_lock);
+ goto keep_locked;
+ }
+ smp_rmb();
+ if (PageDirty(page)) {
write_unlock_irq(&mapping->tree_lock);
goto keep_locked;
}

Attachments:

mm-reclaim-memorder-fix.patch (1.52 kB)

2005-10-15 16:58:10

by Linus Torvalds

[permalink] [raw]

Subject: Re: Possible memory ordering bug in page reclaim?

On Sat, 15 Oct 2005, Herbert Xu wrote:
>
> Benjamin Herrenschmidt <[email protected]> wrote:
>
> > yup, now the question is wether PG_Dirty will be visible to CPU 2 before
> > the page count is decremented right ? That depends on put_page, I
> > suppose. If it's doing a simple atomic, there is an issue. But atomics
> > with return has been so often abused as locks that they may have been
> > implemented with a barrier... (On ppc64, it will do an eieio, thus I
> > think it should be ok).
>
> Yes atomic_add_negative should always be a barrier.

I disagree. That would be very expensive on anything but x86, where it
just happens to be true for other reasons. Atomics do _not_ implement
barriers.

Normally, we really don't care about any other decrement than the one that
goes to zero, and that one tends to come with its own serialization logic.

In this case, we don't even care.

I agree, however, that it looks like PG_dirty is racy. Probably not in
practice, but still.

So I'd suggest adding a smp_wmb() into set_page_dirty, and the rmb where
Nick suggested.

So I'd suggest a patch something more like this.. Marking the dirty/count
cases unlikely too in mm/page-writeback.c, since we should have tested for
these conditions optimistically outside the lock.

Comments? Nick, did you have some test-case that you think might actually
have been impacted by this?

Linus

---
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 0166ea1..722779c 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -668,8 +668,14 @@ int fastcall set_page_dirty(struct page
return (*spd)(page);
return __set_page_dirty_buffers(page);
}
- if (!PageDirty(page))
+ if (!PageDirty(page)) {
SetPageDirty(page);
+ /*
+ * Make sure the dirty bit percolates out
+ * to other CPU's before we release the page
+ */
+ smp_wmb();
+ }
return 0;
}
EXPORT_SYMBOL(set_page_dirty);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0ea71e8..64f9570 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -511,10 +511,11 @@ static int shrink_list(struct list_head
* PageDirty _after_ making sure that the page is freeable and
* not in use by anybody. (pagecache + us == 2)
*/
- if (page_count(page) != 2 || PageDirty(page)) {
- write_unlock_irq(&mapping->tree_lock);
- goto keep_locked;
- }
+ if (unlikely(page_count(page) != 2))
+ goto cannot_free;
+ smp_rmb();
+ if (unlikely(PageDirty(page)))
+ goto cannot_free;

#ifdef CONFIG_SWAP
if (PageSwapCache(page)) {
@@ -538,6 +539,10 @@ free_it:
__pagevec_release_nonlru(&freed_pvec);
continue;

+cannot_free:
+ write_unlock_irq(&mapping->tree_lock);
+ goto keep_locked;
+
activate_locked:
SetPageActive(page);
pgactivate++;

2005-10-15 18:00:22

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: Possible memory ordering bug in page reclaim?

On Sat, Oct 15, 2005 at 10:08:08PM +1000, Herbert Xu wrote:
> Nick Piggin <[email protected]> wrote:
> >
> > Well yes, that's on the store side (1, above). However can't a CPU
> > still speculatively (eg. guess the branch) load the page->flags
> > cacheline which might be satisfied from memory before the page->count
> > cacheline loads? Ie. you can still have the correct write ordering
> > but have incorrect read ordering?
> >
> > Because neither PageDirty nor page_count is a barrier, and there is
> > no read barrier between them.
>
> Yes you're right. A read barrier is required here.

Even a write barrier is required on the left side, the read barrier on
the right side is useless if there is no write barrier on the left side.

Note that the barrier in atomic_add_negative is useless here because it
happens way too late, _after_ the count is decremented (not _before_)
so the decreased count could be already visible to the other cpu.

Not all archs are like x86 where a barrier happens implicitly both
before and after the instruction, and the way atomic_add_negative is
implemented the barrier from a common code point of view is only added
_after_ the instruction.

2005-10-15 19:30:24

by David Miller

[permalink] [raw]

Subject: Re: Possible memory ordering bug in page reclaim?

From: Linus Torvalds <[email protected]>
Date: Sat, 15 Oct 2005 09:57:47 -0700 (PDT)

> On Sat, 15 Oct 2005, Herbert Xu wrote:
> >
> > Yes atomic_add_negative should always be a barrier.
>
> I disagree. That would be very expensive on anything but x86, where it
> just happens to be true for other reasons. Atomics do _not_ implement
> barriers.

When they return values, they are defined to be barriers.
It's even on the documentation :-)

2005-10-15 19:49:47

by Herbert Xu

[permalink] [raw]

Subject: Re: Possible memory ordering bug in page reclaim?

On Sat, Oct 15, 2005 at 06:00:18PM +0000, Andrea Arcangeli wrote:
>
> Note that the barrier in atomic_add_negative is useless here because it
> happens way too late, _after_ the count is decremented (not _before_)
> so the decreased count could be already visible to the other cpu.

Could you please point me to an architecture that does this?

This assumption is in fact made in a number of places in the kernel
where constructs such as atomic_add_negative or atomic_dec_and_test
are used and assumed to imply a memory barrier.

Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2005-10-15 20:07:05

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: Possible memory ordering bug in page reclaim?

On Sun, Oct 16, 2005 at 05:48:55AM +1000, Herbert Xu wrote:
> On Sat, Oct 15, 2005 at 06:00:18PM +0000, Andrea Arcangeli wrote:
> >
> > Note that the barrier in atomic_add_negative is useless here because it
> > happens way too late, _after_ the count is decremented (not _before_)
> > so the decreased count could be already visible to the other cpu.
>
> Could you please point me to an architecture that does this?

sure see alpha:

__asm__ __volatile__(
"1: ldq_l %0,%1\n"
" addq %0,%3,%2\n"
" addq %0,%3,%0\n"
" stq_c %0,%1\n"
" beq %0,2f\n"
" mb\n"

the memory barrier is applied way after the write is visible to other
cpus, you can even get an irq before the mb and block there for some
usec.

>From a common code point of view, the barrier you mentioned in
atomic_add_negative is absolutely useless for this specific case
(setpagedirty+put_page)

2005-10-15 22:18:15

by Benjamin Herrenschmidt

[permalink] [raw]

Subject: Re: Possible memory ordering bug in page reclaim?

On Sat, 2005-10-15 at 12:29 -0700, David S. Miller wrote:
> From: Linus Torvalds <[email protected]>
> Date: Sat, 15 Oct 2005 09:57:47 -0700 (PDT)
>
> > On Sat, 15 Oct 2005, Herbert Xu wrote:
> > >
> > > Yes atomic_add_negative should always be a barrier.
> >
> > I disagree. That would be very expensive on anything but x86, where it
> > just happens to be true for other reasons. Atomics do _not_ implement
> > barriers.
>
> When they return values, they are defined to be barriers.
> It's even on the documentation :-)

Ahhh, good to know :)

/me should read the documentation sometimes....

Ben.

2005-10-15 22:17:58

by Benjamin Herrenschmidt

[permalink] [raw]

Subject: Re: Possible memory ordering bug in page reclaim?

> Even a write barrier is required on the left side, the read barrier on
> the right side is useless if there is no write barrier on the left side.
>
> Note that the barrier in atomic_add_negative is useless here because it
> happens way too late, _after_ the count is decremented (not _before_)
> so the decreased count could be already visible to the other cpu.

Not on ppc64. Our atomic*return functions have a write barrier before
the atomic operation. I think there is just too much code out there that
either expects implicit barriers done by atomics (abuse them as locks)
or simply written by people who don't understand those ordering issues
(to be fair, they can be fairly brain damaging).

I agree this is not a good generic solution though.

> Not all archs are like x86 where a barrier happens implicitly both
> before and after the instruction, and the way atomic_add_negative is
> implemented the barrier from a common code point of view is only added
> _after_ the instruction.

2005-10-15 23:07:21

by David Miller

[permalink] [raw]

Subject: Re: Possible memory ordering bug in page reclaim?

From: Andrea Arcangeli <[email protected]>
Date: Sat, 15 Oct 2005 22:07:01 +0200

> sure see alpha:
>
> __asm__ __volatile__(
> "1: ldq_l %0,%1\n"
> " addq %0,%3,%2\n"
> " addq %0,%3,%0\n"
> " stq_c %0,%1\n"
> " beq %0,2f\n"
> " mb\n"
>
> the memory barrier is applied way after the write is visible to other
> cpus, you can even get an irq before the mb and block there for some
> usec.

For atomic operations returning values, there must be a memory
barrier both before and after the atomic operation. This is
defined in Documentation/atomic_ops.txt, so Alpha needs to be
fixed to add a memory barrier at the beginning of these
assembler sequences.

2005-10-15 23:13:32

by David Miller

[permalink] [raw]

Subject: Re: Possible memory ordering bug in page reclaim?

From: Benjamin Herrenschmidt <[email protected]>
Date: Sun, 16 Oct 2005 08:16:30 +1000

> > Even a write barrier is required on the left side, the read barrier on
> > the right side is useless if there is no write barrier on the left side.
> >
> > Note that the barrier in atomic_add_negative is useless here because it
> > happens way too late, _after_ the count is decremented (not _before_)
> > so the decreased count could be already visible to the other cpu.
>
> Not on ppc64.

Or sparc64, or ppc, or ... any platform that adheres to the
rules specified in Documentation/atomic_ops.txt :-)

One really needs the barriers both before and after the atomic
operation when you return a value, because this is the only way to get
sane semantics for dropping the final reference to some object. When
one does:

if (atomic_dec_and_test(&obj->count))
__free_stuff(obj);

All other modifications to object state before the atomic
decrement must reach global visibility _first_, and then
the atomic decrement itself must be strongly ordered wrt.
future memory operations.

That's why it is documented that the memory barriers both before and
after the atomic operation are required when the operation returns a
result value of some kind.

There is a lot of code in the kernel that depends upon this.

2005-10-16 00:04:12

by Nick Piggin

[permalink] [raw]

Subject: Re: Possible memory ordering bug in page reclaim?

Linus Torvalds wrote:

> I agree, however, that it looks like PG_dirty is racy. Probably not in
> practice, but still.
>
> So I'd suggest adding a smp_wmb() into set_page_dirty, and the rmb where
> Nick suggested.
>
> So I'd suggest a patch something more like this.. Marking the dirty/count
> cases unlikely too in mm/page-writeback.c, since we should have tested for
> these conditions optimistically outside the lock.
>

As Dave suggested, I think there is too much other code that depends on
these atomics to be barriers for us to change it (at least not in this
patch! :)).

> Comments? Nick, did you have some test-case that you think might actually
> have been impacted by this?
>

I guess your vmscan.c hunks are slightly nicer, though I might put
'cannot_free' right at the end, because it will be a very uncommon case.

And no, I don't have a test case. In fact, I wouldn't be surprised if
nobody anywhere has ever hit it :) I was just browsing code...

Thanks,
Nick

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-10-16 19:36:29

by Ivan Kokshaysky

[permalink] [raw]

Subject: Re: Possible memory ordering bug in page reclaim?

On Sat, Oct 15, 2005 at 04:07:02PM -0700, David S. Miller wrote:
> From: Andrea Arcangeli <[email protected]>
> Date: Sat, 15 Oct 2005 22:07:01 +0200
>
> > sure see alpha:
> >
> > __asm__ __volatile__(
> > "1: ldq_l %0,%1\n"
> > " addq %0,%3,%2\n"
> > " addq %0,%3,%0\n"
> > " stq_c %0,%1\n"
> > " beq %0,2f\n"
> > " mb\n"
> >
> > the memory barrier is applied way after the write is visible to other
> > cpus, you can even get an irq before the mb and block there for some
> > usec.
>
> For atomic operations returning values, there must be a memory
> barrier both before and after the atomic operation. This is
> defined in Documentation/atomic_ops.txt, so Alpha needs to be
> fixed to add a memory barrier at the beginning of these
> assembler sequences.

My opinion is that we don't need a barrier even _after_ ll/sc sequences
on Alpha - it's redundant; perhaps it's just a historical baggage.
I strongly believe that ll/sc _does_ imply an SMP memory barrier, as can
be seen from instructions timing: a standalone mb takes tens of cycles,
the mb before/after ll/sc takes 0 to 1 cycle on ev6 (a bit more on ev56)
depending on instruction slotting.
Note that superfluous mb's around atomic stuff still can hurt -
Alpha mb instruction also flushes IO write buffers, so it can
be _extremely_ expensive under heavy IO...

Ivan.

2005-10-17 04:29:43

by David Miller

[permalink] [raw]

Subject: Re: Possible memory ordering bug in page reclaim?

From: Ivan Kokshaysky <[email protected]>
Date: Sun, 16 Oct 2005 23:36:00 +0400

> My opinion is that we don't need a barrier even _after_ ll/sc sequences
> on Alpha - it's redundant; perhaps it's just a historical baggage.
> I strongly believe that ll/sc _does_ imply an SMP memory barrier, as can
> be seen from instructions timing: a standalone mb takes tens of cycles,
> the mb before/after ll/sc takes 0 to 1 cycle on ev6 (a bit more on ev56)
> depending on instruction slotting.
> Note that superfluous mb's around atomic stuff still can hurt -
> Alpha mb instruction also flushes IO write buffers, so it can
> be _extremely_ expensive under heavy IO...

Even a quick google tells me that on Alpha, LL/SC have implicit
barriers only with respect to the locations being acted upon
by the LL/SC, but don't have more general barrier properties.

This is in line with how PPC is defined as well.

I truly believe that you would be removing those Alpha memory barriers
only at your own peril. :-) And since the PPC and Alpha defined
semantics aparently match, I think it would be wise to add the missing
memory barriers from the front of the LL/SC sequences which need them,
just as PPC does.

I just spent 6 months tracking down a bug that turned out to be a
couple of missing memory barriers on sparc64. So, this has taught me
to be extra careful in this area :-) I really think this is an area
in which it's better to be safe than sorry.

2005-10-17 07:23:50

by Ivan Kokshaysky

[permalink] [raw]

Subject: Re: Possible memory ordering bug in page reclaim?

On Sun, Oct 16, 2005 at 09:29:17PM -0700, David S. Miller wrote:
> Even a quick google tells me that on Alpha, LL/SC have implicit
> barriers only with respect to the locations being acted upon
> by the LL/SC, but don't have more general barrier properties.

Argh, you are right - I should have read the manual before posting...

Ivan.

2005-10-17 09:31:49

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: Possible memory ordering bug in page reclaim?

On Sun, Oct 16, 2005 at 11:36:00PM +0400, Ivan Kokshaysky wrote:
> Note that superfluous mb's around atomic stuff still can hurt -
> Alpha mb instruction also flushes IO write buffers, so it can
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

That's what needed to implement the wmb on alpha, this is why mb is
needed there and we need to add it at the top as well to comply with
docs (and especially for atomic_dec_and_test kind of usage like Dave
said).

Ivan I assume you'll take care of fixing it, thanks!