LinuxLists.cc - [PATCH] workqueue: Fix memory ordering race in queue

[permalink] [raw]

Subject: Re: [PATCH] workqueue: Fix memory ordering race in queue_work*()

Hector Martin <[email protected]> wrote:
>
> This has been broken since the dawn of time, and it was incompletely
> fixed by 346c09f80459, which added the necessary barriers in the work
> execution path but failed to account for the missing barrier in the
> test_and_set_bit() failure case. Fix it by switching to
> atomic_long_fetch_or(), which does have unconditional barrier semantics
> regardless of whether the bit was already set or not (this is actually
> just test_and_set_bit() minus the early exit path).

test_and_set_bit is supposed to contain a full memory barrier.
If it doesn't then your arch is broken and needs to be fixed.

Changing this one spot is pointless because such assumptions
are all over the kernel.

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2022-08-16 07:41:10

[permalink] [raw]

Subject: Re: [PATCH] workqueue: Fix memory ordering race in queue_work*()

Tejun Heo <[email protected]> wrote:
>
> Oh, tricky one and yeah you're absolutely right that it makes no sense to
> not guarantee barrier semantics when already pending. I didn't even know
> test_and_set_bit() wasn't a barrier when it failed. Thanks a lot for hunting
> down and fixing this. Applied to wq/for-6.0-fixes.

Please revert this as test_and_set_bit was always supposed to be
a full memory barrier. This is an arch bug.

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2022-08-16 08:24:08

[permalink] [raw]

Subject: Re: [PATCH] workqueue: Fix memory ordering race in queue_work*()

On 2022/08/16 14:52, Linus Torvalds wrote:
> I think I understand *why* it's broken - it looks like a "harmless"
> optimization. After all, if the bitop doesn't do anything, there's
> nothing to order it with.
>
> It makes a certain amount of sense - as long as you don't think about
> it too hard.
>
> The reason it is completely and utterly broken is that it's not
> actually just "the bitop doesn't do anything". Even when it doesn't
> change the bit value, just the ordering of the read of the old bit
> value can be meaningful, exactly for that case of "I added more work
> to the queue, I need to set the bit to tell the consumers, and if I'm
> the first person to set the bit I may need to wake the consumer up".

This is the same reason I argued queue_work() itself needs to have a
similar guarantee, even when it doesn't queue work (and I updated the
doc to match). If test_and_set_bit() is used in this kind of context
often in the kernel, clearly the current implementation/doc clashes with
that.

As I said, I don't have any particular beef in this fight, but this is
horribly broken on M1/2 right now, so I'll send a patch to change the
bitops instead and you all can fight it out over which way is correct :)

--
Hector Martin ([email protected])
Public Key: https://mrcn.st/pub

2022-08-16 08:25:25

[permalink] [raw]

Subject: Re: [PATCH] workqueue: Fix memory ordering race in queue_work*()

On 16/08/2022 14.27, Linus Torvalds wrote:
> On Mon, Aug 15, 2022 at 9:15 PM Herbert Xu <[email protected]> wrote:
>>
>> Please revert this as test_and_set_bit was always supposed to be
>> a full memory barrier. This is an arch bug.
>
<snip>
> From looking at it, the asm-generic implementation is a bit
> questionable, though. In particular, it does
>
> if (READ_ONCE(*p) & mask)
> return 1;
>
> so it's *entirely* unordered for the "bit was already set" case.

These ops are documented in Documentation/atomic_bitops.txt as being
unordered in the failure ("bit was already set" case), and that matches
the generic implementation (which arm64 uses).

On the other hand, Twitter just pointed out that contradicting
documentation exists (I believe this was the source of the third party
doc I found that claimed it's always a barrier):

include/asm-generic/bitops/instrumented-atomic.h

So either one doc and the implementation are wrong, or the other doc is
wrong.

> --- a/include/asm-generic/bitops/atomic.h
> +++ b/include/asm-generic/bitops/atomic.h
> @@ -39,9 +39,6 @@ arch_test_and_set_bit(
> unsigned long mask = BIT_MASK(nr);
>
> p += BIT_WORD(nr);
> - if (READ_ONCE(*p) & mask)
> - return 1;
> -
> old = arch_atomic_long_fetch_or(mask, (atomic_long_t *)p);
> return !!(old & mask);
> }
> @@ -53,9 +50,6 @@ arch_test_and_clear_bit
> unsigned long mask = BIT_MASK(nr);
>
> p += BIT_WORD(nr);
> - if (!(READ_ONCE(*p) & mask))
> - return 0;
> -
> old = arch_atomic_long_fetch_andnot(mask, (atomic_long_t *)p);
> return !!(old & mask);
> }
>
> but the above is not just whitespace-damaged, it's entirely untested
> and based purely on me looking at that code.

That does work, I in fact tried that fix first to prove that the
early-exit was the problem.

I don't have a particularly strong opinion on which way is right here, I
just saw the implementation matched the docs (that I found) and Will
commented on the other thread implying this was all deliberate.

- Hector

2022-08-16 08:46:11

[permalink] [raw]

Subject: Re: [PATCH] workqueue: Fix memory ordering race in queue_work*()

On Mon, Aug 15, 2022 at 9:15 PM Herbert Xu <[email protected]> wrote:
>
> Please revert this as test_and_set_bit was always supposed to be
> a full memory barrier. This is an arch bug.

Yes, the bitops are kind of strange for various legacy reasons:

- set/clear_bit is atomic, but without a memory barrier, and need a
"smp_mb__before_atomic()" or similar for barriers

- test_and_set/clear_bit() are atomic, _and_ are memory barriers

- test_and_set_bit_lock and test_and_clear_bit_unlock are atomic and
_weaker_ than full memory barriers, but sufficient for locking (ie
acquire/release)

Does any of this make any sense at all? No. But those are the
documented semantics exactly because people were worried about
test_and_set_bit being used for locking, since on x86 all the atomics
are also memory barriers.

From looking at it, the asm-generic implementation is a bit
questionable, though. In particular, it does

if (READ_ONCE(*p) & mask)
return 1;

so it's *entirely* unordered for the "bit was already set" case.

That looks very wrong to me, since it basically means that the
test_and_set_bit() can return "bit was already set" based on an old
value - not a memory barrier at all.

So if you use "test_and_set_bit()" as some kind of "I've done my work,
now I am going to set the bit to tell people to pick it up", then that
early "bit was already set" code completely breaks it.

Now, arguably our atomic bitop semantics are very very odd, and it
might be time to revisit them. But that code looks very very buggy to
me.

The bug seems to go back to commit e986a0d6cb36 ("locking/atomics,
asm-generic/bitops/atomic.h: Rewrite using atomic_*() APIs"), and the
fix looks to be as simple as just removing that early READ_ONCE return
case (test_and_clear has the same bug).

Will?

IOW, the proper fix for this should, I think, look something like this
(whitespace mangled) thing:

--- a/include/asm-generic/bitops/atomic.h
+++ b/include/asm-generic/bitops/atomic.h
@@ -39,9 +39,6 @@ arch_test_and_set_bit(
unsigned long mask = BIT_MASK(nr);

p += BIT_WORD(nr);
- if (READ_ONCE(*p) & mask)
- return 1;
-
old = arch_atomic_long_fetch_or(mask, (atomic_long_t *)p);
return !!(old & mask);
}
@@ -53,9 +50,6 @@ arch_test_and_clear_bit
unsigned long mask = BIT_MASK(nr);

p += BIT_WORD(nr);
- if (!(READ_ONCE(*p) & mask))
- return 0;
-
old = arch_atomic_long_fetch_andnot(mask, (atomic_long_t *)p);
return !!(old & mask);
}

but the above is not just whitespace-damaged, it's entirely untested
and based purely on me looking at that code.

Linus

2022-08-16 08:50:05

[permalink] [raw]

Subject: Re: [PATCH] workqueue: Fix memory ordering race in queue_work*()

On Mon, Aug 15, 2022 at 10:36 PM Hector Martin <[email protected]> wrote:
>
> These ops are documented in Documentation/atomic_bitops.txt as being
> unordered in the failure ("bit was already set" case), and that matches
> the generic implementation (which arm64 uses).

Yeah, that documentation is pure garbage. We need to fix it.

I think that "unordered on failure" was added at the same time that
the generic implementation was rewritten.

IOW, the documentation was changed to match that broken
implementation, but it's clearly completely broken.

I think I understand *why* it's broken - it looks like a "harmless"
optimization. After all, if the bitop doesn't do anything, there's
nothing to order it with.

It makes a certain amount of sense - as long as you don't think about
it too hard.

The reason it is completely and utterly broken is that it's not
actually just "the bitop doesn't do anything". Even when it doesn't
change the bit value, just the ordering of the read of the old bit
value can be meaningful, exactly for that case of "I added more work
to the queue, I need to set the bit to tell the consumers, and if I'm
the first person to set the bit I may need to wake the consumer up".

> On the other hand, Twitter just pointed out that contradicting
> documentation exists (I believe this was the source of the third party
> doc I found that claimed it's always a barrier):

It's not just that other documentation exists - it's literally that
the unordered semantics don't even make sense, and don't match reality
and history.

And nobody thought about it or caught it at the time.

The Xen people seem to have noticed at some point, and tried to
introduce a "sync_set_set()"

> So either one doc and the implementation are wrong, or the other doc is
> wrong.

That doc and the generic implementation is clearly wrong.

Linus

2022-08-16 08:55:12

[permalink] [raw]

Subject: Re: [PATCH] workqueue: Fix memory ordering race in queue_work*()

On Mon, Aug 15, 2022 at 10:27:10PM -0700, Linus Torvalds wrote:
>
> The bug seems to go back to commit e986a0d6cb36 ("locking/atomics,
> asm-generic/bitops/atomic.h: Rewrite using atomic_*() APIs"), and the
> fix looks to be as simple as just removing that early READ_ONCE return
> case (test_and_clear has the same bug).
>
> Will?

I think this is the source of all this:

commit 61e02392d3c7ecac1f91c0a90a8043d67e081846
Author: Will Deacon <[email protected]>
Date: Tue Feb 13 13:30:19 2018 +0000

locking/atomic/bitops: Document and clarify ordering semantics for failed test_and_{}_bit()

Unfortunately it doesn't work because lots of kernel code rely on
the memory barrier semantics of test_and_set_bit.

If ARM really wants this change, then eitehr create a new API
for it or audit every single existing use in the kernel.

Patching the documentation and then relying on it is magical thinking.

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2022-08-16 08:55:25

[permalink] [raw]

Subject: Re: [PATCH] workqueue: Fix memory ordering race in queue_work*()

On Mon, Aug 15, 2022 at 10:49 PM Herbert Xu <[email protected]> wrote:
>
> I think this is the source of all this: [ commit 61e02392d3c7 ]

Yes, that doc update seems to actually predate the broken code.

But you can actually see how that commit clearly was only meant to
change the "test_and_set_bit_lock()" case. IOW, in that commit, the
test_and_set_bit() comments clearly say

* test_and_set_bit - Set a bit and return its old value
* This operation is atomic and cannot be reordered.
* It may be reordered on other architectures than x86.
* It also implies a memory barrier.

(that "It may be reordered on other architectures than x86" does show
clear confusion, since it also says "It also implies a memory
barrier")

And the very commit that adds that

- RMW operations that are conditional are unordered on FAILURE,

language then updates the comment only above test_and_set_bit_lock().

And yes, on failure to lock, ordering doesn't matter. You didn't get
the bit lock, there is no ordering for you.

But then I think that mistake in the documentation (the relaxed
semantics was only for the "lock" version of the bitops, not for the
rest ot it) then later led to the mistake in the code.

End result: test_and_set_bit_lock() does indeed only imply an ordering
if it got the lock (ie it was successful).

But test_and_set_bit() itself is always "succesful". If the bit was
already set, that's not any indication of any "failure". It's just an
indication that it was set. Nothing more, nothing less, and the memory
barrier is still required regardless.

Linus

2022-08-16 09:09:23

[permalink] [raw]

Subject: Re: [PATCH] workqueue: Fix memory ordering race in queue_work*()

On 16/08/2022 13.14, Herbert Xu wrote:
> Hector Martin <[email protected]> wrote:
>>
>> This has been broken since the dawn of time, and it was incompletely
>> fixed by 346c09f80459, which added the necessary barriers in the work
>> execution path but failed to account for the missing barrier in the
>> test_and_set_bit() failure case. Fix it by switching to
>> atomic_long_fetch_or(), which does have unconditional barrier semantics
>> regardless of whether the bit was already set or not (this is actually
>> just test_and_set_bit() minus the early exit path).
>
> test_and_set_bit is supposed to contain a full memory barrier.
> If it doesn't then your arch is broken and needs to be fixed.
>
> Changing this one spot is pointless because such assumptions
> are all over the kernel.

Documentation/atomic_bitops.txt and the asm-generic implementaton
disagree with you, so this isn't quite as simple as "your arch is
broken" :-)

- Hector

2022-08-16 10:06:33

[permalink] [raw]

Subject: Re: [PATCH] workqueue: Fix memory ordering race in queue_work*()

On Tue, Aug 16, 2022 at 03:28:50PM +0900, Hector Martin wrote:
>
> This is the same reason I argued queue_work() itself needs to have a
> similar guarantee, even when it doesn't queue work (and I updated the
> doc to match). If test_and_set_bit() is used in this kind of context
> often in the kernel, clearly the current implementation/doc clashes with
> that.

Kernel code all over the place rely on the fact that test_and_set_bit
provides a memory barrier. So this bug that you've discovered is
not at all isolated to the workqeueue system. It'll break the kernel
in lots of places in exactly the same way.

> As I said, I don't have any particular beef in this fight, but this is
> horribly broken on M1/2 right now, so I'll send a patch to change the
> bitops instead and you all can fight it out over which way is correct :)

Please do.

Thanks,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2022-08-16 10:56:30

[permalink] [raw]

Subject: Re: [PATCH] workqueue: Fix memory ordering race in queue_work*()

On 2022/08/16 16:48, Herbert Xu wrote:
> On Tue, Aug 16, 2022 at 03:28:50PM +0900, Hector Martin wrote:
>>
>> This is the same reason I argued queue_work() itself needs to have a
>> similar guarantee, even when it doesn't queue work (and I updated the
>> doc to match). If test_and_set_bit() is used in this kind of context
>> often in the kernel, clearly the current implementation/doc clashes with
>> that.
>
> Kernel code all over the place rely on the fact that test_and_set_bit
> provides a memory barrier. So this bug that you've discovered is
> not at all isolated to the workqeueue system. It'll break the kernel
> in lots of places in exactly the same way.

Now I'm surprised this isn't failing all over the place, given that...
these things are annoyingly subtle.

Still would want Will & Peter to chime in, of course.

>> As I said, I don't have any particular beef in this fight, but this is
>> horribly broken on M1/2 right now, so I'll send a patch to change the
>> bitops instead and you all can fight it out over which way is correct :)
>
> Please do.

Already did, but I just realized I forgot to Cc you. Sorry about that,
hope you can pick it up through the MLs:

https://lore.kernel.org/asahi/[email protected]/T/#u

- Hector

2022-08-16 13:57:10

by Will Deacon

[permalink] [raw]

Subject: Re: [PATCH] workqueue: Fix memory ordering race in queue_work*()

On Mon, Aug 15, 2022 at 10:27:10PM -0700, Linus Torvalds wrote:
> On Mon, Aug 15, 2022 at 9:15 PM Herbert Xu <[email protected]> wrote:
> >
> > Please revert this as test_and_set_bit was always supposed to be
> > a full memory barrier. This is an arch bug.
>
> Yes, the bitops are kind of strange for various legacy reasons:
>
> - set/clear_bit is atomic, but without a memory barrier, and need a
> "smp_mb__before_atomic()" or similar for barriers
>
> - test_and_set/clear_bit() are atomic, _and_ are memory barriers
>
> - test_and_set_bit_lock and test_and_clear_bit_unlock are atomic and
> _weaker_ than full memory barriers, but sufficient for locking (ie
> acquire/release)
>
> Does any of this make any sense at all? No. But those are the
> documented semantics exactly because people were worried about
> test_and_set_bit being used for locking, since on x86 all the atomics
> are also memory barriers.
>
> From looking at it, the asm-generic implementation is a bit
> questionable, though. In particular, it does
>
> if (READ_ONCE(*p) & mask)
> return 1;
>
> so it's *entirely* unordered for the "bit was already set" case.
>
> That looks very wrong to me, since it basically means that the
> test_and_set_bit() can return "bit was already set" based on an old
> value - not a memory barrier at all.
>
> So if you use "test_and_set_bit()" as some kind of "I've done my work,
> now I am going to set the bit to tell people to pick it up", then that
> early "bit was already set" code completely breaks it.
>
> Now, arguably our atomic bitop semantics are very very odd, and it
> might be time to revisit them. But that code looks very very buggy to
> me.
>
> The bug seems to go back to commit e986a0d6cb36 ("locking/atomics,
> asm-generic/bitops/atomic.h: Rewrite using atomic_*() APIs"), and the
> fix looks to be as simple as just removing that early READ_ONCE return
> case (test_and_clear has the same bug).
>
> Will?

Right, this looks like it's all my fault, so sorry about that.

In an effort to replace the spinlock-based atomic bitops with a version
based on atomic instructions in e986a0d6cb36, I inadvertently added this
READ_ONCE() shortcut to test_and_set_bit() because at the time that's
what we had (incorrectly) documented in our attempts at cleaning things
up in this area. I confess that I have never been comfortable with the
comment for test_and_set_bit() prior to my problematic patch:

/**
* test_and_set_bit - Set a bit and return its old value
* @nr: Bit to set
* @addr: Address to count from
*
* This operation is atomic and cannot be reordered.
* It may be reordered on other architectures than x86.
* It also implies a memory barrier.
*/

so while Peter and I were trying to improve the documentation for
atomics and memory barriers we clearly ended up making the wrong call
trying to treat this like e.g. a cmpxchg() (which has the
unordered-on-failure semantics).

It's worth noting that with the spinlock-based implementation (i.e.
prior to e986a0d6cb36) then we would have the same problem on
architectures that implement spinlocks with acquire/release semantics;
accesses from outside of the critical section can drift in and reorder
with each other there, so the conversion looked legitimate to me in
isolation and I vaguely remember going through callers looking for
potential issues. Alas, I obviously missed this case.

So it looks to me like we need to:

1. Upgrade test_and_{set,clear}_bit() to have a full memory barrier
regardless of the value which is read from memory. The lock/unlock
flavours can remain as-is.

2. Fix the documentation

3. Figure out what to do about architectures building atomics out of
spinlocks (probably ok as lock+unlock == full barrier there?)

4. Accept my sincerest apologies for the mess!

> IOW, the proper fix for this should, I think, look something like this
> (whitespace mangled) thing:
>
> --- a/include/asm-generic/bitops/atomic.h
> +++ b/include/asm-generic/bitops/atomic.h
> @@ -39,9 +39,6 @@ arch_test_and_set_bit(
> unsigned long mask = BIT_MASK(nr);
>
> p += BIT_WORD(nr);
> - if (READ_ONCE(*p) & mask)
> - return 1;
> -
> old = arch_atomic_long_fetch_or(mask, (atomic_long_t *)p);
> return !!(old & mask);
> }
> @@ -53,9 +50,6 @@ arch_test_and_clear_bit
> unsigned long mask = BIT_MASK(nr);
>
> p += BIT_WORD(nr);
> - if (!(READ_ONCE(*p) & mask))
> - return 0;
> -
> old = arch_atomic_long_fetch_andnot(mask, (atomic_long_t *)p);
> return !!(old & mask);
> }
>
> but the above is not just whitespace-damaged, it's entirely untested
> and based purely on me looking at that code.

Yes, I think that's step 1, thanks! I'm a bit worried about the perf
numbers on the other thread, but we can get to the bottom of that
separately.

Will

2022-08-16 15:10:54

by Boqun Feng

[permalink] [raw]

Subject: Re: [PATCH] workqueue: Fix memory ordering race in queue_work*()

On Tue, Aug 16, 2022 at 02:41:57PM +0100, Will Deacon wrote:
> On Mon, Aug 15, 2022 at 10:27:10PM -0700, Linus Torvalds wrote:
> > On Mon, Aug 15, 2022 at 9:15 PM Herbert Xu <[email protected]> wrote:
> > >
> > > Please revert this as test_and_set_bit was always supposed to be
> > > a full memory barrier. This is an arch bug.
> >
> > Yes, the bitops are kind of strange for various legacy reasons:
> >
> > - set/clear_bit is atomic, but without a memory barrier, and need a
> > "smp_mb__before_atomic()" or similar for barriers
> >
> > - test_and_set/clear_bit() are atomic, _and_ are memory barriers
> >
> > - test_and_set_bit_lock and test_and_clear_bit_unlock are atomic and
> > _weaker_ than full memory barriers, but sufficient for locking (ie
> > acquire/release)
> >
> > Does any of this make any sense at all? No. But those are the
> > documented semantics exactly because people were worried about
> > test_and_set_bit being used for locking, since on x86 all the atomics
> > are also memory barriers.
> >
> > From looking at it, the asm-generic implementation is a bit
> > questionable, though. In particular, it does
> >
> > if (READ_ONCE(*p) & mask)
> > return 1;
> >
> > so it's *entirely* unordered for the "bit was already set" case.
> >
> > That looks very wrong to me, since it basically means that the
> > test_and_set_bit() can return "bit was already set" based on an old
> > value - not a memory barrier at all.
> >
> > So if you use "test_and_set_bit()" as some kind of "I've done my work,
> > now I am going to set the bit to tell people to pick it up", then that
> > early "bit was already set" code completely breaks it.
> >
> > Now, arguably our atomic bitop semantics are very very odd, and it
> > might be time to revisit them. But that code looks very very buggy to
> > me.
> >
> > The bug seems to go back to commit e986a0d6cb36 ("locking/atomics,
> > asm-generic/bitops/atomic.h: Rewrite using atomic_*() APIs"), and the
> > fix looks to be as simple as just removing that early READ_ONCE return
> > case (test_and_clear has the same bug).
> >
> > Will?
>
> Right, this looks like it's all my fault, so sorry about that.
>
> In an effort to replace the spinlock-based atomic bitops with a version
> based on atomic instructions in e986a0d6cb36, I inadvertently added this
> READ_ONCE() shortcut to test_and_set_bit() because at the time that's
> what we had (incorrectly) documented in our attempts at cleaning things
> up in this area. I confess that I have never been comfortable with the
> comment for test_and_set_bit() prior to my problematic patch:
>
> /**
> * test_and_set_bit - Set a bit and return its old value
> * @nr: Bit to set
> * @addr: Address to count from
> *
> * This operation is atomic and cannot be reordered.
> * It may be reordered on other architectures than x86.
> * It also implies a memory barrier.
> */
>
> so while Peter and I were trying to improve the documentation for
> atomics and memory barriers we clearly ended up making the wrong call
> trying to treat this like e.g. a cmpxchg() (which has the
> unordered-on-failure semantics).
>
> It's worth noting that with the spinlock-based implementation (i.e.
> prior to e986a0d6cb36) then we would have the same problem on
> architectures that implement spinlocks with acquire/release semantics;
> accesses from outside of the critical section can drift in and reorder
> with each other there, so the conversion looked legitimate to me in
> isolation and I vaguely remember going through callers looking for
> potential issues. Alas, I obviously missed this case.
>

I just to want to mention that although spinlock-based atomic bitops
don't provide the full barrier in test_and_set_bit(), but they don't
have the problem spotted by Hector, because test_and_set_bit() and
clear_bit() sync with each other via locks:

test_and_set_bit():
lock(..);
old = *p; // mask is already set by other test_and_set_bit()
*p = old | mask;
unlock(...);
clear_bit():
lock(..);
*p ~= mask;
unlock(..);

so "having a full barrier before test_and_set_bit()" may not be the
exact thing we need here, as long as a test_and_set_bit() can sync with
a clear_bit() uncondiontally, then the world is safe. For example, we
can make test_and_set_bit() RELEASE, and clear_bit() ACQUIRE on ARM64:

test_and_set_bit():
atomic_long_fetch_or_release(..); // pair with clear_bit()
// guarantee everything is
// observed.
clear_bit():
atomic_long_fetch_andnot_acquire(..);

, maybe that's somewhat cheaper than a full barrier implementation.

Thoughts? Just to find the exact ordering requirement for bitops.

Regards,
Boqun

> So it looks to me like we need to:
>
> 1. Upgrade test_and_{set,clear}_bit() to have a full memory barrier
> regardless of the value which is read from memory. The lock/unlock
> flavours can remain as-is.
>
> 2. Fix the documentation
>
> 3. Figure out what to do about architectures building atomics out of
> spinlocks (probably ok as lock+unlock == full barrier there?)
>
> 4. Accept my sincerest apologies for the mess!
>
> > IOW, the proper fix for this should, I think, look something like this
> > (whitespace mangled) thing:
> >
> > --- a/include/asm-generic/bitops/atomic.h
> > +++ b/include/asm-generic/bitops/atomic.h
> > @@ -39,9 +39,6 @@ arch_test_and_set_bit(
> > unsigned long mask = BIT_MASK(nr);
> >
> > p += BIT_WORD(nr);
> > - if (READ_ONCE(*p) & mask)
> > - return 1;
> > -
> > old = arch_atomic_long_fetch_or(mask, (atomic_long_t *)p);
> > return !!(old & mask);
> > }
> > @@ -53,9 +50,6 @@ arch_test_and_clear_bit
> > unsigned long mask = BIT_MASK(nr);
> >
> > p += BIT_WORD(nr);
> > - if (!(READ_ONCE(*p) & mask))
> > - return 0;
> > -
> > old = arch_atomic_long_fetch_andnot(mask, (atomic_long_t *)p);
> > return !!(old & mask);
> > }
> >
> > but the above is not just whitespace-damaged, it's entirely untested
> > and based purely on me looking at that code.
>
> Yes, I think that's step 1, thanks! I'm a bit worried about the perf
> numbers on the other thread, but we can get to the bottom of that
> separately.
>
> Will

2022-08-16 16:52:33

by Tejun Heo

[permalink] [raw]

Subject: Re: [PATCH] workqueue: Fix memory ordering race in queue_work*()

Hello,

On Tue, Aug 16, 2022 at 12:15:38PM +0800, Herbert Xu wrote:
> Tejun Heo <[email protected]> wrote:
> >
> > Oh, tricky one and yeah you're absolutely right that it makes no sense to
> > not guarantee barrier semantics when already pending. I didn't even know
> > test_and_set_bit() wasn't a barrier when it failed. Thanks a lot for hunting
> > down and fixing this. Applied to wq/for-6.0-fixes.
>
> Please revert this as test_and_set_bit was always supposed to be
> a full memory barrier. This is an arch bug.

Alright, reverting.

Thanks.

--
tejun

2022-08-16 16:57:34

[permalink] [raw]

Subject: Re: [PATCH] workqueue: Fix memory ordering race in queue_work*()

On Tue, Aug 16, 2022 at 6:42 AM Will Deacon <[email protected]> wrote:
>
> > Will?
>
> Right, this looks like it's all my fault, so sorry about that.

It's interesting how this bug has existed for basically four years.

I suspect that the thing is fairly hard to hit in practice,
particularly on the kinds of devices most common (ie limited OoO
windows on most phone SoCs).

Together with phone loads probably not generally being all that
exciting from a kernel standpoint (a lot of driver work, probably not
a lot of workqueue stuff).

> 1. Upgrade test_and_{set,clear}_bit() to have a full memory barrier
> regardless of the value which is read from memory. The lock/unlock
> flavours can remain as-is.

I've applied Hector's "locking/atomic: Make test_and_*_bit() ordered
on failure".

> 2. Fix the documentation

That patch had a bit of a fix, but it is probably worth looking at more.

> 3. Figure out what to do about architectures building atomics out of
> spinlocks (probably ok as lock+unlock == full barrier there?)

Yeah, I wonder how we should describe the memory ordering here. We've
always punted on it, saying it's a "memory barrier", but that has been
partly a "look, that's the traditional x86 model".

And the traditional x86 model really sees the locked RMW operations as
being one single operation - in ways that most other architectures
don't.

So on x86, it's more than "this implies a memory barrier" - it's also
that there is no visible load-modify-store sequence where you can
start asking "what about the ordering of the _load_ wrt the preceding
memory operations".

That makes the x86 behavior very easy to think about, but means that
when you have bitops implemented other ways, you have questions that
are much harder to answer.

So in a Lock+read+op+write+unlock sequence, you can have preceding
operations move into the locked region, and mix with the read+op+write
side.

> 4. Accept my sincerest apologies for the mess!

I don't think you were the source of the mess. The *source* of the
mess is that we've always had very messy rules about the bitops in
particular.

I think the *code* is fixed (at least wrt the generic implementation,
I think the other models are up for discussion), but I think the real
issue is how we should describe the requirements.

So I think we have at least three cases we need to deal with:

(a) the people who just want atomics, and don't care about any
ordering. They are bound to exist.

(b) the people who want "acquire"-like semantics. I think the
existing test_and_set_bit_lock() is fine

(c) the people who want "release"-like semantics, where the
"test-and-set-bit" is for announcing "I'm done, was I the first one?".

(d) the full barrier case

I think we currently actively have (b) and (d), but it's possible that
the workqueue case really is only (c).

And I think the spinlock implementation really most naturally has that
(c) semantics - you don't necessarily get some theoretical full memory
barrier, but you *do* get those "handover" semantics.

Maybe we never really want or need (d) at all?

So I get the feeling that we really shouldn't specify
"test_and_set_bit()" in terms of memory barriers at all. I *think* it
would be more natural to describe them in terms of "handover" (ie
acquire/release), and then the spinlock semantics are obviously fine.

So I htink the code problem is easy, I think the real problem here has
always been bad documentation, and it would be really good to clarify
that.

Comments?

Linus

2022-08-16 17:08:44

[permalink] [raw]

Subject: Re: [PATCH] workqueue: Fix memory ordering race in queue_work*()

On 16/08/2022 23.55, Boqun Feng wrote:
> On Tue, Aug 16, 2022 at 02:41:57PM +0100, Will Deacon wrote:
>> It's worth noting that with the spinlock-based implementation (i.e.
>> prior to e986a0d6cb36) then we would have the same problem on
>> architectures that implement spinlocks with acquire/release semantics;
>> accesses from outside of the critical section can drift in and reorder
>> with each other there, so the conversion looked legitimate to me in
>> isolation and I vaguely remember going through callers looking for
>> potential issues. Alas, I obviously missed this case.
>>
>
> I just to want to mention that although spinlock-based atomic bitops
> don't provide the full barrier in test_and_set_bit(), but they don't
> have the problem spotted by Hector, because test_and_set_bit() and
> clear_bit() sync with each other via locks:
>
> test_and_set_bit():
> lock(..);
> old = *p; // mask is already set by other test_and_set_bit()
> *p = old | mask;
> unlock(...);
> clear_bit():
> lock(..);
> *p ~= mask;
> unlock(..);
>
> so "having a full barrier before test_and_set_bit()" may not be the
> exact thing we need here, as long as a test_and_set_bit() can sync with
> a clear_bit() uncondiontally, then the world is safe. For example, we
> can make test_and_set_bit() RELEASE, and clear_bit() ACQUIRE on ARM64:
>
> test_and_set_bit():
> atomic_long_fetch_or_release(..); // pair with clear_bit()
> // guarantee everything is
> // observed.
> clear_bit():
> atomic_long_fetch_andnot_acquire(..);
>
> , maybe that's somewhat cheaper than a full barrier implementation.
>
> Thoughts? Just to find the exact ordering requirement for bitops.

It's worth pointing out that the workqueue code does *not* pair
test_and_set_bit() with clear_bit(). It does an atomic_long_set()
instead (and then there are explicit barriers around it, which are
expected to pair with the implicit barrier in test_and_set_bit()). If we
define test_and_set_bit() to only sync with clear_bit() and not
necessarily be a true barrier, that breaks the usage of the workqueue code.

- Hector

2022-08-16 17:20:33

[permalink] [raw]

Subject: Re: [PATCH] workqueue: Fix memory ordering race in queue_work*()

On Tue, Aug 16, 2022 at 9:22 AM Hector Martin <[email protected]> wrote:
>
> It's worth pointing out that the workqueue code does *not* pair
> test_and_set_bit() with clear_bit(). It does an atomic_long_set()
> instead

Yes. That code is much too subtle.

And yes, I think those barriers are

(a) misleading

(b) don't work with the "serialize bits using spinlock" model at all

It's a good example of "we need to really have a better model for this".

Linus

2022-08-16 17:22:26

by Boqun Feng

[permalink] [raw]

Subject: Re: [PATCH] workqueue: Fix memory ordering race in queue_work*()

On Wed, Aug 17, 2022 at 01:22:09AM +0900, Hector Martin wrote:
> On 16/08/2022 23.55, Boqun Feng wrote:
> > On Tue, Aug 16, 2022 at 02:41:57PM +0100, Will Deacon wrote:
> >> It's worth noting that with the spinlock-based implementation (i.e.
> >> prior to e986a0d6cb36) then we would have the same problem on
> >> architectures that implement spinlocks with acquire/release semantics;
> >> accesses from outside of the critical section can drift in and reorder
> >> with each other there, so the conversion looked legitimate to me in
> >> isolation and I vaguely remember going through callers looking for
> >> potential issues. Alas, I obviously missed this case.
> >>
> >
> > I just to want to mention that although spinlock-based atomic bitops
> > don't provide the full barrier in test_and_set_bit(), but they don't
> > have the problem spotted by Hector, because test_and_set_bit() and
> > clear_bit() sync with each other via locks:
> >
> > test_and_set_bit():
> > lock(..);
> > old = *p; // mask is already set by other test_and_set_bit()
> > *p = old | mask;
> > unlock(...);
> > clear_bit():
> > lock(..);
> > *p ~= mask;
> > unlock(..);
> >
> > so "having a full barrier before test_and_set_bit()" may not be the
> > exact thing we need here, as long as a test_and_set_bit() can sync with
> > a clear_bit() uncondiontally, then the world is safe. For example, we
> > can make test_and_set_bit() RELEASE, and clear_bit() ACQUIRE on ARM64:
> >
> > test_and_set_bit():
> > atomic_long_fetch_or_release(..); // pair with clear_bit()
> > // guarantee everything is
> > // observed.
> > clear_bit():
> > atomic_long_fetch_andnot_acquire(..);
> >
> > , maybe that's somewhat cheaper than a full barrier implementation.
> >
> > Thoughts? Just to find the exact ordering requirement for bitops.
>
> It's worth pointing out that the workqueue code does *not* pair
> test_and_set_bit() with clear_bit(). It does an atomic_long_set()
> instead (and then there are explicit barriers around it, which are
> expected to pair with the implicit barrier in test_and_set_bit()). If we
> define test_and_set_bit() to only sync with clear_bit() and not
> necessarily be a true barrier, that breaks the usage of the workqueue code.
>

Ah, I miss that, but that means the old spinlock-based atomics are
totally broken unless spinlock means full barriers on these archs.

But still, if we define test_and_set_bit() as RELEASE atomic instead of
a full barrier + atomic, it should work for workqueue, right? Do we
actually need extra ordering here?

WRITE_ONCE(*x, 1); // A
test_and_set_bit(..); // a full barrier will order A & B
WRITE_ONCE(*y, 1); // B

That's something I want to figure out.

Regards,
Boqun
> - Hector

2022-08-16 17:23:58

by Tejun Heo

[permalink] [raw]

Subject: Re: [PATCH] workqueue: Fix memory ordering race in queue_work*()

Hello, Will.

On Tue, Aug 16, 2022 at 02:41:57PM +0100, Will Deacon wrote:
> /**
> * test_and_set_bit - Set a bit and return its old value
> * @nr: Bit to set
> * @addr: Address to count from
> *
> * This operation is atomic and cannot be reordered.
> * It may be reordered on other architectures than x86.
> * It also implies a memory barrier.
> */
>
> so while Peter and I were trying to improve the documentation for
> atomics and memory barriers we clearly ended up making the wrong call
> trying to treat this like e.g. a cmpxchg() (which has the
> unordered-on-failure semantics).

I think the doc can be improved here. atomic_t.txt says under ORDERING:

- RMW operations that have a return value are fully ordered;

- RMW operations that are conditional are unordered on FAILURE,
otherwise the above rules apply.

But nothing spells out what's conditional. Maybe it's okay to expect people
to read this doc and extrapolate how it applies, but I think it'd be better
if we spell out clearly per operaiton so that readers can search for a
speicific operation and then follow what the rules are from there.

It bothers me that there doesn't seem to be a comprehensive
operation-indexed doc on the subject. memory-barrier.txt doesn't cover which
operations do what barriers. atomic_t.txt and atomic_bitops.txt cover the
atomic_t operations but not in a comprehensive or searchable manner (e.g.
does test_and_set_bit() return 0 or 1 on success?) and it's not clear where
to look for non-atomic_t atomic operations. I guess it's implied that they
follow the same rules as atomic_t counterparts but I can't seem to find that
spelled out anywhere. The source code docs are layered and dispersed for
generic and arch implemetnations making them difficult to follow.

It'd be awesome if the documentation situation can be improved.

Thanks.

--
tejun

2022-08-16 17:50:57

[permalink] [raw]

Subject: Re: [PATCH] workqueue: Fix memory ordering race in queue_work*()

On 17/08/2022 01.26, Tejun Heo wrote:
> Hello,
>
> On Tue, Aug 16, 2022 at 12:15:38PM +0800, Herbert Xu wrote:
>> Tejun Heo <[email protected]> wrote:
>>>
>>> Oh, tricky one and yeah you're absolutely right that it makes no sense to
>>> not guarantee barrier semantics when already pending. I didn't even know
>>> test_and_set_bit() wasn't a barrier when it failed. Thanks a lot for hunting
>>> down and fixing this. Applied to wq/for-6.0-fixes.
>>
>> Please revert this as test_and_set_bit was always supposed to be
>> a full memory barrier. This is an arch bug.
>
> Alright, reverting.

FYI, Linus applied the alternate fix directly to his tree already, so
this is fixed on M1 either way. However, you may want to pay attention
to the discussion about the subtlety of the workqueue code pairing
test_and_set_bit() not with clear_bit(), but rather atomic_long_set(),
since it *could* imply it is still broken or might be broken in the
future, on other architectures.

- Hector

2022-08-17 05:34:32