The work queueing operation relies on the atomic test-and-set operation
to set the PENDING bit behaving as a full barrier, even if it fails.
Otherwise, the PENDING state may be observed before memory writes
pertaining to the work complete, as they are allowed to be reordered.
That can lead to work being processed before all prior writes are
observable, and no new work being queued to ensure they are observed at
some point.
This has been broken since the dawn of time, and it was incompletely
fixed by 346c09f80459, which added the necessary barriers in the work
execution path but failed to account for the missing barrier in the
test_and_set_bit() failure case. Fix it by switching to
atomic_long_fetch_or(), which does have unconditional barrier semantics
regardless of whether the bit was already set or not (this is actually
just test_and_set_bit() minus the early exit path).
Discovered [1] on Apple M1 platforms, which are ridiculously
out-of-order and managed to trigger this in the TTY core, of all places.
Easily reproducible by running this m1n1 client script on one M1 machine
connected to another one running the m1n1 bootloader in proxy mode:
=============
from m1n1.setup import *
i = 0
while True:
a = iface.readmem(u.base, 1170)
print(i)
i += 1
=============
The script will hang when the TTY layer fails to push a buffer of data
into the ldisc in a timely manner in tty_flip_buffer_push(), which
writes a buffer pointer and then queue_work()s the ldisc push.
(Note: reproducibility depends on .config options)
Additionally, properly document that queue_work() has guarantees even
when work is already queued (it doesn't make any sense for it not to,
the comment in set_work_pool_and_clear_pending() already implies it
does, and the TTY core and probably quite a few other places rely on
this).
[1] https://lore.kernel.org/lkml/[email protected]/#t
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Fixes: 346c09f80459 ("workqueue: fix ghost PENDING flag while doing MQ IO")
Cc: [email protected]
Signed-off-by: Hector Martin <[email protected]>
---
include/linux/workqueue.h | 15 ++++++++++-----
kernel/workqueue.c | 39 +++++++++++++++++++++++++++++++--------
2 files changed, 41 insertions(+), 13 deletions(-)
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index a0143dd24430..d9ea73813a3c 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -484,18 +484,23 @@ extern void wq_worker_comm(char *buf, size_t size, struct task_struct *task);
* We queue the work to the CPU on which it was submitted, but if the CPU dies
* it can be processed by another CPU.
*
- * Memory-ordering properties: If it returns %true, guarantees that all stores
- * preceding the call to queue_work() in the program order will be visible from
- * the CPU which will execute @work by the time such work executes, e.g.,
+ * Memory-ordering properties: Guarantees that all stores preceding the call to
+ * queue_work() in the program order will be visible from the CPU which will
+ * execute @work by the time such work executes, e.g.,
*
* { x is initially 0 }
*
* CPU0 CPU1
*
* WRITE_ONCE(x, 1); [ @work is being executed ]
- * r0 = queue_work(wq, work); r1 = READ_ONCE(x);
+ * queue_work(wq, work); r0 = READ_ONCE(x);
*
- * Forbids: r0 == true && r1 == 0
+ * Forbids: r0 == 0 for the currently pending execution of @work after
+ * queue_work() completes.
+ *
+ * If @work was already pending (ret == false), that execution is guaranteed
+ * to observe x == 1. If @work was newly queued (ret == true), the newly
+ * queued execution is guaranteed to observe x == 1.
*/
static inline bool queue_work(struct workqueue_struct *wq,
struct work_struct *work)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index aeea9731ef80..01bc03eed649 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -655,7 +655,7 @@ static void set_work_pool_and_clear_pending(struct work_struct *work,
{
/*
* The following wmb is paired with the implied mb in
- * test_and_set_bit(PENDING) and ensures all updates to @work made
+ * atomic_long_fetch_or(PENDING) and ensures all updates to @work made
* here are visible to and precede any updates by the next PENDING
* owner.
*/
@@ -673,7 +673,7 @@ static void set_work_pool_and_clear_pending(struct work_struct *work,
*
* 1 STORE event_indicated
* 2 queue_work_on() {
- * 3 test_and_set_bit(PENDING)
+ * 3 atomic_long_fetch_or(PENDING)
* 4 } set_..._and_clear_pending() {
* 5 set_work_data() # clear bit
* 6 smp_mb()
@@ -688,6 +688,15 @@ static void set_work_pool_and_clear_pending(struct work_struct *work,
* finish the queued @work. Meanwhile CPU#1 does not see
* event_indicated is set, because speculative LOAD was executed
* before actual STORE.
+ *
+ * Line 3 requires barrier semantics, even on failure. If it were
+ * implemented with test_and_set_bit() (which does not have
+ * barrier semantics on failure), that would allow the STORE to
+ * be reordered after it, and it could be observed by CPU#1 after
+ * it has executed all the way through to line 8 (and cleared the
+ * PENDING bit in the process). At this point, CPU#0 would not have
+ * queued new work (having observed PENDING set), and CPU#1 would not
+ * have observed the event_indicated store in the last work execution.
*/
smp_mb();
}
@@ -1276,8 +1285,9 @@ static int try_to_grab_pending(struct work_struct *work, bool is_dwork,
return 1;
}
- /* try to claim PENDING the normal way */
- if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)))
+ /* try to claim PENDING the normal way, see queue_work_on() */
+ if (!(atomic_long_fetch_or(WORK_STRUCT_PENDING, &work->data)
+ & WORK_STRUCT_PENDING))
return 0;
rcu_read_lock();
@@ -1541,7 +1551,14 @@ bool queue_work_on(int cpu, struct workqueue_struct *wq,
local_irq_save(flags);
- if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
+ /*
+ * We need unconditional barrier semantics, even on failure,
+ * to avoid racing set_work_pool_and_clear_pending(). Hence,
+ * this has to be atomic_long_fetch_or(), not test_and_set_bit()
+ * which elides the barrier on failure.
+ */
+ if (!(atomic_long_fetch_or(WORK_STRUCT_PENDING, &work->data)
+ & WORK_STRUCT_PENDING)) {
__queue_work(cpu, wq, work);
ret = true;
}
@@ -1623,7 +1640,9 @@ bool queue_work_node(int node, struct workqueue_struct *wq,
local_irq_save(flags);
- if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
+ /* see queue_work_on() */
+ if (!(atomic_long_fetch_or(WORK_STRUCT_PENDING, &work->data)
+ & WORK_STRUCT_PENDING)) {
int cpu = workqueue_select_cpu_near(node);
__queue_work(cpu, wq, work);
@@ -1697,7 +1716,9 @@ bool queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
/* read the comment in __queue_work() */
local_irq_save(flags);
- if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
+ /* see queue_work_on() */
+ if (!(atomic_long_fetch_or(WORK_STRUCT_PENDING, &work->data)
+ & WORK_STRUCT_PENDING)) {
__queue_delayed_work(cpu, wq, dwork, delay);
ret = true;
}
@@ -1769,7 +1790,9 @@ bool queue_rcu_work(struct workqueue_struct *wq, struct rcu_work *rwork)
{
struct work_struct *work = &rwork->work;
- if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
+ /* see queue_work_on() */
+ if (!(atomic_long_fetch_or(WORK_STRUCT_PENDING, &work->data)
+ & WORK_STRUCT_PENDING)) {
rwork->wq = wq;
call_rcu(&rwork->rcu, rcu_work_rcufn);
return true;
--
2.35.1
On Tue, Aug 16, 2022 at 02:58:10AM +0900, Hector Martin wrote:
> This has been broken since the dawn of time, and it was incompletely
> fixed by 346c09f80459, which added the necessary barriers in the work
> execution path but failed to account for the missing barrier in the
> test_and_set_bit() failure case. Fix it by switching to
> atomic_long_fetch_or(), which does have unconditional barrier semantics
> regardless of whether the bit was already set or not (this is actually
> just test_and_set_bit() minus the early exit path).
...
Oh, tricky one and yeah you're absolutely right that it makes no sense to
not guarantee barrier semantics when already pending. I didn't even know
test_and_set_bit() wasn't a barrier when it failed. Thanks a lot for hunting
down and fixing this. Applied to wq/for-6.0-fixes.
Thanks.
--
tejun
Hector Martin <[email protected]> wrote:
>
> This has been broken since the dawn of time, and it was incompletely
> fixed by 346c09f80459, which added the necessary barriers in the work
> execution path but failed to account for the missing barrier in the
> test_and_set_bit() failure case. Fix it by switching to
> atomic_long_fetch_or(), which does have unconditional barrier semantics
> regardless of whether the bit was already set or not (this is actually
> just test_and_set_bit() minus the early exit path).
test_and_set_bit is supposed to contain a full memory barrier.
If it doesn't then your arch is broken and needs to be fixed.
Changing this one spot is pointless because such assumptions
are all over the kernel.
Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Tejun Heo <[email protected]> wrote:
>
> Oh, tricky one and yeah you're absolutely right that it makes no sense to
> not guarantee barrier semantics when already pending. I didn't even know
> test_and_set_bit() wasn't a barrier when it failed. Thanks a lot for hunting
> down and fixing this. Applied to wq/for-6.0-fixes.
Please revert this as test_and_set_bit was always supposed to be
a full memory barrier. This is an arch bug.
Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
On 2022/08/16 14:52, Linus Torvalds wrote:
> I think I understand *why* it's broken - it looks like a "harmless"
> optimization. After all, if the bitop doesn't do anything, there's
> nothing to order it with.
>
> It makes a certain amount of sense - as long as you don't think about
> it too hard.
>
> The reason it is completely and utterly broken is that it's not
> actually just "the bitop doesn't do anything". Even when it doesn't
> change the bit value, just the ordering of the read of the old bit
> value can be meaningful, exactly for that case of "I added more work
> to the queue, I need to set the bit to tell the consumers, and if I'm
> the first person to set the bit I may need to wake the consumer up".
This is the same reason I argued queue_work() itself needs to have a
similar guarantee, even when it doesn't queue work (and I updated the
doc to match). If test_and_set_bit() is used in this kind of context
often in the kernel, clearly the current implementation/doc clashes with
that.
As I said, I don't have any particular beef in this fight, but this is
horribly broken on M1/2 right now, so I'll send a patch to change the
bitops instead and you all can fight it out over which way is correct :)
--
Hector Martin ([email protected])
Public Key: https://mrcn.st/pub
On 16/08/2022 14.27, Linus Torvalds wrote:
> On Mon, Aug 15, 2022 at 9:15 PM Herbert Xu <[email protected]> wrote:
>>
>> Please revert this as test_and_set_bit was always supposed to be
>> a full memory barrier. This is an arch bug.
>
<snip>
> From looking at it, the asm-generic implementation is a bit
> questionable, though. In particular, it does
>
> if (READ_ONCE(*p) & mask)
> return 1;
>
> so it's *entirely* unordered for the "bit was already set" case.
These ops are documented in Documentation/atomic_bitops.txt as being
unordered in the failure ("bit was already set" case), and that matches
the generic implementation (which arm64 uses).
On the other hand, Twitter just pointed out that contradicting
documentation exists (I believe this was the source of the third party
doc I found that claimed it's always a barrier):
include/asm-generic/bitops/instrumented-atomic.h
So either one doc and the implementation are wrong, or the other doc is
wrong.
> --- a/include/asm-generic/bitops/atomic.h
> +++ b/include/asm-generic/bitops/atomic.h
> @@ -39,9 +39,6 @@ arch_test_and_set_bit(
> unsigned long mask = BIT_MASK(nr);
>
> p += BIT_WORD(nr);
> - if (READ_ONCE(*p) & mask)
> - return 1;
> -
> old = arch_atomic_long_fetch_or(mask, (atomic_long_t *)p);
> return !!(old & mask);
> }
> @@ -53,9 +50,6 @@ arch_test_and_clear_bit
> unsigned long mask = BIT_MASK(nr);
>
> p += BIT_WORD(nr);
> - if (!(READ_ONCE(*p) & mask))
> - return 0;
> -
> old = arch_atomic_long_fetch_andnot(mask, (atomic_long_t *)p);
> return !!(old & mask);
> }
>
> but the above is not just whitespace-damaged, it's entirely untested
> and based purely on me looking at that code.
That does work, I in fact tried that fix first to prove that the
early-exit was the problem.
I don't have a particularly strong opinion on which way is right here, I
just saw the implementation matched the docs (that I found) and Will
commented on the other thread implying this was all deliberate.
- Hector
On Mon, Aug 15, 2022 at 9:15 PM Herbert Xu <[email protected]> wrote:
>
> Please revert this as test_and_set_bit was always supposed to be
> a full memory barrier. This is an arch bug.
Yes, the bitops are kind of strange for various legacy reasons:
- set/clear_bit is atomic, but without a memory barrier, and need a
"smp_mb__before_atomic()" or similar for barriers
- test_and_set/clear_bit() are atomic, _and_ are memory barriers
- test_and_set_bit_lock and test_and_clear_bit_unlock are atomic and
_weaker_ than full memory barriers, but sufficient for locking (ie
acquire/release)
Does any of this make any sense at all? No. But those are the
documented semantics exactly because people were worried about
test_and_set_bit being used for locking, since on x86 all the atomics
are also memory barriers.
From looking at it, the asm-generic implementation is a bit
questionable, though. In particular, it does
if (READ_ONCE(*p) & mask)
return 1;
so it's *entirely* unordered for the "bit was already set" case.
That looks very wrong to me, since it basically means that the
test_and_set_bit() can return "bit was already set" based on an old
value - not a memory barrier at all.
So if you use "test_and_set_bit()" as some kind of "I've done my work,
now I am going to set the bit to tell people to pick it up", then that
early "bit was already set" code completely breaks it.
Now, arguably our atomic bitop semantics are very very odd, and it
might be time to revisit them. But that code looks very very buggy to
me.
The bug seems to go back to commit e986a0d6cb36 ("locking/atomics,
asm-generic/bitops/atomic.h: Rewrite using atomic_*() APIs"), and the
fix looks to be as simple as just removing that early READ_ONCE return
case (test_and_clear has the same bug).
Will?
IOW, the proper fix for this should, I think, look something like this
(whitespace mangled) thing:
--- a/include/asm-generic/bitops/atomic.h
+++ b/include/asm-generic/bitops/atomic.h
@@ -39,9 +39,6 @@ arch_test_and_set_bit(
unsigned long mask = BIT_MASK(nr);
p += BIT_WORD(nr);
- if (READ_ONCE(*p) & mask)
- return 1;
-
old = arch_atomic_long_fetch_or(mask, (atomic_long_t *)p);
return !!(old & mask);
}
@@ -53,9 +50,6 @@ arch_test_and_clear_bit
unsigned long mask = BIT_MASK(nr);
p += BIT_WORD(nr);
- if (!(READ_ONCE(*p) & mask))
- return 0;
-
old = arch_atomic_long_fetch_andnot(mask, (atomic_long_t *)p);
return !!(old & mask);
}
but the above is not just whitespace-damaged, it's entirely untested
and based purely on me looking at that code.
Linus
On Mon, Aug 15, 2022 at 10:36 PM Hector Martin <[email protected]> wrote:
>
> These ops are documented in Documentation/atomic_bitops.txt as being
> unordered in the failure ("bit was already set" case), and that matches
> the generic implementation (which arm64 uses).
Yeah, that documentation is pure garbage. We need to fix it.
I think that "unordered on failure" was added at the same time that
the generic implementation was rewritten.
IOW, the documentation was changed to match that broken
implementation, but it's clearly completely broken.
I think I understand *why* it's broken - it looks like a "harmless"
optimization. After all, if the bitop doesn't do anything, there's
nothing to order it with.
It makes a certain amount of sense - as long as you don't think about
it too hard.
The reason it is completely and utterly broken is that it's not
actually just "the bitop doesn't do anything". Even when it doesn't
change the bit value, just the ordering of the read of the old bit
value can be meaningful, exactly for that case of "I added more work
to the queue, I need to set the bit to tell the consumers, and if I'm
the first person to set the bit I may need to wake the consumer up".
> On the other hand, Twitter just pointed out that contradicting
> documentation exists (I believe this was the source of the third party
> doc I found that claimed it's always a barrier):
It's not just that other documentation exists - it's literally that
the unordered semantics don't even make sense, and don't match reality
and history.
And nobody thought about it or caught it at the time.
The Xen people seem to have noticed at some point, and tried to
introduce a "sync_set_set()"
> So either one doc and the implementation are wrong, or the other doc is
> wrong.
That doc and the generic implementation is clearly wrong.
Linus
On Mon, Aug 15, 2022 at 10:27:10PM -0700, Linus Torvalds wrote:
>
> The bug seems to go back to commit e986a0d6cb36 ("locking/atomics,
> asm-generic/bitops/atomic.h: Rewrite using atomic_*() APIs"), and the
> fix looks to be as simple as just removing that early READ_ONCE return
> case (test_and_clear has the same bug).
>
> Will?
I think this is the source of all this:
commit 61e02392d3c7ecac1f91c0a90a8043d67e081846
Author: Will Deacon <[email protected]>
Date: Tue Feb 13 13:30:19 2018 +0000
locking/atomic/bitops: Document and clarify ordering semantics for failed test_and_{}_bit()
Unfortunately it doesn't work because lots of kernel code rely on
the memory barrier semantics of test_and_set_bit.
If ARM really wants this change, then eitehr create a new API
for it or audit every single existing use in the kernel.
Patching the documentation and then relying on it is magical thinking.
Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
On Mon, Aug 15, 2022 at 10:49 PM Herbert Xu <[email protected]> wrote:
>
> I think this is the source of all this: [ commit 61e02392d3c7 ]
Yes, that doc update seems to actually predate the broken code.
But you can actually see how that commit clearly was only meant to
change the "test_and_set_bit_lock()" case. IOW, in that commit, the
test_and_set_bit() comments clearly say
* test_and_set_bit - Set a bit and return its old value
* This operation is atomic and cannot be reordered.
* It may be reordered on other architectures than x86.
* It also implies a memory barrier.
(that "It may be reordered on other architectures than x86" does show
clear confusion, since it also says "It also implies a memory
barrier")
And the very commit that adds that
- RMW operations that are conditional are unordered on FAILURE,
language then updates the comment only above test_and_set_bit_lock().
And yes, on failure to lock, ordering doesn't matter. You didn't get
the bit lock, there is no ordering for you.
But then I think that mistake in the documentation (the relaxed
semantics was only for the "lock" version of the bitops, not for the
rest ot it) then later led to the mistake in the code.
End result: test_and_set_bit_lock() does indeed only imply an ordering
if it got the lock (ie it was successful).
But test_and_set_bit() itself is always "succesful". If the bit was
already set, that's not any indication of any "failure". It's just an
indication that it was set. Nothing more, nothing less, and the memory
barrier is still required regardless.
Linus
On 16/08/2022 13.14, Herbert Xu wrote:
> Hector Martin <[email protected]> wrote:
>>
>> This has been broken since the dawn of time, and it was incompletely
>> fixed by 346c09f80459, which added the necessary barriers in the work
>> execution path but failed to account for the missing barrier in the
>> test_and_set_bit() failure case. Fix it by switching to
>> atomic_long_fetch_or(), which does have unconditional barrier semantics
>> regardless of whether the bit was already set or not (this is actually
>> just test_and_set_bit() minus the early exit path).
>
> test_and_set_bit is supposed to contain a full memory barrier.
> If it doesn't then your arch is broken and needs to be fixed.
>
> Changing this one spot is pointless because such assumptions
> are all over the kernel.
Documentation/atomic_bitops.txt and the asm-generic implementaton
disagree with you, so this isn't quite as simple as "your arch is
broken" :-)
- Hector
On Tue, Aug 16, 2022 at 03:28:50PM +0900, Hector Martin wrote:
>
> This is the same reason I argued queue_work() itself needs to have a
> similar guarantee, even when it doesn't queue work (and I updated the
> doc to match). If test_and_set_bit() is used in this kind of context
> often in the kernel, clearly the current implementation/doc clashes with
> that.
Kernel code all over the place rely on the fact that test_and_set_bit
provides a memory barrier. So this bug that you've discovered is
not at all isolated to the workqeueue system. It'll break the kernel
in lots of places in exactly the same way.
> As I said, I don't have any particular beef in this fight, but this is
> horribly broken on M1/2 right now, so I'll send a patch to change the
> bitops instead and you all can fight it out over which way is correct :)
Please do.
Thanks,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
On 2022/08/16 16:48, Herbert Xu wrote:
> On Tue, Aug 16, 2022 at 03:28:50PM +0900, Hector Martin wrote:
>>
>> This is the same reason I argued queue_work() itself needs to have a
>> similar guarantee, even when it doesn't queue work (and I updated the
>> doc to match). If test_and_set_bit() is used in this kind of context
>> often in the kernel, clearly the current implementation/doc clashes with
>> that.
>
> Kernel code all over the place rely on the fact that test_and_set_bit
> provides a memory barrier. So this bug that you've discovered is
> not at all isolated to the workqeueue system. It'll break the kernel
> in lots of places in exactly the same way.
Now I'm surprised this isn't failing all over the place, given that...
these things are annoyingly subtle.
Still would want Will & Peter to chime in, of course.
>> As I said, I don't have any particular beef in this fight, but this is
>> horribly broken on M1/2 right now, so I'll send a patch to change the
>> bitops instead and you all can fight it out over which way is correct :)
>
> Please do.
Already did, but I just realized I forgot to Cc you. Sorry about that,
hope you can pick it up through the MLs:
https://lore.kernel.org/asahi/[email protected]/T/#u
- Hector
On Mon, Aug 15, 2022 at 10:27:10PM -0700, Linus Torvalds wrote:
> On Mon, Aug 15, 2022 at 9:15 PM Herbert Xu <[email protected]> wrote:
> >
> > Please revert this as test_and_set_bit was always supposed to be
> > a full memory barrier. This is an arch bug.
>
> Yes, the bitops are kind of strange for various legacy reasons:
>
> - set/clear_bit is atomic, but without a memory barrier, and need a
> "smp_mb__before_atomic()" or similar for barriers
>
> - test_and_set/clear_bit() are atomic, _and_ are memory barriers
>
> - test_and_set_bit_lock and test_and_clear_bit_unlock are atomic and
> _weaker_ than full memory barriers, but sufficient for locking (ie
> acquire/release)
>
> Does any of this make any sense at all? No. But those are the
> documented semantics exactly because people were worried about
> test_and_set_bit being used for locking, since on x86 all the atomics
> are also memory barriers.
>
> From looking at it, the asm-generic implementation is a bit
> questionable, though. In particular, it does
>
> if (READ_ONCE(*p) & mask)
> return 1;
>
> so it's *entirely* unordered for the "bit was already set" case.
>
> That looks very wrong to me, since it basically means that the
> test_and_set_bit() can return "bit was already set" based on an old
> value - not a memory barrier at all.
>
> So if you use "test_and_set_bit()" as some kind of "I've done my work,
> now I am going to set the bit to tell people to pick it up", then that
> early "bit was already set" code completely breaks it.
>
> Now, arguably our atomic bitop semantics are very very odd, and it
> might be time to revisit them. But that code looks very very buggy to
> me.
>
> The bug seems to go back to commit e986a0d6cb36 ("locking/atomics,
> asm-generic/bitops/atomic.h: Rewrite using atomic_*() APIs"), and the
> fix looks to be as simple as just removing that early READ_ONCE return
> case (test_and_clear has the same bug).
>
> Will?
Right, this looks like it's all my fault, so sorry about that.
In an effort to replace the spinlock-based atomic bitops with a version
based on atomic instructions in e986a0d6cb36, I inadvertently added this
READ_ONCE() shortcut to test_and_set_bit() because at the time that's
what we had (incorrectly) documented in our attempts at cleaning things
up in this area. I confess that I have never been comfortable with the
comment for test_and_set_bit() prior to my problematic patch:
/**
* test_and_set_bit - Set a bit and return its old value
* @nr: Bit to set
* @addr: Address to count from
*
* This operation is atomic and cannot be reordered.
* It may be reordered on other architectures than x86.
* It also implies a memory barrier.
*/
so while Peter and I were trying to improve the documentation for
atomics and memory barriers we clearly ended up making the wrong call
trying to treat this like e.g. a cmpxchg() (which has the
unordered-on-failure semantics).
It's worth noting that with the spinlock-based implementation (i.e.
prior to e986a0d6cb36) then we would have the same problem on
architectures that implement spinlocks with acquire/release semantics;
accesses from outside of the critical section can drift in and reorder
with each other there, so the conversion looked legitimate to me in
isolation and I vaguely remember going through callers looking for
potential issues. Alas, I obviously missed this case.
So it looks to me like we need to:
1. Upgrade test_and_{set,clear}_bit() to have a full memory barrier
regardless of the value which is read from memory. The lock/unlock
flavours can remain as-is.
2. Fix the documentation
3. Figure out what to do about architectures building atomics out of
spinlocks (probably ok as lock+unlock == full barrier there?)
4. Accept my sincerest apologies for the mess!
> IOW, the proper fix for this should, I think, look something like this
> (whitespace mangled) thing:
>
> --- a/include/asm-generic/bitops/atomic.h
> +++ b/include/asm-generic/bitops/atomic.h
> @@ -39,9 +39,6 @@ arch_test_and_set_bit(
> unsigned long mask = BIT_MASK(nr);
>
> p += BIT_WORD(nr);
> - if (READ_ONCE(*p) & mask)
> - return 1;
> -
> old = arch_atomic_long_fetch_or(mask, (atomic_long_t *)p);
> return !!(old & mask);
> }
> @@ -53,9 +50,6 @@ arch_test_and_clear_bit
> unsigned long mask = BIT_MASK(nr);
>
> p += BIT_WORD(nr);
> - if (!(READ_ONCE(*p) & mask))
> - return 0;
> -
> old = arch_atomic_long_fetch_andnot(mask, (atomic_long_t *)p);
> return !!(old & mask);
> }
>
> but the above is not just whitespace-damaged, it's entirely untested
> and based purely on me looking at that code.
Yes, I think that's step 1, thanks! I'm a bit worried about the perf
numbers on the other thread, but we can get to the bottom of that
separately.
Will
On Tue, Aug 16, 2022 at 02:41:57PM +0100, Will Deacon wrote:
> On Mon, Aug 15, 2022 at 10:27:10PM -0700, Linus Torvalds wrote:
> > On Mon, Aug 15, 2022 at 9:15 PM Herbert Xu <[email protected]> wrote:
> > >
> > > Please revert this as test_and_set_bit was always supposed to be
> > > a full memory barrier. This is an arch bug.
> >
> > Yes, the bitops are kind of strange for various legacy reasons:
> >
> > - set/clear_bit is atomic, but without a memory barrier, and need a
> > "smp_mb__before_atomic()" or similar for barriers
> >
> > - test_and_set/clear_bit() are atomic, _and_ are memory barriers
> >
> > - test_and_set_bit_lock and test_and_clear_bit_unlock are atomic and
> > _weaker_ than full memory barriers, but sufficient for locking (ie
> > acquire/release)
> >
> > Does any of this make any sense at all? No. But those are the
> > documented semantics exactly because people were worried about
> > test_and_set_bit being used for locking, since on x86 all the atomics
> > are also memory barriers.
> >
> > From looking at it, the asm-generic implementation is a bit
> > questionable, though. In particular, it does
> >
> > if (READ_ONCE(*p) & mask)
> > return 1;
> >
> > so it's *entirely* unordered for the "bit was already set" case.
> >
> > That looks very wrong to me, since it basically means that the
> > test_and_set_bit() can return "bit was already set" based on an old
> > value - not a memory barrier at all.
> >
> > So if you use "test_and_set_bit()" as some kind of "I've done my work,
> > now I am going to set the bit to tell people to pick it up", then that
> > early "bit was already set" code completely breaks it.
> >
> > Now, arguably our atomic bitop semantics are very very odd, and it
> > might be time to revisit them. But that code looks very very buggy to
> > me.
> >
> > The bug seems to go back to commit e986a0d6cb36 ("locking/atomics,
> > asm-generic/bitops/atomic.h: Rewrite using atomic_*() APIs"), and the
> > fix looks to be as simple as just removing that early READ_ONCE return
> > case (test_and_clear has the same bug).
> >
> > Will?
>
> Right, this looks like it's all my fault, so sorry about that.
>
> In an effort to replace the spinlock-based atomic bitops with a version
> based on atomic instructions in e986a0d6cb36, I inadvertently added this
> READ_ONCE() shortcut to test_and_set_bit() because at the time that's
> what we had (incorrectly) documented in our attempts at cleaning things
> up in this area. I confess that I have never been comfortable with the
> comment for test_and_set_bit() prior to my problematic patch:
>
> /**
> * test_and_set_bit - Set a bit and return its old value
> * @nr: Bit to set
> * @addr: Address to count from
> *
> * This operation is atomic and cannot be reordered.
> * It may be reordered on other architectures than x86.
> * It also implies a memory barrier.
> */
>
> so while Peter and I were trying to improve the documentation for
> atomics and memory barriers we clearly ended up making the wrong call
> trying to treat this like e.g. a cmpxchg() (which has the
> unordered-on-failure semantics).
>
> It's worth noting that with the spinlock-based implementation (i.e.
> prior to e986a0d6cb36) then we would have the same problem on
> architectures that implement spinlocks with acquire/release semantics;
> accesses from outside of the critical section can drift in and reorder
> with each other there, so the conversion looked legitimate to me in
> isolation and I vaguely remember going through callers looking for
> potential issues. Alas, I obviously missed this case.
>
I just to want to mention that although spinlock-based atomic bitops
don't provide the full barrier in test_and_set_bit(), but they don't
have the problem spotted by Hector, because test_and_set_bit() and
clear_bit() sync with each other via locks:
test_and_set_bit():
lock(..);
old = *p; // mask is already set by other test_and_set_bit()
*p = old | mask;
unlock(...);
clear_bit():
lock(..);
*p ~= mask;
unlock(..);
so "having a full barrier before test_and_set_bit()" may not be the
exact thing we need here, as long as a test_and_set_bit() can sync with
a clear_bit() uncondiontally, then the world is safe. For example, we
can make test_and_set_bit() RELEASE, and clear_bit() ACQUIRE on ARM64:
test_and_set_bit():
atomic_long_fetch_or_release(..); // pair with clear_bit()
// guarantee everything is
// observed.
clear_bit():
atomic_long_fetch_andnot_acquire(..);
, maybe that's somewhat cheaper than a full barrier implementation.
Thoughts? Just to find the exact ordering requirement for bitops.
Regards,
Boqun
> So it looks to me like we need to:
>
> 1. Upgrade test_and_{set,clear}_bit() to have a full memory barrier
> regardless of the value which is read from memory. The lock/unlock
> flavours can remain as-is.
>
> 2. Fix the documentation
>
> 3. Figure out what to do about architectures building atomics out of
> spinlocks (probably ok as lock+unlock == full barrier there?)
>
> 4. Accept my sincerest apologies for the mess!
>
> > IOW, the proper fix for this should, I think, look something like this
> > (whitespace mangled) thing:
> >
> > --- a/include/asm-generic/bitops/atomic.h
> > +++ b/include/asm-generic/bitops/atomic.h
> > @@ -39,9 +39,6 @@ arch_test_and_set_bit(
> > unsigned long mask = BIT_MASK(nr);
> >
> > p += BIT_WORD(nr);
> > - if (READ_ONCE(*p) & mask)
> > - return 1;
> > -
> > old = arch_atomic_long_fetch_or(mask, (atomic_long_t *)p);
> > return !!(old & mask);
> > }
> > @@ -53,9 +50,6 @@ arch_test_and_clear_bit
> > unsigned long mask = BIT_MASK(nr);
> >
> > p += BIT_WORD(nr);
> > - if (!(READ_ONCE(*p) & mask))
> > - return 0;
> > -
> > old = arch_atomic_long_fetch_andnot(mask, (atomic_long_t *)p);
> > return !!(old & mask);
> > }
> >
> > but the above is not just whitespace-damaged, it's entirely untested
> > and based purely on me looking at that code.
>
> Yes, I think that's step 1, thanks! I'm a bit worried about the perf
> numbers on the other thread, but we can get to the bottom of that
> separately.
>
> Will
Hello,
On Tue, Aug 16, 2022 at 12:15:38PM +0800, Herbert Xu wrote:
> Tejun Heo <[email protected]> wrote:
> >
> > Oh, tricky one and yeah you're absolutely right that it makes no sense to
> > not guarantee barrier semantics when already pending. I didn't even know
> > test_and_set_bit() wasn't a barrier when it failed. Thanks a lot for hunting
> > down and fixing this. Applied to wq/for-6.0-fixes.
>
> Please revert this as test_and_set_bit was always supposed to be
> a full memory barrier. This is an arch bug.
Alright, reverting.
Thanks.
--
tejun
On Tue, Aug 16, 2022 at 6:42 AM Will Deacon <[email protected]> wrote:
>
> > Will?
>
> Right, this looks like it's all my fault, so sorry about that.
It's interesting how this bug has existed for basically four years.
I suspect that the thing is fairly hard to hit in practice,
particularly on the kinds of devices most common (ie limited OoO
windows on most phone SoCs).
Together with phone loads probably not generally being all that
exciting from a kernel standpoint (a lot of driver work, probably not
a lot of workqueue stuff).
> 1. Upgrade test_and_{set,clear}_bit() to have a full memory barrier
> regardless of the value which is read from memory. The lock/unlock
> flavours can remain as-is.
I've applied Hector's "locking/atomic: Make test_and_*_bit() ordered
on failure".
> 2. Fix the documentation
That patch had a bit of a fix, but it is probably worth looking at more.
> 3. Figure out what to do about architectures building atomics out of
> spinlocks (probably ok as lock+unlock == full barrier there?)
Yeah, I wonder how we should describe the memory ordering here. We've
always punted on it, saying it's a "memory barrier", but that has been
partly a "look, that's the traditional x86 model".
And the traditional x86 model really sees the locked RMW operations as
being one single operation - in ways that most other architectures
don't.
So on x86, it's more than "this implies a memory barrier" - it's also
that there is no visible load-modify-store sequence where you can
start asking "what about the ordering of the _load_ wrt the preceding
memory operations".
That makes the x86 behavior very easy to think about, but means that
when you have bitops implemented other ways, you have questions that
are much harder to answer.
So in a Lock+read+op+write+unlock sequence, you can have preceding
operations move into the locked region, and mix with the read+op+write
side.
> 4. Accept my sincerest apologies for the mess!
I don't think you were the source of the mess. The *source* of the
mess is that we've always had very messy rules about the bitops in
particular.
I think the *code* is fixed (at least wrt the generic implementation,
I think the other models are up for discussion), but I think the real
issue is how we should describe the requirements.
So I think we have at least three cases we need to deal with:
(a) the people who just want atomics, and don't care about any
ordering. They are bound to exist.
(b) the people who want "acquire"-like semantics. I think the
existing test_and_set_bit_lock() is fine
(c) the people who want "release"-like semantics, where the
"test-and-set-bit" is for announcing "I'm done, was I the first one?".
(d) the full barrier case
I think we currently actively have (b) and (d), but it's possible that
the workqueue case really is only (c).
And I think the spinlock implementation really most naturally has that
(c) semantics - you don't necessarily get some theoretical full memory
barrier, but you *do* get those "handover" semantics.
Maybe we never really want or need (d) at all?
So I get the feeling that we really shouldn't specify
"test_and_set_bit()" in terms of memory barriers at all. I *think* it
would be more natural to describe them in terms of "handover" (ie
acquire/release), and then the spinlock semantics are obviously fine.
So I htink the code problem is easy, I think the real problem here has
always been bad documentation, and it would be really good to clarify
that.
Comments?
Linus
On 16/08/2022 23.55, Boqun Feng wrote:
> On Tue, Aug 16, 2022 at 02:41:57PM +0100, Will Deacon wrote:
>> It's worth noting that with the spinlock-based implementation (i.e.
>> prior to e986a0d6cb36) then we would have the same problem on
>> architectures that implement spinlocks with acquire/release semantics;
>> accesses from outside of the critical section can drift in and reorder
>> with each other there, so the conversion looked legitimate to me in
>> isolation and I vaguely remember going through callers looking for
>> potential issues. Alas, I obviously missed this case.
>>
>
> I just to want to mention that although spinlock-based atomic bitops
> don't provide the full barrier in test_and_set_bit(), but they don't
> have the problem spotted by Hector, because test_and_set_bit() and
> clear_bit() sync with each other via locks:
>
> test_and_set_bit():
> lock(..);
> old = *p; // mask is already set by other test_and_set_bit()
> *p = old | mask;
> unlock(...);
> clear_bit():
> lock(..);
> *p ~= mask;
> unlock(..);
>
> so "having a full barrier before test_and_set_bit()" may not be the
> exact thing we need here, as long as a test_and_set_bit() can sync with
> a clear_bit() uncondiontally, then the world is safe. For example, we
> can make test_and_set_bit() RELEASE, and clear_bit() ACQUIRE on ARM64:
>
> test_and_set_bit():
> atomic_long_fetch_or_release(..); // pair with clear_bit()
> // guarantee everything is
> // observed.
> clear_bit():
> atomic_long_fetch_andnot_acquire(..);
>
> , maybe that's somewhat cheaper than a full barrier implementation.
>
> Thoughts? Just to find the exact ordering requirement for bitops.
It's worth pointing out that the workqueue code does *not* pair
test_and_set_bit() with clear_bit(). It does an atomic_long_set()
instead (and then there are explicit barriers around it, which are
expected to pair with the implicit barrier in test_and_set_bit()). If we
define test_and_set_bit() to only sync with clear_bit() and not
necessarily be a true barrier, that breaks the usage of the workqueue code.
- Hector
On Tue, Aug 16, 2022 at 9:22 AM Hector Martin <[email protected]> wrote:
>
> It's worth pointing out that the workqueue code does *not* pair
> test_and_set_bit() with clear_bit(). It does an atomic_long_set()
> instead
Yes. That code is much too subtle.
And yes, I think those barriers are
(a) misleading
(b) don't work with the "serialize bits using spinlock" model at all
It's a good example of "we need to really have a better model for this".
Linus
On Wed, Aug 17, 2022 at 01:22:09AM +0900, Hector Martin wrote:
> On 16/08/2022 23.55, Boqun Feng wrote:
> > On Tue, Aug 16, 2022 at 02:41:57PM +0100, Will Deacon wrote:
> >> It's worth noting that with the spinlock-based implementation (i.e.
> >> prior to e986a0d6cb36) then we would have the same problem on
> >> architectures that implement spinlocks with acquire/release semantics;
> >> accesses from outside of the critical section can drift in and reorder
> >> with each other there, so the conversion looked legitimate to me in
> >> isolation and I vaguely remember going through callers looking for
> >> potential issues. Alas, I obviously missed this case.
> >>
> >
> > I just to want to mention that although spinlock-based atomic bitops
> > don't provide the full barrier in test_and_set_bit(), but they don't
> > have the problem spotted by Hector, because test_and_set_bit() and
> > clear_bit() sync with each other via locks:
> >
> > test_and_set_bit():
> > lock(..);
> > old = *p; // mask is already set by other test_and_set_bit()
> > *p = old | mask;
> > unlock(...);
> > clear_bit():
> > lock(..);
> > *p ~= mask;
> > unlock(..);
> >
> > so "having a full barrier before test_and_set_bit()" may not be the
> > exact thing we need here, as long as a test_and_set_bit() can sync with
> > a clear_bit() uncondiontally, then the world is safe. For example, we
> > can make test_and_set_bit() RELEASE, and clear_bit() ACQUIRE on ARM64:
> >
> > test_and_set_bit():
> > atomic_long_fetch_or_release(..); // pair with clear_bit()
> > // guarantee everything is
> > // observed.
> > clear_bit():
> > atomic_long_fetch_andnot_acquire(..);
> >
> > , maybe that's somewhat cheaper than a full barrier implementation.
> >
> > Thoughts? Just to find the exact ordering requirement for bitops.
>
> It's worth pointing out that the workqueue code does *not* pair
> test_and_set_bit() with clear_bit(). It does an atomic_long_set()
> instead (and then there are explicit barriers around it, which are
> expected to pair with the implicit barrier in test_and_set_bit()). If we
> define test_and_set_bit() to only sync with clear_bit() and not
> necessarily be a true barrier, that breaks the usage of the workqueue code.
>
Ah, I miss that, but that means the old spinlock-based atomics are
totally broken unless spinlock means full barriers on these archs.
But still, if we define test_and_set_bit() as RELEASE atomic instead of
a full barrier + atomic, it should work for workqueue, right? Do we
actually need extra ordering here?
WRITE_ONCE(*x, 1); // A
test_and_set_bit(..); // a full barrier will order A & B
WRITE_ONCE(*y, 1); // B
That's something I want to figure out.
Regards,
Boqun
> - Hector
Hello, Will.
On Tue, Aug 16, 2022 at 02:41:57PM +0100, Will Deacon wrote:
> /**
> * test_and_set_bit - Set a bit and return its old value
> * @nr: Bit to set
> * @addr: Address to count from
> *
> * This operation is atomic and cannot be reordered.
> * It may be reordered on other architectures than x86.
> * It also implies a memory barrier.
> */
>
> so while Peter and I were trying to improve the documentation for
> atomics and memory barriers we clearly ended up making the wrong call
> trying to treat this like e.g. a cmpxchg() (which has the
> unordered-on-failure semantics).
I think the doc can be improved here. atomic_t.txt says under ORDERING:
- RMW operations that have a return value are fully ordered;
- RMW operations that are conditional are unordered on FAILURE,
otherwise the above rules apply.
But nothing spells out what's conditional. Maybe it's okay to expect people
to read this doc and extrapolate how it applies, but I think it'd be better
if we spell out clearly per operaiton so that readers can search for a
speicific operation and then follow what the rules are from there.
It bothers me that there doesn't seem to be a comprehensive
operation-indexed doc on the subject. memory-barrier.txt doesn't cover which
operations do what barriers. atomic_t.txt and atomic_bitops.txt cover the
atomic_t operations but not in a comprehensive or searchable manner (e.g.
does test_and_set_bit() return 0 or 1 on success?) and it's not clear where
to look for non-atomic_t atomic operations. I guess it's implied that they
follow the same rules as atomic_t counterparts but I can't seem to find that
spelled out anywhere. The source code docs are layered and dispersed for
generic and arch implemetnations making them difficult to follow.
It'd be awesome if the documentation situation can be improved.
Thanks.
--
tejun
On 17/08/2022 01.26, Tejun Heo wrote:
> Hello,
>
> On Tue, Aug 16, 2022 at 12:15:38PM +0800, Herbert Xu wrote:
>> Tejun Heo <[email protected]> wrote:
>>>
>>> Oh, tricky one and yeah you're absolutely right that it makes no sense to
>>> not guarantee barrier semantics when already pending. I didn't even know
>>> test_and_set_bit() wasn't a barrier when it failed. Thanks a lot for hunting
>>> down and fixing this. Applied to wq/for-6.0-fixes.
>>
>> Please revert this as test_and_set_bit was always supposed to be
>> a full memory barrier. This is an arch bug.
>
> Alright, reverting.
FYI, Linus applied the alternate fix directly to his tree already, so
this is fixed on M1 either way. However, you may want to pay attention
to the discussion about the subtlety of the workqueue code pairing
test_and_set_bit() not with clear_bit(), but rather atomic_long_set(),
since it *could* imply it is still broken or might be broken in the
future, on other architectures.
- Hector
On Tue, Aug 16, 2022 at 09:41:52AM -0700, Linus Torvalds wrote:
.
> So I htink the code problem is easy, I think the real problem here has
> always been bad documentation, and it would be really good to clarify
> that.
>
> Comments?
The problem is that test_and_set_bit has been unambiguously
documented to have memory barriers since 2005:
commit 3085f02b869d980c5588f3e8fb136b0b465a2759
Author: David S. Miller <[email protected]>
Date: Fri Feb 4 23:39:15 2005 -0800
[DOC]: Add asm/atomic.h asm/bitops.h implementation specification.
And this is what it says:
+ int test_and_set_bit(unsigned long nr, volatils unsigned long *addr);
+ int test_and_clear_bit(unsigned long nr, volatils unsigned long *addr);
+ int test_and_change_bit(unsigned long nr, volatils unsigned long *addr);
...snip...
+These routines, like the atomic_t counter operations returning values,
+require explicit memory barrier semantics around their execution. All
+memory operations before the atomic bit operation call must be made
+visible globally before the atomic bit operation is made visible.
+Likewise, the atomic bit operation must be visible globally before any
+subsequent memory operation is made visible. For example:
+
+ obj->dead = 1;
+ if (test_and_set_bit(0, &obj->flags))
+ /* ... */;
+ obj->killed = 1;
This file wasn't removed until 16/11/2020 by f0400a77ebdc.
In that time people who wrote code using test_and_set_bit could have
legitimately relied on the memory barrier as documented. Changing
this restrospectively is dangerous.
I'm fine with introducing new primitives that have different
properties, and then converting the existing users of test_and_set_bit
over on a case-by-case basis.
Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt