LinuxLists.cc - Debugging a TTY race condition on M1 (memory ordering dragons)

2022-08-15 11:29:17

Subject: Debugging a TTY race condition on M1 (memory ordering dragons)

Hi all,

I'm trying to debug a TTY layer race on Apple M1 machines, and I'm deep
enough into the rabbit hole that I could use some suggestions/help
figuring out how to continue approaching this.

The issue is that when the cdc-acm driver pushes some data into the TTY
in quick succession (two USB packets), with an active userspace reader,
sometimes the second packet ends up in limbo and never gets delivered to
the ldisc (until more data arrives). This quickly reproduces (within a
second or two) by running this [1] script on an M1 Pro or M1 Max with
another M1 machine connected running the m1n1 proxy, using the Asahi
Linux kernels we ship to end users right now. It does not reproduce with
a random monolithic test kernel I have, so clearly there is some
difference there, but I haven't tried to bisect that yet.

This is already having gone through blaming my m1n1 code, our dwc3
driver, my AIC IRQ controller code, the CDC-ACM driver, etc, but I'll
skip those parts. The rabbit hole goes deeper.

Code context:

static inline void tty_flip_buffer_commit(struct tty_buffer *tail)
{
smp_store_release(&tail->commit, tail->used);
}

void tty_flip_buffer_push(struct tty_port *port)
{
struct tty_bufhead *buf = &port->buf;

tty_flip_buffer_commit(buf->tail);
queue_work(system_unbound_wq, &buf->work);

}

/* This is the work function */
static void flush_to_ldisc(struct work_struct *work)
{
[...]
while (1) {
[...]
count = smp_load_acquire(&head->commit) - head->read;
if (!count) {
if (next == NULL)
break;
[...]
}
// do stuff with the buffer
[...]
}
[...]
}

Commit is accessed with acquire/release ops, which compile down to LDAR
and STLR.

With the test case, two packets end up getting pushed to the TTY: 1024
bytes and 186 bytes (total 1210). What I observe is that sometimes,
after commit is set to 1210, queue_work() returns false (indicating that
work is already queued and hasn't begun executing yet), yet the last
work function execution only sees commit==1024. When this happens,
flush_to_ldisc and queue_work were both executing within a microsecond
or two of each other:

[ 5.408251] queue_work() ret false (commit=1210)
[ 5.410367] flush_to_ldisc() commit=1210
644
[ 5.410984] flush_to_ldisc() commit=1024
[ 5.411570] flush_to_ldisc() commit=1210
645
[ 5.412202] flush_to_ldisc() commit=1024
[ 5.412785] flush_to_ldisc() commit=1210
646
[ 5.413630] flush_to_ldisc() commit=1024
[ 5.413631] queue_work() ret false (commit=1210)
<hang because userspace never gets the data>

The queue_work() documentation says:

> Memory-ordering properties: If it returns true, guarantees that all
> stores preceding the call to queue_work() in the program order will
> be visible from the CPU which will execute work by the time such work
> executes

That would seem to imply that if it returns false, there are no memory
ordering guarantees. If that were the case, the TTY layer's usage of
queue_work() would be buggy. However, that would also make the memory
ordering guarantees essentially useless, since you *need* to know that
if the work is already queued and hasn't started running yet, it will
see all writes you've made until the call (otherwise you'd have to add
manual barriers before queue_work(), which defeats the purpose of
ordering guarantees in the normal case). So, I posit that the
documentation is incomplete and queue_work() does indeed provide such
guarantees.

Behind the scenes, the work pending flag is atomically set with
test_and_set_bit() by queue_work_on(). That compiles down to my old
friend LDSETAL, which I already showed [2] does not provide the
guarantees test_and_set_bit() claims to have (== full memory barrier).
However, I can't get that litmus test to fail on real hardware, so that
may be a red herring as far as this bug goes.

On the other end of things, we have this lovely comment:

static void set_work_pool_and_clear_pending(struct work_struct *work,
int pool_id)
{
/*
* The following wmb is paired with the implied mb in
* test_and_set_bit(PENDING) and ensures all updates to @work made
* here are visible to and precede any updates by the next PENDING
* owner.
*/
smp_wmb();
/* note: this is an atomic_long_set */
set_work_data(work, (unsigned long)pool_id << WORK_OFFQ_POOL_SHIFT, 0);
/*
* The following mb guarantees that previous clear of a PENDING bit
* will not be reordered with any speculative LOADS or STORES from
* work->current_func, which is executed afterwards. This possible
* reordering can lead to a missed execution on attempt to queue
* the same @work. E.g. consider this case:
*
* CPU#0 CPU#1
* ---------------------------- --------------------------------
*
* 1 STORE event_indicated
* 2 queue_work_on() {
* 3 test_and_set_bit(PENDING)
* 4 } set_..._and_clear_pending() {
* 5 set_work_data() # clear bit
* 6 smp_mb()
* 7 work->current_func() {
* 8 LOAD event_indicated
* }
*
* Without an explicit full barrier speculative LOAD on line 8 can
* be executed before CPU#0 does STORE on line 1. If that happens,
* CPU#0 observes the PENDING bit is still set and new execution of
* a @work is not queued in a hope, that CPU#1 will eventually
* finish the queued @work. Meanwhile CPU#1 does not see
* event_indicated is set, because speculative LOAD was executed
* before actual STORE.
*/
smp_mb();
}

That would indeed agree with the idea that queue_work() does offer
guarantees in the ret==false case. And yet, it doesn't work.

Effectively we have the sequence:

CPU#1:
STLR A
LDSETAL B

CPU#2:
DMB ISHST
STR B
LDAR A

I tried writing a litmus test for this, but I wasn't able to make it
fail on neither herd7 nor real hardware. And yet I can trivially
reproduce the problem in vivo with the TTY layer. Perhaps there are
other preconditions to this particular sequence failing on real hardware
(related to caches, speculation, etc)...

I did stumble upon something a bit odd with the LSE atomic/STR pairing.
I also couldn't get a litmus to fail this in neither sim nor hardware,
so I probably just missed something about the memory model, but:
according to the ARM, the load/store halves of an atomic op are
Atomic-ordered-before among themselves, and the write is
Atomic-ordered-before a subsequent read with acquire semantics. However,
there is no mention of non-atomic writes. So, by my reading, this is
possible:

(flag is 1)
LDSETAL LOAD 1
STR 0
LDSETAL STORE 1

Which would leave you with LDSETAL claiming the old/new values are both
1, while the store completed and vanished. This seems pretty insane, and
would mean atomic_long_set() isn't actually atomic wrt other atomic ops,
but I couldn't figure out what forbids it, at least in the textual
description of the memory model. If this is indeed not forbidden, then
this could happen:

LDSETAL Read-Acquire (1)
STR pending=0
LDAR commit=old
STLR commit=new // can move below read-acquire
LDSETAL Write-Release (1)
(Returned 1, so work is not requeued)

But as I said, I couldn't prove that this is possible with herd7 or litmus7.

On the "trying random things" front, this fixes the bug:

tty_flip_buffer_commit(buf->tail);
smp_mb();
queue_work(system_unbound_wq, &buf->work);

But this does not:

tty_flip_buffer_commit(buf->tail);
smp_wmb();
queue_work(system_unbound_wq, &buf->work);

Which does kind of point towards the "load side moving up" theory.

Any ideas? Am I onto something with the STR/LDSETAL thing? Memory model
confusion? CPU bug? Did I miss an obvious bug? :)

[1]

from m1n1.setup import *

i = 0
while True:
a = iface.readmem(0x900000000, 1170)
print(i)
i += 1

[2]
https://lore.kernel.org/all/[email protected]/

Interestingly, the litmus test in that link started passing with
aarch64-v07 recently, even though that claimed to be a relaxation of the
memory model. I started that discussion at herd/herdtools7#418.

- Hector

2022-08-15 13:50:01

by Will Deacon

[permalink] [raw]

Subject: Re: Debugging a TTY race condition on M1 (memory ordering dragons)

Hi Hector,

On Mon, Aug 15, 2022 at 08:16:56PM +0900, Hector Martin wrote:
> I'm trying to debug a TTY layer race on Apple M1 machines, and I'm deep
> enough into the rabbit hole that I could use some suggestions/help
> figuring out how to continue approaching this.

I'm really not familiar with the TTY layer, but here's hoping I can help
a bit on the memory ordering side of things...

> With the test case, two packets end up getting pushed to the TTY: 1024
> bytes and 186 bytes (total 1210). What I observe is that sometimes,
> after commit is set to 1210, queue_work() returns false (indicating that
> work is already queued and hasn't begun executing yet), yet the last
> work function execution only sees commit==1024. When this happens,
> flush_to_ldisc and queue_work were both executing within a microsecond
> or two of each other:
>
> [ 5.408251] queue_work() ret false (commit=1210)
> [ 5.410367] flush_to_ldisc() commit=1210
> 644
> [ 5.410984] flush_to_ldisc() commit=1024
> [ 5.411570] flush_to_ldisc() commit=1210
> 645
> [ 5.412202] flush_to_ldisc() commit=1024
> [ 5.412785] flush_to_ldisc() commit=1210
> 646
> [ 5.413630] flush_to_ldisc() commit=1024
> [ 5.413631] queue_work() ret false (commit=1210)
> <hang because userspace never gets the data>
>
> The queue_work() documentation says:
>
> > Memory-ordering properties: If it returns true, guarantees that all
> > stores preceding the call to queue_work() in the program order will
> > be visible from the CPU which will execute work by the time such work
> > executes
>
> That would seem to imply that if it returns false, there are no memory
> ordering guarantees. If that were the case, the TTY layer's usage of
> queue_work() would be buggy. However, that would also make the memory
> ordering guarantees essentially useless, since you *need* to know that
> if the work is already queued and hasn't started running yet, it will
> see all writes you've made until the call (otherwise you'd have to add
> manual barriers before queue_work(), which defeats the purpose of
> ordering guarantees in the normal case). So, I posit that the
> documentation is incomplete and queue_work() does indeed provide such
> guarantees.
>
> Behind the scenes, the work pending flag is atomically set with
> test_and_set_bit() by queue_work_on(). That compiles down to my old
> friend LDSETAL, which I already showed [2] does not provide the
> guarantees test_and_set_bit() claims to have (== full memory barrier).
> However, I can't get that litmus test to fail on real hardware, so that
> may be a red herring as far as this bug goes.

As I mentioned in the thread you linked to, the architecture was undergoing
review in this area. I should've followed back up, but in the end it was
tightened retrospectively to provide the behaviour you wanted. This was
achieved by augmenting the barrier-ordered-before relation with:

* RW1 is a memory write effect W1 and is generated by an atomic instruction
with both Acquire and Release semantics.

You can see this in the latest Arm ARM.

However, test_and_set_bit() is unordered on failure (i.e. when the bit is
unchanged) and uses READ_ONCE() as a quick check before the RmW. See the
"ORDERING" section of Documentation/atomic_bitops.txt.

> On the other end of things, we have this lovely comment:
>
> static void set_work_pool_and_clear_pending(struct work_struct *work,
> int pool_id)
> {
> /*
> * The following wmb is paired with the implied mb in
> * test_and_set_bit(PENDING) and ensures all updates to @work made
> * here are visible to and precede any updates by the next PENDING
> * owner.
> */
> smp_wmb();
> /* note: this is an atomic_long_set */
> set_work_data(work, (unsigned long)pool_id << WORK_OFFQ_POOL_SHIFT, 0);
> /*
> * The following mb guarantees that previous clear of a PENDING bit
> * will not be reordered with any speculative LOADS or STORES from
> * work->current_func, which is executed afterwards. This possible
> * reordering can lead to a missed execution on attempt to queue
> * the same @work. E.g. consider this case:
> *
> * CPU#0 CPU#1
> * ---------------------------- --------------------------------
> *
> * 1 STORE event_indicated
> * 2 queue_work_on() {
> * 3 test_and_set_bit(PENDING)
> * 4 } set_..._and_clear_pending() {
> * 5 set_work_data() # clear bit
> * 6 smp_mb()
> * 7 work->current_func() {
> * 8 LOAD event_indicated
> * }
> *
> * Without an explicit full barrier speculative LOAD on line 8 can
> * be executed before CPU#0 does STORE on line 1. If that happens,
> * CPU#0 observes the PENDING bit is still set and new execution of
> * a @work is not queued in a hope, that CPU#1 will eventually
> * finish the queued @work. Meanwhile CPU#1 does not see
> * event_indicated is set, because speculative LOAD was executed
> * before actual STORE.
> */
> smp_mb();
> }
>
> That would indeed agree with the idea that queue_work() does offer
> guarantees in the ret==false case. And yet, it doesn't work.
>
> Effectively we have the sequence:
>
> CPU#1:
> STLR A
> LDSETAL B

I think you're missing the "shortcut" in test_and_set_bit():

if (READ_ONCE(*p) & mask)
return 1;

old = arch_atomic_long_fetch_or(mask, (atomic_long_t *)p);

so if the bit is already set (which I think is the 'ret == false' case)
then you've only got a control dependency here and we elide writing to
B.

>
> CPU#2:
> DMB ISHST
> STR B

Missing DMB ISH here from the smp_mb()?

> LDAR A
>
> I tried writing a litmus test for this, but I wasn't able to make it
> fail on neither herd7 nor real hardware. And yet I can trivially
> reproduce the problem in vivo with the TTY layer. Perhaps there are
> other preconditions to this particular sequence failing on real hardware
> (related to caches, speculation, etc)...

Hmm, without the full DMB on CPU2, the store to B and the load-acquire from
A are not ordered so it really should fail.

> I did stumble upon something a bit odd with the LSE atomic/STR pairing.
> I also couldn't get a litmus to fail this in neither sim nor hardware,
> so I probably just missed something about the memory model, but:
> according to the ARM, the load/store halves of an atomic op are
> Atomic-ordered-before among themselves, and the write is
> Atomic-ordered-before a subsequent read with acquire semantics. However,
> there is no mention of non-atomic writes. So, by my reading, this is
> possible:
>
> (flag is 1)
> LDSETAL LOAD 1
> STR 0
> LDSETAL STORE 1
>
> Which would leave you with LDSETAL claiming the old/new values are both
> 1, while the store completed and vanished. This seems pretty insane, and
> would mean atomic_long_set() isn't actually atomic wrt other atomic ops,
> but I couldn't figure out what forbids it, at least in the textual
> description of the memory model. If this is indeed not forbidden, then
> this could happen:
>
> LDSETAL Read-Acquire (1)
> STR pending=0
> LDAR commit=old
> STLR commit=new // can move below read-acquire
> LDSETAL Write-Release (1)
> (Returned 1, so work is not requeued)
>
> But as I said, I couldn't prove that this is possible with herd7 or litmus7.

If that non-atomic store is hitting the same variable, then it cannot land
in the middle of the atomic RmW. The architecture says:

| The atomic instructions perform atomic read and write operations on a memory
| location such that the architecture guarantees that no modification of that
| memory location by another observer can occur between the read and the write
| defined by that instruction.

and the .cat file used by herd has a separate constraint for this (see the
"empty rmw & (fre; coe) as atomic" line).

> On the "trying random things" front, this fixes the bug:
>
> tty_flip_buffer_commit(buf->tail);
> smp_mb();
> queue_work(system_unbound_wq, &buf->work);
>
> But this does not:
>
> tty_flip_buffer_commit(buf->tail);
> smp_wmb();
> queue_work(system_unbound_wq, &buf->work);
>
> Which does kind of point towards the "load side moving up" theory.
>
> Any ideas? Am I onto something with the STR/LDSETAL thing? Memory model
> confusion? CPU bug? Did I miss an obvious bug? :)

There's never anything obvious when you're working with this sort of stuff,
but my suggestion is that we work towards a litmus tests that we both agree
represents the code and then take it from there. At the moment there's an
awful lof of information, but you can see from my comments that I'm not
up to speed with you yet!

Cheers,

Will

2022-08-15 14:30:50

by Peter Zijlstra

[permalink] [raw]

Subject: Re: Debugging a TTY race condition on M1 (memory ordering dragons)

On Mon, Aug 15, 2022 at 02:47:11PM +0100, Will Deacon wrote:

> > Behind the scenes, the work pending flag is atomically set with
> > test_and_set_bit() by queue_work_on(). That compiles down to my old
> > friend LDSETAL, which I already showed [2] does not provide the
> > guarantees test_and_set_bit() claims to have (== full memory barrier).
> > However, I can't get that litmus test to fail on real hardware, so that
> > may be a red herring as far as this bug goes.
>
> As I mentioned in the thread you linked to, the architecture was undergoing
> review in this area. I should've followed back up, but in the end it was
> tightened retrospectively to provide the behaviour you wanted. This was
> achieved by augmenting the barrier-ordered-before relation with:
>
> * RW1 is a memory write effect W1 and is generated by an atomic instruction
> with both Acquire and Release semantics.
>
> You can see this in the latest Arm ARM.
>
> However, test_and_set_bit() is unordered on failure (i.e. when the bit is
> unchanged) and uses READ_ONCE() as a quick check before the RmW. See the
> "ORDERING" section of Documentation/atomic_bitops.txt.

Damn, I forgot that too... :/

> I think you're missing the "shortcut" in test_and_set_bit():
>
> if (READ_ONCE(*p) & mask)
> return 1;
>
> old = arch_atomic_long_fetch_or(mask, (atomic_long_t *)p);
>
> so if the bit is already set (which I think is the 'ret == false' case)
> then you've only got a control dependency here and we elide writing to
> B.

Given all that, I think workqueue wants to be fixed, it really does seem
to rely on full ordering for it's test_and_set_bit() usage.

2022-08-15 16:16:25

by Hector Martin

[permalink] [raw]

Subject: Re: Debugging a TTY race condition on M1 (memory ordering dragons)

(Resend, because I still can't use mail clients properly it seems...)

On 15/08/2022 22.47, Will Deacon wrote:
> As I mentioned in the thread you linked to, the architecture was undergoing
> review in this area. I should've followed back up, but in the end it was
> tightened retrospectively to provide the behaviour you wanted. This was
> achieved by augmenting the barrier-ordered-before relation with:
>
> * RW1 is a memory write effect W1 and is generated by an atomic instruction
> with both Acquire and Release semantics.
>
> You can see this in the latest Arm ARM.
>
> However, test_and_set_bit() is unordered on failure (i.e. when the bit is
> unchanged) and uses READ_ONCE() as a quick check before the RmW. See the
> "ORDERING" section of Documentation/atomic_bitops.txt.

Argh, I'd completely missed that early exit (and had stumbled on an
unofficial doc that said it was always ordered, which confused me).
Indeed, getting rid of the early exit it fixes the problem.

> I think you're missing the "shortcut" in test_and_set_bit():
>
> if (READ_ONCE(*p) & mask)
> return 1;
>
> old = arch_atomic_long_fetch_or(mask, (atomic_long_t *)p);
>
> so if the bit is already set (which I think is the 'ret == false' case)
> then you've only got a control dependency here and we elide writing to
> B.

Completely missed it. Ouch.

>
>>
>> CPU#2:
>> DMB ISHST
>> STR B
>
> Missing DMB ISH here from the smp_mb()?

Yup, my apologies, that was a brain fart while writing the email. I did
have it in the litmus tests (and they indeed completely fail without it).

> If that non-atomic store is hitting the same variable, then it cannot land
> in the middle of the atomic RmW. The architecture says:
>
> | The atomic instructions perform atomic read and write operations on a memory
> | location such that the architecture guarantees that no modification of that
> | memory location by another observer can occur between the read and the write
> | defined by that instruction.
>
> and the .cat file used by herd has a separate constraint for this (see the
> "empty rmw & (fre; coe) as atomic" line).

Ha, I was using G.a from Jan 2021 (back when I started working on this
M1 stuff), and it looks like that wording was added as an issue after
that (D17572) :-)

> There's never anything obvious when you're working with this sort of stuff,
> but my suggestion is that we work towards a litmus tests that we both agree
> represents the code and then take it from there. At the moment there's an
> awful lof of information, but you can see from my comments that I'm not
> up to speed with you yet!

I think you nailed it with the early exit, I'd completely missed that. I
think I can fairly confidently say that's the issue now. As for the
litmus test, indeed with the revised definitions of the memory model /
ARM my concerns no longer apply, hence why I couldn't reproduce them
(and the hardware, thankfully, seems to agree here).

Workqueues are broken. Yay! I'll send a patch.

- Hector

2022-08-15 20:03:15

by Boqun Feng

[permalink] [raw]

Subject: Re: Debugging a TTY race condition on M1 (memory ordering dragons)

On Tue, Aug 16, 2022 at 01:01:17AM +0900, Hector Martin wrote:
> (Resend, because I still can't use mail clients properly it seems...)
>
> On 15/08/2022 22.47, Will Deacon wrote:
> > As I mentioned in the thread you linked to, the architecture was undergoing
> > review in this area. I should've followed back up, but in the end it was
> > tightened retrospectively to provide the behaviour you wanted. This was
> > achieved by augmenting the barrier-ordered-before relation with:
> >
> > * RW1 is a memory write effect W1 and is generated by an atomic instruction
> > with both Acquire and Release semantics.
> >
> > You can see this in the latest Arm ARM.
> >
> > However, test_and_set_bit() is unordered on failure (i.e. when the bit is
> > unchanged) and uses READ_ONCE() as a quick check before the RmW. See the
> > "ORDERING" section of Documentation/atomic_bitops.txt.
>
> Argh, I'd completely missed that early exit (and had stumbled on an
> unofficial doc that said it was always ordered, which confused me).
> Indeed, getting rid of the early exit it fixes the problem.
>
> > I think you're missing the "shortcut" in test_and_set_bit():
> >
> > if (READ_ONCE(*p) & mask)
> > return 1;
> >
> > old = arch_atomic_long_fetch_or(mask, (atomic_long_t *)p);
> >
> > so if the bit is already set (which I think is the 'ret == false' case)
> > then you've only got a control dependency here and we elide writing to
> > B.
>
> Completely missed it. Ouch.
>
> >
> >>
> >> CPU#2:
> >> DMB ISHST
> >> STR B
> >
> > Missing DMB ISH here from the smp_mb()?
>
> Yup, my apologies, that was a brain fart while writing the email. I did
> have it in the litmus tests (and they indeed completely fail without it).
>
> > If that non-atomic store is hitting the same variable, then it cannot land
> > in the middle of the atomic RmW. The architecture says:
> >
> > | The atomic instructions perform atomic read and write operations on a memory
> > | location such that the architecture guarantees that no modification of that
> > | memory location by another observer can occur between the read and the write
> > | defined by that instruction.
> >
> > and the .cat file used by herd has a separate constraint for this (see the
> > "empty rmw & (fre; coe) as atomic" line).
>
> Ha, I was using G.a from Jan 2021 (back when I started working on this
> M1 stuff), and it looks like that wording was added as an issue after
> that (D17572) :-)
>
> > There's never anything obvious when you're working with this sort of stuff,
> > but my suggestion is that we work towards a litmus tests that we both agree
> > represents the code and then take it from there. At the moment there's an
> > awful lof of information, but you can see from my comments that I'm not
> > up to speed with you yet!
>
> I think you nailed it with the early exit, I'd completely missed that. I
> think I can fairly confidently say that's the issue now. As for the
> litmus test, indeed with the revised definitions of the memory model /
> ARM my concerns no longer apply, hence why I couldn't reproduce them
> (and the hardware, thankfully, seems to agree here).
>
> Workqueues are broken. Yay! I'll send a patch.
>

Hmm.. but doesn't your (and Will's) finding actually show why
queue_work() only guarantee ordering if queuing succeeds? In other
words, if you want extra ordering, use smp_mb() before queue_work()
like:

smp_mb(); // pairs with smp_mb() in set_work_pool_and_clear_pending()
queue_work(); // if queue_work() return false, it means the work
// is pending, and someone will eventually clear
// the pending bit, with the smp_mb() above it's
// guaranteed that work function will see the
// memory accesses above.

Of course, I shall defer this to workqueue folks. Just saying that it
may not be broken. We have a few similar guarantees, for example,
wake_up_process() only provides ordering if it really wakes up a
process.

Regards,
Boqun

> - Hector

2022-08-15 21:04:34

by Hector Martin

[permalink] [raw]

Subject: Re: Debugging a TTY race condition on M1 (memory ordering dragons)

On 16/08/2022 03.04, Boqun Feng wrote:
> On Tue, Aug 16, 2022 at 01:01:17AM +0900, Hector Martin wrote:
> Hmm.. but doesn't your (and Will's) finding actually show why
> queue_work() only guarantee ordering if queuing succeeds? In other
> words, if you want extra ordering, use smp_mb() before queue_work()
> like:
>
> smp_mb(); // pairs with smp_mb() in set_work_pool_and_clear_pending()
> queue_work(); // if queue_work() return false, it means the work
> // is pending, and someone will eventually clear
> // the pending bit, with the smp_mb() above it's
> // guaranteed that work function will see the
> // memory accesses above.
>
> Of course, I shall defer this to workqueue folks. Just saying that it
> may not be broken. We have a few similar guarantees, for example,
> wake_up_process() only provides ordering if it really wakes up a
> process.

Technically yes, but that doesn't actually make a lot of sense, and in
fact the comments inside the workqueue code imply that it does actually
provide order even in the failure case (and there are other barriers to
try to make that happen, just not enough). Note that the ordering
documentation was added post-facto, and I don't think the person who
wrote it necessarily considered whether it *actually* provides
guarantees in the failure case, and whether it should.

wake_up_process() is different because it doesn't actually guarantee
anything if the process is already awake. However, under this
definition, queue_work() guarantees that *some* work execution will
observe every preceding write before queue_work(), regardless of the
current state, and that is a very useful property. That is something
that wake_up_process() semantics can't do.

Without this guarantee, basically every queue_work() user that's using
some kind of producer/consumer pattern would need the explicit barrier.
I imagine that pattern is very common.

- Hector

2022-08-15 21:10:06

by Boqun Feng

[permalink] [raw]

Subject: Re: Debugging a TTY race condition on M1 (memory ordering dragons)

On Tue, Aug 16, 2022 at 03:26:21AM +0900, Hector Martin wrote:
> On 16/08/2022 03.04, Boqun Feng wrote:
> > On Tue, Aug 16, 2022 at 01:01:17AM +0900, Hector Martin wrote:
> > Hmm.. but doesn't your (and Will's) finding actually show why
> > queue_work() only guarantee ordering if queuing succeeds? In other
> > words, if you want extra ordering, use smp_mb() before queue_work()
> > like:
> >
> > smp_mb(); // pairs with smp_mb() in set_work_pool_and_clear_pending()
> > queue_work(); // if queue_work() return false, it means the work
> > // is pending, and someone will eventually clear
> > // the pending bit, with the smp_mb() above it's
> > // guaranteed that work function will see the
> > // memory accesses above.
> >
> > Of course, I shall defer this to workqueue folks. Just saying that it
> > may not be broken. We have a few similar guarantees, for example,
> > wake_up_process() only provides ordering if it really wakes up a
> > process.
>
> Technically yes, but that doesn't actually make a lot of sense, and in
> fact the comments inside the workqueue code imply that it does actually
> provide order even in the failure case (and there are other barriers to
> try to make that happen, just not enough). Note that the ordering
> documentation was added post-facto, and I don't think the person who
> wrote it necessarily considered whether it *actually* provides
> guarantees in the failure case, and whether it should.
>
> wake_up_process() is different because it doesn't actually guarantee
> anything if the process is already awake. However, under this
> definition, queue_work() guarantees that *some* work execution will
> observe every preceding write before queue_work(), regardless of the
> current state, and that is a very useful property. That is something
> that wake_up_process() semantics can't do.
>
> Without this guarantee, basically every queue_work() user that's using
> some kind of producer/consumer pattern would need the explicit barrier.
> I imagine that pattern is very common.
>

I agree this is handy, but an unconditional full barrier may be costy to
some users, and probably unnecessary if the users periodically queue
the work. In that case, some successful enqueue will eventually make all
memory accesses observable. Also if workqueue users use their own
locking in work function, then the barrier is also unnecessary.

The document part of course needs some help to clear things up. But I'm
not sure "strengthen"ing the ordering guarantee of queue_work() is a
good idea. Maybe a dedicated API, like:

// More work is needed for the @work, it has the same semantics as
// queue_work() if the @work is not pending. If the @work is pending,
// this ensures the work function observes all memory access before
// this.
void queue_more_work(struct work_struct *work)
{
smp_mb();
queue_work(work);
}

Regards,
Boqun

> - Hector

2022-08-15 21:14:49

by Hector Martin

[permalink] [raw]

Subject: Re: Debugging a TTY race condition on M1 (memory ordering dragons)

On 16/08/2022 03.58, Boqun Feng wrote:
> I agree this is handy, but an unconditional full barrier may be costy to
> some users, and probably unnecessary if the users periodically queue
> the work. In that case, some successful enqueue will eventually make all
> memory accesses observable. Also if workqueue users use their own
> locking in work function, then the barrier is also unnecessary.
>
> The document part of course needs some help to clear things up. But I'm
> not sure "strengthen"ing the ordering guarantee of queue_work() is a
> good idea. Maybe a dedicated API, like:
>
> // More work is needed for the @work, it has the same semantics as
> // queue_work() if the @work is not pending. If the @work is pending,
> // this ensures the work function observes all memory access before
> // this.
> void queue_more_work(struct work_struct *work)
> {
> smp_mb();
> queue_work(work);
> }
>
> Regards,
> Boqun

FWIW, I didn't actually use a full barrier in my patch. I just replaced
the test_and_set_bit() with the underlying atomic op, sans early exit path.

Personally though, I think it makes more sense to have the default
function provide the guarantees, and if someone *really* needs the
performance gain from eliding the implicit barrier, they could use an
alternate API for that (after they show useful gains). This stuff is too
subtle to expect every caller to wrap their head around memory ordering,
and having queue_work() always provide order with prior stores *feels*
intuitive.

But let's see what the workqueue folks say :)

- Hector

2022-08-15 23:18:40

by Peter Zijlstra

[permalink] [raw]

Subject: Re: Debugging a TTY race condition on M1 (memory ordering dragons)

On Tue, Aug 16, 2022 at 04:15:00AM +0900, Hector Martin wrote:

> FWIW, I didn't actually use a full barrier in my patch. I just replaced
> the test_and_set_bit() with the underlying atomic op, sans early exit path.

That isn't quite true; you used atomic_long_fetch_or() which is used in
the generic implementation, but x86 would end up using "LOCK BTS" for
arch_test_and_set_bit(), while atomic_*fetch_or() ends up being a "LOCK
CMPXCHG" loop (which is significantly worse, performance wise).

That said; I don't have a nice solution that works well across the
various architectures.

(in one previous instance of this problem we ended up using a
cmpxchg_relaxed() coupled with an smp_mb__before_atomic(), but there is
no relaxed version of test_and_set_bit())

2022-08-16 00:50:15

by Boqun Feng

[permalink] [raw]

Subject: Re: Debugging a TTY race condition on M1 (memory ordering dragons)

On Tue, Aug 16, 2022 at 04:15:00AM +0900, Hector Martin wrote:
> On 16/08/2022 03.58, Boqun Feng wrote:
> > I agree this is handy, but an unconditional full barrier may be costy to
> > some users, and probably unnecessary if the users periodically queue
> > the work. In that case, some successful enqueue will eventually make all
> > memory accesses observable. Also if workqueue users use their own
> > locking in work function, then the barrier is also unnecessary.
> >
> > The document part of course needs some help to clear things up. But I'm
> > not sure "strengthen"ing the ordering guarantee of queue_work() is a
> > good idea. Maybe a dedicated API, like:
> >
> > // More work is needed for the @work, it has the same semantics as
> > // queue_work() if the @work is not pending. If the @work is pending,
> > // this ensures the work function observes all memory access before
> > // this.
> > void queue_more_work(struct work_struct *work)
> > {
> > smp_mb();
> > queue_work(work);
> > }
> >
> > Regards,
> > Boqun
>
> FWIW, I didn't actually use a full barrier in my patch. I just replaced
> the test_and_set_bit() with the underlying atomic op, sans early exit path.
>
> Personally though, I think it makes more sense to have the default
> function provide the guarantees, and if someone *really* needs the
> performance gain from eliding the implicit barrier, they could use an
> alternate API for that (after they show useful gains). This stuff is too
> subtle to expect every caller to wrap their head around memory ordering,
> and having queue_work() always provide order with prior stores *feels*
> intuitive.
>

Fair enough. It's just that when you know something about memory
ordering and atomics, you kinda want to play the game to save as many
barrier as you can ;-) It's a curse of knowledge.

> But let's see what the workqueue folks say :)
>

Looks like they agree with you ;-) Nice work!

Regards,
Boqun

> - Hector