LinuxLists.cc - WARNING in tracepoint_probe_register

[permalink] [raw]

Subject: [PATCH 1/1] Fix: trace sched switch start/stop racy updates

Reading the sched_cmdline_ref and sched_tgid_ref initial state within
tracing_start_sched_switch without holding the sched_register_mutex is
racy against concurrent updates, which can lead to tracepoint probes
being registered more than once (and thus trigger warnings within
tracepoint.c).

Also, write and read to/from those variables should be done with
WRITE_ONCE() and READ_ONCE(), given that those are read within tracing
probes without holding the sched_register_mutex.

[ Compile-tested only. I suspect it might fix the following syzbot
report:

[email protected] ]

Signed-off-by: Mathieu Desnoyers <[email protected]>
CC: Joel Fernandes (Google) <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Steven Rostedt (VMware) <[email protected]>
CC: Thomas Gleixner <[email protected]>
CC: Paul E. McKenney <[email protected]>
---
kernel/trace/trace_sched_switch.c | 32 ++++++++++++++++++++++----------
1 file changed, 22 insertions(+), 10 deletions(-)

diff --git a/kernel/trace/trace_sched_switch.c b/kernel/trace/trace_sched_switch.c
index e288168661e1..902e8bf59aeb 100644
--- a/kernel/trace/trace_sched_switch.c
+++ b/kernel/trace/trace_sched_switch.c
@@ -26,8 +26,8 @@ probe_sched_switch(void *ignore, bool preempt,
{
int flags;

- flags = (RECORD_TGID * !!sched_tgid_ref) +
- (RECORD_CMDLINE * !!sched_cmdline_ref);
+ flags = (RECORD_TGID * !!READ_ONCE(sched_tgid_ref)) +
+ (RECORD_CMDLINE * !!READ_ONCE(sched_cmdline_ref));

if (!flags)
return;
@@ -39,8 +39,8 @@ probe_sched_wakeup(void *ignore, struct task_struct *wakee)
{
int flags;

- flags = (RECORD_TGID * !!sched_tgid_ref) +
- (RECORD_CMDLINE * !!sched_cmdline_ref);
+ flags = (RECORD_TGID * !!READ_ONCE(sched_tgid_ref)) +
+ (RECORD_CMDLINE * !!READ_ONCE(sched_cmdline_ref));

if (!flags)
return;
@@ -89,21 +89,28 @@ static void tracing_sched_unregister(void)

static void tracing_start_sched_switch(int ops)
{
- bool sched_register = (!sched_cmdline_ref && !sched_tgid_ref);
+ bool sched_register;
+
mutex_lock(&sched_register_mutex);
+ sched_register = (!sched_cmdline_ref && !sched_tgid_ref);

switch (ops) {
case RECORD_CMDLINE:
- sched_cmdline_ref++;
+ WRITE_ONCE(sched_cmdline_ref, sched_cmdline_ref + 1);
break;

case RECORD_TGID:
- sched_tgid_ref++;
+ WRITE_ONCE(sched_tgid_ref, sched_tgid_ref + 1);
break;
+
+ default:
+ WARN_ONCE(1, "Unsupported tracing op: %d", ops);
+ goto end;
}

- if (sched_register && (sched_cmdline_ref || sched_tgid_ref))
+ if (sched_register)
tracing_sched_register();
+end:
mutex_unlock(&sched_register_mutex);
}

@@ -113,16 +120,21 @@ static void tracing_stop_sched_switch(int ops)

switch (ops) {
case RECORD_CMDLINE:
- sched_cmdline_ref--;
+ WRITE_ONCE(sched_cmdline_ref, sched_cmdline_ref - 1);
break;

case RECORD_TGID:
- sched_tgid_ref--;
+ WRITE_ONCE(sched_tgid_ref, sched_tgid_ref - 1);
break;
+
+ default:
+ WARN_ONCE(1, "Unsupported tracing op: %d", ops);
+ goto end;
}

if (!sched_cmdline_ref && !sched_tgid_ref)
tracing_sched_unregister();
+end:
mutex_unlock(&sched_register_mutex);
}

--
2.11.0

2019-08-16 16:26:41

[permalink] [raw]

Subject: Re: [PATCH 1/1] Fix: trace sched switch start/stop racy updates

On Fri, 16 Aug 2019 10:26:43 -0400
Mathieu Desnoyers <[email protected]> wrote:

> Reading the sched_cmdline_ref and sched_tgid_ref initial state within
> tracing_start_sched_switch without holding the sched_register_mutex is
> racy against concurrent updates, which can lead to tracepoint probes
> being registered more than once (and thus trigger warnings within
> tracepoint.c).
>
> Also, write and read to/from those variables should be done with
> WRITE_ONCE() and READ_ONCE(), given that those are read within tracing
> probes without holding the sched_register_mutex.
>

I understand the READ_ONCE() but is the WRITE_ONCE() truly necessary?
It's done while holding the mutex. It's not that critical of a path,
and makes the code look ugly.

-- Steve

> [ Compile-tested only. I suspect it might fix the following syzbot
> report:
>
> [email protected] ]
>
> Signed-off-by: Mathieu Desnoyers <[email protected]>
> CC: Joel Fernandes (Google) <[email protected]>
> CC: Peter Zijlstra <[email protected]>
> CC: Steven Rostedt (VMware) <[email protected]>
> CC: Thomas Gleixner <[email protected]>
> CC: Paul E. McKenney <[email protected]>
> ---
> kernel/trace/trace_sched_switch.c | 32 ++++++++++++++++++++++----------
> 1 file changed, 22 insertions(+), 10 deletions(-)
>
> diff --git a/kernel/trace/trace_sched_switch.c b/kernel/trace/trace_sched_switch.c
> index e288168661e1..902e8bf59aeb 100644
> --- a/kernel/trace/trace_sched_switch.c
> +++ b/kernel/trace/trace_sched_switch.c
> @@ -26,8 +26,8 @@ probe_sched_switch(void *ignore, bool preempt,
> {
> int flags;
>
> - flags = (RECORD_TGID * !!sched_tgid_ref) +
> - (RECORD_CMDLINE * !!sched_cmdline_ref);
> + flags = (RECORD_TGID * !!READ_ONCE(sched_tgid_ref)) +
> + (RECORD_CMDLINE * !!READ_ONCE(sched_cmdline_ref));
>
> if (!flags)
> return;
> @@ -39,8 +39,8 @@ probe_sched_wakeup(void *ignore, struct task_struct *wakee)
> {
> int flags;
>
> - flags = (RECORD_TGID * !!sched_tgid_ref) +
> - (RECORD_CMDLINE * !!sched_cmdline_ref);
> + flags = (RECORD_TGID * !!READ_ONCE(sched_tgid_ref)) +
> + (RECORD_CMDLINE * !!READ_ONCE(sched_cmdline_ref));
>
> if (!flags)
> return;
> @@ -89,21 +89,28 @@ static void tracing_sched_unregister(void)
>
> static void tracing_start_sched_switch(int ops)
> {
> - bool sched_register = (!sched_cmdline_ref && !sched_tgid_ref);
> + bool sched_register;
> +
> mutex_lock(&sched_register_mutex);
> + sched_register = (!sched_cmdline_ref && !sched_tgid_ref);
>
> switch (ops) {
> case RECORD_CMDLINE:
> - sched_cmdline_ref++;
> + WRITE_ONCE(sched_cmdline_ref, sched_cmdline_ref + 1);
> break;
>
> case RECORD_TGID:
> - sched_tgid_ref++;
> + WRITE_ONCE(sched_tgid_ref, sched_tgid_ref + 1);
> break;
> +
> + default:
> + WARN_ONCE(1, "Unsupported tracing op: %d", ops);
> + goto end;
> }
>
> - if (sched_register && (sched_cmdline_ref || sched_tgid_ref))
> + if (sched_register)
> tracing_sched_register();
> +end:
> mutex_unlock(&sched_register_mutex);
> }
>
> @@ -113,16 +120,21 @@ static void tracing_stop_sched_switch(int ops)
>
> switch (ops) {
> case RECORD_CMDLINE:
> - sched_cmdline_ref--;
> + WRITE_ONCE(sched_cmdline_ref, sched_cmdline_ref - 1);
> break;
>
> case RECORD_TGID:
> - sched_tgid_ref--;
> + WRITE_ONCE(sched_tgid_ref, sched_tgid_ref - 1);
> break;
> +
> + default:
> + WARN_ONCE(1, "Unsupported tracing op: %d", ops);
> + goto end;
> }
>
> if (!sched_cmdline_ref && !sched_tgid_ref)
> tracing_sched_unregister();
> +end:
> mutex_unlock(&sched_register_mutex);
> }
>

2019-08-16 16:49:51

[permalink] [raw]

Subject: Re: [PATCH 1/1] Fix: trace sched switch start/stop racy updates

On 16/08/2019 17:25, Steven Rostedt wrote:
>> Also, write and read to/from those variables should be done with
>> WRITE_ONCE() and READ_ONCE(), given that those are read within tracing
>> probes without holding the sched_register_mutex.
>>
>
> I understand the READ_ONCE() but is the WRITE_ONCE() truly necessary?
> It's done while holding the mutex. It's not that critical of a path,
> and makes the code look ugly.
>

I seem to recall something like locking primitives don't protect you from
store tearing / invented stores, so if you can have concurrent readers
using READ_ONCE(), there should be a WRITE_ONCE() on the writer side, even
if it's done in a critical section.

2019-08-16 17:07:36

[permalink] [raw]

Subject: Re: [PATCH 1/1] Fix: trace sched switch start/stop racy updates

On Fri, 16 Aug 2019 17:48:59 +0100
Valentin Schneider <[email protected]> wrote:

> On 16/08/2019 17:25, Steven Rostedt wrote:
> >> Also, write and read to/from those variables should be done with
> >> WRITE_ONCE() and READ_ONCE(), given that those are read within tracing
> >> probes without holding the sched_register_mutex.
> >>
> >
> > I understand the READ_ONCE() but is the WRITE_ONCE() truly necessary?
> > It's done while holding the mutex. It's not that critical of a path,
> > and makes the code look ugly.
> >
>
> I seem to recall something like locking primitives don't protect you from
> store tearing / invented stores, so if you can have concurrent readers
> using READ_ONCE(), there should be a WRITE_ONCE() on the writer side, even
> if it's done in a critical section.

But for this, it really doesn't matter. The READ_ONCE() is for going
from 0->1 or 1->0 any other change is the same as 1.

When we enable trace events, we start recording the tasks comms such
that we can possibly map them to the pids. When we disable trace
events, we stop recording the comms so that we don't overwrite the
cache when not needed. Note, if more than the max cache of tasks are
recorded during a session, we are likely to miss comms anyway.

Thinking about this more, the READ_ONCE() and WRTIE_ONCE() are not even
needed, because this is just a best effort anyway.

The only real fix was to move the check into the mutex protect area,
because that can cause a real bug if there was a race.

{
- bool sched_register = (!sched_cmdline_ref && !sched_tgid_ref);
+ bool sched_register;
+
mutex_lock(&sched_register_mutex);
+ sched_register = (!sched_cmdline_ref && !sched_tgid_ref);

Thus, I'd like to see a v2 of this patch without the READ_ONCE() or
WRITE_ONCE() added.

-- Steve

2019-08-16 17:20:36

[permalink] [raw]

Subject: Re: [PATCH 1/1] Fix: trace sched switch start/stop racy updates

----- On Aug 16, 2019, at 12:25 PM, rostedt [email protected] wrote:

> On Fri, 16 Aug 2019 10:26:43 -0400 Mathieu Desnoyers <[email protected]> wrote:
>
[...]
>>
>> Also, write and read to/from those variables should be done with
>> WRITE_ONCE() and READ_ONCE(), given that those are read within tracing
>> probes without holding the sched_register_mutex.
>>
>
> I understand the READ_ONCE() but is the WRITE_ONCE() truly necessary?
> It's done while holding the mutex. It's not that critical of a path,
> and makes the code look ugly.

The update is done while holding the mutex, but the read-side does not
hold that mutex, so it can observe the intermediate state caused by
store-tearing or invented stores which can be generated by the compiler
on the update-side.

Please refer to the following LWN article:

https://lwn.net/Articles/793253/

Sections:
- "Store tearing"
- "Invented stores"

Arguably, based on that article, store tearing is only observed in the
wild for constants (which is not the case here), and invented stores
seem to require specific code patterns. But I wonder why we would ever want to
pair a fragile non-volatile store with a READ_ONCE() ? Considering the pain
associated to reproduce and hunt down this kind of issue in the wild, I would
be tempted to enforce that any READ_ONCE() operating on a variable would either
need to be paired with WRITE_ONCE() or with atomic operations, so those can
eventually be validated by static code checkers and code sanitizers.

If coding style is your only concern here, we may want to consider
introducing new macros in compiler.h:

WRITE_ONCE_INC(v) /* v++ */
WRITE_ONCE_DEC(v) /* v-- */
WRITE_ONCE_ADD(v, count) /* v += count */
WRITE_ONCE_SUB(v, count) /* v -= count */

Thanks,

Mathieu

>
> -- Steve
>
>
>
>> [ Compile-tested only. I suspect it might fix the following syzbot
>> report:
>>
>> [email protected] ]
>>
>> Signed-off-by: Mathieu Desnoyers <[email protected]>
>> CC: Joel Fernandes (Google) <[email protected]>
>> CC: Peter Zijlstra <[email protected]>
>> CC: Steven Rostedt (VMware) <[email protected]>
>> CC: Thomas Gleixner <[email protected]>
>> CC: Paul E. McKenney <[email protected]>
>> ---
>> kernel/trace/trace_sched_switch.c | 32 ++++++++++++++++++++++----------
>> 1 file changed, 22 insertions(+), 10 deletions(-)
>>
>> diff --git a/kernel/trace/trace_sched_switch.c
>> b/kernel/trace/trace_sched_switch.c
>> index e288168661e1..902e8bf59aeb 100644
>> --- a/kernel/trace/trace_sched_switch.c
>> +++ b/kernel/trace/trace_sched_switch.c
>> @@ -26,8 +26,8 @@ probe_sched_switch(void *ignore, bool preempt,
>> {
>> int flags;
>>
>> - flags = (RECORD_TGID * !!sched_tgid_ref) +
>> - (RECORD_CMDLINE * !!sched_cmdline_ref);
>> + flags = (RECORD_TGID * !!READ_ONCE(sched_tgid_ref)) +
>> + (RECORD_CMDLINE * !!READ_ONCE(sched_cmdline_ref));
>>
>> if (!flags)
>> return;
>> @@ -39,8 +39,8 @@ probe_sched_wakeup(void *ignore, struct task_struct *wakee)
>> {
>> int flags;
>>
>> - flags = (RECORD_TGID * !!sched_tgid_ref) +
>> - (RECORD_CMDLINE * !!sched_cmdline_ref);
>> + flags = (RECORD_TGID * !!READ_ONCE(sched_tgid_ref)) +
>> + (RECORD_CMDLINE * !!READ_ONCE(sched_cmdline_ref));
>>
>> if (!flags)
>> return;
>> @@ -89,21 +89,28 @@ static void tracing_sched_unregister(void)
>>
>> static void tracing_start_sched_switch(int ops)
>> {
>> - bool sched_register = (!sched_cmdline_ref && !sched_tgid_ref);
>> + bool sched_register;
>> +
>> mutex_lock(&sched_register_mutex);
>> + sched_register = (!sched_cmdline_ref && !sched_tgid_ref);
>>
>> switch (ops) {
>> case RECORD_CMDLINE:
>> - sched_cmdline_ref++;
>> + WRITE_ONCE(sched_cmdline_ref, sched_cmdline_ref + 1);
>> break;
>>
>> case RECORD_TGID:
>> - sched_tgid_ref++;
>> + WRITE_ONCE(sched_tgid_ref, sched_tgid_ref + 1);
>> break;
>> +
>> + default:
>> + WARN_ONCE(1, "Unsupported tracing op: %d", ops);
>> + goto end;
>> }
>>
>> - if (sched_register && (sched_cmdline_ref || sched_tgid_ref))
>> + if (sched_register)
>> tracing_sched_register();
>> +end:
>> mutex_unlock(&sched_register_mutex);
>> }
>>
>> @@ -113,16 +120,21 @@ static void tracing_stop_sched_switch(int ops)
>>
>> switch (ops) {
>> case RECORD_CMDLINE:
>> - sched_cmdline_ref--;
>> + WRITE_ONCE(sched_cmdline_ref, sched_cmdline_ref - 1);
>> break;
>>
>> case RECORD_TGID:
>> - sched_tgid_ref--;
>> + WRITE_ONCE(sched_tgid_ref, sched_tgid_ref - 1);
>> break;
>> +
>> + default:
>> + WARN_ONCE(1, "Unsupported tracing op: %d", ops);
>> + goto end;
>> }
>>
>> if (!sched_cmdline_ref && !sched_tgid_ref)
>> tracing_sched_unregister();
>> +end:
>> mutex_unlock(&sched_register_mutex);
>> }

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2019-08-16 17:43:16

[permalink] [raw]

Subject: Re: [PATCH 1/1] Fix: trace sched switch start/stop racy updates

----- On Aug 16, 2019, at 1:04 PM, rostedt [email protected] wrote:

> On Fri, 16 Aug 2019 17:48:59 +0100
> Valentin Schneider <[email protected]> wrote:
>
>> On 16/08/2019 17:25, Steven Rostedt wrote:
>> >> Also, write and read to/from those variables should be done with
>> >> WRITE_ONCE() and READ_ONCE(), given that those are read within tracing
>> >> probes without holding the sched_register_mutex.
>> >>
>> >
>> > I understand the READ_ONCE() but is the WRITE_ONCE() truly necessary?
>> > It's done while holding the mutex. It's not that critical of a path,
>> > and makes the code look ugly.
>> >
>>
>> I seem to recall something like locking primitives don't protect you from
>> store tearing / invented stores, so if you can have concurrent readers
>> using READ_ONCE(), there should be a WRITE_ONCE() on the writer side, even
>> if it's done in a critical section.
>
> But for this, it really doesn't matter. The READ_ONCE() is for going
> from 0->1 or 1->0 any other change is the same as 1.

Let's consider this "invented store" scenario (even though as I noted in my
earlier email, I suspect this particular instance of the code does not appear
to fit the requirements to generate this in the wild with current compilers):

intial state:

sched_tgid_ref = 10;

Thread A Thread B

registration tracepoint probe
sched_tgid_ref++
- compiler decides to invent a
store: sched_tgid_ref = 0
READ_ONCE(sched_tgid_ref) -> observes 0,
but should really see either 10 or 11.
- compiler stores 11 into
sched_tgid_ref

This kind of scenario could cause spurious missing data in the middle of a
trace caused by another user trying to increment the reference count, which
is really unexpected.

A similar scenario can happen for "store tearing" if the compiler decides
to break the store into many stores close to the 16-bit overflow value when
updating a 32-bit reference count. Spurious 1 -> 0 -> 1 transitions could be
observed by readers.

> When we enable trace events, we start recording the tasks comms such
> that we can possibly map them to the pids. When we disable trace
> events, we stop recording the comms so that we don't overwrite the
> cache when not needed. Note, if more than the max cache of tasks are
> recorded during a session, we are likely to miss comms anyway.
>
> Thinking about this more, the READ_ONCE() and WRTIE_ONCE() are not even
> needed, because this is just a best effort anyway.

If you choose not to use READ_ONCE(), then the "load tearing" issue can
cause similar spurious 1 -> 0 -> 1 transitions near 16-bit counter
overflow as described above. The "Invented load" also becomes an issue,
because the compiler could use the loaded value for a branch, and re-load
that value between two branches which are expected to use the same value,
effectively generating a corrupted state.

I think we need a statement about whether READ_ONCE/WRITE_ONCE should
be used in this kind of situation, or if we are fine dealing with the
awkward compiler side-effects when they will occur.

Thanks,

Mathieu

>
> The only real fix was to move the check into the mutex protect area,
> because that can cause a real bug if there was a race.
>
> {
> - bool sched_register = (!sched_cmdline_ref && !sched_tgid_ref);
> + bool sched_register;
> +
> mutex_lock(&sched_register_mutex);
> + sched_register = (!sched_cmdline_ref && !sched_tgid_ref);
>
> Thus, I'd like to see a v2 of this patch without the READ_ONCE() or
> WRITE_ONCE() added.
>
> -- Steve

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2019-08-16 19:19:12

[permalink] [raw]

Subject: Re: [PATCH 1/1] Fix: trace sched switch start/stop racy updates

On Fri, 16 Aug 2019 13:19:20 -0400 (EDT)
Mathieu Desnoyers <[email protected]> wrote:

> ----- On Aug 16, 2019, at 12:25 PM, rostedt [email protected] wrote:
>
> > On Fri, 16 Aug 2019 10:26:43 -0400 Mathieu Desnoyers <[email protected]> wrote:
> >
> [...]
> >>
> >> Also, write and read to/from those variables should be done with
> >> WRITE_ONCE() and READ_ONCE(), given that those are read within tracing
> >> probes without holding the sched_register_mutex.
> >>
> >
> > I understand the READ_ONCE() but is the WRITE_ONCE() truly necessary?
> > It's done while holding the mutex. It's not that critical of a path,
> > and makes the code look ugly.
>
> The update is done while holding the mutex, but the read-side does not
> hold that mutex, so it can observe the intermediate state caused by
> store-tearing or invented stores which can be generated by the compiler
> on the update-side.
>
> Please refer to the following LWN article:
>
> https://lwn.net/Articles/793253/
>
> Sections:
> - "Store tearing"
> - "Invented stores"
>
> Arguably, based on that article, store tearing is only observed in the
> wild for constants (which is not the case here), and invented stores
> seem to require specific code patterns. But I wonder why we would ever want to
> pair a fragile non-volatile store with a READ_ONCE() ? Considering the pain
> associated to reproduce and hunt down this kind of issue in the wild, I would
> be tempted to enforce that any READ_ONCE() operating on a variable would either
> need to be paired with WRITE_ONCE() or with atomic operations, so those can
> eventually be validated by static code checkers and code sanitizers.

My issue is that this is just a case to decide if we should cache a
comm or not. It's a helper, nothing more. There's no guarantee that
something will get cached.

-- Steve

>
> If coding style is your only concern here, we may want to consider
> introducing new macros in compiler.h:
>
> WRITE_ONCE_INC(v) /* v++ */
> WRITE_ONCE_DEC(v) /* v-- */
> WRITE_ONCE_ADD(v, count) /* v += count */
> WRITE_ONCE_SUB(v, count) /* v -= count */
>

2019-08-16 19:19:49

[permalink] [raw]

Subject: Re: [PATCH 1/1] Fix: trace sched switch start/stop racy updates

On Fri, 16 Aug 2019 13:41:59 -0400 (EDT)
Mathieu Desnoyers <[email protected]> wrote:

> ----- On Aug 16, 2019, at 1:04 PM, rostedt [email protected] wrote:
>
> > On Fri, 16 Aug 2019 17:48:59 +0100
> > Valentin Schneider <[email protected]> wrote:
> >
> >> On 16/08/2019 17:25, Steven Rostedt wrote:
> >> >> Also, write and read to/from those variables should be done with
> >> >> WRITE_ONCE() and READ_ONCE(), given that those are read within tracing
> >> >> probes without holding the sched_register_mutex.
> >> >>
> >> >
> >> > I understand the READ_ONCE() but is the WRITE_ONCE() truly necessary?
> >> > It's done while holding the mutex. It's not that critical of a path,
> >> > and makes the code look ugly.
> >> >
> >>
> >> I seem to recall something like locking primitives don't protect you from
> >> store tearing / invented stores, so if you can have concurrent readers
> >> using READ_ONCE(), there should be a WRITE_ONCE() on the writer side, even
> >> if it's done in a critical section.
> >
> > But for this, it really doesn't matter. The READ_ONCE() is for going
> > from 0->1 or 1->0 any other change is the same as 1.
>
> Let's consider this "invented store" scenario (even though as I noted in my
> earlier email, I suspect this particular instance of the code does not appear
> to fit the requirements to generate this in the wild with current compilers):
>
> intial state:
>
> sched_tgid_ref = 10;
>
> Thread A Thread B
>
> registration tracepoint probe
> sched_tgid_ref++
> - compiler decides to invent a
> store: sched_tgid_ref = 0

This looks to me that this would cause more issues in other parts of
the code if a compiler ever decided to do this.

But I still don't care for this case. It's a cache, nothing more. No
guarantee that anything will be recorded.

> READ_ONCE(sched_tgid_ref) ->
> observes 0, but should really see either 10 or 11.
> - compiler stores 11 into
> sched_tgid_ref
>
> This kind of scenario could cause spurious missing data in the middle
> of a trace caused by another user trying to increment the reference
> count, which is really unexpected.
>
> A similar scenario can happen for "store tearing" if the compiler
> decides to break the store into many stores close to the 16-bit
> overflow value when updating a 32-bit reference count. Spurious 1 ->
> 0 -> 1 transitions could be observed by readers.
>
> > When we enable trace events, we start recording the tasks comms such
> > that we can possibly map them to the pids. When we disable trace
> > events, we stop recording the comms so that we don't overwrite the
> > cache when not needed. Note, if more than the max cache of tasks are
> > recorded during a session, we are likely to miss comms anyway.
> >
> > Thinking about this more, the READ_ONCE() and WRTIE_ONCE() are not
> > even needed, because this is just a best effort anyway.
>
> If you choose not to use READ_ONCE(), then the "load tearing" issue
> can cause similar spurious 1 -> 0 -> 1 transitions near 16-bit counter
> overflow as described above. The "Invented load" also becomes an
> issue, because the compiler could use the loaded value for a branch,
> and re-load that value between two branches which are expected to use
> the same value, effectively generating a corrupted state.
>
> I think we need a statement about whether READ_ONCE/WRITE_ONCE should
> be used in this kind of situation, or if we are fine dealing with the
> awkward compiler side-effects when they will occur.
>

Let me put it this way. My biggest issue with this, is that the
READ/WRITE_ONCE() has *nothing* to do with the bug you are trying to
fix.

That bug is that we did the decision of starting the probes outside the
mutex. That is fixed my moving the decision into the mutex. The
READ/WRITE_ONCE() is just added noise.

-- Steve

2019-08-16 19:22:15

by Alan Stern

[permalink] [raw]

Subject: Re: [PATCH 1/1] Fix: trace sched switch start/stop racy updates

On Fri, 16 Aug 2019, Mathieu Desnoyers wrote:

> If you choose not to use READ_ONCE(), then the "load tearing" issue can
> cause similar spurious 1 -> 0 -> 1 transitions near 16-bit counter
> overflow as described above. The "Invented load" also becomes an issue,
> because the compiler could use the loaded value for a branch, and re-load
> that value between two branches which are expected to use the same value,
> effectively generating a corrupted state.
>
> I think we need a statement about whether READ_ONCE/WRITE_ONCE should
> be used in this kind of situation, or if we are fine dealing with the
> awkward compiler side-effects when they will occur.

The only real downside (apart from readability) of READ_ONCE and
WRITE_ONCE is that they prevent the compiler from optimizing accesses
to the location being read or written. But if you're just doing a
single access in each place, not multiple accesses, then there's
nothing to optimize anyway. So there's no real reason not to use
READ_ONCE or WRITE_ONCE.

Alan Stern

2019-08-16 20:45:14

[permalink] [raw]

Subject: Re: [PATCH 1/1] Fix: trace sched switch start/stop racy updates

On Fri, Aug 16, 2019 at 3:19 PM Alan Stern <[email protected]> wrote:
>
> On Fri, 16 Aug 2019, Mathieu Desnoyers wrote:
>
> > If you choose not to use READ_ONCE(), then the "load tearing" issue can
> > cause similar spurious 1 -> 0 -> 1 transitions near 16-bit counter
> > overflow as described above. The "Invented load" also becomes an issue,
> > because the compiler could use the loaded value for a branch, and re-load
> > that value between two branches which are expected to use the same value,
> > effectively generating a corrupted state.
> >
> > I think we need a statement about whether READ_ONCE/WRITE_ONCE should
> > be used in this kind of situation, or if we are fine dealing with the
> > awkward compiler side-effects when they will occur.
>
> The only real downside (apart from readability) of READ_ONCE and
> WRITE_ONCE is that they prevent the compiler from optimizing accesses
> to the location being read or written. But if you're just doing a
> single access in each place, not multiple accesses, then there's
> nothing to optimize anyway. So there's no real reason not to use
> READ_ONCE or WRITE_ONCE.

I am also more on the side of using *_ONCE. To me, by principal, I
would be willing to convert any concurrent plain access using _ONCE,
just so we don't have to worry about it now or in the future and also
documents the access.

Perhaps the commit message can be reworded to mention that the _ONCE
is an additional clean up for safe access.

thanks,

- Joel

2019-08-16 20:50:13

[permalink] [raw]

Subject: Re: [PATCH 1/1] Fix: trace sched switch start/stop racy updates

On Fri, 16 Aug 2019 16:44:10 -0400
Joel Fernandes <[email protected]> wrote:

> I am also more on the side of using *_ONCE. To me, by principal, I
> would be willing to convert any concurrent plain access using _ONCE,
> just so we don't have to worry about it now or in the future and also
> documents the access.
>
> Perhaps the commit message can be reworded to mention that the _ONCE
> is an additional clean up for safe access.

The most I'll take is two separate patches. One is going to be marked
for stable as it fixes a real bug. The other is more for cosmetic or
theoretical issues, that I will state clearly "NOT FOR STABLE", such
that the autosel doesn't take them.

-- Steve

2019-08-16 20:50:38

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [PATCH 1/1] Fix: trace sched switch start/stop racy updates

On Fri, 16 Aug 2019, Joel Fernandes wrote:
> On Fri, Aug 16, 2019 at 3:19 PM Alan Stern <[email protected]> wrote:
> > On Fri, 16 Aug 2019, Mathieu Desnoyers wrote:
> >
> > > If you choose not to use READ_ONCE(), then the "load tearing" issue can
> > > cause similar spurious 1 -> 0 -> 1 transitions near 16-bit counter
> > > overflow as described above. The "Invented load" also becomes an issue,
> > > because the compiler could use the loaded value for a branch, and re-load
> > > that value between two branches which are expected to use the same value,
> > > effectively generating a corrupted state.
> > >
> > > I think we need a statement about whether READ_ONCE/WRITE_ONCE should
> > > be used in this kind of situation, or if we are fine dealing with the
> > > awkward compiler side-effects when they will occur.
> >
> > The only real downside (apart from readability) of READ_ONCE and
> > WRITE_ONCE is that they prevent the compiler from optimizing accesses
> > to the location being read or written. But if you're just doing a
> > single access in each place, not multiple accesses, then there's
> > nothing to optimize anyway. So there's no real reason not to use
> > READ_ONCE or WRITE_ONCE.
>
> I am also more on the side of using *_ONCE. To me, by principal, I
> would be willing to convert any concurrent plain access using _ONCE,
> just so we don't have to worry about it now or in the future and also
> documents the access.

By that argumentation we need to plaster half of the kernel with _ONCE()
and I'm so not looking forward to the insane amount of script kiddies
patches to do that.

Can we finally put a foot down and tell compiler and standard committee
people to stop this insanity?

Thanks,

tglx

2019-08-16 20:59:09

[permalink] [raw]

Subject: Re: [PATCH 1/1] Fix: trace sched switch start/stop racy updates

On Fri, Aug 16, 2019 at 10:49:04PM +0200, Thomas Gleixner wrote:
> On Fri, 16 Aug 2019, Joel Fernandes wrote:
> > On Fri, Aug 16, 2019 at 3:19 PM Alan Stern <[email protected]> wrote:
> > > On Fri, 16 Aug 2019, Mathieu Desnoyers wrote:
> > >
> > > > If you choose not to use READ_ONCE(), then the "load tearing" issue can
> > > > cause similar spurious 1 -> 0 -> 1 transitions near 16-bit counter
> > > > overflow as described above. The "Invented load" also becomes an issue,
> > > > because the compiler could use the loaded value for a branch, and re-load
> > > > that value between two branches which are expected to use the same value,
> > > > effectively generating a corrupted state.
> > > >
> > > > I think we need a statement about whether READ_ONCE/WRITE_ONCE should
> > > > be used in this kind of situation, or if we are fine dealing with the
> > > > awkward compiler side-effects when they will occur.
> > >
> > > The only real downside (apart from readability) of READ_ONCE and
> > > WRITE_ONCE is that they prevent the compiler from optimizing accesses
> > > to the location being read or written. But if you're just doing a
> > > single access in each place, not multiple accesses, then there's
> > > nothing to optimize anyway. So there's no real reason not to use
> > > READ_ONCE or WRITE_ONCE.
> >
> > I am also more on the side of using *_ONCE. To me, by principal, I
> > would be willing to convert any concurrent plain access using _ONCE,
> > just so we don't have to worry about it now or in the future and also
> > documents the access.
>
> By that argumentation we need to plaster half of the kernel with _ONCE()
> and I'm so not looking forward to the insane amount of script kiddies
> patches to do that.

Really? That is quite scary that you are saying half of the kernel has issues
with concurrent access or compiler optimizations. It scares me that a
concurrent access can tear down a store/load and existing code can just fail,
if that is the case.

> Can we finally put a foot down and tell compiler and standard committee
> people to stop this insanity?

Sure, or could the compilers provide flags which prevent such optimization
similar to -O* flags?

thanks,

- Joel

2019-08-16 21:02:46

[permalink] [raw]

Subject: Re: [PATCH 1/1] Fix: trace sched switch start/stop racy updates

On Fri, Aug 16, 2019 at 04:49:12PM -0400, Steven Rostedt wrote:
> On Fri, 16 Aug 2019 16:44:10 -0400
> Joel Fernandes <[email protected]> wrote:
>
>
> > I am also more on the side of using *_ONCE. To me, by principal, I
> > would be willing to convert any concurrent plain access using _ONCE,
> > just so we don't have to worry about it now or in the future and also
> > documents the access.
> >
> > Perhaps the commit message can be reworded to mention that the _ONCE
> > is an additional clean up for safe access.
>
> The most I'll take is two separate patches. One is going to be marked
> for stable as it fixes a real bug. The other is more for cosmetic or
> theoretical issues, that I will state clearly "NOT FOR STABLE", such
> that the autosel doesn't take them.

Makes sense to me!

thanks,

- Joel

2019-08-16 21:06:26

[permalink] [raw]

Subject: Re: [PATCH 1/1] Fix: trace sched switch start/stop racy updates

On Fri, Aug 16, 2019 at 1:49 PM Thomas Gleixner <[email protected]> wrote:
>
> Can we finally put a foot down and tell compiler and standard committee
> people to stop this insanity?

It's already effectively done.

Yes, values can be read from memory multiple times if they need
reloading. So "READ_ONCE()" when the value can change is a damn good
idea.

But it should only be used if the value *can* change. Inside a locked
region it is actively pointless and misleading.

Similarly, WRITE_ONCE() should only be used if you have a _reason_ for
using it (notably if you're not holding a lock).

If people use READ_ONCE/WRITE_ONCE when there are locks that prevent
the values from changing, they are only making the code illegible.
Don't do it.

But in the *absence* of locking, READ_ONCE/WRITE_ONCE is usually a
good thing. The READ_ONCE actually tends to matter, because even if
the value is used only once at a source level, the compiler *could*
decide to do something else.

The WRITE_ONCE() may or may not matter (afaik, thanks to concurrency,
modern C standard does not allow optimistic writes anyway, and we
wouldn't really accept such a compiler option if it did).

But if the write is done without locking, it's good practice just to
show you are aware of the whole "done without locking" part.

Linus

2019-08-16 22:29:04

[permalink] [raw]

Subject: Re: [PATCH 1/1] Fix: trace sched switch start/stop racy updates

On 16/08/2019 21:57, Joel Fernandes wrote:
>> Can we finally put a foot down and tell compiler and standard committee
>> people to stop this insanity?
>
> Sure, or could the compilers provide flags which prevent such optimization
> similar to -O* flags?
>

How would you differentiate optimizations you want from those you don't with
just a flag? There's a reason we use volatile casts instead of declaring
everything volatile: we actually *want* those optimizations. It just so
happens that we don't want them *in some places*, and we have tools to tag
them as such.

The alternative is having a compiler that can magically correlate e.g. locked
writes with lock-free reads and properly handle them, but I don't think
there's a foolproof way of doing that.

> thanks,
>
> - Joel
>

2019-08-16 22:59:31

[permalink] [raw]

Subject: Re: [PATCH 1/1] Fix: trace sched switch start/stop racy updates

On Fri, Aug 16, 2019 at 3:27 PM Valentin Schneider
<[email protected]> wrote:
>
> How would you differentiate optimizations you want from those you don't with
> just a flag? There's a reason we use volatile casts instead of declaring
> everything volatile: we actually *want* those optimizations. It just so
> happens that we don't want them *in some places*, and we have tools to tag
> them as such.

We actually disable lots of "valid" (read: the standard allows them,
but they are completely wrong for the kernel) optimizations because
they are wrong.

The whole type-based alias thing is just wrong. The C standards body
was incompetent to allow that garbage. So we disable it.

If you can *prove* that no aliasing exists, go ahead and re-order
accesses. But no guesses based on random types.

Similarly, if some compiler decides that it's ok to make speculative
writes (knowing it will over-write it with the right data later) to
data that is possibly visible to other threads, then such an
"optimization" needs to just be disabled. It might help some
benchmark, and if you read the standard just the right way it might be
allowed - but that doesn't make it valid.

We already had situations like that, where compiler people thought it
would be ok (for example) to turns a narrow write into a wider
read-modify-write because it had already done the wider read for other
reasons.

Again, the original C standard "allows" that in theory, because the
original C standard doesn't take threading into account. In fact, the
alpha architecture made actively bad design decisions based on that
(incorrect) assumption.

It turns out that in that case, even non-kernel people rebelled, and
it's apparently thankfully not allowed in newer versions of the
standard, exactly because threading has become a thing. You can't
magically write back unrelated variables just because they might be
next-door neighbors and share a word.

So no, we do *not* in general just say that we want any random
optimizations. A compiler that turns a single write into something
else is almost certainly something that shouldn't be allowed near the
kernel.

We add READ_ONCE and WRITE_ONCE annotations when they make sense. Not
because of some theoretical "compiler is free to do garbage"
arguments. If such garbage happens, we need to fix the compiler, the
same way we already do with

-fno-strict-aliasing
-fno-delete-null-pointer-checks
-fno-strict-overflow

because all those "optimizations" are just fundamentally unsafe and wrong.

I really wish the compiler would never take advantage of "I can prove
this is undefined behavior" kind of things when it comes to the kernel
(or any other projects I am involved with, for that matter). If you
can prove that, then you shouldn't decide to generate random code
without a big warning. But that's what those optimizations that we
disable effectively all do.

I'd love to have a flag that says "all undefined behavior is treated
as implementation-defined". There's a somewhat subtle - but very
important - difference there.

And that's what some hypothetical speculative write optimizations do
too. I do not believe they are valid for the kernel. If the code says

if (a)
global_var = 1
else
global_var = 0

then the compiler had better not turn that into

global_var = 0
if (a)
global_var = 1

even if there isn't a volatile there. But yes, we've had compiler
writers that say "if you read the specs, that's ok".

No, it's not ok. Because reality trumps any weasel-spec-reading.

And happily, I don't think we've ever really seen a compiler that we
use that actually does the above kind of speculative write thing (but
doing it for your own local variables that can't be seen by other
threads of execution - go wild).

So in general, we very much expect the compiler to do sane code
generation, and not (for example) do store tearing on normal
word-sized things or add writes that weren't there originally etc.

And yes, reads are different from writes. Reads don't have the same
kind of "other threads of execution can see them" effects, so a
compiler turning a single read into multiple reads is much more
realistic and not the same kind of "we need to expect a certain kind
of sanity from the compiler" issue.

Linus

2019-08-17 01:27:19

[permalink] [raw]

Subject: Re: [PATCH 1/1] Fix: trace sched switch start/stop racy updates

----- On Aug 16, 2019, at 4:49 PM, rostedt [email protected] wrote:

> On Fri, 16 Aug 2019 16:44:10 -0400
> Joel Fernandes <[email protected]> wrote:
>
>
>> I am also more on the side of using *_ONCE. To me, by principal, I
>> would be willing to convert any concurrent plain access using _ONCE,
>> just so we don't have to worry about it now or in the future and also
>> documents the access.
>>
>> Perhaps the commit message can be reworded to mention that the _ONCE
>> is an additional clean up for safe access.
>
> The most I'll take is two separate patches. One is going to be marked
> for stable as it fixes a real bug. The other is more for cosmetic or
> theoretical issues, that I will state clearly "NOT FOR STABLE", such
> that the autosel doesn't take them.

Splitting this into two separate patches makes perfect sense.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2019-08-17 01:37:57

[permalink] [raw]

Subject: Re: [PATCH 1/1] Fix: trace sched switch start/stop racy updates

----- On Aug 16, 2019, at 5:04 PM, Linus Torvalds [email protected] wrote:

> On Fri, Aug 16, 2019 at 1:49 PM Thomas Gleixner <[email protected]> wrote:
>>
>> Can we finally put a foot down and tell compiler and standard committee
>> people to stop this insanity?
>
> It's already effectively done.
>
> Yes, values can be read from memory multiple times if they need
> reloading. So "READ_ONCE()" when the value can change is a damn good
> idea.
>
> But it should only be used if the value *can* change. Inside a locked
> region it is actively pointless and misleading.
>
> Similarly, WRITE_ONCE() should only be used if you have a _reason_ for
> using it (notably if you're not holding a lock).
>
> If people use READ_ONCE/WRITE_ONCE when there are locks that prevent
> the values from changing, they are only making the code illegible.
> Don't do it.

I agree with your argument in the case where both read-side and write-side
are protected by the same lock: READ/WRITE_ONCE are useless then. However,
in the scenario we have here, only the write-side is protected by the lock
against concurrent updates, but reads don't take any lock.

If WRITE_ONCE has any use at all (protecting against store tearing and
invented stores), it should be used even with a lock held in this
scenario, because the lock does not prevent READ_ONCE() from observing
transient states caused by lack of WRITE_ONCE() for the update.

So why does WRITE_ONCE exist in the first place ? Is it for documentation
purposes only or are there actual situations where omitting it can cause
bugs with real-life compilers ?

In terms of code change, should we favor only introducing WRITE_ONCE
in new code, or should existing code matching those conditions be
moved to WRITE_ONCE as bug fixes ?

Thanks,

Mathieu

>
> But in the *absence* of locking, READ_ONCE/WRITE_ONCE is usually a
> good thing. The READ_ONCE actually tends to matter, because even if
> the value is used only once at a source level, the compiler *could*
> decide to do something else.
>
> The WRITE_ONCE() may or may not matter (afaik, thanks to concurrency,
> modern C standard does not allow optimistic writes anyway, and we
> wouldn't really accept such a compiler option if it did).
>
> But if the write is done without locking, it's good practice just to
> show you are aware of the whole "done without locking" part.
>
> Linus

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2019-08-17 01:43:03

[permalink] [raw]

Subject: Re: [PATCH 1/1] Fix: trace sched switch start/stop racy updates

----- On Aug 16, 2019, at 6:57 PM, Linus Torvalds [email protected] wrote:

> So in general, we very much expect the compiler to do sane code
> generation, and not (for example) do store tearing on normal
> word-sized things or add writes that weren't there originally etc.

My understanding of https://lwn.net/Articles/793253/ section "Store tearing"
which points at https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56028 seems
to contradict your expectation at least when writing constants to a
64-bit word without a volatile access.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2019-08-17 02:14:31

[permalink] [raw]

Subject: Re: [PATCH 1/1] Fix: trace sched switch start/stop racy updates

On Fri, 16 Aug 2019 21:36:49 -0400 (EDT)
Mathieu Desnoyers <[email protected]> wrote:

> ----- On Aug 16, 2019, at 5:04 PM, Linus Torvalds [email protected] wrote:
>
> > On Fri, Aug 16, 2019 at 1:49 PM Thomas Gleixner <[email protected]> wrote:
> >>
> >> Can we finally put a foot down and tell compiler and standard committee
> >> people to stop this insanity?
> >
> > It's already effectively done.
> >
> > Yes, values can be read from memory multiple times if they need
> > reloading. So "READ_ONCE()" when the value can change is a damn good
> > idea.
> >
> > But it should only be used if the value *can* change. Inside a locked
> > region it is actively pointless and misleading.
> >
> > Similarly, WRITE_ONCE() should only be used if you have a _reason_ for
> > using it (notably if you're not holding a lock).
> >
> > If people use READ_ONCE/WRITE_ONCE when there are locks that prevent
> > the values from changing, they are only making the code illegible.
> > Don't do it.
>
> I agree with your argument in the case where both read-side and write-side
> are protected by the same lock: READ/WRITE_ONCE are useless then. However,
> in the scenario we have here, only the write-side is protected by the lock
> against concurrent updates, but reads don't take any lock.

And because reads are not protected by any lock or memory barrier,
using READ_ONCE() is pointless. The CPU could be doing a lot of hidden
manipulation of that variable too.

Again, this is just to enable caching of the comm. Nothing more. It's a
"best effort" algorithm. We get it, we then can map a pid to a name. If
not, we just have a pid and we map "<...>".

Next you'll be asking for the memory barriers to guarantee a real hit.
And honestly, this information is not worth any overhead.

And most often we enable this before we enable the tracepoint we want
this information from, which requires modification of the text area and
will do a bunch of syncs across CPUs. That alone will most likely keep
any race from happening.

The only real bug is the check to know if we should add the probe or
not was done outside the lock, and when we hit the race, we could add a
probe twice, causing the kernel to spit out a warning. You fixed that,
and that's all that needs to be done. I'm now even more against adding
the READ_ONCE() or WRITE_ONCE().

-- Steve

>
> If WRITE_ONCE has any use at all (protecting against store tearing and
> invented stores), it should be used even with a lock held in this
> scenario, because the lock does not prevent READ_ONCE() from observing
> transient states caused by lack of WRITE_ONCE() for the update.
>
> So why does WRITE_ONCE exist in the first place ? Is it for documentation
> purposes only or are there actual situations where omitting it can cause
> bugs with real-life compilers ?
>
> In terms of code change, should we favor only introducing WRITE_ONCE
> in new code, or should existing code matching those conditions be
> moved to WRITE_ONCE as bug fixes ?
>
>

2019-08-17 04:53:54

[permalink] [raw]

Subject: Re: [PATCH 1/1] Fix: trace sched switch start/stop racy updates

On Fri, Aug 16, 2019 at 03:57:43PM -0700, Linus Torvalds wrote:

[ . . . ]

> We add READ_ONCE and WRITE_ONCE annotations when they make sense. Not
> because of some theoretical "compiler is free to do garbage"
> arguments. If such garbage happens, we need to fix the compiler, the
> same way we already do with
>
> -fno-strict-aliasing

Yeah, the compete-with-FORTRAN stuff. :-/

There is some work going on in the C committee on this, where the
theorists would like to restrict strict-alias based optimizations to
enable better analysis tooling. And no, although the theorists are
pushing in the direction we would like them to, as far as I can see
they are not pushing as far as we would like. But it might be that
-fno-strict-aliasing needs some upgrades as well. I expect to learn
more at the next meeting in a few months.

http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2364.pdf
http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2363.pdf
http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2362.pdf

> -fno-delete-null-pointer-checks
> -fno-strict-overflow
>
> because all those "optimizations" are just fundamentally unsafe and wrong.

I was hoping that -fno-strict-overflow might go into the C++ (not C)
standard, and even thought that it had at one point, but what went into
the standard was that signed integers are twos complement, not that
overflowing them is well defined.

We are pushing to hopefully end but at least to restrict the current
pointer lifetime-end zap behavior in both C and C++, which is similar
to the pointer provenance/alias analysis that -fno-strict-aliasing at
least partially suppresses. The zapping invalidates loading, storing,
comparing, or doing arithmetic on a pointer to an object whose lifetime
has ended. (The WG14 PDF linked to below gives a non-exhaustive list
of problems that this causes.)

The Linux kernel used to avoid this by refusing to tell the compiler about
kmalloc() and friends, but that changed a few years ago. This zapping
rules out a non-trivial class of concurrent algorithms, but for once
RCU is unaffected. Some committee members complained that zapping has
been part of the standard since about 1990, but Maged Michael found a
writeup of one of the algorithms dating back to 1973. ;-)

http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2369.pdf
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1726r0.pdf

> I really wish the compiler would never take advantage of "I can prove
> this is undefined behavior" kind of things when it comes to the kernel
> (or any other projects I am involved with, for that matter). If you
> can prove that, then you shouldn't decide to generate random code
> without a big warning. But that's what those optimizations that we
> disable effectively all do.
>
> I'd love to have a flag that says "all undefined behavior is treated
> as implementation-defined". There's a somewhat subtle - but very
> important - difference there.

It would also be nice to downgrade some of the undefined behavior in
the standard itself. Compiler writers usually hate this because it
would require them to document what their compiler does in each case.
So they would prefer "unspecified" if the could not have "undefined".

> And that's what some hypothetical speculative write optimizations do
> too. I do not believe they are valid for the kernel. If the code says
>
> if (a)
> global_var = 1
> else
> global_var = 0
>
> then the compiler had better not turn that into
>
> global_var = 0
> if (a)
> global_var = 1
>
> even if there isn't a volatile there. But yes, we've had compiler
> writers that say "if you read the specs, that's ok".
>
> No, it's not ok. Because reality trumps any weasel-spec-reading.
>
> And happily, I don't think we've ever really seen a compiler that we
> use that actually does the above kind of speculative write thing (but
> doing it for your own local variables that can't be seen by other
> threads of execution - go wild).

Me, I would still want to use WRITE_ONCE() in this case.

> So in general, we very much expect the compiler to do sane code
> generation, and not (for example) do store tearing on normal
> word-sized things or add writes that weren't there originally etc.
>
> And yes, reads are different from writes. Reads don't have the same
> kind of "other threads of execution can see them" effects, so a
> compiler turning a single read into multiple reads is much more
> realistic and not the same kind of "we need to expect a certain kind
> of sanity from the compiler" issue.

Though each of those compiler-generated multiple reads might return a
different value, right?

Thanx, Paul

2019-08-17 08:10:07

[permalink] [raw]

Subject: Re: [PATCH 1/1] Fix: trace sched switch start/stop racy updates

On Fri, Aug 16, 2019, 18:36 Mathieu Desnoyers
<[email protected]> wrote:
>
> If WRITE_ONCE has any use at all (protecting against store tearing and
> invented stores), it should be used even with a lock held in this
> scenario, because the lock does not prevent READ_ONCE() from observing
> transient states caused by lack of WRITE_ONCE() for the update.

The thing is, we really haven't requred WRITE_ONCE() to protect
against store tearing.

We have lots of traditional code that does stuff along the lines of

.. set of data structure ..
smp_wmb();
*ptr = newdata;

and we simply *depend* on the compiler doing the "*ptr" as a single
write. We've simply always done that. Even on UP we've had the
"interrupts will see old value or new value but not some random
half-way value", going all the way back to the original kernel
sources.

The data tearing issue is almost a non-issue. We're not going to add
WRITE_ONCE() to these kinds of places for no good reason.

> So why does WRITE_ONCE exist in the first place ? Is it for documentation
> purposes only or are there actual situations where omitting it can cause
> bugs with real-life compilers ?

WRITE_ONCE should be seen mainly as (a) documentation and (b) for new code.

Although I suspect often you'd be better off using smb_load_acquire()
and smp_store_release() when you have code sequences where you have
unlocked READ_ONCE/WRITE_ONCE and memory ordering.

WRITE_ONCE() doesn't even protect against data tearing. If you do a
"WRITE_ONCE()" on a type larger than 8 bytes, it will fall back to
__builtin_memcpy().

So honestly, WRITE_ONCE() is often more documentation than protecting
against something, but overdoing it doesn't help document anything, it
just obscures the point.

Yeah, yeah, it will use a "volatile" access for the proper normal
types, but even then that won't protect against data tearing on 64-bit
writes on a 32-bit machine, for example. It doesn't even give ordering
guarantees for the sub-words.

So you still end up with the almost the same basic rule: a natural
byte/word write will be atomic. But really, it's going to be so even
without the WRITE_ONCE(), so...

It does ostensibly protect against the compiler re-ordering the write
against other writes (note: *compiler*, not CPU), and it does make
sure the write only happens once. But it's really hard to see a valid
compiler optimization that wouldn't do that anyway.

Because of the compiler ordering of WRITE_ONCE against other
WRITE_ONCE cases, it could be used for irq-safe ordering on the local
cpu, for example. If that matters. Although I suspect any such case is
practically going to use per-cpu variables anyway.

End result: it's *mostly* about documentation.

Just do a grep for WRITE_ONCE() vs READ_ONCE(). You'll find a lot more
users of the latter. And quite frankly, I looked at some of the
WRITE_ONCE() users and a lot of them were kind of pointless.

Somebody tell me just exactly how they expect the WRITE_ONCE() cases
in net/tls/tls_sw.c to matter, for example (and that was literally a
random case I happened to look at). It's not clear what they do, since
they certainly don't imply any memory ordering. They do imply a
certain amount of compiler re-ordering due to the volatile, but that's
pretty limited too. Only wrt _other_ things with side effects, not the
normal code around them anyway.

In contrast, READ_ONCE() has very practical examples of mattering,
because unlike writes, compilers really can reasonably split reads.
For example, if you're testing multiple bits in the same word, and you
want to make sure they are coherent, you should very much do

val = READ_ONCE(mem);
.. test different bits in val ..

because without the READ_ONCE(), the compiler could end up just doing
the read multiple times.

Similarly, even if you only use a value once, this is actually
something a compiler can do:

if (a) {
lock();
B()
unlock();
} else
B();

and a compiler might end up generating that as

if (a) lock();
B();
if (a) unlock();

instead. So doing "if (READ_ONCE(a))" actually makes a difference - it
guarantees that 'a' is only read once, even if that value was
_literally_ only used on a source level, the compiler really could
have decided "let's read it twice".

See how duplicating a read is fundamentally different from duplicating
a write? Duplicating or splitting a read is not visible to other
threads. Notice how nothiing like the above is reasonable for a write.

That said, the most common case really is none of the above half-way
subtle cases. It's literally the whole "write pointer once". Making
existing code that does that use WRITE_ONCE() is completely pointless.
It's not going to really change or fix anything at all.

Linus

2019-08-17 08:31:25

[permalink] [raw]

Subject: Re: [PATCH 1/1] Fix: trace sched switch start/stop racy updates

On Fri, Aug 16, 2019 at 9:52 PM Paul E. McKenney <[email protected]> wrote:
>
> > I'd love to have a flag that says "all undefined behavior is treated
> > as implementation-defined". There's a somewhat subtle - but very
> > important - difference there.
>
> It would also be nice to downgrade some of the undefined behavior in
> the standard itself. Compiler writers usually hate this because it
> would require them to document what their compiler does in each case.
> So they would prefer "unspecified" if the could not have "undefined".

That certainly would sound very good to me.

It's not the "documented behavior" that is important to me (since it
is still going to be potentially different across different
platforms), it's the "at least it's *some* consistent behavior for
that platform".

That's just _so_ much better than "undefined behavior" which basically
says the compiler writer can do absolutely anything, whether it makes
sense or not, and whether it might open a small bug up to absolutely
catastrophic end results.

> > if (a)
> > global_var = 1
> > else
> > global_var = 0
>
> Me, I would still want to use WRITE_ONCE() in this case.

I actually suspect that if we had this kind of code, and a real reason
why readers would see it locklessly, then yes, WRITE_ONCE() makes
sense.

But most of the cases where people add WRITE_ONCE() aren't even the
above kind of half-questionable cases. They are just literally

WRITE_ONCE(flag, value);

and since there is no real memory ordering implied by this, what's the
actual value of that statement? What problem does it really solve wrt
just writing it as

flag = value;

which is generally a lot easier to read.

If the flag has semantic behavior wrt other things, and the write can
race with a read (whether it's the read or the write that is unlocked
is irrelevant), is still doesn't tend to make a lot of real
difference.

In the end, the most common reason for a WRITE_ONCE() is mostly just
"to visually pair up with the non-synchronized read that uses
READ_ONCE()"

Put another way: a WRITE_ONCE() without a paired READ_ONCE() is almost
certainly pointless.

But the reverse is not really true. All a READ_ONCE() says is "I want
either the old or the new value", and it can get that _without_ being
paired with a WRITE_ONCE().

See? They just aren't equally important.

> > And yes, reads are different from writes. Reads don't have the same
> > kind of "other threads of execution can see them" effects, so a
> > compiler turning a single read into multiple reads is much more
> > realistic and not the same kind of "we need to expect a certain kind
> > of sanity from the compiler" issue.
>
> Though each of those compiler-generated multiple reads might return a
> different value, right?

Right. See the examples I have in the email to Mathieu:

unsigned int bits = some_global_value;
...test different bits in in 'bits' ...

can easily cause multiple reads (particularly on a CPU that has a
"test bits in memory" instruction and a lack of registers.

So then doing it as

unsigned int bits = READ_ONCE(some_global_value);
.. test different bits in 'bits'...

makes a real and obvious semantic difference. In ways that changing a one-time

ptr->flag = true;

to

WRITE_ONCE(ptr->flag, true);

does _not_ make any obvious semantic difference what-so-ever.

Caching reads is also something that makes sense and is common, in
ways that caching writes does not. So doing

while (in_progress_global) /* twiddle your thumbs */;

obviously trivially translates to an infinite loop with a single
conditional in front of it, in ways that

while (READ_ONCE(in_progress_global)) /* twiddle */;

does not.

So there are often _obvious_ reasons to use READ_ONCE().

I really do not find the same to be true of WRITE_ONCE(). There are
valid uses, but they are definitely much less common, and much less
obvious.

Linus

2019-08-17 08:45:46

[permalink] [raw]

Subject: Re: [PATCH 1/1] Fix: trace sched switch start/stop racy updates

On Sat, Aug 17, 2019 at 1:28 AM Linus Torvalds
<[email protected]> wrote:
>
> unsigned int bits = some_global_value;
> ...test different bits in in 'bits' ...
>
> can easily cause multiple reads (particularly on a CPU that has a
> "test bits in memory" instruction and a lack of registers.
>
> So then doing it as
>
> unsigned int bits = READ_ONCE(some_global_value);
> .. test different bits in 'bits'...

Side note: this is likely the best example of actual WRITE_ONCE() use
too: if you have that global value with multiple bits that actually
have some interdependencies, then doing

some_global_value = some_complex_expression();

might be reasonably compiled to do several rmw instructions to update
'some_global_value'

So then

WRITE_ONCE(some_global_value, some_complex_expression());

really can be a good thing - it clearly just writes things once, and
it also documents the whole "write one or the other" value, not some
mid-way one, when you then look at the READ_ONCE() thing.

But I'm seeing a lot of WRITE_ONCE(x, constantvalue) kind of things
and don't seem to find a lot of reason to think that they are any
inherently better than "x = constantvalue".

(In contrast, using "smp_store_release(flag, true)" has inherent
value, because it actually implies a memory barrier wrt previous
writes, in ways that WRITE_ONCE() or a direct assignment does not.)

Ok, enough blathering. I think I've made my point.

Linus

2019-08-17 14:29:04

[permalink] [raw]

Subject: Re: [PATCH 1/1] Fix: trace sched switch start/stop racy updates

----- On Aug 16, 2019, at 3:15 PM, rostedt [email protected] wrote:

> On Fri, 16 Aug 2019 13:19:20 -0400 (EDT)
> Mathieu Desnoyers <[email protected]> wrote:
>
>> ----- On Aug 16, 2019, at 12:25 PM, rostedt [email protected] wrote:
>>
>> > On Fri, 16 Aug 2019 10:26:43 -0400 Mathieu Desnoyers
>> > <[email protected]> wrote:
>> >
>> [...]
>> >>
>> >> Also, write and read to/from those variables should be done with
>> >> WRITE_ONCE() and READ_ONCE(), given that those are read within tracing
>> >> probes without holding the sched_register_mutex.
>> >>
>> >
>> > I understand the READ_ONCE() but is the WRITE_ONCE() truly necessary?
>> > It's done while holding the mutex. It's not that critical of a path,
>> > and makes the code look ugly.
>>
>> The update is done while holding the mutex, but the read-side does not
>> hold that mutex, so it can observe the intermediate state caused by
>> store-tearing or invented stores which can be generated by the compiler
>> on the update-side.
>>
>> Please refer to the following LWN article:
>>
>> https://lwn.net/Articles/793253/
>>
>> Sections:
>> - "Store tearing"
>> - "Invented stores"
>>
>> Arguably, based on that article, store tearing is only observed in the
>> wild for constants (which is not the case here), and invented stores
>> seem to require specific code patterns. But I wonder why we would ever want to
>> pair a fragile non-volatile store with a READ_ONCE() ? Considering the pain
>> associated to reproduce and hunt down this kind of issue in the wild, I would
>> be tempted to enforce that any READ_ONCE() operating on a variable would either
>> need to be paired with WRITE_ONCE() or with atomic operations, so those can
>> eventually be validated by static code checkers and code sanitizers.
>
> My issue is that this is just a case to decide if we should cache a
> comm or not. It's a helper, nothing more. There's no guarantee that
> something will get cached.

I get your point wrt WRITE_ONCE(): since it's a cache it should not have
user-visible effects if a temporary incorrect value is observed. Well in
reality, it's not a cache: if the lookup fails, it returns "<...>" instead,
so cache lookup failure ends up not providing any useful data in the trace.
Let's assume this is a known and documented tracer limitation.

However, wrt READ_ONCE(), things are different. The variable read ends up
being used to control various branches in the code, and the compiler could
decide to re-fetch the variable (with a different state), and therefore
cause _some_ of the branches to be inconsistent. See
tracing_record_taskinfo_sched_switch() and tracing_record_taskinfo() @flags
parameter.

AFAIU the current code should not generate any out-of-bound writes in case of
re-fetch, but no comment in there documents how fragile this is.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2019-08-17 14:45:27

[permalink] [raw]

Subject: Re: [PATCH 1/1] Fix: trace sched switch start/stop racy updates

----- On Aug 16, 2019, at 10:13 PM, rostedt [email protected] wrote:

> On Fri, 16 Aug 2019 21:36:49 -0400 (EDT)
> Mathieu Desnoyers <[email protected]> wrote:
>
>> ----- On Aug 16, 2019, at 5:04 PM, Linus Torvalds [email protected]
>> wrote:
>>
>> > On Fri, Aug 16, 2019 at 1:49 PM Thomas Gleixner <[email protected]> wrote:
>> >>
>> >> Can we finally put a foot down and tell compiler and standard committee
>> >> people to stop this insanity?
>> >
>> > It's already effectively done.
>> >
>> > Yes, values can be read from memory multiple times if they need
>> > reloading. So "READ_ONCE()" when the value can change is a damn good
>> > idea.
>> >
>> > But it should only be used if the value *can* change. Inside a locked
>> > region it is actively pointless and misleading.
>> >
>> > Similarly, WRITE_ONCE() should only be used if you have a _reason_ for
>> > using it (notably if you're not holding a lock).
>> >
>> > If people use READ_ONCE/WRITE_ONCE when there are locks that prevent
>> > the values from changing, they are only making the code illegible.
>> > Don't do it.
>>
>> I agree with your argument in the case where both read-side and write-side
>> are protected by the same lock: READ/WRITE_ONCE are useless then. However,
>> in the scenario we have here, only the write-side is protected by the lock
>> against concurrent updates, but reads don't take any lock.
>
> And because reads are not protected by any lock or memory barrier,
> using READ_ONCE() is pointless. The CPU could be doing a lot of hidden
> manipulation of that variable too.

Please enlighten me by providing some details on what the CPU could do to
this word-aligned, word-sized variable in the absence of lock and barrier
that is relevant to this specific use-case ?

I suspect most of the barriers you refer to here are taken care of by the
tracepoint code which uses RCU to synchronize probe registration wrt
probe callback execution.

>
> Again, this is just to enable caching of the comm. Nothing more. It's a
> "best effort" algorithm. We get it, we then can map a pid to a name. If
> not, we just have a pid and we map "<...>".
>
> Next you'll be asking for the memory barriers to guarantee a real hit.
> And honestly, this information is not worth any overhead.

No, that's not my intent to add overhead to guarantee trace data
availability near trace beginning and end. However, considering that
READ_ONCE() and WRITE_ONCE() can provide additional data availability
guarantees in the middle of traces at no runtime cost, it seems like a
good trade off.

It's easier for an analysis to disregard missing information at the
beginning and end of trace without generating false-positive reports
than when it happens spuriously in the middle of traces.

>
> And most often we enable this before we enable the tracepoint we want
> this information from, which requires modification of the text area and
> will do a bunch of syncs across CPUs. That alone will most likely keep
> any race from happening.

Indeed the tracepoint use of RCU to synchronize registration vs probes
should take care of those barriers.

>
> The only real bug is the check to know if we should add the probe or
> not was done outside the lock, and when we hit the race, we could add a
> probe twice, causing the kernel to spit out a warning. You fixed that,
> and that's all that needs to be done.

I just sent that fix in a separate patch.

> I'm now even more against adding the READ_ONCE() or WRITE_ONCE().

I'm not convinced by your arguments.

Thanks,

Mathieu

>
>
> -- Steve
>
>
>
>>
>> If WRITE_ONCE has any use at all (protecting against store tearing and
>> invented stores), it should be used even with a lock held in this
>> scenario, because the lock does not prevent READ_ONCE() from observing
>> transient states caused by lack of WRITE_ONCE() for the update.
>>
>> So why does WRITE_ONCE exist in the first place ? Is it for documentation
>> purposes only or are there actual situations where omitting it can cause
>> bugs with real-life compilers ?
>>
>> In terms of code change, should we favor only introducing WRITE_ONCE
>> in new code, or should existing code matching those conditions be
>> moved to WRITE_ONCE as bug fixes ?
>>

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2019-08-17 15:04:18

[permalink] [raw]

Subject: Re: [PATCH 1/1] Fix: trace sched switch start/stop racy updates

----- On Aug 17, 2019, at 4:44 AM, Linus Torvalds [email protected] wrote:

>
> But I'm seeing a lot of WRITE_ONCE(x, constantvalue) kind of things
> and don't seem to find a lot of reason to think that they are any
> inherently better than "x = constantvalue".

If the only states that "x" can take is 1 or 0, then indeed there seems
to be no point in using a WRITE_ONCE() when paired with a READ_ONCE()
other than for documentation purposes.

However, if the state of "x" can be any pointer value, or a reference
count value, then not using "WRITE_ONCE()" to store a constant leaves
the compiler free to perform that store in more than one memory access.
Based on [1], section "Store tearing", there are situations where this
happens on x86 in the wild today when storing 64-bit constants: the
compiler is then free to decide to use two 32-bit immediate store
instructions.

Thanks,

Mathieu

[1] https://lwn.net/Articles/793253/

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2019-08-17 15:30:38

[permalink] [raw]

Subject: Re: [PATCH 1/1] Fix: trace sched switch start/stop racy updates

On Sat, 17 Aug 2019 10:40:31 -0400 (EDT)
Mathieu Desnoyers <[email protected]> wrote:

> > I'm now even more against adding the READ_ONCE() or WRITE_ONCE().
>
> I'm not convinced by your arguments.

Prove to me that there's an issue here beyond theoretical analysis,
then I'll consider that patch.

Show me a compiler used to compile the kernel that zeros out the
increment. Show me were the race actually occurs.

I think the READ/WRITE_ONCE() is more confusing than helpful. And
unneeded churn to the code. And really not needed for something that's
not critical to execution.

-- Steve

2019-08-17 15:45:15

[permalink] [raw]

Subject: Re: [PATCH 1/1] Fix: trace sched switch start/stop racy updates

On Sat, 17 Aug 2019 10:27:39 -0400 (EDT)
Mathieu Desnoyers <[email protected]> wrote:

> I get your point wrt WRITE_ONCE(): since it's a cache it should not have
> user-visible effects if a temporary incorrect value is observed. Well in
> reality, it's not a cache: if the lookup fails, it returns "<...>" instead,
> so cache lookup failure ends up not providing any useful data in the trace.
> Let's assume this is a known and documented tracer limitation.

Note, this is done at every sched switch, for both next and prev tasks.
And the update is only done at the enabling of a tracepoint (very rare
occurrence) If it missed it scheduling in, it has a really good chance
of getting it while scheduling out.

And 99.999% of my tracing that I do, the tasks scheduling in when
enabling a tracepoint is not what I even care about, as I enable
tracing then start what I want to trace.

>
> However, wrt READ_ONCE(), things are different. The variable read ends up
> being used to control various branches in the code, and the compiler could
> decide to re-fetch the variable (with a different state), and therefore
> cause _some_ of the branches to be inconsistent. See
> tracing_record_taskinfo_sched_switch() and tracing_record_taskinfo() @flags
> parameter.

I'm more OK with using a READ_ONCE() on the flags so it is consistent.
But the WRITE_ONCE() is going a bit overboard.

>
> AFAIU the current code should not generate any out-of-bound writes in case of
> re-fetch, but no comment in there documents how fragile this is.

Which part of the code are you talking about here?

-- Steve

2019-08-17 15:54:53

[permalink] [raw]

Subject: Re: [PATCH 1/1] Fix: trace sched switch start/stop racy updates

----- On Aug 17, 2019, at 11:42 AM, rostedt [email protected] wrote:

> On Sat, 17 Aug 2019 10:27:39 -0400 (EDT)
> Mathieu Desnoyers <[email protected]> wrote:
>
>> I get your point wrt WRITE_ONCE(): since it's a cache it should not have
>> user-visible effects if a temporary incorrect value is observed. Well in
>> reality, it's not a cache: if the lookup fails, it returns "<...>" instead,
>> so cache lookup failure ends up not providing any useful data in the trace.
>> Let's assume this is a known and documented tracer limitation.
>
> Note, this is done at every sched switch, for both next and prev tasks.
> And the update is only done at the enabling of a tracepoint (very rare
> occurrence) If it missed it scheduling in, it has a really good chance
> of getting it while scheduling out.
>
> And 99.999% of my tracing that I do, the tasks scheduling in when
> enabling a tracepoint is not what I even care about, as I enable
> tracing then start what I want to trace.

Since it's refcount based, my concern is about the side-effect of
incrementing or decrementing that reference count without WRITE_ONCE
which would lead to a transient corrupted value observed by _another_
active tracing user.

For you use-case, it would lead to a missing comm when you are actively
tracing what you want to trace, caused by another user of that refcount
incrementing or decrementing it.

I agree with you that missing tracing data at the beginning or end of a
trace is not important.

>>
>> However, wrt READ_ONCE(), things are different. The variable read ends up
>> being used to control various branches in the code, and the compiler could
>> decide to re-fetch the variable (with a different state), and therefore
>> cause _some_ of the branches to be inconsistent. See
>> tracing_record_taskinfo_sched_switch() and tracing_record_taskinfo() @flags
>> parameter.
>
> I'm more OK with using a READ_ONCE() on the flags so it is consistent.
> But the WRITE_ONCE() is going a bit overboard.

Hence my request for additional guidance on the usefulness of WRITE_ONCE(),
whether it's mainly there for documentation purposes, or if we should consider
that it takes care of real-life problems introduced by compiler optimizations
in the wild. The LWN article seems to imply that it's not just a theoretical
issue, but I'll have to let the article authors justify their conclusions,
because I have limited time to investigate this myself.

>
>>
>> AFAIU the current code should not generate any out-of-bound writes in case of
>> re-fetch, but no comment in there documents how fragile this is.
>
> Which part of the code are you talking about here?

kernel/trace/trace.c:tracing_record_taskinfo_sched_switch()
kernel/trace/trace.c:tracing_record_taskinfo()

where @flags is used to control a few branches. I don't think any of those
would end up causing corruption if the flags is re-fetched between two
branches, but it seems rather fragile.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2019-08-17 15:56:37

[permalink] [raw]

Subject: Re: [PATCH 1/1] Fix: trace sched switch start/stop racy updates

----- On Aug 17, 2019, at 11:26 AM, rostedt [email protected] wrote:

> On Sat, 17 Aug 2019 10:40:31 -0400 (EDT)
> Mathieu Desnoyers <[email protected]> wrote:
>
>> > I'm now even more against adding the READ_ONCE() or WRITE_ONCE().
>>
>> I'm not convinced by your arguments.
>
> Prove to me that there's an issue here beyond theoretical analysis,
> then I'll consider that patch.
>
> Show me a compiler used to compile the kernel that zeros out the
> increment. Show me were the race actually occurs.
>
> I think the READ/WRITE_ONCE() is more confusing than helpful. And
> unneeded churn to the code. And really not needed for something that's
> not critical to execution.

I'll have to let the authors of the LWN article speak up on this, because
I have limited time to replicate this investigation myself.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2019-08-17 16:42:05

[permalink] [raw]

Subject: Re: [PATCH 1/1] Fix: trace sched switch start/stop racy updates

On Sat, 17 Aug 2019 11:55:17 -0400 (EDT)
Mathieu Desnoyers <[email protected]> wrote:

> ----- On Aug 17, 2019, at 11:26 AM, rostedt [email protected] wrote:
>
> > On Sat, 17 Aug 2019 10:40:31 -0400 (EDT)
> > Mathieu Desnoyers <[email protected]> wrote:
> >
> >> > I'm now even more against adding the READ_ONCE() or WRITE_ONCE().
> >>
> >> I'm not convinced by your arguments.
> >
> > Prove to me that there's an issue here beyond theoretical analysis,
> > then I'll consider that patch.
> >
> > Show me a compiler used to compile the kernel that zeros out the
> > increment. Show me were the race actually occurs.
> >
> > I think the READ/WRITE_ONCE() is more confusing than helpful. And
> > unneeded churn to the code. And really not needed for something that's
> > not critical to execution.
>
> I'll have to let the authors of the LWN article speak up on this, because
> I have limited time to replicate this investigation myself.

I'll let Paul McKenney convince me then, if he has any spare cycles ;-)

The one instance in that article is from a 2013 bug, which talks about
storing a 64 bit value on a 32 bit machine. But the ref count is an int
(32 bit), and I highly doubt any compiler will split it into 16 bit
stores for a simple increment. And I don't believe Linux even supports
any architecture that requires 16 bit stores anymore.

-- Steve

2019-08-17 16:45:03

[permalink] [raw]

Subject: Re: [PATCH 1/1] Fix: trace sched switch start/stop racy updates

On Sat, 17 Aug 2019 11:53:41 -0400 (EDT)
Mathieu Desnoyers <[email protected]> wrote:

> kernel/trace/trace.c:tracing_record_taskinfo_sched_switch()
> kernel/trace/trace.c:tracing_record_taskinfo()
>
> where @flags is used to control a few branches. I don't think any of those
> would end up causing corruption if the flags is re-fetched between two
> branches, but it seems rather fragile.

There shouldn't be any corruption, as each flag test is not dependent
on any of the other tests. But I do agree, a READ_ONCE() would be
appropriate for just making a consistent state among the tests, even
though they are independent.

-- Steve

2019-08-17 20:05:42

[permalink] [raw]

Subject: Re: [PATCH 1/1] Fix: trace sched switch start/stop racy updates

Apologies to Steve for continuing this thread when all he wanted was moving
an operation inside a mutex...

On 17/08/2019 16:02, Mathieu Desnoyers wrote:
[...]
> However, if the state of "x" can be any pointer value, or a reference
> count value, then not using "WRITE_ONCE()" to store a constant leaves
> the compiler free to perform that store in more than one memory access.
> Based on [1], section "Store tearing", there are situations where this
> happens on x86 in the wild today when storing 64-bit constants: the
> compiler is then free to decide to use two 32-bit immediate store
> instructions.
>

That's also how I understand things, and it's also one of the points raised
in the compiler barrier section of memory-barriers.txt

Taking this store tearing, or the invented stores - e.g. the branch
optimization pointed out by Linus:

> if (a)
> global_var = 1
> else
> global_var = 0
>
> then the compiler had better not turn that into
>
> global_var = 0
> if (a)
> global_var = 1

AFAICT nothing prevents this from happening inside a critical section (where
the locking primitives provide the right barriers, but that's it). That's
all fine when data is never accessed locklessly, but in the case of locked
writes vs lockless reads, couldn't there be "leaks" of these transient
states? In those cases we would want WRITE_ONCE() for the writes.

So going back to:

> But the reverse is not really true. All a READ_ONCE() says is "I want
> either the old or the new value", and it can get that _without_ being
> paired with a WRITE_ONCE().

AFAIU it's not always the case, since a lone READ_ONCE() could get transient
values.

I'll be honest, it's not 100% clear to me when those optimizations can
actually be done (maybe the branch thingy but the others are dubious), and
it's even less clear when compilers *actually* do it - only that they have
been reported to do it (so it's not made up).

> Thanks,
>
> Mathieu
>
> [1] https://lwn.net/Articles/793253/
>

2019-08-17 22:10:31

[permalink] [raw]

Subject: Re: [PATCH 1/1] Fix: trace sched switch start/stop racy updates

On Sat, Aug 17, 2019 at 12:40:40PM -0400, Steven Rostedt wrote:
> On Sat, 17 Aug 2019 11:55:17 -0400 (EDT)
> Mathieu Desnoyers <[email protected]> wrote:
>
> > ----- On Aug 17, 2019, at 11:26 AM, rostedt [email protected] wrote:
> >
> > > On Sat, 17 Aug 2019 10:40:31 -0400 (EDT)
> > > Mathieu Desnoyers <[email protected]> wrote:
> > >
> > >> > I'm now even more against adding the READ_ONCE() or WRITE_ONCE().
> > >>
> > >> I'm not convinced by your arguments.
> > >
> > > Prove to me that there's an issue here beyond theoretical analysis,
> > > then I'll consider that patch.
> > >
> > > Show me a compiler used to compile the kernel that zeros out the
> > > increment. Show me were the race actually occurs.
> > >
> > > I think the READ/WRITE_ONCE() is more confusing than helpful. And
> > > unneeded churn to the code. And really not needed for something that's
> > > not critical to execution.
> >
> > I'll have to let the authors of the LWN article speak up on this, because
> > I have limited time to replicate this investigation myself.
>
> I'll let Paul McKenney convince me then, if he has any spare cycles ;-)

You guys do manage to time these things sometimes. ;-)

> The one instance in that article is from a 2013 bug, which talks about
> storing a 64 bit value on a 32 bit machine. But the ref count is an int
> (32 bit), and I highly doubt any compiler will split it into 16 bit
> stores for a simple increment. And I don't believe Linux even supports
> any architecture that requires 16 bit stores anymore.

For a machine-sized and aligned increment, it is indeed hard to imagine,
even for me. I would be more worried about stores of constants with
lots of zero bits between non-zero bits on systems with small-sized
store-immediate instructions.

Thanx, Paul

2019-08-17 22:30:33

[permalink] [raw]

Subject: Re: [PATCH 1/1] Fix: trace sched switch start/stop racy updates

On Sat, Aug 17, 2019 at 01:28:48AM -0700, Linus Torvalds wrote:

[ . . . ]

> Put another way: a WRITE_ONCE() without a paired READ_ONCE() is almost
> certainly pointless.

"Your honor, I have no further questions at this time, but I reserve
the right to recall this witness."

Outside of things like MMIO (where one could argue that the corresponding
READ_ONCE() is in the device firmware), the use cases I can imagine
for WRITE_ONCE() with no READ_ONCE() are quite strange. For example,
doing the WRITE_ONCE()s while read-holding a given lock and doing plain
reads while write-holding that same lock. While at the same time being
worried about store tearing and similar.

Perhaps I am suffering a failure of imagination, but I am not seeing
a reasonable use for such things at the moment.

> But the reverse is not really true. All a READ_ONCE() says is "I want
> either the old or the new value", and it can get that _without_ being
> paired with a WRITE_ONCE().
>
> See? They just aren't equally important.
>
> > > And yes, reads are different from writes. Reads don't have the same
> > > kind of "other threads of execution can see them" effects, so a
> > > compiler turning a single read into multiple reads is much more
> > > realistic and not the same kind of "we need to expect a certain kind
> > > of sanity from the compiler" issue.
> >
> > Though each of those compiler-generated multiple reads might return a
> > different value, right?
>
> Right. See the examples I have in the email to Mathieu:
>
> unsigned int bits = some_global_value;
> ...test different bits in in 'bits' ...
>
> can easily cause multiple reads (particularly on a CPU that has a
> "test bits in memory" instruction and a lack of registers.
>
> So then doing it as
>
> unsigned int bits = READ_ONCE(some_global_value);
> .. test different bits in 'bits'...
>
> makes a real and obvious semantic difference. In ways that changing a one-time
>
> ptr->flag = true;
>
> to
>
> WRITE_ONCE(ptr->flag, true);
>
> does _not_ make any obvious semantic difference what-so-ever.

Agreed, especially given that only one bit of ->flag is most likely
ever changing.

> Caching reads is also something that makes sense and is common, in
> ways that caching writes does not. So doing
>
> while (in_progress_global) /* twiddle your thumbs */;
>
> obviously trivially translates to an infinite loop with a single
> conditional in front of it, in ways that
>
> while (READ_ONCE(in_progress_global)) /* twiddle */;
>
> does not.
>
> So there are often _obvious_ reasons to use READ_ONCE().
>
> I really do not find the same to be true of WRITE_ONCE(). There are
> valid uses, but they are definitely much less common, and much less
> obvious.

Agreed, and I expect READ_ONCE() to continue to be used more heavily than
is WRITE_ONCE(), even including the documentation-only WRITE_ONCE() usage.

Thanx, Paul

2019-08-17 23:02:32

[permalink] [raw]

Subject: Re: [PATCH 1/1] Fix: trace sched switch start/stop racy updates

On Sat, Aug 17, 2019 at 09:03:30PM +0100, Valentin Schneider wrote:
> Apologies to Steve for continuing this thread when all he wanted was moving
> an operation inside a mutex...
>
> On 17/08/2019 16:02, Mathieu Desnoyers wrote:
> [...]
> > However, if the state of "x" can be any pointer value, or a reference
> > count value, then not using "WRITE_ONCE()" to store a constant leaves
> > the compiler free to perform that store in more than one memory access.
> > Based on [1], section "Store tearing", there are situations where this
> > happens on x86 in the wild today when storing 64-bit constants: the
> > compiler is then free to decide to use two 32-bit immediate store
> > instructions.
> >
>
> That's also how I understand things, and it's also one of the points raised
> in the compiler barrier section of memory-barriers.txt
>
> Taking this store tearing, or the invented stores - e.g. the branch
> optimization pointed out by Linus:
>
> > if (a)
> > global_var = 1
> > else
> > global_var = 0
> >
> > then the compiler had better not turn that into
> >
> > global_var = 0
> > if (a)
> > global_var = 1
>
> AFAICT nothing prevents this from happening inside a critical section (where
> the locking primitives provide the right barriers, but that's it). That's
> all fine when data is never accessed locklessly, but in the case of locked
> writes vs lockless reads, couldn't there be "leaks" of these transient
> states? In those cases we would want WRITE_ONCE() for the writes.
>
> So going back to:
>
> > But the reverse is not really true. All a READ_ONCE() says is "I want
> > either the old or the new value", and it can get that _without_ being
> > paired with a WRITE_ONCE().
>
> AFAIU it's not always the case, since a lone READ_ONCE() could get transient
> values.

Linus noted that he believes that compilers for architectures supporting
Linux can be trusted to avoid store-to-load transformations, invented
stores, and unnecessary store tearing. Should these appear, Linus would
report a bug against the compiler and expect it to be fixed.

> I'll be honest, it's not 100% clear to me when those optimizations can
> actually be done (maybe the branch thingy but the others are dubious), and
> it's even less clear when compilers *actually* do it - only that they have
> been reported to do it (so it's not made up).

There is significant unclarity inherent in the situation. The standard
says one thing, different compilers do other things, and developers
often expect yet a third thing. And sometimes things change over time,
for example, the ca. 2011 dictim against compilers inventing data races.

Hey, they didn't teach me this aspect of software development in school,
either. ;-)

Thanx, Paul

2019-08-18 09:17:57

by Pavel Machek

[permalink] [raw]

Subject: stable markup was Re: [PATCH 1/1] Fix: trace sched switch start/stop racy updates

Hi!

> The most I'll take is two separate patches. One is going to be marked
> for stable as it fixes a real bug. The other is more for cosmetic or
> theoretical issues, that I will state clearly "NOT FOR STABLE", such
> that the autosel doesn't take them.

Do we have standartized way to mark "this is not for stable"? Because
I often think "I'd really hate to see this in stable"...

On a related note, it would be nice to have standartized way to
marking corresponding upstream commit. (Currently three methods are in
use).

Upstream: XXXX

in the signoff area would be nice, clearly marking who touched the
patch before mainline and who after.

Best regards,

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

Attachments:

(No filename) (840.00 B)
signature.asc (188.00 B)
Digital signature Download all attachments

2019-08-19 10:36:27

[permalink] [raw]

Subject: Re: [PATCH 1/1] Fix: trace sched switch start/stop racy updates

On 18/08/2019 00:00, Paul E. McKenney wrote:
[...]
> Linus noted that he believes that compilers for architectures supporting
> Linux can be trusted to avoid store-to-load transformations, invented
> stores, and unnecessary store tearing. Should these appear, Linus would
> report a bug against the compiler and expect it to be fixed.
>
>> I'll be honest, it's not 100% clear to me when those optimizations can
>> actually be done (maybe the branch thingy but the others are dubious), and
>> it's even less clear when compilers *actually* do it - only that they have
>> been reported to do it (so it's not made up).
>
> There is significant unclarity inherent in the situation. The standard
> says one thing, different compilers do other things, and developers
> often expect yet a third thing. And sometimes things change over time,
> for example, the ca. 2011 dictim against compilers inventing data races.
>
> Hey, they didn't teach me this aspect of software development in school,
> either. ;-)
>

Gotta keeps things "interesting" somehow, eh...

Thanks for the clarifications.

> Thanx, Paul
>

2019-08-20 13:58:14

[permalink] [raw]

Subject: Re: [PATCH 1/1] Fix: trace sched switch start/stop racy updates

On Sat, Aug 17, 2019 at 01:08:02AM -0700, Linus Torvalds wrote:

> The data tearing issue is almost a non-issue. We're not going to add
> WRITE_ONCE() to these kinds of places for no good reason.

Paulmck actually has an example of that somewhere; ISTR that particular
case actually got fixed by GCC, but I'd really _love_ for some compiler
people (both GCC and LLVM) to state that their respective compilers will
not do load/store tearing for machine word sized load/stores.

Without this written guarantee (which supposedly was in older GCC
manuals but has since gone missing), I'm loathe to rely on it.

Yes, it is very rare, but it is a massive royal pain to debug if/when it
does do happen.

2019-08-20 14:03:28

[permalink] [raw]

Subject: Re: [PATCH 1/1] Fix: trace sched switch start/stop racy updates

On Fri, Aug 16, 2019 at 09:52:17PM -0700, Paul E. McKenney wrote:
> On Fri, Aug 16, 2019 at 03:57:43PM -0700, Linus Torvalds wrote:
>
> [ . . . ]
>
> > We add READ_ONCE and WRITE_ONCE annotations when they make sense. Not
> > because of some theoretical "compiler is free to do garbage"
> > arguments. If such garbage happens, we need to fix the compiler, the
> > same way we already do with
> >
> > -fno-strict-aliasing
>
> Yeah, the compete-with-FORTRAN stuff. :-/
>
> There is some work going on in the C committee on this, where the
> theorists would like to restrict strict-alias based optimizations to
> enable better analysis tooling. And no, although the theorists are
> pushing in the direction we would like them to, as far as I can see
> they are not pushing as far as we would like. But it might be that
> -fno-strict-aliasing needs some upgrades as well. I expect to learn
> more at the next meeting in a few months.
>
> http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2364.pdf
> http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2363.pdf
> http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2362.pdf

We really should get the compiler folks to give us a
-fno-pointer-provenance. Waiting on the standards committee to get their
act together seems unlikely, esp. given that some people actually seem
to _want_ this nonsense :/

2019-08-20 20:31:44

[permalink] [raw]

Subject: Re: [PATCH 1/1] Fix: trace sched switch start/stop racy updates

On Tue, Aug 20, 2019 at 03:56:12PM +0200, Peter Zijlstra wrote:
> On Sat, Aug 17, 2019 at 01:08:02AM -0700, Linus Torvalds wrote:
>
> > The data tearing issue is almost a non-issue. We're not going to add
> > WRITE_ONCE() to these kinds of places for no good reason.
>
> Paulmck actually has an example of that somewhere; ISTR that particular
> case actually got fixed by GCC, but I'd really _love_ for some compiler
> people (both GCC and LLVM) to state that their respective compilers will
> not do load/store tearing for machine word sized load/stores.

I do very much recall such an example, but I am now unable to either
find it or reproduce it. :-/

If I cannot turn it up in a few days, I will ask the LWN editors to
make appropriate changes to the "Who is afraid" article.

> Without this written guarantee (which supposedly was in older GCC
> manuals but has since gone missing), I'm loathe to rely on it.
>
> Yes, it is very rare, but it is a massive royal pain to debug if/when it
> does do happen.

But from what I can see, Linus is OK with use of WRITE_ONCE() for data
races on any variable for which there is at least one READ_ONCE().
So we can still use WRITE_ONCE() as we would like in our own code.
Yes, you or I might be hit by someone else's omission of WRITE_ONCE(),
it is better than the proverbial kick in the teeth.

Of course, if anyone knows of a compiler/architecture combination that
really does tear stores of 32-bit constants, please do not keep it
a secret! After all, it would be good to get that addressed easily
starting now rather than after a difficult and painful series of
debugging sessions.

Thanx, Paul

2019-08-20 20:33:30

[permalink] [raw]

Subject: Re: [PATCH 1/1] Fix: trace sched switch start/stop racy updates

On Tue, Aug 20, 2019 at 04:01:16PM +0200, Peter Zijlstra wrote:
> On Fri, Aug 16, 2019 at 09:52:17PM -0700, Paul E. McKenney wrote:
> > On Fri, Aug 16, 2019 at 03:57:43PM -0700, Linus Torvalds wrote:
> >
> > [ . . . ]
> >
> > > We add READ_ONCE and WRITE_ONCE annotations when they make sense. Not
> > > because of some theoretical "compiler is free to do garbage"
> > > arguments. If such garbage happens, we need to fix the compiler, the
> > > same way we already do with
> > >
> > > -fno-strict-aliasing
> >
> > Yeah, the compete-with-FORTRAN stuff. :-/
> >
> > There is some work going on in the C committee on this, where the
> > theorists would like to restrict strict-alias based optimizations to
> > enable better analysis tooling. And no, although the theorists are
> > pushing in the direction we would like them to, as far as I can see
> > they are not pushing as far as we would like. But it might be that
> > -fno-strict-aliasing needs some upgrades as well. I expect to learn
> > more at the next meeting in a few months.
> >
> > http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2364.pdf
> > http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2363.pdf
> > http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2362.pdf
>
> We really should get the compiler folks to give us a
> -fno-pointer-provenance. Waiting on the standards committee to get their
> act together seems unlikely, esp. given that some people actually seem
> to _want_ this nonsense :/

The reason that they want it is to enable some significant optimizations
in numerical code on the one hand and in heavily templated C++ code on
the other. Neither of which has much bearing on kernel code.

Interested in coming to the next C standards committee meeting in October
to help me push for this? ;-)

Thanx, Paul

2019-08-20 20:41:37

[permalink] [raw]

Subject: Re: [PATCH 1/1] Fix: trace sched switch start/stop racy updates

On Tue, Aug 20, 2019 at 01:31:35PM -0700, Paul E. McKenney wrote:
> On Tue, Aug 20, 2019 at 04:01:16PM +0200, Peter Zijlstra wrote:

> > We really should get the compiler folks to give us a
> > -fno-pointer-provenance. Waiting on the standards committee to get their
> > act together seems unlikely, esp. given that some people actually seem
> > to _want_ this nonsense :/
>
> The reason that they want it is to enable some significant optimizations
> in numerical code on the one hand and in heavily templated C++ code on
> the other. Neither of which has much bearing on kernel code.
>
> Interested in coming to the next C standards committee meeting in October
> to help me push for this? ;-)

How about we try and get some compiler folks together at plumbers and
bribe them with beer? Once we have our compiler knob, we happy :-)

2019-08-20 20:54:58

[permalink] [raw]

Subject: Re: [PATCH 1/1] Fix: trace sched switch start/stop racy updates

On Tue, Aug 20, 2019 at 10:39:39PM +0200, Peter Zijlstra wrote:
> On Tue, Aug 20, 2019 at 01:31:35PM -0700, Paul E. McKenney wrote:
> > On Tue, Aug 20, 2019 at 04:01:16PM +0200, Peter Zijlstra wrote:
>
> > > We really should get the compiler folks to give us a
> > > -fno-pointer-provenance. Waiting on the standards committee to get their
> > > act together seems unlikely, esp. given that some people actually seem
> > > to _want_ this nonsense :/
> >
> > The reason that they want it is to enable some significant optimizations
> > in numerical code on the one hand and in heavily templated C++ code on
> > the other. Neither of which has much bearing on kernel code.
> >
> > Interested in coming to the next C standards committee meeting in October
> > to help me push for this? ;-)
>
> How about we try and get some compiler folks together at plumbers and
> bribe them with beer? Once we have our compiler knob, we happy :-)

C'mon, Peter! Where is your sense of self-destruction??? ;-)

But yes, if nothing else there is a Toolchains MC [1]. Which happens to
have a topic entitled "Potential impact/benefit/detriment of recently
developed GCC optimizations on the kernel", now that you mention it.
Looking forward to seeing you in Lisbon!

Thanx, Paul

[1] https://linuxplumbersconf.org/event/4/sessions/45/#20190909

2019-08-21 10:47:00

by Will Deacon

[permalink] [raw]

Subject: Re: [PATCH 1/1] Fix: trace sched switch start/stop racy updates

On Tue, Aug 20, 2019 at 01:29:32PM -0700, Paul E. McKenney wrote:
> On Tue, Aug 20, 2019 at 03:56:12PM +0200, Peter Zijlstra wrote:
> > On Sat, Aug 17, 2019 at 01:08:02AM -0700, Linus Torvalds wrote:
> >
> > > The data tearing issue is almost a non-issue. We're not going to add
> > > WRITE_ONCE() to these kinds of places for no good reason.
> >
> > Paulmck actually has an example of that somewhere; ISTR that particular
> > case actually got fixed by GCC, but I'd really _love_ for some compiler
> > people (both GCC and LLVM) to state that their respective compilers will
> > not do load/store tearing for machine word sized load/stores.
>
> I do very much recall such an example, but I am now unable to either
> find it or reproduce it. :-/
>
> If I cannot turn it up in a few days, I will ask the LWN editors to
> make appropriate changes to the "Who is afraid" article.
>
> > Without this written guarantee (which supposedly was in older GCC
> > manuals but has since gone missing), I'm loathe to rely on it.
> >
> > Yes, it is very rare, but it is a massive royal pain to debug if/when it
> > does do happen.
>
> But from what I can see, Linus is OK with use of WRITE_ONCE() for data
> races on any variable for which there is at least one READ_ONCE().
> So we can still use WRITE_ONCE() as we would like in our own code.
> Yes, you or I might be hit by someone else's omission of WRITE_ONCE(),
> it is better than the proverbial kick in the teeth.
>
> Of course, if anyone knows of a compiler/architecture combination that
> really does tear stores of 32-bit constants, please do not keep it
> a secret! After all, it would be good to get that addressed easily
> starting now rather than after a difficult and painful series of
> debugging sessions.

It's not quite what you asked for, but if you look at the following
silly code:

typedef unsigned long long u64;

struct data {
u64 arr[1023];
u64 flag;
};

void foo(struct data *x)
{
int i;

for (i = 0; i < 1023; ++i)
x->arr[i] = 0;

x->flag = 0;
}

void bar(u64 *x)
{
*x = 0xabcdef10abcdef10;
}

Then arm64 clang (-O2) generates the following for foo:

foo: // @foo
stp x29, x30, [sp, #-16]! // 16-byte Folded Spill
orr w2, wzr, #0x2000
mov w1, wzr
mov x29, sp
bl memset
ldp x29, x30, [sp], #16 // 16-byte Folded Reload
ret

and so the store to 'flag' has become part of the memset, which could
easily be bytewise in terms of atomicity (and this isn't unlikely given
we have a DC ZVA instruction which only guaratees bytewise atomicity).

GCC (also -O2) generates the following for bar:

bar:
mov w1, 61200
movk w1, 0xabcd, lsl 16
stp w1, w1, [x0]
ret

and so it is using a store-pair instruction to reduce the complexity in
the immediate generation. Thus, the 64-bit store will only have 32-bit
atomicity. In fact, this is scary because if I change bar to:

void bar(u64 *x)
{
*(volatile u64 *)x = 0xabcdef10abcdef10;
}

then I get:

bar:
mov w1, 61200
movk w1, 0xabcd, lsl 16
str w1, [x0]
str w1, [x0, 4]
ret

so I'm not sure that WRITE_ONCE would even help :/

It's worth noting that:

void baz(atomic_long *x)
{
atomic_store_explicit(x, 0xabcdef10abcdef10, memory_order_relaxed)
}

does the right thing:

baz:
mov x1, 61200
movk x1, 0xabcd, lsl 16
movk x1, 0xef10, lsl 32
movk x1, 0xabcd, lsl 48
str x1, [x0]
ret

Whilst these examples may be contrived, I do thing they illustrate that
we can't simply say "stores to aligned, word-sized pointers are atomic".

Will

2019-08-21 13:24:51

[permalink] [raw]

Subject: Re: [PATCH 1/1] Fix: trace sched switch start/stop racy updates

On Wed, Aug 21, 2019 at 11:32:01AM +0100, Will Deacon wrote:
> On Tue, Aug 20, 2019 at 01:29:32PM -0700, Paul E. McKenney wrote:
> > On Tue, Aug 20, 2019 at 03:56:12PM +0200, Peter Zijlstra wrote:
> > > On Sat, Aug 17, 2019 at 01:08:02AM -0700, Linus Torvalds wrote:
> > >
> > > > The data tearing issue is almost a non-issue. We're not going to add
> > > > WRITE_ONCE() to these kinds of places for no good reason.
> > >
> > > Paulmck actually has an example of that somewhere; ISTR that particular
> > > case actually got fixed by GCC, but I'd really _love_ for some compiler
> > > people (both GCC and LLVM) to state that their respective compilers will
> > > not do load/store tearing for machine word sized load/stores.
> >
> > I do very much recall such an example, but I am now unable to either
> > find it or reproduce it. :-/
> >
> > If I cannot turn it up in a few days, I will ask the LWN editors to
> > make appropriate changes to the "Who is afraid" article.
> >
> > > Without this written guarantee (which supposedly was in older GCC
> > > manuals but has since gone missing), I'm loathe to rely on it.
> > >
> > > Yes, it is very rare, but it is a massive royal pain to debug if/when it
> > > does do happen.
> >
> > But from what I can see, Linus is OK with use of WRITE_ONCE() for data
> > races on any variable for which there is at least one READ_ONCE().
> > So we can still use WRITE_ONCE() as we would like in our own code.
> > Yes, you or I might be hit by someone else's omission of WRITE_ONCE(),
> > it is better than the proverbial kick in the teeth.
> >
> > Of course, if anyone knows of a compiler/architecture combination that
> > really does tear stores of 32-bit constants, please do not keep it
> > a secret! After all, it would be good to get that addressed easily
> > starting now rather than after a difficult and painful series of
> > debugging sessions.
>
> It's not quite what you asked for, but if you look at the following
> silly code:
>
> typedef unsigned long long u64;
>
> struct data {
> u64 arr[1023];
> u64 flag;
> };
>
> void foo(struct data *x)
> {
> int i;
>
> for (i = 0; i < 1023; ++i)
> x->arr[i] = 0;
>
> x->flag = 0;
> }
>
> void bar(u64 *x)
> {
> *x = 0xabcdef10abcdef10;
> }
>
> Then arm64 clang (-O2) generates the following for foo:
>
> foo: // @foo
> stp x29, x30, [sp, #-16]! // 16-byte Folded Spill
> orr w2, wzr, #0x2000
> mov w1, wzr
> mov x29, sp
> bl memset
> ldp x29, x30, [sp], #16 // 16-byte Folded Reload
> ret
>
> and so the store to 'flag' has become part of the memset, which could
> easily be bytewise in terms of atomicity (and this isn't unlikely given
> we have a DC ZVA instruction which only guaratees bytewise atomicity).
>
> GCC (also -O2) generates the following for bar:
>
> bar:
> mov w1, 61200
> movk w1, 0xabcd, lsl 16
> stp w1, w1, [x0]
> ret
>
> and so it is using a store-pair instruction to reduce the complexity in
> the immediate generation. Thus, the 64-bit store will only have 32-bit
> atomicity. In fact, this is scary because if I change bar to:
>
> void bar(u64 *x)
> {
> *(volatile u64 *)x = 0xabcdef10abcdef10;
> }
>
> then I get:
>
> bar:
> mov w1, 61200
> movk w1, 0xabcd, lsl 16
> str w1, [x0]
> str w1, [x0, 4]
> ret
>
> so I'm not sure that WRITE_ONCE would even help :/

Well, I can have the LWN article cite your email, then. So thank you
very much!

Is generation of this code for a 64-bit volatile store considered a bug?
Or does ARMv8 exclude the possibility of 64-bit MMIO registers? And I
would guess that Thomas and Linus would ask a similar bugginess question
for normal stores. ;-)

> It's worth noting that:
>
> void baz(atomic_long *x)
> {
> atomic_store_explicit(x, 0xabcdef10abcdef10, memory_order_relaxed)
> }
>
> does the right thing:
>
> baz:
> mov x1, 61200
> movk x1, 0xabcd, lsl 16
> movk x1, 0xef10, lsl 32
> movk x1, 0xabcd, lsl 48
> str x1, [x0]
> ret

OK, the C11 and C++11 guys should be happy with this.

> Whilst these examples may be contrived, I do thing they illustrate that
> we can't simply say "stores to aligned, word-sized pointers are atomic".

And thank you again!

Thanx, Paul

2019-08-21 13:35:48

by Will Deacon

[permalink] [raw]

Subject: Re: [PATCH 1/1] Fix: trace sched switch start/stop racy updates

On Wed, Aug 21, 2019 at 06:23:10AM -0700, Paul E. McKenney wrote:
> On Wed, Aug 21, 2019 at 11:32:01AM +0100, Will Deacon wrote:
> > void bar(u64 *x)
> > {
> > *(volatile u64 *)x = 0xabcdef10abcdef10;
> > }
> >
> > then I get:
> >
> > bar:
> > mov w1, 61200
> > movk w1, 0xabcd, lsl 16
> > str w1, [x0]
> > str w1, [x0, 4]
> > ret
> >
> > so I'm not sure that WRITE_ONCE would even help :/
>
> Well, I can have the LWN article cite your email, then. So thank you
> very much!
>
> Is generation of this code for a 64-bit volatile store considered a bug?

I consider it a bug for the volatile case, and the one compiler person I've
spoken to also seems to reckon it's a bug, so hopefully it will be fixed.
I'm led to believe it's an optimisation in the AArch64 backend of GCC.

> Or does ARMv8 exclude the possibility of 64-bit MMIO registers? And I
> would guess that Thomas and Linus would ask a similar bugginess question
> for normal stores. ;-)

We use inline asm for MMIO, fwiw.

Will

2019-08-21 13:57:43

[permalink] [raw]

Subject: Re: [PATCH 1/1] Fix: trace sched switch start/stop racy updates

On Wed, Aug 21, 2019 at 02:32:48PM +0100, Will Deacon wrote:
> On Wed, Aug 21, 2019 at 06:23:10AM -0700, Paul E. McKenney wrote:
> > On Wed, Aug 21, 2019 at 11:32:01AM +0100, Will Deacon wrote:
> > > void bar(u64 *x)
> > > {
> > > *(volatile u64 *)x = 0xabcdef10abcdef10;
> > > }
> > >
> > > then I get:
> > >
> > > bar:
> > > mov w1, 61200
> > > movk w1, 0xabcd, lsl 16
> > > str w1, [x0]
> > > str w1, [x0, 4]
> > > ret
> > >
> > > so I'm not sure that WRITE_ONCE would even help :/
> >
> > Well, I can have the LWN article cite your email, then. So thank you
> > very much!
> >
> > Is generation of this code for a 64-bit volatile store considered a bug?
>
> I consider it a bug for the volatile case, and the one compiler person I've
> spoken to also seems to reckon it's a bug, so hopefully it will be fixed.
> I'm led to believe it's an optimisation in the AArch64 backend of GCC.

Here is hoping for the fix!

> > Or does ARMv8 exclude the possibility of 64-bit MMIO registers? And I
> > would guess that Thomas and Linus would ask a similar bugginess question
> > for normal stores. ;-)
>
> We use inline asm for MMIO, fwiw.

I should have remembered that, shouldn't I have? ;-)

Is that also common practice across other embedded kernels these days?

Thanx, Paul

2019-08-21 15:47:03

[permalink] [raw]

Subject: Re: [PATCH 1/1] Fix: trace sched switch start/stop racy updates

On Wed, Aug 21, 2019 at 06:23:10AM -0700, Paul E. McKenney wrote:
> On Wed, Aug 21, 2019 at 11:32:01AM +0100, Will Deacon wrote:

> > and so it is using a store-pair instruction to reduce the complexity in
> > the immediate generation. Thus, the 64-bit store will only have 32-bit
> > atomicity. In fact, this is scary because if I change bar to:
> >
> > void bar(u64 *x)
> > {
> > *(volatile u64 *)x = 0xabcdef10abcdef10;
> > }
> >
> > then I get:
> >
> > bar:
> > mov w1, 61200
> > movk w1, 0xabcd, lsl 16
> > str w1, [x0]
> > str w1, [x0, 4]
> > ret
> >
> > so I'm not sure that WRITE_ONCE would even help :/
>
> Well, I can have the LWN article cite your email, then. So thank you
> very much!
>
> Is generation of this code for a 64-bit volatile store considered a bug?
> Or does ARMv8 exclude the possibility of 64-bit MMIO registers? And I
> would guess that Thomas and Linus would ask a similar bugginess question
> for normal stores. ;-)

I'm calling this a compiler bug; the way I understand volatile this is
very much against the intentended use case. That is, this is buggy even
on UP vs signals or MMIO.

2019-08-21 16:16:45

[permalink] [raw]

Subject: Re: [PATCH 1/1] Fix: trace sched switch start/stop racy updates

On Wed, Aug 21, 2019 at 11:48:43AM -0400, Mathieu Desnoyers wrote:
> ----- On Aug 21, 2019, at 8:33 AM, Peter Zijlstra [email protected] wrote:
>
> > On Wed, Aug 21, 2019 at 06:23:10AM -0700, Paul E. McKenney wrote:
> >> On Wed, Aug 21, 2019 at 11:32:01AM +0100, Will Deacon wrote:
> >
> >> > and so it is using a store-pair instruction to reduce the complexity in
> >> > the immediate generation. Thus, the 64-bit store will only have 32-bit
> >> > atomicity. In fact, this is scary because if I change bar to:
> >> >
> >> > void bar(u64 *x)
> >> > {
> >> > *(volatile u64 *)x = 0xabcdef10abcdef10;
> >> > }
> >> >
> >> > then I get:
> >> >
> >> > bar:
> >> > mov w1, 61200
> >> > movk w1, 0xabcd, lsl 16
> >> > str w1, [x0]
> >> > str w1, [x0, 4]
> >> > ret
> >> >
> >> > so I'm not sure that WRITE_ONCE would even help :/
> >>
> >> Well, I can have the LWN article cite your email, then. So thank you
> >> very much!
> >>
> >> Is generation of this code for a 64-bit volatile store considered a bug?
> >> Or does ARMv8 exclude the possibility of 64-bit MMIO registers? And I
> >> would guess that Thomas and Linus would ask a similar bugginess question
> >> for normal stores. ;-)
> >
> > I'm calling this a compiler bug; the way I understand volatile this is
> > very much against the intentended use case. That is, this is buggy even
> > on UP vs signals or MMIO.
>
> And here is a simpler reproducer on my gcc-8.3.0 (aarch64) compiled with O2:
>
> volatile unsigned long a;
>
> void fct(void)
> {
> a = 0x1234567812345678ULL;
> }
>
> void fct(void)
> {
> a = 0x1234567812345678ULL;
> 0: 90000000 adrp x0, 8 <fct+0x8>
> 4: 528acf01 mov w1, #0x5678 // #22136
> 8: 72a24681 movk w1, #0x1234, lsl #16
> c: f9400000 ldr x0, [x0]
> 10: b9000001 str w1, [x0]
> 14: b9000401 str w1, [x0, #4]
> }
> 18: d65f03c0 ret
>
> And the non-volatile case uses stp (is it a single store to memory ?):
>
> unsigned long a;
>
> void fct(void)
> {
> a = 0x1234567812345678ULL;
> }
>
> void fct(void)
> {
> a = 0x1234567812345678ULL;
> 0: 90000000 adrp x0, 8 <fct+0x8>
> 4: 528acf01 mov w1, #0x5678 // #22136
> 8: 72a24681 movk w1, #0x1234, lsl #16
> c: f9400000 ldr x0, [x0]
> 10: 29000401 stp w1, w1, [x0]
> }
> 14: d65f03c0 ret
>
> It would probably be a good idea to audit other architectures, since this
> is done by the compiler backend.

That does seem like a very good idea!

Thanx, Paul

2019-08-21 16:25:40

by Will Deacon

[permalink] [raw]

Subject: Re: [PATCH 1/1] Fix: trace sched switch start/stop racy updates

On Wed, Aug 21, 2019 at 06:56:10AM -0700, Paul E. McKenney wrote:
> On Wed, Aug 21, 2019 at 02:32:48PM +0100, Will Deacon wrote:
> > On Wed, Aug 21, 2019 at 06:23:10AM -0700, Paul E. McKenney wrote:
> > > On Wed, Aug 21, 2019 at 11:32:01AM +0100, Will Deacon wrote:
> > > > void bar(u64 *x)
> > > > {
> > > > *(volatile u64 *)x = 0xabcdef10abcdef10;
> > > > }
> > > >
> > > > then I get:
> > > >
> > > > bar:
> > > > mov w1, 61200
> > > > movk w1, 0xabcd, lsl 16
> > > > str w1, [x0]
> > > > str w1, [x0, 4]
> > > > ret
> > > >
> > > > so I'm not sure that WRITE_ONCE would even help :/
> > >
> > > Well, I can have the LWN article cite your email, then. So thank you
> > > very much!
> > >
> > > Is generation of this code for a 64-bit volatile store considered a bug?
> >
> > I consider it a bug for the volatile case, and the one compiler person I've
> > spoken to also seems to reckon it's a bug, so hopefully it will be fixed.
> > I'm led to believe it's an optimisation in the AArch64 backend of GCC.
>
> Here is hoping for the fix!
>
> > > Or does ARMv8 exclude the possibility of 64-bit MMIO registers? And I
> > > would guess that Thomas and Linus would ask a similar bugginess question
> > > for normal stores. ;-)
> >
> > We use inline asm for MMIO, fwiw.
>
> I should have remembered that, shouldn't I have? ;-)
>
> Is that also common practice across other embedded kernels these days?

I think so. Sometimes you care about things like the addressing mode being
used, so it's easier to roll it by hand.

Will

2019-08-21 18:24:58

[permalink] [raw]

Subject: Re: [PATCH 1/1] Fix: trace sched switch start/stop racy updates

----- On Aug 21, 2019, at 8:33 AM, Peter Zijlstra [email protected] wrote:

> On Wed, Aug 21, 2019 at 06:23:10AM -0700, Paul E. McKenney wrote:
>> On Wed, Aug 21, 2019 at 11:32:01AM +0100, Will Deacon wrote:
>
>> > and so it is using a store-pair instruction to reduce the complexity in
>> > the immediate generation. Thus, the 64-bit store will only have 32-bit
>> > atomicity. In fact, this is scary because if I change bar to:
>> >
>> > void bar(u64 *x)
>> > {
>> > *(volatile u64 *)x = 0xabcdef10abcdef10;
>> > }
>> >
>> > then I get:
>> >
>> > bar:
>> > mov w1, 61200
>> > movk w1, 0xabcd, lsl 16
>> > str w1, [x0]
>> > str w1, [x0, 4]
>> > ret
>> >
>> > so I'm not sure that WRITE_ONCE would even help :/
>>
>> Well, I can have the LWN article cite your email, then. So thank you
>> very much!
>>
>> Is generation of this code for a 64-bit volatile store considered a bug?
>> Or does ARMv8 exclude the possibility of 64-bit MMIO registers? And I
>> would guess that Thomas and Linus would ask a similar bugginess question
>> for normal stores. ;-)
>
> I'm calling this a compiler bug; the way I understand volatile this is
> very much against the intentended use case. That is, this is buggy even
> on UP vs signals or MMIO.

And here is a simpler reproducer on my gcc-8.3.0 (aarch64) compiled with O2:

volatile unsigned long a;

void fct(void)
{
a = 0x1234567812345678ULL;
}

void fct(void)
{
a = 0x1234567812345678ULL;
0: 90000000 adrp x0, 8 <fct+0x8>
4: 528acf01 mov w1, #0x5678 // #22136
8: 72a24681 movk w1, #0x1234, lsl #16
c: f9400000 ldr x0, [x0]
10: b9000001 str w1, [x0]
14: b9000401 str w1, [x0, #4]
}
18: d65f03c0 ret

And the non-volatile case uses stp (is it a single store to memory ?):

unsigned long a;

void fct(void)
{
a = 0x1234567812345678ULL;
}

void fct(void)
{
a = 0x1234567812345678ULL;
0: 90000000 adrp x0, 8 <fct+0x8>
4: 528acf01 mov w1, #0x5678 // #22136
8: 72a24681 movk w1, #0x1234, lsl #16
c: f9400000 ldr x0, [x0]
10: 29000401 stp w1, w1, [x0]
}
14: d65f03c0 ret

It would probably be a good idea to audit other architectures, since this
is done by the compiler backend.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2019-08-21 19:04:54