LinuxLists.cc - Should this_cpu

2018-12-08 01:11:59

Subject: Should this_cpu_read() be volatile?

[Resend, changing title & adding lkml and some others ]

On Dec 7, 2018, at 3:12 PM, Nadav Amit <[email protected]> wrote:

[ We can start a new thread, since I have the tendency to hijack threads. ]

> On Dec 7, 2018, at 12:45 AM, Peter Zijlstra <[email protected]> wrote:
>
> On Thu, Dec 06, 2018 at 09:26:24AM -0800, Nadav Amit wrote:
>>> On Dec 6, 2018, at 2:25 AM, Peter Zijlstra <[email protected]> wrote:
>>>
>>> On Thu, Dec 06, 2018 at 12:28:26AM -0800, Nadav Amit wrote:
>>>> [ +Peter ]
>>>>

[snip]

>>>>
>>>> *But* there is one thing that may require some attention - patch
>>>> b59167ac7bafd ("x86/percpu: Fix this_cpu_read()”) set ordering constraints
>>>> on the VM_ARGS() evaluation. And this patch also imposes, it appears,
>>>> (unnecessary) constraints on other pieces of code.
>>>>
>>>> These constraints are due to the addition of the volatile keyword for
>>>> this_cpu_read() by the patch. This affects at least 68 functions in my
>>>> kernel build, some of which are hot (I think), e.g., finish_task_switch(),
>>>> smp_x86_platform_ipi() and select_idle_sibling().
>>>>
>>>> Peter, perhaps the solution was too big of a hammer? Is it possible instead
>>>> to create a separate "this_cpu_read_once()” with the volatile keyword? Such
>>>> a function can be used for native_sched_clock() and other seqlocks, etc.
>>>
>>> No. like the commit writes this_cpu_read() _must_ imply READ_ONCE(). If
>>> you want something else, use something else, there's plenty other
>>> options available.
>>>
>>> There's this_cpu_op_stable(), but also __this_cpu_read() and
>>> raw_this_cpu_read() (which currently don't differ from this_cpu_read()
>>> but could).
>>
>> Would setting the inline assembly memory operand both as input and output be
>> better than using the “volatile”?
>
> I don't know.. I'm forever befuddled by the exact semantics of gcc
> inline asm.
>
>> I think that If you do that, the compiler would should the this_cpu_read()
>> as something that changes the per-cpu-variable, which would make it invalid
>> to re-read the value. At the same time, it would not prevent reordering the
>> read with other stuff.
>
> So the thing is; as I wrote, the generic version of this_cpu_*() is:
>
> local_irq_save();
> __this_cpu_*();
> local_irq_restore();
>
> And per local_irq_{save,restore}() including compiler barriers that
> cannot be reordered around either.
>
> And per the principle of least surprise, I think our primitives should
> have similar semantics.

I guess so, but as you’ll see below, the end result is ugly.

> I'm actually having difficulty finding the this_cpu_read() in any of the
> functions you mention, so I cannot make any concrete suggestions other
> than pointing at the alternative functions available.

So I got deeper into the code to understand a couple of differences. In the
case of select_idle_sibling(), the patch (Peter’s) increase the function
code size by 123 bytes (over the baseline of 986). The per-cpu variable is
called through the following call chain:

select_idle_sibling()
=> select_idle_cpu()
=> local_clock()
=> raw_smp_processor_id()

And results in 2 more calls to sched_clock_cpu(), as the compiler assumes
the processor id changes in between (which obviously wouldn’t happen). There
may be more changes around, which I didn’t fully analyze. But the very least
reading the processor id should not get “volatile”.

As for finish_task_switch(), the impact is only few bytes, but still
unnecessary. It appears that with your patch preempt_count() causes multiple
reads of __preempt_count in this code:

if (WARN_ONCE(preempt_count() != 2*PREEMPT_DISABLE_OFFSET,
"corrupted preempt_count: %s/%d/0x%x\n",
current->comm, current->pid, preempt_count()))
preempt_count_set(FORK_PREEMPT_COUNT);

Again, this is unwarranted, as the preemption count should not be changed in
any interrupt.

2018-12-08 10:53:55

by Peter Zijlstra

[permalink] [raw]

Subject: Re: Should this_cpu_read() be volatile?

On Fri, Dec 07, 2018 at 04:40:52PM -0800, Nadav Amit wrote:

> > I'm actually having difficulty finding the this_cpu_read() in any of the
> > functions you mention, so I cannot make any concrete suggestions other
> > than pointing at the alternative functions available.
>
>
> So I got deeper into the code to understand a couple of differences. In the
> case of select_idle_sibling(), the patch (Peter’s) increase the function
> code size by 123 bytes (over the baseline of 986). The per-cpu variable is
> called through the following call chain:
>
> select_idle_sibling()
> => select_idle_cpu()
> => local_clock()
> => raw_smp_processor_id()
>
> And results in 2 more calls to sched_clock_cpu(), as the compiler assumes
> the processor id changes in between (which obviously wouldn’t happen).

That is the thing with raw_smp_processor_id(), it is allowed to be used
in preemptible context, and there it _obviously_ can change between
subsequent invocations.

So again, this change is actually good.

If we want to fix select_idle_cpu(), we should maybe not use
local_clock() there but use sched_clock_cpu() with a stable argument,
this code runs with IRQs disabled and therefore the CPU number is stable
for us here.

> There may be more changes around, which I didn’t fully analyze. But
> the very least reading the processor id should not get “volatile”.
>
> As for finish_task_switch(), the impact is only few bytes, but still
> unnecessary. It appears that with your patch preempt_count() causes multiple
> reads of __preempt_count in this code:
>
> if (WARN_ONCE(preempt_count() != 2*PREEMPT_DISABLE_OFFSET,
> "corrupted preempt_count: %s/%d/0x%x\n",
> current->comm, current->pid, preempt_count()))
> preempt_count_set(FORK_PREEMPT_COUNT);

My patch proposed here:

https://marc.info/?l=linux-mm&m=154409548410209

would actually fix that one I think, preempt_count() uses
raw_cpu_read_4() which will loose the volatile with that patch.

2018-12-10 00:59:27

by Nadav Amit

[permalink] [raw]

Subject: Re: Should this_cpu_read() be volatile?

> On Dec 8, 2018, at 2:52 AM, Peter Zijlstra <[email protected]> wrote:
>
> On Fri, Dec 07, 2018 at 04:40:52PM -0800, Nadav Amit wrote:
>
>>> I'm actually having difficulty finding the this_cpu_read() in any of the
>>> functions you mention, so I cannot make any concrete suggestions other
>>> than pointing at the alternative functions available.
>>
>>
>> So I got deeper into the code to understand a couple of differences. In the
>> case of select_idle_sibling(), the patch (Peter’s) increase the function
>> code size by 123 bytes (over the baseline of 986). The per-cpu variable is
>> called through the following call chain:
>>
>> select_idle_sibling()
>> => select_idle_cpu()
>> => local_clock()
>> => raw_smp_processor_id()
>>
>> And results in 2 more calls to sched_clock_cpu(), as the compiler assumes
>> the processor id changes in between (which obviously wouldn’t happen).
>
> That is the thing with raw_smp_processor_id(), it is allowed to be used
> in preemptible context, and there it _obviously_ can change between
> subsequent invocations.
>
> So again, this change is actually good.
>
> If we want to fix select_idle_cpu(), we should maybe not use
> local_clock() there but use sched_clock_cpu() with a stable argument,
> this code runs with IRQs disabled and therefore the CPU number is stable
> for us here.
>
>> There may be more changes around, which I didn’t fully analyze. But
>> the very least reading the processor id should not get “volatile”.
>>
>> As for finish_task_switch(), the impact is only few bytes, but still
>> unnecessary. It appears that with your patch preempt_count() causes multiple
>> reads of __preempt_count in this code:
>>
>> if (WARN_ONCE(preempt_count() != 2*PREEMPT_DISABLE_OFFSET,
>> "corrupted preempt_count: %s/%d/0x%x\n",
>> current->comm, current->pid, preempt_count()))
>> preempt_count_set(FORK_PREEMPT_COUNT);
>
> My patch proposed here:
>
> https://marc.info/?l=linux-mm&m=154409548410209
>
> would actually fix that one I think, preempt_count() uses
> raw_cpu_read_4() which will loose the volatile with that patch.

Sorry for the spam from yesterday. That what happens when I try to write
emails on my phone while I’m distracted.

I tested the patch you referenced, and it certainly improves the situation
for reads, but there are still small and big issues lying around.

The biggest one is that (I think) smp_processor_id() should apparently use
__this_cpu_read(). Anyhow, when preemption checks are on, it is validated
that smp_processor_id() is not used while preemption is enabled, and IRQs
are not supposed to change its value. Otherwise the generated code of many
functions is affected.

There are all kind of other smaller issues, such as set_irq_regs() and
get_irq_regs(), which should run with disabled interrupts. They affect the
generated code in do_IRQ() and others.

But beyond that, there are so many places in the code that use
this_cpu_read() while IRQs are guaranteed to be disabled. For example
arch/x86/mm/tlb.c is full with this_cpu_read/write() and almost(?) all
should be running with interrupts disabled. Having said that, in my build
only flush_tlb_func_common() was affected.

2018-12-10 09:41:07

by Peter Zijlstra

[permalink] [raw]

Subject: Re: Should this_cpu_read() be volatile?

On Sun, Dec 09, 2018 at 04:57:43PM -0800, Nadav Amit wrote:
> > On Dec 8, 2018, at 2:52 AM, Peter Zijlstra <[email protected]> wrote:

> > My patch proposed here:
> >
> > https://marc.info/?l=linux-mm&m=154409548410209
> >
> > would actually fix that one I think, preempt_count() uses
> > raw_cpu_read_4() which will loose the volatile with that patch.

> I tested the patch you referenced, and it certainly improves the situation
> for reads, but there are still small and big issues lying around.

I'm sure :-(, this has been 'festering' for a long while it seems. And
esp. on x86 specific code, where for a long time we all assumed the
various per-cpu APIs were in fact the same (which turns out to very much
not be true).

> The biggest one is that (I think) smp_processor_id() should apparently use
> __this_cpu_read().

Agreed, and note that this will also improve code generation on !x86.

However, I'm not sure the current !debug definition:

#define smp_processor_id() raw_smp_processor_id()

is actually correct. Where raw_smp_processor_id() must be
this_cpu_read() to avoid CSE, we actually want to allow CSE on
smp_processor_id() etc..

> There are all kind of other smaller issues, such as set_irq_regs() and
> get_irq_regs(), which should run with disabled interrupts. They affect the
> generated code in do_IRQ() and others.
>
> But beyond that, there are so many places in the code that use
> this_cpu_read() while IRQs are guaranteed to be disabled. For example
> arch/x86/mm/tlb.c is full with this_cpu_read/write() and almost(?) all
> should be running with interrupts disabled. Having said that, in my build
> only flush_tlb_func_common() was affected.

This all feels like something static analysis could help with; such
tools would also make sense for !x86 where the difference between the
various per-cpu accessors is even bigger.

2018-12-11 17:41:31

by Nadav Amit

[permalink] [raw]

Subject: Re: Should this_cpu_read() be volatile?

> On Dec 10, 2018, at 12:55 AM, Peter Zijlstra <[email protected]> wrote:
>
> On Sun, Dec 09, 2018 at 04:57:43PM -0800, Nadav Amit wrote:
>>> On Dec 8, 2018, at 2:52 AM, Peter Zijlstra <[email protected]> wrote:
>
>>> My patch proposed here:
>>>
>>> https://marc.info/?l=linux-mm&m=154409548410209
>>>
>>> would actually fix that one I think, preempt_count() uses
>>> raw_cpu_read_4() which will loose the volatile with that patch.
>
>> I tested the patch you referenced, and it certainly improves the situation
>> for reads, but there are still small and big issues lying around.
>
> I'm sure :-(, this has been 'festering' for a long while it seems. And
> esp. on x86 specific code, where for a long time we all assumed the
> various per-cpu APIs were in fact the same (which turns out to very much
> not be true).
>
>> The biggest one is that (I think) smp_processor_id() should apparently use
>> __this_cpu_read().
>
> Agreed, and note that this will also improve code generation on !x86.
>
> However, I'm not sure the current !debug definition:
>
> #define smp_processor_id() raw_smp_processor_id()
>
> is actually correct. Where raw_smp_processor_id() must be
> this_cpu_read() to avoid CSE, we actually want to allow CSE on
> smp_processor_id() etc..

Yes. That makes sense.

>
>> There are all kind of other smaller issues, such as set_irq_regs() and
>> get_irq_regs(), which should run with disabled interrupts. They affect the
>> generated code in do_IRQ() and others.
>>
>> But beyond that, there are so many places in the code that use
>> this_cpu_read() while IRQs are guaranteed to be disabled. For example
>> arch/x86/mm/tlb.c is full with this_cpu_read/write() and almost(?) all
>> should be running with interrupts disabled. Having said that, in my build
>> only flush_tlb_func_common() was affected.
>
> This all feels like something static analysis could help with; such
> tools would also make sense for !x86 where the difference between the
> various per-cpu accessors is even bigger.

If something like that existed, it could also allow to get rid of
local_irq_save() (and use local_irq_disable() instead).