2023-10-10 16:43:16

by Uros Bizjak

[permalink] [raw]
Subject: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

Implementing arch_raw_cpu_ptr() in C, allows the compiler to perform
better optimizations, such as setting an appropriate base to compute
the address instead of an add instruction.

E.g.: address calcuation in amd_pmu_enable_virt() improves from:

48 c7 c0 00 00 00 00 mov $0x0,%rax
87b7: R_X86_64_32S cpu_hw_events

65 48 03 05 00 00 00 add %gs:0x0(%rip),%rax
00
87bf: R_X86_64_PC32 this_cpu_off-0x4

48 c7 80 28 13 00 00 movq $0x0,0x1328(%rax)
00 00 00 00

to:

65 48 8b 05 00 00 00 mov %gs:0x0(%rip),%rax
00
8798: R_X86_64_PC32 this_cpu_off-0x4
48 c7 80 00 00 00 00 movq $0x0,0x0(%rax)
00 00 00 00
87a6: R_X86_64_32S cpu_hw_events+0x1328

Co-developed-by: Nadav Amit <[email protected]>
Signed-off-by: Nadav Amit <[email protected]>
Signed-off-by: Uros Bizjak <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
---
arch/x86/include/asm/percpu.h | 15 +++++++++++++++
1 file changed, 15 insertions(+)

diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h
index 60ea7755c0fe..cdc188279c5a 100644
--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -49,6 +49,19 @@
#define __force_percpu_prefix "%%"__stringify(__percpu_seg)":"
#define __my_cpu_offset this_cpu_read(this_cpu_off)

+#ifdef CONFIG_USE_X86_SEG_SUPPORT
+/*
+ * Efficient implementation for cases in which the compiler supports
+ * named address spaces. Allows the compiler to perform additional
+ * optimizations that can save more instructions.
+ */
+#define arch_raw_cpu_ptr(ptr) \
+({ \
+ unsigned long tcp_ptr__; \
+ tcp_ptr__ = __raw_cpu_read(, this_cpu_off) + (unsigned long)(ptr); \
+ (typeof(*(ptr)) __kernel __force *)tcp_ptr__; \
+})
+#else /* CONFIG_USE_X86_SEG_SUPPORT */
/*
* Compared to the generic __my_cpu_offset version, the following
* saves one instruction and avoids clobbering a temp register.
@@ -61,6 +74,8 @@
: "m" (__my_cpu_var(this_cpu_off)), "0" (ptr)); \
(typeof(*(ptr)) __kernel __force *)tcp_ptr__; \
})
+#endif /* CONFIG_USE_X86_SEG_SUPPORT */
+
#else /* CONFIG_SMP */
#define __percpu_seg_override
#define __percpu_prefix ""
--
2.41.0


2023-10-10 17:33:18

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Tue, 10 Oct 2023 at 09:43, Uros Bizjak <[email protected]> wrote:
>
> Implementing arch_raw_cpu_ptr() in C, allows the compiler to perform
> better optimizations, such as setting an appropriate base to compute
> the address instead of an add instruction.

Hmm. I wonder..

> + tcp_ptr__ = __raw_cpu_read(, this_cpu_off) + (unsigned long)(ptr); \

Do we really even want to use __raw_cpu_read(this_cpu_off) at all?

On my machines (I tested an Intel 8th gen laptop, and my AMD Zen 2
Threadripper machine), 'rdgsbase' seems to be basically two cycles.

I wonder if we'd be better off using that, rather than doing the load.

Yes, a load that hits in L1D$ will schedule better, so it's "cheaper"
in that sense. The rdgsbase instruction *probably* will end up only
decoding in the first decoder etc. But we're talking single-cycle kind
of effects, and the rdgsbase case should be much better from a cache
perspective and might use fewer memory pipeline resources to offset
the fact that it uses an unusual front end decoder resource...

Yes, yes, we'd have to make it an alternative, and do something like

static __always_inline unsigned long new_cpu_offset(void)
{
unsigned long res;
asm(ALTERNATIVE(
"movq %%gs:this_cpu_off,%0",
"rdgsbase %0",
X86_FEATURE_FSGSBASE)
: "=r" (res));
return res;
}

but it still seems fairly straightforward. Hmm?

Linus

2023-10-10 18:23:14

by Uros Bizjak

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Tue, Oct 10, 2023 at 7:32 PM Linus Torvalds
<[email protected]> wrote:
>
> On Tue, 10 Oct 2023 at 09:43, Uros Bizjak <[email protected]> wrote:
> >
> > Implementing arch_raw_cpu_ptr() in C, allows the compiler to perform
> > better optimizations, such as setting an appropriate base to compute
> > the address instead of an add instruction.
>
> Hmm. I wonder..
>
> > + tcp_ptr__ = __raw_cpu_read(, this_cpu_off) + (unsigned long)(ptr); \
>
> Do we really even want to use __raw_cpu_read(this_cpu_off) at all?

Please note that besides propagation of the addition into address, the
patch also exposes memory load to the compiler, with the anticipation
that the compiler CSEs the load from this_cpu_off from eventual
multiple addresses. For this to work, we have to get rid of the asms.
It is important that the compiler knows that this is a memory load, so
it can also apply other compiler magic to it.

BTW: A follow-up patch will also use__raw_cpu_read to implement
this_cpu_read_stable. We can then read "const aliased" current_task to
CSE the load even more, something similar to [1].

[1] https://lore.kernel.org/lkml/[email protected]/

Uros.

> On my machines (I tested an Intel 8th gen laptop, and my AMD Zen 2
> Threadripper machine), 'rdgsbase' seems to be basically two cycles.
>
> I wonder if we'd be better off using that, rather than doing the load.
>
> Yes, a load that hits in L1D$ will schedule better, so it's "cheaper"
> in that sense. The rdgsbase instruction *probably* will end up only
> decoding in the first decoder etc. But we're talking single-cycle kind
> of effects, and the rdgsbase case should be much better from a cache
> perspective and might use fewer memory pipeline resources to offset
> the fact that it uses an unusual front end decoder resource...
>
> Yes, yes, we'd have to make it an alternative, and do something like
>
> static __always_inline unsigned long new_cpu_offset(void)
> {
> unsigned long res;
> asm(ALTERNATIVE(
> "movq %%gs:this_cpu_off,%0",
> "rdgsbase %0",
> X86_FEATURE_FSGSBASE)
> : "=r" (res));
> return res;
> }
>
> but it still seems fairly straightforward. Hmm?
>
> Linus

2023-10-10 18:26:27

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()



> On Oct 10, 2023, at 9:22 PM, Uros Bizjak <[email protected]> wrote:
>
> On Tue, Oct 10, 2023 at 7:32 PM Linus Torvalds
> <[email protected]> wrote:
>>
>> On Tue, 10 Oct 2023 at 09:43, Uros Bizjak <[email protected]> wrote:
>>>
>>> Implementing arch_raw_cpu_ptr() in C, allows the compiler to perform
>>> better optimizations, such as setting an appropriate base to compute
>>> the address instead of an add instruction.
>>
>> Hmm. I wonder..
>>
>>> + tcp_ptr__ = __raw_cpu_read(, this_cpu_off) + (unsigned long)(ptr); \
>>
>> Do we really even want to use __raw_cpu_read(this_cpu_off) at all?
>
> Please note that besides propagation of the addition into address, the
> patch also exposes memory load to the compiler, with the anticipation
> that the compiler CSEs the load from this_cpu_off from eventual
> multiple addresses. For this to work, we have to get rid of the asms.
> It is important that the compiler knows that this is a memory load, so
> it can also apply other compiler magic to it.
>
> BTW: A follow-up patch will also use__raw_cpu_read to implement
> this_cpu_read_stable. We can then read "const aliased" current_task to
> CSE the load even more, something similar to [1].

I was just writing the same thing. :)

As a minor note the proposed assembly version seems to be missing
__FORCE_ORDER as an input argument to prevent reordering past preempt_enable
and preempt_disable. But that’s really not the main point.

2023-10-10 18:38:12

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Tue, 10 Oct 2023 at 11:22, Uros Bizjak <[email protected]> wrote:
>
> Please note that besides propagation of the addition into address, the
> patch also exposes memory load to the compiler, with the anticipation
> that the compiler CSEs the load from this_cpu_off from eventual
> multiple addresses. For this to work, we have to get rid of the asms.

I actually checked that the inline asm gets combined, the same way the
this_cpu_read_stable cases do (which we use for 'current').

Linus

2023-10-10 18:41:51

by Uros Bizjak

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Tue, Oct 10, 2023 at 8:38 PM Linus Torvalds
<[email protected]> wrote:
>
> On Tue, 10 Oct 2023 at 11:22, Uros Bizjak <[email protected]> wrote:
> >
> > Please note that besides propagation of the addition into address, the
> > patch also exposes memory load to the compiler, with the anticipation
> > that the compiler CSEs the load from this_cpu_off from eventual
> > multiple addresses. For this to work, we have to get rid of the asms.
>
> I actually checked that the inline asm gets combined, the same way the
> this_cpu_read_stable cases do (which we use for 'current').

Yes, but does it CSE the load from multiple addresses?

Uros.

2023-10-10 18:43:33

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Tue, 10 Oct 2023 at 11:25, Nadav Amit <[email protected]> wrote:
>
> As a minor note the proposed assembly version seems to be missing
> __FORCE_ORDER as an input argument to prevent reordering past preempt_enable
> and preempt_disable. But that’s really not the main point.

Hmm. No, it's probably *is* the main point - see my reply to Uros that
the CSE on the inline asm itself gets rid of duplication.

And yes, we currently rely on that asm CSE for doing 'current' and not
reloading the value all the time.

So yes, we'd like to have a barrier for not moving it across the
preemption barriers, and __FORCE_ORDER would seem to be a good way to
do that.

I really suspect that 'rdgsbase' is better than a memory load in
practice, but I have no numbers to back that up, apart from a "it's
not a slow instruction, and core CPU is generally better than memory".

Linus

2023-10-10 18:52:58

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Tue, 10 Oct 2023 at 11:41, Uros Bizjak <[email protected]> wrote:
>
> Yes, but does it CSE the load from multiple addresses?

Yes, it should do that just right, because the *asm* itself is
identical, just the offsets (that gcc then adds separately) would be
different.

This is not unlike how we depend on gcc CSE'ing the "current" part
when doing multiple accesses of different members off that:

static __always_inline struct task_struct *get_current(void)
{
return this_cpu_read_stable(pcpu_hot.current_task);
}

with this_cpu_read_stable() being an inline asm that lacks the memory
component (the same way the fallback hides it by just using
"%%gs:this_cpu_off" directly inside the asm, instead of exposing it as
a memory access to gcc).

Of course, I think that with the "__seg_gs" patches, we *could* expose
the "%%gs:this_cpu_off" part to gcc, since gcc hopefully then can do
the alias analysis on that side and see that it can CSE the thing
anyway.

That might be a better choice than __FORCE_ORDER, in fact.

IOW, something like

static __always_inline unsigned long new_cpu_offset(void)
{
unsigned long res;
asm(ALTERNATIVE(
"movq " __percpu_arg(1) ",%0",
"rdgsbase %0",
X86_FEATURE_FSGSBASE)
: "=r" (res)
: "m" (this_cpu_off));
return res;
}

would presumably work together with your __seg_gs stuff.

UNTESTED!!

Linus

2023-10-11 07:28:32

by Uros Bizjak

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Tue, Oct 10, 2023 at 8:52 PM Linus Torvalds
<[email protected]> wrote:
>
> On Tue, 10 Oct 2023 at 11:41, Uros Bizjak <[email protected]> wrote:
> >
> > Yes, but does it CSE the load from multiple addresses?
>
> Yes, it should do that just right, because the *asm* itself is
> identical, just the offsets (that gcc then adds separately) would be
> different.

Indeed. To illustrate the question with an example, foo() and bar()
should compile to the same assembly, and there should be only one read
form m resp. n:

--cut here--
__seg_gs int m;

int foo (void)
{
return m + m;
}

int n;

static inline int get (int *m)
{
int res;

asm ("mov %%gs:%1, %0" : "=r"(res) : "m"(*m));
return res;
}

int bar (void)
{
return get (&n) + get (&n);
}
--cut here--

And they do:

0000000000000000 <foo>:
0: 65 8b 05 00 00 00 00 mov %gs:0x0(%rip),%eax # 7 <foo+0x7>
7: 01 c0 add %eax,%eax
9: c3 retq

0000000000000010 <bar>:
10: 65 8b 05 00 00 00 00 mov %gs:0x0(%rip),%eax # 17 <bar+0x7>
17: 01 c0 add %eax,%eax
19: c3 retq

>
> This is not unlike how we depend on gcc CSE'ing the "current" part
> when doing multiple accesses of different members off that:
>
> static __always_inline struct task_struct *get_current(void)
> {
> return this_cpu_read_stable(pcpu_hot.current_task);
> }
>
> with this_cpu_read_stable() being an inline asm that lacks the memory
> component (the same way the fallback hides it by just using
> "%%gs:this_cpu_off" directly inside the asm, instead of exposing it as
> a memory access to gcc).
>
> Of course, I think that with the "__seg_gs" patches, we *could* expose
> the "%%gs:this_cpu_off" part to gcc, since gcc hopefully then can do
> the alias analysis on that side and see that it can CSE the thing
> anyway.
>
> That might be a better choice than __FORCE_ORDER, in fact.
>
> IOW, something like
>
> static __always_inline unsigned long new_cpu_offset(void)
> {
> unsigned long res;
> asm(ALTERNATIVE(
> "movq " __percpu_arg(1) ",%0",
> "rdgsbase %0",
> X86_FEATURE_FSGSBASE)
> : "=r" (res)
> : "m" (this_cpu_off));
> return res;
> }
>
> would presumably work together with your __seg_gs stuff.

I have zero experience with rdgsbase insn, but the above is not
dependent on __seg_gs, so (the movq part at least) would also work in
the current mainline. To work together with __seg_gs stuff,
this_cpu_offset should be enclosed in __my_cpu_var. Also, if rdgsbase
is substituted with rdfsbase, it will also work for 32-bit targets.

Uros.

> UNTESTED!!
>
> Linus

2023-10-11 07:42:21

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()



> On Oct 10, 2023, at 9:37 PM, Linus Torvalds <[email protected]> wrote:
>
> !! External Email
>
> On Tue, 10 Oct 2023 at 11:22, Uros Bizjak <[email protected]> wrote:
>>
>> Please note that besides propagation of the addition into address, the
>> patch also exposes memory load to the compiler, with the anticipation
>> that the compiler CSEs the load from this_cpu_off from eventual
>> multiple addresses. For this to work, we have to get rid of the asms.
>
> I actually checked that the inline asm gets combined, the same way the
> this_cpu_read_stable cases do (which we use for 'current’)

You are correct. Having said that, for “current" we may be able to do something
better, as regardless to preemption “current" remains the same, and
this_cpu_read_stable() does miss some opportunities to avoid reloading the
value from memory. I proposed a solution before, but I am not sure it would
work properly with LTO. I guess Uros would know better.


2023-10-11 07:46:19

by Uros Bizjak

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Wed, Oct 11, 2023 at 9:27 AM Uros Bizjak <[email protected]> wrote:

> > IOW, something like
> >
> > static __always_inline unsigned long new_cpu_offset(void)
> > {
> > unsigned long res;
> > asm(ALTERNATIVE(
> > "movq " __percpu_arg(1) ",%0",
> > "rdgsbase %0",
> > X86_FEATURE_FSGSBASE)
> > : "=r" (res)
> > : "m" (this_cpu_off));
> > return res;
> > }
> >
> > would presumably work together with your __seg_gs stuff.
>
> I have zero experience with rdgsbase insn, but the above is not
> dependent on __seg_gs, so (the movq part at least) would also work in
> the current mainline. To work together with __seg_gs stuff,
> this_cpu_offset should be enclosed in __my_cpu_var. Also, if rdgsbase
> is substituted with rdfsbase, it will also work for 32-bit targets.

In fact, rdgsbase is available only for 64-bit targets.

Uros.

2023-10-11 18:42:50

by Uros Bizjak

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Tue, Oct 10, 2023 at 8:52 PM Linus Torvalds
<[email protected]> wrote:
>
> On Tue, 10 Oct 2023 at 11:41, Uros Bizjak <[email protected]> wrote:
> >
> > Yes, but does it CSE the load from multiple addresses?
>
> Yes, it should do that just right, because the *asm* itself is
> identical, just the offsets (that gcc then adds separately) would be
> different.
>
> This is not unlike how we depend on gcc CSE'ing the "current" part
> when doing multiple accesses of different members off that:
>
> static __always_inline struct task_struct *get_current(void)
> {
> return this_cpu_read_stable(pcpu_hot.current_task);
> }
>
> with this_cpu_read_stable() being an inline asm that lacks the memory
> component (the same way the fallback hides it by just using
> "%%gs:this_cpu_off" directly inside the asm, instead of exposing it as
> a memory access to gcc).
>
> Of course, I think that with the "__seg_gs" patches, we *could* expose
> the "%%gs:this_cpu_off" part to gcc, since gcc hopefully then can do
> the alias analysis on that side and see that it can CSE the thing
> anyway.
>
> That might be a better choice than __FORCE_ORDER, in fact.
>
> IOW, something like
>
> static __always_inline unsigned long new_cpu_offset(void)
> {
> unsigned long res;
> asm(ALTERNATIVE(
> "movq " __percpu_arg(1) ",%0",
> "rdgsbase %0",
> X86_FEATURE_FSGSBASE)
> : "=r" (res)
> : "m" (this_cpu_off));
> return res;
> }
>
> would presumably work together with your __seg_gs stuff.
>
> UNTESTED!!

The attached patch was tested on a target with fsgsbase CPUID and
without it. It works!

The patch improves amd_pmu_enable_virt() in the same way as reported
in the original patch submission and also reduces the number of percpu
offset reads (either from this_cpu_off or with rdgsbase) from 1663 to
1571.

The only drawback is a larger binary size:

text data bss dec hex filename
25546594 4387686 808452 30742732 1d518cc vmlinux-new.o
25515256 4387814 808452 30711522 1d49ee2 vmlinux-old.o

that increases by 31k (0.123%), probably due to 1578 rdgsbase alternatives.

I'll prepare and submit a patch for tip/percpu branch.

Uros.


>
> Linus


Attachments:
cpu_ptr-mainline.diff.txt (1.46 kB)

2023-10-11 19:37:55

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Wed, 11 Oct 2023 at 00:42, Nadav Amit <[email protected]> wrote:
>
> You are correct. Having said that, for “current" we may be able to do something
> better, as regardless to preemption “current" remains the same, and
> this_cpu_read_stable() does miss some opportunities to avoid reloading the
> value from memory.

It would be lovely to generate even better code, but that
this_cpu_read_stable() thing is the best we've come up with. It
intentionally has *no* memory inputs or anything else that might make
gcc think "I need to re-do this".

For example, instead of using "m" as a memory input, it very
intentionally uses "p", to make it clear that that it just uses the
_pointer_, not the memory location itself.

That's obviously a lie - it actually does access memory - but it's a
lie exactly because of the reason you mention: even when the memory
location changes due to preemption (or explicit scheduling), it always
changes back to the the value we care about.

So gcc _should_ be able to CSE it in all situations, but it's entirely
possible that gcc then decides to re-generate the value for whatever
reason. It's a cheap op, so it's ok to regen, of course, but the
intent is basically to let the compiler re-use the value as much as
possible.

But it *is* probably better to regenerate the value than it would be
to spill and re-load it, and from the cases I've seen, this all tends
to work fairly well.

Linus

2023-10-11 19:41:20

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Wed, 11 Oct 2023 at 00:45, Uros Bizjak <[email protected]> wrote:
>
> In fact, rdgsbase is available only for 64-bit targets.

Not even all 64-bit targets. That's why I did that ALTERNATIVE() thing
with X86_FEATURE_FSGSBASE, which uses the kernel instruction
re-writing.

So that suggested asm of mine defaults to loading the value from
memory through %gs, but with X86_FEATURE_FSGSBASE it gets rewritten to
use rdgsbase.

And again - I'm not sure it's any faster, but it's _potentially_
certainly better in that it doesn't use the cache-line.

Linus

2023-10-11 19:52:31

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Wed, 11 Oct 2023 at 11:42, Uros Bizjak <[email protected]> wrote:
>
> The attached patch was tested on a target with fsgsbase CPUID and
> without it. It works!

.. I should clearly read all my emails before answering some of them.

Yes, that patch looks good to me, and I'm happy to hear that you
actually tested it unlike my "maybe something like this".

> The patch improves amd_pmu_enable_virt() in the same way as reported
> in the original patch submission and also reduces the number of percpu
> offset reads (either from this_cpu_off or with rdgsbase) from 1663 to
> 1571.

Dio y ou have any actka performance numbers? The patch looks good to
me, and I *think* rdgsbase ends up being faster in practice due to
avoiding a memory access, but that's very much a gut feel.

> The only drawback is a larger binary size:
>
> text data bss dec hex filename
> 25546594 4387686 808452 30742732 1d518cc vmlinux-new.o
> 25515256 4387814 808452 30711522 1d49ee2 vmlinux-old.o
>
> that increases by 31k (0.123%), probably due to 1578 rdgsbase alternatives.

I'm actually surprised that it increases the text size. The 'rdgsbase'
instruction should be smaller than a 'mov %gs', so I would have
expected the *data* size to increase due to the alternatives tables,
but not the text size.

[ Looks around ]

Oh. It's because we put the altinstructions into the text section.
That's kind of silly, but whatever.

So I think that increase in text-size is not "real" - yes, it
increases our binary size because we obviously have two instructions,
but the actual *executable* part likely stays the same, and it's just
that we grow the altinstruction metadata.

Linus

2023-10-11 19:53:25

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Wed, 11 Oct 2023 at 12:51, Linus Torvalds
<[email protected]> wrote:
>
> Dio y ou have any actka performance numbers?

And before anybody worries - no, I didn't have a stroke in the middle
of writing that. I'm just not a great typist ;)

Linus

2023-10-11 20:01:18

by Uros Bizjak

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Wed, Oct 11, 2023 at 9:52 PM Linus Torvalds
<[email protected]> wrote:
>
> On Wed, 11 Oct 2023 at 11:42, Uros Bizjak <[email protected]> wrote:
> >
> > The attached patch was tested on a target with fsgsbase CPUID and
> > without it. It works!
>
> .. I should clearly read all my emails before answering some of them.
>
> Yes, that patch looks good to me, and I'm happy to hear that you
> actually tested it unlike my "maybe something like this".
>
> > The patch improves amd_pmu_enable_virt() in the same way as reported
> > in the original patch submission and also reduces the number of percpu
> > offset reads (either from this_cpu_off or with rdgsbase) from 1663 to
> > 1571.
>
> Dio y ou have any actka performance numbers? The patch looks good to
> me, and I *think* rdgsbase ends up being faster in practice due to
> avoiding a memory access, but that's very much a gut feel.

Unfortunately, I don't have any perf numbers, only those from Agner's
instruction tables. The memory access performance has so many
parameters, that gut feeling is the only thing besides real
case-by-case measurements. The rule of thumb in the compiler world is
also that memory access should be avoided.

Uros.

>
> > The only drawback is a larger binary size:
> >
> > text data bss dec hex filename
> > 25546594 4387686 808452 30742732 1d518cc vmlinux-new.o
> > 25515256 4387814 808452 30711522 1d49ee2 vmlinux-old.o
> >
> > that increases by 31k (0.123%), probably due to 1578 rdgsbase alternatives.
>
> I'm actually surprised that it increases the text size. The 'rdgsbase'
> instruction should be smaller than a 'mov %gs', so I would have
> expected the *data* size to increase due to the alternatives tables,
> but not the text size.
>
> [ Looks around ]
>
> Oh. It's because we put the altinstructions into the text section.
> That's kind of silly, but whatever.
>
> So I think that increase in text-size is not "real" - yes, it
> increases our binary size because we obviously have two instructions,
> but the actual *executable* part likely stays the same, and it's just
> that we grow the altinstruction metadata.
>
> Linus

2023-10-11 21:33:23

by Uros Bizjak

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Wed, Oct 11, 2023 at 9:37 PM Linus Torvalds
<[email protected]> wrote:
>
> On Wed, 11 Oct 2023 at 00:42, Nadav Amit <[email protected]> wrote:
> >
> > You are correct. Having said that, for “current" we may be able to do something
> > better, as regardless to preemption “current" remains the same, and
> > this_cpu_read_stable() does miss some opportunities to avoid reloading the
> > value from memory.
>
> It would be lovely to generate even better code, but that
> this_cpu_read_stable() thing is the best we've come up with. It
> intentionally has *no* memory inputs or anything else that might make
> gcc think "I need to re-do this".

The attached patch makes this_cpu_read_stable a bit better by using
rip-relative addressing. Immediate reduction of text section by 4kB
and also makes the kernel some more PIE friendly.

> For example, instead of using "m" as a memory input, it very
> intentionally uses "p", to make it clear that that it just uses the
> _pointer_, not the memory location itself.
>
> That's obviously a lie - it actually does access memory - but it's a
> lie exactly because of the reason you mention: even when the memory
> location changes due to preemption (or explicit scheduling), it always
> changes back to the the value we care about.
>
> So gcc _should_ be able to CSE it in all situations, but it's entirely
> possible that gcc then decides to re-generate the value for whatever
> reason. It's a cheap op, so it's ok to regen, of course, but the
> intent is basically to let the compiler re-use the value as much as
> possible.
>
> But it *is* probably better to regenerate the value than it would be
> to spill and re-load it, and from the cases I've seen, this all tends
> to work fairly well.

Reading the above, it looks to me that we don't want to play games
with "const aliased" versions of current_task [1], as proposed by
Nadav in his patch series. The current version of
this_cpu_read_stable() (plus the attached trivial patch) is as good as
it can get.

[1] https://lore.kernel.org/lkml/[email protected]/

Uros.


Attachments:
p.diff.txt (765.00 B)

2023-10-11 21:56:12

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Wed, 11 Oct 2023 at 14:33, Uros Bizjak <[email protected]> wrote:
>
> Reading the above, it looks to me that we don't want to play games
> with "const aliased" versions of current_task [1], as proposed by
> Nadav in his patch series.

Well, maybe I'd like it if I saw what the effect of it was, but that
patch mentions "sync_mm_rss()" which doesn't actually exist
(SPLIT_RSS_COUNTING is never defined, the split version is gone and
hasn't existed since commit f1a7941243c1 "mm: convert mm's rss stats
into percpu_counter")

I'm not sure why gcc used to get code generation wrong there, and I
don't see what could be improved in the current implementation of
'current_task', but I guess there could be things that could be done
to make gcc more likely to CSE the cases..

IOW, I don't understand why Navad's patch would improve gcc code gen.
It presumably depends on some gcc internal issue (ie "gcc just happens
to be better at optimization X").

Which is obviously a valid thing in general.

Linus

2023-10-11 22:38:01

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()


* Linus Torvalds <[email protected]> wrote:

> > The only drawback is a larger binary size:
> >
> > text data bss dec hex filename
> > 25546594 4387686 808452 30742732 1d518cc vmlinux-new.o
> > 25515256 4387814 808452 30711522 1d49ee2 vmlinux-old.o
> >
> > that increases by 31k (0.123%), probably due to 1578 rdgsbase alternatives.
>
> I'm actually surprised that it increases the text size. The 'rdgsbase'
> instruction should be smaller than a 'mov %gs', so I would have
> expected the *data* size to increase due to the alternatives tables,
> but not the text size.
>
> [ Looks around ]
>
> Oh. It's because we put the altinstructions into the text section.
> That's kind of silly, but whatever.

Yeah, we should probably move .altinstructions from init-text to .init.data
or so? Contains a bunch of other sections too that don't get executed
directly ... and in fact has some non-code data structures too, such as ...
".apicdrivers". :-/

I suspect people put all that into .text because it was the easiest place
to modify in the x86 linker script, and linker scripts are arguably scary.

Will check all this.

Thanks,

Ingo

2023-10-11 23:16:10

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On 10/11/23 15:37, Ingo Molnar wrote:
>
> * Linus Torvalds <[email protected]> wrote:
>
>>> The only drawback is a larger binary size:
>>>
>>> text data bss dec hex filename
>>> 25546594 4387686 808452 30742732 1d518cc vmlinux-new.o
>>> 25515256 4387814 808452 30711522 1d49ee2 vmlinux-old.o
>>>
>>> that increases by 31k (0.123%), probably due to 1578 rdgsbase alternatives.
>>
>> I'm actually surprised that it increases the text size. The 'rdgsbase'
>> instruction should be smaller than a 'mov %gs', so I would have
>> expected the *data* size to increase due to the alternatives tables,
>> but not the text size.
>>
>> [ Looks around ]
>>
>> Oh. It's because we put the altinstructions into the text section.
>> That's kind of silly, but whatever.
>
> Yeah, we should probably move .altinstructions from init-text to .init.data
> or so? Contains a bunch of other sections too that don't get executed
> directly ... and in fact has some non-code data structures too, such as ...
> ".apicdrivers". :-/
>
> I suspect people put all that into .text because it was the easiest place
> to modify in the x86 linker script, and linker scripts are arguably scary.
>

Well, it's more than that; "size" considers all non-writable sections to
be "text".

-hpa

2023-10-12 01:35:22

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Wed, Oct 11, 2023 at 04:15:15PM -0700, H. Peter Anvin wrote:
> On 10/11/23 15:37, Ingo Molnar wrote:
> >
> > * Linus Torvalds <[email protected]> wrote:
> >
> > > > The only drawback is a larger binary size:
> > > >
> > > > text data bss dec hex filename
> > > > 25546594 4387686 808452 30742732 1d518cc vmlinux-new.o
> > > > 25515256 4387814 808452 30711522 1d49ee2 vmlinux-old.o
> > > >
> > > > that increases by 31k (0.123%), probably due to 1578 rdgsbase alternatives.
> > >
> > > I'm actually surprised that it increases the text size. The 'rdgsbase'
> > > instruction should be smaller than a 'mov %gs', so I would have
> > > expected the *data* size to increase due to the alternatives tables,
> > > but not the text size.
> > >
> > > [ Looks around ]
> > >
> > > Oh. It's because we put the altinstructions into the text section.
> > > That's kind of silly, but whatever.
> >
> > Yeah, we should probably move .altinstructions from init-text to .init.data
> > or so? Contains a bunch of other sections too that don't get executed
> > directly ... and in fact has some non-code data structures too, such as ...
> > ".apicdrivers". :-/
> >
> > I suspect people put all that into .text because it was the easiest place
> > to modify in the x86 linker script, and linker scripts are arguably scary.
> >
>
> Well, it's more than that; "size" considers all non-writable sections to be
> "text".

Indeed, I added a printf to "size", it shows that all the following
sections are "text":

.text
.pci_fixup
.tracedata
__ksymtab
__ksymtab_gpl
__ksymtab_strings
__init_rodata
__param
__ex_table
.notes
.orc_header
.orc_unwind_ip
.orc_unwind
.init.text
.altinstr_aux
.x86_cpu_dev.init
.parainstructions
.retpoline_sites
.return_sites
.call_sites
.altinstructions
.altinstr_replacement
.exit.text
.smp_locks

I can't fathom why it doesn't just filter based on the EXECINSTR section
flag.

"size" is probably worse than useless, as many of these sections can
change size rather arbitrarily, especially .orc_* and .*_sites.

I can't help but wonder how many hasty optimizations have been made over
the years based on the sketchy output of this tool.

It should be trivial to replace the use of "size" with our own
"text_size" script which does what we want, e.g., filter on EXECINSTR.

Here are the current EXECINSTR sections:

~/git/binutils-gdb/binutils $ readelf -WS /tmp/vmlinux |grep X
[ 1] .text PROGBITS ffffffff81000000 200000 1200000 00 AX 0 0 4096
[21] .init.text PROGBITS ffffffff833b7000 27b7000 091b50 00 AX 0 0 16
[22] .altinstr_aux PROGBITS ffffffff83448b50 2848b50 00176a 00 AX 0 0 1
[30] .altinstr_replacement PROGBITS ffffffff8372661a 2b2661a 0028b9 00 AX 0 0 1
[32] .exit.text PROGBITS ffffffff83728f10 2b28f10 0030c7 00 AX 0 0 16

As Ingo mentioned, we could make .altinstr_replacement non-executable.
That confuses objtool, but I think we could remedy that pretty easily.

Though, another problem is that .text has a crazy amount of padding
which makes it always the same size, due to the SRSO alias mitigation
alignment linker magic. We should fix that somehow.

--
Josh

2023-10-12 06:19:30

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()


* Josh Poimboeuf <[email protected]> wrote:

> Though, another problem is that .text has a crazy amount of padding
> which makes it always the same size, due to the SRSO alias mitigation
> alignment linker magic. We should fix that somehow.

We could emit a non-aligned end-of-text symbol (we might have it already),
and have a script or small .c program in scripts/ or tools/ that looks
at vmlinux and displays a user-friendly and accurate list of text and
data sizes in the kernel?

And since objtool is technically an 'object files tool', and it already
looks at sections & symbols, it could also grow a:

objtool size <objfile>

command that does the sane thing ... I'd definitely start using that, instead of 'size'.

/me runs :-)

Thanks,

Ingo

2023-10-12 15:19:22

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()


> On Oct 12, 2023, at 12:54 AM, Linus Torvalds <[email protected]> wrote:
>
> !! External Email
>
> On Wed, 11 Oct 2023 at 14:33, Uros Bizjak <[email protected]> wrote:
>>
>> Reading the above, it looks to me that we don't want to play games
>> with "const aliased" versions of current_task [1], as proposed by
>> Nadav in his patch series.
>
> Well, maybe I'd like it if I saw what the effect of it was, but that
> patch mentions "sync_mm_rss()" which doesn't actually exist
> (SPLIT_RSS_COUNTING is never defined, the split version is gone and
> hasn't existed since commit f1a7941243c1 "mm: convert mm's rss stats
> into percpu_counter")

So I added a new version of the current aliasing (well, actually pcpu_hot
in the new version) on top of Uros’s patches, and the effect can be seen
in many functions. I don’t want to bother with many examples so here is
a common and simple one:

Currently syscall_exit_work() that starts with:

0xffffffff8111e120 <+0>: push %rbp
0xffffffff8111e121 <+1>: mov %rdi,%rbp
0xffffffff8111e124 <+4>: push %rbx
0xffffffff8111e125 <+5>: mov %rsi,%rbx
0xffffffff8111e128 <+8>: and $0x20,%esi
0xffffffff8111e12b <+11>: je 0xffffffff8111e143 <syscall_exit_work+35>
0xffffffff8111e12d <+13>: mov %gs:0x2ac80,%rax
0xffffffff8111e136 <+22>: cmpb $0x0,0x800(%rax)
0xffffffff8111e13d <+29>: jne 0xffffffff8111e22a <syscall_exit_work+266>
0xffffffff8111e143 <+35>: mov %gs:0x2ac80,%rax
0xffffffff8111e14c <+44>: cmpq $0x0,0x7c8(%rax)

Using the const-alias changes the beginning of syscall_exit_work to:

0xffffffff8111cb80 <+0>: push %r12
0xffffffff8111cb82 <+2>: mov %gs:0x7ef0e0f6(%rip),%r12 # 0x2ac80 <pcpu_hot>
0xffffffff8111cb8a <+10>: push %rbp
0xffffffff8111cb8b <+11>: mov %rdi,%rbp
0xffffffff8111cb8e <+14>: push %rbx
0xffffffff8111cb8f <+15>: mov %rsi,%rbx
0xffffffff8111cb92 <+18>: and $0x20,%esi
0xffffffff8111cb95 <+21>: je 0xffffffff8111cba6 <syscall_exit_work+38>
0xffffffff8111cb97 <+23>: cmpb $0x0,0x800(%r12)
0xffffffff8111cba0 <+32>: jne 0xffffffff8111cc7a <syscall_exit_work+250>
0xffffffff8111cba6 <+38>: cmpq $0x0,0x7c8(%r12)

So we both see RIP-relative addressing is being used (hence the instruction is
one byte shorter) and the reload going away.

Now, I am not a compiler expert as for the rationale, but it googling around
I can see Nick explaining the rationale [1] - if you use “p” your read memory.
BTW: It is related to discussion you had [2], in which you encountered an issue
I also encountered before [3]. My bad for pushing it in.

Anyhow, I created a similar code on godbolt ( https://godbolt.org/z/dPqKKzPs4 )
to show this behavior - how compiler barriers cause reload. It seems that this
behavior happens on GCC and CLANG on various versions.

The idea behind the patch is that it communicates - in the compilation unit
granularity - that current is fixed. There is an issue of whether it works with
LTO, which I have never checked.


[1] https://reviews.llvm.org/D145416
[2] https://lore.kernel.org/lkml/[email protected]/
[3] https://lore.kernel.org/all/[email protected]/

--

Here’s the updated patch - but I didn’t really boot a machine with it so new
issues might have come since my last patch-set:

-- >8 --

Date: Thu, 12 Oct 2023 06:02:03 -0700
Subject: [PATCH] Const current

---
arch/x86/include/asm/current.h | 17 ++++++++++++++++-
arch/x86/kernel/cpu/common.c | 4 ++++
include/linux/compiler.h | 2 +-
3 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/current.h b/arch/x86/include/asm/current.h
index a1168e7b69e5..d05fbb6a8bd7 100644
--- a/arch/x86/include/asm/current.h
+++ b/arch/x86/include/asm/current.h
@@ -36,9 +36,24 @@ static_assert(sizeof(struct pcpu_hot) == 64);

DECLARE_PER_CPU_ALIGNED(struct pcpu_hot, pcpu_hot);

+/*
+ * Hold a constant alias for current_task, which would allow to avoid caching of
+ * current task.
+ *
+ * We must mark const_current_task with the segment qualifiers, as otherwise gcc
+ * would do redundant reads of const_current_task.
+ */
+DECLARE_PER_CPU(struct pcpu_hot const __percpu_seg_override, const_pcpu_hot);
+
static __always_inline struct task_struct *get_current(void)
{
- return this_cpu_read_stable(pcpu_hot.current_task);
+
+ /*
+ * GCC is missing functionality of removing segment qualifiers, which
+ * messes with per-cpu infrastructure that holds local copies. Use
+ * __raw_cpu_read to avoid holding any copy.
+ */
+ return __raw_cpu_read(, const_pcpu_hot.current_task);
}

#define current get_current()
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 382d4e6b848d..94590af11388 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -2052,6 +2052,10 @@ DEFINE_PER_CPU_ALIGNED(struct pcpu_hot, pcpu_hot) = {
};
EXPORT_PER_CPU_SYMBOL(pcpu_hot);

+DECLARE_PER_CPU_ALIGNED(struct pcpu_hot const __percpu_seg_override, const_pcpu_hot)
+ __attribute__((alias("pcpu_hot")));
+EXPORT_PER_CPU_SYMBOL(const_pcpu_hot);
+
#ifdef CONFIG_X86_64
DEFINE_PER_CPU_FIRST(struct fixed_percpu_data,
fixed_percpu_data) __aligned(PAGE_SIZE) __visible;
diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index d7779a18b24f..e7059292085e 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -212,7 +212,7 @@ void ftrace_likely_update(struct ftrace_likely_data *f, int val,
*/
#define ___ADDRESSABLE(sym, __attrs) \
static void * __used __attrs \
- __UNIQUE_ID(__PASTE(__addressable_,sym)) = (void *)&sym;
+ __UNIQUE_ID(__PASTE(__addressable_,sym)) = (void *)(uintptr_t)&sym;
#define __ADDRESSABLE(sym) \
___ADDRESSABLE(sym, __section(".discard.addressable"))

--
2.25.1

2023-10-12 16:08:28

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Thu, Oct 12, 2023 at 08:19:14AM +0200, Ingo Molnar wrote:
>
> * Josh Poimboeuf <[email protected]> wrote:
>
> > Though, another problem is that .text has a crazy amount of padding
> > which makes it always the same size, due to the SRSO alias mitigation
> > alignment linker magic. We should fix that somehow.
>
> We could emit a non-aligned end-of-text symbol (we might have it already),
> and have a script or small .c program in scripts/ or tools/ that looks
> at vmlinux and displays a user-friendly and accurate list of text and
> data sizes in the kernel?
>
> And since objtool is technically an 'object files tool', and it already
> looks at sections & symbols, it could also grow a:
>
> objtool size <objfile>
>
> command that does the sane thing ... I'd definitely start using that, instead of 'size'.
>
> /me runs :-)

Yeah, that's actually not a bad idea.

I had been thinking a "simple" script would be fine, but I'm realizing
the scope of this thing could grow over time. In which case a script is
less than ideal. And objtool already has the ability to do this pretty
easily.

--
Josh

2023-10-12 16:33:55

by Uros Bizjak

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Thu, Oct 12, 2023 at 5:19 PM Nadav Amit <[email protected]> wrote:
>
>
> > On Oct 12, 2023, at 12:54 AM, Linus Torvalds <[email protected]> wrote:
> >
> > !! External Email
> >
> > On Wed, 11 Oct 2023 at 14:33, Uros Bizjak <[email protected]> wrote:
> >>
> >> Reading the above, it looks to me that we don't want to play games
> >> with "const aliased" versions of current_task [1], as proposed by
> >> Nadav in his patch series.
> >
> > Well, maybe I'd like it if I saw what the effect of it was, but that
> > patch mentions "sync_mm_rss()" which doesn't actually exist
> > (SPLIT_RSS_COUNTING is never defined, the split version is gone and
> > hasn't existed since commit f1a7941243c1 "mm: convert mm's rss stats
> > into percpu_counter")
>
> So I added a new version of the current aliasing (well, actually pcpu_hot
> in the new version) on top of Uros’s patches, and the effect can be seen
> in many functions. I don’t want to bother with many examples so here is
> a common and simple one:
>
> Currently syscall_exit_work() that starts with:
>
> 0xffffffff8111e120 <+0>: push %rbp
> 0xffffffff8111e121 <+1>: mov %rdi,%rbp
> 0xffffffff8111e124 <+4>: push %rbx
> 0xffffffff8111e125 <+5>: mov %rsi,%rbx
> 0xffffffff8111e128 <+8>: and $0x20,%esi
> 0xffffffff8111e12b <+11>: je 0xffffffff8111e143 <syscall_exit_work+35>
> 0xffffffff8111e12d <+13>: mov %gs:0x2ac80,%rax
> 0xffffffff8111e136 <+22>: cmpb $0x0,0x800(%rax)
> 0xffffffff8111e13d <+29>: jne 0xffffffff8111e22a <syscall_exit_work+266>
> 0xffffffff8111e143 <+35>: mov %gs:0x2ac80,%rax
> 0xffffffff8111e14c <+44>: cmpq $0x0,0x7c8(%rax)
>
> Using the const-alias changes the beginning of syscall_exit_work to:
>
> 0xffffffff8111cb80 <+0>: push %r12
> 0xffffffff8111cb82 <+2>: mov %gs:0x7ef0e0f6(%rip),%r12 # 0x2ac80 <pcpu_hot>
> 0xffffffff8111cb8a <+10>: push %rbp
> 0xffffffff8111cb8b <+11>: mov %rdi,%rbp
> 0xffffffff8111cb8e <+14>: push %rbx
> 0xffffffff8111cb8f <+15>: mov %rsi,%rbx
> 0xffffffff8111cb92 <+18>: and $0x20,%esi
> 0xffffffff8111cb95 <+21>: je 0xffffffff8111cba6 <syscall_exit_work+38>
> 0xffffffff8111cb97 <+23>: cmpb $0x0,0x800(%r12)
> 0xffffffff8111cba0 <+32>: jne 0xffffffff8111cc7a <syscall_exit_work+250>
> 0xffffffff8111cba6 <+38>: cmpq $0x0,0x7c8(%r12)
>
> So we both see RIP-relative addressing is being used (hence the instruction is
> one byte shorter) and the reload going away.

Just a quick remark here:

For some reason existing percpu_stable_op asm uses %P operand
modifier. This will drop all syntax-specific prefixes and issue the
bare constant. It will also remove the (%rip) suffix. What we want
here is a generic %a modifier (See 6.47.2.8 Generic Operand Modifiers
[1]) that will substitute a memory reference, with the actual operand
treated as the address. In combination with "p" constraint will DTRT
and will emit symbol with the (%rip) suffix when available, also when
-fpie is in effect.

[1] https://gcc.gnu.org/onlinedocs/gcc-13.2.0/gcc.pdf

Uros.

2023-10-12 16:55:37

by Uros Bizjak

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Thu, Oct 12, 2023 at 6:33 PM Uros Bizjak <[email protected]> wrote:
>
> On Thu, Oct 12, 2023 at 5:19 PM Nadav Amit <[email protected]> wrote:
> >
> >
> > > On Oct 12, 2023, at 12:54 AM, Linus Torvalds <[email protected]> wrote:
> > >
> > > !! External Email
> > >
> > > On Wed, 11 Oct 2023 at 14:33, Uros Bizjak <[email protected]> wrote:
> > >>
> > >> Reading the above, it looks to me that we don't want to play games
> > >> with "const aliased" versions of current_task [1], as proposed by
> > >> Nadav in his patch series.
> > >
> > > Well, maybe I'd like it if I saw what the effect of it was, but that
> > > patch mentions "sync_mm_rss()" which doesn't actually exist
> > > (SPLIT_RSS_COUNTING is never defined, the split version is gone and
> > > hasn't existed since commit f1a7941243c1 "mm: convert mm's rss stats
> > > into percpu_counter")
> >
> > So I added a new version of the current aliasing (well, actually pcpu_hot
> > in the new version) on top of Uros’s patches, and the effect can be seen
> > in many functions. I don’t want to bother with many examples so here is
> > a common and simple one:
> >
> > Currently syscall_exit_work() that starts with:
> >
> > 0xffffffff8111e120 <+0>: push %rbp
> > 0xffffffff8111e121 <+1>: mov %rdi,%rbp
> > 0xffffffff8111e124 <+4>: push %rbx
> > 0xffffffff8111e125 <+5>: mov %rsi,%rbx
> > 0xffffffff8111e128 <+8>: and $0x20,%esi
> > 0xffffffff8111e12b <+11>: je 0xffffffff8111e143 <syscall_exit_work+35>
> > 0xffffffff8111e12d <+13>: mov %gs:0x2ac80,%rax
> > 0xffffffff8111e136 <+22>: cmpb $0x0,0x800(%rax)
> > 0xffffffff8111e13d <+29>: jne 0xffffffff8111e22a <syscall_exit_work+266>
> > 0xffffffff8111e143 <+35>: mov %gs:0x2ac80,%rax
> > 0xffffffff8111e14c <+44>: cmpq $0x0,0x7c8(%rax)
> >
> > Using the const-alias changes the beginning of syscall_exit_work to:
> >
> > 0xffffffff8111cb80 <+0>: push %r12
> > 0xffffffff8111cb82 <+2>: mov %gs:0x7ef0e0f6(%rip),%r12 # 0x2ac80 <pcpu_hot>
> > 0xffffffff8111cb8a <+10>: push %rbp
> > 0xffffffff8111cb8b <+11>: mov %rdi,%rbp
> > 0xffffffff8111cb8e <+14>: push %rbx
> > 0xffffffff8111cb8f <+15>: mov %rsi,%rbx
> > 0xffffffff8111cb92 <+18>: and $0x20,%esi
> > 0xffffffff8111cb95 <+21>: je 0xffffffff8111cba6 <syscall_exit_work+38>
> > 0xffffffff8111cb97 <+23>: cmpb $0x0,0x800(%r12)
> > 0xffffffff8111cba0 <+32>: jne 0xffffffff8111cc7a <syscall_exit_work+250>
> > 0xffffffff8111cba6 <+38>: cmpq $0x0,0x7c8(%r12)
> >
> > So we both see RIP-relative addressing is being used (hence the instruction is
> > one byte shorter) and the reload going away.
>
> Just a quick remark here:
>
> For some reason existing percpu_stable_op asm uses %P operand
> modifier. This will drop all syntax-specific prefixes and issue the
> bare constant. It will also remove the (%rip) suffix. What we want
> here is a generic %a modifier (See 6.47.2.8 Generic Operand Modifiers
> [1]) that will substitute a memory reference, with the actual operand
> treated as the address. In combination with "p" constraint will DTRT
> and will emit symbol with the (%rip) suffix when available, also when
> -fpie is in effect.

An example:

--cut here--
int m;

int foo (void)
{
asm ("# %0 %P0 %a0" :: "p" (&m));
}
--cut here--

gcc -O2 -S:

# $m m m(%rip)

gcc -O2 -fpie -S:

# $m m m(%rip)

gcc -O2 -m32 -S:

# $m m m

Uros.

2023-10-12 16:57:03

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Thu, 12 Oct 2023 at 09:33, Uros Bizjak <[email protected]> wrote:
>
> For some reason existing percpu_stable_op asm uses %P operand
> modifier. This will drop all syntax-specific prefixes and issue the
> bare constant. It will also remove the (%rip) suffix. What we want
> here is a generic %a modifier (See 6.47.2.8 Generic Operand Modifiers
> [1]) that will substitute a memory reference, with the actual operand
> treated as the address. In combination with "p" constraint will DTRT
> and will emit symbol with the (%rip) suffix when available, also when
> -fpie is in effect.

Well, I have to admit that I think the main reason we used "p" as a
constraint was simply that we knew about it, and I don't think I've
ever even realized "a" existed.

In fact, we historically didn't use a lot of operand modifiers, and
the only common ones (at least for x86) tend to be the "register size"
modifiers (b/w/q)

I just did

git grep 'asm.*[^%]%[a-z][0-9]' arch/x86

and while that will only catch one-liner inline asm cases, all it
finds is indeed just the size ones.

A slightly smarter grep finds a couple of uses of '%c' for bare constants.

So I think the "we used P for percpu_stable_op" is really mostly a "we
are not actually very familiar with the operand modifiers". We use
the constraints fairly wildly, but the operand modifiers are a
relative rarity.

I suspect - but it's much too long ago - that some gcc person just
told us that "solve this using 'P'" and we never went any further.

So changing it to use 'a' sounds like the right thing to do, and we
can only plead ignorance, not wilful stupidity.

Linus

2023-10-12 17:10:38

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Thu, 12 Oct 2023 at 09:55, Uros Bizjak <[email protected]> wrote:
>
> An example:

Oh, I'm convinced.

The fix seems to be a simple one-liner, ie just

- asm(__pcpu_op2_##size(op, __percpu_arg(P[var]), "%[val]") \
+ asm(__pcpu_op2_##size(op, __percpu_arg(a[var]), "%[val]") \

and it turns out that we have other places where I think we could use that '%a',

For example, we have things like this:

asm ("lea sme_cmdline_arg(%%rip), %0"
: "=r" (cmdline_arg)
: "p" (sme_cmdline_arg));

and I think the only reason we do that ridiculous asm is that the code
in question really does want that (%rip) encoding. It sounds like this
could just do

asm ("lea %a1, %0"
: "=r" (cmdline_arg)
: "p" (sme_cmdline_arg));

instead. Once again, I claim ignorance of the operand modifiers as the
reason for these kinds of things.

But coming back to the stable op thing, I do wonder if there is some
way we could avoid the unnecessary reload.

I don't hate Nadav's patch, so that part is fine, but I'd like to
understand what it is that makes gcc think it needs to reload. We have
other cases (like the ALTERNATIVE() uses) where we *have* to use
inline asm, so it would be good to know...

Is it just that "p" (in the constraint, not "P" in the modifier) ends
up always being seen as a memory access, even when we only use the
address?

That part has never really been something we've been entirely clear
on. We *are* passing in just the address, so the hope in *that* place
is that it's only an address dependency, not a memory one.

(Of course, we use 'p' constraints in other places, and may
expect/assume that those have memory dependency. Again - this has
never been entirely clear)

Linus

2023-10-12 17:17:17

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Thu, 12 Oct 2023 at 08:19, Nadav Amit <[email protected]> wrote:
>
> +/*
> + * Hold a constant alias for current_task, which would allow to avoid caching of
> + * current task.
> + *
> + * We must mark const_current_task with the segment qualifiers, as otherwise gcc
> + * would do redundant reads of const_current_task.
> + */
> +DECLARE_PER_CPU(struct pcpu_hot const __percpu_seg_override, const_pcpu_hot);

Hmm. The only things I'm not super-happy about with your patch is

(a) it looks like this depends on the alias analysis knowing that the
__seg_gs isn't affected by normal memory ops. That implies that this
will not work well with compiler versions that don't do that?

(b) This declaration doesn't match the other one. So now there are
two *different* declarations for const_pcpu_hot, which I really don't
like.

That second one would seem to be trivial to just fix (or maybe not,
and you do it that way for some horrible reason).

The first one sounds bad to me - basically making the *reason* for
this patch go away - but maybe the compilers that don't support
address spaces are so rare that we can ignore it.

Linus

2023-10-12 17:47:47

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Thu, 12 Oct 2023 at 10:10, Linus Torvalds
<[email protected]> wrote:
>
> The fix seems to be a simple one-liner, ie just
>
> - asm(__pcpu_op2_##size(op, __percpu_arg(P[var]), "%[val]") \
> + asm(__pcpu_op2_##size(op, __percpu_arg(a[var]), "%[val]") \

Nope. That doesn't work at all.

It turns out that we're not the only ones that didn't know about the
'a' modifier.

clang has also never heard of it in this context, and the above
one-liner results in an endless sea of errors, with

error: invalid operand in inline asm: 'movq %gs:${1:a}, $0'

Looking around, I think it's X86AsmPrinter::PrintAsmOperand() that is
supposed to handle these things, and while it does have some handling
for 'a', the comment around it says

case 'a': // This is an address. Currently only 'i' and 'r' are expected.

and I think our use ends up just confusing the heck out of clang. Of
course, clang also does this:

case 'P': // This is the operand of a call, treat specially.
PrintPCRelImm(MI, OpNo, O);
return false;

so clang *already* generates those 'current' accesses as PCrelative, and I see

movq %gs:pcpu_hot(%rip), %r13

in the generated code.

End result: clang actually generates what we want just using 'P', and
the whole "P vs a" is only a gcc thing.

Why *does* gcc do that silly thing of dropping '(%rip)' from the address, btw?

Linus

2023-10-12 17:52:44

by Uros Bizjak

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Thu, Oct 12, 2023 at 7:10 PM Linus Torvalds
<[email protected]> wrote:
>
> On Thu, 12 Oct 2023 at 09:55, Uros Bizjak <[email protected]> wrote:
> >
> > An example:
>
> Oh, I'm convinced.
>
> The fix seems to be a simple one-liner, ie just
>
> - asm(__pcpu_op2_##size(op, __percpu_arg(P[var]), "%[val]") \
> + asm(__pcpu_op2_##size(op, __percpu_arg(a[var]), "%[val]") \

The effect of the change:

25542442 4387686 808452 30738580 1d50894 vmlinux-new.o
25546484 4387686 808452 30742622 1d5185e vmlinux-old.o

> and it turns out that we have other places where I think we could use that '%a',
>
> For example, we have things like this:
>
> asm ("lea sme_cmdline_arg(%%rip), %0"
> : "=r" (cmdline_arg)
> : "p" (sme_cmdline_arg));
>
> and I think the only reason we do that ridiculous asm is that the code
> in question really does want that (%rip) encoding. It sounds like this
> could just do
>
> asm ("lea %a1, %0"
> : "=r" (cmdline_arg)
> : "p" (sme_cmdline_arg));
>
> instead. Once again, I claim ignorance of the operand modifiers as the
> reason for these kinds of things.
>
> But coming back to the stable op thing, I do wonder if there is some
> way we could avoid the unnecessary reload.
>
> I don't hate Nadav's patch, so that part is fine, but I'd like to
> understand what it is that makes gcc think it needs to reload. We have
> other cases (like the ALTERNATIVE() uses) where we *have* to use
> inline asm, so it would be good to know...
>
> Is it just that "p" (in the constraint, not "P" in the modifier) ends
> up always being seen as a memory access, even when we only use the
> address?
>
> That part has never really been something we've been entirely clear
> on. We *are* passing in just the address, so the hope in *that* place
> is that it's only an address dependency, not a memory one.

Let's see the difference of:

--cut here--
int m;

void foo (void)
{
asm ("# %a0" :: "p" (&m));
}

void bar (void)
{
asm ("# %0" :: "m" (m));
}
--cut here--

The internal dump shows:

(insn:TI 5 2 15 2 (parallel [
(asm_operands/v ("# %a0") ("") 0 [
(symbol_ref:DI ("m") [flags 0x2] <var_decl
0x7f3175011bd0 m>)
]
[
(asm_input:DI ("p") rip.c:5)
]
[] rip.c:5)
(clobber (reg:CC 17 flags))
]) "rip.c":5:3 -1
(expr_list:REG_UNUSED (reg:CC 17 flags)
(nil)))

vs:

(insn:TI 5 2 13 2 (parallel [
(asm_operands/v ("# %0") ("") 0 [
(mem/c:SI (symbol_ref:DI ("m") [flags 0x2]
<var_decl 0x7f3175011bd0 m>) [1 m+0 S4 A32])
]
[
(asm_input:SI ("m") rip.c:10)
]
[] rip.c:10)
(clobber (reg:CC 17 flags))
]) "rip.c":10:3 -1
(expr_list:REG_UNUSED (reg:CC 17 flags)
(nil)))

The first argument is internally regarded as "constant":

-- Macro: CONSTANT_P (X)
'CONSTANT_P', which is defined by target-independent code, accepts
integer-values expressions whose values are not explicitly known,
such as 'symbol_ref', 'label_ref', and 'high' expressions and
'const' arithmetic expressions, in addition to 'const_int' and
'const_double' expressions.

So, it should not have any dependency.

Perhaps a testcase should be created and posted to gcc-bugs for
further analysis.

Uros.

2023-10-12 18:01:03

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()


* Josh Poimboeuf <[email protected]> wrote:

> On Thu, Oct 12, 2023 at 08:19:14AM +0200, Ingo Molnar wrote:
> >
> > * Josh Poimboeuf <[email protected]> wrote:
> >
> > > Though, another problem is that .text has a crazy amount of padding
> > > which makes it always the same size, due to the SRSO alias mitigation
> > > alignment linker magic. We should fix that somehow.
> >
> > We could emit a non-aligned end-of-text symbol (we might have it already),
> > and have a script or small .c program in scripts/ or tools/ that looks
> > at vmlinux and displays a user-friendly and accurate list of text and
> > data sizes in the kernel?
> >
> > And since objtool is technically an 'object files tool', and it already
> > looks at sections & symbols, it could also grow a:
> >
> > objtool size <objfile>
> >
> > command that does the sane thing ... I'd definitely start using that, instead of 'size'.
> >
> > /me runs :-)
>
> Yeah, that's actually not a bad idea.
>
> I had been thinking a "simple" script would be fine, but I'm realizing
> the scope of this thing could grow over time. In which case a script is
> less than ideal. And objtool already has the ability to do this pretty
> easily.

Yeah, and speed actually matters here: I have scripts that generate object
comparisons between commits, and every second of runtime counts - and a
script would be slower and more fragile for something like allmodconfig
builds or larger disto configs.

BTW., maybe the right objtool subcommand would be 'objtool sections', with
an 'objtool sections size' sub-sub-command. Because I think this discussion
shows that it would be good to have a bit of visibility into the sanity of
our sections setup, with 'objtool sections check' for example doing a
sanity check on whether there's anything extra in the text section that
shouldn't be there? Or so ...

Thanks,

Ingo

2023-10-12 18:02:12

by Uros Bizjak

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Thu, Oct 12, 2023 at 7:47 PM Linus Torvalds
<[email protected]> wrote:
>
> On Thu, 12 Oct 2023 at 10:10, Linus Torvalds
> <[email protected]> wrote:
> >
> > The fix seems to be a simple one-liner, ie just
> >
> > - asm(__pcpu_op2_##size(op, __percpu_arg(P[var]), "%[val]") \
> > + asm(__pcpu_op2_##size(op, __percpu_arg(a[var]), "%[val]") \
>
> Nope. That doesn't work at all.
>
> It turns out that we're not the only ones that didn't know about the
> 'a' modifier.
>
> clang has also never heard of it in this context, and the above
> one-liner results in an endless sea of errors, with
>
> error: invalid operand in inline asm: 'movq %gs:${1:a}, $0'
>
> Looking around, I think it's X86AsmPrinter::PrintAsmOperand() that is
> supposed to handle these things, and while it does have some handling
> for 'a', the comment around it says
>
> case 'a': // This is an address. Currently only 'i' and 'r' are expected.
>
> and I think our use ends up just confusing the heck out of clang. Of
> course, clang also does this:
>
> case 'P': // This is the operand of a call, treat specially.
> PrintPCRelImm(MI, OpNo, O);
> return false;
>
> so clang *already* generates those 'current' accesses as PCrelative, and I see
>
> movq %gs:pcpu_hot(%rip), %r13
>
> in the generated code.
>
> End result: clang actually generates what we want just using 'P', and
> the whole "P vs a" is only a gcc thing.

Ugh, this isn't exactly following Clang's claim that "In general,
Clang is highly compatible with the GCC inline assembly extensions,
allowing the same set of constraints, modifiers and operands as GCC
inline assembly."

[1] https://clang.llvm.org/compatibility.html#inline-asm

> Why *does* gcc do that silly thing of dropping '(%rip)' from the address, btw?

The documentation says:

[p] Print raw symbol name (without syntax-specific prefixes).

[P] If used for a function, print the PLT suf-fix and generate PIC
code. For example, emit foo@PLT instead of ’foo’ for the function
foo(). If used for a constant, drop all syntax-specific prefixes and
issue the bare constant. See p above.

I'd say that "bare constant" is something without (%pic).

Uros.

2023-10-12 19:34:30

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()



> On Oct 12, 2023, at 8:16 PM, Linus Torvalds <[email protected]> wrote:
>
> !! External Email
>
> On Thu, 12 Oct 2023 at 08:19, Nadav Amit <[email protected]> wrote:
>>
>> +/*
>> + * Hold a constant alias for current_task, which would allow to avoid caching of
>> + * current task.
>> + *
>> + * We must mark const_current_task with the segment qualifiers, as otherwise gcc
>> + * would do redundant reads of const_current_task.
>> + */
>> +DECLARE_PER_CPU(struct pcpu_hot const __percpu_seg_override, const_pcpu_hot);
>
> Hmm. The only things I'm not super-happy about with your patch is
>
> (a) it looks like this depends on the alias analysis knowing that the
> __seg_gs isn't affected by normal memory ops. That implies that this
> will not work well with compiler versions that don't do that?
>
> (b) This declaration doesn't match the other one. So now there are
> two *different* declarations for const_pcpu_hot, which I really don't
> like.
>
> That second one would seem to be trivial to just fix (or maybe not,
> and you do it that way for some horrible reason).

If you refer to the difference between DECLARE_PER_CPU_ALIGNED() and
DECLARE_PER_CPU() - that’s just a silly mistake that I made porting my
old patch (I also put “const” in the wrong place of the declaration, sorry).

>
> The first one sounds bad to me - basically making the *reason* for
> this patch go away - but maybe the compilers that don't support
> address spaces are so rare that we can ignore it.

As far as I understand it has nothing to do with the address spaces, and IIRC
the compiler does not regard gs/fs address spaces as independent from the main
one. That’s the reason a compiler barrier affects regular loads with __seg_gs.

The “trick” that the patch does is to expose a new const_pcpu_hot symbol that has
a “const” qualifier. For compilation units from which the symbol is effectively
constant, we use const_pcpu_hot. The compiler then knows that the value would not
change.

Later, when we actually define the const_pcpu_hot, we tell the compiler using
__attribute__((alias("pcpu_hot”)) that this symbol is actually an alias to pcpu_hot.

Although it is a bit of a trick that I have never seen elsewhere, I don’t see it
violating GCC specifications (“except for top-level qualifiers the alias target
must have the same type as the alias” [1]), and there is nothing that is specific
to the gs address-space. I still have the concern of its interaction with LTO
though, and perhaps using “-fno-lto” when compiling compilation units that
modify current (e.g., arch/x86/kernel/process_64.o) is necessary.

I hope it makes sense.

[1] https://gcc.gnu.org/onlinedocs/gcc/Common-Variable-Attributes.html


Attachments:
signature.asc (849.00 B)
Message signed with OpenPGP

2023-10-12 19:41:53

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Thu, 12 Oct 2023 at 12:32, Nadav Amit <[email protected]> wrote:
>
> If you refer to the difference between DECLARE_PER_CPU_ALIGNED() and
> DECLARE_PER_CPU() - that’s just a silly mistake that I made porting my
> old patch (I also put “const” in the wrong place of the declaration, sorry).

Yes, the only difference is the ALIGNED and the 'const', but I think
also the alias attribute.

However, I'd be happier if we had just one place that declares them, not two.

Even if the two were identical, it seems wrong to have two
declarations for the same thing.

> The “trick” that the patch does is to expose a new const_pcpu_hot symbol that has
> a “const” qualifier. For compilation units from which the symbol is effectively
> constant, we use const_pcpu_hot. The compiler then knows that the value would not
> change.

Oh, I don't disagree with that part.

I just don't see why the 'asm' version would have any difference. For
that too, the compiler should see that the result of the asm doesn't
change.

So my confusion / worry is not about the const alias. I like that part.

My worry is literally "in other situations we _have_ to use asm(), and
it's not clear why gcc wouldn't do as well for it".

Linus

2023-10-12 21:31:15

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Thu, Oct 12, 2023 at 07:59:43PM +0200, Ingo Molnar wrote:
>
> * Josh Poimboeuf <[email protected]> wrote:
>
> > On Thu, Oct 12, 2023 at 08:19:14AM +0200, Ingo Molnar wrote:
> > >
> > > * Josh Poimboeuf <[email protected]> wrote:
> > >
> > > > Though, another problem is that .text has a crazy amount of padding
> > > > which makes it always the same size, due to the SRSO alias mitigation
> > > > alignment linker magic. We should fix that somehow.
> > >
> > > We could emit a non-aligned end-of-text symbol (we might have it already),
> > > and have a script or small .c program in scripts/ or tools/ that looks
> > > at vmlinux and displays a user-friendly and accurate list of text and
> > > data sizes in the kernel?
> > >
> > > And since objtool is technically an 'object files tool', and it already
> > > looks at sections & symbols, it could also grow a:
> > >
> > > objtool size <objfile>
> > >
> > > command that does the sane thing ... I'd definitely start using that, instead of 'size'.
> > >
> > > /me runs :-)
> >
> > Yeah, that's actually not a bad idea.
> >
> > I had been thinking a "simple" script would be fine, but I'm realizing
> > the scope of this thing could grow over time. In which case a script is
> > less than ideal. And objtool already has the ability to do this pretty
> > easily.
>
> Yeah, and speed actually matters here: I have scripts that generate object
> comparisons between commits, and every second of runtime counts - and a
> script would be slower and more fragile for something like allmodconfig
> builds or larger disto configs.

Ah, good to know.

> BTW., maybe the right objtool subcommand would be 'objtool sections', with
> an 'objtool sections size' sub-sub-command. Because I think this discussion
> shows that it would be good to have a bit of visibility into the sanity of
> our sections setup, with 'objtool sections check' for example doing a
> sanity check on whether there's anything extra in the text section that
> shouldn't be there? Or so ...

What would be an example of something "extra"? A sanity check might fit
better alongside the other checks already being done by the main objtool
"subcommand" which gets run by the kernel build.

BTW, I actually removed subcommands a while ago when I overhauled
objtool's interface to make it easier to combine options. That said,
I'm not opposed to re-adding them if we can find a sane way to do so.

Here's the current interface:

Usage: objtool <actions> [<options>] file.o

Actions:
-h, --hacks[=<jump_label,noinstr,skylake>]
patch toolchain bugs/limitations
-i, --ibt validate and annotate IBT
-l, --sls validate straight-line-speculation mitigations
-m, --mcount annotate mcount/fentry calls for ftrace
-n, --noinstr validate noinstr rules
-o, --orc generate ORC metadata
-r, --retpoline validate and annotate retpoline usage
-s, --stackval validate frame pointer rules
-t, --static-call annotate static calls
-u, --uaccess validate uaccess rules for SMAP
--cfi annotate kernel control flow integrity (kCFI) function preambles
--dump[=<orc>] dump metadata
--prefix <n> generate prefix symbols
--rethunk validate and annotate rethunk usage
--unret validate entry unret placement

Options:
-v, --verbose verbose warnings
--backtrace unwind on error
--backup create .orig files before modification
--dry-run don't write modifications
--link object is a linked object
--mnop nop out mcount call sites
--module object is part of a kernel module
--no-unreachable skip 'unreachable instruction' warnings
--sec-address print section addresses in warnings
--stats print statistics


Note how all the actions can be easily combined in a single execution
instance.

If we re-added subcommands, most of the existing functionality would be
part of a single subcommand. It used to be called "check", but it's no
longer a read-only operation so that's misleading. I'll call it "run"
for now.

Right now my preference would be to leave the existing interface as-is,
and then graft optional subcommands on top. If no subcommand is
specified then it would default to the "run" subcommand. It's a little
funky, but it would work well for the common case, where ~99% of the
functionality lives. And it doesn't break existing setups and
backports.

For example:

# current interface (no changes)
objtool --mcount --orc --retpoline --uaccess vmlinux.o

# same, with optional explicit "run" subcommand
objtool run --mcount --orc --retpoline --uaccess vmlinux.o

# new "size" subcommand
obtool size [options] vmlinux.o.before vmlinux.o.after

--
Josh

2023-10-13 09:40:09

by Uros Bizjak

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Thu, Oct 12, 2023 at 8:01 PM Uros Bizjak <[email protected]> wrote:
>
> On Thu, Oct 12, 2023 at 7:47 PM Linus Torvalds
> <[email protected]> wrote:
> >
> > On Thu, 12 Oct 2023 at 10:10, Linus Torvalds
> > <[email protected]> wrote:
> > >
> > > The fix seems to be a simple one-liner, ie just
> > >
> > > - asm(__pcpu_op2_##size(op, __percpu_arg(P[var]), "%[val]") \
> > > + asm(__pcpu_op2_##size(op, __percpu_arg(a[var]), "%[val]") \
> >
> > Nope. That doesn't work at all.
> >
> > It turns out that we're not the only ones that didn't know about the
> > 'a' modifier.
> >
> > clang has also never heard of it in this context, and the above
> > one-liner results in an endless sea of errors, with
> >
> > error: invalid operand in inline asm: 'movq %gs:${1:a}, $0'
> >
> > Looking around, I think it's X86AsmPrinter::PrintAsmOperand() that is
> > supposed to handle these things, and while it does have some handling
> > for 'a', the comment around it says
> >
> > case 'a': // This is an address. Currently only 'i' and 'r' are expected.
> >
> > and I think our use ends up just confusing the heck out of clang. Of
> > course, clang also does this:
> >
> > case 'P': // This is the operand of a call, treat specially.
> > PrintPCRelImm(MI, OpNo, O);
> > return false;
> >
> > so clang *already* generates those 'current' accesses as PCrelative, and I see
> >
> > movq %gs:pcpu_hot(%rip), %r13
> >
> > in the generated code.
> >
> > End result: clang actually generates what we want just using 'P', and
> > the whole "P vs a" is only a gcc thing.
>
> Ugh, this isn't exactly following Clang's claim that "In general,
> Clang is highly compatible with the GCC inline assembly extensions,
> allowing the same set of constraints, modifiers and operands as GCC
> inline assembly."

For added fun I obtained some old clang:

$ clang --version
clang version 11.0.0 (Fedora 11.0.0-3.fc33)

and tried to compile this:

int m;
__seg_gs int n;

void foo (void)
{
asm ("# %a0 %a1" :: "p" (&m), "p" (&n));
asm ("# %P0 %P1" :: "p" (&m), "p" (&n));
}

clang-11:

# m n
# m n

clang-11 -fpie:

# m(%rip) n(%rip)
# m n

clang-11 -m32:

# m n
# m n

gcc:

# m(%rip) n(%rip)
# m n

gcc -fpie:

# m(%rip) n(%rip)
# m n

gcc -m32:

# m n
# m n

Please find attached a patch that should bring some order to this
issue. The patch includes two demonstration sites, the generated code
for mem_encrypt_identity.c does not change while the change in
percpu.h brings expected 4kB code size reduction.

Uros.


Attachments:
memref.diff.txt (2.55 kB)

2023-10-13 10:52:35

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()


* Josh Poimboeuf <[email protected]> wrote:

> Right now my preference would be to leave the existing interface as-is,
> and then graft optional subcommands on top. If no subcommand is
> specified then it would default to the "run" subcommand. It's a little
> funky, but it would work well for the common case, where ~99% of the
> functionality lives. And it doesn't break existing setups and
> backports.
>
> For example:
>
> # current interface (no changes)
> objtool --mcount --orc --retpoline --uaccess vmlinux.o
>
> # same, with optional explicit "run" subcommand
> objtool run --mcount --orc --retpoline --uaccess vmlinux.o
>
> # new "size" subcommand
> obtool size [options] vmlinux.o.before vmlinux.o.after

Yeah, sounds good!

Thanks,

Ingo

2023-10-13 11:54:00

by Uros Bizjak

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Fri, Oct 13, 2023 at 11:38 AM Uros Bizjak <[email protected]> wrote:
>
> On Thu, Oct 12, 2023 at 8:01 PM Uros Bizjak <[email protected]> wrote:
> >
> > On Thu, Oct 12, 2023 at 7:47 PM Linus Torvalds
> > <[email protected]> wrote:
> > >
> > > On Thu, 12 Oct 2023 at 10:10, Linus Torvalds
> > > <[email protected]> wrote:
> > > >
> > > > The fix seems to be a simple one-liner, ie just
> > > >
> > > > - asm(__pcpu_op2_##size(op, __percpu_arg(P[var]), "%[val]") \
> > > > + asm(__pcpu_op2_##size(op, __percpu_arg(a[var]), "%[val]") \
> > >
> > > Nope. That doesn't work at all.
> > >
> > > It turns out that we're not the only ones that didn't know about the
> > > 'a' modifier.
> > >
> > > clang has also never heard of it in this context, and the above
> > > one-liner results in an endless sea of errors, with
> > >
> > > error: invalid operand in inline asm: 'movq %gs:${1:a}, $0'
> > >
> > > Looking around, I think it's X86AsmPrinter::PrintAsmOperand() that is
> > > supposed to handle these things, and while it does have some handling
> > > for 'a', the comment around it says
> > >
> > > case 'a': // This is an address. Currently only 'i' and 'r' are expected.
> > >
> > > and I think our use ends up just confusing the heck out of clang. Of
> > > course, clang also does this:
> > >
> > > case 'P': // This is the operand of a call, treat specially.
> > > PrintPCRelImm(MI, OpNo, O);
> > > return false;
> > >
> > > so clang *already* generates those 'current' accesses as PCrelative, and I see
> > >
> > > movq %gs:pcpu_hot(%rip), %r13
> > >
> > > in the generated code.
> > >
> > > End result: clang actually generates what we want just using 'P', and
> > > the whole "P vs a" is only a gcc thing.

Maybe we should go with what Clang expects. %a with "i" constraint is
also what GCC handles, because

‘i’: An immediate integer operand (one with constant value) is
allowed. This includes symbolic constants whose values will be known
only at assembly time or later.

Attached patch patches both cases: the generated code for
mem_encrypt_identity.c does not change while the change in
percpu.h brings expected 4kB code size reduction. I think this is the
correct solution that will work for both compilers.

Uros.


Attachments:
memref-2.diff.txt (1.97 kB)

2023-10-13 16:39:18

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Fri, 13 Oct 2023 at 04:53, Uros Bizjak <[email protected]> wrote:
>
> Maybe we should go with what Clang expects. %a with "i" constraint is
> also what GCC handles, because
>
> ‘i’: An immediate integer operand (one with constant value) is
> allowed. This includes symbolic constants whose values will be known
> only at assembly time or later.

This looks fine to me, and would seem to be the simplest way to have
both gcc and clang be happy with things.

All these uses seem to be immediate addresses, as any actual dynamic
ones will use a proper "m" constraint (and no operand modifiers)

Linus

2023-10-16 18:53:19

by Uros Bizjak

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Thu, Oct 12, 2023 at 5:19 PM Nadav Amit <[email protected]> wrote:
>
>
> > On Oct 12, 2023, at 12:54 AM, Linus Torvalds <[email protected]> wrote:
> >
> > !! External Email
> >
> > On Wed, 11 Oct 2023 at 14:33, Uros Bizjak <[email protected]> wrote:
> >>
> >> Reading the above, it looks to me that we don't want to play games
> >> with "const aliased" versions of current_task [1], as proposed by
> >> Nadav in his patch series.
> >
> > Well, maybe I'd like it if I saw what the effect of it was, but that
> > patch mentions "sync_mm_rss()" which doesn't actually exist
> > (SPLIT_RSS_COUNTING is never defined, the split version is gone and
> > hasn't existed since commit f1a7941243c1 "mm: convert mm's rss stats
> > into percpu_counter")
>
> So I added a new version of the current aliasing (well, actually pcpu_hot
> in the new version) on top of Uros’s patches, and the effect can be seen
> in many functions. I don’t want to bother with many examples so here is
> a common and simple one:
>
> Currently syscall_exit_work() that starts with:
>
> 0xffffffff8111e120 <+0>: push %rbp
> 0xffffffff8111e121 <+1>: mov %rdi,%rbp
> 0xffffffff8111e124 <+4>: push %rbx
> 0xffffffff8111e125 <+5>: mov %rsi,%rbx
> 0xffffffff8111e128 <+8>: and $0x20,%esi
> 0xffffffff8111e12b <+11>: je 0xffffffff8111e143 <syscall_exit_work+35>
> 0xffffffff8111e12d <+13>: mov %gs:0x2ac80,%rax
> 0xffffffff8111e136 <+22>: cmpb $0x0,0x800(%rax)
> 0xffffffff8111e13d <+29>: jne 0xffffffff8111e22a <syscall_exit_work+266>
> 0xffffffff8111e143 <+35>: mov %gs:0x2ac80,%rax
> 0xffffffff8111e14c <+44>: cmpq $0x0,0x7c8(%rax)
>
> Using the const-alias changes the beginning of syscall_exit_work to:
>
> 0xffffffff8111cb80 <+0>: push %r12
> 0xffffffff8111cb82 <+2>: mov %gs:0x7ef0e0f6(%rip),%r12 # 0x2ac80 <pcpu_hot>
> 0xffffffff8111cb8a <+10>: push %rbp
> 0xffffffff8111cb8b <+11>: mov %rdi,%rbp
> 0xffffffff8111cb8e <+14>: push %rbx
> 0xffffffff8111cb8f <+15>: mov %rsi,%rbx
> 0xffffffff8111cb92 <+18>: and $0x20,%esi
> 0xffffffff8111cb95 <+21>: je 0xffffffff8111cba6 <syscall_exit_work+38>
> 0xffffffff8111cb97 <+23>: cmpb $0x0,0x800(%r12)
> 0xffffffff8111cba0 <+32>: jne 0xffffffff8111cc7a <syscall_exit_work+250>
> 0xffffffff8111cba6 <+38>: cmpq $0x0,0x7c8(%r12)
>
> So we both see RIP-relative addressing is being used (hence the instruction is
> one byte shorter) and the reload going away.
>
> Now, I am not a compiler expert as for the rationale, but it googling around
> I can see Nick explaining the rationale [1] - if you use “p” your read memory.
> BTW: It is related to discussion you had [2], in which you encountered an issue
> I also encountered before [3]. My bad for pushing it in.
>
> Anyhow, I created a similar code on godbolt ( https://godbolt.org/z/dPqKKzPs4 )
> to show this behavior - how compiler barriers cause reload. It seems that this
> behavior happens on GCC and CLANG on various versions.
>
> The idea behind the patch is that it communicates - in the compilation unit
> granularity - that current is fixed. There is an issue of whether it works with
> LTO, which I have never checked.
>
>
> [1] https://reviews.llvm.org/D145416
> [2] https://lore.kernel.org/lkml/[email protected]/
> [3] https://lore.kernel.org/all/[email protected]/
>
> --
>
> Here’s the updated patch - but I didn’t really boot a machine with it so new
> issues might have come since my last patch-set:

Unfortunately, it does not work and dies early in the boot with:

[ 4.939358] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 4.940090] #PF: supervisor write access in kernel mode
[ 4.940090] #PF: error_code(0x0002) - not-present page
[ 4.940090] PGD 0 P4D 0
[ 4.940090] Oops: 0002 [#1] PREEMPT SMP PTI
[ 4.940090] CPU: 1 PID: 52 Comm: kworker/u4:1 Not tainted
6.6.0-rc6-00365-g0c09c1d70838-dirty #7
[ 4.940090] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.16.2-1.fc37 04/01/2014
[ 4.940090] RIP: 0010:begin_new_exec+0x8f2/0xa30
[ 4.940090] Code: 31 f6 e8 c1 49 f9 ff e9 3c fa ff ff 31 f6 4c 89
ef e8 b2 4a f9 ff e9 19 fa ff ff 31 f6 4c 89 ef e8 23 4a f9 ff e9 ea
fa ff ff <f0> 41 ff 0c 24 0f
85 55 fb ff ff 4c 89 e7 e8 4b 02 df ff e9 48 fb
[ 4.940090] RSP: 0000:ffff9c84c01e3d68 EFLAGS: 00010246
[ 4.940090] RAX: 0000000000000000 RBX: ffff9946e30c1f00 RCX: 0000000000000000
[ 4.940090] RDX: 0000000000000000 RSI: ffff9946e2ff0000 RDI: ffff9946e30c2718
[ 4.940090] RBP: ffff9946c03a7c00 R08: 00000000fffffffe R09: 00000000ffffffff
[ 4.940090] R10: 000001ffffffffff R11: 0000000000000001 R12: 0000000000000000
[ 4.940090] R13: 0000000000000000 R14: ffff9946e30c2718 R15: ffff9946e2ff0000
[ 4.940090] FS: 0000000000000000(0000) GS:ffff994763f00000(0000)
knlGS:0000000000000000
[ 4.940090] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4.940090] CR2: 0000000000000000 CR3: 00000001003a8000 CR4: 00000000000406f0
[ 4.940090] Call Trace:
[ 4.940090] <TASK>
[ 4.940090] ? __die+0x1e/0x60
[ 4.940090] ? page_fault_oops+0x17b/0x470
[ 4.940090] ? search_module_extables+0x14/0x50
[ 4.940090] ? exc_page_fault+0x66/0x140
[ 4.940090] ? asm_exc_page_fault+0x26/0x30
[ 4.940090] ? begin_new_exec+0x8f2/0xa30
[ 4.940090] ? begin_new_exec+0x3ce/0xa30
[ 4.940090] ? load_elf_phdrs+0x67/0xb0
[ 4.940090] load_elf_binary+0x2bb/0x1770
[ 4.940090] ? __kernel_read+0x136/0x2d0
[ 4.940090] bprm_execve+0x277/0x630
[ 4.940090] kernel_execve+0x145/0x1a0
[ 4.940090] call_usermodehelper_exec_async+0xcb/0x180
[ 4.940090] ? __pfx_call_usermodehelper_exec_async+0x10/0x10
[ 4.940090] ret_from_fork+0x2f/0x50
[ 4.940090] ? __pfx_call_usermodehelper_exec_async+0x10/0x10
[ 4.940090] ret_from_fork_asm+0x1b/0x30
[ 4.940090] </TASK>
[ 4.940090] Modules linked in:
[ 4.940090] CR2: 0000000000000000
[ 5.017606] ---[ end trace 0000000000000000 ]---
[ 5.018957] RIP: 0010:begin_new_exec+0x8f2/0xa30
[ 5.020299] Code: 31 f6 e8 c1 49 f9 ff e9 3c fa ff ff 31 f6 4c 89
ef e8 b2 4a f9 ff e9 19 fa ff ff 31 f6 4c 89 ef e8 23 4a f9 ff e9 ea
fa ff ff <f0> 41 ff 0c 24 0f 85 55 fb ff ff 4c 89 e7 e8 4b 02 df ff e9
48 fb
[ 5.024765] RSP: 0000:ffff9c84c01e3d68 EFLAGS: 00010246
[ 5.026150] RAX: 0000000000000000 RBX: ffff9946e30c1f00 RCX: 0000000000000000
[ 5.027916] RDX: 0000000000000000 RSI: ffff9946e2ff0000 RDI: ffff9946e30c2718
[ 5.029714] RBP: ffff9946c03a7c00 R08: 00000000fffffffe R09: 00000000ffffffff
[ 5.031461] R10: 000001ffffffffff R11: 0000000000000001 R12: 0000000000000000
[ 5.033186] R13: 0000000000000000 R14: ffff9946e30c2718 R15: ffff9946e2ff0000
[ 5.034908] FS: 0000000000000000(0000) GS:ffff994763f00000(0000)
knlGS:0000000000000000
[ 5.036907] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5.038341] CR2: 0000000000000000 CR3: 00000001003a8000 CR4: 00000000000406f0
[ 5.040044] Kernel panic - not syncing: Fatal exception
[ 5.040647] Kernel Offset: 0x22e00000 from 0xffffffff81000000
(relocation range: 0xffffffff80000000-0xffffffffbfffffff)

It looks that aliasing a structure from another namespace is a no-go,
since the patch (attached, slightly changed your patch) without
__percpu_seg_override decorations bootstraps OK. The working patch
(without __percpu_seg_override) is not effective (no effect in
syscall_exit_work) and increases the number of current_task reads from
3841 to 4711.

Uros.


Attachments:
p.diff.txt (2.31 kB)

2023-10-16 19:25:04

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Mon, 16 Oct 2023 at 11:53, Uros Bizjak <[email protected]> wrote:
>
> Unfortunately, it does not work and dies early in the boot with:

Side note: build the kernel with debug info (the limited form is
sufficient), and then run oopses through

./scripts/decode_stacktrace.sh

to get much nicer oops information that has line numbers and inlining
information in the backtrace.

> [ 4.939358] BUG: kernel NULL pointer dereference, address: 0000000000000000
> [ 4.940090] RIP: 0010:begin_new_exec+0x8f2/0xa30
> [ 4.940090] Code: 31 f6 e8 c1 49 f9 ff e9 3c fa ff ff 31 f6 4c 89
> ef e8 b2 4a f9 ff e9 19 fa ff ff 31 f6 4c 89 ef e8 23 4a f9 ff e9 ea
> fa ff ff <f0> 41 ff 0c 24 0f
> 85 55 fb ff ff 4c 89 e7 e8 4b 02 df ff e9 48 fb

That decodes to

0: 31 f6 xor %esi,%esi
2: e8 c1 49 f9 ff call 0xfffffffffff949c8
7: e9 3c fa ff ff jmp 0xfffffffffffffa48
c: 31 f6 xor %esi,%esi
e: 4c 89 ef mov %r13,%rdi
11: e8 b2 4a f9 ff call 0xfffffffffff94ac8
16: e9 19 fa ff ff jmp 0xfffffffffffffa34
1b: 31 f6 xor %esi,%esi
1d: 4c 89 ef mov %r13,%rdi
20: e8 23 4a f9 ff call 0xfffffffffff94a48
25: e9 ea fa ff ff jmp 0xfffffffffffffb14
2a:* f0 41 ff 0c 24 lock decl (%r12) <-- trapping instruction
2f: 0f 85 55 fb ff ff jne 0xfffffffffffffb8a
35: 4c 89 e7 mov %r12,%rdi
38: e8 4b 02 df ff call 0xffffffffffdf0288

but without a nicer backtrace it's nasty to guess where this is.

The "lock decl ; jne" is a good hint, though - that sequence is most
definitely "atomic_dec_and_test()".

And that in turn means that it's almost certainly mmdrop(), which is

if (unlikely(atomic_dec_and_test(&mm->mm_count)))
__mmdrop(mm);

where that

35: 4c 89 e7 mov %r12,%rdi
38: e8 4b 02 df ff call 0xffffffffffdf0288

is exactly the unlikely "__mmdrop(mm)" part (and gcc decided to make
the likely branch a branch-out for some reason - presumably with the
inlining the code around it meant that was the better layout - maybe
this was all inside another "unlikely()" branch.

And if I read that right, this has all been inlined from
begin_new_exec() -> exec_mmap() -> mmdrop_lazy_tlb().

Now, how and why 'mm' would be NULL in that path, and why any
'current' reloading optimization would matter in this all I very much
can't see. The call site in begin_new_exec() is

/*
* Release all of the old mmap stuff
*/
acct_arg_size(bprm, 0);
retval = exec_mmap(bprm->mm);
if (retval)
goto out;

bprm->mm = NULL;

and "bprm->mm" is most definitely non-NULL there because we earlier did

So I suspect the problem happened much earlier, caused some nasty
internal corruption, and the odd 'mm is NULL' is just a symptom.

retval = set_mm_exe_file(bprm->mm, bprm->file);

using it, and that would have oopsed had bprm->mm been NULL then.

So there's some serious corruption there, but from the oops itself I
can't tell the source. I guess if we get 'current' wrong anywhere, all
bets are off.

Linus

2023-10-16 20:35:45

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()


> On Oct 16, 2023, at 10:24 PM, Linus Torvalds <[email protected]> wrote:
>
> So there's some serious corruption there, but from the oops itself I
> can't tell the source. I guess if we get 'current' wrong anywhere, all
> bets are off.

I don’t think it means that it the aliasing does not work; I think it all means that
it actually works *too well*.

I have encountered several such issues before [1], and while some have been fixed,
some have not (I looked at switch_fpu_finish()), and might under the right/wrong
circumstances use the wrongly-“cached” current. Moreover, perhaps new problems
have been added since my old patch.

Perhaps the whack-a-mole approach that I took in [1] is wrong. Instead, perhaps it
is better to use an uncached version of current when compiling
arch/x86/kernel/process_64.c and arch/x86/kernel/process_32.c , so they would not
use the const alias, but instead use the original (non-const) one. Some macro
magic and an additional "-D” build flag can do that.

But first, there is a need to confirm that’s actually the problem. I’ll try to do
it tomorrow.

Regards,
Nadav


[1] https://lore.kernel.org/all/[email protected]/

2023-10-16 21:00:50

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Mon, 16 Oct 2023 at 13:35, Nadav Amit <[email protected]> wrote:
>
> I don’t think it means that it the aliasing does not work; I think it all means that
> it actually works *too well*.

Hmm. The *only* case I can think of is __switch_to() itself when doing that

raw_cpu_write(pcpu_hot.current_task, next_p);

and arguably that part should never have been done in C at all, but here we are.

I do think that from a sanity standpoint, it would be good to split
"__switch_to()" into two: a "before stack switch" and "after stack
switch", and change 'current' from within the asm section (ie do that
part in __switch_to_asm").

Or maybe we should just keep the single "__switch_to()" function, but
make it clear that it happens *after* current has been changed (and
maybe rename it to 'funish_switch_to()" or something like that to make
it clearer).

Because the situation with __switch_to() right now is outright
confusing, where it has 'prevp' and 'nextp', but then implicitly uses
'current' in two different ways.

Ugh.

Linus

2023-10-16 21:11:03

by Uros Bizjak

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Mon, Oct 16, 2023 at 8:52 PM Uros Bizjak <[email protected]> wrote:
>
> On Thu, Oct 12, 2023 at 5:19 PM Nadav Amit <[email protected]> wrote:
> >
> >
> > > On Oct 12, 2023, at 12:54 AM, Linus Torvalds <[email protected]> wrote:
> > >
> > > !! External Email
> > >
> > > On Wed, 11 Oct 2023 at 14:33, Uros Bizjak <[email protected]> wrote:
> > >>
> > >> Reading the above, it looks to me that we don't want to play games
> > >> with "const aliased" versions of current_task [1], as proposed by
> > >> Nadav in his patch series.
> > >
> > > Well, maybe I'd like it if I saw what the effect of it was, but that
> > > patch mentions "sync_mm_rss()" which doesn't actually exist
> > > (SPLIT_RSS_COUNTING is never defined, the split version is gone and
> > > hasn't existed since commit f1a7941243c1 "mm: convert mm's rss stats
> > > into percpu_counter")
> >
> > So I added a new version of the current aliasing (well, actually pcpu_hot
> > in the new version) on top of Uros’s patches, and the effect can be seen
> > in many functions. I don’t want to bother with many examples so here is
> > a common and simple one:
> >
> > Currently syscall_exit_work() that starts with:
> >
> > 0xffffffff8111e120 <+0>: push %rbp
> > 0xffffffff8111e121 <+1>: mov %rdi,%rbp
> > 0xffffffff8111e124 <+4>: push %rbx
> > 0xffffffff8111e125 <+5>: mov %rsi,%rbx
> > 0xffffffff8111e128 <+8>: and $0x20,%esi
> > 0xffffffff8111e12b <+11>: je 0xffffffff8111e143 <syscall_exit_work+35>
> > 0xffffffff8111e12d <+13>: mov %gs:0x2ac80,%rax
> > 0xffffffff8111e136 <+22>: cmpb $0x0,0x800(%rax)
> > 0xffffffff8111e13d <+29>: jne 0xffffffff8111e22a <syscall_exit_work+266>
> > 0xffffffff8111e143 <+35>: mov %gs:0x2ac80,%rax
> > 0xffffffff8111e14c <+44>: cmpq $0x0,0x7c8(%rax)
> >
> > Using the const-alias changes the beginning of syscall_exit_work to:
> >
> > 0xffffffff8111cb80 <+0>: push %r12
> > 0xffffffff8111cb82 <+2>: mov %gs:0x7ef0e0f6(%rip),%r12 # 0x2ac80 <pcpu_hot>
> > 0xffffffff8111cb8a <+10>: push %rbp
> > 0xffffffff8111cb8b <+11>: mov %rdi,%rbp
> > 0xffffffff8111cb8e <+14>: push %rbx
> > 0xffffffff8111cb8f <+15>: mov %rsi,%rbx
> > 0xffffffff8111cb92 <+18>: and $0x20,%esi
> > 0xffffffff8111cb95 <+21>: je 0xffffffff8111cba6 <syscall_exit_work+38>
> > 0xffffffff8111cb97 <+23>: cmpb $0x0,0x800(%r12)
> > 0xffffffff8111cba0 <+32>: jne 0xffffffff8111cc7a <syscall_exit_work+250>
> > 0xffffffff8111cba6 <+38>: cmpq $0x0,0x7c8(%r12)
> >
> > So we both see RIP-relative addressing is being used (hence the instruction is
> > one byte shorter) and the reload going away.
> >
> > Now, I am not a compiler expert as for the rationale, but it googling around
> > I can see Nick explaining the rationale [1] - if you use “p” your read memory.
> > BTW: It is related to discussion you had [2], in which you encountered an issue
> > I also encountered before [3]. My bad for pushing it in.
> >
> > Anyhow, I created a similar code on godbolt ( https://godbolt.org/z/dPqKKzPs4 )
> > to show this behavior - how compiler barriers cause reload. It seems that this
> > behavior happens on GCC and CLANG on various versions.
> >
> > The idea behind the patch is that it communicates - in the compilation unit
> > granularity - that current is fixed. There is an issue of whether it works with
> > LTO, which I have never checked.
> >
> >
> > [1] https://reviews.llvm.org/D145416
> > [2] https://lore.kernel.org/lkml/[email protected]/
> > [3] https://lore.kernel.org/all/[email protected]/
> >
> > --
> >
> > Here’s the updated patch - but I didn’t really boot a machine with it so new
> > issues might have come since my last patch-set:
>
> Unfortunately, it does not work and dies early in the boot with:
>
> [ 4.939358] BUG: kernel NULL pointer dereference, address: 0000000000000000
> [ 4.940090] #PF: supervisor write access in kernel mode
> [ 4.940090] #PF: error_code(0x0002) - not-present page
> [ 4.940090] PGD 0 P4D 0
> [ 4.940090] Oops: 0002 [#1] PREEMPT SMP PTI
> [ 4.940090] CPU: 1 PID: 52 Comm: kworker/u4:1 Not tainted
> 6.6.0-rc6-00365-g0c09c1d70838-dirty #7
> [ 4.940090] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS 1.16.2-1.fc37 04/01/2014
> [ 4.940090] RIP: 0010:begin_new_exec+0x8f2/0xa30
> [ 4.940090] Code: 31 f6 e8 c1 49 f9 ff e9 3c fa ff ff 31 f6 4c 89
> ef e8 b2 4a f9 ff e9 19 fa ff ff 31 f6 4c 89 ef e8 23 4a f9 ff e9 ea
> fa ff ff <f0> 41 ff 0c 24 0f
> 85 55 fb ff ff 4c 89 e7 e8 4b 02 df ff e9 48 fb
> [ 4.940090] RSP: 0000:ffff9c84c01e3d68 EFLAGS: 00010246
> [ 4.940090] RAX: 0000000000000000 RBX: ffff9946e30c1f00 RCX: 0000000000000000
> [ 4.940090] RDX: 0000000000000000 RSI: ffff9946e2ff0000 RDI: ffff9946e30c2718
> [ 4.940090] RBP: ffff9946c03a7c00 R08: 00000000fffffffe R09: 00000000ffffffff
> [ 4.940090] R10: 000001ffffffffff R11: 0000000000000001 R12: 0000000000000000
> [ 4.940090] R13: 0000000000000000 R14: ffff9946e30c2718 R15: ffff9946e2ff0000
> [ 4.940090] FS: 0000000000000000(0000) GS:ffff994763f00000(0000)
> knlGS:0000000000000000
> [ 4.940090] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 4.940090] CR2: 0000000000000000 CR3: 00000001003a8000 CR4: 00000000000406f0
> [ 4.940090] Call Trace:
> [ 4.940090] <TASK>
> [ 4.940090] ? __die+0x1e/0x60
> [ 4.940090] ? page_fault_oops+0x17b/0x470
> [ 4.940090] ? search_module_extables+0x14/0x50
> [ 4.940090] ? exc_page_fault+0x66/0x140
> [ 4.940090] ? asm_exc_page_fault+0x26/0x30
> [ 4.940090] ? begin_new_exec+0x8f2/0xa30
> [ 4.940090] ? begin_new_exec+0x3ce/0xa30
> [ 4.940090] ? load_elf_phdrs+0x67/0xb0
> [ 4.940090] load_elf_binary+0x2bb/0x1770
> [ 4.940090] ? __kernel_read+0x136/0x2d0
> [ 4.940090] bprm_execve+0x277/0x630
> [ 4.940090] kernel_execve+0x145/0x1a0
> [ 4.940090] call_usermodehelper_exec_async+0xcb/0x180
> [ 4.940090] ? __pfx_call_usermodehelper_exec_async+0x10/0x10
> [ 4.940090] ret_from_fork+0x2f/0x50
> [ 4.940090] ? __pfx_call_usermodehelper_exec_async+0x10/0x10
> [ 4.940090] ret_from_fork_asm+0x1b/0x30
> [ 4.940090] </TASK>
> [ 4.940090] Modules linked in:
> [ 4.940090] CR2: 0000000000000000
> [ 5.017606] ---[ end trace 0000000000000000 ]---
> [ 5.018957] RIP: 0010:begin_new_exec+0x8f2/0xa30
> [ 5.020299] Code: 31 f6 e8 c1 49 f9 ff e9 3c fa ff ff 31 f6 4c 89
> ef e8 b2 4a f9 ff e9 19 fa ff ff 31 f6 4c 89 ef e8 23 4a f9 ff e9 ea
> fa ff ff <f0> 41 ff 0c 24 0f 85 55 fb ff ff 4c 89 e7 e8 4b 02 df ff e9
> 48 fb
> [ 5.024765] RSP: 0000:ffff9c84c01e3d68 EFLAGS: 00010246
> [ 5.026150] RAX: 0000000000000000 RBX: ffff9946e30c1f00 RCX: 0000000000000000
> [ 5.027916] RDX: 0000000000000000 RSI: ffff9946e2ff0000 RDI: ffff9946e30c2718
> [ 5.029714] RBP: ffff9946c03a7c00 R08: 00000000fffffffe R09: 00000000ffffffff
> [ 5.031461] R10: 000001ffffffffff R11: 0000000000000001 R12: 0000000000000000
> [ 5.033186] R13: 0000000000000000 R14: ffff9946e30c2718 R15: ffff9946e2ff0000
> [ 5.034908] FS: 0000000000000000(0000) GS:ffff994763f00000(0000)
> knlGS:0000000000000000
> [ 5.036907] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 5.038341] CR2: 0000000000000000 CR3: 00000001003a8000 CR4: 00000000000406f0
> [ 5.040044] Kernel panic - not syncing: Fatal exception
> [ 5.040647] Kernel Offset: 0x22e00000 from 0xffffffff81000000
> (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>
> It looks that aliasing a structure from another namespace is a no-go,
> since the patch (attached, slightly changed your patch) without
> __percpu_seg_override decorations bootstraps OK. The working patch
> (without __percpu_seg_override) is not effective (no effect in
> syscall_exit_work) and increases the number of current_task reads from
> 3841 to 4711.

Forgot to say that the "nonworking" patch reduces the number of
current_task reads to 3221.

+#ifdef CONFIG_USE_X86_SEG_SUPPORT
+static __always_inline struct task_struct *get_current(void)
+{
+ return this_cpu_read_const(const_pcpu_hot.current_task);
+}

FWIW, const_pcpu_hot is in __seg_gs space when decorated with
__percpu_seg_override, so plain "return const_pcpu_hot.current_task;"
would work here, too.

Uros.

2023-10-16 23:02:54

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Mon, 16 Oct 2023 at 13:35, Nadav Amit <[email protected]> wrote:
>
> I have encountered several such issues before [1], and while some have been fixed,
> some have not (I looked at switch_fpu_finish()), and might under the right/wrong
> circumstances use the wrongly-“cached” current. Moreover, perhaps new problems
> have been added since my old patch.

Yeah, that fpu switching is disgusting and borderline buggy. And yes,
it would trigger problems when caching the value of 'current'.

I don't particularly love the patch you pointed at, because it seems
to have only fixed the switch_fpu_finish() case, which is the one that
presumably triggered issues, but that's not a very pretty fix.

switch_fpu_prepare() has the exact same problem, and in fact is likely
the *source* of the issue, because that's the original "load current"
that then ends up being cached incorrectly later in __switch_to().

The whole

struct fpu *prev_fpu = &prev->fpu;

thing in __switch_to() is pretty ugly. There's no reason why we should
look at that 'prev_fpu' pointer there, or pass it down.

And it only generates worse code, in how it loads 'current' when
t__switch_to() has the right task pointers.

So the attached patch is, I think, the right thing to do. It may not
be the *complete* fix, but at least for the config I tested, this
makes all loads of 'current' go away in the resulting generated
assembly, and the only access to '%gs:pcpu_hot(%rip)' is the write to
update it:

movq %rbx, %gs:pcpu_hot(%rip)

from that

raw_cpu_write(pcpu_hot.current_task, next_p);

code.

Thomas, I think you've touched this code last, but even that isn't
very recent. The attached patch not only cleans this up, it actually
generates better code too:

(a) it removes one push/pop pair at entry/exit because there's one
less register used (no 'current')

(b) it removes that pointless load of 'current' because it just uses
the right argument:

- #APP
- movq %gs:pcpu_hot(%rip), %r12
- #NO_APP
- testq $16384, (%r12)
+ testq $16384, (%rdi)

so I think this is the right thing to do. I checked that the 32-bit
code builds and looks sane too.

I do think the 'old/new' naming in the FPU code should probably be
'prev/next' to match the switch_to() naming, but I didn't do that.

Comments?

Linus

Linus


Attachments:
patch.diff (3.60 kB)

2023-10-16 23:15:37

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Mon, 16 Oct 2023 at 16:02, Linus Torvalds
<[email protected]> wrote:
>
> so I think this is the right thing to do. I checked that the 32-bit
> code builds and looks sane too.

Just to clarify: the 64-bit side I actually booted and am running.

The 32-bit side is pretty much identical, but I only checked that that
'process_32.c' file still builds. I didn't do any other testing.

Linus

2023-10-17 07:24:05

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()


> On Oct 17, 2023, at 2:02 AM, Linus Torvalds <[email protected]> wrote:
>
> - #APP
> - movq %gs:pcpu_hot(%rip), %r12
> - #NO_APP
> - testq $16384, (%r12)
> + testq $16384, (%rdi)
>
> so I think this is the right thing to do. I checked that the 32-bit
> code builds and looks sane too.
>
> I do think the 'old/new' naming in the FPU code should probably be
> 'prev/next' to match the switch_to() naming, but I didn't do that.
>
> Comments?

Yes, the FPU issue is the one that caused me to crash before. I indeed missed
the switch_fpu_prepare(). The other issue that I encountered before, with
__resctrl_sched_in() has already been taken care of.

It would have been nice to somehow prevent such a thing from reoccurring.
Presumably objtool could have ensured it is so. But anyhow, I do not know of
any other currently open issues.

This whole thing (in addition to Uros’s analysis and objdump numbers) show
that the const-alias allows much more aggressive optimizations than the
current this_cpu_read_stable().

2023-10-17 19:01:37

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Tue, 17 Oct 2023 at 00:23, Nadav Amit <[email protected]> wrote:
>
> Yes, the FPU issue is the one that caused me to crash before.

Uros, can you verify whether that patch of mine resolves the issue you saw?

That patch is _technically_ an actual bug-fix, although right now our
existing 'current' caching that depends on just CSE'ing the inline asm
(and is apparently limited to only doing so within single basic
blocks) doesn't actually trigger the bug in our __switch_to() logic in
practice.

Linus

2023-10-17 19:12:23

by Uros Bizjak

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Tue, Oct 17, 2023 at 9:00 PM Linus Torvalds
<[email protected]> wrote:
>
> On Tue, 17 Oct 2023 at 00:23, Nadav Amit <[email protected]> wrote:
> >
> > Yes, the FPU issue is the one that caused me to crash before.
>
> Uros, can you verify whether that patch of mine resolves the issue you saw?
>
> That patch is _technically_ an actual bug-fix, although right now our
> existing 'current' caching that depends on just CSE'ing the inline asm
> (and is apparently limited to only doing so within single basic
> blocks) doesn't actually trigger the bug in our __switch_to() logic in
> practice.

Unfortunately, it doesn't fix the oops :(

I'm testing your patch, together with the attached patch with the
current tip tree (that already has all necessary percpu stuff), and
get exactly the same oops in:

[ 4.969657] cfg80211: Loading compiled-in X.509 certificates for
regulatory database
[ 4.980712] modprobe (53) used greatest stack depth: 13480 bytes left
[ 4.981048] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 4.981830] #PF: supervisor write access in kernel mode
[ 4.981830] #PF: error_code(0x0002) - not-present page
[ 4.981830] PGD 0 P4D 0
[ 4.981830] Oops: 0002 [#1] PREEMPT SMP PTI
[ 4.981830] CPU: 1 PID: 54 Comm: kworker/u4:1 Not tainted
6.6.0-rc6-00406-g84ab57184ff4-dirty #2
[ 4.981830] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.16.2-1.fc37 04/01/2014
[ 4.981830] RIP: 0010:begin_new_exec+0x8f2/0xa30
[ 4.981830] Code: 31 f6 e8 c1 49 f9 ff e9 3c fa ff ff 31 f6 4c 89
ef e8 b2 4a f9 ff e9 19 fa ff ff 31 f6 4c 89 ef e8 23 4a f9 ff e9 ea
fa ff ff <f0>
41 ff 0c 24 0f 85 55 fb ff ff 4c 89 e7 e8 4b 02 df ff e9 48 fb
[ 4.981830] RSP: 0000:ffffa505401f3d68 EFLAGS: 00010246
[ 4.981830] RAX: 0000000000000000 RBX: ffff89ed809e9f00 RCX: 0000000000000000
[ 4.981830] RDX: 0000000000000000 RSI: ffff89ed80e6c000 RDI: ffff89ed809ea718
[ 4.981830] RBP: ffff89ed8039ee00 R08: 00000000fffffffe R09: 00000000ffffffff
[ 4.981830] R10: 000001ffffffffff R11: 0000000000000001 R12: 0000000000000000
[ 4.981830] R13: 0000000000000000 R14: ffff89ed809ea718 R15: ffff89ed80e6c000
[ 4.981830] FS: 0000000000000000(0000) GS:ffff89ee24900000(0000)
knlGS:0000000000000000
[ 4.981830] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4.981830] CR2: 0000000000000000 CR3: 00000001003a0000 CR4: 00000000000406f0
[ 4.981830] Call Trace:
[ 4.981830] <TASK>
[ 4.981830] ? __die+0x1e/0x60
[ 4.981830] ? page_fault_oops+0x17b/0x470
[ 4.981830] ? search_module_extables+0x14/0x50
[ 4.981830] ? exc_page_fault+0x66/0x140
[ 4.981830] ? asm_exc_page_fault+0x26/0x30
[ 4.981830] ? begin_new_exec+0x8f2/0xa30
[ 4.981830] ? begin_new_exec+0x3ce/0xa30
[ 4.981830] ? load_elf_phdrs+0x67/0xb0
[ 4.981830] load_elf_binary+0x2bb/0x1770
[ 4.981830] ? __kernel_read+0x136/0x2d0
[ 4.981830] bprm_execve+0x277/0x630
[ 4.981830] kernel_execve+0x145/0x1a0
[ 4.981830] call_usermodehelper_exec_async+0xcb/0x180
[ 4.981830] ? __pfx_call_usermodehelper_exec_async+0x10/0x10
[ 4.981830] ret_from_fork+0x2f/0x50
[ 4.981830] ? __pfx_call_usermodehelper_exec_async+0x10/0x10
[ 4.981830] ret_from_fork_asm+0x1b/0x30
[ 4.981830] </TASK>
[ 4.981830] Modules linked in:
[ 4.981830] CR2: 0000000000000000
[ 5.052612] ---[ end trace 0000000000000000 ]---
[ 5.053833] RIP: 0010:begin_new_exec+0x8f2/0xa30
[ 5.055065] Code: 31 f6 e8 c1 49 f9 ff e9 3c fa ff ff 31 f6 4c 89
ef e8 b2 4a f9 ff e9 19 fa ff ff 31 f6 4c 89 ef e8 23 4a f9 ff e9 ea
fa ff ff <f0>
41 ff 0c 24 0f 85 55 fb ff ff 4c 89 e7 e8 4b 02 df ff e9 48 fb
[ 5.059476] RSP: 0000:ffffa505401f3d68 EFLAGS: 00010246
[ 5.060780] RAX: 0000000000000000 RBX: ffff89ed809e9f00 RCX: 0000000000000000
[ 5.062483] RDX: 0000000000000000 RSI: ffff89ed80e6c000 RDI: ffff89ed809ea718
[ 5.064190] RBP: ffff89ed8039ee00 R08: 00000000fffffffe R09: 00000000ffffffff
[ 5.065908] R10: 000001ffffffffff R11: 0000000000000001 R12: 0000000000000000
[ 5.067625] R13: 0000000000000000 R14: ffff89ed809ea718 R15: ffff89ed80e6c000
[ 5.069343] FS: 0000000000000000(0000) GS:ffff89ee24900000(0000)
knlGS:0000000000000000
[ 5.071313] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5.072732] CR2: 0000000000000000 CR3: 00000001003a0000 CR4: 00000000000406f0
[ 5.074439] Kernel panic - not syncing: Fatal exception
[ 5.075028] Kernel Offset: 0xcc00000 from 0xffffffff81000000
(relocation range: 0xffffffff80000000-0xffffffffbfffffff)



Uros.


Attachments:
current.diff.txt (1.83 kB)

2023-10-17 21:06:40

by Uros Bizjak

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Tue, Oct 17, 2023 at 9:11 PM Uros Bizjak <[email protected]> wrote:
>
> On Tue, Oct 17, 2023 at 9:00 PM Linus Torvalds
> <[email protected]> wrote:
> >
> > On Tue, 17 Oct 2023 at 00:23, Nadav Amit <[email protected]> wrote:
> > >
> > > Yes, the FPU issue is the one that caused me to crash before.
> >
> > Uros, can you verify whether that patch of mine resolves the issue you saw?
> >
> > That patch is _technically_ an actual bug-fix, although right now our
> > existing 'current' caching that depends on just CSE'ing the inline asm
> > (and is apparently limited to only doing so within single basic
> > blocks) doesn't actually trigger the bug in our __switch_to() logic in
> > practice.
>
> Unfortunately, it doesn't fix the oops :(
>
> I'm testing your patch, together with the attached patch with the
> current tip tree (that already has all necessary percpu stuff), and
> get exactly the same oops in:

But adding the attached patch on top of both patches boots OK.

Uros.


Attachments:
p.diff.txt (302.00 B)

2023-10-17 21:54:06

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Tue, 17 Oct 2023 at 14:06, Uros Bizjak <[email protected]> wrote:
>
> But adding the attached patch on top of both patches boots OK.

Funky.

Mind adding a

WARN_ON_ONCE(!active_mm);

to there to give a nice backtrace for the odd NULL case.

That code *is* related to 'current', in how we do

tsk = current;
...
local_irq_disable();
active_mm = tsk->active_mm;
tsk->active_mm = mm;
tsk->mm = mm;
...
activate_mm(active_mm, mm);
...
mmdrop_lazy_tlb(active_mm);

but I don't see how 'active_mm' could *poossibly* be validly NULL
here, and why caching 'current' would matter and change it.

Strange.

Hmm. We do set

tsk->active_mm = NULL;

in copy_mm(), and then we have that odd kernel thread case:

/*
* Are we cloning a kernel thread?
*
* We need to steal a active VM for that..
*/
oldmm = current->mm;
if (!oldmm)
return 0;

but none of this should even matter, because by the time we actually
*schedule* that thread, we'll set active_mm to the right thing.

Can anybody see what's up?

Linus

2023-10-17 22:07:08

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()



> On Oct 18, 2023, at 12:53 AM, Linus Torvalds <[email protected]> wrote:
>
>
> but none of this should even matter, because by the time we actually
> *schedule* that thread, we'll set active_mm to the right thing.
>
> Can anybody see what's up?

Could it be related to exec_mmap() -> exec_mm_release() -> mm_release() -> deactivate_mm() ?

#define deactivate_mm(tsk, mm) \
do { \
if (!tsk->vfork_done) \
shstk_free(tsk); \
load_gs_index(0); \
loadsegment(fs, 0); \
} while (0)

We change gs_index(), so perhaps it affects later GS reads. There is also this
X86_BUG_NULL_SEG. Need to dive deeper; just initial thoughts though (i.e., I might be
completely off).

2023-10-17 22:29:42

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()



> On Oct 18, 2023, at 1:06 AM, Nadav Amit <[email protected]> wrote:
>
>
>
>> On Oct 18, 2023, at 12:53 AM, Linus Torvalds <[email protected]> wrote:
>>
>>
>> but none of this should even matter, because by the time we actually
>> *schedule* that thread, we'll set active_mm to the right thing.
>>
>> Can anybody see what's up?
>
> Could it be related to exec_mmap() -> exec_mm_release() -> mm_release() -> deactivate_mm() ?

I am probably completely wrong. Sorry for the noise.

2023-10-18 07:46:54

by Uros Bizjak

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Tue, Oct 17, 2023 at 11:53 PM Linus Torvalds
<[email protected]> wrote:
>
> On Tue, 17 Oct 2023 at 14:06, Uros Bizjak <[email protected]> wrote:
> >
> > But adding the attached patch on top of both patches boots OK.
>
> Funky.
>
> Mind adding a
>
> WARN_ON_ONCE(!active_mm);
>
> to there to give a nice backtrace for the odd NULL case.

[ 4.907840] Call Trace:
[ 4.908909] <TASK>
[ 4.909858] ? __warn+0x7b/0x120
[ 4.911108] ? begin_new_exec+0x90f/0xa30
[ 4.912602] ? report_bug+0x164/0x190
[ 4.913929] ? handle_bug+0x3c/0x70
[ 4.915179] ? exc_invalid_op+0x17/0x70
[ 4.916569] ? asm_exc_invalid_op+0x1a/0x20
[ 4.917969] ? begin_new_exec+0x90f/0xa30
[ 4.919303] ? begin_new_exec+0x3ce/0xa30
[ 4.920667] ? load_elf_phdrs+0x67/0xb0
[ 4.921935] load_elf_binary+0x2bb/0x1770
[ 4.923262] ? __kernel_read+0x136/0x2d0
[ 4.924563] bprm_execve+0x277/0x630
[ 4.925703] kernel_execve+0x145/0x1a0
[ 4.926890] call_usermodehelper_exec_async+0xcb/0x180
[ 4.928408] ? __pfx_call_usermodehelper_exec_async+0x10/0x10
[ 4.930515] ret_from_fork+0x2f/0x50
[ 4.931894] ? __pfx_call_usermodehelper_exec_async+0x10/0x10
[ 4.933941] ret_from_fork_asm+0x1b/0x30
[ 4.935371] </TASK>
[ 4.936212] ---[ end trace 0000000000000000 ]---

>
> That code *is* related to 'current', in how we do
>
> tsk = current;
> ...
> local_irq_disable();
> active_mm = tsk->active_mm;
> tsk->active_mm = mm;
> tsk->mm = mm;
> ...
> activate_mm(active_mm, mm);
> ...
> mmdrop_lazy_tlb(active_mm);
>
> but I don't see how 'active_mm' could *poossibly* be validly NULL
> here, and why caching 'current' would matter and change it.

I have also added "__attribute__((optimize(0)))" to exec_mmap() to
weed out compiler bugs. The result was the same oops in
mmdrop_lazy_tlb.

Also, when using WARN_ON instead of WARN_ON_ONCE, it triggers only
once during the whole boot, with the above trace.

Another observation: adding WARN_ON to the top of exec_mmap:

WARN_ON(!current->active_mm);
/* Notify parent that we're no longer interested in the old VM */
tsk = current;
old_mm = current->mm;

also triggers WARN, suggesting that current does not have active_mm
set on the entry to the function.

Uros.

> Strange.
>
> Hmm. We do set
>
> tsk->active_mm = NULL;
>
> in copy_mm(), and then we have that odd kernel thread case:
>
> /*
> * Are we cloning a kernel thread?
> *
> * We need to steal a active VM for that..
> */
> oldmm = current->mm;
> if (!oldmm)
> return 0;
>
> but none of this should even matter, because by the time we actually
> *schedule* that thread, we'll set active_mm to the right thing.
>
> Can anybody see what's up?
>
> Linus

2023-10-18 09:05:25

by Uros Bizjak

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Wed, Oct 18, 2023 at 9:46 AM Uros Bizjak <[email protected]> wrote:
>
> On Tue, Oct 17, 2023 at 11:53 PM Linus Torvalds
> <[email protected]> wrote:
> >
> > On Tue, 17 Oct 2023 at 14:06, Uros Bizjak <[email protected]> wrote:
> > >
> > > But adding the attached patch on top of both patches boots OK.
> >
> > Funky.
> >
> > Mind adding a
> >
> > WARN_ON_ONCE(!active_mm);
> >
> > to there to give a nice backtrace for the odd NULL case.
>
> [ 4.907840] Call Trace:
> [ 4.908909] <TASK>
> [ 4.909858] ? __warn+0x7b/0x120
> [ 4.911108] ? begin_new_exec+0x90f/0xa30
> [ 4.912602] ? report_bug+0x164/0x190
> [ 4.913929] ? handle_bug+0x3c/0x70
> [ 4.915179] ? exc_invalid_op+0x17/0x70
> [ 4.916569] ? asm_exc_invalid_op+0x1a/0x20
> [ 4.917969] ? begin_new_exec+0x90f/0xa30
> [ 4.919303] ? begin_new_exec+0x3ce/0xa30
> [ 4.920667] ? load_elf_phdrs+0x67/0xb0
> [ 4.921935] load_elf_binary+0x2bb/0x1770
> [ 4.923262] ? __kernel_read+0x136/0x2d0
> [ 4.924563] bprm_execve+0x277/0x630
> [ 4.925703] kernel_execve+0x145/0x1a0
> [ 4.926890] call_usermodehelper_exec_async+0xcb/0x180
> [ 4.928408] ? __pfx_call_usermodehelper_exec_async+0x10/0x10
> [ 4.930515] ret_from_fork+0x2f/0x50
> [ 4.931894] ? __pfx_call_usermodehelper_exec_async+0x10/0x10
> [ 4.933941] ret_from_fork_asm+0x1b/0x30
> [ 4.935371] </TASK>
> [ 4.936212] ---[ end trace 0000000000000000 ]---
>
> >
> > That code *is* related to 'current', in how we do
> >
> > tsk = current;
> > ...
> > local_irq_disable();
> > active_mm = tsk->active_mm;
> > tsk->active_mm = mm;
> > tsk->mm = mm;
> > ...
> > activate_mm(active_mm, mm);
> > ...
> > mmdrop_lazy_tlb(active_mm);
> >
> > but I don't see how 'active_mm' could *poossibly* be validly NULL
> > here, and why caching 'current' would matter and change it.
>
> I have also added "__attribute__((optimize(0)))" to exec_mmap() to
> weed out compiler bugs. The result was the same oops in
> mmdrop_lazy_tlb.
>
> Also, when using WARN_ON instead of WARN_ON_ONCE, it triggers only
> once during the whole boot, with the above trace.
>
> Another observation: adding WARN_ON to the top of exec_mmap:
>
> WARN_ON(!current->active_mm);
> /* Notify parent that we're no longer interested in the old VM */
> tsk = current;
> old_mm = current->mm;
>
> also triggers WARN, suggesting that current does not have active_mm
> set on the entry to the function.

Solved.

All that is needed is to patch cpu_init() from
arch/x86/kernel/cpu/common.c with:

--cut here--
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index b14fc8c1c953..61b6fcdf6937 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -2228,7 +2232,7 @@ void cpu_init_exception_handling(void)
*/
void cpu_init(void)
{
- struct task_struct *cur = current;
+ struct task_struct *cur = this_cpu_read_stable(pcpu_hot.current_task);
int cpu = raw_smp_processor_id();

#ifdef CONFIG_NUMA
--cut here--

This is effectively the old get_current(). Since we declare and export

+DECLARE_PER_CPU_ALIGNED(const struct pcpu_hot __percpu_seg_override,
+ const_pcpu_hot) __attribute__((alias("pcpu_hot")));
+EXPORT_PER_CPU_SYMBOL(const_pcpu_hot);

in the same file, and the "new" current represents just

return const_pcpu_hot.current_task;

GCC assumes and over-optimizes something and seemingly doesn't fully
initialize the

cur->active_mm = &init_mm;

below.

Have to run now, but this will be easy to fix.

Uros.

2023-10-18 10:54:44

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()



> On Oct 18, 2023, at 12:04 PM, Uros Bizjak <[email protected]> wrote:
>
> Solved.
>
> All that is needed is to patch cpu_init() from
> arch/x86/kernel/cpu/common.c with:
>
> --cut here--
> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> index b14fc8c1c953..61b6fcdf6937 100644
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -2228,7 +2232,7 @@ void cpu_init_exception_handling(void)
> */
> void cpu_init(void)
> {
> - struct task_struct *cur = current;
> + struct task_struct *cur = this_cpu_read_stable(pcpu_hot.current_task);
> int cpu = raw_smp_processor_id();

Thanks for solving that, and sorry that I missed it.

The reason I didn’t encounter it before is that in my original patch I created
a new compilation unit which only defined the alias.

Since there might be additional problems (any “current” use in common.c is
dangerous, even in included files), I think that while there may be additional
solutions, defining the alias in a separate compilation unit - as I did before -
is the safest.

2023-10-18 12:15:48

by Uros Bizjak

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Wed, Oct 18, 2023 at 12:54 PM Nadav Amit <[email protected]> wrote:
>
>
>
> > On Oct 18, 2023, at 12:04 PM, Uros Bizjak <[email protected]> wrote:
> >
> > Solved.
> >
> > All that is needed is to patch cpu_init() from
> > arch/x86/kernel/cpu/common.c with:
> >
> > --cut here--
> > diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> > index b14fc8c1c953..61b6fcdf6937 100644
> > --- a/arch/x86/kernel/cpu/common.c
> > +++ b/arch/x86/kernel/cpu/common.c
> > @@ -2228,7 +2232,7 @@ void cpu_init_exception_handling(void)
> > */
> > void cpu_init(void)
> > {
> > - struct task_struct *cur = current;
> > + struct task_struct *cur = this_cpu_read_stable(pcpu_hot.current_task);
> > int cpu = raw_smp_processor_id();
>
> Thanks for solving that, and sorry that I missed it.
>
> The reason I didn’t encounter it before is that in my original patch I created
> a new compilation unit which only defined the alias.
>
> Since there might be additional problems (any “current” use in common.c is
> dangerous, even in included files), I think that while there may be additional
> solutions, defining the alias in a separate compilation unit - as I did before -
> is the safest.

What happens here can be illustrated with the following testcase:

--cut here--
int init_mm;

struct task_struct
{
int *active_mm;
};

struct task_struct init_task;

struct pcpu_hot
{
struct task_struct *current_task;
};

struct pcpu_hot pcpu_hot = { .current_task = &init_task };

extern const struct pcpu_hot __seg_gs const_pcpu_hot
__attribute__((alias("pcpu_hot")));

void foo (void)
{
struct task_struct *cur = const_pcpu_hot.current_task;

cur->active_mm = &init_mm;
}
--cut here--

gcc -O2 -S:

foo:
movq $init_mm, init_task(%rip)
ret

Here, gcc optimizes the access to generic address space, which is
allowed to, since *we set the alias to pcpu_hot*, which is in the
generic address space. The compiler doesn't care that we actually
want:

foo:
movq %gs:const_pcpu_hot(%rip), %rax
movq $init_mm, (%rax)

So yes, to prevent the optimization, we have to hide the alias in another TU.

BTW: Clang creates:

foo:
movq %gs:pcpu_hot(%rip), %rax
movq $init_mm, (%rax)
retq

It is a bit more conservative and retains the address space of the
aliasing symbol.

Looks like another case of underspecified functionality where both
compilers differ. Luckily, both DTRT when aliases are hidden in
another TU.

Uros.

2023-10-18 13:16:20

by Uros Bizjak

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Wed, Oct 18, 2023 at 2:14 PM Uros Bizjak <[email protected]> wrote:
>
> On Wed, Oct 18, 2023 at 12:54 PM Nadav Amit <[email protected]> wrote:
> >
> >
> >
> > > On Oct 18, 2023, at 12:04 PM, Uros Bizjak <[email protected]> wrote:
> > >
> > > Solved.
> > >
> > > All that is needed is to patch cpu_init() from
> > > arch/x86/kernel/cpu/common.c with:
> > >
> > > --cut here--
> > > diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> > > index b14fc8c1c953..61b6fcdf6937 100644
> > > --- a/arch/x86/kernel/cpu/common.c
> > > +++ b/arch/x86/kernel/cpu/common.c
> > > @@ -2228,7 +2232,7 @@ void cpu_init_exception_handling(void)
> > > */
> > > void cpu_init(void)
> > > {
> > > - struct task_struct *cur = current;
> > > + struct task_struct *cur = this_cpu_read_stable(pcpu_hot.current_task);
> > > int cpu = raw_smp_processor_id();
> >
> > Thanks for solving that, and sorry that I missed it.
> >
> > The reason I didn’t encounter it before is that in my original patch I created
> > a new compilation unit which only defined the alias.
> >
> > Since there might be additional problems (any “current” use in common.c is
> > dangerous, even in included files), I think that while there may be additional
> > solutions, defining the alias in a separate compilation unit - as I did before -
> > is the safest.
>
> What happens here can be illustrated with the following testcase:
>
> --cut here--
> int init_mm;
>
> struct task_struct
> {
> int *active_mm;
> };
>
> struct task_struct init_task;
>
> struct pcpu_hot
> {
> struct task_struct *current_task;
> };
>
> struct pcpu_hot pcpu_hot = { .current_task = &init_task };
>
> extern const struct pcpu_hot __seg_gs const_pcpu_hot
> __attribute__((alias("pcpu_hot")));
>
> void foo (void)
> {
> struct task_struct *cur = const_pcpu_hot.current_task;
>
> cur->active_mm = &init_mm;
> }
> --cut here--
>
> gcc -O2 -S:
>
> foo:
> movq $init_mm, init_task(%rip)
> ret
>
> Here, gcc optimizes the access to generic address space, which is
> allowed to, since *we set the alias to pcpu_hot*, which is in the
> generic address space. The compiler doesn't care that we actually
> want:
>
> foo:
> movq %gs:const_pcpu_hot(%rip), %rax
> movq $init_mm, (%rax)
>
> So yes, to prevent the optimization, we have to hide the alias in another TU.
>
> BTW: Clang creates:
>
> foo:
> movq %gs:pcpu_hot(%rip), %rax
> movq $init_mm, (%rax)
> retq
>
> It is a bit more conservative and retains the address space of the
> aliasing symbol.
>
> Looks like another case of underspecified functionality where both
> compilers differ. Luckily, both DTRT when aliases are hidden in
> another TU.

Attached is the prototype patch that works for me (together with
Linus' FPU switching patch).

Uros.


Attachments:
current.diff.txt (2.90 kB)

2023-10-18 14:46:41

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()


> On Oct 18, 2023, at 4:15 PM, Uros Bizjak <[email protected]> wrote:
>
>>
>> Looks like another case of underspecified functionality where both
>> compilers differ. Luckily, both DTRT when aliases are hidden in
>> another TU.
>
> Attached is the prototype patch that works for me (together with
> Linus' FPU switching patch).

In general looks good. See some minor issues below.

> --- a/arch/x86/include/asm/current.h
> +++ b/arch/x86/include/asm/current.h
> @@ -36,10 +36,23 @@ static_assert(sizeof(struct pcpu_hot) == 64);
>
> DECLARE_PER_CPU_ALIGNED(struct pcpu_hot, pcpu_hot);
>
> +/*
> + *
> + */

Obviously some further comments to clarify why struct pcpu_hot is
defined in percpu-hot.c (the GCC manual says: "It is an error if
the alias target is not defined in the same translation unit as the
alias” which can be used as part of the explanation.)

> +DECLARE_PER_CPU_ALIGNED(const struct pcpu_hot __percpu_seg_override,
> + const_pcpu_hot);
> +
> +#ifdef CONFIG_USE_X86_SEG_SUPPORT
> +static __always_inline struct task_struct *get_current(void)
> +{
> + return const_pcpu_hot.current_task;
> +}
> +#else
> static __always_inline struct task_struct *get_current(void)
> {
> return this_cpu_read_stable(pcpu_hot.current_task);
> }
> +#endif


Please consider using IS_ENABLED() to avoid the ifdef’ry.

So this would turn to be:

static __always_inline struct task_struct *get_current(void)
{
if (IS_ENABLED(CONFIG_USE_X86_SEG_SUPPORT))
return const_pcpu_hot.current_task;

return this_cpu_read_stable(pcpu_hot.current_task);
}


2023-10-18 15:18:21

by Uros Bizjak

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Wed, Oct 18, 2023 at 4:46 PM Nadav Amit <[email protected]> wrote:

> >> Looks like another case of underspecified functionality where both
> >> compilers differ. Luckily, both DTRT when aliases are hidden in
> >> another TU.
> >
> > Attached is the prototype patch that works for me (together with
> > Linus' FPU switching patch).
>
> In general looks good. See some minor issues below.
>
> > --- a/arch/x86/include/asm/current.h
> > +++ b/arch/x86/include/asm/current.h
> > @@ -36,10 +36,23 @@ static_assert(sizeof(struct pcpu_hot) == 64);
> >
> > DECLARE_PER_CPU_ALIGNED(struct pcpu_hot, pcpu_hot);
> >
> > +/*
> > + *
> > + */
>
> Obviously some further comments to clarify why struct pcpu_hot is
> defined in percpu-hot.c (the GCC manual says: "It is an error if
> the alias target is not defined in the same translation unit as the
> alias” which can be used as part of the explanation.)

Sure.

>
> > +DECLARE_PER_CPU_ALIGNED(const struct pcpu_hot __percpu_seg_override,
> > + const_pcpu_hot);
> > +
> > +#ifdef CONFIG_USE_X86_SEG_SUPPORT
> > +static __always_inline struct task_struct *get_current(void)
> > +{
> > + return const_pcpu_hot.current_task;
> > +}
> > +#else
> > static __always_inline struct task_struct *get_current(void)
> > {
> > return this_cpu_read_stable(pcpu_hot.current_task);
> > }
> > +#endif
>
>
> Please consider using IS_ENABLED() to avoid the ifdef’ry.
>
> So this would turn to be:
>
> static __always_inline struct task_struct *get_current(void)
> {
> if (IS_ENABLED(CONFIG_USE_X86_SEG_SUPPORT))
> return const_pcpu_hot.current_task;
>
> return this_cpu_read_stable(pcpu_hot.current_task);
> }

I am more thinking of moving the ifdeffery to percpu.h, something like
the attached part of the patch. This would handle all current and
future stable percpu variables.

Thanks,
Uros.


Attachments:
p.diff.txt (1.13 kB)

2023-10-18 16:03:26

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()


> On Oct 18, 2023, at 6:17 PM, Uros Bizjak <[email protected]> wrote:
>
> I am more thinking of moving the ifdeffery to percpu.h, something like
> the attached part of the patch. This would handle all current and
> future stable percpu variables.

I think that for consistency this_cpu_read_stable() should always be an
rvalue, so instead of:

> #define this_cpu_read_stable(pcp) const_##pcp

You would use a statement expression:

#define this_cpu_read_stable(pcp) ({ const_##pcp; })

This would match the other (existing/fallback) definition of
this_cpu_read_stable.

Having said that, I am not sure what other usages you have in mind.
“current” is a pretty obvious straight forward case with considerable
impact on code generation. There may be additional variables, but it is
likely that there would be more functions/TU in which they would not be
constant and would require more refined techniques to avoid mistakes
such as the use of stale cached values.

2023-10-18 16:13:17

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Wed, 18 Oct 2023 at 06:15, Uros Bizjak <[email protected]> wrote:
>
> Attached is the prototype patch that works for me (together with
> Linus' FPU switching patch).

That looks reasonable, but I think the separate compilation unit is
unnecessary, and I still absolutely hate how that const_pcpu_hot thing
is declared twice (your patch does it in both current.h and in
percpu-hot.c).

How about we just do the whole alias as a linker thing instead? So the
same way that we just do

jiffies = jiffies_64;

in our arch/x86/kernel/vmlinux.lds.S file, we could just do

const_pcpu_hot = pcpu_hot;

in there.

Then, as far as the compiler is concerned, we just have

DECLARE_PER_CPU_ALIGNED(
const struct pcpu_hot __percpu_seg_override,
const_pcpu_hot)

and the compiler doesn't know that it's aliased to anything else.

And please do that declaration in just *one* place.

Linus

2023-10-18 16:27:22

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Wed, 18 Oct 2023 at 09:03, Nadav Amit <[email protected]> wrote:
>
> Having said that, I am not sure what other usages you have in mind.
> “current” is a pretty obvious straight forward case with considerable
> impact on code generation. There may be additional variables, but it is
> likely that there would be more functions/TU in which they would not be
> constant and would require more refined techniques to avoid mistakes
> such as the use of stale cached values.

Yeah, I don't think there really are other cases.

We do have things that could be considered stable (like
"smp_processor_id()" which is stable as long as preemption or
migration is disabled (or it's in an irq-off section).

And it might be lovely to optimize those too, *BUT* that would require
that there be a barrier against that optimization that works.

And if there is anything that this thread has made clear, it's that
the whole 'load from a constant section' doesn't seem to have any sane
barriers.

So while the CSE for inline asm statements is a bit too weak with that
whole "only CSE within a basic block" thing, the CSE of "load a
constant value from memory" is too *strong*, in that we don't seem to
have _any_ sane way to say "now you need to reload".

The traditional way we've done that is with our "barrier()" macro,
which does the whole inline asm with a memory clobber, but even that
doesn't act as a barrier for gcc optimizing the constant load.

Which means that while we'd probably love for the compiere to optimize
smp_processor_id() a bit more, we can't use the 'stable memory
location' trick for it.

Because I can't think of anything but 'current' that would be _that_
stable as far as C code is concerned.

Linus

2023-10-18 17:24:31

by Uros Bizjak

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Wed, Oct 18, 2023 at 6:13 PM Linus Torvalds
<[email protected]> wrote:
>
> On Wed, 18 Oct 2023 at 06:15, Uros Bizjak <[email protected]> wrote:
> >
> > Attached is the prototype patch that works for me (together with
> > Linus' FPU switching patch).
>
> That looks reasonable, but I think the separate compilation unit is
> unnecessary, and I still absolutely hate how that const_pcpu_hot thing
> is declared twice (your patch does it in both current.h and in
> percpu-hot.c).
>
> How about we just do the whole alias as a linker thing instead? So the
> same way that we just do
>
> jiffies = jiffies_64;
>
> in our arch/x86/kernel/vmlinux.lds.S file, we could just do
>
> const_pcpu_hot = pcpu_hot;
>
> in there.
>
> Then, as far as the compiler is concerned, we just have
>
> DECLARE_PER_CPU_ALIGNED(
> const struct pcpu_hot __percpu_seg_override,
> const_pcpu_hot)
>
> and the compiler doesn't know that it's aliased to anything else.

Oh...

... this works, too! Please see the attached patch.

(Note to self: Leave linking stuff to linker) ;)

> And please do that declaration in just *one* place.

Sure. Now the patch looks quite slim, but works as expected, reducing
the number of current_task accesses from 3841 to 3220.

Thanks,
Uros.


Attachments:
current-2.diff.txt (1.47 kB)

2023-10-18 17:29:19

by Uros Bizjak

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Wed, Oct 18, 2023 at 6:26 PM Linus Torvalds
<[email protected]> wrote:
>
> On Wed, 18 Oct 2023 at 09:03, Nadav Amit <[email protected]> wrote:
> >
> > Having said that, I am not sure what other usages you have in mind.
> > “current” is a pretty obvious straight forward case with considerable
> > impact on code generation. There may be additional variables, but it is
> > likely that there would be more functions/TU in which they would not be
> > constant and would require more refined techniques to avoid mistakes
> > such as the use of stale cached values.
>
> Yeah, I don't think there really are other cases.

In processor.h, we have:

static __always_inline unsigned long current_top_of_stack(void)
{
/*
* We can't read directly from tss.sp0: sp0 on x86_32 is special in
* and around vm86 mode and sp0 on x86_64 is special because of the
* entry trampoline.
*/
return this_cpu_read_stable(pcpu_hot.top_of_stack);
}

But I don't know how much it is used.

Uros.

> We do have things that could be considered stable (like
> "smp_processor_id()" which is stable as long as preemption or
> migration is disabled (or it's in an irq-off section).
>
> And it might be lovely to optimize those too, *BUT* that would require
> that there be a barrier against that optimization that works.
>
> And if there is anything that this thread has made clear, it's that
> the whole 'load from a constant section' doesn't seem to have any sane
> barriers.
>
> So while the CSE for inline asm statements is a bit too weak with that
> whole "only CSE within a basic block" thing, the CSE of "load a
> constant value from memory" is too *strong*, in that we don't seem to
> have _any_ sane way to say "now you need to reload".
>
> The traditional way we've done that is with our "barrier()" macro,
> which does the whole inline asm with a memory clobber, but even that
> doesn't act as a barrier for gcc optimizing the constant load.
>
> Which means that while we'd probably love for the compiere to optimize
> smp_processor_id() a bit more, we can't use the 'stable memory
> location' trick for it.
>
> Because I can't think of anything but 'current' that would be _that_
> stable as far as C code is concerned.
>
> Linus

2023-10-18 18:02:28

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Wed, 18 Oct 2023 at 10:08, Uros Bizjak <[email protected]> wrote:
>
> Sure. Now the patch looks quite slim, but works as expected, reducing
> the number of current_task accesses from 3841 to 3220.

Thanks, that patch looks lovely to me.

Since you've done all the hard lifting and the testing, I'd suggest
you submit this all to the x86, including my fpu patch. Take my
sign-off, and the commit message might be something along the lines of

x86: clean up fpu switching in the middle of task switching

It happens to work, but it's very very wrong, because our 'current'
macro is magic that is supposedly loading a stable value.

It just happens to be not quite stable enough and the compilers
re-load the value enough for this code to work. But it's wrong.

It also generates worse code.

So fix it.

Signed-off-by: Linus Torvalds <[email protected]>

or add any verbiage you feel appropriate.

Linus

2023-10-18 18:09:00

by Uros Bizjak

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Wed, Oct 18, 2023 at 6:26 PM Linus Torvalds
<[email protected]> wrote:
>
> On Wed, 18 Oct 2023 at 09:03, Nadav Amit <[email protected]> wrote:
> >
> > Having said that, I am not sure what other usages you have in mind.
> > “current” is a pretty obvious straight forward case with considerable
> > impact on code generation. There may be additional variables, but it is
> > likely that there would be more functions/TU in which they would not be
> > constant and would require more refined techniques to avoid mistakes
> > such as the use of stale cached values.
>
> Yeah, I don't think there really are other cases.
>
> We do have things that could be considered stable (like
> "smp_processor_id()" which is stable as long as preemption or
> migration is disabled (or it's in an irq-off section).
>
> And it might be lovely to optimize those too, *BUT* that would require
> that there be a barrier against that optimization that works.

But loads from non-const memory work like the above.

Please consider:

--cut here--
extern __seg_gs int m;

int foo (void)
{
int r;

r = m;
r += m;
asm volatile ("" ::: "memory");
r += m;

return r;
}

int bar (void)
{
int r;

r = m;
r += m;
r += m;

return r;
}
--cut here--

gcc -O2:

foo:
movl %gs:m(%rip), %eax
addl %eax, %eax
addl %gs:m(%rip), %eax
ret

bar:
movl %gs:m(%rip), %eax
leal (%rax,%rax,2), %eax
ret

Please note the __barrier(), implemented with asm volatile.

> And if there is anything that this thread has made clear, it's that
> the whole 'load from a constant section' doesn't seem to have any sane
> barriers.
>
> So while the CSE for inline asm statements is a bit too weak with that
> whole "only CSE within a basic block" thing, the CSE of "load a
> constant value from memory" is too *strong*, in that we don't seem to
> have _any_ sane way to say "now you need to reload".

We can use alias to __seg_gs non-const memory, so the value can be
accessed without asm. __barrier() will then force reload. Please note
that any memory clobber, hidden inside asm will also force reload.

> The traditional way we've done that is with our "barrier()" macro,
> which does the whole inline asm with a memory clobber, but even that
> doesn't act as a barrier for gcc optimizing the constant load.
>
> Which means that while we'd probably love for the compiere to optimize
> smp_processor_id() a bit more, we can't use the 'stable memory
> location' trick for it.

We should get rid of asm statements, and as shown above, __barrier()
will do the trick.

> Because I can't think of anything but 'current' that would be _that_
> stable as far as C code is concerned.

Uros.

2023-10-18 18:12:03

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Wed, 18 Oct 2023 at 10:23, Uros Bizjak <[email protected]> wrote:
>
> In processor.h, we have:
>
> static __always_inline unsigned long current_top_of_stack(void)

Yeah, but that is never used multiple times afaik. I think it's purely
for things like

WARN_ON_ONCE(!on_thread_stack());

in the entry code, for example.

So I guess it can use the same infrastructure, but I doubt it matters
in any practical way.

Grepping around for it, it looks like the 32-bit code has some stale commentary:

* Reload esp0 and pcpu_hot.top_of_stack. This changes
* current_thread_info().

but that seems entirely bogus. We historically picked up
current_thread_info() from %esp, but that hasn't been true in ages,
afaik. Now it's all based on 'current'.

Linus

2023-10-18 18:16:33

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Wed, 18 Oct 2023 at 11:08, Uros Bizjak <[email protected]> wrote:
>
> But loads from non-const memory work like the above.

Yes, I'm certainly ok with the move to use plain loads from __seg_gs
for the percpu accesses. If they didn't honor the memory clobber, we
could never use it at all.

I was just saying that the 'const' alias trick isn't useful for
anything else than 'current', because everything else needs to at
least honor our existing barriers.

(And yes, there's the other user of this_cpu_read_stable() -
'top_of_stack', but as mentioned that doesn't really matter).

Linus

2023-10-18 18:27:17

by Uros Bizjak

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Wed, Oct 18, 2023 at 8:16 PM Linus Torvalds
<[email protected]> wrote:
>
> On Wed, 18 Oct 2023 at 11:08, Uros Bizjak <[email protected]> wrote:
> >
> > But loads from non-const memory work like the above.
>
> Yes, I'm certainly ok with the move to use plain loads from __seg_gs
> for the percpu accesses. If they didn't honor the memory clobber, we
> could never use it at all.
>
> I was just saying that the 'const' alias trick isn't useful for
> anything else than 'current', because everything else needs to at
> least honor our existing barriers.

FYI, smp_processor_id() is implemented as:

#define __smp_processor_id() __this_cpu_read(pcpu_hot.cpu_number)

where __this_* forces volatile access which disables CSE.

*If* the variable is really stable, then it should use __raw_cpu_read.
Both, __raw_* and __this_* were recently (tip/percpu branch)
implemented for SEG_SUPPORT as:

#define __raw_cpu_read(qual, pcp) \
({ \
*(qual __my_cpu_type(pcp) *)__my_cpu_ptr(&(pcp)); \
})

where "qual" can be volatile. To enable smp_processor_id()
optimization, it just needs to be moved from __this to __raw accessor.

> (And yes, there's the other user of this_cpu_read_stable() -
> 'top_of_stack', but as mentioned that doesn't really matter).

Uros.

2023-10-18 18:29:28

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()


> On Oct 18, 2023, at 9:08 PM, Uros Bizjak <[email protected]> wrote:
>
> We can use alias to __seg_gs non-const memory, so the value can be
> accessed without asm. __barrier() will then force reload. Please note
> that any memory clobber, hidden inside asm will also force reload.

For the record, at the time I tried to find a creative solution to
have fine granularity barrier control. I looked into address-spaces,
pure and const function attributes, and use of the restrict keyword.

I did not find a better solution. The behavior of most of these
mechanisms was non-intuitive for me and inconsistent across compilers.

You may succeed where I have failed as you seem more familiar with
the compiler code, but be aware it is a rabbit hole...

2023-10-18 19:34:01

by Uros Bizjak

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Wed, Oct 18, 2023 at 8:26 PM Uros Bizjak <[email protected]> wrote:
>
> On Wed, Oct 18, 2023 at 8:16 PM Linus Torvalds
> <[email protected]> wrote:
> >
> > On Wed, 18 Oct 2023 at 11:08, Uros Bizjak <[email protected]> wrote:
> > >
> > > But loads from non-const memory work like the above.
> >
> > Yes, I'm certainly ok with the move to use plain loads from __seg_gs
> > for the percpu accesses. If they didn't honor the memory clobber, we
> > could never use it at all.
> >
> > I was just saying that the 'const' alias trick isn't useful for
> > anything else than 'current', because everything else needs to at
> > least honor our existing barriers.
>
> FYI, smp_processor_id() is implemented as:
>
> #define __smp_processor_id() __this_cpu_read(pcpu_hot.cpu_number)
>
> where __this_* forces volatile access which disables CSE.
>
> *If* the variable is really stable, then it should use __raw_cpu_read.
> Both, __raw_* and __this_* were recently (tip/percpu branch)
> implemented for SEG_SUPPORT as:

This pach works for me:

--cut here--
diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h
index 4fab2ed454f3..6eda4748bf64 100644
--- a/arch/x86/include/asm/smp.h
+++ b/arch/x86/include/asm/smp.h
@@ -141,8 +141,7 @@ __visible void
smp_call_function_single_interrupt(struct pt_regs *r);
* This function is needed by all SMP systems. It must _always_ be valid
* from the initial startup.
*/
-#define raw_smp_processor_id() this_cpu_read(pcpu_hot.cpu_number)
-#define __smp_processor_id() __this_cpu_read(pcpu_hot.cpu_number)
+#define raw_smp_processor_id() raw_cpu_read(pcpu_hot.cpu_number)

#ifdef CONFIG_X86_32
extern int safe_smp_processor_id(void);
--cut here--

But removes merely 10 reads from 3219.

BTW: I also don't understand the comment from include/linux/smp.h:

/*
* Allow the architecture to differentiate between a stable and unstable read.
* For example, x86 uses an IRQ-safe asm-volatile read for the unstable but a
* regular asm read for the stable.
*/
#ifndef __smp_processor_id
#define __smp_processor_id(x) raw_smp_processor_id(x)
#endif

All reads up to word size on x86 are atomic, so IRQ safe. asm-volatile
is not some IRQ property, but prevents the compiler from CSE the asm
and scheduling (moving) asm around too much.

Uros.

2023-10-18 20:17:51

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()



> On Oct 18, 2023, at 10:33 PM, Uros Bizjak <[email protected]> wrote:
>
> This pach works for me:
>
> --cut here--
> diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h
> index 4fab2ed454f3..6eda4748bf64 100644
> --- a/arch/x86/include/asm/smp.h
> +++ b/arch/x86/include/asm/smp.h
> @@ -141,8 +141,7 @@ __visible void
> smp_call_function_single_interrupt(struct pt_regs *r);
> * This function is needed by all SMP systems. It must _always_ be valid
> * from the initial startup.
> */
> -#define raw_smp_processor_id() this_cpu_read(pcpu_hot.cpu_number)
> -#define __smp_processor_id() __this_cpu_read(pcpu_hot.cpu_number)
> +#define raw_smp_processor_id() raw_cpu_read(pcpu_hot.cpu_number)

I don’t think that’s correct. IIUC, although changing __smp_processor_id()
to read pcpu_hot.cpu_number through raw_cpu_read, raw_smp_processor_id()
should not be changed in this manner.

raw_smp_processor_id() does not assume that preemption or at least
migration of the task to another core is possible. So “volatile” keyword
and inline assembly are used to ensure the ordering and that the read
is read without some compiler optimization that reads the value multiple
times or tears the reads.

In contrast raw_cpu_read() does not use the volatile keyword, so it
does not provide the same guarantees.

2023-10-18 20:22:58

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Wed, 18 Oct 2023 at 12:33, Uros Bizjak <[email protected]> wrote:
>
> This pach works for me:

Looks fine.

But you actually bring up another issue:

> BTW: I also don't understand the comment from include/linux/smp.h:
>
> /*
> * Allow the architecture to differentiate between a stable and unstable read.
> * For example, x86 uses an IRQ-safe asm-volatile read for the unstable but a
> * regular asm read for the stable.

I think the comment is badly worded, but I think the issue may actually be real.

One word: rematerialization.

The thing is, turning inline asm accesses to regular compiler loads
has a *very* bad semantic problem: the compiler may now feel like it
can not only combine the loads (ok), but also possibly rematerialize
values by re-doing the loads (NOT OK!).

IOW, the kernel often has very strict requirements of "at most once"
behavior, because doing two loads might give different results.

The cpu number is a good example of this.

And yes, sometimes we use actual volatile accesses for them
(READ_ONCE() and WRITE_ONCE()) but those are *horrendous* in general,
and are much too strict. Not only does gcc generally lose its mind
when it sees volatile (ie it stops doing various sane combinations
that would actually be perfectly valid), but it obviously also stops
doing CSE on the loads (as it has to).

So the "non-volatile asm" has been a great way to get the "at most
one" behavior: it's safe wrt interrupts changing the value, because
you will see *one* value, not two. As far as we know, gcc never
rematerializes the output of an inline asm. So when you use an inline
asm, you may have the result CSE'd, but you'll never see it generate
more than *one* copy of the inline asm.

(Of course, as with so much about inline asm, that "knowledge" is not
necessarily explicitly spelled out anywhere, and it's just "that's how
it has always worked").

IOW, look at code like the one in swiotlb_pool_find_slots(), which does this:

int start = raw_smp_processor_id() & (pool->nareas - 1);

and the use of 'start' really is meant to be just a good heuristic, in
that different concurrent CPU's will start looking in different pools.
So that code is basically "cpu-local by default", but it's purely
about locality, it's not some kind of correctness issue, and it's not
necessarily run when the code is *tied* to a particular CPU.

But what *is* important is that 'start' have *one* value, and one
value only. So look at that loop, which hasically does

do {
.. use the 'i' based on 'start' ..
if (++i >= pool->nareas)
i = 0;
} while (i != start);

and it is very important indeed that the compiler does *not* think
"Oh, I can rematerialize the 'start' value".

See what I'm saying? Using 'volatile' for loading the current CPU
value would be bad for performance for no good reason. But loading it
multiple times would be a *bug*.

Using inline asm is basically perfect here: the compiler can *combine*
two inline asms into one, but once we have a value for 'start', it
won't change, because the compiler is not going to decide "I can drop
this value, and just re-do the inline asm to rematerialize it".

This all makes me worried about the __seg_fs thing.

For 'current', this is all perfect. Rematerializing current is
actually better than spilling and reloading the value.

But for something like raw_smp_processor_id(), rematerializing would
be a correctness problem, and a really horrible one (because in
practice, the code would work 99.9999% of the time, and then once in a
blue moon, it would rematerialize a different value).

See the problem?

I guess we could use the stdatomics to try to explain these issues to
the compiler, but I don't even know what the C interfaces look like or
whether they are stable and usable across the range of compilers we
use.

Linus

2023-10-18 20:35:04

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Wed, 18 Oct 2023 at 13:22, Linus Torvalds
<[email protected]> wrote:
>
> And yes, sometimes we use actual volatile accesses for them
> (READ_ONCE() and WRITE_ONCE()) but those are *horrendous* in general,
> and are much too strict. Not only does gcc generally lose its mind
> when it sees volatile (ie it stops doing various sane combinations
> that would actually be perfectly valid), but it obviously also stops
> doing CSE on the loads (as it has to).

Note, in case you wonder what I mean by "lose its mind", try this
(extremely stupid) test program:

void a(volatile int *i) { ++*i; }
void b(int *i) { ++*i; }

and note that the non-volatile version does

addl $1, (%rdi)

but the volatile version then refuses to combine the read+write into a
rmw instruction, and generates

movl (%rdi), %eax
addl $1, %eax
movl %eax, (%rdi)

instead.

Sure, it's correct, but it's an example of how 'volatile' ends up
disabling a lot of other optimizations than just the "don't remove the
access".

Doing the volatile as one rmw instruction would still have been very
obviously valid - it's still doing a read and a write. You don't need
two instructions for that.

I'm not complaining, and I understand *why* it happens - compiler
writers very understandably go "oh, I'm not touching that".

I'm just trying to point out that volatile really screws up code
generation even aside from the "access _exactly_ once" issue.

So using inline asm and relying on gcc doing (minimal) CSE will then
generate better code than volatile ever could, even when we just use a
simple 'mov" instruction. At least you get that basic combining
effect, even if it's not great.

And for memory ops, *not* using volatile is dangerous when they aren't stable.

Linus

2023-10-18 20:43:16

by Uros Bizjak

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Wed, Oct 18, 2023 at 10:22 PM Linus Torvalds
<[email protected]> wrote:
>
> On Wed, 18 Oct 2023 at 12:33, Uros Bizjak <[email protected]> wrote:
> >
> > This pach works for me:
>
> Looks fine.
>
> But you actually bring up another issue:
>
> > BTW: I also don't understand the comment from include/linux/smp.h:
> >
> > /*
> > * Allow the architecture to differentiate between a stable and unstable read.
> > * For example, x86 uses an IRQ-safe asm-volatile read for the unstable but a
> > * regular asm read for the stable.
>
> I think the comment is badly worded, but I think the issue may actually be real.
>
> One word: rematerialization.
>
> The thing is, turning inline asm accesses to regular compiler loads
> has a *very* bad semantic problem: the compiler may now feel like it
> can not only combine the loads (ok), but also possibly rematerialize
> values by re-doing the loads (NOT OK!).
>
> IOW, the kernel often has very strict requirements of "at most once"
> behavior, because doing two loads might give different results.
>
> The cpu number is a good example of this.
>
> And yes, sometimes we use actual volatile accesses for them
> (READ_ONCE() and WRITE_ONCE()) but those are *horrendous* in general,
> and are much too strict. Not only does gcc generally lose its mind
> when it sees volatile (ie it stops doing various sane combinations
> that would actually be perfectly valid), but it obviously also stops
> doing CSE on the loads (as it has to).
>
> So the "non-volatile asm" has been a great way to get the "at most
> one" behavior: it's safe wrt interrupts changing the value, because
> you will see *one* value, not two. As far as we know, gcc never
> rematerializes the output of an inline asm. So when you use an inline
> asm, you may have the result CSE'd, but you'll never see it generate
> more than *one* copy of the inline asm.
>
> (Of course, as with so much about inline asm, that "knowledge" is not
> necessarily explicitly spelled out anywhere, and it's just "that's how
> it has always worked").
>
> IOW, look at code like the one in swiotlb_pool_find_slots(), which does this:
>
> int start = raw_smp_processor_id() & (pool->nareas - 1);
>
> and the use of 'start' really is meant to be just a good heuristic, in
> that different concurrent CPU's will start looking in different pools.
> So that code is basically "cpu-local by default", but it's purely
> about locality, it's not some kind of correctness issue, and it's not
> necessarily run when the code is *tied* to a particular CPU.
>
> But what *is* important is that 'start' have *one* value, and one
> value only. So look at that loop, which hasically does
>
> do {
> .. use the 'i' based on 'start' ..
> if (++i >= pool->nareas)
> i = 0;
> } while (i != start);
>
> and it is very important indeed that the compiler does *not* think
> "Oh, I can rematerialize the 'start' value".
>
> See what I'm saying? Using 'volatile' for loading the current CPU
> value would be bad for performance for no good reason. But loading it
> multiple times would be a *bug*.
>
> Using inline asm is basically perfect here: the compiler can *combine*
> two inline asms into one, but once we have a value for 'start', it
> won't change, because the compiler is not going to decide "I can drop
> this value, and just re-do the inline asm to rematerialize it".
>
> This all makes me worried about the __seg_fs thing.

Please note that there is a difference between this_* and raw_*
accessors. this_* has "volatile" qualification, and for sure it won't
ever be rematerialized (this would defeat the purpose of "volatile").
Previously, this_* was defined as asm-volatile, and now it is defined
as a volatile memory access. GCC will disable almost all optimizations
when volatile memory is accessed. IIRC, volatile-asm also won't be
combined, since the compiler does not know about secondary effects of
asm.

Side note: The raw_smp_processor_id() uses this_, so it has volatile
qualification. Perhaps we can do

+#define __smp_processor_id() raw_cpu_read(pcpu_hot.cpu_number)

as this seems like the relaxed version of smp_processor_id().

So, guarantees of asm-volatile are the same as guarantees of volatile
memory access.

Uros.
>
> For 'current', this is all perfect. Rematerializing current is
> actually better than spilling and reloading the value.
>
> But for something like raw_smp_processor_id(), rematerializing would
> be a correctness problem, and a really horrible one (because in
> practice, the code would work 99.9999% of the time, and then once in a
> blue moon, it would rematerialize a different value).
>
> See the problem?
>
> I guess we could use the stdatomics to try to explain these issues to
> the compiler, but I don't even know what the C interfaces look like or
> whether they are stable and usable across the range of compilers we
> use.
>
> Linus

2023-10-18 20:52:39

by Uros Bizjak

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Wed, Oct 18, 2023 at 10:34 PM Linus Torvalds
<[email protected]> wrote:
>
> On Wed, 18 Oct 2023 at 13:22, Linus Torvalds
> <[email protected]> wrote:
> >
> > And yes, sometimes we use actual volatile accesses for them
> > (READ_ONCE() and WRITE_ONCE()) but those are *horrendous* in general,
> > and are much too strict. Not only does gcc generally lose its mind
> > when it sees volatile (ie it stops doing various sane combinations
> > that would actually be perfectly valid), but it obviously also stops
> > doing CSE on the loads (as it has to).
>
> Note, in case you wonder what I mean by "lose its mind", try this
> (extremely stupid) test program:
>
> void a(volatile int *i) { ++*i; }
> void b(int *i) { ++*i; }
>
> and note that the non-volatile version does
>
> addl $1, (%rdi)
>
> but the volatile version then refuses to combine the read+write into a
> rmw instruction, and generates
>
> movl (%rdi), %eax
> addl $1, %eax
> movl %eax, (%rdi)
>
> instead.
>
> Sure, it's correct, but it's an example of how 'volatile' ends up
> disabling a lot of other optimizations than just the "don't remove the
> access".
>
> Doing the volatile as one rmw instruction would still have been very
> obviously valid - it's still doing a read and a write. You don't need
> two instructions for that.

FYI: This is the reason RMW instructions in percpu.h are not (blindly)
converted to C ops. They will remain in their (volatile or not) asm
form because of the above reason, and due to the fact that they don't
combine with anything.

> I'm not complaining, and I understand *why* it happens - compiler
> writers very understandably go "oh, I'm not touching that".
>
> I'm just trying to point out that volatile really screws up code
> generation even aside from the "access _exactly_ once" issue.
>
> So using inline asm and relying on gcc doing (minimal) CSE will then
> generate better code than volatile ever could, even when we just use a
> simple 'mov" instruction. At least you get that basic combining
> effect, even if it's not great.

Actually, RMW insns are better written in asm, while simple "mov"
should be converted to (volatile or not) memory access. On x86 "mov"s
from memory (reads) will combine nicely with almost all other
instructions.

> And for memory ops, *not* using volatile is dangerous when they aren't stable.

True.

Uros.

2023-10-18 21:09:47

by Uros Bizjak

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Wed, Oct 18, 2023 at 10:51 PM Uros Bizjak <[email protected]> wrote:
>
> On Wed, Oct 18, 2023 at 10:34 PM Linus Torvalds
> <[email protected]> wrote:
> >
> > On Wed, 18 Oct 2023 at 13:22, Linus Torvalds
> > <[email protected]> wrote:
> > >
> > > And yes, sometimes we use actual volatile accesses for them
> > > (READ_ONCE() and WRITE_ONCE()) but those are *horrendous* in general,
> > > and are much too strict. Not only does gcc generally lose its mind
> > > when it sees volatile (ie it stops doing various sane combinations
> > > that would actually be perfectly valid), but it obviously also stops
> > > doing CSE on the loads (as it has to).
> >
> > Note, in case you wonder what I mean by "lose its mind", try this
> > (extremely stupid) test program:
> >
> > void a(volatile int *i) { ++*i; }
> > void b(int *i) { ++*i; }
> >
> > and note that the non-volatile version does
> >
> > addl $1, (%rdi)
> >
> > but the volatile version then refuses to combine the read+write into a
> > rmw instruction, and generates
> >
> > movl (%rdi), %eax
> > addl $1, %eax
> > movl %eax, (%rdi)
> >
> > instead.
> >
> > Sure, it's correct, but it's an example of how 'volatile' ends up
> > disabling a lot of other optimizations than just the "don't remove the
> > access".
> >
> > Doing the volatile as one rmw instruction would still have been very
> > obviously valid - it's still doing a read and a write. You don't need
> > two instructions for that.
>
> FYI: This is the reason RMW instructions in percpu.h are not (blindly)
> converted to C ops. They will remain in their (volatile or not) asm
> form because of the above reason, and due to the fact that they don't
> combine with anything.
>
> > I'm not complaining, and I understand *why* it happens - compiler
> > writers very understandably go "oh, I'm not touching that".
> >
> > I'm just trying to point out that volatile really screws up code
> > generation even aside from the "access _exactly_ once" issue.
> >
> > So using inline asm and relying on gcc doing (minimal) CSE will then
> > generate better code than volatile ever could, even when we just use a
> > simple 'mov" instruction. At least you get that basic combining
> > effect, even if it's not great.
>
> Actually, RMW insns are better written in asm, while simple "mov"
> should be converted to (volatile or not) memory access. On x86 "mov"s
> from memory (reads) will combine nicely with almost all other
> instructions.

BTW: There was a discussion that GCC should construct RMW instructions
also when the memory location is marked volatile, but there was no
resolution reached. So, the "I'm not touching that" approach remains.
However, GCC *will* combine a volatile read with a follow-up
instruction.

Uros.

2023-10-18 21:13:36

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Wed, 18 Oct 2023 at 13:52, Uros Bizjak <[email protected]> wrote:
>
> FYI: This is the reason RMW instructions in percpu.h are not (blindly)
> converted to C ops. They will remain in their (volatile or not) asm
> form because of the above reason, and due to the fact that they don't
> combine with anything.

Well, also because converting them to C would be HORRIBYL BUGGY.

They absolutely have to be a single instruction. We have architectures
that can't do rmw instructions, and then they have to do lots of extra
horrid crud (disable interrupts or whatever) to make a percpu 'add' be
a valid thing.


> Actually, RMW insns are better written in asm, while simple "mov"
> should be converted to (volatile or not) memory access.

No.

This remat issue has convinced me that the *only* thing that should be
converted to a memory access is the "stable" case (which in practice
is mainly just 'current').

Because if you make them 'volatile' memory instructions, then the
simple "mov" inline asm is simply better. It still allows CSE on the
asm (in the "raw" form).

And if you make them memory instructions _without_ the 'volatile', the
memory access is simply buggy until we have some 'nomaterialize'
model.

So the *only* situation where a memory access is better is that
'stable' case. In all other cases they are the same or strictly worse
than 'asm'.

Linus

2023-10-18 21:41:41

by Uros Bizjak

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Wed, Oct 18, 2023 at 11:11 PM Linus Torvalds
<[email protected]> wrote:
>
> On Wed, 18 Oct 2023 at 13:52, Uros Bizjak <[email protected]> wrote:
> >
> > FYI: This is the reason RMW instructions in percpu.h are not (blindly)
> > converted to C ops. They will remain in their (volatile or not) asm
> > form because of the above reason, and due to the fact that they don't
> > combine with anything.
>
> Well, also because converting them to C would be HORRIBYL BUGGY.
>
> They absolutely have to be a single instruction. We have architectures
> that can't do rmw instructions, and then they have to do lots of extra
> horrid crud (disable interrupts or whatever) to make a percpu 'add' be
> a valid thing.
>
>
> > Actually, RMW insns are better written in asm, while simple "mov"
> > should be converted to (volatile or not) memory access.
>
> No.
>
> This remat issue has convinced me that the *only* thing that should be
> converted to a memory access is the "stable" case (which in practice
> is mainly just 'current').
>
> Because if you make them 'volatile' memory instructions, then the
> simple "mov" inline asm is simply better. It still allows CSE on the
> asm (in the "raw" form).

The ones in "raw" form are not IRQ safe and these are implemented
without volatile qualifier.

The safe variant are ones with "this" form. These were implemented as
volatile-asm and are now implemented as volatile reads. They do not
rematerialize, the number of memory accesses stays the same. They do
not CSE (volatile-asm also doesn't), but they can propagate into
follow-up instructions.

> And if you make them memory instructions _without_ the 'volatile', the
> memory access is simply buggy until we have some 'nomaterialize'
> model.

This is the reason that almost all percpu access is implemented using
this_* accessors. raw_* is a relaxed version without IRQ guarantees
that should be (and is) used in a controlled manner in a special
places:

https://elixir.bootlin.com/linux/latest/A/ident/this_cpu_read
https://elixir.bootlin.com/linux/latest/A/ident/raw_cpu_read

>
> So the *only* situation where a memory access is better is that
> 'stable' case. In all other cases they are the same or strictly worse
> than 'asm'.

No, argument propagation is non-existent with "asm" version.

Uros.

2023-10-18 22:40:36

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Wed, 18 Oct 2023 at 14:40, Uros Bizjak <[email protected]> wrote:
>
> The ones in "raw" form are not IRQ safe and these are implemented
> without volatile qualifier.

You are misreading it.

Both *are* irq safe - on x86.

The difference between "this_cpu_xyz()" and "raw_cpu_xyz()" is that on
*other* architectures, "raw_cpu_xyz():" can be a lot more efficient,
because other architectures may need to do extra work to make the
"this" version be atomic on a particular CPU.

See for example __count_vm_event() vs count_vm_event().

In fact, that particular use isn't even in an interrupt-safe context,
that's an example of literally "I'd rather be fast that correct for
certain statistics that aren't all that important".

They two versions generate the same code on x86, but on other
architectures, __count_vm_event() can *much* simpler and faster
because it doesn't disable interrupts or do other special things.

But on x86, the whole "interrupt safety" is a complete red herring.
Both of them generate the exact same instruction.

On x86, the "volatile" is actually for a completely different reason:
to avoid too much CSE by the compiler.

See commit b59167ac7baf ("x86/percpu: Fix this_cpu_read()").

In fact, that commit went overboard, and just added "volatile" to
*every* percpu read.

So then people complained about *that*, and PeterZ did commit
0b9ccc0a9b14 ("x86/percpu: Differentiate this_cpu_{}() and
__this_cpu_{}()"), which basically made that "qual or not" be a macro
choice.

And in the process, it now got added to all the RMW ops, that didn't
actually need it or want it in the first place, since they won't be
CSE'd, since they depend on the input.

So that commit basically generalized the whole thing entirely
pointlessly, and caused your current confusion.

End result: we should remove 'volatile' from the RMW ops. It doesn't
do anything on x86. All it does is make us have two subtly different
versions that we don't care about the difference.

End result two: we should make it clear that "this_cpu_read()" vs
"raw_cpu_read()" are *NOT* about interrupts. Even on architectures
where the RMW ops need to have irq protection (so that they are atomic
wrt interrupts also modifying the value), the *READ* operation
obviously has no such issue.

For the raw_cpu_read() vs this_cpu_read() case, the only issue is
whether you can CSE the result.

And in 99% of all cases, you can - and want to - CSE it. But as that
commit b59167ac7baf shows, sometimes you cannot.

Side note: the code that caused that problem is this:

__always_inline void __cyc2ns_read(struct cyc2ns_data *data)
{
int seq, idx;

do {
seq = this_cpu_read(cyc2ns.seq.seqcount.sequence);
...
} while (unlikely(seq != this_cpu_read(cyc2ns.seq.seqcount.sequence)));
}

where the issue is that the this_cpu_read() of that sequence number
needs to be ordered.

Honestly, that code is just buggy and bad. We should never have
"fixed" it by changing the semantics of this_cpu_read() in the first
place.

The problem is that it re-implements its own locking model, and as so
often happens when people do that, they do it completely wrongly.

Look at the *REAL* sequence counter code in <linux/seqlock.h>. Notice
how in raw_read_seqcount_begin() we have

unsigned _seq = __read_seqcount_begin(s);
smp_rmb();

because it actually does the proper barriers. Notice how the garbage
code in __cyc2ns_read() doesn't have them - and how it was buggy as a
result.

(Also notice how this all predates our "we should use load_acquire()
instead of smb_rmb()", but whatever).

IOW, all the "volatiles" in the x86 <asm/percpu.h> file are LITERAL
GARBAGE and should not exist, and are due to a historical mistake.

Linus

2023-10-18 23:07:18

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Wed, 18 Oct 2023 at 15:40, Linus Torvalds
<[email protected]> wrote:
>
> See for example __count_vm_event() vs count_vm_event().
>
> In fact, that particular use isn't even in an interrupt-safe context,
> that's an example of literally "I'd rather be fast that correct for
> certain statistics that aren't all that important".

.. just to clarify - I don't think the VM statistics code isn't even
updated from interrupts, but it is still incorrect to do
"raw_cpu_add()" even in just process context, because on architectures
where it results in separate load-op-store instructions, you can get
preempted in the middle, and now your loaded value is some old stale
one. So when you get back, somebody else might have updated the count,
but you'll still end up doing the store using the stale value.

For VM statistics like the BALLOON_MIGRATE, nobody cares. The stats
may be incorrect, but they aren't a correctness issue, and they'll be
in the right ballpark because the race is not generally hit.

So "interrupt safe" here is not necessarily about actual interrupts
themselves directly. You *can* have that too, but it can also be about
just an interrupt causing preemption.

Anyway, again, none of this is relevant on x86, since the
single-instruction rmw percpu sequence is obviously non-interruptible,

The one oddity on x86 is that because 'xchg' always has an implied
lock, so there we *do* have a multi-instruction sequence.

And then - and *ONLY* then - the raw-vs-this matters even on x86:
"raw" just does a "load-store" pair, while "this" does a cmpxchg loop
(the latter of which is safe for both irq use and preemption because
the cmpxchg obviously re-checks the original value).

But even in that xchg case, the "volatile" part of the asm is a
complete red herring and shouldn't exist.

Linus

2023-10-19 07:05:02

by Uros Bizjak

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Thu, Oct 19, 2023 at 12:40 AM Linus Torvalds
<[email protected]> wrote:
>
> On Wed, 18 Oct 2023 at 14:40, Uros Bizjak <[email protected]> wrote:
> >
> > The ones in "raw" form are not IRQ safe and these are implemented
> > without volatile qualifier.
>
> You are misreading it.
>
> Both *are* irq safe - on x86.
>
> The difference between "this_cpu_xyz()" and "raw_cpu_xyz()" is that on
> *other* architectures, "raw_cpu_xyz():" can be a lot more efficient,
> because other architectures may need to do extra work to make the
> "this" version be atomic on a particular CPU.
>
> See for example __count_vm_event() vs count_vm_event().
>
> In fact, that particular use isn't even in an interrupt-safe context,
> that's an example of literally "I'd rather be fast that correct for
> certain statistics that aren't all that important".
>
> They two versions generate the same code on x86, but on other
> architectures, __count_vm_event() can *much* simpler and faster
> because it doesn't disable interrupts or do other special things.
>
> But on x86, the whole "interrupt safety" is a complete red herring.
> Both of them generate the exact same instruction.
>
> On x86, the "volatile" is actually for a completely different reason:
> to avoid too much CSE by the compiler.

Let me explain how the compiler handles volatile. Basically, it
disables most of the optimizations when volatile is encountered, be it
on volatile asm or volatile memory access. It is important that
argument propagation is NOT disabled with volatile memory, so the
compiler can still propagate arguments into the following instruction,
as long as no memory access is crossed. This is all that [1]
implements, while keeping volatile around.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?h=x86/percpu&id=ca4256348660cb2162668ec3d13d1f921d05374a

Remark: You raised a question on why "testl $0xf0000,%gs:0x0(%rip)"
was not optimized to testb - because of volatile access on memory, the
compiler won't shrink memory read.

The compiler won't CSE volatile-asm or volatile reads, so the number
of memory accesses stay the same. This is an important property,
promised by this_* API, and this_* should be used almost everywhere
because of this property.

raw_* is a relaxed version that allows CSE of asm and memory accesses.
The number of memory accesses can change, and additionally, it *can*
rematerialize arguments from memory [2].

[2] If there is an instruction that uses all registers in between the
load from memory and the insn that uses the loaded value, then the
compiler will try to accommodate the instruction by pushing all active
registers to the stack (stack is owned by the compiler). Instead of
pushing and popping the register with a loaded value, it can read it
from the original non-volatile location as well. If the location is
qualified as volatile, this optimization is off.

This is the rematerialization that you are worrying about. As well as
CSE, it can be allowed under special conditions, in a section, where
it is *guaranteed* that the value in memory won't change in the whole
section. This can be guaranteed in a (kind of) critical section (e.g.
when interrupts are disabled), otherwise the rematerialized value can
differ, because some interrupt handler changed value in memory. Also,
the opposite is true, we will miss the changed value when the read is
CSE'd.

That means, that even when the instruction is the same (and IRQ safe),
"volatile" inhibits unwanted optimisations, and should be there
because of the compiler, to guarantee the access to memory. Since RMW
asms are accessing memory, they should be volatile, too for
non-relaxed versions.

Following with the raw_* issue: These are relaxed versions that
require stable operands, in a section, marked with some barriers (so,
critical section). Those barriers (e.g. volatile asm memory clobber)
will prevent scheduling of memory accesses outside this critical
section, and memory operand stability will be achieved by e.g.
disabling interrupts. Since the user *guarantees* memory stability,
the compiler can relax its CSE and rematerialization requirement *for
this section*. Accessors that are implemented without "volatile"
communicate this relaxation to the compiler.

The above also means that when raw_* versions are used on non-stable
operands, it is the user's fault to use it at the first place, not the
compiler's. raw_ versions should be used with great care.

To illustrate the above, let's look at the "xchg" RMW implementation.
The this_* one is interrupt safe in the sense that it is implemented
with cmpxchg operation that is atomic w.r.t interrupts (that can
change contents of the memory) [note, this is not "locked", and does
not have full atomic property], but raw_* is implemented simply by
raw_ reads and writes. In the raw_ case, we don't care what happens
with memory accesses, because they are required to be stable in the
section where raw_xchg is used.

Then we have this_cpu_read_stable that has even stricter requirements
on operand stability. It can be considered as a constant memory, and
now we have the means to communicate this to the compiler.

[Side note: this_cpu_read_stable is a misnomer. It should be named
raw_cpu_read_stable to reflect the operand stability, required by raw_
versions.

From the compiler standpoint, "volatiles" are not optional, they
communicate operand stability state to the compiler. I have reviewed
percpu.h many times, and the current state of operations is correct,
even if they result in the same instruction.

BTW: About that raw_smp_processor_id():

The raw_ version should stay defined as
this_cpu_read(pcpu_hot.cpu_number), while __smp_processor_id() can be
redefined to raw_cpu_read. Please note the usage in
include/linux/smp.h:

#define get_cpu() ({ preempt_disable(); __smp_processor_id(); })
#define put_cpu() preempt_enable()

where preempt_disable and preempt_enable mark the boundaries of the
critical section.

Uros.

> See commit b59167ac7baf ("x86/percpu: Fix this_cpu_read()").
>
> In fact, that commit went overboard, and just added "volatile" to
> *every* percpu read.
>
> So then people complained about *that*, and PeterZ did commit
> 0b9ccc0a9b14 ("x86/percpu: Differentiate this_cpu_{}() and
> __this_cpu_{}()"), which basically made that "qual or not" be a macro
> choice.
>
> And in the process, it now got added to all the RMW ops, that didn't
> actually need it or want it in the first place, since they won't be
> CSE'd, since they depend on the input.
>
> So that commit basically generalized the whole thing entirely
> pointlessly, and caused your current confusion.
>
> End result: we should remove 'volatile' from the RMW ops. It doesn't
> do anything on x86. All it does is make us have two subtly different
> versions that we don't care about the difference.
>
> End result two: we should make it clear that "this_cpu_read()" vs
> "raw_cpu_read()" are *NOT* about interrupts. Even on architectures
> where the RMW ops need to have irq protection (so that they are atomic
> wrt interrupts also modifying the value), the *READ* operation
> obviously has no such issue.
>
> For the raw_cpu_read() vs this_cpu_read() case, the only issue is
> whether you can CSE the result.
>
> And in 99% of all cases, you can - and want to - CSE it. But as that
> commit b59167ac7baf shows, sometimes you cannot.
>
> Side note: the code that caused that problem is this:
>
> __always_inline void __cyc2ns_read(struct cyc2ns_data *data)
> {
> int seq, idx;
>
> do {
> seq = this_cpu_read(cyc2ns.seq.seqcount.sequence);
> ...
> } while (unlikely(seq != this_cpu_read(cyc2ns.seq.seqcount.sequence)));
> }
>
> where the issue is that the this_cpu_read() of that sequence number
> needs to be ordered.
>
> Honestly, that code is just buggy and bad. We should never have
> "fixed" it by changing the semantics of this_cpu_read() in the first
> place.
>
> The problem is that it re-implements its own locking model, and as so
> often happens when people do that, they do it completely wrongly.
>
> Look at the *REAL* sequence counter code in <linux/seqlock.h>. Notice
> how in raw_read_seqcount_begin() we have
>
> unsigned _seq = __read_seqcount_begin(s);
> smp_rmb();
>
> because it actually does the proper barriers. Notice how the garbage
> code in __cyc2ns_read() doesn't have them - and how it was buggy as a
> result.
>
> (Also notice how this all predates our "we should use load_acquire()
> instead of smb_rmb()", but whatever).
>
> IOW, all the "volatiles" in the x86 <asm/percpu.h> file are LITERAL
> GARBAGE and should not exist, and are due to a historical mistake.
>
> Linus

2023-10-19 08:47:21

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Wed, Oct 18, 2023 at 03:40:05PM -0700, Linus Torvalds wrote:

> Side note: the code that caused that problem is this:
>
> __always_inline void __cyc2ns_read(struct cyc2ns_data *data)
> {
> int seq, idx;
>
> do {
> seq = this_cpu_read(cyc2ns.seq.seqcount.sequence);
> ...
> } while (unlikely(seq != this_cpu_read(cyc2ns.seq.seqcount.sequence)));
> }
>
> where the issue is that the this_cpu_read() of that sequence number
> needs to be ordered.

I have very vague memories of other code also relying on this_cpu_read()
implying READ_ONCE().

And that code really only is buggy if you do not have that. Since it is
cpu local, the smp_rmb() would be confusing, as would smp_load_acquire()
be -- there is no cross-cpu data ordering.

The other option is of couse adding explicit barrier(), but that's
entirely superfluous when all the loads are READ_ONCE().


If you want to make this_cpu_read() not imply READ_ONCE(), then we
should go audit all users :/ Can be done ofc.

2023-10-19 08:55:26

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Wed, Oct 18, 2023 at 03:40:05PM -0700, Linus Torvalds wrote:

> Look at the *REAL* sequence counter code in <linux/seqlock.h>. Notice
> how in raw_read_seqcount_begin() we have
>
> unsigned _seq = __read_seqcount_begin(s);
> smp_rmb();
>
> because it actually does the proper barriers. Notice how the garbage
> code in __cyc2ns_read() doesn't have them - and how it was buggy as a
> result.
>
> (Also notice how this all predates our "we should use load_acquire()
> instead of smb_rmb()", but whatever).

seqlock actually wants rmb even today, the pattern is:

do {
seq = load-seq
rmb
load-data
rmb
} while (seq != re-load-seq)

we specifically only care about loads, and the data loads must be
between the sequence number loads.

As such, load-acquire is not a natural match.

2023-10-19 09:09:48

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Wed, Oct 18, 2023 at 03:40:05PM -0700, Linus Torvalds wrote:
> On Wed, 18 Oct 2023 at 14:40, Uros Bizjak <[email protected]> wrote:
> >
> > The ones in "raw" form are not IRQ safe and these are implemented
> > without volatile qualifier.
>
> You are misreading it.
>
> Both *are* irq safe - on x86.

Stronger, x86 arch code very much relies on them being NMI-safe. Which
makes the generic implementation insufficient.

They *must* be single RmW instructions on x86.

2023-10-19 09:24:09

by Uros Bizjak

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Thu, Oct 19, 2023 at 11:09 AM Peter Zijlstra <[email protected]> wrote:
>
> On Wed, Oct 18, 2023 at 03:40:05PM -0700, Linus Torvalds wrote:
> > On Wed, 18 Oct 2023 at 14:40, Uros Bizjak <[email protected]> wrote:
> > >
> > > The ones in "raw" form are not IRQ safe and these are implemented
> > > without volatile qualifier.
> >
> > You are misreading it.
> >
> > Both *are* irq safe - on x86.
>
> Stronger, x86 arch code very much relies on them being NMI-safe. Which
> makes the generic implementation insufficient.
>
> They *must* be single RmW instructions on x86.

Maybe I should rephrase my quoted claim above:

"raw" versions are not needed to be IRQ safe [*].

[*] Memory arguments need to be stable, so IRQ and NMI handlers can
not change them outside of the critical section where the "raw"
version operates. When memory arguments are stable, the compiler can
omit reads (cache) the arguments, or re-reads them (rematerialize)
from memory. The atomicity of the operation is irrelevant in the "raw"
context, so the implementation of raw_percpu_xchg_op using
raw_cpu_read/write is OK in this context.

Uros.

2023-10-19 16:32:56

by Uros Bizjak

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Wed, Oct 18, 2023 at 10:22 PM Linus Torvalds
<[email protected]> wrote:
>
> On Wed, 18 Oct 2023 at 12:33, Uros Bizjak <[email protected]> wrote:
> >
> > This pach works for me:
>
> Looks fine.
>
> But you actually bring up another issue:
>
> > BTW: I also don't understand the comment from include/linux/smp.h:
> >
> > /*
> > * Allow the architecture to differentiate between a stable and unstable read.
> > * For example, x86 uses an IRQ-safe asm-volatile read for the unstable but a
> > * regular asm read for the stable.
>
> I think the comment is badly worded, but I think the issue may actually be real.
>
> One word: rematerialization.
>
> The thing is, turning inline asm accesses to regular compiler loads
> has a *very* bad semantic problem: the compiler may now feel like it
> can not only combine the loads (ok), but also possibly rematerialize
> values by re-doing the loads (NOT OK!).
>
> IOW, the kernel often has very strict requirements of "at most once"
> behavior, because doing two loads might give different results.
>
> The cpu number is a good example of this.
>
> And yes, sometimes we use actual volatile accesses for them
> (READ_ONCE() and WRITE_ONCE()) but those are *horrendous* in general,
> and are much too strict. Not only does gcc generally lose its mind
> when it sees volatile (ie it stops doing various sane combinations
> that would actually be perfectly valid), but it obviously also stops
> doing CSE on the loads (as it has to).
>
> So the "non-volatile asm" has been a great way to get the "at most
> one" behavior: it's safe wrt interrupts changing the value, because
> you will see *one* value, not two. As far as we know, gcc never
> rematerializes the output of an inline asm. So when you use an inline
> asm, you may have the result CSE'd, but you'll never see it generate
> more than *one* copy of the inline asm.
>
> (Of course, as with so much about inline asm, that "knowledge" is not
> necessarily explicitly spelled out anywhere, and it's just "that's how
> it has always worked").

Perhaps you will be interested in chapter 6.47.2.1, "Volatile" of GCC
manual that says:

" Under certain circumstances, GCC may duplicate (or remove duplicates
of) your assembly code when optimizing."

The compiler may re-materialize non-volatile asm in the same way it
may re-materialize arguments from non-volatile memory. To avoid this,
volatile asm is necessary when unstable memory arguments are accessed
using this_* variants.

Uros.

2023-10-19 17:00:13

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Thu, 19 Oct 2023 at 00:04, Uros Bizjak <[email protected]> wrote:
>
> Let me explain how the compiler handles volatile.

We're talking past each other.

You are talking about the volatile *memory* ops, and the the
difference that "raw" vs "this" would cause with and without the
"volatile".

While *I* am now convinced that the memory ops aren't even an option,
because they will generate worse code, because pretty much all users
use the "this" version (which would have to use volatile),

Because if we just stick with inline asms, the need for "volatile"
simply goes away.

The existing volatile on those percpu inline asms is *wrong*. It's a
historical mistake.

And with just a plain non-volatile inline asm, the inline asm wins.

It doesn't have the (bad) read-once behavior of a volatile memory op.

And it also doesn't have the (horrible correctness issue)
rematerialization behavior of a non-volatile memory op.

A compiler that were to rematerializes an inline asm (instead of
spilling) would be a bad joke. That's not an optimization, that's just
a crazy bad compiler with a code generation bug.

Linus

2023-10-19 17:06:38

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Thu, 19 Oct 2023 at 01:54, Peter Zijlstra <[email protected]> wrote:
>
> seqlock actually wants rmb even today, the pattern is:

That's the pattern today, and yes, it makes superficial sense, but no,
it's not a "really wants" issue.

> do {
> seq = load-seq
> rmb
> load-data
> rmb
> } while (seq != re-load-seq)
>
> we specifically only care about loads, and the data loads must be
> between the sequence number loads.
>
> As such, load-acquire is not a natural match.

You are correct that "rmb" _sounds_ logical. We do, after all, want to
order reads wrt each other.

So rmb is the obvious choice.

But the thing is, "read_acquire" actually does that too.

So if you do

seq = load_acquire(orig_seq);
load-data

then that acquire actually makes that first 'rmb' pointless. Acquire
already guarantees that all subsequent memory operations are ordered
wrt that read.

And 'acquire' is likely faster than 'rmb' on sane modern architectures.

On x86 it doesn't matter (rmb is a no-op, and all loads are acquires).

But on arm64, for example, you can do a 'ld.acq' in one instruction
and you're done - while a rmb then ends up being a barrier (ok, the
asm mnemonics are horrible: it's not "ld.acq", it's "ldar", but
whatever - I like arm64 as an architecture, but I think they made the
standard assembly syntax pointlessly and actively hostile to humans).

Of course then microarchitectures may end up doing basically the same
thing, but at least technically the 'load acquire' is likely more
targeted and more optimized.

The second rmb is then harder to change, and that is going to stay an
rmb ( you could say "do an acquire on the last data load, but that
doesn't fit the sane locking semantics of a sequence lock).

But I do think our sequence counters would be better off using
"smp_load_acquire()" for that initial read.

Of course, then the percpu case doesn't care about the SMP ordering,
but it should still use an UP barrier to make sure things don't get
re-ordered. Relying on our "percpu_read()" ordering other reads around
it is *wrong*.

Linus

2023-10-19 17:09:31

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Thu, 19 Oct 2023 at 09:32, Uros Bizjak <[email protected]> wrote:
>
> Perhaps you will be interested in chapter 6.47.2.1, "Volatile" of GCC
> manual that says:
>
> " Under certain circumstances, GCC may duplicate (or remove duplicates
> of) your assembly code when optimizing."
>
> The compiler may re-materialize non-volatile asm in the same way it
> may re-materialize arguments from non-volatile memory. To avoid this,
> volatile asm is necessary when unstable memory arguments are accessed
> using this_* variants.

That's disgusting. The whole (and only) point of rematerialization is
as an optimization. When gcc doesn't know how expensive an inline asm
is (and they can be very expensive indeed), doing remat on it would be
an obvious mistake.

I think what you say is "we're technically allowed to do it, but we'd
be crazy to *actually* do it".

The kernel does require a sane compiler. We will turn off options that
make it do insane things - even when said insane things are allowed by
a bad standard.

Do you actually have gcc code that rematerializes an inline asm?

Linus

2023-10-19 17:21:50

by Uros Bizjak

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Thu, Oct 19, 2023 at 7:00 PM Linus Torvalds
<[email protected]> wrote:
>
> On Thu, 19 Oct 2023 at 00:04, Uros Bizjak <[email protected]> wrote:
> >
> > Let me explain how the compiler handles volatile.
>
> We're talking past each other.
>
> You are talking about the volatile *memory* ops, and the the
> difference that "raw" vs "this" would cause with and without the
> "volatile".
>
> While *I* am now convinced that the memory ops aren't even an option,
> because they will generate worse code, because pretty much all users
> use the "this" version (which would have to use volatile),

Please see [1]. Even with volatile access, with memory ops the
compiler can propagate operands, resulting in ~8k code size reduction,
and many hundreds (if not thousands) MOVs propagated into subsequent
instructions. Please note many code examples in [1]. This is not
possible with the asm variant.

[1] https://lore.kernel.org/lkml/[email protected]/

> Because if we just stick with inline asms, the need for "volatile"
> simply goes away.

No, the compiler is then free to remove or duplicate the asm (plus
other unwanted optimizations), please see the end of chapter 6.47.2.1
in [2].

[2] https://gcc.gnu.org/onlinedocs/gcc-13.2.0/gcc/Extended-Asm.html#Volatile-1

> The existing volatile on those percpu inline asms is *wrong*. It's a
> historical mistake.

Please see above.

> And with just a plain non-volatile inline asm, the inline asm wins.

Please see [1] for the code propagation argument.

> It doesn't have the (bad) read-once behavior of a volatile memory op.
>
> And it also doesn't have the (horrible correctness issue)
> rematerialization behavior of a non-volatile memory op.

Unfortunately, it does. Without volatile, asm can be rematerialized in
the same way as it can be CSEd. OTOH, the memory op with memory-ops
approach is casted to volatile in this_* case, so it for sure won't
get rematerialized.

> A compiler that were to rematerializes an inline asm (instead of
> spilling) would be a bad joke. That's not an optimization, that's just
> a crazy bad compiler with a code generation bug.

But that is what the compiler does without volatile.

Thanks,
Uros.

2023-10-19 18:07:50

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Thu, 19 Oct 2023 at 10:21, Uros Bizjak <[email protected]> wrote:
>
> > A compiler that were to rematerializes an inline asm (instead of
> > spilling) would be a bad joke. That's not an optimization, that's just
> > a crazy bad compiler with a code generation bug.
>
> But that is what the compiler does without volatile.

Do you actually have a real case of that, or are basing it purely off
insane documentation?

Because remat of inline asm really _is_ insane.

Linus

2023-10-19 18:14:46

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Thu, Oct 19, 2023 at 10:04:56AM -0700, Linus Torvalds wrote:

> So if you do
>
> seq = load_acquire(orig_seq);
> load-data
>
> then that acquire actually makes that first 'rmb' pointless. Acquire
> already guarantees that all subsequent memory operations are ordered
> wrt that read.
>
> And 'acquire' is likely faster than 'rmb' on sane modern architectures.
>
> On x86 it doesn't matter (rmb is a no-op, and all loads are acquires).
>
> But on arm64, for example, you can do a 'ld.acq' in one instruction
> and you're done - while a rmb then ends up being a barrier (ok, the
> asm mnemonics are horrible: it's not "ld.acq", it's "ldar", but
> whatever - I like arm64 as an architecture, but I think they made the
> standard assembly syntax pointlessly and actively hostile to humans).
>
> Of course then microarchitectures may end up doing basically the same
> thing, but at least technically the 'load acquire' is likely more
> targeted and more optimized.

Sure, acquire should work fine here.

> The second rmb is then harder to change, and that is going to stay an
> rmb ( you could say "do an acquire on the last data load, but that
> doesn't fit the sane locking semantics of a sequence lock).

Wouldn't even work, acquire allows an earlier load to pass it. It only
constraints later loads to not happen before.

> Of course, then the percpu case doesn't care about the SMP ordering,
> but it should still use an UP barrier to make sure things don't get
> re-ordered. Relying on our "percpu_read()" ordering other reads around
> it is *wrong*.

I'm happy to put barrier() in there if it makes you feel better.

But are you really saying this_cpu_read() should not imply READ_ONCE()?

If so, we should probably go audit a ton of code :/

2023-10-19 18:17:29

by Uros Bizjak

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Thu, Oct 19, 2023 at 8:06 PM Linus Torvalds
<[email protected]> wrote:
>
> On Thu, 19 Oct 2023 at 10:21, Uros Bizjak <[email protected]> wrote:
> >
> > > A compiler that were to rematerializes an inline asm (instead of
> > > spilling) would be a bad joke. That's not an optimization, that's just
> > > a crazy bad compiler with a code generation bug.
> >
> > But that is what the compiler does without volatile.
>
> Do you actually have a real case of that, or are basing it purely off
> insane documentation?
>
> Because remat of inline asm really _is_ insane.

The following testcase pushes the compiler to the limit:

--cut here--
extern void ex (int);

static int read (void)
{
int ret;

asm ("# -> %0" : "=r"(ret));
return ret;
}

int foo (void)
{
int ret = read ();

ex (ret);
asm volatile ("clobber" : : : "ax", "cx", "dx", "bx", "bp", "si", "di");

return ret;
}

extern int m;

int bar (void)
{
int ret = m;

ex (ret);
asm volatile ("clobber" : : : "ax", "cx", "dx", "bx", "bp", "si", "di");

return ret;
}
--cut here--

Please compile the above with -S -O2 -m32 (so we don't have to list
all 16 x86_64 registers).

And NO (whee...), there is no rematerialization of asm (foo() ). OTOH,
there is also no rematerialization from non-volatile memory (bar() ),
although it would be more optimal than spill to/fill from a frame pair
of moves. I wonder what are "certain circumstances" that the
documentation is referring to.

Uros.


Uros.

2023-10-19 18:23:25

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Thu, 19 Oct 2023 at 11:14, Peter Zijlstra <[email protected]> wrote:
>
> But are you really saying this_cpu_read() should not imply READ_ONCE()?

Well, Uros is saying that we may be *forced* to have that implication,
much as I really hate it (and wonder at the competence of a compiler
that forces the code-pessimizing 'volatile').

And the "it's not volatile" is actually our historical behavior. The
volatile really is new, and didn't exist before your commit
b59167ac7baf ("x86/percpu: Fix this_cpu_read()").

So the whole "implies READ_ONCE()" really seems to be due to that
*one* mistake in our percpu sequence locking code.

Yes, it's been that way for 5 years now, but it was the other way
around for the preceding decade....

Linus

2023-10-19 18:38:04

by Uros Bizjak

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Thu, Oct 19, 2023 at 8:22 PM Linus Torvalds
<[email protected]> wrote:
>
> On Thu, 19 Oct 2023 at 11:14, Peter Zijlstra <[email protected]> wrote:
> >
> > But are you really saying this_cpu_read() should not imply READ_ONCE()?
>
> Well, Uros is saying that we may be *forced* to have that implication,
> much as I really hate it (and wonder at the competence of a compiler
> that forces the code-pessimizing 'volatile').

Please note that my patch mitigates exactly this. The propagation of
volatile(!) arguments allows huge instruction and code savings. By
using non-volatile asm, a very limited BB CSE can perhaps remove a few
asms. However, if there is no READ_ONCE requirement, then we can
simply remove "volatile" qualification for this_cpu_read from the
memory-ops patch. It will be like a field trip for the compiler,
because *then* it will be able to optimize everything without
limitations.

Uros.

> And the "it's not volatile" is actually our historical behavior. The
> volatile really is new, and didn't exist before your commit
> b59167ac7baf ("x86/percpu: Fix this_cpu_read()").
>
> So the whole "implies READ_ONCE()" really seems to be due to that
> *one* mistake in our percpu sequence locking code.
>
> Yes, it's been that way for 5 years now, but it was the other way
> around for the preceding decade....
>
> Linus

2023-10-19 18:50:24

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Thu, 19 Oct 2023 at 11:16, Uros Bizjak <[email protected]> wrote:
>
> And NO (whee...), there is no rematerialization of asm (foo() ). OTOH,
> there is also no rematerialization from non-volatile memory (bar() ),
> although it would be more optimal than spill to/fill from a frame pair
> of moves. I wonder what are "certain circumstances" that the
> documentation is referring to.

Honestly, I've actually never seen gcc rematerialize anything at all.

I really only started worrying about remat issues in a theoretical
sense, and because I feel it would be relatively *easy* to do for
something where the source is a load.

Linus

2023-10-19 19:08:10

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Thu, 19 Oct 2023 at 11:49, Linus Torvalds
<[email protected]> wrote:
>
> Honestly, I've actually never seen gcc rematerialize anything at all.
>
> I really only started worrying about remat issues in a theoretical
> sense, and because I feel it would be relatively *easy* to do for
> something where the source is a load.

.. I started looking around, since I actually have gcc sources around.

At least lra-remat.cc explicitly says

o no any memory (as access to memory is non-profitable)

so if we could just *rely* on that, it would actually allow us to use
memory ops without the volatile.

That would be the best of all worlds, of course.

I do have clang sources too, but I've looked at gcc enough that I at
least can do the "grep and look for patterns" and tend to have an idea
of what the passes are. Clang, not so much.

From my "monkey see patterns" check, it does look like clang mainly
just rematerializes immediates (and some address generation), but also
memory accesses without a base register (allowing a "base register" of
(%rip)).

See X86InstrInfo::isReallyTriviallyReMaterializable() in case anybody cares.

So it does look like clang might actually rematerialize exactly percpu
loads with a constant address, and my "those are easy to
rematerialize" worry may have been correct.

HOWEVER, I'd like to once again say that I know so little about llvm
that my "monkey with 'grep' and some pattern matching ability" thing
really means that I'm just guessing.

Now, PeterZ is obviously worried about even just CSE and re-ordering,
so remat isn't the *only* issue. I do agree that we've had 'volatile'
on many of the asms possibly hiding any issues for the last five
years.

Linus

2023-10-19 21:07:05

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Thu, 19 Oct 2023 at 11:16, Uros Bizjak <[email protected]> wrote:
>
> I wonder what are "certain circumstances" that the
> documentation is referring to.

Looking more at that "under certain circumstances" statement, I
actually think it refers even to the situation *with* "asm volatile".

In particular, when doing loop unrolling, gcc will obviously duplicate
the asm (both with and without volatile). That would obviously lead to
exactly the kinds of problems that snippet of documentation then talks
about:

"This can lead to unexpected duplicate symbol errors during
compilation if your asm code defines symbols or labels"

so that makes complete sense. It also matches up with the fact that
this is all actually documented very much under the "volatile" label -
ie this is a generic thing that happens even *with* volatile in place,
and we should not expect that "one asm statement" will generate
exactly one copy of the resulting assembler.

It also matches up with the whole earlier preceding about "Note that
the compiler can move even volatile asm instructions relative to other
code, including across jump instructions". So I think what happened is
exactly that somebody was declaring a variable or local label inside
the asm, and then the docs were clarified to state that the asm can be
duplicated in the output.

Of course, this is all just by me reading the docs and looking at gcc
output for way too many years. It's not based on any knowledge of the
original issue.

Linus

2023-10-19 22:39:56

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

Unrelated question to the gcc people (well, related in the way that
this discussion made me *test* this).

Lookie here:

int test(void)
{
unsigned int sum = 0;
for (int i = 0; i < 4; i++) {
unsigned int val;
#if ONE
asm("magic1 %0":"=r" (val): :"memory");
#else
asm volatile("magic2 %0":"=r" (val));
#endif
sum += val;
}
return sum;
}

and now build this with

gcc -O2 -S -DONE -funroll-all-loops t.c

and I get a *completely* nonsensical end result. What gcc generates is
literally insane.

What I *expected* to happen was that the two cases (with "-DONE" and
without) would generate the same code, since one has a "asm volatile",
and the other has a memory clobber.

IOW, neither really should be something that can be combined.

But no. The '-DONE" version is completely crazy with my gcc-13.2.1 setup.

First off, it does actually CSE all the asm's despite the memory
clobber. Which I find quite debatable, but whatever.

But not only does it CSE them, it then does *not* just multiply the
result by four. No. It generates this insanity:

magic1 %eax
movl %eax, %edx
addl %eax, %eax
addl %edx, %eax
addl %edx, %eax
ret

so it has apparently done the CSE _after_ the other optimizations.

Very strange.

Honestly, the CSE part looks like an obvious bug to me. The gcc
documentation states:

The "memory" clobber tells the compiler that the assembly code
performs memory reads or writes to items other than those listed in
the input and output operands (for example, accessing the memory
pointed to by one of the input parameters).

so CSE'ing any inline asm with a memory clobber sounds *very* dubious.
The asm literally told the compiler that it has side effects in
unrelated memory locations!

I don't think we actually care in the kernel (and yes, I think it
would always be safer to use "asm volatile" if there's some unrelated
memory locations that change), but since I was testing this and was
surprised, and since the obvious reading of the documented behavior of
a memory clobber really does scream "you can't combine those asms", I
thought I'd mention this.

Also, *without* the memory clobber, gcc obviously still does CSE the
asm, but also, gcc ends up doing just

magic1 %eax
sall $2, %eax
ret

so the memory clobber clearly does actually make a difference. Just
not a _sane_ one.

In testing, clang does *not* have this apparently buggy behavior (but
clang annoyingly actually checks the instruction mnemonics, so I had
to change "magic" into "strl" instead to make clang happy).

Hmm?

Linus

2023-10-20 07:59:11

by Uros Bizjak

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Thu, Oct 19, 2023 at 9:07 PM Linus Torvalds
<[email protected]> wrote:
>
> On Thu, 19 Oct 2023 at 11:49, Linus Torvalds
> <[email protected]> wrote:
> >
> > Honestly, I've actually never seen gcc rematerialize anything at all.
> >
> > I really only started worrying about remat issues in a theoretical
> > sense, and because I feel it would be relatively *easy* to do for
> > something where the source is a load.
>
> .. I started looking around, since I actually have gcc sources around.
>
> At least lra-remat.cc explicitly says
>
> o no any memory (as access to memory is non-profitable)
>
> so if we could just *rely* on that, it would actually allow us to use
> memory ops without the volatile.
>
> That would be the best of all worlds, of course.

I have made an experiment and changed:

#define __raw_cpu_read(qual, pcp) \
({ \
- *(qual __my_cpu_type(pcp) *)__my_cpu_ptr(&(pcp)); \
+ *(__my_cpu_type(pcp) *)__my_cpu_ptr(&(pcp)); \
})

#define __raw_cpu_write(qual, pcp, val) \
do { \
- *(qual __my_cpu_type(pcp) *)__my_cpu_ptr(&(pcp)) = (val); \
+ *(__my_cpu_type(pcp) *)__my_cpu_ptr(&(pcp)) = (val); \
} while (0)

Basically, I removed "volatile" from read/write accessors. With all
new percpu patches in place the difference in all percpu accesses is:

Reference: 15990 accesses
Patched: 15976 accesses.

So, the difference is 14 fewer accesses. Waaay too low of a gain for a
potential pain.

The code size savings are:

text data bss dec hex filename
25476129 4389468 808452 30674049 1d40c81 vmlinux-new.o
25476021 4389444 808452 30673917 1d40bfd vmlinux-ref.o

So, 108 bytes for the default build.

Uros.

2023-10-20 08:09:04

by Uros Bizjak

[permalink] [raw]
Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Fri, Oct 20, 2023 at 12:39 AM Linus Torvalds
<[email protected]> wrote:
>
> Unrelated question to the gcc people (well, related in the way that
> this discussion made me *test* this).

Perhaps you should report this in the gcc bugzilla and move the
discussion there. This thread already has more than 100 messages...

Thanks,
Uros.

2023-11-20 09:39:27

by Uros Bizjak

[permalink] [raw]
Subject: Use %a asm operand modifier to obtain %rip-relative addressing

On Thu, Oct 12, 2023 at 7:10 PM Linus Torvalds
<[email protected]> wrote:
>
> On Thu, 12 Oct 2023 at 09:55, Uros Bizjak <[email protected]> wrote:
> >
> > An example:
>
> Oh, I'm convinced.
>
> The fix seems to be a simple one-liner, ie just
>
> - asm(__pcpu_op2_##size(op, __percpu_arg(P[var]), "%[val]") \
> + asm(__pcpu_op2_##size(op, __percpu_arg(a[var]), "%[val]") \
>
> and it turns out that we have other places where I think we could use that '%a',
>
> For example, we have things like this:
>
> asm ("lea sme_cmdline_arg(%%rip), %0"
> : "=r" (cmdline_arg)
> : "p" (sme_cmdline_arg));
>
> and I think the only reason we do that ridiculous asm is that the code
> in question really does want that (%rip) encoding. It sounds like this
> could just do
>
> asm ("lea %a1, %0"
> : "=r" (cmdline_arg)
> : "p" (sme_cmdline_arg));
>
> instead. Once again, I claim ignorance of the operand modifiers as the
> reason for these kinds of things.

I have looked a bit in the above code. From the compiler PoV, the
above can be written as:

asm ("lea %1, %0"
: "=r" (cmdline_arg)
: "m" (sme_cmdline_arg));

and it will always result in %rip-relative asm address:

#APP
# 585 "arch/x86/mm/mem_encrypt_identity.c" 1
lea sme_cmdline_arg(%rip), %rsi
# 0 "" 2
#NO_APP

Uros.