2023-10-11 20:42:26

by Uros Bizjak

[permalink] [raw]
Subject: [PATCH tip] x86/percpu: Rewrite arch_raw_cpu_ptr()

Implement arch_raw_cpu_ptr() as a load from this_cpu_off and then
add the ptr value to the base. This way, the compiler can propagate
addend to the following instruction and simplify address calculation.

E.g.: address calcuation in amd_pmu_enable_virt() improves from:

48 c7 c0 00 00 00 00 mov $0x0,%rax
87b7: R_X86_64_32S cpu_hw_events

65 48 03 05 00 00 00 add %gs:0x0(%rip),%rax
00
87bf: R_X86_64_PC32 this_cpu_off-0x4

48 c7 80 28 13 00 00 movq $0x0,0x1328(%rax)
00 00 00 00

to:

65 48 8b 05 00 00 00 mov %gs:0x0(%rip),%rax
00
8798: R_X86_64_PC32 this_cpu_off-0x4
48 c7 80 00 00 00 00 movq $0x0,0x0(%rax)
00 00 00 00
87a6: R_X86_64_32S cpu_hw_events+0x1328

The compiler can also eliminate redundant loads from this_cpu_off,
reducing the number of percpu offset reads (either from this_cpu_off
or with rdgsbase) from 1663 to 1571.

Additionaly, the patch introduces 'rdgsbase' alternative for CPUs with
X86_FEATURE_FSGSBASE. The rdgsbase instruction *probably* will end up
only decoding in the first decoder etc. But we're talking single-cycle
kind of effects, and the rdgsbase case should be much better from
a cache perspective and might use fewer memory pipeline resources to
offset the fact that it uses an unusual front end decoder resource...

The only drawback of the patch is larger binary size:

text data bss dec hex filename
25546594 4387686 808452 30742732 1d518cc vmlinux-new.o
25515256 4387814 808452 30711522 1d49ee2 vmlinux-old.o

that increases by 31k (0.123%), due to 1578 rdgsbase altinstructions
that are placed in the text section. The increase in text-size is not
"real" - the 'rdgsbase' instruction should be smaller than a 'mov %gs';
binary size increases because we obviously have two instructions, but
the actual *executable* part likely stays the same, and it's just that
we grow the altinstruction metadata.

Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Uros Bizjak <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
---
arch/x86/include/asm/percpu.h | 26 ++++++++++++++++++++------
1 file changed, 20 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h
index 60ea7755c0fe..e047a0bc5554 100644
--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -49,18 +49,32 @@
#define __force_percpu_prefix "%%"__stringify(__percpu_seg)":"
#define __my_cpu_offset this_cpu_read(this_cpu_off)

-/*
- * Compared to the generic __my_cpu_offset version, the following
- * saves one instruction and avoids clobbering a temp register.
- */
+#ifdef CONFIG_X86_64
+#define arch_raw_cpu_ptr(ptr) \
+({ \
+ unsigned long tcp_ptr__; \
+ asm (ALTERNATIVE("movq " __percpu_arg(1) ", %0", \
+ "rdgsbase %0", \
+ X86_FEATURE_FSGSBASE) \
+ : "=r" (tcp_ptr__) \
+ : "m" (__my_cpu_var(this_cpu_off))); \
+ \
+ tcp_ptr__ += (unsigned long)(ptr); \
+ (typeof(*(ptr)) __kernel __force *)tcp_ptr__; \
+})
+#else /* CONFIG_X86_64 */
#define arch_raw_cpu_ptr(ptr) \
({ \
unsigned long tcp_ptr__; \
- asm ("add " __percpu_arg(1) ", %0" \
+ asm ("movl " __percpu_arg(1) ", %0" \
: "=r" (tcp_ptr__) \
- : "m" (__my_cpu_var(this_cpu_off)), "0" (ptr)); \
+ : "m" (__my_cpu_var(this_cpu_off))); \
+ \
+ tcp_ptr__ += (unsigned long)(ptr); \
(typeof(*(ptr)) __kernel __force *)tcp_ptr__; \
})
+#endif /* CONFIG_X86_64 */
+
#else /* CONFIG_SMP */
#define __percpu_seg_override
#define __percpu_prefix ""
--
2.41.0


2023-10-13 16:05:09

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH tip] x86/percpu: Rewrite arch_raw_cpu_ptr()

On Wed, Oct 11, 2023, Uros Bizjak wrote:
> Additionaly, the patch introduces 'rdgsbase' alternative for CPUs with
> X86_FEATURE_FSGSBASE. The rdgsbase instruction *probably* will end up
> only decoding in the first decoder etc. But we're talking single-cycle
> kind of effects, and the rdgsbase case should be much better from
> a cache perspective and might use fewer memory pipeline resources to
> offset the fact that it uses an unusual front end decoder resource...

The switch to RDGSBASE should be a separate patch, and should come with actual
performance numbers.

A significant percentage of data accesses in Intel's TDX-Module[*] use this
pattern, e.g. even global data is relative to GS.base in the module due its rather
odd and restricted environment. Back in the early days of TDX, the module used
RD{FS,GS}BASE instead of prefixes to get pointers to per-CPU and global data
structures in the TDX-Module. It's been a few years so I forget the exact numbers,
but at the time a single transition between guest and host would have something
like ~100 reads of FS.base or GS.base. Switching from RD{FS,GS}BASE to prefixed
accesses reduced the latency for a guest<->host transition through the TDX-Module
by several thousand cycles, as every RD{FS,GS}BASE had a latency of ~18 cycles
(again, going off 3+ year old memories).

The TDX-Module code is pretty much a pathological worth case scenario, but I
suspect its usage is very similar to most usage of raw_cpu_ptr(), e.g. get a
pointer to some data structure and then do multiple reads/writes from/to that
data structure.

The other wrinkle with RD{FS,FS}GSBASE is that they are trivially easy to emulate.
If a hypervisor/VMM is advertising FSGSBASE even when it's not supported by
hardware, e.g. to migrate VMs to older hardware, then every RDGSBASE will end up
taking a few thousand cycles (#UD -> VM-Exit -> emulate). I would be surprised
if any hypervisor actually does this as it would be easier/smarter to simply not
advertise FSGSBASE if migrating to older hardware might be necessary, e.g. KVM
doesn't support emulating RD{FS,GS}BASE. But at the same time, the whole reason
I stumbled on the TDX-Module's sub-optimal RD{FS,GS}BASE usage was because I had
hacked KVM to emulate RD{FS,GS}BASE so that I could do KVM TDX development on older
hardware. I.e. it's not impossible that this code could run on hardware where
RDGSBASE is emulated in software.

[*] https://www.intel.com/content/www/us/en/download/738875/intel-trust-domain-extension-intel-tdx-module.html

2023-10-13 19:31:24

by Uros Bizjak

[permalink] [raw]
Subject: Re: [PATCH tip] x86/percpu: Rewrite arch_raw_cpu_ptr()

On Fri, Oct 13, 2023 at 6:04 PM Sean Christopherson <[email protected]> wrote:
>
> On Wed, Oct 11, 2023, Uros Bizjak wrote:
> > Additionaly, the patch introduces 'rdgsbase' alternative for CPUs with
> > X86_FEATURE_FSGSBASE. The rdgsbase instruction *probably* will end up
> > only decoding in the first decoder etc. But we're talking single-cycle
> > kind of effects, and the rdgsbase case should be much better from
> > a cache perspective and might use fewer memory pipeline resources to
> > offset the fact that it uses an unusual front end decoder resource...
>
> The switch to RDGSBASE should be a separate patch, and should come with actual
> performance numbers.

This *is* the patch to switch to RDGSBASE. The propagation of
arguments is a nice side-effect of the patch. due to the explicit
addition of the offset addend to the %gs base. This patch is
alternative implementation of [1]

[1] x86/percpu: Use C for arch_raw_cpu_ptr(),
https://lore.kernel.org/lkml/[email protected]/

Unfortunately, I have no idea on how to measure the impact of such a
low-level feature, so I'll at least need some guidance. The "gut
feeling" says that special instruction, intended to support the
feature, is always better than emulating said feature with a memory
access.

> A significant percentage of data accesses in Intel's TDX-Module[*] use this
> pattern, e.g. even global data is relative to GS.base in the module due its rather
> odd and restricted environment. Back in the early days of TDX, the module used
> RD{FS,GS}BASE instead of prefixes to get pointers to per-CPU and global data
> structures in the TDX-Module. It's been a few years so I forget the exact numbers,
> but at the time a single transition between guest and host would have something
> like ~100 reads of FS.base or GS.base. Switching from RD{FS,GS}BASE to prefixed
> accesses reduced the latency for a guest<->host transition through the TDX-Module
> by several thousand cycles, as every RD{FS,GS}BASE had a latency of ~18 cycles
> (again, going off 3+ year old memories).
>
> The TDX-Module code is pretty much a pathological worth case scenario, but I
> suspect its usage is very similar to most usage of raw_cpu_ptr(), e.g. get a
> pointer to some data structure and then do multiple reads/writes from/to that
> data structure.
>
> The other wrinkle with RD{FS,FS}GSBASE is that they are trivially easy to emulate.
> If a hypervisor/VMM is advertising FSGSBASE even when it's not supported by
> hardware, e.g. to migrate VMs to older hardware, then every RDGSBASE will end up
> taking a few thousand cycles (#UD -> VM-Exit -> emulate). I would be surprised
> if any hypervisor actually does this as it would be easier/smarter to simply not
> advertise FSGSBASE if migrating to older hardware might be necessary, e.g. KVM
> doesn't support emulating RD{FS,GS}BASE. But at the same time, the whole reason
> I stumbled on the TDX-Module's sub-optimal RD{FS,GS}BASE usage was because I had
> hacked KVM to emulate RD{FS,GS}BASE so that I could do KVM TDX development on older
> hardware. I.e. it's not impossible that this code could run on hardware where
> RDGSBASE is emulated in software.

There are some other issues when memory access to the percpu area is
implemented with an asm. An ongoing analysis shows that compilers
can't CSE asm over basic-block boundaries, the CSE of asm is a very
simple pattern matching through the BB. So, taking into account all
presented facts and noticeable increase in binary size due to the use
of alternatives, I'm leaning towards using C for arch_raw_cpu_ptr().
Compilers are extremely good at optimizing memory access, so I guess
the original patch [1] outweighs the trouble with RDGSBASE. Unless
someone proves that RDGSBASE is noticeably *better* than current (or
future optimized) implementation.

> [*] https://www.intel.com/content/www/us/en/download/738875/intel-trust-domain-extension-intel-tdx-module.html

Thanks,
Uros.

2023-10-13 20:08:29

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH tip] x86/percpu: Rewrite arch_raw_cpu_ptr()

On Fri, 13 Oct 2023 at 12:30, Uros Bizjak <[email protected]> wrote:
>
> There are some other issues when memory access to the percpu area is
> implemented with an asm. An ongoing analysis shows that compilers
> can't CSE asm over basic-block boundaries, the CSE of asm is a very
> simple pattern matching through the BB.

Ahh. That explains the odd "partial CSE".

I was assuming that CSE was done on the SSA level with some dominance
analysis. Which presumably all the load simplification ends up doing.

Linus

2023-10-13 21:02:47

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH tip] x86/percpu: Rewrite arch_raw_cpu_ptr()

On Fri, Oct 13, 2023, Uros Bizjak wrote:
> On Fri, Oct 13, 2023 at 6:04 PM Sean Christopherson <[email protected]> wrote:
> >
> > On Wed, Oct 11, 2023, Uros Bizjak wrote:
> > > Additionaly, the patch introduces 'rdgsbase' alternative for CPUs with
> > > X86_FEATURE_FSGSBASE. The rdgsbase instruction *probably* will end up
> > > only decoding in the first decoder etc. But we're talking single-cycle
> > > kind of effects, and the rdgsbase case should be much better from
> > > a cache perspective and might use fewer memory pipeline resources to
> > > offset the fact that it uses an unusual front end decoder resource...
> >
> > The switch to RDGSBASE should be a separate patch, and should come with actual
> > performance numbers.
>
> This *is* the patch to switch to RDGSBASE. The propagation of
> arguments is a nice side-effect of the patch. due to the explicit
> addition of the offset addend to the %gs base. This patch is
> alternative implementation of [1]
>
> [1] x86/percpu: Use C for arch_raw_cpu_ptr(),
> https://lore.kernel.org/lkml/[email protected]/

Me confused, can't you first switch to MOV with tcp_ptr__ += (unsigned long)(ptr),
and then introduce the RDGSBASE alternative?

> Unfortunately, I have no idea on how to measure the impact of such a
> low-level feature, so I'll at least need some guidance. The "gut
> feeling" says that special instruction, intended to support the
> feature, is always better than emulating said feature with a memory
> access.

AIUI, {RD,WR}{FS,GS}BASE were added as faster alternatives to {RD,WR}MSR, not to
accelerate actual accesses to per-CPU data, TLS, etc. E.g. loading a 64-bit base
via a MOV to FS/GS is impossible. And presumably saving a userspace controlled
by actually accessing FS/GS is dangerous for one reason or another.

The instructions are guarded by a CR4 bit, the ucode cost just to check CR4.FSGSBASE
is probably non-trivial.

2023-10-14 10:12:16

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH tip] x86/percpu: Rewrite arch_raw_cpu_ptr()


* Uros Bizjak <[email protected]> wrote:

> Implement arch_raw_cpu_ptr() as a load from this_cpu_off and then
> add the ptr value to the base. This way, the compiler can propagate
> addend to the following instruction and simplify address calculation.
>
> E.g.: address calcuation in amd_pmu_enable_virt() improves from:
>
> 48 c7 c0 00 00 00 00 mov $0x0,%rax
> 87b7: R_X86_64_32S cpu_hw_events
>
> 65 48 03 05 00 00 00 add %gs:0x0(%rip),%rax
> 00
> 87bf: R_X86_64_PC32 this_cpu_off-0x4
>
> 48 c7 80 28 13 00 00 movq $0x0,0x1328(%rax)
> 00 00 00 00
>
> to:
>
> 65 48 8b 05 00 00 00 mov %gs:0x0(%rip),%rax
> 00
> 8798: R_X86_64_PC32 this_cpu_off-0x4
> 48 c7 80 00 00 00 00 movq $0x0,0x0(%rax)
> 00 00 00 00
> 87a6: R_X86_64_32S cpu_hw_events+0x1328
>
> The compiler can also eliminate redundant loads from this_cpu_off,
> reducing the number of percpu offset reads (either from this_cpu_off
> or with rdgsbase) from 1663 to 1571.
>
> Additionaly, the patch introduces 'rdgsbase' alternative for CPUs with
> X86_FEATURE_FSGSBASE. The rdgsbase instruction *probably* will end up
> only decoding in the first decoder etc. But we're talking single-cycle
> kind of effects, and the rdgsbase case should be much better from
> a cache perspective and might use fewer memory pipeline resources to
> offset the fact that it uses an unusual front end decoder resource...

So the 'additionally' wording in the changelog should have been a big hint
already that the introduction of RDGSBASE usage needs to be a separate
patch. ;-)

Thanks,

Ingo

2023-10-14 10:34:43

by Uros Bizjak

[permalink] [raw]
Subject: Re: [PATCH tip] x86/percpu: Rewrite arch_raw_cpu_ptr()

On Sat, Oct 14, 2023 at 12:04 PM Ingo Molnar <[email protected]> wrote:
>
>
> * Uros Bizjak <[email protected]> wrote:
>
> > Implement arch_raw_cpu_ptr() as a load from this_cpu_off and then
> > add the ptr value to the base. This way, the compiler can propagate
> > addend to the following instruction and simplify address calculation.
> >
> > E.g.: address calcuation in amd_pmu_enable_virt() improves from:
> >
> > 48 c7 c0 00 00 00 00 mov $0x0,%rax
> > 87b7: R_X86_64_32S cpu_hw_events
> >
> > 65 48 03 05 00 00 00 add %gs:0x0(%rip),%rax
> > 00
> > 87bf: R_X86_64_PC32 this_cpu_off-0x4
> >
> > 48 c7 80 28 13 00 00 movq $0x0,0x1328(%rax)
> > 00 00 00 00
> >
> > to:
> >
> > 65 48 8b 05 00 00 00 mov %gs:0x0(%rip),%rax
> > 00
> > 8798: R_X86_64_PC32 this_cpu_off-0x4
> > 48 c7 80 00 00 00 00 movq $0x0,0x0(%rax)
> > 00 00 00 00
> > 87a6: R_X86_64_32S cpu_hw_events+0x1328
> >
> > The compiler can also eliminate redundant loads from this_cpu_off,
> > reducing the number of percpu offset reads (either from this_cpu_off
> > or with rdgsbase) from 1663 to 1571.
> >
> > Additionaly, the patch introduces 'rdgsbase' alternative for CPUs with
> > X86_FEATURE_FSGSBASE. The rdgsbase instruction *probably* will end up
> > only decoding in the first decoder etc. But we're talking single-cycle
> > kind of effects, and the rdgsbase case should be much better from
> > a cache perspective and might use fewer memory pipeline resources to
> > offset the fact that it uses an unusual front end decoder resource...
>
> So the 'additionally' wording in the changelog should have been a big hint
> already that the introduction of RDGSBASE usage needs to be a separate
> patch. ;-)

Indeed. I think that the first part should be universally beneficial,
as it converts

mov symbol, %rax
add %gs:this_cpu_off, %rax

to:

mov %gs:this_cpu_off, %rax
add symbol, %rax

and allows the compiler to propagate addition into address calculation
(the latter is also similar to the code, generated by _seg_gs
approach).

At this point, the "experimental" part could either

a) introduce RDGSBASE:

As discussed with Sean, this could be problematic, at least with KVM,
and has some other drawbacks (e.g. larger binary size, limited CSE of
asm).

b) move to __seg_gs approach via _raw_cpu_read[1]:

This approach solves the "limited CSE with assembly" compiler issue,
since it exposes load to the compiler, and has greater optimization
potential.

[1] https://lore.kernel.org/lkml/[email protected]/

Unfortunately, these two are mutually exclusive, since RDGSBASE is
implemented as asm.

To move things forward, I propose to proceed conservatively with the
original patch [1], but this one should be split into two parts. The
first will introduce the switch to MOV with tcp_ptr__ += (unsigned
long)(ptr), and the second will add __seg_gs part.

At this point, we can experiment with RDGSBASE, and compare it with
both approaches, with and without __seg_gs, by just changing the asm
template to:

+ asm (ALTERNATIVE("mov " __percpu_arg(1) ", %0", \
+ "rdgsbase %0", \
+ X86_FEATURE_FSGSBASE) \

Uros.