2023-10-04 14:51:59

by Uros Bizjak

[permalink] [raw]
Subject: [PATCH 0/4] x86/percpu: Use segment qualifiers

This patchset resurrect the work of Richard Henderson [1] and Nadav
Amit [2] to introduce named address spaces compiler extension [3,4]
into the linux kernel.

On the x86 target, variables may be declared as being relative to
the %fs or %gs segments.

__seg_fs
__seg_gs

The object is accessed with the respective segment override prefix.

The following patchset takes a bit more cautious approach and converts
only moves, currently implemented as an asm, to generic moves to/from
named address space. The compiler is then able to propagate memory
arguments into instructions that use these memory references, producing
more compact assembly, in addition to avoiding using a register as a
temporary to hold value from the memory.

The patchset enables propagation of hundreds of memory arguments,
resulting in the cumulative code size reduction of 7.94kB (please note
that the kernel is compiled with -O2, so the code size is not entirely
correct measure; some parts of the code can now be duplicated for
better performance due to -O2, etc...).

Some examples of propagations:

a) into sign/zero extensions:

110b54: 65 0f b6 05 00 00 00 movzbl %gs:0x0(%rip),%eax
11ab90: 65 0f b6 15 00 00 00 movzbl %gs:0x0(%rip),%edx
14484a: 65 0f b7 35 00 00 00 movzwl %gs:0x0(%rip),%esi
1a08a9: 65 0f b6 43 78 movzbl %gs:0x78(%rbx),%eax
1a08f9: 65 0f b6 43 78 movzbl %gs:0x78(%rbx),%eax

4ab29a: 65 48 63 15 00 00 00 movslq %gs:0x0(%rip),%rdx
4be128: 65 4c 63 25 00 00 00 movslq %gs:0x0(%rip),%r12
547468: 65 48 63 1f movslq %gs:(%rdi),%rbx
5474e7: 65 48 63 0a movslq %gs:(%rdx),%rcx
54d05d: 65 48 63 0d 00 00 00 movslq %gs:0x0(%rip),%rcx

b) into compares:

b40804: 65 f7 05 00 00 00 00 testl $0xf0000,%gs:0x0(%rip)
b487e8: 65 f7 05 00 00 00 00 testl $0xf0000,%gs:0x0(%rip)
b6f14c: 65 f6 05 00 00 00 00 testb $0x1,%gs:0x0(%rip)
bac1b8: 65 f6 05 00 00 00 00 testb $0x1,%gs:0x0(%rip)
df2244: 65 f7 05 00 00 00 00 testl $0xff00,%gs:0x0(%rip)

9a7517: 65 80 3d 00 00 00 00 cmpb $0x0,%gs:0x0(%rip)
b282ba: 65 44 3b 35 00 00 00 cmp %gs:0x0(%rip),%r14d
b48f61: 65 66 83 3d 00 00 00 cmpw $0x8,%gs:0x0(%rip)
b493fe: 65 80 38 00 cmpb $0x0,%gs:(%rax)
b73867: 65 66 83 3d 00 00 00 cmpw $0x8,%gs:0x0(%rip)

c) into other insns:

65ec02: 65 0f 44 15 00 00 00 cmove %gs:0x0(%rip),%edx
6c98ac: 65 0f 44 15 00 00 00 cmove %gs:0x0(%rip),%edx
9aafaf: 65 0f 44 15 00 00 00 cmove %gs:0x0(%rip),%edx
b45868: 65 0f 48 35 00 00 00 cmovs %gs:0x0(%rip),%esi
d276f8: 65 0f 44 15 00 00 00 cmove %gs:0x0(%rip),%edx

The above propagations result in the following code size
improvements for current mainline kernel (with the default config),
compiled with

gcc (GCC) 12.3.1 20230508 (Red Hat 12.3.1-1)

text data bss dec hex filename
25508862 4386540 808388 30703790 1d480ae vmlinux-vanilla.o
25500922 4386532 808388 30695842 1d461a2 vmlinux-new.o

The conversion of other read-modify-write instructions does not bring
us any benefits, the compiler has some problems when constructing RMW
instructions from the generic code and easily misses some opportunities.

There are other optimizations possible involving arch_raw_cpu_ptr and
aggressive caching of current that are implemented in the original
patch series. These can be implemented as follow-ups at some later
time.

The patcshet was tested on Fedora 38 with kernel 6.5.5 and gcc 13.2.1
(In fact, I'm writing this message on the patched kernel.)

[1] https://lore.kernel.org/lkml/[email protected]/
[2] https://lore.kernel.org/lkml/[email protected]/
[3] https://gcc.gnu.org/onlinedocs/gcc/Named-Address-Spaces.html
[4] https://clang.llvm.org/docs/LanguageExtensions.html#target-specific-extensions

Cc: Andy Lutomirski <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Josh Poimboeuf <[email protected]>

Uros Bizjak (4):
x86/percpu: Update arch/x86/include/asm/percpu.h to the current tip
x86/percpu: Enable named address spaces with known compiler version
x86/percpu: Use compiler segment prefix qualifier
x86/percpu: Use C for percpu read/write accessors

arch/x86/Kconfig | 7 +
arch/x86/include/asm/percpu.h | 237 ++++++++++++++++++++++++++++-----
arch/x86/include/asm/preempt.h | 2 +-
3 files changed, 209 insertions(+), 37 deletions(-)

--
2.41.0


2023-10-04 14:52:05

by Uros Bizjak

[permalink] [raw]
Subject: [PATCH 1/4] x86/percpu: Update arch/x86/include/asm/percpu.h to the current tip

This is just a convenient patch that brings current mainline version
of arch/x86/include/asm/percpu.h to the version in the current tip tree.

Cc: Andy Lutomirski <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Signed-off-by: Uros Bizjak <[email protected]>
---
arch/x86/include/asm/percpu.h | 110 ++++++++++++++++++++++++++++++++--
1 file changed, 104 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h
index 34734d730463..20624b80f890 100644
--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -210,6 +210,25 @@ do { \
(typeof(_var))(unsigned long) pco_old__; \
})

+#define percpu_try_cmpxchg_op(size, qual, _var, _ovalp, _nval) \
+({ \
+ bool success; \
+ __pcpu_type_##size *pco_oval__ = (__pcpu_type_##size *)(_ovalp); \
+ __pcpu_type_##size pco_old__ = *pco_oval__; \
+ __pcpu_type_##size pco_new__ = __pcpu_cast_##size(_nval); \
+ asm qual (__pcpu_op2_##size("cmpxchg", "%[nval]", \
+ __percpu_arg([var])) \
+ CC_SET(z) \
+ : CC_OUT(z) (success), \
+ [oval] "+a" (pco_old__), \
+ [var] "+m" (_var) \
+ : [nval] __pcpu_reg_##size(, pco_new__) \
+ : "memory"); \
+ if (unlikely(!success)) \
+ *pco_oval__ = pco_old__; \
+ likely(success); \
+})
+
#if defined(CONFIG_X86_32) && !defined(CONFIG_UML)
#define percpu_cmpxchg64_op(size, qual, _var, _oval, _nval) \
({ \
@@ -223,26 +242,63 @@ do { \
old__.var = _oval; \
new__.var = _nval; \
\
- asm qual (ALTERNATIVE("leal %P[var], %%esi; call this_cpu_cmpxchg8b_emu", \
+ asm qual (ALTERNATIVE("call this_cpu_cmpxchg8b_emu", \
"cmpxchg8b " __percpu_arg([var]), X86_FEATURE_CX8) \
: [var] "+m" (_var), \
"+a" (old__.low), \
"+d" (old__.high) \
: "b" (new__.low), \
- "c" (new__.high) \
- : "memory", "esi"); \
+ "c" (new__.high), \
+ "S" (&(_var)) \
+ : "memory"); \
\
old__.var; \
})

#define raw_cpu_cmpxchg64(pcp, oval, nval) percpu_cmpxchg64_op(8, , pcp, oval, nval)
#define this_cpu_cmpxchg64(pcp, oval, nval) percpu_cmpxchg64_op(8, volatile, pcp, oval, nval)
+
+#define percpu_try_cmpxchg64_op(size, qual, _var, _ovalp, _nval) \
+({ \
+ bool success; \
+ u64 *_oval = (u64 *)(_ovalp); \
+ union { \
+ u64 var; \
+ struct { \
+ u32 low, high; \
+ }; \
+ } old__, new__; \
+ \
+ old__.var = *_oval; \
+ new__.var = _nval; \
+ \
+ asm qual (ALTERNATIVE("call this_cpu_cmpxchg8b_emu", \
+ "cmpxchg8b " __percpu_arg([var]), X86_FEATURE_CX8) \
+ CC_SET(z) \
+ : CC_OUT(z) (success), \
+ [var] "+m" (_var), \
+ "+a" (old__.low), \
+ "+d" (old__.high) \
+ : "b" (new__.low), \
+ "c" (new__.high), \
+ "S" (&(_var)) \
+ : "memory"); \
+ if (unlikely(!success)) \
+ *_oval = old__.var; \
+ likely(success); \
+})
+
+#define raw_cpu_try_cmpxchg64(pcp, ovalp, nval) percpu_try_cmpxchg64_op(8, , pcp, ovalp, nval)
+#define this_cpu_try_cmpxchg64(pcp, ovalp, nval) percpu_try_cmpxchg64_op(8, volatile, pcp, ovalp, nval)
#endif

#ifdef CONFIG_X86_64
#define raw_cpu_cmpxchg64(pcp, oval, nval) percpu_cmpxchg_op(8, , pcp, oval, nval);
#define this_cpu_cmpxchg64(pcp, oval, nval) percpu_cmpxchg_op(8, volatile, pcp, oval, nval);

+#define raw_cpu_try_cmpxchg64(pcp, ovalp, nval) percpu_try_cmpxchg_op(8, , pcp, ovalp, nval);
+#define this_cpu_try_cmpxchg64(pcp, ovalp, nval) percpu_try_cmpxchg_op(8, volatile, pcp, ovalp, nval);
+
#define percpu_cmpxchg128_op(size, qual, _var, _oval, _nval) \
({ \
union { \
@@ -255,20 +311,54 @@ do { \
old__.var = _oval; \
new__.var = _nval; \
\
- asm qual (ALTERNATIVE("leaq %P[var], %%rsi; call this_cpu_cmpxchg16b_emu", \
+ asm qual (ALTERNATIVE("call this_cpu_cmpxchg16b_emu", \
"cmpxchg16b " __percpu_arg([var]), X86_FEATURE_CX16) \
: [var] "+m" (_var), \
"+a" (old__.low), \
"+d" (old__.high) \
: "b" (new__.low), \
- "c" (new__.high) \
- : "memory", "rsi"); \
+ "c" (new__.high), \
+ "S" (&(_var)) \
+ : "memory"); \
\
old__.var; \
})

#define raw_cpu_cmpxchg128(pcp, oval, nval) percpu_cmpxchg128_op(16, , pcp, oval, nval)
#define this_cpu_cmpxchg128(pcp, oval, nval) percpu_cmpxchg128_op(16, volatile, pcp, oval, nval)
+
+#define percpu_try_cmpxchg128_op(size, qual, _var, _ovalp, _nval) \
+({ \
+ bool success; \
+ u128 *_oval = (u128 *)(_ovalp); \
+ union { \
+ u128 var; \
+ struct { \
+ u64 low, high; \
+ }; \
+ } old__, new__; \
+ \
+ old__.var = *_oval; \
+ new__.var = _nval; \
+ \
+ asm qual (ALTERNATIVE("call this_cpu_cmpxchg16b_emu", \
+ "cmpxchg16b " __percpu_arg([var]), X86_FEATURE_CX16) \
+ CC_SET(z) \
+ : CC_OUT(z) (success), \
+ [var] "+m" (_var), \
+ "+a" (old__.low), \
+ "+d" (old__.high) \
+ : "b" (new__.low), \
+ "c" (new__.high), \
+ "S" (&(_var)) \
+ : "memory"); \
+ if (unlikely(!success)) \
+ *_oval = old__.var; \
+ likely(success); \
+})
+
+#define raw_cpu_try_cmpxchg128(pcp, ovalp, nval) percpu_try_cmpxchg128_op(16, , pcp, ovalp, nval)
+#define this_cpu_try_cmpxchg128(pcp, ovalp, nval) percpu_try_cmpxchg128_op(16, volatile, pcp, ovalp, nval)
#endif

/*
@@ -343,6 +433,9 @@ do { \
#define raw_cpu_cmpxchg_1(pcp, oval, nval) percpu_cmpxchg_op(1, , pcp, oval, nval)
#define raw_cpu_cmpxchg_2(pcp, oval, nval) percpu_cmpxchg_op(2, , pcp, oval, nval)
#define raw_cpu_cmpxchg_4(pcp, oval, nval) percpu_cmpxchg_op(4, , pcp, oval, nval)
+#define raw_cpu_try_cmpxchg_1(pcp, ovalp, nval) percpu_try_cmpxchg_op(1, , pcp, ovalp, nval)
+#define raw_cpu_try_cmpxchg_2(pcp, ovalp, nval) percpu_try_cmpxchg_op(2, , pcp, ovalp, nval)
+#define raw_cpu_try_cmpxchg_4(pcp, ovalp, nval) percpu_try_cmpxchg_op(4, , pcp, ovalp, nval)

#define this_cpu_add_return_1(pcp, val) percpu_add_return_op(1, volatile, pcp, val)
#define this_cpu_add_return_2(pcp, val) percpu_add_return_op(2, volatile, pcp, val)
@@ -350,6 +443,9 @@ do { \
#define this_cpu_cmpxchg_1(pcp, oval, nval) percpu_cmpxchg_op(1, volatile, pcp, oval, nval)
#define this_cpu_cmpxchg_2(pcp, oval, nval) percpu_cmpxchg_op(2, volatile, pcp, oval, nval)
#define this_cpu_cmpxchg_4(pcp, oval, nval) percpu_cmpxchg_op(4, volatile, pcp, oval, nval)
+#define this_cpu_try_cmpxchg_1(pcp, ovalp, nval) percpu_try_cmpxchg_op(1, volatile, pcp, ovalp, nval)
+#define this_cpu_try_cmpxchg_2(pcp, ovalp, nval) percpu_try_cmpxchg_op(2, volatile, pcp, ovalp, nval)
+#define this_cpu_try_cmpxchg_4(pcp, ovalp, nval) percpu_try_cmpxchg_op(4, volatile, pcp, ovalp, nval)

/*
* Per cpu atomic 64 bit operations are only available under 64 bit.
@@ -364,6 +460,7 @@ do { \
#define raw_cpu_add_return_8(pcp, val) percpu_add_return_op(8, , pcp, val)
#define raw_cpu_xchg_8(pcp, nval) raw_percpu_xchg_op(pcp, nval)
#define raw_cpu_cmpxchg_8(pcp, oval, nval) percpu_cmpxchg_op(8, , pcp, oval, nval)
+#define raw_cpu_try_cmpxchg_8(pcp, ovalp, nval) percpu_try_cmpxchg_op(8, , pcp, ovalp, nval)

#define this_cpu_read_8(pcp) percpu_from_op(8, volatile, "mov", pcp)
#define this_cpu_write_8(pcp, val) percpu_to_op(8, volatile, "mov", (pcp), val)
@@ -373,6 +470,7 @@ do { \
#define this_cpu_add_return_8(pcp, val) percpu_add_return_op(8, volatile, pcp, val)
#define this_cpu_xchg_8(pcp, nval) percpu_xchg_op(8, volatile, pcp, nval)
#define this_cpu_cmpxchg_8(pcp, oval, nval) percpu_cmpxchg_op(8, volatile, pcp, oval, nval)
+#define this_cpu_try_cmpxchg_8(pcp, ovalp, nval) percpu_try_cmpxchg_op(8, volatile, pcp, ovalp, nval)
#endif

static __always_inline bool x86_this_cpu_constant_test_bit(unsigned int nr,
--
2.41.0

2023-10-04 14:52:12

by Uros Bizjak

[permalink] [raw]
Subject: [PATCH 3/4] x86/percpu: Use compiler segment prefix qualifier

From: Nadav Amit <[email protected]>

Using a segment prefix qualifier is cleaner than using a segment prefix
in the inline assembly, and provides the compiler with more information,
telling it that __seg_gs:[addr] is different than [addr] when it
analyzes data dependencies. It also enables various optimizations that
will be implemented in the next patches.

Use segment prefix qualifiers when they are supported. Unfortunately,
gcc does not provide a way to remove segment qualifiers, which is needed
to use typeof() to create local instances of the per-cpu variable. For
this reason, do not use the segment qualifier for per-cpu variables, and
do casting using the segment qualifier instead.

Uros: Improve compiler support detection and update the patch
to the current mainline.

Cc: Andy Lutomirski <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Signed-off-by: Nadav Amit <[email protected]>
Signed-off-by: Uros Bizjak <[email protected]>
---
arch/x86/include/asm/percpu.h | 68 +++++++++++++++++++++++-----------
arch/x86/include/asm/preempt.h | 2 +-
2 files changed, 47 insertions(+), 23 deletions(-)

diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h
index 20624b80f890..da451202a1b9 100644
--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -28,26 +28,50 @@
#include <linux/stringify.h>

#ifdef CONFIG_SMP
+
+#ifdef CONFIG_CC_HAS_NAMED_AS
+
+#ifdef CONFIG_X86_64
+#define __percpu_seg_override __seg_gs
+#else
+#define __percpu_seg_override __seg_fs
+#endif
+
+#define __percpu_prefix ""
+
+#else /* CONFIG_CC_HAS_NAMED_AS */
+
+#define __percpu_seg_override
#define __percpu_prefix "%%"__stringify(__percpu_seg)":"
+
+#endif /* CONFIG_CC_HAS_NAMED_AS */
+
+#define __force_percpu_prefix "%%"__stringify(__percpu_seg)":"
#define __my_cpu_offset this_cpu_read(this_cpu_off)

/*
* Compared to the generic __my_cpu_offset version, the following
* saves one instruction and avoids clobbering a temp register.
*/
-#define arch_raw_cpu_ptr(ptr) \
-({ \
- unsigned long tcp_ptr__; \
- asm ("add " __percpu_arg(1) ", %0" \
- : "=r" (tcp_ptr__) \
- : "m" (this_cpu_off), "0" (ptr)); \
- (typeof(*(ptr)) __kernel __force *)tcp_ptr__; \
+#define arch_raw_cpu_ptr(ptr) \
+({ \
+ unsigned long tcp_ptr__; \
+ asm ("add " __percpu_arg(1) ", %0" \
+ : "=r" (tcp_ptr__) \
+ : "m" (__my_cpu_var(this_cpu_off)), "0" (ptr)); \
+ (typeof(*(ptr)) __kernel __force *)tcp_ptr__; \
})
-#else
+#else /* CONFIG_SMP */
+#define __percpu_seg_override
#define __percpu_prefix ""
-#endif
+#define __force_percpu_prefix ""
+#endif /* CONFIG_SMP */

+#define __my_cpu_type(var) typeof(var) __percpu_seg_override
+#define __my_cpu_ptr(ptr) (__my_cpu_type(*ptr) *)(uintptr_t)(ptr)
+#define __my_cpu_var(var) (*__my_cpu_ptr(&var))
#define __percpu_arg(x) __percpu_prefix "%" #x
+#define __force_percpu_arg(x) __force_percpu_prefix "%" #x

/*
* Initialized pointers to per-cpu variables needed for the boot
@@ -107,14 +131,14 @@ do { \
(void)pto_tmp__; \
} \
asm qual(__pcpu_op2_##size(op, "%[val]", __percpu_arg([var])) \
- : [var] "+m" (_var) \
+ : [var] "+m" (__my_cpu_var(_var)) \
: [val] __pcpu_reg_imm_##size(pto_val__)); \
} while (0)

#define percpu_unary_op(size, qual, op, _var) \
({ \
asm qual (__pcpu_op1_##size(op, __percpu_arg([var])) \
- : [var] "+m" (_var)); \
+ : [var] "+m" (__my_cpu_var(_var))); \
})

/*
@@ -144,14 +168,14 @@ do { \
__pcpu_type_##size pfo_val__; \
asm qual (__pcpu_op2_##size(op, __percpu_arg([var]), "%[val]") \
: [val] __pcpu_reg_##size("=", pfo_val__) \
- : [var] "m" (_var)); \
+ : [var] "m" (__my_cpu_var(_var))); \
(typeof(_var))(unsigned long) pfo_val__; \
})

#define percpu_stable_op(size, op, _var) \
({ \
__pcpu_type_##size pfo_val__; \
- asm(__pcpu_op2_##size(op, __percpu_arg(P[var]), "%[val]") \
+ asm(__pcpu_op2_##size(op, __force_percpu_arg(P[var]), "%[val]") \
: [val] __pcpu_reg_##size("=", pfo_val__) \
: [var] "p" (&(_var))); \
(typeof(_var))(unsigned long) pfo_val__; \
@@ -166,7 +190,7 @@ do { \
asm qual (__pcpu_op2_##size("xadd", "%[tmp]", \
__percpu_arg([var])) \
: [tmp] __pcpu_reg_##size("+", paro_tmp__), \
- [var] "+m" (_var) \
+ [var] "+m" (__my_cpu_var(_var)) \
: : "memory"); \
(typeof(_var))(unsigned long) (paro_tmp__ + _val); \
})
@@ -187,7 +211,7 @@ do { \
__percpu_arg([var])) \
"\n\tjnz 1b" \
: [oval] "=&a" (pxo_old__), \
- [var] "+m" (_var) \
+ [var] "+m" (__my_cpu_var(_var)) \
: [nval] __pcpu_reg_##size(, pxo_new__) \
: "memory"); \
(typeof(_var))(unsigned long) pxo_old__; \
@@ -204,7 +228,7 @@ do { \
asm qual (__pcpu_op2_##size("cmpxchg", "%[nval]", \
__percpu_arg([var])) \
: [oval] "+a" (pco_old__), \
- [var] "+m" (_var) \
+ [var] "+m" (__my_cpu_var(_var)) \
: [nval] __pcpu_reg_##size(, pco_new__) \
: "memory"); \
(typeof(_var))(unsigned long) pco_old__; \
@@ -221,7 +245,7 @@ do { \
CC_SET(z) \
: CC_OUT(z) (success), \
[oval] "+a" (pco_old__), \
- [var] "+m" (_var) \
+ [var] "+m" (__my_cpu_var(_var)) \
: [nval] __pcpu_reg_##size(, pco_new__) \
: "memory"); \
if (unlikely(!success)) \
@@ -244,7 +268,7 @@ do { \
\
asm qual (ALTERNATIVE("call this_cpu_cmpxchg8b_emu", \
"cmpxchg8b " __percpu_arg([var]), X86_FEATURE_CX8) \
- : [var] "+m" (_var), \
+ : [var] "+m" (__my_cpu_var(_var)), \
"+a" (old__.low), \
"+d" (old__.high) \
: "b" (new__.low), \
@@ -276,7 +300,7 @@ do { \
"cmpxchg8b " __percpu_arg([var]), X86_FEATURE_CX8) \
CC_SET(z) \
: CC_OUT(z) (success), \
- [var] "+m" (_var), \
+ [var] "+m" (__my_cpu_var(_var)), \
"+a" (old__.low), \
"+d" (old__.high) \
: "b" (new__.low), \
@@ -313,7 +337,7 @@ do { \
\
asm qual (ALTERNATIVE("call this_cpu_cmpxchg16b_emu", \
"cmpxchg16b " __percpu_arg([var]), X86_FEATURE_CX16) \
- : [var] "+m" (_var), \
+ : [var] "+m" (__my_cpu_var(_var)), \
"+a" (old__.low), \
"+d" (old__.high) \
: "b" (new__.low), \
@@ -345,7 +369,7 @@ do { \
"cmpxchg16b " __percpu_arg([var]), X86_FEATURE_CX16) \
CC_SET(z) \
: CC_OUT(z) (success), \
- [var] "+m" (_var), \
+ [var] "+m" (__my_cpu_var(_var)), \
"+a" (old__.low), \
"+d" (old__.high) \
: "b" (new__.low), \
@@ -494,7 +518,7 @@ static inline bool x86_this_cpu_variable_test_bit(int nr,
asm volatile("btl "__percpu_arg(2)",%1"
CC_SET(c)
: CC_OUT(c) (oldbit)
- : "m" (*(unsigned long __percpu *)addr), "Ir" (nr));
+ : "m" (*__my_cpu_ptr((unsigned long __percpu *)(addr))), "Ir" (nr));

return oldbit;
}
diff --git a/arch/x86/include/asm/preempt.h b/arch/x86/include/asm/preempt.h
index 2d13f25b1bd8..e25b95e7cf82 100644
--- a/arch/x86/include/asm/preempt.h
+++ b/arch/x86/include/asm/preempt.h
@@ -92,7 +92,7 @@ static __always_inline void __preempt_count_sub(int val)
*/
static __always_inline bool __preempt_count_dec_and_test(void)
{
- return GEN_UNARY_RMWcc("decl", pcpu_hot.preempt_count, e,
+ return GEN_UNARY_RMWcc("decl", __my_cpu_var(pcpu_hot.preempt_count), e,
__percpu_arg([var]));
}

--
2.41.0

2023-10-04 14:52:14

by Uros Bizjak

[permalink] [raw]
Subject: [PATCH 4/4] x86/percpu: Use C for percpu read/write accessors

The percpu code mostly uses inline assembly. Using segment qualifiers
allows to use C code instead, which enables the compiler to perform
various optimizations (e.g. propagation of memory arguments). Convert
percpu read and write accessors to C code, so the memory argument can
be propagated to the instruction that uses this argument.

Some examples of propagations:

a) into sign/zero extensions:

110b54: 65 0f b6 05 00 00 00 movzbl %gs:0x0(%rip),%eax
11ab90: 65 0f b6 15 00 00 00 movzbl %gs:0x0(%rip),%edx
14484a: 65 0f b7 35 00 00 00 movzwl %gs:0x0(%rip),%esi
1a08a9: 65 0f b6 43 78 movzbl %gs:0x78(%rbx),%eax
1a08f9: 65 0f b6 43 78 movzbl %gs:0x78(%rbx),%eax

4ab29a: 65 48 63 15 00 00 00 movslq %gs:0x0(%rip),%rdx
4be128: 65 4c 63 25 00 00 00 movslq %gs:0x0(%rip),%r12
547468: 65 48 63 1f movslq %gs:(%rdi),%rbx
5474e7: 65 48 63 0a movslq %gs:(%rdx),%rcx
54d05d: 65 48 63 0d 00 00 00 movslq %gs:0x0(%rip),%rcx

b) into compares:

b40804: 65 f7 05 00 00 00 00 testl $0xf0000,%gs:0x0(%rip)
b487e8: 65 f7 05 00 00 00 00 testl $0xf0000,%gs:0x0(%rip)
b6f14c: 65 f6 05 00 00 00 00 testb $0x1,%gs:0x0(%rip)
bac1b8: 65 f6 05 00 00 00 00 testb $0x1,%gs:0x0(%rip)
df2244: 65 f7 05 00 00 00 00 testl $0xff00,%gs:0x0(%rip)

9a7517: 65 80 3d 00 00 00 00 cmpb $0x0,%gs:0x0(%rip)
b282ba: 65 44 3b 35 00 00 00 cmp %gs:0x0(%rip),%r14d
b48f61: 65 66 83 3d 00 00 00 cmpw $0x8,%gs:0x0(%rip)
b493fe: 65 80 38 00 cmpb $0x0,%gs:(%rax)
b73867: 65 66 83 3d 00 00 00 cmpw $0x8,%gs:0x0(%rip)

c) into other insns:

65ec02: 65 0f 44 15 00 00 00 cmove %gs:0x0(%rip),%edx
6c98ac: 65 0f 44 15 00 00 00 cmove %gs:0x0(%rip),%edx
9aafaf: 65 0f 44 15 00 00 00 cmove %gs:0x0(%rip),%edx
b45868: 65 0f 48 35 00 00 00 cmovs %gs:0x0(%rip),%esi
d276f8: 65 0f 44 15 00 00 00 cmove %gs:0x0(%rip),%edx

The above propagations result in the following code size
improvements for current mainline kernel (with the default config),
compiled with

gcc (GCC) 12.3.1 20230508 (Red Hat 12.3.1-1)

text data bss dec hex filename
25508862 4386540 808388 30703790 1d480ae vmlinux-vanilla.o
25500922 4386532 808388 30695842 1d461a2 vmlinux-new.o

The conversion of other read-modify-write instructions does not bring us any
benefits, the compiler has some problems when constructing RMW instructions
from the generic code and easily misses some opportunities.

Cc: Andy Lutomirski <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Co-developed-by: Nadav Amit <[email protected]>
Signed-off-by: Nadav Amit <[email protected]>
Signed-off-by: Uros Bizjak <[email protected]>
---
arch/x86/include/asm/percpu.h | 65 +++++++++++++++++++++++++++++------
1 file changed, 54 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h
index da451202a1b9..60ea7755c0fe 100644
--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -400,13 +400,66 @@ do { \
#define this_cpu_read_stable_8(pcp) percpu_stable_op(8, "mov", pcp)
#define this_cpu_read_stable(pcp) __pcpu_size_call_return(this_cpu_read_stable_, pcp)

+#ifdef CONFIG_USE_X86_SEG_SUPPORT
+
+#define __raw_cpu_read(qual, pcp) \
+({ \
+ *(qual __my_cpu_type(pcp) *)__my_cpu_ptr(&(pcp)); \
+})
+
+#define __raw_cpu_write(qual, pcp, val) \
+do { \
+ *(qual __my_cpu_type(pcp) *)__my_cpu_ptr(&(pcp)) = (val); \
+} while (0)
+
+#define raw_cpu_read_1(pcp) __raw_cpu_read(, pcp)
+#define raw_cpu_read_2(pcp) __raw_cpu_read(, pcp)
+#define raw_cpu_read_4(pcp) __raw_cpu_read(, pcp)
+#define raw_cpu_write_1(pcp, val) __raw_cpu_write(, pcp, val)
+#define raw_cpu_write_2(pcp, val) __raw_cpu_write(, pcp, val)
+#define raw_cpu_write_4(pcp, val) __raw_cpu_write(, pcp, val)
+
+#define this_cpu_read_1(pcp) __raw_cpu_read(volatile, pcp)
+#define this_cpu_read_2(pcp) __raw_cpu_read(volatile, pcp)
+#define this_cpu_read_4(pcp) __raw_cpu_read(volatile, pcp)
+#define this_cpu_write_1(pcp, val) __raw_cpu_write(volatile, pcp, val)
+#define this_cpu_write_2(pcp, val) __raw_cpu_write(volatile, pcp, val)
+#define this_cpu_write_4(pcp, val) __raw_cpu_write(volatile, pcp, val)
+
+#ifdef CONFIG_X86_64
+#define raw_cpu_read_8(pcp) __raw_cpu_read(, pcp)
+#define raw_cpu_write_8(pcp, val) __raw_cpu_write(, pcp, val)
+
+#define this_cpu_read_8(pcp) __raw_cpu_read(volatile, pcp)
+#define this_cpu_write_8(pcp, val) __raw_cpu_write(volatile, pcp, val)
+#endif
+
+#else /* CONFIG_USE_X86_SEG_SUPPORT */
+
#define raw_cpu_read_1(pcp) percpu_from_op(1, , "mov", pcp)
#define raw_cpu_read_2(pcp) percpu_from_op(2, , "mov", pcp)
#define raw_cpu_read_4(pcp) percpu_from_op(4, , "mov", pcp)
-
#define raw_cpu_write_1(pcp, val) percpu_to_op(1, , "mov", (pcp), val)
#define raw_cpu_write_2(pcp, val) percpu_to_op(2, , "mov", (pcp), val)
#define raw_cpu_write_4(pcp, val) percpu_to_op(4, , "mov", (pcp), val)
+
+#define this_cpu_read_1(pcp) percpu_from_op(1, volatile, "mov", pcp)
+#define this_cpu_read_2(pcp) percpu_from_op(2, volatile, "mov", pcp)
+#define this_cpu_read_4(pcp) percpu_from_op(4, volatile, "mov", pcp)
+#define this_cpu_write_1(pcp, val) percpu_to_op(1, volatile, "mov", (pcp), val)
+#define this_cpu_write_2(pcp, val) percpu_to_op(2, volatile, "mov", (pcp), val)
+#define this_cpu_write_4(pcp, val) percpu_to_op(4, volatile, "mov", (pcp), val)
+
+#ifdef CONFIG_X86_64
+#define raw_cpu_read_8(pcp) percpu_from_op(8, , "mov", pcp)
+#define raw_cpu_write_8(pcp, val) percpu_to_op(8, , "mov", (pcp), val)
+
+#define this_cpu_read_8(pcp) percpu_from_op(8, volatile, "mov", pcp)
+#define this_cpu_write_8(pcp, val) percpu_to_op(8, volatile, "mov", (pcp), val)
+#endif
+
+#endif /* CONFIG_USE_X86_SEG_SUPPORT */
+
#define raw_cpu_add_1(pcp, val) percpu_add_op(1, , (pcp), val)
#define raw_cpu_add_2(pcp, val) percpu_add_op(2, , (pcp), val)
#define raw_cpu_add_4(pcp, val) percpu_add_op(4, , (pcp), val)
@@ -432,12 +485,6 @@ do { \
#define raw_cpu_xchg_2(pcp, val) raw_percpu_xchg_op(pcp, val)
#define raw_cpu_xchg_4(pcp, val) raw_percpu_xchg_op(pcp, val)

-#define this_cpu_read_1(pcp) percpu_from_op(1, volatile, "mov", pcp)
-#define this_cpu_read_2(pcp) percpu_from_op(2, volatile, "mov", pcp)
-#define this_cpu_read_4(pcp) percpu_from_op(4, volatile, "mov", pcp)
-#define this_cpu_write_1(pcp, val) percpu_to_op(1, volatile, "mov", (pcp), val)
-#define this_cpu_write_2(pcp, val) percpu_to_op(2, volatile, "mov", (pcp), val)
-#define this_cpu_write_4(pcp, val) percpu_to_op(4, volatile, "mov", (pcp), val)
#define this_cpu_add_1(pcp, val) percpu_add_op(1, volatile, (pcp), val)
#define this_cpu_add_2(pcp, val) percpu_add_op(2, volatile, (pcp), val)
#define this_cpu_add_4(pcp, val) percpu_add_op(4, volatile, (pcp), val)
@@ -476,8 +523,6 @@ do { \
* 32 bit must fall back to generic operations.
*/
#ifdef CONFIG_X86_64
-#define raw_cpu_read_8(pcp) percpu_from_op(8, , "mov", pcp)
-#define raw_cpu_write_8(pcp, val) percpu_to_op(8, , "mov", (pcp), val)
#define raw_cpu_add_8(pcp, val) percpu_add_op(8, , (pcp), val)
#define raw_cpu_and_8(pcp, val) percpu_to_op(8, , "and", (pcp), val)
#define raw_cpu_or_8(pcp, val) percpu_to_op(8, , "or", (pcp), val)
@@ -486,8 +531,6 @@ do { \
#define raw_cpu_cmpxchg_8(pcp, oval, nval) percpu_cmpxchg_op(8, , pcp, oval, nval)
#define raw_cpu_try_cmpxchg_8(pcp, ovalp, nval) percpu_try_cmpxchg_op(8, , pcp, ovalp, nval)

-#define this_cpu_read_8(pcp) percpu_from_op(8, volatile, "mov", pcp)
-#define this_cpu_write_8(pcp, val) percpu_to_op(8, volatile, "mov", (pcp), val)
#define this_cpu_add_8(pcp, val) percpu_add_op(8, volatile, (pcp), val)
#define this_cpu_and_8(pcp, val) percpu_to_op(8, volatile, "and", (pcp), val)
#define this_cpu_or_8(pcp, val) percpu_to_op(8, volatile, "or", (pcp), val)
--
2.41.0

2023-10-04 14:52:22

by Uros Bizjak

[permalink] [raw]
Subject: [PATCH 2/4] x86/percpu: Enable named address spaces with known compiler version

Enable named address spaces with known compiler version
(GCC 13.1 and later) in order to avoid possible issues with named
address spaces with older compilers. Set CC_HAS_NAMED_AS when the
compiler satisfies version requirements and set USE_X86_SEG_SUPPORT
to signal when segment qualifiers could be used.

Cc: Andy Lutomirski <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Signed-off-by: Uros Bizjak <[email protected]>
---
v1: Enable support with known compiler version
---
arch/x86/Kconfig | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 66bfabae8814..3aa73f50dc05 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2388,6 +2388,13 @@ source "kernel/livepatch/Kconfig"

endmenu

+config CC_HAS_NAMED_AS
+ def_bool CC_IS_GCC && GCC_VERSION >= 130100
+
+config USE_X86_SEG_SUPPORT
+ def_bool y
+ depends on CC_HAS_NAMED_AS && SMP
+
config CC_HAS_SLS
def_bool $(cc-option,-mharden-sls=all)

--
2.41.0

2023-10-04 16:38:37

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 4/4] x86/percpu: Use C for percpu read/write accessors


* Uros Bizjak <[email protected]> wrote:

> The percpu code mostly uses inline assembly. Using segment qualifiers
> allows to use C code instead, which enables the compiler to perform
> various optimizations (e.g. propagation of memory arguments). Convert
> percpu read and write accessors to C code, so the memory argument can
> be propagated to the instruction that uses this argument.
>
> Some examples of propagations:
>
> a) into sign/zero extensions:
>
> 110b54: 65 0f b6 05 00 00 00 movzbl %gs:0x0(%rip),%eax
> 11ab90: 65 0f b6 15 00 00 00 movzbl %gs:0x0(%rip),%edx
> 14484a: 65 0f b7 35 00 00 00 movzwl %gs:0x0(%rip),%esi
> 1a08a9: 65 0f b6 43 78 movzbl %gs:0x78(%rbx),%eax
> 1a08f9: 65 0f b6 43 78 movzbl %gs:0x78(%rbx),%eax
>
> 4ab29a: 65 48 63 15 00 00 00 movslq %gs:0x0(%rip),%rdx
> 4be128: 65 4c 63 25 00 00 00 movslq %gs:0x0(%rip),%r12
> 547468: 65 48 63 1f movslq %gs:(%rdi),%rbx
> 5474e7: 65 48 63 0a movslq %gs:(%rdx),%rcx
> 54d05d: 65 48 63 0d 00 00 00 movslq %gs:0x0(%rip),%rcx

Could you please also quote a 'before' assembly sequence, at least once
per group of propagations?

Ie. readers will be able to see what kind of code generation changes
result in this kind of text size reduction:

> text data bss dec hex filename
> 25508862 4386540 808388 30703790 1d480ae vmlinux-vanilla.o
> 25500922 4386532 808388 30695842 1d461a2 vmlinux-new.o

Thanks,

Ingo

2023-10-04 16:41:28

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 4/4] x86/percpu: Use C for percpu read/write accessors


* Ingo Molnar <[email protected]> wrote:

>
> * Uros Bizjak <[email protected]> wrote:
>
> > The percpu code mostly uses inline assembly. Using segment qualifiers
> > allows to use C code instead, which enables the compiler to perform
> > various optimizations (e.g. propagation of memory arguments). Convert
> > percpu read and write accessors to C code, so the memory argument can
> > be propagated to the instruction that uses this argument.
> >
> > Some examples of propagations:
> >
> > a) into sign/zero extensions:
> >
> > 110b54: 65 0f b6 05 00 00 00 movzbl %gs:0x0(%rip),%eax
> > 11ab90: 65 0f b6 15 00 00 00 movzbl %gs:0x0(%rip),%edx
> > 14484a: 65 0f b7 35 00 00 00 movzwl %gs:0x0(%rip),%esi
> > 1a08a9: 65 0f b6 43 78 movzbl %gs:0x78(%rbx),%eax
> > 1a08f9: 65 0f b6 43 78 movzbl %gs:0x78(%rbx),%eax
> >
> > 4ab29a: 65 48 63 15 00 00 00 movslq %gs:0x0(%rip),%rdx
> > 4be128: 65 4c 63 25 00 00 00 movslq %gs:0x0(%rip),%r12
> > 547468: 65 48 63 1f movslq %gs:(%rdi),%rbx
> > 5474e7: 65 48 63 0a movslq %gs:(%rdx),%rcx
> > 54d05d: 65 48 63 0d 00 00 00 movslq %gs:0x0(%rip),%rcx
>
> Could you please also quote a 'before' assembly sequence, at least once
> per group of propagations?

Ie. for any changes to x86 code generation, please follow the changelog
format of:

7c097ca50d2b ("x86/percpu: Do not clobber %rsi in percpu_{try_,}cmpxchg{64,128}_op")

...
Move the load of %rsi outside inline asm, so the compiler can
reuse the value. The code in slub.o improves from:

55ac: 49 8b 3c 24 mov (%r12),%rdi
55b0: 48 8d 4a 40 lea 0x40(%rdx),%rcx
55b4: 49 8b 1c 07 mov (%r15,%rax,1),%rbx
55b8: 4c 89 f8 mov %r15,%rax
55bb: 48 8d 37 lea (%rdi),%rsi
55be: e8 00 00 00 00 callq 55c3 <...>
55bf: R_X86_64_PLT32 this_cpu_cmpxchg16b_emu-0x4
55c3: 75 a3 jne 5568 <...>
55c5: ...

0000000000000000 <.altinstr_replacement>:
5: 65 48 0f c7 0f cmpxchg16b %gs:(%rdi)

to:

55ac: 49 8b 34 24 mov (%r12),%rsi
55b0: 48 8d 4a 40 lea 0x40(%rdx),%rcx
55b4: 49 8b 1c 07 mov (%r15,%rax,1),%rbx
55b8: 4c 89 f8 mov %r15,%rax
55bb: e8 00 00 00 00 callq 55c0 <...>
55bc: R_X86_64_PLT32 this_cpu_cmpxchg16b_emu-0x4
55c0: 75 a6 jne 5568 <...>
55c2: ...

Where the alternative replacement instruction now uses %rsi:

0000000000000000 <.altinstr_replacement>:
5: 65 48 0f c7 0e cmpxchg16b %gs:(%rsi)

The instruction (effectively a reg-reg move) at 55bb: in the original
assembly is removed. Also, both the CALL and replacement CMPXCHG16B
are 5 bytes long, removing the need for NOPs in the asm code.
...

Thanks,

Ingo

2023-10-04 19:24:48

by Uros Bizjak

[permalink] [raw]
Subject: [PATCH v2 4/4] x86/percpu: Use C for percpu read/write accessors

The percpu code mostly uses inline assembly. Using segment qualifiers
allows to use C code instead, which enables the compiler to perform
various optimizations (e.g. propagation of memory arguments). Convert
percpu read and write accessors to C code, so the memory argument can
be propagated to the instruction that uses this argument.

Some examples of propagations:

a) into sign/zero extensions:

the code improves from:

65 8a 05 00 00 00 00 mov %gs:0x0(%rip),%al
0f b6 c0 movzbl %al,%eax

to:

65 0f b6 05 00 00 00 movzbl %gs:0x0(%rip),%eax
00

and in a similar way for:

movzbl %gs:0x0(%rip),%edx
movzwl %gs:0x0(%rip),%esi
movzbl %gs:0x78(%rbx),%eax

movslq %gs:0x0(%rip),%rdx
movslq %gs:(%rdi),%rbx

b) into compares:

the code improves from:

65 8b 05 00 00 00 00 mov %gs:0x0(%rip),%eax
a9 00 00 0f 00 test $0xf0000,%eax

to:

65 f7 05 00 00 00 00 testl $0xf0000,%gs:0x0(%rip)
00 00 0f 00

and in a similar way for:

testl $0xf0000,%gs:0x0(%rip)
testb $0x1,%gs:0x0(%rip)
testl $0xff00,%gs:0x0(%rip)

cmpb $0x0,%gs:0x0(%rip)
cmp %gs:0x0(%rip),%r14d
cmpw $0x8,%gs:0x0(%rip)
cmpb $0x0,%gs:(%rax)

c) into other insns:

the code improves from:

1a355: 83 fa ff cmp $0xffffffff,%edx
1a358: 75 07 jne 1a361 <...>
1a35a: 65 8b 15 00 00 00 00 mov %gs:0x0(%rip),%edx
1a361:

to:

1a35a: 83 fa ff cmp $0xffffffff,%edx
1a35d: 65 0f 44 15 00 00 00 cmove %gs:0x0(%rip),%edx
1a364: 00

The above propagations result in the following code size
improvements for current mainline kernel (with the default config),
compiled with

gcc (GCC) 12.3.1 20230508 (Red Hat 12.3.1-1)

text data bss dec hex filename
25508862 4386540 808388 30703790 1d480ae vmlinux-vanilla.o
25500922 4386532 808388 30695842 1d461a2 vmlinux-new.o

Cc: Andy Lutomirski <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Co-developed-by: Nadav Amit <[email protected]>
Signed-off-by: Nadav Amit <[email protected]>
Signed-off-by: Uros Bizjak <[email protected]>
---
v2: Rewrite code examples in the commit message.
---
arch/x86/include/asm/percpu.h | 65 +++++++++++++++++++++++++++++------
1 file changed, 54 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h
index da451202a1b9..60ea7755c0fe 100644
--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -400,13 +400,66 @@ do { \
#define this_cpu_read_stable_8(pcp) percpu_stable_op(8, "mov", pcp)
#define this_cpu_read_stable(pcp) __pcpu_size_call_return(this_cpu_read_stable_, pcp)

+#ifdef CONFIG_USE_X86_SEG_SUPPORT
+
+#define __raw_cpu_read(qual, pcp) \
+({ \
+ *(qual __my_cpu_type(pcp) *)__my_cpu_ptr(&(pcp)); \
+})
+
+#define __raw_cpu_write(qual, pcp, val) \
+do { \
+ *(qual __my_cpu_type(pcp) *)__my_cpu_ptr(&(pcp)) = (val); \
+} while (0)
+
+#define raw_cpu_read_1(pcp) __raw_cpu_read(, pcp)
+#define raw_cpu_read_2(pcp) __raw_cpu_read(, pcp)
+#define raw_cpu_read_4(pcp) __raw_cpu_read(, pcp)
+#define raw_cpu_write_1(pcp, val) __raw_cpu_write(, pcp, val)
+#define raw_cpu_write_2(pcp, val) __raw_cpu_write(, pcp, val)
+#define raw_cpu_write_4(pcp, val) __raw_cpu_write(, pcp, val)
+
+#define this_cpu_read_1(pcp) __raw_cpu_read(volatile, pcp)
+#define this_cpu_read_2(pcp) __raw_cpu_read(volatile, pcp)
+#define this_cpu_read_4(pcp) __raw_cpu_read(volatile, pcp)
+#define this_cpu_write_1(pcp, val) __raw_cpu_write(volatile, pcp, val)
+#define this_cpu_write_2(pcp, val) __raw_cpu_write(volatile, pcp, val)
+#define this_cpu_write_4(pcp, val) __raw_cpu_write(volatile, pcp, val)
+
+#ifdef CONFIG_X86_64
+#define raw_cpu_read_8(pcp) __raw_cpu_read(, pcp)
+#define raw_cpu_write_8(pcp, val) __raw_cpu_write(, pcp, val)
+
+#define this_cpu_read_8(pcp) __raw_cpu_read(volatile, pcp)
+#define this_cpu_write_8(pcp, val) __raw_cpu_write(volatile, pcp, val)
+#endif
+
+#else /* CONFIG_USE_X86_SEG_SUPPORT */
+
#define raw_cpu_read_1(pcp) percpu_from_op(1, , "mov", pcp)
#define raw_cpu_read_2(pcp) percpu_from_op(2, , "mov", pcp)
#define raw_cpu_read_4(pcp) percpu_from_op(4, , "mov", pcp)
-
#define raw_cpu_write_1(pcp, val) percpu_to_op(1, , "mov", (pcp), val)
#define raw_cpu_write_2(pcp, val) percpu_to_op(2, , "mov", (pcp), val)
#define raw_cpu_write_4(pcp, val) percpu_to_op(4, , "mov", (pcp), val)
+
+#define this_cpu_read_1(pcp) percpu_from_op(1, volatile, "mov", pcp)
+#define this_cpu_read_2(pcp) percpu_from_op(2, volatile, "mov", pcp)
+#define this_cpu_read_4(pcp) percpu_from_op(4, volatile, "mov", pcp)
+#define this_cpu_write_1(pcp, val) percpu_to_op(1, volatile, "mov", (pcp), val)
+#define this_cpu_write_2(pcp, val) percpu_to_op(2, volatile, "mov", (pcp), val)
+#define this_cpu_write_4(pcp, val) percpu_to_op(4, volatile, "mov", (pcp), val)
+
+#ifdef CONFIG_X86_64
+#define raw_cpu_read_8(pcp) percpu_from_op(8, , "mov", pcp)
+#define raw_cpu_write_8(pcp, val) percpu_to_op(8, , "mov", (pcp), val)
+
+#define this_cpu_read_8(pcp) percpu_from_op(8, volatile, "mov", pcp)
+#define this_cpu_write_8(pcp, val) percpu_to_op(8, volatile, "mov", (pcp), val)
+#endif
+
+#endif /* CONFIG_USE_X86_SEG_SUPPORT */
+
#define raw_cpu_add_1(pcp, val) percpu_add_op(1, , (pcp), val)
#define raw_cpu_add_2(pcp, val) percpu_add_op(2, , (pcp), val)
#define raw_cpu_add_4(pcp, val) percpu_add_op(4, , (pcp), val)
@@ -432,12 +485,6 @@ do { \
#define raw_cpu_xchg_2(pcp, val) raw_percpu_xchg_op(pcp, val)
#define raw_cpu_xchg_4(pcp, val) raw_percpu_xchg_op(pcp, val)

-#define this_cpu_read_1(pcp) percpu_from_op(1, volatile, "mov", pcp)
-#define this_cpu_read_2(pcp) percpu_from_op(2, volatile, "mov", pcp)
-#define this_cpu_read_4(pcp) percpu_from_op(4, volatile, "mov", pcp)
-#define this_cpu_write_1(pcp, val) percpu_to_op(1, volatile, "mov", (pcp), val)
-#define this_cpu_write_2(pcp, val) percpu_to_op(2, volatile, "mov", (pcp), val)
-#define this_cpu_write_4(pcp, val) percpu_to_op(4, volatile, "mov", (pcp), val)
#define this_cpu_add_1(pcp, val) percpu_add_op(1, volatile, (pcp), val)
#define this_cpu_add_2(pcp, val) percpu_add_op(2, volatile, (pcp), val)
#define this_cpu_add_4(pcp, val) percpu_add_op(4, volatile, (pcp), val)
@@ -476,8 +523,6 @@ do { \
* 32 bit must fall back to generic operations.
*/
#ifdef CONFIG_X86_64
-#define raw_cpu_read_8(pcp) percpu_from_op(8, , "mov", pcp)
-#define raw_cpu_write_8(pcp, val) percpu_to_op(8, , "mov", (pcp), val)
#define raw_cpu_add_8(pcp, val) percpu_add_op(8, , (pcp), val)
#define raw_cpu_and_8(pcp, val) percpu_to_op(8, , "and", (pcp), val)
#define raw_cpu_or_8(pcp, val) percpu_to_op(8, , "or", (pcp), val)
@@ -486,8 +531,6 @@ do { \
#define raw_cpu_cmpxchg_8(pcp, oval, nval) percpu_cmpxchg_op(8, , pcp, oval, nval)
#define raw_cpu_try_cmpxchg_8(pcp, ovalp, nval) percpu_try_cmpxchg_op(8, , pcp, ovalp, nval)

-#define this_cpu_read_8(pcp) percpu_from_op(8, volatile, "mov", pcp)
-#define this_cpu_write_8(pcp, val) percpu_to_op(8, volatile, "mov", (pcp), val)
#define this_cpu_add_8(pcp, val) percpu_add_op(8, volatile, (pcp), val)
#define this_cpu_and_8(pcp, val) percpu_to_op(8, volatile, "and", (pcp), val)
#define this_cpu_or_8(pcp, val) percpu_to_op(8, volatile, "or", (pcp), val)
--
2.41.0

2023-10-04 19:42:58

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/percpu: Use C for percpu read/write accessors

Unrelated reaction..

On Wed, 4 Oct 2023 at 12:24, Uros Bizjak <[email protected]> wrote:
>
> the code improves from:
>
> 65 8b 05 00 00 00 00 mov %gs:0x0(%rip),%eax
> a9 00 00 0f 00 test $0xf0000,%eax
>
> to:
>
> 65 f7 05 00 00 00 00 testl $0xf0000,%gs:0x0(%rip)
> 00 00 0f 00

Funky.

Why does gcc generate that full-width load from memory, and not demote
it to a byte test?

IOW, it should not be

65 f7 05 00 00 00 00 testl $0xf0000,%gs:0x0(%rip)
00 00 0f 00

after optimizing it, it should be three bytes shorter at

65 f6 05 00 00 00 00 testb $0xf,%gs:0x0(%rip)
0f

instead (this is "objdump", so it doesn't show that the relocation
entry has changed by +2 to compensate).

Now, doing the access narrowing is a bad idea for stores (because it
can cause subsequent loads to have conflicts in the store buffer), but
for loads it should always be a win to narrow the access.

I wonder why gcc doesn't do it. This is not related to __seg_gs - I
tried it with regular memory accesses too, and gcc kept those as
32-bit accesses too.

And no, the assembler can't optimize that operation either, since I
think changing the testl to a testb would change the 'P' bit in the
resulting eflags, so this is a "the compiler could pick a better
instruction choice" thing.

I'm probably missing some reason why gcc wouldn't do this. But clang
does seem to do this obvious optimization.

Linus

2023-10-04 20:08:14

by Uros Bizjak

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/percpu: Use C for percpu read/write accessors

On Wed, Oct 4, 2023 at 9:42 PM Linus Torvalds
<[email protected]> wrote:
>
> Unrelated reaction..
>
> On Wed, 4 Oct 2023 at 12:24, Uros Bizjak <[email protected]> wrote:
> >
> > the code improves from:
> >
> > 65 8b 05 00 00 00 00 mov %gs:0x0(%rip),%eax
> > a9 00 00 0f 00 test $0xf0000,%eax
> >
> > to:
> >
> > 65 f7 05 00 00 00 00 testl $0xf0000,%gs:0x0(%rip)
> > 00 00 0f 00
>
> Funky.
>
> Why does gcc generate that full-width load from memory, and not demote
> it to a byte test?

It does when LSB is accessed at the same address. For example:

int m;
_Bool foo (void) { return m & 0x0f; }

compiles to:

0: f6 05 00 00 00 00 0f testb $0xf,0x0(%rip) # 7 <foo+0x7>

>
> IOW, it should not be
>
> 65 f7 05 00 00 00 00 testl $0xf0000,%gs:0x0(%rip)
> 00 00 0f 00
>
> after optimizing it, it should be three bytes shorter at
>
> 65 f6 05 00 00 00 00 testb $0xf,%gs:0x0(%rip)
> 0f
>
> instead (this is "objdump", so it doesn't show that the relocation
> entry has changed by +2 to compensate).
>
> Now, doing the access narrowing is a bad idea for stores (because it
> can cause subsequent loads to have conflicts in the store buffer), but
> for loads it should always be a win to narrow the access.
>
> I wonder why gcc doesn't do it. This is not related to __seg_gs - I
> tried it with regular memory accesses too, and gcc kept those as
> 32-bit accesses too.
>
> And no, the assembler can't optimize that operation either, since I
> think changing the testl to a testb would change the 'P' bit in the
> resulting eflags, so this is a "the compiler could pick a better
> instruction choice" thing.
>
> I'm probably missing some reason why gcc wouldn't do this. But clang
> does seem to do this obvious optimization.

You get a store forwarding stall when you write a bigger operand to
memory and then read part of it, if the smaller part doesn't start at
the same
address.

Uros.

2023-10-04 20:13:08

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/percpu: Use C for percpu read/write accessors

On Wed, 4 Oct 2023 at 13:08, Uros Bizjak <[email protected]> wrote:
>
> You get a store forwarding stall when you write a bigger operand to
> memory and then read part of it, if the smaller part doesn't start at
> the same address.

I don't think that has been true for over a decade now.

Afaik, any half-way modern Intel and AMD cores will forward any fully
contained load.

The whole "same address" was a P4 thing, iirc.

Linus

2023-10-04 20:21:35

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/percpu: Use C for percpu read/write accessors

On Wed, 4 Oct 2023 at 13:12, Linus Torvalds
<[email protected]> wrote:
>
> On Wed, 4 Oct 2023 at 13:08, Uros Bizjak <[email protected]> wrote:
> >
> > You get a store forwarding stall when you write a bigger operand to
> > memory and then read part of it, if the smaller part doesn't start at
> > the same address.
>
> I don't think that has been true for over a decade now.
>
> Afaik, any half-way modern Intel and AMD cores will forward any fully
> contained load.

https://www.agner.org/optimize/microarchitecture.pdf

See for example pg 136 (Sandy Bridge / Ivy Bridge):

"Store forwarding works in the following cases:
..
• When a write of 64 bits or less is followed by a read of a smaller
size which is fully contained in the write address range, regardless
of alignment"

and for AMD Zen cores:

"Store forwarding of a write to a subsequent read works very well in
all cases, including reads from a part of the written data"

So forget the whole "same address" rule. It's simply not true or
relevant any more.

Linus

2023-10-04 20:23:13

by Uros Bizjak

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/percpu: Use C for percpu read/write accessors

On Wed, Oct 4, 2023 at 10:20 PM Linus Torvalds
<[email protected]> wrote:
>
> On Wed, 4 Oct 2023 at 13:12, Linus Torvalds
> <[email protected]> wrote:
> >
> > On Wed, 4 Oct 2023 at 13:08, Uros Bizjak <[email protected]> wrote:
> > >
> > > You get a store forwarding stall when you write a bigger operand to
> > > memory and then read part of it, if the smaller part doesn't start at
> > > the same address.
> >
> > I don't think that has been true for over a decade now.
> >
> > Afaik, any half-way modern Intel and AMD cores will forward any fully
> > contained load.
>
> https://www.agner.org/optimize/microarchitecture.pdf
>
> See for example pg 136 (Sandy Bridge / Ivy Bridge):
>
> "Store forwarding works in the following cases:
> ..
> • When a write of 64 bits or less is followed by a read of a smaller
> size which is fully contained in the write address range, regardless
> of alignment"
>
> and for AMD Zen cores:
>
> "Store forwarding of a write to a subsequent read works very well in
> all cases, including reads from a part of the written data"
>
> So forget the whole "same address" rule. It's simply not true or
> relevant any more.

No problem then, we will implement the optimization in the compiler.

Thanks,
Uros.

2023-10-05 14:07:42

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/percpu: Use C for percpu read/write accessors


* Uros Bizjak <[email protected]> wrote:

> The percpu code mostly uses inline assembly. Using segment qualifiers
> allows to use C code instead, which enables the compiler to perform
> various optimizations (e.g. propagation of memory arguments). Convert
> percpu read and write accessors to C code, so the memory argument can
> be propagated to the instruction that uses this argument.

> text data bss dec hex filename
> 25508862 4386540 808388 30703790 1d480ae vmlinux-vanilla.o
> 25500922 4386532 808388 30695842 1d461a2 vmlinux-new.o

Ok, this all looks like a pretty nice optimization.

As discussed previously, I've created a new tip:x86/percpu topic branch
for this, based on tip:x86/asm that carries the other percpu patches.
This branch will be merged in v6.8, best-case scenario.

Also note that I lowered the version cutoff from GCC 13.1 to 12.1, for
the simple selfish reason to include my own daily systems in test coverage.

Is there any known bug fixed in the GCC 12.1 ... 13.1 version range that
could make this approach problematic?

Thanks,

Ingo

Subject: [tip: x86/percpu] x86/percpu: Use compiler segment prefix qualifier

The following commit has been merged into the x86/percpu branch of tip:

Commit-ID: 9a462b9eafa6dda16ea8429b151edb1fb535d744
Gitweb: https://git.kernel.org/tip/9a462b9eafa6dda16ea8429b151edb1fb535d744
Author: Nadav Amit <[email protected]>
AuthorDate: Wed, 04 Oct 2023 16:49:43 +02:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Thu, 05 Oct 2023 09:01:52 +02:00

x86/percpu: Use compiler segment prefix qualifier

Using a segment prefix qualifier is cleaner than using a segment prefix
in the inline assembly, and provides the compiler with more information,
telling it that __seg_gs:[addr] is different than [addr] when it
analyzes data dependencies. It also enables various optimizations that
will be implemented in the next patches.

Use segment prefix qualifiers when they are supported. Unfortunately,
gcc does not provide a way to remove segment qualifiers, which is needed
to use typeof() to create local instances of the per-CPU variable. For
this reason, do not use the segment qualifier for per-CPU variables, and
do casting using the segment qualifier instead.

Uros: Improve compiler support detection and update the patch
to the current mainline.

Signed-off-by: Nadav Amit <[email protected]>
Signed-off-by: Uros Bizjak <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
arch/x86/include/asm/percpu.h | 68 ++++++++++++++++++++++-----------
arch/x86/include/asm/preempt.h | 2 +-
2 files changed, 47 insertions(+), 23 deletions(-)

diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h
index 20624b8..da45120 100644
--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -28,26 +28,50 @@
#include <linux/stringify.h>

#ifdef CONFIG_SMP
+
+#ifdef CONFIG_CC_HAS_NAMED_AS
+
+#ifdef CONFIG_X86_64
+#define __percpu_seg_override __seg_gs
+#else
+#define __percpu_seg_override __seg_fs
+#endif
+
+#define __percpu_prefix ""
+
+#else /* CONFIG_CC_HAS_NAMED_AS */
+
+#define __percpu_seg_override
#define __percpu_prefix "%%"__stringify(__percpu_seg)":"
+
+#endif /* CONFIG_CC_HAS_NAMED_AS */
+
+#define __force_percpu_prefix "%%"__stringify(__percpu_seg)":"
#define __my_cpu_offset this_cpu_read(this_cpu_off)

/*
* Compared to the generic __my_cpu_offset version, the following
* saves one instruction and avoids clobbering a temp register.
*/
-#define arch_raw_cpu_ptr(ptr) \
-({ \
- unsigned long tcp_ptr__; \
- asm ("add " __percpu_arg(1) ", %0" \
- : "=r" (tcp_ptr__) \
- : "m" (this_cpu_off), "0" (ptr)); \
- (typeof(*(ptr)) __kernel __force *)tcp_ptr__; \
+#define arch_raw_cpu_ptr(ptr) \
+({ \
+ unsigned long tcp_ptr__; \
+ asm ("add " __percpu_arg(1) ", %0" \
+ : "=r" (tcp_ptr__) \
+ : "m" (__my_cpu_var(this_cpu_off)), "0" (ptr)); \
+ (typeof(*(ptr)) __kernel __force *)tcp_ptr__; \
})
-#else
+#else /* CONFIG_SMP */
+#define __percpu_seg_override
#define __percpu_prefix ""
-#endif
+#define __force_percpu_prefix ""
+#endif /* CONFIG_SMP */

+#define __my_cpu_type(var) typeof(var) __percpu_seg_override
+#define __my_cpu_ptr(ptr) (__my_cpu_type(*ptr) *)(uintptr_t)(ptr)
+#define __my_cpu_var(var) (*__my_cpu_ptr(&var))
#define __percpu_arg(x) __percpu_prefix "%" #x
+#define __force_percpu_arg(x) __force_percpu_prefix "%" #x

/*
* Initialized pointers to per-cpu variables needed for the boot
@@ -107,14 +131,14 @@ do { \
(void)pto_tmp__; \
} \
asm qual(__pcpu_op2_##size(op, "%[val]", __percpu_arg([var])) \
- : [var] "+m" (_var) \
+ : [var] "+m" (__my_cpu_var(_var)) \
: [val] __pcpu_reg_imm_##size(pto_val__)); \
} while (0)

#define percpu_unary_op(size, qual, op, _var) \
({ \
asm qual (__pcpu_op1_##size(op, __percpu_arg([var])) \
- : [var] "+m" (_var)); \
+ : [var] "+m" (__my_cpu_var(_var))); \
})

/*
@@ -144,14 +168,14 @@ do { \
__pcpu_type_##size pfo_val__; \
asm qual (__pcpu_op2_##size(op, __percpu_arg([var]), "%[val]") \
: [val] __pcpu_reg_##size("=", pfo_val__) \
- : [var] "m" (_var)); \
+ : [var] "m" (__my_cpu_var(_var))); \
(typeof(_var))(unsigned long) pfo_val__; \
})

#define percpu_stable_op(size, op, _var) \
({ \
__pcpu_type_##size pfo_val__; \
- asm(__pcpu_op2_##size(op, __percpu_arg(P[var]), "%[val]") \
+ asm(__pcpu_op2_##size(op, __force_percpu_arg(P[var]), "%[val]") \
: [val] __pcpu_reg_##size("=", pfo_val__) \
: [var] "p" (&(_var))); \
(typeof(_var))(unsigned long) pfo_val__; \
@@ -166,7 +190,7 @@ do { \
asm qual (__pcpu_op2_##size("xadd", "%[tmp]", \
__percpu_arg([var])) \
: [tmp] __pcpu_reg_##size("+", paro_tmp__), \
- [var] "+m" (_var) \
+ [var] "+m" (__my_cpu_var(_var)) \
: : "memory"); \
(typeof(_var))(unsigned long) (paro_tmp__ + _val); \
})
@@ -187,7 +211,7 @@ do { \
__percpu_arg([var])) \
"\n\tjnz 1b" \
: [oval] "=&a" (pxo_old__), \
- [var] "+m" (_var) \
+ [var] "+m" (__my_cpu_var(_var)) \
: [nval] __pcpu_reg_##size(, pxo_new__) \
: "memory"); \
(typeof(_var))(unsigned long) pxo_old__; \
@@ -204,7 +228,7 @@ do { \
asm qual (__pcpu_op2_##size("cmpxchg", "%[nval]", \
__percpu_arg([var])) \
: [oval] "+a" (pco_old__), \
- [var] "+m" (_var) \
+ [var] "+m" (__my_cpu_var(_var)) \
: [nval] __pcpu_reg_##size(, pco_new__) \
: "memory"); \
(typeof(_var))(unsigned long) pco_old__; \
@@ -221,7 +245,7 @@ do { \
CC_SET(z) \
: CC_OUT(z) (success), \
[oval] "+a" (pco_old__), \
- [var] "+m" (_var) \
+ [var] "+m" (__my_cpu_var(_var)) \
: [nval] __pcpu_reg_##size(, pco_new__) \
: "memory"); \
if (unlikely(!success)) \
@@ -244,7 +268,7 @@ do { \
\
asm qual (ALTERNATIVE("call this_cpu_cmpxchg8b_emu", \
"cmpxchg8b " __percpu_arg([var]), X86_FEATURE_CX8) \
- : [var] "+m" (_var), \
+ : [var] "+m" (__my_cpu_var(_var)), \
"+a" (old__.low), \
"+d" (old__.high) \
: "b" (new__.low), \
@@ -276,7 +300,7 @@ do { \
"cmpxchg8b " __percpu_arg([var]), X86_FEATURE_CX8) \
CC_SET(z) \
: CC_OUT(z) (success), \
- [var] "+m" (_var), \
+ [var] "+m" (__my_cpu_var(_var)), \
"+a" (old__.low), \
"+d" (old__.high) \
: "b" (new__.low), \
@@ -313,7 +337,7 @@ do { \
\
asm qual (ALTERNATIVE("call this_cpu_cmpxchg16b_emu", \
"cmpxchg16b " __percpu_arg([var]), X86_FEATURE_CX16) \
- : [var] "+m" (_var), \
+ : [var] "+m" (__my_cpu_var(_var)), \
"+a" (old__.low), \
"+d" (old__.high) \
: "b" (new__.low), \
@@ -345,7 +369,7 @@ do { \
"cmpxchg16b " __percpu_arg([var]), X86_FEATURE_CX16) \
CC_SET(z) \
: CC_OUT(z) (success), \
- [var] "+m" (_var), \
+ [var] "+m" (__my_cpu_var(_var)), \
"+a" (old__.low), \
"+d" (old__.high) \
: "b" (new__.low), \
@@ -494,7 +518,7 @@ static inline bool x86_this_cpu_variable_test_bit(int nr,
asm volatile("btl "__percpu_arg(2)",%1"
CC_SET(c)
: CC_OUT(c) (oldbit)
- : "m" (*(unsigned long __percpu *)addr), "Ir" (nr));
+ : "m" (*__my_cpu_ptr((unsigned long __percpu *)(addr))), "Ir" (nr));

return oldbit;
}
diff --git a/arch/x86/include/asm/preempt.h b/arch/x86/include/asm/preempt.h
index 4527e14..4b2a35d 100644
--- a/arch/x86/include/asm/preempt.h
+++ b/arch/x86/include/asm/preempt.h
@@ -92,7 +92,7 @@ static __always_inline void __preempt_count_sub(int val)
*/
static __always_inline bool __preempt_count_dec_and_test(void)
{
- return GEN_UNARY_RMWcc("decl", pcpu_hot.preempt_count, e,
+ return GEN_UNARY_RMWcc("decl", __my_cpu_var(pcpu_hot.preempt_count), e,
__percpu_arg([var]));
}

2023-10-05 16:42:33

by Uros Bizjak

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/percpu: Use C for percpu read/write accessors

On Thu, Oct 5, 2023 at 9:06 AM Ingo Molnar <[email protected]> wrote:
>
>
> * Uros Bizjak <[email protected]> wrote:
>
> > The percpu code mostly uses inline assembly. Using segment qualifiers
> > allows to use C code instead, which enables the compiler to perform
> > various optimizations (e.g. propagation of memory arguments). Convert
> > percpu read and write accessors to C code, so the memory argument can
> > be propagated to the instruction that uses this argument.
>
> > text data bss dec hex filename
> > 25508862 4386540 808388 30703790 1d480ae vmlinux-vanilla.o
> > 25500922 4386532 808388 30695842 1d461a2 vmlinux-new.o
>
> Ok, this all looks like a pretty nice optimization.

This is just the beginning, named AS enables several other
optimizations, as can be seen in Nadav's original patch series. If we
want to make the kernel PIE, then we have to get rid of the absolute
address in percpu_stable_op (AKA this_cpu_read_stable). We can use
__raw_cpu_read, but PATCH 7/7 [1] optimizes even further.

> As discussed previously, I've created a new tip:x86/percpu topic branch
> for this, based on tip:x86/asm that carries the other percpu patches.
> This branch will be merged in v6.8, best-case scenario.
>
> Also note that I lowered the version cutoff from GCC 13.1 to 12.1, for
> the simple selfish reason to include my own daily systems in test coverage.
>
> Is there any known bug fixed in the GCC 12.1 ... 13.1 version range that
> could make this approach problematic?

Not that I know of. I have done all of the work with GCC 12.3.1 (the
default Fedora 37 compiler) and additionally tested with GCC 13.2.1
(Fedora 38). I have made the patched kernel the default kernel on my
main workstation, and haven't encountered any problems since I
installed it a week ago.

If there are any problems encountered with the compiler, we (the GCC
compiler authors) can and will fix them promptly. I'd push for all
supported GCC versions, but maybe not just yet ;)

[1] https://lore.kernel.org/lkml/[email protected]/

Uros.

Subject: [tip: x86/percpu] x86/percpu: Use C for percpu read/write accessors

The following commit has been merged into the x86/percpu branch of tip:

Commit-ID: ca4256348660cb2162668ec3d13d1f921d05374a
Gitweb: https://git.kernel.org/tip/ca4256348660cb2162668ec3d13d1f921d05374a
Author: Uros Bizjak <[email protected]>
AuthorDate: Wed, 04 Oct 2023 21:23:08 +02:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Thu, 05 Oct 2023 09:01:53 +02:00

x86/percpu: Use C for percpu read/write accessors

The percpu code mostly uses inline assembly. Using segment qualifiers
allows to use C code instead, which enables the compiler to perform
various optimizations (e.g. propagation of memory arguments). Convert
percpu read and write accessors to C code, so the memory argument can
be propagated to the instruction that uses this argument.

Some examples of propagations:

a) into sign/zero extensions:

the code improves from:

65 8a 05 00 00 00 00 mov %gs:0x0(%rip),%al
0f b6 c0 movzbl %al,%eax

to:

65 0f b6 05 00 00 00 movzbl %gs:0x0(%rip),%eax
00

and in a similar way for:

movzbl %gs:0x0(%rip),%edx
movzwl %gs:0x0(%rip),%esi
movzbl %gs:0x78(%rbx),%eax

movslq %gs:0x0(%rip),%rdx
movslq %gs:(%rdi),%rbx

b) into compares:

the code improves from:

65 8b 05 00 00 00 00 mov %gs:0x0(%rip),%eax
a9 00 00 0f 00 test $0xf0000,%eax

to:

65 f7 05 00 00 00 00 testl $0xf0000,%gs:0x0(%rip)
00 00 0f 00

and in a similar way for:

testl $0xf0000,%gs:0x0(%rip)
testb $0x1,%gs:0x0(%rip)
testl $0xff00,%gs:0x0(%rip)

cmpb $0x0,%gs:0x0(%rip)
cmp %gs:0x0(%rip),%r14d
cmpw $0x8,%gs:0x0(%rip)
cmpb $0x0,%gs:(%rax)

c) into other insns:

the code improves from:

1a355: 83 fa ff cmp $0xffffffff,%edx
1a358: 75 07 jne 1a361 <...>
1a35a: 65 8b 15 00 00 00 00 mov %gs:0x0(%rip),%edx
1a361:

to:

1a35a: 83 fa ff cmp $0xffffffff,%edx
1a35d: 65 0f 44 15 00 00 00 cmove %gs:0x0(%rip),%edx
1a364: 00

The above propagations result in the following code size
improvements for current mainline kernel (with the default config),
compiled with:

# gcc (GCC) 12.3.1 20230508 (Red Hat 12.3.1-1)

text data bss dec filename
25508862 4386540 808388 30703790 vmlinux-vanilla.o
25500922 4386532 808388 30695842 vmlinux-new.o

Co-developed-by: Nadav Amit <[email protected]>
Signed-off-by: Nadav Amit <[email protected]>
Signed-off-by: Uros Bizjak <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
arch/x86/include/asm/percpu.h | 65 ++++++++++++++++++++++++++++------
1 file changed, 54 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h
index da45120..60ea775 100644
--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -400,13 +400,66 @@ do { \
#define this_cpu_read_stable_8(pcp) percpu_stable_op(8, "mov", pcp)
#define this_cpu_read_stable(pcp) __pcpu_size_call_return(this_cpu_read_stable_, pcp)

+#ifdef CONFIG_USE_X86_SEG_SUPPORT
+
+#define __raw_cpu_read(qual, pcp) \
+({ \
+ *(qual __my_cpu_type(pcp) *)__my_cpu_ptr(&(pcp)); \
+})
+
+#define __raw_cpu_write(qual, pcp, val) \
+do { \
+ *(qual __my_cpu_type(pcp) *)__my_cpu_ptr(&(pcp)) = (val); \
+} while (0)
+
+#define raw_cpu_read_1(pcp) __raw_cpu_read(, pcp)
+#define raw_cpu_read_2(pcp) __raw_cpu_read(, pcp)
+#define raw_cpu_read_4(pcp) __raw_cpu_read(, pcp)
+#define raw_cpu_write_1(pcp, val) __raw_cpu_write(, pcp, val)
+#define raw_cpu_write_2(pcp, val) __raw_cpu_write(, pcp, val)
+#define raw_cpu_write_4(pcp, val) __raw_cpu_write(, pcp, val)
+
+#define this_cpu_read_1(pcp) __raw_cpu_read(volatile, pcp)
+#define this_cpu_read_2(pcp) __raw_cpu_read(volatile, pcp)
+#define this_cpu_read_4(pcp) __raw_cpu_read(volatile, pcp)
+#define this_cpu_write_1(pcp, val) __raw_cpu_write(volatile, pcp, val)
+#define this_cpu_write_2(pcp, val) __raw_cpu_write(volatile, pcp, val)
+#define this_cpu_write_4(pcp, val) __raw_cpu_write(volatile, pcp, val)
+
+#ifdef CONFIG_X86_64
+#define raw_cpu_read_8(pcp) __raw_cpu_read(, pcp)
+#define raw_cpu_write_8(pcp, val) __raw_cpu_write(, pcp, val)
+
+#define this_cpu_read_8(pcp) __raw_cpu_read(volatile, pcp)
+#define this_cpu_write_8(pcp, val) __raw_cpu_write(volatile, pcp, val)
+#endif
+
+#else /* CONFIG_USE_X86_SEG_SUPPORT */
+
#define raw_cpu_read_1(pcp) percpu_from_op(1, , "mov", pcp)
#define raw_cpu_read_2(pcp) percpu_from_op(2, , "mov", pcp)
#define raw_cpu_read_4(pcp) percpu_from_op(4, , "mov", pcp)
-
#define raw_cpu_write_1(pcp, val) percpu_to_op(1, , "mov", (pcp), val)
#define raw_cpu_write_2(pcp, val) percpu_to_op(2, , "mov", (pcp), val)
#define raw_cpu_write_4(pcp, val) percpu_to_op(4, , "mov", (pcp), val)
+
+#define this_cpu_read_1(pcp) percpu_from_op(1, volatile, "mov", pcp)
+#define this_cpu_read_2(pcp) percpu_from_op(2, volatile, "mov", pcp)
+#define this_cpu_read_4(pcp) percpu_from_op(4, volatile, "mov", pcp)
+#define this_cpu_write_1(pcp, val) percpu_to_op(1, volatile, "mov", (pcp), val)
+#define this_cpu_write_2(pcp, val) percpu_to_op(2, volatile, "mov", (pcp), val)
+#define this_cpu_write_4(pcp, val) percpu_to_op(4, volatile, "mov", (pcp), val)
+
+#ifdef CONFIG_X86_64
+#define raw_cpu_read_8(pcp) percpu_from_op(8, , "mov", pcp)
+#define raw_cpu_write_8(pcp, val) percpu_to_op(8, , "mov", (pcp), val)
+
+#define this_cpu_read_8(pcp) percpu_from_op(8, volatile, "mov", pcp)
+#define this_cpu_write_8(pcp, val) percpu_to_op(8, volatile, "mov", (pcp), val)
+#endif
+
+#endif /* CONFIG_USE_X86_SEG_SUPPORT */
+
#define raw_cpu_add_1(pcp, val) percpu_add_op(1, , (pcp), val)
#define raw_cpu_add_2(pcp, val) percpu_add_op(2, , (pcp), val)
#define raw_cpu_add_4(pcp, val) percpu_add_op(4, , (pcp), val)
@@ -432,12 +485,6 @@ do { \
#define raw_cpu_xchg_2(pcp, val) raw_percpu_xchg_op(pcp, val)
#define raw_cpu_xchg_4(pcp, val) raw_percpu_xchg_op(pcp, val)

-#define this_cpu_read_1(pcp) percpu_from_op(1, volatile, "mov", pcp)
-#define this_cpu_read_2(pcp) percpu_from_op(2, volatile, "mov", pcp)
-#define this_cpu_read_4(pcp) percpu_from_op(4, volatile, "mov", pcp)
-#define this_cpu_write_1(pcp, val) percpu_to_op(1, volatile, "mov", (pcp), val)
-#define this_cpu_write_2(pcp, val) percpu_to_op(2, volatile, "mov", (pcp), val)
-#define this_cpu_write_4(pcp, val) percpu_to_op(4, volatile, "mov", (pcp), val)
#define this_cpu_add_1(pcp, val) percpu_add_op(1, volatile, (pcp), val)
#define this_cpu_add_2(pcp, val) percpu_add_op(2, volatile, (pcp), val)
#define this_cpu_add_4(pcp, val) percpu_add_op(4, volatile, (pcp), val)
@@ -476,8 +523,6 @@ do { \
* 32 bit must fall back to generic operations.
*/
#ifdef CONFIG_X86_64
-#define raw_cpu_read_8(pcp) percpu_from_op(8, , "mov", pcp)
-#define raw_cpu_write_8(pcp, val) percpu_to_op(8, , "mov", (pcp), val)
#define raw_cpu_add_8(pcp, val) percpu_add_op(8, , (pcp), val)
#define raw_cpu_and_8(pcp, val) percpu_to_op(8, , "and", (pcp), val)
#define raw_cpu_or_8(pcp, val) percpu_to_op(8, , "or", (pcp), val)
@@ -486,8 +531,6 @@ do { \
#define raw_cpu_cmpxchg_8(pcp, oval, nval) percpu_cmpxchg_op(8, , pcp, oval, nval)
#define raw_cpu_try_cmpxchg_8(pcp, ovalp, nval) percpu_try_cmpxchg_op(8, , pcp, ovalp, nval)

-#define this_cpu_read_8(pcp) percpu_from_op(8, volatile, "mov", pcp)
-#define this_cpu_write_8(pcp, val) percpu_to_op(8, volatile, "mov", (pcp), val)
#define this_cpu_add_8(pcp, val) percpu_add_op(8, volatile, (pcp), val)
#define this_cpu_and_8(pcp, val) percpu_to_op(8, volatile, "and", (pcp), val)
#define this_cpu_or_8(pcp, val) percpu_to_op(8, volatile, "or", (pcp), val)

Subject: [tip: x86/percpu] x86/percpu: Enable named address spaces with known compiler version

The following commit has been merged into the x86/percpu branch of tip:

Commit-ID: 1ca3683cc6d2c2ce4204df519c4e4730d037905a
Gitweb: https://git.kernel.org/tip/1ca3683cc6d2c2ce4204df519c4e4730d037905a
Author: Uros Bizjak <[email protected]>
AuthorDate: Wed, 04 Oct 2023 16:49:42 +02:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Thu, 05 Oct 2023 09:01:52 +02:00

x86/percpu: Enable named address spaces with known compiler version

Enable named address spaces with known compiler versions
(GCC 12.1 and later) in order to avoid possible issues with named
address spaces with older compilers. Set CC_HAS_NAMED_AS when the
compiler satisfies version requirements and set USE_X86_SEG_SUPPORT
to signal when segment qualifiers could be used.

Signed-off-by: Uros Bizjak <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
arch/x86/Kconfig | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 982b777..ecb2569 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2388,6 +2388,13 @@ source "kernel/livepatch/Kconfig"

endmenu

+config CC_HAS_NAMED_AS
+ def_bool CC_IS_GCC && GCC_VERSION >= 120100
+
+config USE_X86_SEG_SUPPORT
+ def_bool y
+ depends on CC_HAS_NAMED_AS && SMP
+
config CC_HAS_SLS
def_bool $(cc-option,-mharden-sls=all)

2023-10-08 18:01:36

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 4/4] x86/percpu: Use C for percpu read/write accessors

On Wed, 4 Oct 2023 at 07:51, Uros Bizjak <[email protected]> wrote:
>
> The percpu code mostly uses inline assembly. Using segment qualifiers
> allows to use C code instead, which enables the compiler to perform
> various optimizations (e.g. propagation of memory arguments). Convert
> percpu read and write accessors to C code, so the memory argument can
> be propagated to the instruction that uses this argument.

So apparently this causes boot failures.

It might be worth testing a version where this:

> +#define raw_cpu_read_1(pcp) __raw_cpu_read(, pcp)
> +#define raw_cpu_read_2(pcp) __raw_cpu_read(, pcp)
> +#define raw_cpu_read_4(pcp) __raw_cpu_read(, pcp)
> +#define raw_cpu_write_1(pcp, val) __raw_cpu_write(, pcp, val)
> +#define raw_cpu_write_2(pcp, val) __raw_cpu_write(, pcp, val)
> +#define raw_cpu_write_4(pcp, val) __raw_cpu_write(, pcp, val)

and this

> +#ifdef CONFIG_X86_64
> +#define raw_cpu_read_8(pcp) __raw_cpu_read(, pcp)
> +#define raw_cpu_write_8(pcp, val) __raw_cpu_write(, pcp, val)

was all using 'volatile' in the qualifier argument and see if that
makes the boot failure go away.

Because while the old code wasn't "asm volatile", even just a *plain*
asm() is certainly a lot more serialized than a normal access.

For example, the asm() version of raw_cpu_write() used "+m" for the
destination modifier, which means that if you did multiple percpu
writes to the same variable, gcc would output multiple asm calls,
because it would see the subsequent ones as reading the old value
(even if they don't *actually* do so).

That's admittedly really just because it uses a common macro for
raw_cpu_write() and the updates (like the percpu_add() code), so the
fact that it uses "+m" instead of "=m" is just a random odd artifact
of the inline asm version, but maybe we have code that ends up working
just by accident.

Also, I'm not sure gcc re-orders asms wrt each other - even when they
aren't marked volatile.

So it might be worth at least a trivial "make everything volatile"
test to see if that affects anything.

Linus

2023-10-08 19:18:20

by Uros Bizjak

[permalink] [raw]
Subject: Re: [PATCH 4/4] x86/percpu: Use C for percpu read/write accessors

On Sun, Oct 8, 2023 at 8:00 PM Linus Torvalds
<[email protected]> wrote:
>
> On Wed, 4 Oct 2023 at 07:51, Uros Bizjak <[email protected]> wrote:
> >
> > The percpu code mostly uses inline assembly. Using segment qualifiers
> > allows to use C code instead, which enables the compiler to perform
> > various optimizations (e.g. propagation of memory arguments). Convert
> > percpu read and write accessors to C code, so the memory argument can
> > be propagated to the instruction that uses this argument.
>
> So apparently this causes boot failures.
>
> It might be worth testing a version where this:
>
> > +#define raw_cpu_read_1(pcp) __raw_cpu_read(, pcp)
> > +#define raw_cpu_read_2(pcp) __raw_cpu_read(, pcp)
> > +#define raw_cpu_read_4(pcp) __raw_cpu_read(, pcp)
> > +#define raw_cpu_write_1(pcp, val) __raw_cpu_write(, pcp, val)
> > +#define raw_cpu_write_2(pcp, val) __raw_cpu_write(, pcp, val)
> > +#define raw_cpu_write_4(pcp, val) __raw_cpu_write(, pcp, val)
>
> and this
>
> > +#ifdef CONFIG_X86_64
> > +#define raw_cpu_read_8(pcp) __raw_cpu_read(, pcp)
> > +#define raw_cpu_write_8(pcp, val) __raw_cpu_write(, pcp, val)
>
> was all using 'volatile' in the qualifier argument and see if that
> makes the boot failure go away.
>
> Because while the old code wasn't "asm volatile", even just a *plain*
> asm() is certainly a lot more serialized than a normal access.
>
> For example, the asm() version of raw_cpu_write() used "+m" for the
> destination modifier, which means that if you did multiple percpu
> writes to the same variable, gcc would output multiple asm calls,
> because it would see the subsequent ones as reading the old value
> (even if they don't *actually* do so).
>
> That's admittedly really just because it uses a common macro for
> raw_cpu_write() and the updates (like the percpu_add() code), so the
> fact that it uses "+m" instead of "=m" is just a random odd artifact
> of the inline asm version, but maybe we have code that ends up working
> just by accident.
>
> Also, I'm not sure gcc re-orders asms wrt each other - even when they
> aren't marked volatile.
>
> So it might be worth at least a trivial "make everything volatile"
> test to see if that affects anything.

I have managed to reproduce the bug, and I'm trying the following path:

Scrap the last patch and just add:

#define __raw_cpu_read_new(size, qual, pcp) \
({ \
*(qual __my_cpu_type(pcp) *)__my_cpu_ptr(&(pcp)); \
})

#define __raw_cpu_read(size, qual, _var) \
({ \
__pcpu_type_##size pfo_val__; \
asm qual (__pcpu_op2_##size("mov", __percpu_arg([var]), "%[val]") \
: [val] __pcpu_reg_##size("=", pfo_val__) \
: [var] "m" (__my_cpu_var(_var))); \
(typeof(_var))(unsigned long) pfo_val__; \
})

Then changing *only*

#define raw_cpu_read_8(pcp) __raw_cpu_read_new(8, , pcp)

brings the boot process far enough to report:

[ 0.463711][ T0] Dentry cache hash table entries: 1048576
(order: 11, 8388608 bytes, linear)
[ 0.465859][ T0] Inode-cache hash table entries: 524288 (order:
10, 4194304 bytes, linear)
PANIC: early exception 0x0d IP 10:ffffffff810c4cb9 error 0 cr2
0xffff8881ab1ff000
[ 0.469084][ T0] CPU: 0 PID: 0 Comm: swapper Not tainted
6.5.0-11417-gca4256348660-dirty #7
[ 0.470756][ T0] RIP: 0010:cpu_init_exception_handling+0x179/0x740
[ 0.472045][ T0] Code: be 0f 00 00 00 4a 03 04 ed 40 19 15 85 48
89 c7 e8 9c bb ff ff 48 c7 c0 10 73 02 00 48 ba 00 00 00 00 00 fc ff
df 48 c1 e8 03 <80> 3c 10
00 0f 85 21 05 00 00 65 48 8b 05 45 26 f6 7e 48 8d 7b 24
[ 0.475784][ T0] RSP: 0000:ffffffff85207e38 EFLAGS: 00010002
ORIG_RAX: 0000000000000000
[ 0.477384][ T0] RAX: 0000000000004e62 RBX: ffff88817700a000
RCX: 0000000000000010
[ 0.479093][ T0] RDX: dffffc0000000000 RSI: ffffffff85207e60
RDI: ffff88817700f078
[ 0.481178][ T0] RBP: 000000000000f000 R08: 0040f50000000000
R09: 0040f50000000000
[ 0.482655][ T0] R10: ffff8881ab02a000 R11: 0000000000000000
R12: 1ffffffff0a40fc8
[ 0.484128][ T0] R13: 0000000000000000 R14: 0000000000000000
R15: ffffffff85151940
[ 0.485604][ T0] FS: 0000000000000000(0000)
GS:ffff888177000000(0000) knlGS:0000000000000000
[ 0.487246][ T0] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 0.488515][ T0] CR2: ffff8881ab1ff000 CR3: 00000000052d7000
CR4: 00000000000000b0
[ 0.490002][ T0] Call Trace:
[ 0.490600][ T0] <TASK>
[ 0.491145][ T0] ? early_fixup_exception+0x10e/0x280
[ 0.492176][ T0] ? early_idt_handler_common+0x2f/0x40
[ 0.493222][ T0] ? cpu_init_exception_handling+0x179/0x740
[ 0.494348][ T0] ? cpu_init_exception_handling+0x164/0x740
[ 0.495472][ T0] ? syscall_init+0x1c0/0x1c0
[ 0.496351][ T0] ? per_cpu_ptr_to_phys+0x1ca/0x2c0
[ 0.497336][ T0] ? setup_cpu_entry_areas+0x138/0x980
[ 0.498365][ T0] trap_init+0xa/0x40

Let me see what happens here. I have changed *only* raw_cpu_read_8,
but the GP fault is reported in cpu_init_exception_handling, which
uses this_cpu_ptr. Please note that all per-cpu initializations go
through existing {this|raw}_cpu_write.

void cpu_init_exception_handling(void)
{
struct tss_struct *tss = this_cpu_ptr(&cpu_tss_rw);
int cpu = raw_smp_processor_id();
...

I have tried the trick with all reads volatile (and writes as they
were before the patch), but it didn't make a difference. Also, the
kernel from the git/tip branch works OK for default config, so I think
there is some config option that somehow disturbs the named-as enabled
kernel.

Uros.

2023-10-08 20:14:37

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 4/4] x86/percpu: Use C for percpu read/write accessors

On Sun, 8 Oct 2023 at 12:18, Uros Bizjak <[email protected]> wrote:
>
> Let me see what happens here. I have changed *only* raw_cpu_read_8,
> but the GP fault is reported in cpu_init_exception_handling, which
> uses this_cpu_ptr. Please note that all per-cpu initializations go
> through existing {this|raw}_cpu_write.

I think it's an ordering issue, and I think you may hit some issue
with loading TR od the GDT or whatever.

For example, we have this

set_tss_desc(cpu, &get_cpu_entry_area(cpu)->tss.x86_tss);

followed by

asm volatile("ltr %w0"::"q" (GDT_ENTRY_TSS*8));

in native_load_tr_desc(), and I think we might want to add a "memory"
clobber to it to make sure it is serialized with any stores to the GDT
entries in question.

I don't think *that* particular thing is the issue (because you kept
the writes as-is and still hit things), but I think it's an example of
some lazy inline asm constraints that could possibly cause problems if
the ordering changes.

And yes, this code ends up depending on things like
CONFIG_PARAVIRT_XXL for whether it uses the native TR loading or uses
some paravirt version, so config options can make a difference.

Again: I don't think it's that "ltr" instruction. I'm calling it out
just as a "that function does some funky things", and the load TR is
*one* of the funky things, and it looks like it could be the same type
of thing that then causes issues.

Things like CONFIG_SMP might also matter, because the percpu setup is
different. On UP, the *segment* use goes away, but I think the whole
"use inline asm vs regular memory ops" remains (admittedly I did *not*
verify that, I might be speaking out of my *ss).

Your dump does end up being close to a %gs access:

0: 4a 03 04 ed 40 19 15 add -0x7aeae6c0(,%r13,8),%rax
7: 85
8: 48 89 c7 mov %rax,%rdi
b: e8 9c bb ff ff call 0xffffffffffffbbac
10: 48 c7 c0 10 73 02 00 mov $0x27310,%rax
17: 48 ba 00 00 00 00 00 movabs $0xdffffc0000000000,%rdx
1e: fc ff df
21: 48 c1 e8 03 shr $0x3,%rax
25:* 80 3c 10 00 cmpb $0x0,(%rax,%rdx,1) <-- trapping instruction
29: 0f 85 21 05 00 00 jne 0x550
2f: 65 48 8b 05 45 26 f6 mov %gs:0x7ef62645(%rip),%rax # 0x7ef6267c
36: 7e
37: 48 8d 7b 24 lea 0x24(%rbx),%rdi

but I don't know what the "call" before is, so I wasn't able to match
it up with any obvious code in there.

Linus

2023-10-08 20:52:30

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 4/4] x86/percpu: Use C for percpu read/write accessors

On Sun, 8 Oct 2023 at 13:13, Linus Torvalds
<[email protected]> wrote:
>
> Your dump does end up being close to a %gs access:

Bah. I should have looked closer at the instructions before the oops.

Because I think that's exactly the problem here. That's the KASAN
checks that have been added, and we have this insane code:

> 10: 48 c7 c0 10 73 02 00 mov $0x27310,%rax
> 17: 48 ba 00 00 00 00 00 movabs $0xdffffc0000000000,%rdx
> 1e: fc ff df
> 21: 48 c1 e8 03 shr $0x3,%rax
> 25:* 80 3c 10 00 cmpb $0x0,(%rax,%rdx,1) <-- trapping instruction

Look how both %rax and %rdx are constants, yet then gcc has generated
that crazy "shift a constant value right by three bits, and then use
an addressing mode to add it to another constant".

And that 0xdffffc0000000000 constant is KASAN_SHADOW_OFFSET.

So what I think is going on is trivial - and has nothing to do with ordering.

I think gcc is simply doing a KASAN check on a percpu address.

Which it shouldn't do, and didn't use to do because we did the access
using inline asm.

But now that gcc does the accesses as normal (albeit special address
space) memory accesses, the KASAN code triggers on them too, and it
all goes to hell in a handbasket very quickly.

End result: those percpu accessor functions need to disable any KASAN
checking or other sanitizer checking. Not on the percpu address,
because that's not a "real" address, it's obviously just the offset
from the segment register.

We have some other cases like that, see __read_once_word_nocheck().

And gcc should probably not have generated such code in the first
place, so arguably this is a bug with -fsanitize=kernel-address. How
does gcc handle the thread pointers with address sanitizer? Does it
convert them into real pointers first, and didn't realize that it
can't do it for __seg_gs?

Linus

2023-10-08 21:42:07

by Uros Bizjak

[permalink] [raw]
Subject: Re: [PATCH 4/4] x86/percpu: Use C for percpu read/write accessors

On Sun, Oct 8, 2023 at 10:48 PM Linus Torvalds
<[email protected]> wrote:
>
> On Sun, 8 Oct 2023 at 13:13, Linus Torvalds
> <[email protected]> wrote:
> >
> > Your dump does end up being close to a %gs access:
>
> Bah. I should have looked closer at the instructions before the oops.
>
> Because I think that's exactly the problem here. That's the KASAN
> checks that have been added, and we have this insane code:
>
> > 10: 48 c7 c0 10 73 02 00 mov $0x27310,%rax
> > 17: 48 ba 00 00 00 00 00 movabs $0xdffffc0000000000,%rdx
> > 1e: fc ff df
> > 21: 48 c1 e8 03 shr $0x3,%rax
> > 25:* 80 3c 10 00 cmpb $0x0,(%rax,%rdx,1) <-- trapping instruction
>
> Look how both %rax and %rdx are constants, yet then gcc has generated
> that crazy "shift a constant value right by three bits, and then use
> an addressing mode to add it to another constant".

Hm, the compiler knows perfectly well how to make compound addresses,
but all this KASAN stuff is a bit special.

> And that 0xdffffc0000000000 constant is KASAN_SHADOW_OFFSET.
>
> So what I think is going on is trivial - and has nothing to do with ordering.
>
> I think gcc is simply doing a KASAN check on a percpu address.
>
> Which it shouldn't do, and didn't use to do because we did the access
> using inline asm.
>
> But now that gcc does the accesses as normal (albeit special address
> space) memory accesses, the KASAN code triggers on them too, and it
> all goes to hell in a handbasket very quickly.

Yes, I can confirm that. The failing .config from Linux Kernel Test
project works perfectly well after KASAN has been switched off.

So, the patch to fix the issue could be as simple as the one attached
to the message.

> End result: those percpu accessor functions need to disable any KASAN
> checking or other sanitizer checking. Not on the percpu address,
> because that's not a "real" address, it's obviously just the offset
> from the segment register.
>
> We have some other cases like that, see __read_once_word_nocheck().
>
> And gcc should probably not have generated such code in the first
> place, so arguably this is a bug with -fsanitize=kernel-address. How
> does gcc handle the thread pointers with address sanitizer? Does it
> convert them into real pointers first, and didn't realize that it
> can't do it for __seg_gs?

I don't know this part of the compiler well, but it should not touch
non-default namespaces. I'll file a bug report there.

Thanks,
Uros.


Attachments:
p.diff.txt (367.00 B)

2023-10-09 11:41:48

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 4/4] x86/percpu: Use C for percpu read/write accessors


* Uros Bizjak <[email protected]> wrote:

> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index ecb256954351..1edf4a5b93ca 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -2393,7 +2393,7 @@ config CC_HAS_NAMED_AS
>
> config USE_X86_SEG_SUPPORT
> def_bool y
> - depends on CC_HAS_NAMED_AS && SMP
> + depends on CC_HAS_NAMED_AS && SMP && !KASAN
> + depends on CC_HAS_NAMED_AS && SMP && !KASAN

So I'd rather express this as a Kconfig quirk line, and explain each quirk.

Something like:

depends on CC_HAS_NAMED_AS
depends on SMP
#
# -fsanitize=kernel-address (KASAN) is at the moment incompatible
# with named address spaces - see GCC bug #12345.
#
depends on !KASAN

... or so.

BTW., please also document the reason why !SMP is excluded.

Thanks,

Ingo

2023-10-09 11:42:50

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 4/4] x86/percpu: Use C for percpu read/write accessors


* Uros Bizjak <[email protected]> wrote:

> I have tried the trick with all reads volatile (and writes as they were
> before the patch), but it didn't make a difference. Also, the kernel from
> the git/tip branch works OK for default config, so I think there is some
> config option that somehow disturbs the named-as enabled kernel.

Yeah, I made sure tip:x86/percpu boots & works fine on a number of systems
- but that testing wasn't comprehensive at all, and I didn't have KASAN
enabled either, which generates pretty intrusive instrumentation.

Thanks,

Ingo

2023-10-09 11:52:18

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 4/4] x86/percpu: Use C for percpu read/write accessors


* Ingo Molnar <[email protected]> wrote:

>
> * Uros Bizjak <[email protected]> wrote:
>
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index ecb256954351..1edf4a5b93ca 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -2393,7 +2393,7 @@ config CC_HAS_NAMED_AS
> >
> > config USE_X86_SEG_SUPPORT
> > def_bool y
> > - depends on CC_HAS_NAMED_AS && SMP
> > + depends on CC_HAS_NAMED_AS && SMP && !KASAN
> > + depends on CC_HAS_NAMED_AS && SMP && !KASAN
>
> So I'd rather express this as a Kconfig quirk line, and explain each quirk.
>
> Something like:
>
> depends on CC_HAS_NAMED_AS
> depends on SMP
> #
> # -fsanitize=kernel-address (KASAN) is at the moment incompatible
> # with named address spaces - see GCC bug #12345.
> #
> depends on !KASAN
>
> ... or so.

BTW., while this OK for testing, this is too heavy handed for release
purposes, so please only disable the KASAN instrumentation for the affected
percpu accessors.

See the various __no_sanitize* attributes available.

I'd even suggest introducing a new attribute variant, specific to x86,
prefixed with __no_sanitize_x86_seg or so, which would allow the eventual
KASAN-instrumentation of the percpu accessors once the underlying GCC bug
is fixed.

Thanks,

Ingo

2023-10-09 12:01:37

by Uros Bizjak

[permalink] [raw]
Subject: Re: [PATCH 4/4] x86/percpu: Use C for percpu read/write accessors

On Mon, Oct 9, 2023 at 1:51 PM Ingo Molnar <[email protected]> wrote:
>
>
> * Ingo Molnar <[email protected]> wrote:
>
> >
> > * Uros Bizjak <[email protected]> wrote:
> >
> > > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > > index ecb256954351..1edf4a5b93ca 100644
> > > --- a/arch/x86/Kconfig
> > > +++ b/arch/x86/Kconfig
> > > @@ -2393,7 +2393,7 @@ config CC_HAS_NAMED_AS
> > >
> > > config USE_X86_SEG_SUPPORT
> > > def_bool y
> > > - depends on CC_HAS_NAMED_AS && SMP
> > > + depends on CC_HAS_NAMED_AS && SMP && !KASAN
> > > + depends on CC_HAS_NAMED_AS && SMP && !KASAN
> >
> > So I'd rather express this as a Kconfig quirk line, and explain each quirk.
> >
> > Something like:
> >
> > depends on CC_HAS_NAMED_AS
> > depends on SMP
> > #
> > # -fsanitize=kernel-address (KASAN) is at the moment incompatible
> > # with named address spaces - see GCC bug #12345.
> > #
> > depends on !KASAN
> >
> > ... or so.
>
> BTW., while this OK for testing, this is too heavy handed for release
> purposes, so please only disable the KASAN instrumentation for the affected
> percpu accessors.
>
> See the various __no_sanitize* attributes available.

These attributes are for function declarations. The percpu casts can
not be implemented with separate static inline functions. Also,
__no_sanitize_address is mutually exclusive with __always_inline.

Uros.

> I'd even suggest introducing a new attribute variant, specific to x86,
> prefixed with __no_sanitize_x86_seg or so, which would allow the eventual
> KASAN-instrumentation of the percpu accessors once the underlying GCC bug
> is fixed.
>
> Thanks,
>
> Ingo

2023-10-09 12:20:59

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 4/4] x86/percpu: Use C for percpu read/write accessors


* Uros Bizjak <[email protected]> wrote:

> On Mon, Oct 9, 2023 at 1:51 PM Ingo Molnar <[email protected]> wrote:
> >
> >
> > * Ingo Molnar <[email protected]> wrote:
> >
> > >
> > > * Uros Bizjak <[email protected]> wrote:
> > >
> > > > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > > > index ecb256954351..1edf4a5b93ca 100644
> > > > --- a/arch/x86/Kconfig
> > > > +++ b/arch/x86/Kconfig
> > > > @@ -2393,7 +2393,7 @@ config CC_HAS_NAMED_AS
> > > >
> > > > config USE_X86_SEG_SUPPORT
> > > > def_bool y
> > > > - depends on CC_HAS_NAMED_AS && SMP
> > > > + depends on CC_HAS_NAMED_AS && SMP && !KASAN
> > > > + depends on CC_HAS_NAMED_AS && SMP && !KASAN
> > >
> > > So I'd rather express this as a Kconfig quirk line, and explain each quirk.
> > >
> > > Something like:
> > >
> > > depends on CC_HAS_NAMED_AS
> > > depends on SMP
> > > #
> > > # -fsanitize=kernel-address (KASAN) is at the moment incompatible
> > > # with named address spaces - see GCC bug #12345.
> > > #
> > > depends on !KASAN
> > >
> > > ... or so.
> >
> > BTW., while this OK for testing, this is too heavy handed for release
> > purposes, so please only disable the KASAN instrumentation for the affected
> > percpu accessors.
> >
> > See the various __no_sanitize* attributes available.
>
> These attributes are for function declarations. The percpu casts can
> not be implemented with separate static inline functions. Also,
> __no_sanitize_address is mutually exclusive with __always_inline.

Sigh - I guess the Kconfig toggle is the only solution then?

Thanks,

Ingo

2023-10-09 12:28:02

by Uros Bizjak

[permalink] [raw]
Subject: Re: [PATCH 4/4] x86/percpu: Use C for percpu read/write accessors

On Mon, Oct 9, 2023 at 1:41 PM Ingo Molnar <[email protected]> wrote:
>
>
> * Uros Bizjak <[email protected]> wrote:
>
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index ecb256954351..1edf4a5b93ca 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -2393,7 +2393,7 @@ config CC_HAS_NAMED_AS
> >
> > config USE_X86_SEG_SUPPORT
> > def_bool y
> > - depends on CC_HAS_NAMED_AS && SMP
> > + depends on CC_HAS_NAMED_AS && SMP && !KASAN
> > + depends on CC_HAS_NAMED_AS && SMP && !KASAN
>
> So I'd rather express this as a Kconfig quirk line, and explain each quirk.
>
> Something like:
>
> depends on CC_HAS_NAMED_AS
> depends on SMP
> #
> # -fsanitize=kernel-address (KASAN) is at the moment incompatible
> # with named address spaces - see GCC bug #12345.
> #
> depends on !KASAN
>
> ... or so.
>
> BTW., please also document the reason why !SMP is excluded.

Eh, thanks for pointing it out, it is not needed at all, it works also
for !SMP. Will fix in a Kconfig patch.

Thanks,
Uros.

2023-10-09 12:33:51

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH 4/4] x86/percpu: Use C for percpu read/write accessors



> On Oct 9, 2023, at 3:00 PM, Uros Bizjak <[email protected]> wrote:
>
> !! External Email
>
> On Mon, Oct 9, 2023 at 1:51 PM Ingo Molnar <[email protected]> wrote:
>>
>> BTW., while this OK for testing, this is too heavy handed for release
>> purposes, so please only disable the KASAN instrumentation for the affected
>> percpu accessors.
>>
>> See the various __no_sanitize* attributes available.
>
> These attributes are for function declarations. The percpu casts can
> not be implemented with separate static inline functions. Also,
> __no_sanitize_address is mutually exclusive with __always_inline.

Right, but for GCC you may be able to do something like:

#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-fsanitize=address"

// Your code here...
#pragma GCC diagnostic pop

Not sure if there is something equivalent in CLANG, and it should be done with
the kernel’s _Pragma.

2023-10-09 12:43:11

by Uros Bizjak

[permalink] [raw]
Subject: Re: [PATCH 4/4] x86/percpu: Use C for percpu read/write accessors

On Mon, Oct 9, 2023 at 2:21 PM Nadav Amit <[email protected]> wrote:
>
>
>
> > On Oct 9, 2023, at 3:00 PM, Uros Bizjak <[email protected]> wrote:
> >
> > !! External Email
> >
> > On Mon, Oct 9, 2023 at 1:51 PM Ingo Molnar <[email protected]> wrote:
> >>
> >> BTW., while this OK for testing, this is too heavy handed for release
> >> purposes, so please only disable the KASAN instrumentation for the affected
> >> percpu accessors.
> >>
> >> See the various __no_sanitize* attributes available.
> >
> > These attributes are for function declarations. The percpu casts can
> > not be implemented with separate static inline functions. Also,
> > __no_sanitize_address is mutually exclusive with __always_inline.
>
> Right, but for GCC you may be able to do something like:
>
> #pragma GCC diagnostic push
> #pragma GCC diagnostic ignored "-fsanitize=address"
>
> // Your code here...
> #pragma GCC diagnostic pop
>
> Not sure if there is something equivalent in CLANG, and it should be done with
> the kernel’s _Pragma.

Unfortunately, this is only for diagnostics and expects "-W..." to
suppress warnings. Here we want to disable kernel sanitizer just for
the enclosing access and I'm sure it won't work with diagnostics
pragmas. I don't think that "-fsanitize=..." is included in target or
optimization options allowed in Pragma.

Uros.

2023-10-09 12:53:30

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH 4/4] x86/percpu: Use C for percpu read/write accessors



> On Oct 9, 2023, at 3:42 PM, Uros Bizjak <[email protected]> wrote:
>
> !! External Email
>
> On Mon, Oct 9, 2023 at 2:21 PM Nadav Amit <[email protected]> wrote:
>>
>>
>>
>>> On Oct 9, 2023, at 3:00 PM, Uros Bizjak <[email protected]> wrote:
>>>
>>> !! External Email
>>>
>>> On Mon, Oct 9, 2023 at 1:51 PM Ingo Molnar <[email protected]> wrote:
>>>>
>>>> BTW., while this OK for testing, this is too heavy handed for release
>>>> purposes, so please only disable the KASAN instrumentation for the affected
>>>> percpu accessors.
>>>>
>>>> See the various __no_sanitize* attributes available.
>>>
>>> These attributes are for function declarations. The percpu casts can
>>> not be implemented with separate static inline functions. Also,
>>> __no_sanitize_address is mutually exclusive with __always_inline.
>>
>> Right, but for GCC you may be able to do something like:
>>
>> #pragma GCC diagnostic push
>> #pragma GCC diagnostic ignored "-fsanitize=address"
>>
>> // Your code here...
>> #pragma GCC diagnostic pop
>>
>> Not sure if there is something equivalent in CLANG, and it should be done with
>> the kernel’s _Pragma.
>
> Unfortunately, this is only for diagnostics and expects "-W..." to
> suppress warnings. Here we want to disable kernel sanitizer just for
> the enclosing access and I'm sure it won't work with diagnostics
> pragmas. I don't think that "-fsanitize=..." is included in target or
> optimization options allowed in Pragma.

Ugh. Sorry for the noise. You seem to be right.

2023-10-09 14:35:48

by Uros Bizjak

[permalink] [raw]
Subject: Re: [PATCH 4/4] x86/percpu: Use C for percpu read/write accessors

On Mon, Oct 9, 2023 at 1:41 PM Ingo Molnar <[email protected]> wrote:
>
>
> * Uros Bizjak <[email protected]> wrote:
>
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index ecb256954351..1edf4a5b93ca 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -2393,7 +2393,7 @@ config CC_HAS_NAMED_AS
> >
> > config USE_X86_SEG_SUPPORT
> > def_bool y
> > - depends on CC_HAS_NAMED_AS && SMP
> > + depends on CC_HAS_NAMED_AS && SMP && !KASAN
> > + depends on CC_HAS_NAMED_AS && SMP && !KASAN
>
> So I'd rather express this as a Kconfig quirk line, and explain each quirk.
>
> Something like:
>
> depends on CC_HAS_NAMED_AS
> depends on SMP
> #
> # -fsanitize=kernel-address (KASAN) is at the moment incompatible
> # with named address spaces - see GCC bug #12345.
> #
> depends on !KASAN

This is now PR sanitizer/111736 [1], but perhaps KASAN people [CC'd]
also want to be notified about this problem.

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111736

Thanks,
Uros.

2023-10-10 06:37:54

by Uros Bizjak

[permalink] [raw]
Subject: Re: [PATCH 4/4] x86/percpu: Use C for percpu read/write accessors

On Sun, Oct 8, 2023 at 8:00 PM Linus Torvalds
<[email protected]> wrote:
>
> On Wed, 4 Oct 2023 at 07:51, Uros Bizjak <[email protected]> wrote:
> >
> > The percpu code mostly uses inline assembly. Using segment qualifiers
> > allows to use C code instead, which enables the compiler to perform
> > various optimizations (e.g. propagation of memory arguments). Convert
> > percpu read and write accessors to C code, so the memory argument can
> > be propagated to the instruction that uses this argument.
>
> So apparently this causes boot failures.
>
> It might be worth testing a version where this:
>
> > +#define raw_cpu_read_1(pcp) __raw_cpu_read(, pcp)
> > +#define raw_cpu_read_2(pcp) __raw_cpu_read(, pcp)
> > +#define raw_cpu_read_4(pcp) __raw_cpu_read(, pcp)
> > +#define raw_cpu_write_1(pcp, val) __raw_cpu_write(, pcp, val)
> > +#define raw_cpu_write_2(pcp, val) __raw_cpu_write(, pcp, val)
> > +#define raw_cpu_write_4(pcp, val) __raw_cpu_write(, pcp, val)
>
> and this
>
> > +#ifdef CONFIG_X86_64
> > +#define raw_cpu_read_8(pcp) __raw_cpu_read(, pcp)
> > +#define raw_cpu_write_8(pcp, val) __raw_cpu_write(, pcp, val)
>
> was all using 'volatile' in the qualifier argument and see if that
> makes the boot failure go away.
>
> Because while the old code wasn't "asm volatile", even just a *plain*
> asm() is certainly a lot more serialized than a normal access.
>
> For example, the asm() version of raw_cpu_write() used "+m" for the
> destination modifier, which means that if you did multiple percpu
> writes to the same variable, gcc would output multiple asm calls,
> because it would see the subsequent ones as reading the old value
> (even if they don't *actually* do so).
>
> That's admittedly really just because it uses a common macro for
> raw_cpu_write() and the updates (like the percpu_add() code), so the
> fact that it uses "+m" instead of "=m" is just a random odd artifact
> of the inline asm version, but maybe we have code that ends up working
> just by accident.

FYI: While the emitted asm code is correct, the program flow depends
on uninitialized value. The compiler is free to remove the whole insn
stream in this case. Admittedly, we have asm here, so the compiler is
a bit more forgiving, but it is a slippery slope nevertheless.

Uros.

2024-04-10 11:13:46

by Andrey Konovalov

[permalink] [raw]
Subject: Re: [PATCH 4/4] x86/percpu: Use C for percpu read/write accessors

On Mon, Oct 9, 2023 at 4:35 PM Uros Bizjak <[email protected]> wrote:
>
> On Mon, Oct 9, 2023 at 1:41 PM Ingo Molnar <[email protected]> wrote:
> >
> >
> > * Uros Bizjak <[email protected]> wrote:
> >
> > > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > > index ecb256954351..1edf4a5b93ca 100644
> > > --- a/arch/x86/Kconfig
> > > +++ b/arch/x86/Kconfig
> > > @@ -2393,7 +2393,7 @@ config CC_HAS_NAMED_AS
> > >
> > > config USE_X86_SEG_SUPPORT
> > > def_bool y
> > > - depends on CC_HAS_NAMED_AS && SMP
> > > + depends on CC_HAS_NAMED_AS && SMP && !KASAN
> > > + depends on CC_HAS_NAMED_AS && SMP && !KASAN
> >
> > So I'd rather express this as a Kconfig quirk line, and explain each quirk.
> >
> > Something like:
> >
> > depends on CC_HAS_NAMED_AS
> > depends on SMP
> > #
> > # -fsanitize=kernel-address (KASAN) is at the moment incompatible
> > # with named address spaces - see GCC bug #12345.
> > #
> > depends on !KASAN
>
> This is now PR sanitizer/111736 [1], but perhaps KASAN people [CC'd]
> also want to be notified about this problem.
>
> [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111736

Filed a KASAN bug to track this:
https://bugzilla.kernel.org/show_bug.cgi?id=218703

Thanks!

2024-04-10 11:21:45

by Uros Bizjak

[permalink] [raw]
Subject: Re: [PATCH 4/4] x86/percpu: Use C for percpu read/write accessors

On Wed, Apr 10, 2024 at 1:11 PM Andrey Konovalov <[email protected]> wrote:
>
> On Mon, Oct 9, 2023 at 4:35 PM Uros Bizjak <[email protected]> wrote:
> >
> > On Mon, Oct 9, 2023 at 1:41 PM Ingo Molnar <[email protected]> wrote:
> > >
> > >
> > > * Uros Bizjak <[email protected]> wrote:
> > >
> > > > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > > > index ecb256954351..1edf4a5b93ca 100644
> > > > --- a/arch/x86/Kconfig
> > > > +++ b/arch/x86/Kconfig
> > > > @@ -2393,7 +2393,7 @@ config CC_HAS_NAMED_AS
> > > >
> > > > config USE_X86_SEG_SUPPORT
> > > > def_bool y
> > > > - depends on CC_HAS_NAMED_AS && SMP
> > > > + depends on CC_HAS_NAMED_AS && SMP && !KASAN
> > > > + depends on CC_HAS_NAMED_AS && SMP && !KASAN
> > >
> > > So I'd rather express this as a Kconfig quirk line, and explain each quirk.
> > >
> > > Something like:
> > >
> > > depends on CC_HAS_NAMED_AS
> > > depends on SMP
> > > #
> > > # -fsanitize=kernel-address (KASAN) is at the moment incompatible
> > > # with named address spaces - see GCC bug #12345.
> > > #
> > > depends on !KASAN
> >
> > This is now PR sanitizer/111736 [1], but perhaps KASAN people [CC'd]
> > also want to be notified about this problem.
> >
> > [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111736
>
> Filed a KASAN bug to track this:
> https://bugzilla.kernel.org/show_bug.cgi?id=218703

Please note the fix in -tip tree that reenables sanitizers for fixed compilers:

https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?h=x86/percpu&id=9ebe5500d4b25ee4cde04eec59a6764361a60709

Thanks,
Uros.

2024-04-10 11:24:59

by Andrey Konovalov

[permalink] [raw]
Subject: Re: [PATCH 4/4] x86/percpu: Use C for percpu read/write accessors

On Wed, Apr 10, 2024 at 1:21 PM Uros Bizjak <[email protected]> wrote:
>
> > Filed a KASAN bug to track this:
> > https://bugzilla.kernel.org/show_bug.cgi?id=218703
>
> Please note the fix in -tip tree that reenables sanitizers for fixed compilers:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?h=x86/percpu&id=9ebe5500d4b25ee4cde04eec59a6764361a60709
>
> Thanks,
> Uros.

Ah, awesome! I guess this will be in the mainline soon, so I'll close
the bug then. Thank you!