LinuxLists.cc - [PATCH 0/5] x86/percpu semantics and fixes

2019-02-27 10:19:41

Subject: [PATCH 0/5] x86/percpu semantics and fixes

Hi all,

This is a collection of x86/percpu changes that I had pending and got reminded
of by Linus' comment yesterday about __this_cpu_xchg().

This tidies up the x86/percpu primitives and fixes a bunch of 'fallout'.

Built and boot tested with CONFIG_DEBUG_PREEMPT=y.

---
arch/x86/include/asm/irq_regs.h | 4 +-
arch/x86/include/asm/percpu.h | 236 +++++++++++++++++++++-------------------
arch/x86/include/asm/smp.h | 3 +-
arch/x86/mm/tlb.c | 62 +++++------
include/linux/smp.h | 45 +++++---
kernel/sched/fair.c | 5 +-
6 files changed, 193 insertions(+), 162 deletions(-)

2019-02-27 10:26:54

by Peter Zijlstra

[permalink] [raw]

Subject: [PATCH 6/5] x86/percpu: Optimize raw_cpu_xchg()

And because it's one of _those_ days, I forgot to include one patch...

---
Subject: x86/percpu: Optimize raw_cpu_xchg()
From: Peter Zijlstra <[email protected]>
Date: Wed Feb 27 11:09:56 CET 2019

Since raw_cpu_xchg() doesn't need to be IRQ-safe, like
this_cpu_xchg(), we can use a simple load-store instead of the cmpxchg
loop.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
arch/x86/include/asm/percpu.h | 18 +++++++++++++++---
1 file changed, 15 insertions(+), 3 deletions(-)

--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -407,9 +407,21 @@ do { \
#define raw_cpu_or_1(pcp, val) percpu_to_op(, "or", (pcp), val)
#define raw_cpu_or_2(pcp, val) percpu_to_op(, "or", (pcp), val)
#define raw_cpu_or_4(pcp, val) percpu_to_op(, "or", (pcp), val)
-#define raw_cpu_xchg_1(pcp, val) percpu_xchg_op(, pcp, val)
-#define raw_cpu_xchg_2(pcp, val) percpu_xchg_op(, pcp, val)
-#define raw_cpu_xchg_4(pcp, val) percpu_xchg_op(, pcp, val)
+
+/*
+ * raw_cpu_xchg() can use a load-store since it is not required to be
+ * IRQ-safe.
+ */
+#define raw_percpu_xchg_op(var, nval) \
+({ \
+ typeof(var) pxo_ret__ = raw_cpu_read(var); \
+ raw_cpu_write(var, (nval)); \
+ pxo_ret__; \
+})
+
+#define raw_cpu_xchg_1(pcp, val) raw_percpu_xchg_op(pcp, val)
+#define raw_cpu_xchg_2(pcp, val) raw_percpu_xchg_op(pcp, val)
+#define raw_cpu_xchg_4(pcp, val) raw_percpu_xchg_op(pcp, val)

#define this_cpu_read_1(pcp) percpu_from_op(volatile, "mov", pcp)
#define this_cpu_read_2(pcp) percpu_from_op(volatile, "mov", pcp)

2019-02-27 15:22:49

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 6/5] x86/percpu: Optimize raw_cpu_xchg()

On Wed, Feb 27, 2019 at 11:24:45AM +0100, Peter Zijlstra wrote:
>
> And because it's one of _those_ days, I forgot to include one patch...
>
> ---
> Subject: x86/percpu: Optimize raw_cpu_xchg()
> From: Peter Zijlstra <[email protected]>
> Date: Wed Feb 27 11:09:56 CET 2019
>
> Since raw_cpu_xchg() doesn't need to be IRQ-safe, like
> this_cpu_xchg(), we can use a simple load-store instead of the cmpxchg
> loop.
>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> ---
> arch/x86/include/asm/percpu.h | 18 +++++++++++++++---
> 1 file changed, 15 insertions(+), 3 deletions(-)
>
> --- a/arch/x86/include/asm/percpu.h
> +++ b/arch/x86/include/asm/percpu.h
> @@ -407,9 +407,21 @@ do { \
> #define raw_cpu_or_1(pcp, val) percpu_to_op(, "or", (pcp), val)
> #define raw_cpu_or_2(pcp, val) percpu_to_op(, "or", (pcp), val)
> #define raw_cpu_or_4(pcp, val) percpu_to_op(, "or", (pcp), val)
> -#define raw_cpu_xchg_1(pcp, val) percpu_xchg_op(, pcp, val)
> -#define raw_cpu_xchg_2(pcp, val) percpu_xchg_op(, pcp, val)
> -#define raw_cpu_xchg_4(pcp, val) percpu_xchg_op(, pcp, val)
> +
> +/*
> + * raw_cpu_xchg() can use a load-store since it is not required to be
> + * IRQ-safe.
> + */
> +#define raw_percpu_xchg_op(var, nval) \
> +({ \
> + typeof(var) pxo_ret__ = raw_cpu_read(var); \
> + raw_cpu_write(var, (nval)); \
> + pxo_ret__; \
> +})
> +
> +#define raw_cpu_xchg_1(pcp, val) raw_percpu_xchg_op(pcp, val)
> +#define raw_cpu_xchg_2(pcp, val) raw_percpu_xchg_op(pcp, val)
> +#define raw_cpu_xchg_4(pcp, val) raw_percpu_xchg_op(pcp, val)
>
> #define this_cpu_read_1(pcp) percpu_from_op(volatile, "mov", pcp)
> #define this_cpu_read_2(pcp) percpu_from_op(volatile, "mov", pcp)

And yes, I just added raw_cpu_xchg_8... *sigh*

2019-02-27 23:17:57

by Nadav Amit

[permalink] [raw]

Subject: Re: [PATCH 0/5] x86/percpu semantics and fixes

> On Feb 27, 2019, at 2:12 AM, Peter Zijlstra <[email protected]> wrote:
>
> Hi all,
>
> This is a collection of x86/percpu changes that I had pending and got reminded
> of by Linus' comment yesterday about __this_cpu_xchg().
>
> This tidies up the x86/percpu primitives and fixes a bunch of 'fallout'.
>
> Built and boot tested with CONFIG_DEBUG_PREEMPT=y.

Overall this series affects 70 functions and shortens the code by 326 bytes.
_local_bh_enable() for example is shorten by 14 bytes (26%).

I must admit that I although I pointed some of these issues before, I am not
sure whether they are really important...

Recently, I tried to see how to make the compiler to generate “better code”
from Linux. I sprinkled “pure” attribute on many common function (e.g.,
page_rmapping()), sg_next()); sprinkled const-attribute on some others
(e.g., jiffies_to_msecs()); created a const-alias variable so the compiler
would consider kaslr variables and sme_me_mask as constant after
initialization, and so on.

I was then looking at the changed code, and while some functions were
shorter and some longer, many common functions did look “better”. The only
problem was that any benchmark that I did not show any measurable impact.

So perhaps it is a matter of measurement, but eventually right now there is
no clean win.

2019-03-08 14:52:54

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 0/5] x86/percpu semantics and fixes

On Wed, Feb 27, 2019 at 11:12:52AM +0100, Peter Zijlstra wrote:

> This is a collection of x86/percpu changes that I had pending and got reminded
> of by Linus' comment yesterday about __this_cpu_xchg().
>
> This tidies up the x86/percpu primitives and fixes a bunch of 'fallout'.

(Sorry; this is going to have _wide_ output)

OK, so what I did is I build 4 kernels (O=defconfig-build{,1,2,3}) with
resp that many patches of this series applied.

When I look at just the vmlinux size output:

$ size defconfig-build*/vmlinux
text data bss dec hex filename
19540631 5040164 1871944 26452739 193a303 defconfig-build/vmlinux
19540635 5040164 1871944 26452743 193a307 defconfig-build1/vmlinux
19540685 5040164 1871944 26452793 193a339 defconfig-build2/vmlinux
19540685 5040164 1871944 26452793 193a339 defconfig-build3/vmlinux

Things appear to get slightly larger; however when I look in more
detail using my (newly written compare script, find attached), I get
things like:

$ ./compare.sh defconfig-build defconfig-build1
arch/x86/mm/fault.o 12850 12818 -32
kernel/power/process.o 3586 3706 +120
kernel/locking/rtmutex.o 1687 1671 -16
kernel/sched/core.o 7127 7181 +54
kernel/time/tick-sched.o 8941 8837 -104
kernel/exit.o 310 385 +75
kernel/softirq.o 1217 1199 -18
kernel/workqueue.o 3240 3288 +48
net/ipv6/tcp_ipv6.o 25434 25345 -89
net/ipv4/tcp_ipv4.o 301 305 +4
total 4768226 4768268 +42

When we look at just tick-sched.o:

$ ./compare.sh defconfig-build defconfig-build1 kernel/time/tick-sched.o
can_stop_idle_tick.isra.14 146 139 -7

we see a totally different number ?!

$ ./compare.sh defconfig-build defconfig-build1 kernel/time/tick-sched.o can_stop_idle_tick.isra.14
0000 0000000000000680 <can_stop_idle_tick.isra.14>: | 0000 0000000000000680 <can_stop_idle_tick.isra.14>:
0000 680: 53 push %rbx | 0000 680: 53 push %rbx
0001 681: 89 f8 mov %edi,%eax | 0001 681: 89 f8 mov %edi,%eax
0003 683: 48 0f a3 05 00 00 00 bt %rax,0x0(%rip) # 68b <can_stop_id | 0003 683: 48 0f a3 05 00 00 00 bt %rax,0x0(%rip) # 68b <can_stop_id
000a 68a: 00 | 000a 68a: 00
0007 687: R_X86_64_PC32 __cpu_online_mask-0x4 | 0007 687: R_X86_64_PC32 __cpu_online_mask-0x4
000b 68b: 0f 92 c3 setb %bl | 000b 68b: 0f 92 c3 setb %bl
000e 68e: 73 67 jae 6f7 <can_stop_idle_tick.isra.14+0x77> \ 000e 68e: 73 48 jae 6d8 <can_stop_idle_tick.isra.14+0x58>
0010 690: 8b 06 mov (%rsi),%eax | 0010 690: 8b 06 mov (%rsi),%eax
0012 692: 85 c0 test %eax,%eax | 0012 692: 85 c0 test %eax,%eax
0014 694: 74 21 je 6b7 <can_stop_idle_tick.isra.14+0x37> | 0014 694: 74 21 je 6b7 <can_stop_idle_tick.isra.14+0x37>
0016 696: 65 48 8b 04 25 00 00 mov %gs:0x0,%rax | 0016 696: 65 48 8b 04 25 00 00 mov %gs:0x0,%rax
001d 69d: 00 00 | 001d 69d: 00 00
001b 69b: R_X86_64_32S current_task | 001b 69b: R_X86_64_32S current_task
001f 69f: 48 8b 00 mov (%rax),%rax | 001f 69f: 48 8b 00 mov (%rax),%rax
0022 6a2: a8 08 test $0x8,%al | 0022 6a2: a8 08 test $0x8,%al
0024 6a4: 75 11 jne 6b7 <can_stop_idle_tick.isra.14+0x37> | 0024 6a4: 75 11 jne 6b7 <can_stop_idle_tick.isra.14+0x37>
0026 6a6: 65 66 8b 05 00 00 00 mov %gs:0x0(%rip),%ax # 6ae <can_stop \ 0026 6a6: 65 66 8b 35 00 00 00 mov %gs:0x0(%rip),%si # 6ae <can_stop
002d 6ad: 00 | 002d 6ad: 00
002a 6aa: R_X86_64_PC32 irq_stat-0x4 | 002a 6aa: R_X86_64_PC32 irq_stat-0x4
002e 6ae: 66 85 c0 test %ax,%ax \ 002e 6ae: 66 85 f6 test %si,%si
0031 6b1: 75 0a jne 6bd <can_stop_idle_tick.isra.14+0x3d> | 0031 6b1: 75 0a jne 6bd <can_stop_idle_tick.isra.14+0x3d>
0033 6b3: 89 d8 mov %ebx,%eax | 0033 6b3: 89 d8 mov %ebx,%eax
0035 6b5: 5b pop %rbx | 0035 6b5: 5b pop %rbx
0036 6b6: c3 retq | 0036 6b6: c3 retq
0037 6b7: 31 db xor %ebx,%ebx | 0037 6b7: 31 db xor %ebx,%ebx
0039 6b9: 89 d8 mov %ebx,%eax | 0039 6b9: 89 d8 mov %ebx,%eax
003b 6bb: 5b pop %rbx | 003b 6bb: 5b pop %rbx
003c 6bc: c3 retq | 003c 6bc: c3 retq
003d 6bd: 31 db xor %ebx,%ebx | 003d 6bd: 31 db xor %ebx,%ebx
003f 6bf: 83 3d 00 00 00 00 09 cmpl $0x9,0x0(%rip) # 6c6 <can_stop_id | 003f 6bf: 83 3d 00 00 00 00 09 cmpl $0x9,0x0(%rip) # 6c6 <can_stop_id
0041 6c1: R_X86_64_PC32 .bss-0x5 | 0041 6c1: R_X86_64_PC32 .bss-0x5
0046 6c6: 7f eb jg 6b3 <can_stop_idle_tick.isra.14+0x33> | 0046 6c6: 7f eb jg 6b3 <can_stop_idle_tick.isra.14+0x33>
0048 6c8: 65 66 8b 05 00 00 00 mov %gs:0x0(%rip),%ax # 6d0 <can_stop \ 0048 6c8: 0f b7 f6 movzwl %si,%esi
004f 6cf: 00 \ 004b 6cb: f7 c6 ff fd 00 00 test $0xfdff,%esi
004c 6cc: R_X86_64_PC32 irq_stat-0x4 \ 0051 6d1: 74 e0 je 6b3 <can_stop_idle_tick.isra.14+0x33>
0050 6d0: a9 ff fd 00 00 test $0xfdff,%eax \ 0053 6d3: e9 00 00 00 00 jmpq 6d8 <can_stop_idle_tick.isra.14+0x58>
0055 6d5: 74 dc je 6b3 <can_stop_idle_tick.isra.14+0x33> \ 0054 6d4: R_X86_64_PC32 .text.unlikely-0x4
0057 6d7: 65 66 8b 35 00 00 00 mov %gs:0x0(%rip),%si # 6df <can_stop \ 0058 6d8: 3b 3d 00 00 00 00 cmp 0x0(%rip),%edi # 6de <can_stop_id
005e 6de: 00 \ 005a 6da: R_X86_64_PC32 tick_do_timer_cpu-0x4
005b 6db: R_X86_64_PC32 irq_stat-0x4 \ 005e 6de: 75 0a jne 6ea <can_stop_idle_tick.isra.14+0x6a>
005f 6df: 48 c7 c7 00 00 00 00 mov $0x0,%rdi \ 0060 6e0: c7 05 00 00 00 00 ff movl $0xffffffff,0x0(%rip) # 6ea <can_
0062 6e2: R_X86_64_32S .rodata.str1.8 \ 0067 6e7: ff ff ff
0066 6e6: 0f b7 f6 movzwl %si,%esi \ 0062 6e2: R_X86_64_PC32 tick_do_timer_cpu-0x8
0069 6e9: e8 00 00 00 00 callq 6ee <can_stop_idle_tick.isra.14+0x6e> \ 006a 6ea: 48 c7 02 00 00 00 00 movq $0x0,(%rdx)
006a 6ea: R_X86_64_PLT32 printk-0x4 \ 0071 6f1: eb c0 jmp 6b3 <can_stop_idle_tick.isra.14+0x33>
006e 6ee: 83 05 00 00 00 00 01 addl $0x1,0x0(%rip) # 6f5 <can_stop_id \ 0073 6f3: 66 66 2e 0f 1f 84 00 data16 nopw %cs:0x0(%rax,%rax,1)
0070 6f0: R_X86_64_PC32 .bss-0x5 \ 007a 6fa: 00 00 00 00
0075 6f5: eb bc jmp 6b3 <can_stop_idle_tick.isra.14+0x33> \ 007e 6fe: 66 90 xchg %ax,%ax
0077 6f7: 3b 3d 00 00 00 00 cmp 0x0(%rip),%edi # 6fd <can_stop_id \ fffffffffffff980
0079 6f9: R_X86_64_PC32 tick_do_timer_cpu-0x4 \ 0000 0000000000000000 <can_stop_idle_tick.isra.14.cold.23>:
007d 6fd: 75 0a jne 709 <can_stop_idle_tick.isra.14+0x89> \ 0000 0: 48 c7 c7 00 00 00 00 mov $0x0,%rdi
007f 6ff: c7 05 00 00 00 00 ff movl $0xffffffff,0x0(%rip) # 709 <can_ \ 0003 3: R_X86_64_32S .rodata.str1.8
0086 706: ff ff ff \ 0007 7: e8 00 00 00 00 callq c <can_stop_idle_tick.isra.14.cold.23+0x
0081 701: R_X86_64_PC32 tick_do_timer_cpu-0x8 \ 0008 8: R_X86_64_PLT32 printk-0x4
0089 709: 48 c7 02 00 00 00 00 movq $0x0,(%rdx) \ 000c c: 83 05 00 00 00 00 01 addl $0x1,0x0(%rip) # 13 <can_stop_idl
0090 710: eb a1 jmp 6b3 <can_stop_idle_tick.isra.14+0x33> \ 000e e: R_X86_64_PC32 .bss-0x5
0092 712: 66 66 2e 0f 1f 84 00 data16 nopw %cs:0x0(%rax,%rax,1) \ 0013 13: e9 00 00 00 00 jmpq 18 <__setup_setup_tick_nohz>
0099 719: 00 00 00 00 \ 0014 14: R_X86_64_PC32 .text+0x6af
009d 71d: 0f 1f 00 nopl (%rax) \

And we see that GCC created a .cold. subfunction because the first patch
removed the volatile from __this_cpu_read() and could thus move it.

Similarly the second patch; which removes volatile from
smp_processor_id():

$ ./compare.sh defconfig-build1 defconfig-build2
arch/x86/events/amd/ibs.o 667 757 +90
arch/x86/kernel/cpu/mce/core.o 2677 2696 +19
arch/x86/kernel/cpu/mce/therm_throt.o 508 527 +19
arch/x86/kernel/cpu/mtrr/generic.o 9523 9203 -320
arch/x86/kernel/acpi/sleep.o 3152 3088 -64
arch/x86/kernel/nmi.o 338 562 +224
arch/x86/kernel/process.o 1554 1586 +32
arch/x86/kernel/tsc_sync.o 5591 5377 -214
kernel/irq/spurious.o 5835 5771 -64
kernel/irq/cpuhotplug.o 2253 2189 -64
kernel/time/clocksource.o 480 593 +113
total 4768268 4768039 -229

we get smaller total executable sections; and even when there is growth:

$ ./compare.sh defconfig-build1 defconfig-build2 arch/x86/events/amd/ibs.o setup_APIC_ibs
0000 0000000000000420 <setup_APIC_ibs>: | 0000 0000000000000420 <setup_APIC_ibs>:
0000 420: 53 push %rbx | 0000 420: 53 push %rbx
0001 421: b9 3a 10 01 c0 mov $0xc001103a,%ecx | 0001 421: b9 3a 10 01 c0 mov $0xc001103a,%ecx
0006 426: 0f 32 rdmsr | 0006 426: 0f 32 rdmsr
0008 428: 48 c1 e2 20 shl $0x20,%rdx | 0008 428: 48 c1 e2 20 shl $0x20,%rdx
000c 42c: 48 89 d3 mov %rdx,%rbx | 000c 42c: 48 89 d3 mov %rdx,%rbx
000f 42f: 48 09 c3 or %rax,%rbx | 000f 42f: 48 09 c3 or %rax,%rbx
0012 432: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) | 0012 432: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
0017 437: f6 c7 01 test $0x1,%bh | 0017 437: f6 c7 01 test $0x1,%bh
001a 43a: 74 2a je 466 <setup_APIC_ibs+0x46> \ 001a 43a: 0f 84 00 00 00 00 je 440 <setup_APIC_ibs+0x20>
001c 43c: 89 df mov %ebx,%edi \ 001c 43c: R_X86_64_PC32 .text.unlikely-0x4
001e 43e: 31 c9 xor %ecx,%ecx \ 0020 440: 89 df mov %ebx,%edi
0020 440: 31 f6 xor %esi,%esi \ 0022 442: 31 c9 xor %ecx,%ecx
0022 442: ba 04 00 00 00 mov $0x4,%edx \ 0024 444: 31 f6 xor %esi,%esi
0027 447: 83 e7 0f and $0xf,%edi \ 0026 446: ba 04 00 00 00 mov $0x4,%edx
002a 44a: e8 00 00 00 00 callq 44f <setup_APIC_ibs+0x2f> \ 002b 44b: 83 e7 0f and $0xf,%edi
002b 44b: R_X86_64_PLT32 setup_APIC_eilvt-0x4 \ 002e 44e: e8 00 00 00 00 callq 453 <setup_APIC_ibs+0x33>
002f 44f: 85 c0 test %eax,%eax \ 002f 44f: R_X86_64_PLT32 setup_APIC_eilvt-0x4
0031 451: 75 13 jne 466 <setup_APIC_ibs+0x46> \ 0033 453: 85 c0 test %eax,%eax
0033 453: 5b pop %rbx \ 0035 455: 0f 85 00 00 00 00 jne 45b <setup_APIC_ibs+0x3b>
0034 454: c3 retq \ 0037 457: R_X86_64_PC32 .text.unlikely-0x4
0035 455: 31 d2 xor %edx,%edx \ 003b 45b: 5b pop %rbx
0037 457: 48 89 de mov %rbx,%rsi \ 003c 45c: c3 retq
003a 45a: bf 3a 10 01 c0 mov $0xc001103a,%edi \ 003d 45d: 31 d2 xor %edx,%edx
003f 45f: e8 00 00 00 00 callq 464 <setup_APIC_ibs+0x44> \ 003f 45f: 48 89 de mov %rbx,%rsi
0040 460: R_X86_64_PLT32 do_trace_read_msr-0x4 \ 0042 462: bf 3a 10 01 c0 mov $0xc001103a,%edi
0044 464: eb d1 jmp 437 <setup_APIC_ibs+0x17> \ 0047 467: e8 00 00 00 00 callq 46c <setup_APIC_ibs+0x4c>
0046 466: 65 8b 35 00 00 00 00 mov %gs:0x0(%rip),%esi # 46d <setup_A \ 0048 468: R_X86_64_PLT32 do_trace_read_msr-0x4
0049 469: R_X86_64_PC32 cpu_number-0x4 \ 004c 46c: eb c9 jmp 437 <setup_APIC_ibs+0x17>
004d 46d: 48 c7 c7 00 00 00 00 mov $0x0,%rdi \ 004e 46e: 66 90 xchg %ax,%ax
0050 470: R_X86_64_32S .rodata.str1.8 \ fffffffffffffbe0
0054 474: 5b pop %rbx \ 0000 0000000000000000 <setup_APIC_ibs.cold.9>:
0055 475: e9 00 00 00 00 jmpq 47a <setup_APIC_ibs+0x5a> \ 0000 0: 48 c7 c7 00 00 00 00 mov $0x0,%rdi
0056 476: R_X86_64_PLT32 printk-0x4 \ 0003 3: R_X86_64_32S .rodata.str1.8
005a 47a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1) \ 0007 7: 5b pop %rbx
fffffffffffffbe0 \ 0008 8: 65 8b 35 00 00 00 00 mov %gs:0x0(%rip),%esi # f <setup_API
\ 000b b: R_X86_64_PC32 cpu_number-0x4
\ 000f f: e9 00 00 00 00 jmpq 14 <force_ibs_eilvt_setup.cold.10>
\ 0010 10: R_X86_64_PLT32 printk-0x4
\ 0000

It is because of cold subfunction creation; with a reduction in side of
the regular path.

The third build included patches 3 and 4 (because they don't much
overlap); and give some meagre savings:

$ ./compare.sh defconfig-build2 defconfig-build3 arch/x86/kernel/irq.o
do_IRQ 195 187 -8
smp_x86_platform_ipi 234 222 -12
smp_kvm_posted_intr_ipi 74 66 -8
smp_kvm_posted_intr_wakeup_ipi 86 78 -8
smp_kvm_posted_intr_nested_ipi 74 66 -8
$ ./compare.sh defconfig-build2 defconfig-build3 arch/x86/mm/tlb.o
flush_tlb_func_common.constprop.13 728 719 -9
switch_mm_irqs_off 1528 1524 -4

Now, I realize you particularly hate the tlb patch; and I'll see if I
can get these same savings with a few less __ added.

But in general, I think these patches are worth it. esp. since I've
already done them :-)

Attachments:

(No filename) (20.89 kB)
compare.sh (1.66 kB)
Download all attachments

2019-03-08 19:36:04

by Nadav Amit

[permalink] [raw]

Subject: Re: [PATCH 0/5] x86/percpu semantics and fixes

> On Mar 8, 2019, at 6:50 AM, Peter Zijlstra <[email protected]> wrote:
>
> On Wed, Feb 27, 2019 at 11:12:52AM +0100, Peter Zijlstra wrote:
>
>> This is a collection of x86/percpu changes that I had pending and got reminded
>> of by Linus' comment yesterday about __this_cpu_xchg().
>>
>> This tidies up the x86/percpu primitives and fixes a bunch of 'fallout'.
>
> (Sorry; this is going to have _wide_ output)
>
> OK, so what I did is I build 4 kernels (O=defconfig-build{,1,2,3}) with
> resp that many patches of this series applied.
>
> When I look at just the vmlinux size output:
>
> $ size defconfig-build*/vmlinux
> text data bss dec hex filename
> 19540631 5040164 1871944 26452739 193a303 defconfig-build/vmlinux
> 19540635 5040164 1871944 26452743 193a307 defconfig-build1/vmlinux
> 19540685 5040164 1871944 26452793 193a339 defconfig-build2/vmlinux
> 19540685 5040164 1871944 26452793 193a339 defconfig-build3/vmlinux
>
> Things appear to get slightly larger; however when I look in more
> detail using my (newly written compare script, find attached), I get
> things like:

Nice script! I keep asking myself how comparing two binaries can provide
some “number” to indicate how “good” the binary is (at least relatively to
another one) - either during compilation or after. Code size, as you show,
is the wrong metric.

Anyhow, I am a little disappointed (and surprised) that in most cases that I
played with, this kind of optimizations have marginal impact on performance
at best, even when the binary changes “a lot” and when microbenchmarks are
used.

2019-03-08 20:58:07

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 0/5] x86/percpu semantics and fixes

On Fri, Mar 08, 2019 at 07:35:17PM +0000, Nadav Amit wrote:

> Nice script! I keep asking myself how comparing two binaries can provide
> some “number” to indicate how “good” the binary is (at least relatively to
> another one) - either during compilation or after. Code size, as you show,
> is the wrong metric.

Right; I'm still pondering other metrics, like:

readelf -WS | grep AX | grep -v -e init -e exit -e altinstr -e unlikely -e fixup

which is only 'fast' path text.

> Anyhow, I am a little disappointed (and surprised) that in most cases that I
> played with, this kind of optimizations have marginal impact on performance
> at best, even when the binary changes “a lot” and when microbenchmarks are
> used.

Right, but if we don't care, it'll be death by 1000 cuts.

Anyway, can anybody explain percpu_stable_op() vs percpu_from_op() ?

I'm thinking of a variant of Linus' patch, but I'm confused about the
above.

2019-03-08 22:50:34

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 0/5] x86/percpu semantics and fixes

On Fri, Mar 08, 2019 at 07:35:17PM +0000, Nadav Amit wrote:
> Nice script!

Find a new one; this one is fast enough to run a symbol diff on vmlinux.o

Attachments:

(No filename) (156.00 B)
compare.sh (2.09 kB)
Download all attachments

2019-03-10 12:48:57

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 0/5] x86/percpu semantics and fixes

On Fri, Mar 08, 2019 at 09:56:37PM +0100, Peter Zijlstra wrote:

> Anyway, can anybody explain percpu_stable_op() vs percpu_from_op() ?
>
> I'm thinking of a variant of Linus' patch, but I'm confused about the
> above.

So whatever I tried with +m only made things worse and always affects
thousands of symbols.

Now, afaict the whole percpu_stable_op thing is an ugly hack becaues some
earlier compiler would not CSE the regular percpu_from_op. But since it
does do that today; esp. after my first patch, I tried implementing
this_cpu_read_stable() with percpu_from_op() (no volatile, obv).

That also affects _lots_ of sites, but also significantly shrinks the
kernel image.

It's 2307 symbols affected, but:

17642871 2157438 747808 20548117 1398a15 defconfig-build1/vmlinux.o (patch 1)
17639081 2157438 747808 20544327 1397b47 defconfig-build0/vmlinux.o (patch 1 - percpu_stable_op)

So I think I'll add a patch removing percpu_stable_op and all its users.