LinuxLists.cc - [PATCH] x86, retpolines: raise limit for generating indirect calls from switch-case

2019-02-21 22:21:26

Subject: [PATCH] x86, retpolines: raise limit for generating indirect calls from switch-case

From networking side, there are numerous attempts to get rid of
indirect calls in fast-path wherever feasible in order to avoid
the cost of retpolines, for example, just to name a few:

* 283c16a2dfd3 ("indirect call wrappers: helpers to speed-up indirect calls of builtin")
* aaa5d90b395a ("net: use indirect call wrappers at GRO network layer")
* 028e0a476684 ("net: use indirect call wrappers at GRO transport layer")
* 356da6d0cde3 ("dma-mapping: bypass indirect calls for dma-direct")
* 09772d92cd5a ("bpf: avoid retpoline for lookup/update/delete calls on maps")
* 10870dd89e95 ("netfilter: nf_tables: add direct calls for all builtin expressions")
[...]

Recent work on XDP from Björn and Magnus additionally found that
manually transforming the XDP return code switch statement with
more than 5 cases into if-else combination would result in a
considerable speedup in XDP layer due to avoidance of indirect
calls in CONFIG_RETPOLINE enabled builds. On i40e driver with
XDP prog attached, a 20-26% speedup has been observed [0]. Aside
from XDP, there are many other places later in the networking
stack's critical path with similar switch-case processing. Rather
than fixing every XDP-enabled driver and locations in stack by
hand, it would be good to instead raise the limit where gcc would
emit expensive indirect calls from the switch under retpolines
and stick with the default as-is in case of !retpoline configured
kernels. This would also have the advantage that for archs where
this is not necessary, we let compiler select the underlying
target optimization for these constructs and avoid potential
slow-downs by if-else hand-rewrite.

In case of gcc, this setting is controlled by case-values-threshold
which has an architecture global default that selects 4 or 5 (latter
if target does not have a case insn that compares the bounds) where
some arch back ends like arm64 or s390 override it with their own
target hooks, for example, in gcc commit db7a90aa0de5 ("S/390:
Disable prediction of indirect branches") the threshold pretty
much disables jump tables by limit of 20 under retpoline builds.
Comparing gcc's and clang's default code generation on x86-64 under
O2 level with retpoline build results in the following outcome
for 5 switch cases:

* gcc with -mindirect-branch=thunk-inline -mindirect-branch-register:

# gdb -batch -ex 'disassemble dispatch' ./c-switch
Dump of assembler code for function dispatch:
0x0000000000400be0 <+0>: cmp $0x4,%edi
0x0000000000400be3 <+3>: ja 0x400c35 <dispatch+85>
0x0000000000400be5 <+5>: lea 0x915f8(%rip),%rdx # 0x4921e4
0x0000000000400bec <+12>: mov %edi,%edi
0x0000000000400bee <+14>: movslq (%rdx,%rdi,4),%rax
0x0000000000400bf2 <+18>: add %rdx,%rax
0x0000000000400bf5 <+21>: callq 0x400c01 <dispatch+33>
0x0000000000400bfa <+26>: pause
0x0000000000400bfc <+28>: lfence
0x0000000000400bff <+31>: jmp 0x400bfa <dispatch+26>
0x0000000000400c01 <+33>: mov %rax,(%rsp)
0x0000000000400c05 <+37>: retq
0x0000000000400c06 <+38>: nopw %cs:0x0(%rax,%rax,1)
0x0000000000400c10 <+48>: jmpq 0x400c90 <fn_3>
0x0000000000400c15 <+53>: nopl (%rax)
0x0000000000400c18 <+56>: jmpq 0x400c70 <fn_2>
0x0000000000400c1d <+61>: nopl (%rax)
0x0000000000400c20 <+64>: jmpq 0x400c50 <fn_1>
0x0000000000400c25 <+69>: nopl (%rax)
0x0000000000400c28 <+72>: jmpq 0x400c40 <fn_0>
0x0000000000400c2d <+77>: nopl (%rax)
0x0000000000400c30 <+80>: jmpq 0x400cb0 <fn_4>
0x0000000000400c35 <+85>: push %rax
0x0000000000400c36 <+86>: callq 0x40dd80 <abort>
End of assembler dump.

* clang with -mretpoline emitting search tree:

# gdb -batch -ex 'disassemble dispatch' ./c-switch
Dump of assembler code for function dispatch:
0x0000000000400b30 <+0>: cmp $0x1,%edi
0x0000000000400b33 <+3>: jle 0x400b44 <dispatch+20>
0x0000000000400b35 <+5>: cmp $0x2,%edi
0x0000000000400b38 <+8>: je 0x400b4d <dispatch+29>
0x0000000000400b3a <+10>: cmp $0x3,%edi
0x0000000000400b3d <+13>: jne 0x400b52 <dispatch+34>
0x0000000000400b3f <+15>: jmpq 0x400c50 <fn_3>
0x0000000000400b44 <+20>: test %edi,%edi
0x0000000000400b46 <+22>: jne 0x400b5c <dispatch+44>
0x0000000000400b48 <+24>: jmpq 0x400c20 <fn_0>
0x0000000000400b4d <+29>: jmpq 0x400c40 <fn_2>
0x0000000000400b52 <+34>: cmp $0x4,%edi
0x0000000000400b55 <+37>: jne 0x400b66 <dispatch+54>
0x0000000000400b57 <+39>: jmpq 0x400c60 <fn_4>
0x0000000000400b5c <+44>: cmp $0x1,%edi
0x0000000000400b5f <+47>: jne 0x400b66 <dispatch+54>
0x0000000000400b61 <+49>: jmpq 0x400c30 <fn_1>
0x0000000000400b66 <+54>: push %rax
0x0000000000400b67 <+55>: callq 0x40dd20 <abort>
End of assembler dump.

For sake of comparison, clang without -mretpoline:

# gdb -batch -ex 'disassemble dispatch' ./c-switch
Dump of assembler code for function dispatch:
0x0000000000400b30 <+0>: cmp $0x4,%edi
0x0000000000400b33 <+3>: ja 0x400b57 <dispatch+39>
0x0000000000400b35 <+5>: mov %edi,%eax
0x0000000000400b37 <+7>: jmpq *0x492148(,%rax,8)
0x0000000000400b3e <+14>: jmpq 0x400bf0 <fn_0>
0x0000000000400b43 <+19>: jmpq 0x400c30 <fn_4>
0x0000000000400b48 <+24>: jmpq 0x400c10 <fn_2>
0x0000000000400b4d <+29>: jmpq 0x400c20 <fn_3>
0x0000000000400b52 <+34>: jmpq 0x400c00 <fn_1>
0x0000000000400b57 <+39>: push %rax
0x0000000000400b58 <+40>: callq 0x40dcf0 <abort>
End of assembler dump.

Raising the cases to a high number (e.g. 100) will still result
in similar code generation pattern with clang and gcc as above,
in other words clang generally turns off jump table emission by
having an extra expansion pass under retpoline build to turn
indirectbr instructions from their IR into switch instructions
as a built-in -mno-jump-table lowering of a switch (in this
case, even if IR input already contained an indirect branch).

For gcc, adding --param=case-values-threshold=20 as in similar
fashion as s390 in order to raise the limit for x86 retpoline
enabled builds results in a small vmlinux size increase of only
0.13% (before=18,027,528 after=18,051,192). For clang this
option is ignored due to i) not being needed as mentioned and
ii) not having above cmdline parameter. Non-retpoline-enabled
builds with gcc continue to use the default case-values-threshold
setting, so nothing changes here.

[0] https://lore.kernel.org/netdev/[email protected]/
and "The Path to DPDK Speeds for AF_XDP", LPC 2018, networking track:
- http://vger.kernel.org/lpc_net2018_talks/lpc18_pres_af_xdp_perf-v3.pdf
- http://vger.kernel.org/lpc_net2018_talks/lpc18_paper_af_xdp_perf-v2.pdf

Signed-off-by: Daniel Borkmann <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Björn Töpel <[email protected]>
Cc: Magnus Karlsson <[email protected]>
Cc: Jesper Dangaard Brouer <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: David Woodhouse <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Linus Torvalds <[email protected]>
---
arch/x86/Makefile | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 9c5a67d1b9c1..f55420a67164 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -217,6 +217,11 @@ KBUILD_CFLAGS += -fno-asynchronous-unwind-tables
# Avoid indirect branches in kernel to deal with Spectre
ifdef CONFIG_RETPOLINE
KBUILD_CFLAGS += $(RETPOLINE_CFLAGS)
+ # Additionally, avoid generating expensive indirect jumps which
+ # are subject to retpolines for small number of switch cases.
+ # clang turns off jump table generation by default when under
+ # retpoline builds, however, gcc does not for x86.
+ KBUILD_CFLAGS += $(call cc-option,--param=case-values-threshold=20)
endif

archscripts: scripts_basic
--
2.17.1

2019-02-21 22:29:01

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH] x86, retpolines: raise limit for generating indirect calls from switch-case

On Thu, Feb 21, 2019 at 2:20 PM Daniel Borkmann <[email protected]> wrote:
>
> In case of gcc, this setting is controlled by case-values-threshold
> which has an architecture global default that selects 4 or 5 (

Ack. For retpoline, that's much too low.

Patch looks sane, although it would be good to verify just which
versions of gcc this works for. All versions with retpoline?

Linus

2019-02-21 22:57:41

by Daniel Borkmann

[permalink] [raw]

Subject: Re: [PATCH] x86, retpolines: raise limit for generating indirect calls from switch-case

On 02/21/2019 11:27 PM, Linus Torvalds wrote:
> On Thu, Feb 21, 2019 at 2:20 PM Daniel Borkmann <[email protected]> wrote:
>>
>> In case of gcc, this setting is controlled by case-values-threshold
>> which has an architecture global default that selects 4 or 5 (
>
> Ack. For retpoline, that's much too low.
>
> Patch looks sane, although it would be good to verify just which
> versions of gcc this works for. All versions with retpoline?

The feature was first added in gcc 4.7 [0], under "General Optimizer Improvements":

Support for a new parameter --param case-values-threshold=n was added to allow
users to control the cutoff between doing switch statements as a series of if
statements and using a jump table.

From what I can tell, original author (H.J. Lu) provided backports up to gcc 4.8
and distros seem to have pulled it from his github branch [1] as upstream gcc does
not handle backports for stable versions that old.

Thanks,
Daniel

[0] https://www.gnu.org/software/gcc/gcc-4.7/changes.html
[1] https://bugs.launchpad.net/ubuntu/+source/gcc-4.8/+bug/1749261
https://github.com/hjl-tools/gcc/tree/hjl/indirect/gcc-4_8-branch/master

2019-02-22 07:32:24

by Jesper Dangaard Brouer

[permalink] [raw]

Subject: Re: [PATCH] x86, retpolines: raise limit for generating indirect calls from switch-case

On Thu, 21 Feb 2019 23:19:41 +0100
Daniel Borkmann <[email protected]> wrote:

> Recent work on XDP from Björn and Magnus additionally found that
> manually transforming the XDP return code switch statement with
> more than 5 cases into if-else combination would result in a
> considerable speedup in XDP layer due to avoidance of indirect
> calls in CONFIG_RETPOLINE enabled builds. On i40e driver with
> XDP prog attached, a 20-26% speedup has been observed [0]. Aside
> from XDP, there are many other places later in the networking
> stack's critical path with similar switch-case processing. Rather
> than fixing every XDP-enabled driver and locations in stack by
> hand, it would be good to instead raise the limit where gcc would
> emit expensive indirect calls from the switch under retpolines

I'm very happy to see this. Thanks to Björn for finding, analyzing and
providing hand-coded-if-else code that demonstrated the performance
issue for XDP. But I do think this GCC case-values-threshold param is
a better and more generic solution to the issue we observed and
measured in XDP land. And hopefully other parts of the network stack
and kernel will also benefit.

Acked-by: Jesper Dangaard Brouer <[email protected]>

Thanks for following up on this Daniel,
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer

2019-02-26 18:22:53

by Björn Töpel

[permalink] [raw]

Subject: Re: [PATCH] x86, retpolines: raise limit for generating indirect calls from switch-case

On Fri, 22 Feb 2019 at 08:32, Jesper Dangaard Brouer <[email protected]> wrote:
>
> On Thu, 21 Feb 2019 23:19:41 +0100
> Daniel Borkmann <[email protected]> wrote:
>
> > Recent work on XDP from Björn and Magnus additionally found that
> > manually transforming the XDP return code switch statement with
> > more than 5 cases into if-else combination would result in a
> > considerable speedup in XDP layer due to avoidance of indirect
> > calls in CONFIG_RETPOLINE enabled builds. On i40e driver with
> > XDP prog attached, a 20-26% speedup has been observed [0]. Aside
> > from XDP, there are many other places later in the networking
> > stack's critical path with similar switch-case processing. Rather
> > than fixing every XDP-enabled driver and locations in stack by
> > hand, it would be good to instead raise the limit where gcc would
> > emit expensive indirect calls from the switch under retpolines
>
> I'm very happy to see this. Thanks to Björn for finding, analyzing and
> providing hand-coded-if-else code that demonstrated the performance
> issue for XDP. But I do think this GCC case-values-threshold param is
> a better and more generic solution to the issue we observed and
> measured in XDP land. And hopefully other parts of the network stack
> and kernel will also benefit.
>
> Acked-by: Jesper Dangaard Brouer <[email protected]>
>
> Thanks for following up on this Daniel,

I definitely prefer a switch-statement over the if-else-messiness in
this context. Thanks for doing the deep-dive, Daniel!

FWIW,
Acked-by: Björn Töpel <[email protected]>

> --
> Best regards,
> Jesper Dangaard Brouer
> MSc.CS, Principal Kernel Engineer at Red Hat
> LinkedIn: http://www.linkedin.com/in/brouer

2019-02-28 12:18:29

by David Woodhouse

[permalink] [raw]

Subject: Re: [tip:x86/build] x86, retpolines: Raise limit for generating indirect calls from switch-case

On Thu, 2019-02-28 at 03:12 -0800, tip-bot for Daniel Borkmann wrote:
> Commit-ID: ce02ef06fcf7a399a6276adb83f37373d10cbbe1
> Gitweb: https://git.kernel.org/tip/ce02ef06fcf7a399a6276adb83f37373d10cbbe1
> Author: Daniel Borkmann <[email protected]>
> AuthorDate: Thu, 21 Feb 2019 23:19:41 +0100
> Committer: Thomas Gleixner <[email protected]>
> CommitDate: Thu, 28 Feb 2019 12:10:31 +0100
>
> x86, retpolines: Raise limit for generating indirect calls from switch-case
>
> From networking side, there are numerous attempts to get rid of indirect
> calls in fast-path wherever feasible in order to avoid the cost of
> retpolines, for example, just to name a few:
>
> * 283c16a2dfd3 ("indirect call wrappers: helpers to speed-up indirect calls of builtin")
> * aaa5d90b395a ("net: use indirect call wrappers at GRO network layer")
> * 028e0a476684 ("net: use indirect call wrappers at GRO transport layer")
> * 356da6d0cde3 ("dma-mapping: bypass indirect calls for dma-direct")
> * 09772d92cd5a ("bpf: avoid retpoline for lookup/update/delete calls on maps")
> * 10870dd89e95 ("netfilter: nf_tables: add direct calls for all builtin expressions")
> [...]
>
> Recent work on XDP from Björn and Magnus additionally found that manually
> transforming the XDP return code switch statement with more than 5 cases
> into if-else combination would result in a considerable speedup in XDP
> layer due to avoidance of indirect calls in CONFIG_RETPOLINE enabled
> builds.

+HJL

This is a GCC bug, surely? It should know how expensive each
instruction is, and choose which to use accordingly. That should be
true even when the indirect branch "instruction" is a retpoline, and
thus enormously expensive.

I believe this is https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86952 so
please at least reference that bug, and be prepared to turn this hack
off when GCC is fixed.

Attachments:

smime.p7s (5.05 kB)

2019-02-28 13:20:41

by tip-bot for Vasyl Gomonovych

[permalink] [raw]

Subject: [tip:x86/build] x86, retpolines: Raise limit for generating indirect calls from switch-case

Commit-ID: ce02ef06fcf7a399a6276adb83f37373d10cbbe1
Gitweb: https://git.kernel.org/tip/ce02ef06fcf7a399a6276adb83f37373d10cbbe1
Author: Daniel Borkmann <[email protected]>
AuthorDate: Thu, 21 Feb 2019 23:19:41 +0100
Committer: Thomas Gleixner <[email protected]>
CommitDate: Thu, 28 Feb 2019 12:10:31 +0100

x86, retpolines: Raise limit for generating indirect calls from switch-case

From networking side, there are numerous attempts to get rid of indirect
calls in fast-path wherever feasible in order to avoid the cost of
retpolines, for example, just to name a few:

* 283c16a2dfd3 ("indirect call wrappers: helpers to speed-up indirect calls of builtin")
* aaa5d90b395a ("net: use indirect call wrappers at GRO network layer")
* 028e0a476684 ("net: use indirect call wrappers at GRO transport layer")
* 356da6d0cde3 ("dma-mapping: bypass indirect calls for dma-direct")
* 09772d92cd5a ("bpf: avoid retpoline for lookup/update/delete calls on maps")
* 10870dd89e95 ("netfilter: nf_tables: add direct calls for all builtin expressions")
[...]

Recent work on XDP from Björn and Magnus additionally found that manually
transforming the XDP return code switch statement with more than 5 cases
into if-else combination would result in a considerable speedup in XDP
layer due to avoidance of indirect calls in CONFIG_RETPOLINE enabled
builds. On i40e driver with XDP prog attached, a 20-26% speedup has been
observed [0]. Aside from XDP, there are many other places later in the
networking stack's critical path with similar switch-case
processing. Rather than fixing every XDP-enabled driver and locations in
stack by hand, it would be good to instead raise the limit where gcc would
emit expensive indirect calls from the switch under retpolines and stick
with the default as-is in case of !retpoline configured kernels. This would
also have the advantage that for archs where this is not necessary, we let
compiler select the underlying target optimization for these constructs and
avoid potential slow-downs by if-else hand-rewrite.

In case of gcc, this setting is controlled by case-values-threshold which
has an architecture global default that selects 4 or 5 (latter if target
does not have a case insn that compares the bounds) where some arch back
ends like arm64 or s390 override it with their own target hooks, for
example, in gcc commit db7a90aa0de5 ("S/390: Disable prediction of indirect
branches") the threshold pretty much disables jump tables by limit of 20
under retpoline builds. Comparing gcc's and clang's default code
generation on x86-64 under O2 level with retpoline build results in the
following outcome for 5 switch cases:

* gcc with -mindirect-branch=thunk-inline -mindirect-branch-register:

# gdb -batch -ex 'disassemble dispatch' ./c-switch
Dump of assembler code for function dispatch:
0x0000000000400be0 <+0>: cmp $0x4,%edi
0x0000000000400be3 <+3>: ja 0x400c35 <dispatch+85>
0x0000000000400be5 <+5>: lea 0x915f8(%rip),%rdx # 0x4921e4
0x0000000000400bec <+12>: mov %edi,%edi
0x0000000000400bee <+14>: movslq (%rdx,%rdi,4),%rax
0x0000000000400bf2 <+18>: add %rdx,%rax
0x0000000000400bf5 <+21>: callq 0x400c01 <dispatch+33>
0x0000000000400bfa <+26>: pause
0x0000000000400bfc <+28>: lfence
0x0000000000400bff <+31>: jmp 0x400bfa <dispatch+26>
0x0000000000400c01 <+33>: mov %rax,(%rsp)
0x0000000000400c05 <+37>: retq
0x0000000000400c06 <+38>: nopw %cs:0x0(%rax,%rax,1)
0x0000000000400c10 <+48>: jmpq 0x400c90 <fn_3>
0x0000000000400c15 <+53>: nopl (%rax)
0x0000000000400c18 <+56>: jmpq 0x400c70 <fn_2>
0x0000000000400c1d <+61>: nopl (%rax)
0x0000000000400c20 <+64>: jmpq 0x400c50 <fn_1>
0x0000000000400c25 <+69>: nopl (%rax)
0x0000000000400c28 <+72>: jmpq 0x400c40 <fn_0>
0x0000000000400c2d <+77>: nopl (%rax)
0x0000000000400c30 <+80>: jmpq 0x400cb0 <fn_4>
0x0000000000400c35 <+85>: push %rax
0x0000000000400c36 <+86>: callq 0x40dd80 <abort>
End of assembler dump.

* clang with -mretpoline emitting search tree:

# gdb -batch -ex 'disassemble dispatch' ./c-switch
Dump of assembler code for function dispatch:
0x0000000000400b30 <+0>: cmp $0x1,%edi
0x0000000000400b33 <+3>: jle 0x400b44 <dispatch+20>
0x0000000000400b35 <+5>: cmp $0x2,%edi
0x0000000000400b38 <+8>: je 0x400b4d <dispatch+29>
0x0000000000400b3a <+10>: cmp $0x3,%edi
0x0000000000400b3d <+13>: jne 0x400b52 <dispatch+34>
0x0000000000400b3f <+15>: jmpq 0x400c50 <fn_3>
0x0000000000400b44 <+20>: test %edi,%edi
0x0000000000400b46 <+22>: jne 0x400b5c <dispatch+44>
0x0000000000400b48 <+24>: jmpq 0x400c20 <fn_0>
0x0000000000400b4d <+29>: jmpq 0x400c40 <fn_2>
0x0000000000400b52 <+34>: cmp $0x4,%edi
0x0000000000400b55 <+37>: jne 0x400b66 <dispatch+54>
0x0000000000400b57 <+39>: jmpq 0x400c60 <fn_4>
0x0000000000400b5c <+44>: cmp $0x1,%edi
0x0000000000400b5f <+47>: jne 0x400b66 <dispatch+54>
0x0000000000400b61 <+49>: jmpq 0x400c30 <fn_1>
0x0000000000400b66 <+54>: push %rax
0x0000000000400b67 <+55>: callq 0x40dd20 <abort>
End of assembler dump.

For sake of comparison, clang without -mretpoline:

# gdb -batch -ex 'disassemble dispatch' ./c-switch
Dump of assembler code for function dispatch:
0x0000000000400b30 <+0>: cmp $0x4,%edi
0x0000000000400b33 <+3>: ja 0x400b57 <dispatch+39>
0x0000000000400b35 <+5>: mov %edi,%eax
0x0000000000400b37 <+7>: jmpq *0x492148(,%rax,8)
0x0000000000400b3e <+14>: jmpq 0x400bf0 <fn_0>
0x0000000000400b43 <+19>: jmpq 0x400c30 <fn_4>
0x0000000000400b48 <+24>: jmpq 0x400c10 <fn_2>
0x0000000000400b4d <+29>: jmpq 0x400c20 <fn_3>
0x0000000000400b52 <+34>: jmpq 0x400c00 <fn_1>
0x0000000000400b57 <+39>: push %rax
0x0000000000400b58 <+40>: callq 0x40dcf0 <abort>
End of assembler dump.

Raising the cases to a high number (e.g. 100) will still result in similar
code generation pattern with clang and gcc as above, in other words clang
generally turns off jump table emission by having an extra expansion pass
under retpoline build to turn indirectbr instructions from their IR into
switch instructions as a built-in -mno-jump-table lowering of a switch (in
this case, even if IR input already contained an indirect branch).

For gcc, adding --param=case-values-threshold=20 as in similar fashion as
s390 in order to raise the limit for x86 retpoline enabled builds results
in a small vmlinux size increase of only 0.13% (before=18,027,528
after=18,051,192). For clang this option is ignored due to i) not being
needed as mentioned and ii) not having above cmdline
parameter. Non-retpoline-enabled builds with gcc continue to use the
default case-values-threshold setting, so nothing changes here.

[0] https://lore.kernel.org/netdev/[email protected]/
and "The Path to DPDK Speeds for AF_XDP", LPC 2018, networking track:
- http://vger.kernel.org/lpc_net2018_talks/lpc18_pres_af_xdp_perf-v3.pdf
- http://vger.kernel.org/lpc_net2018_talks/lpc18_paper_af_xdp_perf-v2.pdf

Signed-off-by: Daniel Borkmann <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Acked-by: Jesper Dangaard Brouer <[email protected]>
Acked-by: Björn Töpel <[email protected]>
Acked-by: Linus Torvalds <[email protected]>
Cc: [email protected]
Cc: David S. Miller <[email protected]>
Cc: Magnus Karlsson <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: David Woodhouse <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

---
arch/x86/Makefile | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 9c5a67d1b9c1..f55420a67164 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -217,6 +217,11 @@ KBUILD_CFLAGS += -fno-asynchronous-unwind-tables
# Avoid indirect branches in kernel to deal with Spectre
ifdef CONFIG_RETPOLINE
KBUILD_CFLAGS += $(RETPOLINE_CFLAGS)
+ # Additionally, avoid generating expensive indirect jumps which
+ # are subject to retpolines for small number of switch cases.
+ # clang turns off jump table generation by default when under
+ # retpoline builds, however, gcc does not for x86.
+ KBUILD_CFLAGS += $(call cc-option,--param=case-values-threshold=20)
endif

archscripts: scripts_basic

2019-02-28 13:28:03

by H.J. Lu

[permalink] [raw]

Subject: Re: [tip:x86/build] x86, retpolines: Raise limit for generating indirect calls from switch-case

On Thu, Feb 28, 2019 at 3:27 AM David Woodhouse <[email protected]> wrote:
>
> On Thu, 2019-02-28 at 03:12 -0800, tip-bot for Daniel Borkmann wrote:
> > Commit-ID: ce02ef06fcf7a399a6276adb83f37373d10cbbe1
> > Gitweb: https://git.kernel.org/tip/ce02ef06fcf7a399a6276adb83f37373d10cbbe1
> > Author: Daniel Borkmann <[email protected]>
> > AuthorDate: Thu, 21 Feb 2019 23:19:41 +0100
> > Committer: Thomas Gleixner <[email protected]>
> > CommitDate: Thu, 28 Feb 2019 12:10:31 +0100
> >
> > x86, retpolines: Raise limit for generating indirect calls from switch-case
> >
> > From networking side, there are numerous attempts to get rid of indirect
> > calls in fast-path wherever feasible in order to avoid the cost of
> > retpolines, for example, just to name a few:
> >
> > * 283c16a2dfd3 ("indirect call wrappers: helpers to speed-up indirect calls of builtin")
> > * aaa5d90b395a ("net: use indirect call wrappers at GRO network layer")
> > * 028e0a476684 ("net: use indirect call wrappers at GRO transport layer")
> > * 356da6d0cde3 ("dma-mapping: bypass indirect calls for dma-direct")
> > * 09772d92cd5a ("bpf: avoid retpoline for lookup/update/delete calls on maps")
> > * 10870dd89e95 ("netfilter: nf_tables: add direct calls for all builtin expressions")
> > [...]
> >
> > Recent work on XDP from Björn and Magnus additionally found that manually
> > transforming the XDP return code switch statement with more than 5 cases
> > into if-else combination would result in a considerable speedup in XDP
> > layer due to avoidance of indirect calls in CONFIG_RETPOLINE enabled
> > builds.
>
> +HJL
>
> This is a GCC bug, surely? It should know how expensive each
> instruction is, and choose which to use accordingly. That should be
> true even when the indirect branch "instruction" is a retpoline, and
> thus enormously expensive.
>
> I believe this is https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86952 so
> please at least reference that bug, and be prepared to turn this hack
> off when GCC is fixed.

We couldn't find a testcase to show jump table with indirect branch
is slower than direct branches.

--
H.J.

2019-02-28 18:20:47

by H.J. Lu

[permalink] [raw]

Subject: Re: [tip:x86/build] x86, retpolines: Raise limit for generating indirect calls from switch-case

On Thu, Feb 28, 2019 at 8:18 AM Daniel Borkmann <[email protected]> wrote:
>
> On 02/28/2019 01:53 PM, H.J. Lu wrote:
> > On Thu, Feb 28, 2019 at 3:27 AM David Woodhouse <[email protected]> wrote:
> >> On Thu, 2019-02-28 at 03:12 -0800, tip-bot for Daniel Borkmann wrote:
> >>> Commit-ID: ce02ef06fcf7a399a6276adb83f37373d10cbbe1
> >>> Gitweb: https://git.kernel.org/tip/ce02ef06fcf7a399a6276adb83f37373d10cbbe1
> >>> Author: Daniel Borkmann <[email protected]>
> >>> AuthorDate: Thu, 21 Feb 2019 23:19:41 +0100
> >>> Committer: Thomas Gleixner <[email protected]>
> >>> CommitDate: Thu, 28 Feb 2019 12:10:31 +0100
> >>>
> >>> x86, retpolines: Raise limit for generating indirect calls from switch-case
> >>>
> >>> From networking side, there are numerous attempts to get rid of indirect
> >>> calls in fast-path wherever feasible in order to avoid the cost of
> >>> retpolines, for example, just to name a few:
> >>>
> >>> * 283c16a2dfd3 ("indirect call wrappers: helpers to speed-up indirect calls of builtin")
> >>> * aaa5d90b395a ("net: use indirect call wrappers at GRO network layer")
> >>> * 028e0a476684 ("net: use indirect call wrappers at GRO transport layer")
> >>> * 356da6d0cde3 ("dma-mapping: bypass indirect calls for dma-direct")
> >>> * 09772d92cd5a ("bpf: avoid retpoline for lookup/update/delete calls on maps")
> >>> * 10870dd89e95 ("netfilter: nf_tables: add direct calls for all builtin expressions")
> >>> [...]
> >>>
> >>> Recent work on XDP from Björn and Magnus additionally found that manually
> >>> transforming the XDP return code switch statement with more than 5 cases
> >>> into if-else combination would result in a considerable speedup in XDP
> >>> layer due to avoidance of indirect calls in CONFIG_RETPOLINE enabled
> >>> builds.
> >>
> >> +HJL
> >>
> >> This is a GCC bug, surely? It should know how expensive each
> >> instruction is, and choose which to use accordingly. That should be
> >> true even when the indirect branch "instruction" is a retpoline, and
> >> thus enormously expensive.
> >>
> >> I believe this is https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86952 so
> >> please at least reference that bug, and be prepared to turn this hack
> >> off when GCC is fixed.
> >
> > We couldn't find a testcase to show jump table with indirect branch
> > is slower than direct branches.
>
> Ok, I've just checked https://github.com/marxin/microbenchmark/tree/retpoline-table
> with the below on top.
>
> Makefile | 6 +++---
> switch.c | 2 +-
> test.c | 6 ++++--
> 3 files changed, 8 insertions(+), 6 deletions(-)
>
> diff --git a/Makefile b/Makefile
> index bd83233..ea81520 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -1,16 +1,16 @@
> CC=gcc
> CFLAGS=-g -I.
> -CFLAGS+=-O2 -mindirect-branch=thunk
> +CFLAGS+=-O2 -mindirect-branch=thunk-inline -mindirect-branch-register
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Does slowdown show up only with -mindirect-branch=thunk-inline?

> ASFLAGS=-g
>
> EXE=test
>
> OBJS=test.o switch-no-table.o switch.o
>
> -switch-no-table.o switch-no-table.s: CFLAGS += -fno-jump-tables
> +switch-no-table.o switch-no-table.s: CFLAGS += --param=case-values-threshold=20
>
> all: $(EXE)
> - ./$(EXE)
> + taskset 1 ./$(EXE)
>
> $(EXE): $(OBJS)
> $(CC) -o $@ $^
> diff --git a/switch.c b/switch.c
> index fe0a8b0..233ec14 100644
> --- a/switch.c
> +++ b/switch.c
> @@ -3,7 +3,7 @@ int global;
> int
> foo (int x)
> {
> - switch (x) {
> + switch (x & 0xf) {
> case 0:
> return 11;
> case 1:
> diff --git a/test.c b/test.c
> index 3d1e0da..7fc22a4 100644
> --- a/test.c
> +++ b/test.c
> @@ -15,21 +15,23 @@ main ()
> unsigned long long start, end;
> unsigned long long diff1, diff2;
>
> + global = 0;
> start = __rdtscp (&i);
> for (i = 0; i < LOOP; i++)
> foo_no_table (i);
> end = __rdtscp (&i);
> diff1 = end - start;
>
> - printf ("no jump table: %lld\n", diff1);
> + printf ("global:%d no jump table: %lld\n", global, diff1);
>
> + global = 0;
> start = __rdtscp (&i);
> for (i = 0; i < LOOP; i++)
> foo (i);
> end = __rdtscp (&i);
> diff2 = end - start;
>
> - printf ("jump table : %lld (%.2f%%)\n", diff2, 100.0f * diff2 / diff1);
> + printf ("global:%d jump table : %lld (%.2f%%)\n", global, diff2, 100.0f * diff2 / diff1);
>
> return 0;
> }
> --
> 2.17.1
>
> ** This basically iterates through the cases:
>
> Below I'm getting ~twice the time needed for jump table vs no jump table
> for the flags kernel is using:
>
> # make
> gcc -g -I. -O2 -mindirect-branch=thunk-inline -mindirect-branch-register -c -o test.o test.c
> gcc -g -I. -O2 -mindirect-branch=thunk-inline -mindirect-branch-register --param=case-values-threshold=20 -c -o switch-no-table.o switch-no-table.c
> gcc -g -I. -O2 -mindirect-branch=thunk-inline -mindirect-branch-register -c -o switch.o switch.c
> gcc -o test test.o switch-no-table.o switch.o
> taskset 1 ./test
> global:50000000 no jump table: 6329361694
> global:50000000 jump table : 13745181180 (217.17%)
> root@snat:~/microbenchmark# make
> taskset 1 ./test
> global:50000000 no jump table: 6328846466
> global:50000000 jump table : 13746479870 (217.20%)
> root@snat:~/microbenchmark# make
> taskset 1 ./test
> global:50000000 no jump table: 6326922428
> global:50000000 jump table : 13745139496 (217.25%)
> root@snat:~/microbenchmark# make
> taskset 1 ./test
> global:50000000 no jump table: 6327943506
> global:50000000 jump table : 13744388354 (217.20%)
> root@snat:~/microbenchmark# make
> taskset 1 ./test
> global:50000000 no jump table: 6332503572
> global:50000000 jump table : 13729817800 (216.82%)
> root@snat:~/microbenchmark# make
> taskset 1 ./test
> global:50000000 no jump table: 6328378006
> global:50000000 jump table : 13747069902 (217.23%)
> root@snat:~/microbenchmark# make
> taskset 1 ./test
> global:50000000 no jump table: 6326481236
> global:50000000 jump table : 13749345724 (217.33%)
> root@snat:~/microbenchmark# make
> taskset 1 ./test
> global:50000000 no jump table: 6329332628
> global:50000000 jump table : 13745879704 (217.18%)
> root@snat:~/microbenchmark# make
> taskset 1 ./test
> global:50000000 no jump table: 6327734850
> global:50000000 jump table : 13746412678 (217.24%)
>
> For comparison that both are 100% when raising limit is _not_ in use
> (which is expected of course but just to make sure):
>
> root@snat:~/microbenchmark# make
> gcc -g -I. -O2 -mindirect-branch=thunk-inline -mindirect-branch-register -c -o test.o test.c
> gcc -g -I. -O2 -mindirect-branch=thunk-inline -mindirect-branch-register -c -o switch-no-table.o switch-no-table.c
> gcc -g -I. -O2 -mindirect-branch=thunk-inline -mindirect-branch-register -c -o switch.o switch.c
> gcc -o test test.o switch-no-table.o switch.o
> taskset 1 ./test
> global:50000000 no jump table: 13704083238
> global:50000000 jump table : 13746838060 (100.31%)
> root@snat:~/microbenchmark# make
> taskset 1 ./test
> global:50000000 no jump table: 13753854740
> global:50000000 jump table : 13746624470 (99.95%)
> root@snat:~/microbenchmark# make
> taskset 1 ./test
> global:50000000 no jump table: 13707053714
> global:50000000 jump table : 13746682002 (100.29%)
> root@snat:~/microbenchmark# make
> taskset 1 ./test
> global:50000000 no jump table: 13708843624
> global:50000000 jump table : 13749733040 (100.30%)
> root@snat:~/microbenchmark# make
> taskset 1 ./test
> global:50000000 no jump table: 13707365404
> global:50000000 jump table : 13747683096 (100.29%)
> root@snat:~/microbenchmark# make
> taskset 1 ./test
> global:50000000 no jump table: 13707014114
> global:50000000 jump table : 13746444272 (100.29%)
> root@snat:~/microbenchmark# make
> taskset 1 ./test
> global:50000000 no jump table: 13709596158
> global:50000000 jump table : 13750499176 (100.30%)
> root@snat:~/microbenchmark# make
> taskset 1 ./test
> global:50000000 no jump table: 13709484118
> global:50000000 jump table : 13747952446 (100.28%)
> root@snat:~/microbenchmark# make
> taskset 1 ./test
> global:50000000 no jump table: 13708873570
> global:50000000 jump table : 13748950096 (100.29%)
>
> ** Next case would be constantly hitting first switch case:
>
> diff --git a/switch.c b/switch.c
> index 233ec14..fe0a8b0 100644
> --- a/switch.c
> +++ b/switch.c
> @@ -3,7 +3,7 @@ int global;
> int
> foo (int x)
> {
> - switch (x & 0xf) {
> + switch (x) {
> case 0:
> return 11;
> case 1:
> diff --git a/test.c b/test.c
> index 7fc22a4..2849112 100644
> --- a/test.c
> +++ b/test.c
> @@ -5,6 +5,7 @@ extern int foo (int);
> extern int foo_no_table (int);
>
> int global = 20;
> +int j = 0;
>
> #define LOOP 800000000
>
> @@ -18,7 +19,7 @@ main ()
> global = 0;
> start = __rdtscp (&i);
> for (i = 0; i < LOOP; i++)
> - foo_no_table (i);
> + foo_no_table (j);
> end = __rdtscp (&i);
> diff1 = end - start;
>
> @@ -27,7 +28,7 @@ main ()
> global = 0;
> start = __rdtscp (&i);
> for (i = 0; i < LOOP; i++)
> - foo (i);
> + foo (j);
> end = __rdtscp (&i);
> diff2 = end - start;
>
> # make
> gcc -g -I. -O2 -mindirect-branch=thunk-inline -mindirect-branch-register -c -o test.o test.c
> gcc -g -I. -O2 -mindirect-branch=thunk-inline -mindirect-branch-register --param=case-values-threshold=20 -c -o switch-no-table.o switch-no-table.c
> gcc -g -I. -O2 -mindirect-branch=thunk-inline -mindirect-branch-register -c -o switch.o switch.c
> gcc -o test test.o switch-no-table.o switch.o
> taskset 1 ./test
> global:0 no jump table: 6098109200
> global:0 jump table : 30717871980 (503.73%)
> root@snat:~/microbenchmark# make
> taskset 1 ./test
> global:0 no jump table: 6097799330
> global:0 jump table : 30727386270 (503.91%)
> root@snat:~/microbenchmark# make
> taskset 1 ./test
> global:0 no jump table: 6097559796
> global:0 jump table : 30715992452 (503.74%)
> root@snat:~/microbenchmark# make
> taskset 1 ./test
> global:0 no jump table: 6098532172
> global:0 jump table : 30716423870 (503.67%)
> root@snat:~/microbenchmark# make
> taskset 1 ./test
> global:0 no jump table: 6097429586
> global:0 jump table : 30715774634 (503.75%)
> root@snat:~/microbenchmark# make
> taskset 1 ./test
> global:0 no jump table: 6097813848
> global:0 jump table : 30716476820 (503.73%)
> root@snat:~/microbenchmark# make
> taskset 1 ./test
> global:0 no jump table: 6096955736
> global:0 jump table : 30715385478 (503.78%)
> root@snat:~/microbenchmark# make
> taskset 1 ./test
> global:0 no jump table: 6096820240
> global:0 jump table : 30719682434 (503.86%)
>
> ** And next case would be constantly hitting default case:
>
> diff --git a/test.c b/test.c
> index 2849112..be9bfc1 100644
> --- a/test.c
> +++ b/test.c
> @@ -5,7 +5,7 @@ extern int foo (int);
> extern int foo_no_table (int);
>
> int global = 20;
> -int j = 0;
> +int j = 1000;
>
> #define LOOP 800000000
>
> # make
> gcc -g -I. -O2 -mindirect-branch=thunk-inline -mindirect-branch-register -c -o test.o test.c
> gcc -g -I. -O2 -mindirect-branch=thunk-inline -mindirect-branch-register --param=case-values-threshold=20 -c -o switch-no-table.o switch-no-table.c
> gcc -g -I. -O2 -mindirect-branch=thunk-inline -mindirect-branch-register -c -o switch.o switch.c
> gcc -o test test.o switch-no-table.o switch.o
> taskset 1 ./test
> global:0 no jump table: 6422890064
> global:0 jump table : 6866072454 (106.90%)
> root@snat:~/microbenchmark# make
> taskset 1 ./test
> global:0 no jump table: 6423267608
> global:0 jump table : 6866266176 (106.90%)
> root@snat:~/microbenchmark# make
> taskset 1 ./test
> global:0 no jump table: 6424721624
> global:0 jump table : 6866607842 (106.88%)
> root@snat:~/microbenchmark# make
> taskset 1 ./test
> global:0 no jump table: 6424225664
> global:0 jump table : 6866843372 (106.89%)
> root@snat:~/microbenchmark# make
> taskset 1 ./test
> global:0 no jump table: 6424073830
> global:0 jump table : 6866467050 (106.89%)
> root@snat:~/microbenchmark# make
> taskset 1 ./test
> global:0 no jump table: 6426515396
> global:0 jump table : 6867031640 (106.85%)
> root@snat:~/microbenchmark# make
> taskset 1 ./test
> global:0 no jump table: 6425126656
> global:0 jump table : 6866352988 (106.87%)
> root@snat:~/microbenchmark# make
> taskset 1 ./test
> global:0 no jump table: 6423040024
> global:0 jump table : 6867233670 (106.92%)
> root@snat:~/microbenchmark# make
> taskset 1 ./test
> global:0 no jump table: 6422256136
> global:0 jump table : 6865902094 (106.91%)
>
> I could also try different distributions perhaps for the case selector,
> but observations match in that direction with what Bjorn et al also have
> been seen in XDP case.
>
> Thanks,
> Daniel

--
H.J.

2019-02-28 18:41:49

by Daniel Borkmann

[permalink] [raw]

Subject: Re: [tip:x86/build] x86, retpolines: Raise limit for generating indirect calls from switch-case

On 02/28/2019 05:25 PM, H.J. Lu wrote:
> On Thu, Feb 28, 2019 at 8:18 AM Daniel Borkmann <[email protected]> wrote:
>> On 02/28/2019 01:53 PM, H.J. Lu wrote:
>>> On Thu, Feb 28, 2019 at 3:27 AM David Woodhouse <[email protected]> wrote:
>>>> On Thu, 2019-02-28 at 03:12 -0800, tip-bot for Daniel Borkmann wrote:
>>>>> Commit-ID: ce02ef06fcf7a399a6276adb83f37373d10cbbe1
>>>>> Gitweb: https://git.kernel.org/tip/ce02ef06fcf7a399a6276adb83f37373d10cbbe1
>>>>> Author: Daniel Borkmann <[email protected]>
>>>>> AuthorDate: Thu, 21 Feb 2019 23:19:41 +0100
>>>>> Committer: Thomas Gleixner <[email protected]>
>>>>> CommitDate: Thu, 28 Feb 2019 12:10:31 +0100
>>>>>
>>>>> x86, retpolines: Raise limit for generating indirect calls from switch-case
>>>>>
>>>>> From networking side, there are numerous attempts to get rid of indirect
>>>>> calls in fast-path wherever feasible in order to avoid the cost of
>>>>> retpolines, for example, just to name a few:
>>>>>
>>>>> * 283c16a2dfd3 ("indirect call wrappers: helpers to speed-up indirect calls of builtin")
>>>>> * aaa5d90b395a ("net: use indirect call wrappers at GRO network layer")
>>>>> * 028e0a476684 ("net: use indirect call wrappers at GRO transport layer")
>>>>> * 356da6d0cde3 ("dma-mapping: bypass indirect calls for dma-direct")
>>>>> * 09772d92cd5a ("bpf: avoid retpoline for lookup/update/delete calls on maps")
>>>>> * 10870dd89e95 ("netfilter: nf_tables: add direct calls for all builtin expressions")
>>>>> [...]
>>>>>
>>>>> Recent work on XDP from Björn and Magnus additionally found that manually
>>>>> transforming the XDP return code switch statement with more than 5 cases
>>>>> into if-else combination would result in a considerable speedup in XDP
>>>>> layer due to avoidance of indirect calls in CONFIG_RETPOLINE enabled
>>>>> builds.
>>>>
>>>> +HJL
>>>>
>>>> This is a GCC bug, surely? It should know how expensive each
>>>> instruction is, and choose which to use accordingly. That should be
>>>> true even when the indirect branch "instruction" is a retpoline, and
>>>> thus enormously expensive.
>>>>
>>>> I believe this is https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86952 so
>>>> please at least reference that bug, and be prepared to turn this hack
>>>> off when GCC is fixed.
>>>
>>> We couldn't find a testcase to show jump table with indirect branch
>>> is slower than direct branches.
>>
>> Ok, I've just checked https://github.com/marxin/microbenchmark/tree/retpoline-table
>> with the below on top.
>>
>> Makefile | 6 +++---
>> switch.c | 2 +-
>> test.c | 6 ++++--
>> 3 files changed, 8 insertions(+), 6 deletions(-)
>>
>> diff --git a/Makefile b/Makefile
>> index bd83233..ea81520 100644
>> --- a/Makefile
>> +++ b/Makefile
>> @@ -1,16 +1,16 @@
>> CC=gcc
>> CFLAGS=-g -I.
>> -CFLAGS+=-O2 -mindirect-branch=thunk
>> +CFLAGS+=-O2 -mindirect-branch=thunk-inline -mindirect-branch-register
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> Does slowdown show up only with -mindirect-branch=thunk-inline?

Not really, numbers are in similar range / outcome. Additionally, I also tried
on a bit bigger machine (Xeon Gold 5120 this time). First is thunk-inline, second
is thunk, and third is w/o raising limit for comparison; first test (from last
mail) on that machine:

root@test:~/microbenchmark# make
gcc -g -I. -O2 -mindirect-branch=thunk-inline -mindirect-branch-register -c -o test.o test.c
gcc -g -I. -O2 -mindirect-branch=thunk-inline -mindirect-branch-register --param=case-values-threshold=20 -c -o switch-no-table.o switch-no-table.c
gcc -g -I. -O2 -mindirect-branch=thunk-inline -mindirect-branch-register -c -o switch.o switch.c
gcc -o test test.o switch-no-table.o switch.o
./test
no jump table: 5624962964
jump table : 13016449922 (231.41%)
root@test:~/microbenchmark# make
./test
no jump table: 5619612366
jump table : 13014680544 (231.59%)
root@test:~/microbenchmark# make
./test
no jump table: 5619725000
jump table : 13003825442 (231.40%)
root@test:~/microbenchmark# make
./test
no jump table: 5619668520
jump table : 13011259440 (231.53%)
root@test:~/microbenchmark# make
./test
no jump table: 5623093740
jump table : 13044403684 (231.98%)

root@test:~/microbenchmark# make
gcc -g -I. -O2 -mindirect-branch=thunk -c -o test.o test.c
gcc -g -I. -O2 -mindirect-branch=thunk --param=case-values-threshold=20 -c -o switch-no-table.o switch-no-table.c
gcc -g -I. -O2 -mindirect-branch=thunk -c -o switch.o switch.c
gcc -o test test.o switch-no-table.o switch.o
./test
no jump table: 5620474618
jump table : 13373059114 (237.93%)
root@test:~/microbenchmark# make
./test
no jump table: 5619791082
jump table : 13325518382 (237.12%)
root@test:~/microbenchmark# make
./test
no jump table: 5621678214
jump table : 13335416770 (237.21%)
root@test:~/microbenchmark# make
./test
no jump table: 5621402772
jump table : 13345090466 (237.40%)

root@test:~/microbenchmark# make
gcc -g -I. -O2 -mindirect-branch=thunk -c -o test.o test.c
gcc -g -I. -O2 -mindirect-branch=thunk -c -o switch-no-table.o switch-no-table.c
gcc -g -I. -O2 -mindirect-branch=thunk -c -o switch.o switch.c
gcc -o test test.o switch-no-table.o switch.o
./test
no jump table: 13658170002
jump table : 13404815232 (98.15%)
root@test:~/microbenchmark# make
./test
no jump table: 13664287098
jump table : 13407352204 (98.12%)
root@test:~/microbenchmark# make
./test
no jump table: 13667680182
jump table : 13422187370 (98.20%)
root@test:~/microbenchmark# make
./test
no jump table: 13665625094
jump table : 13420373364 (98.21%)
root@test:~/microbenchmark#

Thanks,
Daniel

2019-02-28 18:44:08

by H.J. Lu

[permalink] [raw]

Subject: Re: [tip:x86/build] x86, retpolines: Raise limit for generating indirect calls from switch-case

On Thu, Feb 28, 2019 at 9:58 AM Daniel Borkmann <[email protected]> wrote:
>
> On 02/28/2019 05:25 PM, H.J. Lu wrote:
> > On Thu, Feb 28, 2019 at 8:18 AM Daniel Borkmann <[email protected]> wrote:
> >> On 02/28/2019 01:53 PM, H.J. Lu wrote:
> >>> On Thu, Feb 28, 2019 at 3:27 AM David Woodhouse <[email protected]> wrote:
> >>>> On Thu, 2019-02-28 at 03:12 -0800, tip-bot for Daniel Borkmann wrote:
> >>>>> Commit-ID: ce02ef06fcf7a399a6276adb83f37373d10cbbe1
> >>>>> Gitweb: https://git.kernel.org/tip/ce02ef06fcf7a399a6276adb83f37373d10cbbe1
> >>>>> Author: Daniel Borkmann <[email protected]>
> >>>>> AuthorDate: Thu, 21 Feb 2019 23:19:41 +0100
> >>>>> Committer: Thomas Gleixner <[email protected]>
> >>>>> CommitDate: Thu, 28 Feb 2019 12:10:31 +0100
> >>>>>
> >>>>> x86, retpolines: Raise limit for generating indirect calls from switch-case
> >>>>>
> >>>>> From networking side, there are numerous attempts to get rid of indirect
> >>>>> calls in fast-path wherever feasible in order to avoid the cost of
> >>>>> retpolines, for example, just to name a few:
> >>>>>
> >>>>> * 283c16a2dfd3 ("indirect call wrappers: helpers to speed-up indirect calls of builtin")
> >>>>> * aaa5d90b395a ("net: use indirect call wrappers at GRO network layer")
> >>>>> * 028e0a476684 ("net: use indirect call wrappers at GRO transport layer")
> >>>>> * 356da6d0cde3 ("dma-mapping: bypass indirect calls for dma-direct")
> >>>>> * 09772d92cd5a ("bpf: avoid retpoline for lookup/update/delete calls on maps")
> >>>>> * 10870dd89e95 ("netfilter: nf_tables: add direct calls for all builtin expressions")
> >>>>> [...]
> >>>>>
> >>>>> Recent work on XDP from Björn and Magnus additionally found that manually
> >>>>> transforming the XDP return code switch statement with more than 5 cases
> >>>>> into if-else combination would result in a considerable speedup in XDP
> >>>>> layer due to avoidance of indirect calls in CONFIG_RETPOLINE enabled
> >>>>> builds.
> >>>>
> >>>> +HJL
> >>>>
> >>>> This is a GCC bug, surely? It should know how expensive each
> >>>> instruction is, and choose which to use accordingly. That should be
> >>>> true even when the indirect branch "instruction" is a retpoline, and
> >>>> thus enormously expensive.
> >>>>
> >>>> I believe this is https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86952 so
> >>>> please at least reference that bug, and be prepared to turn this hack
> >>>> off when GCC is fixed.
> >>>
> >>> We couldn't find a testcase to show jump table with indirect branch
> >>> is slower than direct branches.
> >>
> >> Ok, I've just checked https://github.com/marxin/microbenchmark/tree/retpoline-table
> >> with the below on top.
> >>
> >> Makefile | 6 +++---
> >> switch.c | 2 +-
> >> test.c | 6 ++++--
> >> 3 files changed, 8 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/Makefile b/Makefile
> >> index bd83233..ea81520 100644
> >> --- a/Makefile
> >> +++ b/Makefile
> >> @@ -1,16 +1,16 @@
> >> CC=gcc
> >> CFLAGS=-g -I.
> >> -CFLAGS+=-O2 -mindirect-branch=thunk
> >> +CFLAGS+=-O2 -mindirect-branch=thunk-inline -mindirect-branch-register
> > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> >
> > Does slowdown show up only with -mindirect-branch=thunk-inline?
>
> Not really, numbers are in similar range / outcome. Additionally, I also tried
> on a bit bigger machine (Xeon Gold 5120 this time). First is thunk-inline, second
> is thunk, and third is w/o raising limit for comparison; first test (from last
> mail) on that machine:

Please re-open:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86952

with new info.

--
H.J.

2019-02-28 18:44:44

by Daniel Borkmann

[permalink] [raw]

Subject: Re: [tip:x86/build] x86, retpolines: Raise limit for generating indirect calls from switch-case

On 02/28/2019 07:09 PM, H.J. Lu wrote:
> On Thu, Feb 28, 2019 at 9:58 AM Daniel Borkmann <[email protected]> wrote:
>> On 02/28/2019 05:25 PM, H.J. Lu wrote:
>>> On Thu, Feb 28, 2019 at 8:18 AM Daniel Borkmann <[email protected]> wrote:
>>>> On 02/28/2019 01:53 PM, H.J. Lu wrote:
>>>>> On Thu, Feb 28, 2019 at 3:27 AM David Woodhouse <[email protected]> wrote:
>>>>>> On Thu, 2019-02-28 at 03:12 -0800, tip-bot for Daniel Borkmann wrote:
>>>>>>> Commit-ID: ce02ef06fcf7a399a6276adb83f37373d10cbbe1
>>>>>>> Gitweb: https://git.kernel.org/tip/ce02ef06fcf7a399a6276adb83f37373d10cbbe1
>>>>>>> Author: Daniel Borkmann <[email protected]>
>>>>>>> AuthorDate: Thu, 21 Feb 2019 23:19:41 +0100
>>>>>>> Committer: Thomas Gleixner <[email protected]>
>>>>>>> CommitDate: Thu, 28 Feb 2019 12:10:31 +0100
>>>>>>>
>>>>>>> x86, retpolines: Raise limit for generating indirect calls from switch-case
>>>>>>>
>>>>>>> From networking side, there are numerous attempts to get rid of indirect
>>>>>>> calls in fast-path wherever feasible in order to avoid the cost of
>>>>>>> retpolines, for example, just to name a few:
>>>>>>>
>>>>>>> * 283c16a2dfd3 ("indirect call wrappers: helpers to speed-up indirect calls of builtin")
>>>>>>> * aaa5d90b395a ("net: use indirect call wrappers at GRO network layer")
>>>>>>> * 028e0a476684 ("net: use indirect call wrappers at GRO transport layer")
>>>>>>> * 356da6d0cde3 ("dma-mapping: bypass indirect calls for dma-direct")
>>>>>>> * 09772d92cd5a ("bpf: avoid retpoline for lookup/update/delete calls on maps")
>>>>>>> * 10870dd89e95 ("netfilter: nf_tables: add direct calls for all builtin expressions")
>>>>>>> [...]
>>>>>>>
>>>>>>> Recent work on XDP from Björn and Magnus additionally found that manually
>>>>>>> transforming the XDP return code switch statement with more than 5 cases
>>>>>>> into if-else combination would result in a considerable speedup in XDP
>>>>>>> layer due to avoidance of indirect calls in CONFIG_RETPOLINE enabled
>>>>>>> builds.
>>>>>>
>>>>>> +HJL
>>>>>>
>>>>>> This is a GCC bug, surely? It should know how expensive each
>>>>>> instruction is, and choose which to use accordingly. That should be
>>>>>> true even when the indirect branch "instruction" is a retpoline, and
>>>>>> thus enormously expensive.
>>>>>>
>>>>>> I believe this is https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86952 so
>>>>>> please at least reference that bug, and be prepared to turn this hack
>>>>>> off when GCC is fixed.
>>>>>
>>>>> We couldn't find a testcase to show jump table with indirect branch
>>>>> is slower than direct branches.
>>>>
>>>> Ok, I've just checked https://github.com/marxin/microbenchmark/tree/retpoline-table
>>>> with the below on top.
>>>>
>>>> Makefile | 6 +++---
>>>> switch.c | 2 +-
>>>> test.c | 6 ++++--
>>>> 3 files changed, 8 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/Makefile b/Makefile
>>>> index bd83233..ea81520 100644
>>>> --- a/Makefile
>>>> +++ b/Makefile
>>>> @@ -1,16 +1,16 @@
>>>> CC=gcc
>>>> CFLAGS=-g -I.
>>>> -CFLAGS+=-O2 -mindirect-branch=thunk
>>>> +CFLAGS+=-O2 -mindirect-branch=thunk-inline -mindirect-branch-register
>>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>>
>>> Does slowdown show up only with -mindirect-branch=thunk-inline?
>>
>> Not really, numbers are in similar range / outcome. Additionally, I also tried
>> on a bit bigger machine (Xeon Gold 5120 this time). First is thunk-inline, second
>> is thunk, and third is w/o raising limit for comparison; first test (from last
>> mail) on that machine:
>
> Please re-open:
>
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86952
>
> with new info.

Yeah will do, thanks!

2019-02-28 19:35:20

by Daniel Borkmann

[permalink] [raw]

Subject: Re: [tip:x86/build] x86, retpolines: Raise limit for generating indirect calls from switch-case

On 02/28/2019 01:53 PM, H.J. Lu wrote:
> On Thu, Feb 28, 2019 at 3:27 AM David Woodhouse <[email protected]> wrote:
>> On Thu, 2019-02-28 at 03:12 -0800, tip-bot for Daniel Borkmann wrote:
>>> Commit-ID: ce02ef06fcf7a399a6276adb83f37373d10cbbe1
>>> Gitweb: https://git.kernel.org/tip/ce02ef06fcf7a399a6276adb83f37373d10cbbe1
>>> Author: Daniel Borkmann <[email protected]>
>>> AuthorDate: Thu, 21 Feb 2019 23:19:41 +0100
>>> Committer: Thomas Gleixner <[email protected]>
>>> CommitDate: Thu, 28 Feb 2019 12:10:31 +0100
>>>
>>> x86, retpolines: Raise limit for generating indirect calls from switch-case
>>>
>>> From networking side, there are numerous attempts to get rid of indirect
>>> calls in fast-path wherever feasible in order to avoid the cost of
>>> retpolines, for example, just to name a few:
>>>
>>> * 283c16a2dfd3 ("indirect call wrappers: helpers to speed-up indirect calls of builtin")
>>> * aaa5d90b395a ("net: use indirect call wrappers at GRO network layer")
>>> * 028e0a476684 ("net: use indirect call wrappers at GRO transport layer")
>>> * 356da6d0cde3 ("dma-mapping: bypass indirect calls for dma-direct")
>>> * 09772d92cd5a ("bpf: avoid retpoline for lookup/update/delete calls on maps")
>>> * 10870dd89e95 ("netfilter: nf_tables: add direct calls for all builtin expressions")
>>> [...]
>>>
>>> Recent work on XDP from Björn and Magnus additionally found that manually
>>> transforming the XDP return code switch statement with more than 5 cases
>>> into if-else combination would result in a considerable speedup in XDP
>>> layer due to avoidance of indirect calls in CONFIG_RETPOLINE enabled
>>> builds.
>>
>> +HJL
>>
>> This is a GCC bug, surely? It should know how expensive each
>> instruction is, and choose which to use accordingly. That should be
>> true even when the indirect branch "instruction" is a retpoline, and
>> thus enormously expensive.
>>
>> I believe this is https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86952 so
>> please at least reference that bug, and be prepared to turn this hack
>> off when GCC is fixed.
>
> We couldn't find a testcase to show jump table with indirect branch
> is slower than direct branches.

Ok, I've just checked https://github.com/marxin/microbenchmark/tree/retpoline-table
with the below on top.

Makefile | 6 +++---
switch.c | 2 +-
test.c | 6 ++++--
3 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/Makefile b/Makefile
index bd83233..ea81520 100644
--- a/Makefile
+++ b/Makefile
@@ -1,16 +1,16 @@
CC=gcc
CFLAGS=-g -I.
-CFLAGS+=-O2 -mindirect-branch=thunk
+CFLAGS+=-O2 -mindirect-branch=thunk-inline -mindirect-branch-register
ASFLAGS=-g

EXE=test

OBJS=test.o switch-no-table.o switch.o

-switch-no-table.o switch-no-table.s: CFLAGS += -fno-jump-tables
+switch-no-table.o switch-no-table.s: CFLAGS += --param=case-values-threshold=20

all: $(EXE)
- ./$(EXE)
+ taskset 1 ./$(EXE)

$(EXE): $(OBJS)
$(CC) -o $@ $^
diff --git a/switch.c b/switch.c
index fe0a8b0..233ec14 100644
--- a/switch.c
+++ b/switch.c
@@ -3,7 +3,7 @@ int global;
int
foo (int x)
{
- switch (x) {
+ switch (x & 0xf) {
case 0:
return 11;
case 1:
diff --git a/test.c b/test.c
index 3d1e0da..7fc22a4 100644
--- a/test.c
+++ b/test.c
@@ -15,21 +15,23 @@ main ()
unsigned long long start, end;
unsigned long long diff1, diff2;

+ global = 0;
start = __rdtscp (&i);
for (i = 0; i < LOOP; i++)
foo_no_table (i);
end = __rdtscp (&i);
diff1 = end - start;

- printf ("no jump table: %lld\n", diff1);
+ printf ("global:%d no jump table: %lld\n", global, diff1);

+ global = 0;
start = __rdtscp (&i);
for (i = 0; i < LOOP; i++)
foo (i);
end = __rdtscp (&i);
diff2 = end - start;

- printf ("jump table : %lld (%.2f%%)\n", diff2, 100.0f * diff2 / diff1);
+ printf ("global:%d jump table : %lld (%.2f%%)\n", global, diff2, 100.0f * diff2 / diff1);

return 0;
}
--
2.17.1

** This basically iterates through the cases:

Below I'm getting ~twice the time needed for jump table vs no jump table
for the flags kernel is using:

# make
gcc -g -I. -O2 -mindirect-branch=thunk-inline -mindirect-branch-register -c -o test.o test.c
gcc -g -I. -O2 -mindirect-branch=thunk-inline -mindirect-branch-register --param=case-values-threshold=20 -c -o switch-no-table.o switch-no-table.c
gcc -g -I. -O2 -mindirect-branch=thunk-inline -mindirect-branch-register -c -o switch.o switch.c
gcc -o test test.o switch-no-table.o switch.o
taskset 1 ./test
global:50000000 no jump table: 6329361694
global:50000000 jump table : 13745181180 (217.17%)
root@snat:~/microbenchmark# make
taskset 1 ./test
global:50000000 no jump table: 6328846466
global:50000000 jump table : 13746479870 (217.20%)
root@snat:~/microbenchmark# make
taskset 1 ./test
global:50000000 no jump table: 6326922428
global:50000000 jump table : 13745139496 (217.25%)
root@snat:~/microbenchmark# make
taskset 1 ./test
global:50000000 no jump table: 6327943506
global:50000000 jump table : 13744388354 (217.20%)
root@snat:~/microbenchmark# make
taskset 1 ./test
global:50000000 no jump table: 6332503572
global:50000000 jump table : 13729817800 (216.82%)
root@snat:~/microbenchmark# make
taskset 1 ./test
global:50000000 no jump table: 6328378006
global:50000000 jump table : 13747069902 (217.23%)
root@snat:~/microbenchmark# make
taskset 1 ./test
global:50000000 no jump table: 6326481236
global:50000000 jump table : 13749345724 (217.33%)
root@snat:~/microbenchmark# make
taskset 1 ./test
global:50000000 no jump table: 6329332628
global:50000000 jump table : 13745879704 (217.18%)
root@snat:~/microbenchmark# make
taskset 1 ./test
global:50000000 no jump table: 6327734850
global:50000000 jump table : 13746412678 (217.24%)

For comparison that both are 100% when raising limit is _not_ in use
(which is expected of course but just to make sure):

root@snat:~/microbenchmark# make
gcc -g -I. -O2 -mindirect-branch=thunk-inline -mindirect-branch-register -c -o test.o test.c
gcc -g -I. -O2 -mindirect-branch=thunk-inline -mindirect-branch-register -c -o switch-no-table.o switch-no-table.c
gcc -g -I. -O2 -mindirect-branch=thunk-inline -mindirect-branch-register -c -o switch.o switch.c
gcc -o test test.o switch-no-table.o switch.o
taskset 1 ./test
global:50000000 no jump table: 13704083238
global:50000000 jump table : 13746838060 (100.31%)
root@snat:~/microbenchmark# make
taskset 1 ./test
global:50000000 no jump table: 13753854740
global:50000000 jump table : 13746624470 (99.95%)
root@snat:~/microbenchmark# make
taskset 1 ./test
global:50000000 no jump table: 13707053714
global:50000000 jump table : 13746682002 (100.29%)
root@snat:~/microbenchmark# make
taskset 1 ./test
global:50000000 no jump table: 13708843624
global:50000000 jump table : 13749733040 (100.30%)
root@snat:~/microbenchmark# make
taskset 1 ./test
global:50000000 no jump table: 13707365404
global:50000000 jump table : 13747683096 (100.29%)
root@snat:~/microbenchmark# make
taskset 1 ./test
global:50000000 no jump table: 13707014114
global:50000000 jump table : 13746444272 (100.29%)
root@snat:~/microbenchmark# make
taskset 1 ./test
global:50000000 no jump table: 13709596158
global:50000000 jump table : 13750499176 (100.30%)
root@snat:~/microbenchmark# make
taskset 1 ./test
global:50000000 no jump table: 13709484118
global:50000000 jump table : 13747952446 (100.28%)
root@snat:~/microbenchmark# make
taskset 1 ./test
global:50000000 no jump table: 13708873570
global:50000000 jump table : 13748950096 (100.29%)

** Next case would be constantly hitting first switch case:

diff --git a/switch.c b/switch.c
index 233ec14..fe0a8b0 100644
--- a/switch.c
+++ b/switch.c
@@ -3,7 +3,7 @@ int global;
int
foo (int x)
{
- switch (x & 0xf) {
+ switch (x) {
case 0:
return 11;
case 1:
diff --git a/test.c b/test.c
index 7fc22a4..2849112 100644
--- a/test.c
+++ b/test.c
@@ -5,6 +5,7 @@ extern int foo (int);
extern int foo_no_table (int);

int global = 20;
+int j = 0;

#define LOOP 800000000

@@ -18,7 +19,7 @@ main ()
global = 0;
start = __rdtscp (&i);
for (i = 0; i < LOOP; i++)
- foo_no_table (i);
+ foo_no_table (j);
end = __rdtscp (&i);
diff1 = end - start;

@@ -27,7 +28,7 @@ main ()
global = 0;
start = __rdtscp (&i);
for (i = 0; i < LOOP; i++)
- foo (i);
+ foo (j);
end = __rdtscp (&i);
diff2 = end - start;

# make
gcc -g -I. -O2 -mindirect-branch=thunk-inline -mindirect-branch-register -c -o test.o test.c
gcc -g -I. -O2 -mindirect-branch=thunk-inline -mindirect-branch-register --param=case-values-threshold=20 -c -o switch-no-table.o switch-no-table.c
gcc -g -I. -O2 -mindirect-branch=thunk-inline -mindirect-branch-register -c -o switch.o switch.c
gcc -o test test.o switch-no-table.o switch.o
taskset 1 ./test
global:0 no jump table: 6098109200
global:0 jump table : 30717871980 (503.73%)
root@snat:~/microbenchmark# make
taskset 1 ./test
global:0 no jump table: 6097799330
global:0 jump table : 30727386270 (503.91%)
root@snat:~/microbenchmark# make
taskset 1 ./test
global:0 no jump table: 6097559796
global:0 jump table : 30715992452 (503.74%)
root@snat:~/microbenchmark# make
taskset 1 ./test
global:0 no jump table: 6098532172
global:0 jump table : 30716423870 (503.67%)
root@snat:~/microbenchmark# make
taskset 1 ./test
global:0 no jump table: 6097429586
global:0 jump table : 30715774634 (503.75%)
root@snat:~/microbenchmark# make
taskset 1 ./test
global:0 no jump table: 6097813848
global:0 jump table : 30716476820 (503.73%)
root@snat:~/microbenchmark# make
taskset 1 ./test
global:0 no jump table: 6096955736
global:0 jump table : 30715385478 (503.78%)
root@snat:~/microbenchmark# make
taskset 1 ./test
global:0 no jump table: 6096820240
global:0 jump table : 30719682434 (503.86%)

** And next case would be constantly hitting default case:

diff --git a/test.c b/test.c
index 2849112..be9bfc1 100644
--- a/test.c
+++ b/test.c
@@ -5,7 +5,7 @@ extern int foo (int);
extern int foo_no_table (int);

int global = 20;
-int j = 0;
+int j = 1000;

#define LOOP 800000000

# make
gcc -g -I. -O2 -mindirect-branch=thunk-inline -mindirect-branch-register -c -o test.o test.c
gcc -g -I. -O2 -mindirect-branch=thunk-inline -mindirect-branch-register --param=case-values-threshold=20 -c -o switch-no-table.o switch-no-table.c
gcc -g -I. -O2 -mindirect-branch=thunk-inline -mindirect-branch-register -c -o switch.o switch.c
gcc -o test test.o switch-no-table.o switch.o
taskset 1 ./test
global:0 no jump table: 6422890064
global:0 jump table : 6866072454 (106.90%)
root@snat:~/microbenchmark# make
taskset 1 ./test
global:0 no jump table: 6423267608
global:0 jump table : 6866266176 (106.90%)
root@snat:~/microbenchmark# make
taskset 1 ./test
global:0 no jump table: 6424721624
global:0 jump table : 6866607842 (106.88%)
root@snat:~/microbenchmark# make
taskset 1 ./test
global:0 no jump table: 6424225664
global:0 jump table : 6866843372 (106.89%)
root@snat:~/microbenchmark# make
taskset 1 ./test
global:0 no jump table: 6424073830
global:0 jump table : 6866467050 (106.89%)
root@snat:~/microbenchmark# make
taskset 1 ./test
global:0 no jump table: 6426515396
global:0 jump table : 6867031640 (106.85%)
root@snat:~/microbenchmark# make
taskset 1 ./test
global:0 no jump table: 6425126656
global:0 jump table : 6866352988 (106.87%)
root@snat:~/microbenchmark# make
taskset 1 ./test
global:0 no jump table: 6423040024
global:0 jump table : 6867233670 (106.92%)
root@snat:~/microbenchmark# make
taskset 1 ./test
global:0 no jump table: 6422256136
global:0 jump table : 6865902094 (106.91%)

I could also try different distributions perhaps for the case selector,
but observations match in that direction with what Bjorn et al also have
been seen in XDP case.

Thanks,
Daniel