2019-07-18 01:38:20

by Josh Poimboeuf

[permalink] [raw]
Subject: [PATCH v2 10/22] bpf: Disable GCC -fgcse optimization for ___bpf_prog_run()

On x86-64, with CONFIG_RETPOLINE=n, GCC's "global common subexpression
elimination" optimization results in ___bpf_prog_run()'s jumptable code
changing from this:

select_insn:
jmp *jumptable(, %rax, 8)
...
ALU64_ADD_X:
...
jmp *jumptable(, %rax, 8)
ALU_ADD_X:
...
jmp *jumptable(, %rax, 8)

to this:

select_insn:
mov jumptable, %r12
jmp *(%r12, %rax, 8)
...
ALU64_ADD_X:
...
jmp *(%r12, %rax, 8)
ALU_ADD_X:
...
jmp *(%r12, %rax, 8)

The jumptable address is placed in a register once, at the beginning of
the function. The function execution can then go through multiple
indirect jumps which rely on that same register value. This has a few
issues:

1) Objtool isn't smart enough to be able to track such a register value
across multiple recursive indirect jumps through the jump table.

2) With CONFIG_RETPOLINE enabled, this optimization actually results in
a small slowdown. I measured a ~4.7% slowdown in the test_bpf
"tcpdump port 22" selftest.

This slowdown is actually predicted by the GCC manual:

Note: When compiling a program using computed gotos, a GCC
extension, you may get better run-time performance if you
disable the global common subexpression elimination pass by
adding -fno-gcse to the command line.

So just disable the optimization for this function.

Fixes: e55a73251da3 ("bpf: Fix ORC unwinding in non-JIT BPF code")
Reported-by: Randy Dunlap <[email protected]>
Signed-off-by: Josh Poimboeuf <[email protected]>
Acked-by: Alexei Starovoitov <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
---
Cc: Alexei Starovoitov <[email protected]>
Cc: Daniel Borkmann <[email protected]>
---
include/linux/compiler-gcc.h | 2 ++
include/linux/compiler_types.h | 4 ++++
kernel/bpf/core.c | 2 +-
3 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/include/linux/compiler-gcc.h b/include/linux/compiler-gcc.h
index e8579412ad21..d7ee4c6bad48 100644
--- a/include/linux/compiler-gcc.h
+++ b/include/linux/compiler-gcc.h
@@ -170,3 +170,5 @@
#else
#define __diag_GCC_8(s)
#endif
+
+#define __no_fgcse __attribute__((optimize("-fno-gcse")))
diff --git a/include/linux/compiler_types.h b/include/linux/compiler_types.h
index 095d55c3834d..599c27b56c29 100644
--- a/include/linux/compiler_types.h
+++ b/include/linux/compiler_types.h
@@ -189,6 +189,10 @@ struct ftrace_likely_data {
#define asm_volatile_goto(x...) asm goto(x)
#endif

+#ifndef __no_fgcse
+# define __no_fgcse
+#endif
+
/* Are two types/vars the same type (ignoring qualifiers)? */
#define __same_type(a, b) __builtin_types_compatible_p(typeof(a), typeof(b))

diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 7e98f36a14e2..8191a7db2777 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -1295,7 +1295,7 @@ bool bpf_opcode_in_insntable(u8 code)
*
* Decode and execute eBPF instructions.
*/
-static u64 ___bpf_prog_run(u64 *regs, const struct bpf_insn *insn, u64 *stack)
+static u64 __no_fgcse ___bpf_prog_run(u64 *regs, const struct bpf_insn *insn, u64 *stack)
{
#define BPF_INSN_2_LBL(x, y) [BPF_##x | BPF_##y] = &&x##_##y
#define BPF_INSN_3_LBL(x, y, z) [BPF_##x | BPF_##y | BPF_##z] = &&x##_##y##_##z
--
2.20.1


Subject: [tip:core/urgent] bpf: Disable GCC -fgcse optimization for ___bpf_prog_run()

Commit-ID: 3193c0836f203a91bef96d88c64cccf0be090d9c
Gitweb: https://git.kernel.org/tip/3193c0836f203a91bef96d88c64cccf0be090d9c
Author: Josh Poimboeuf <[email protected]>
AuthorDate: Wed, 17 Jul 2019 20:36:45 -0500
Committer: Thomas Gleixner <[email protected]>
CommitDate: Thu, 18 Jul 2019 21:01:06 +0200

bpf: Disable GCC -fgcse optimization for ___bpf_prog_run()

On x86-64, with CONFIG_RETPOLINE=n, GCC's "global common subexpression
elimination" optimization results in ___bpf_prog_run()'s jumptable code
changing from this:

select_insn:
jmp *jumptable(, %rax, 8)
...
ALU64_ADD_X:
...
jmp *jumptable(, %rax, 8)
ALU_ADD_X:
...
jmp *jumptable(, %rax, 8)

to this:

select_insn:
mov jumptable, %r12
jmp *(%r12, %rax, 8)
...
ALU64_ADD_X:
...
jmp *(%r12, %rax, 8)
ALU_ADD_X:
...
jmp *(%r12, %rax, 8)

The jumptable address is placed in a register once, at the beginning of
the function. The function execution can then go through multiple
indirect jumps which rely on that same register value. This has a few
issues:

1) Objtool isn't smart enough to be able to track such a register value
across multiple recursive indirect jumps through the jump table.

2) With CONFIG_RETPOLINE enabled, this optimization actually results in
a small slowdown. I measured a ~4.7% slowdown in the test_bpf
"tcpdump port 22" selftest.

This slowdown is actually predicted by the GCC manual:

Note: When compiling a program using computed gotos, a GCC
extension, you may get better run-time performance if you
disable the global common subexpression elimination pass by
adding -fno-gcse to the command line.

So just disable the optimization for this function.

Fixes: e55a73251da3 ("bpf: Fix ORC unwinding in non-JIT BPF code")
Reported-by: Randy Dunlap <[email protected]>
Signed-off-by: Josh Poimboeuf <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Acked-by: Alexei Starovoitov <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/30c3ca29ba037afcbd860a8672eef0021addf9fe.1563413318.git.jpoimboe@redhat.com

---
include/linux/compiler-gcc.h | 2 ++
include/linux/compiler_types.h | 4 ++++
kernel/bpf/core.c | 2 +-
3 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/include/linux/compiler-gcc.h b/include/linux/compiler-gcc.h
index e8579412ad21..d7ee4c6bad48 100644
--- a/include/linux/compiler-gcc.h
+++ b/include/linux/compiler-gcc.h
@@ -170,3 +170,5 @@
#else
#define __diag_GCC_8(s)
#endif
+
+#define __no_fgcse __attribute__((optimize("-fno-gcse")))
diff --git a/include/linux/compiler_types.h b/include/linux/compiler_types.h
index 095d55c3834d..599c27b56c29 100644
--- a/include/linux/compiler_types.h
+++ b/include/linux/compiler_types.h
@@ -189,6 +189,10 @@ struct ftrace_likely_data {
#define asm_volatile_goto(x...) asm goto(x)
#endif

+#ifndef __no_fgcse
+# define __no_fgcse
+#endif
+
/* Are two types/vars the same type (ignoring qualifiers)? */
#define __same_type(a, b) __builtin_types_compatible_p(typeof(a), typeof(b))

diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 7e98f36a14e2..8191a7db2777 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -1295,7 +1295,7 @@ bool bpf_opcode_in_insntable(u8 code)
*
* Decode and execute eBPF instructions.
*/
-static u64 ___bpf_prog_run(u64 *regs, const struct bpf_insn *insn, u64 *stack)
+static u64 __no_fgcse ___bpf_prog_run(u64 *regs, const struct bpf_insn *insn, u64 *stack)
{
#define BPF_INSN_2_LBL(x, y) [BPF_##x | BPF_##y] = &&x##_##y
#define BPF_INSN_3_LBL(x, y, z) [BPF_##x | BPF_##y | BPF_##z] = &&x##_##y##_##z

2020-04-29 21:56:14

by Josh Poimboeuf

[permalink] [raw]
Subject: BPF vs objtool again

On Thu, Jul 18, 2019 at 12:14:08PM -0700, tip-bot for Josh Poimboeuf wrote:
> Commit-ID: 3193c0836f203a91bef96d88c64cccf0be090d9c
> Gitweb: https://git.kernel.org/tip/3193c0836f203a91bef96d88c64cccf0be090d9c
> Author: Josh Poimboeuf <[email protected]>
> AuthorDate: Wed, 17 Jul 2019 20:36:45 -0500
> Committer: Thomas Gleixner <[email protected]>
> CommitDate: Thu, 18 Jul 2019 21:01:06 +0200
>
> bpf: Disable GCC -fgcse optimization for ___bpf_prog_run()

For some reason, this

__attribute__((optimize("-fno-gcse")))

is disabling frame pointers in ___bpf_prog_run(). If you compile with
CONFIG_FRAME_POINTER it'll show something like:

kernel/bpf/core.o: warning: objtool: ___bpf_prog_run.cold()+0x7: call without frame pointer save/setup

Also, since GCC 9.1, the GCC docs say "The optimize attribute should be
used for debugging purposes only. It is not suitable in production
code." That doesn't sound too promising.

So it seems like this commit should be reverted. But then we're back to
objtool being broken again in the RETPOLINE=n case, which means no ORC
coverage in this function. (See above commit for the details)

Some ideas:

- Skip objtool checking of that func/file (at least for RETPOLINE=n) --
but then it won't have ORC coverage.

- Get rid of the "double goto" in ___bpf_prog_run(), which simplifies it
enough for objtool to understand -- but then the text explodes for
RETPOLINE=y.

- Add -fno-gfcse to the Makefile for kernel/bpf/core.c -- but then that
affects the optimization of other functions in the file. However I
don't think the impact is significant.

- Move ___bpf_prog_run() to its own file with the -fno-gfcse flag. I'm
thinking this could be the least bad option. Alexei?

--
Josh

2020-04-29 22:04:09

by Arvind Sankar

[permalink] [raw]
Subject: Re: BPF vs objtool again

On Wed, Apr 29, 2020 at 04:51:59PM -0500, Josh Poimboeuf wrote:
> On Thu, Jul 18, 2019 at 12:14:08PM -0700, tip-bot for Josh Poimboeuf wrote:
> > Commit-ID: 3193c0836f203a91bef96d88c64cccf0be090d9c
> > Gitweb: https://git.kernel.org/tip/3193c0836f203a91bef96d88c64cccf0be090d9c
> > Author: Josh Poimboeuf <[email protected]>
> > AuthorDate: Wed, 17 Jul 2019 20:36:45 -0500
> > Committer: Thomas Gleixner <[email protected]>
> > CommitDate: Thu, 18 Jul 2019 21:01:06 +0200
> >
> > bpf: Disable GCC -fgcse optimization for ___bpf_prog_run()
>
> For some reason, this
>
> __attribute__((optimize("-fno-gcse")))
>
> is disabling frame pointers in ___bpf_prog_run(). If you compile with
> CONFIG_FRAME_POINTER it'll show something like:
>
> kernel/bpf/core.o: warning: objtool: ___bpf_prog_run.cold()+0x7: call without frame pointer save/setup
>
> Also, since GCC 9.1, the GCC docs say "The optimize attribute should be
> used for debugging purposes only. It is not suitable in production
> code." That doesn't sound too promising.
>

It turns out that the optimize attribute doesn't append options to the
command-line arguments, it starts from the defaults and only adds
whatever you specify in the attribute. So it's not very useful for
production code.

See this for eg where the same thing came up in a different context.
https://lore.kernel.org/lkml/[email protected]/

2020-04-29 23:45:59

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: BPF vs objtool again

On Wed, Apr 29, 2020 at 04:51:59PM -0500, Josh Poimboeuf wrote:
> On Thu, Jul 18, 2019 at 12:14:08PM -0700, tip-bot for Josh Poimboeuf wrote:
> > Commit-ID: 3193c0836f203a91bef96d88c64cccf0be090d9c
> > Gitweb: https://git.kernel.org/tip/3193c0836f203a91bef96d88c64cccf0be090d9c
> > Author: Josh Poimboeuf <[email protected]>
> > AuthorDate: Wed, 17 Jul 2019 20:36:45 -0500
> > Committer: Thomas Gleixner <[email protected]>
> > CommitDate: Thu, 18 Jul 2019 21:01:06 +0200
> >
> > bpf: Disable GCC -fgcse optimization for ___bpf_prog_run()
>
> For some reason, this
>
> __attribute__((optimize("-fno-gcse")))
>
> is disabling frame pointers in ___bpf_prog_run(). If you compile with
> CONFIG_FRAME_POINTER it'll show something like:
>
> kernel/bpf/core.o: warning: objtool: ___bpf_prog_run.cold()+0x7: call without frame pointer save/setup

you mean it started to disable frame pointers from some version of gcc?
It wasn't doing this before, since objtool wasn't complaining, right?
Sounds like gcc bug?

> Also, since GCC 9.1, the GCC docs say "The optimize attribute should be
> used for debugging purposes only. It is not suitable in production
> code." That doesn't sound too promising.
>
> So it seems like this commit should be reverted. But then we're back to
> objtool being broken again in the RETPOLINE=n case, which means no ORC
> coverage in this function. (See above commit for the details)
>
> Some ideas:
>
> - Skip objtool checking of that func/file (at least for RETPOLINE=n) --
> but then it won't have ORC coverage.
>
> - Get rid of the "double goto" in ___bpf_prog_run(), which simplifies it
> enough for objtool to understand -- but then the text explodes for
> RETPOLINE=y.

How that will look like?
That could be the best option.

> - Add -fno-gfcse to the Makefile for kernel/bpf/core.c -- but then that
> affects the optimization of other functions in the file. However I
> don't think the impact is significant.
>
> - Move ___bpf_prog_run() to its own file with the -fno-gfcse flag. I'm
> thinking this could be the least bad option. Alexei?

I think it would be easier to move some of the hot path
functions out of core.c instead.
Like *ksym*, BPF_CALL*, bpf_jit*, bpf_prog*.
I think resulting churn will be less.
imo it's more important to keep git blame history for interpreter
than for the other funcs.
Sounds like it's a fix that needs to be sent for the next RC ?
Please send a patch for bpf tree then.

Daniel, thoughts?

2020-04-30 00:15:12

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: BPF vs objtool again

On Wed, Apr 29, 2020 at 04:41:59PM -0700, Alexei Starovoitov wrote:
> On Wed, Apr 29, 2020 at 04:51:59PM -0500, Josh Poimboeuf wrote:
> > On Thu, Jul 18, 2019 at 12:14:08PM -0700, tip-bot for Josh Poimboeuf wrote:
> > > Commit-ID: 3193c0836f203a91bef96d88c64cccf0be090d9c
> > > Gitweb: https://git.kernel.org/tip/3193c0836f203a91bef96d88c64cccf0be090d9c
> > > Author: Josh Poimboeuf <[email protected]>
> > > AuthorDate: Wed, 17 Jul 2019 20:36:45 -0500
> > > Committer: Thomas Gleixner <[email protected]>
> > > CommitDate: Thu, 18 Jul 2019 21:01:06 +0200
> > >
> > > bpf: Disable GCC -fgcse optimization for ___bpf_prog_run()
> >
> > For some reason, this
> >
> > __attribute__((optimize("-fno-gcse")))
> >
> > is disabling frame pointers in ___bpf_prog_run(). If you compile with
> > CONFIG_FRAME_POINTER it'll show something like:
> >
> > kernel/bpf/core.o: warning: objtool: ___bpf_prog_run.cold()+0x7: call without frame pointer save/setup
>
> you mean it started to disable frame pointers from some version of gcc?
> It wasn't doing this before, since objtool wasn't complaining, right?
> Sounds like gcc bug?

I actually think this warning has been around for a while. I just only
recently looked at it. I don't think anything changed in GCC, it's just
that almost nobody uses CONFIG_FRAME_POINTER, so it wasn't really
noticed.

> > Also, since GCC 9.1, the GCC docs say "The optimize attribute should be
> > used for debugging purposes only. It is not suitable in production
> > code." That doesn't sound too promising.
> >
> > So it seems like this commit should be reverted. But then we're back to
> > objtool being broken again in the RETPOLINE=n case, which means no ORC
> > coverage in this function. (See above commit for the details)
> >
> > Some ideas:
> >
> > - Skip objtool checking of that func/file (at least for RETPOLINE=n) --
> > but then it won't have ORC coverage.
> >
> > - Get rid of the "double goto" in ___bpf_prog_run(), which simplifies it
> > enough for objtool to understand -- but then the text explodes for
> > RETPOLINE=y.
>
> How that will look like?
> That could be the best option.

For example:

#define GOTO ({ goto *jumptable[insn->code]; })

and then replace all 'goto select_insn' with 'GOTO;'

The problem is that with RETPOLINE=y, the function text size grows from
5k to 7k, because for each of the 160+ retpoline JMPs, GCC (stupidly)
reloads the jump table register into a scratch register.

> > - Add -fno-gfcse to the Makefile for kernel/bpf/core.c -- but then that
> > affects the optimization of other functions in the file. However I
> > don't think the impact is significant.
> >
> > - Move ___bpf_prog_run() to its own file with the -fno-gfcse flag. I'm
> > thinking this could be the least bad option. Alexei?
>
> I think it would be easier to move some of the hot path
> functions out of core.c instead.
> Like *ksym*, BPF_CALL*, bpf_jit*, bpf_prog*.
> I think resulting churn will be less.
> imo it's more important to keep git blame history for interpreter
> than for the other funcs.
> Sounds like it's a fix that needs to be sent for the next RC ?
> Please send a patch for bpf tree then.

I can make a patch, what file would you recommend moving those hot path
functions to?

--
Josh

2020-04-30 02:12:51

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: BPF vs objtool again

On Wed, Apr 29, 2020 at 07:13:00PM -0500, Josh Poimboeuf wrote:
> On Wed, Apr 29, 2020 at 04:41:59PM -0700, Alexei Starovoitov wrote:
> > On Wed, Apr 29, 2020 at 04:51:59PM -0500, Josh Poimboeuf wrote:
> > > On Thu, Jul 18, 2019 at 12:14:08PM -0700, tip-bot for Josh Poimboeuf wrote:
> > > > Commit-ID: 3193c0836f203a91bef96d88c64cccf0be090d9c
> > > > Gitweb: https://git.kernel.org/tip/3193c0836f203a91bef96d88c64cccf0be090d9c
> > > > Author: Josh Poimboeuf <[email protected]>
> > > > AuthorDate: Wed, 17 Jul 2019 20:36:45 -0500
> > > > Committer: Thomas Gleixner <[email protected]>
> > > > CommitDate: Thu, 18 Jul 2019 21:01:06 +0200
> > > >
> > > > bpf: Disable GCC -fgcse optimization for ___bpf_prog_run()
> > >
> > > For some reason, this
> > >
> > > __attribute__((optimize("-fno-gcse")))
> > >
> > > is disabling frame pointers in ___bpf_prog_run(). If you compile with
> > > CONFIG_FRAME_POINTER it'll show something like:
> > >
> > > kernel/bpf/core.o: warning: objtool: ___bpf_prog_run.cold()+0x7: call without frame pointer save/setup
> >
> > you mean it started to disable frame pointers from some version of gcc?
> > It wasn't doing this before, since objtool wasn't complaining, right?
> > Sounds like gcc bug?
>
> I actually think this warning has been around for a while. I just only
> recently looked at it. I don't think anything changed in GCC, it's just
> that almost nobody uses CONFIG_FRAME_POINTER, so it wasn't really
> noticed.
>
> > > Also, since GCC 9.1, the GCC docs say "The optimize attribute should be
> > > used for debugging purposes only. It is not suitable in production
> > > code." That doesn't sound too promising.
> > >
> > > So it seems like this commit should be reverted. But then we're back to
> > > objtool being broken again in the RETPOLINE=n case, which means no ORC
> > > coverage in this function. (See above commit for the details)
> > >
> > > Some ideas:
> > >
> > > - Skip objtool checking of that func/file (at least for RETPOLINE=n) --
> > > but then it won't have ORC coverage.
> > >
> > > - Get rid of the "double goto" in ___bpf_prog_run(), which simplifies it
> > > enough for objtool to understand -- but then the text explodes for
> > > RETPOLINE=y.
> >
> > How that will look like?
> > That could be the best option.
>
> For example:
>
> #define GOTO ({ goto *jumptable[insn->code]; })
>
> and then replace all 'goto select_insn' with 'GOTO;'
>
> The problem is that with RETPOLINE=y, the function text size grows from
> 5k to 7k, because for each of the 160+ retpoline JMPs, GCC (stupidly)
> reloads the jump table register into a scratch register.

that would be a tiny change, right?
I'd rather go with that and gate it with 'ifdef CONFIG_FRAME_POINTER'
Like:
#ifndef CONFIG_FRAME_POINTER
#define CONT ({ insn++; goto select_insn; })
#define CONT_JMP ({ insn++; goto select_insn; })
#else
#define CONT ({ insn++; goto *jumptable[insn->code]; })
#define CONT_JMP ({ insn++; goto *jumptable[insn->code]; })
#endif

The reason this CONT and CONT_JMP macros are there because a combination
of different gcc versions together with different cpus make branch predictor
behave 'unpredictably'.

I've played with CONT and CONT_JMP either both doing direct goto or
indirect goto and observed quite different performance characteristics
from the interpreter.
What you see right now was the best tune I could get from a set of cpus
I had to play with and compilers. If I did the same tuning today the outcome
could have been different.
So I think it's totally fine to use above code. I think some cpus may actually
see performance gains with certain versions of gcc.
The retpoline text increase is unfortunate but when retpoline is on
other security knobs should be on too. In particular CONFIG_BPF_JIT_ALWAYS_ON
should be on as well. Which will remove interpreter from .text completely.

2020-04-30 03:57:33

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: BPF vs objtool again

On Wed, Apr 29, 2020 at 07:10:52PM -0700, Alexei Starovoitov wrote:
> > For example:
> >
> > #define GOTO ({ goto *jumptable[insn->code]; })
> >
> > and then replace all 'goto select_insn' with 'GOTO;'
> >
> > The problem is that with RETPOLINE=y, the function text size grows from
> > 5k to 7k, because for each of the 160+ retpoline JMPs, GCC (stupidly)
> > reloads the jump table register into a scratch register.
>
> that would be a tiny change, right?
> I'd rather go with that and gate it with 'ifdef CONFIG_FRAME_POINTER'
> Like:
> #ifndef CONFIG_FRAME_POINTER
> #define CONT ({ insn++; goto select_insn; })
> #define CONT_JMP ({ insn++; goto select_insn; })
> #else
> #define CONT ({ insn++; goto *jumptable[insn->code]; })
> #define CONT_JMP ({ insn++; goto *jumptable[insn->code]; })
> #endif
>
> The reason this CONT and CONT_JMP macros are there because a combination
> of different gcc versions together with different cpus make branch predictor
> behave 'unpredictably'.
>
> I've played with CONT and CONT_JMP either both doing direct goto or
> indirect goto and observed quite different performance characteristics
> from the interpreter.
> What you see right now was the best tune I could get from a set of cpus
> I had to play with and compilers. If I did the same tuning today the outcome
> could have been different.
> So I think it's totally fine to use above code. I think some cpus may actually
> see performance gains with certain versions of gcc.
> The retpoline text increase is unfortunate but when retpoline is on
> other security knobs should be on too. In particular CONFIG_BPF_JIT_ALWAYS_ON
> should be on as well. Which will remove interpreter from .text completely.

This would actually be contingent on RETPOLINE, not FRAME_POINTER.

(FRAME_POINTER was the other issue with the "optimize" attribute, which
we're reverting so it'll no longer be a problem.)

So if you're not concerned about the retpoline text growth, it could be
as simple as:

#define CONT ({ insn++; goto *jumptable[insn->code]; })
#define CONT_JMP ({ insn++; goto *jumptable[insn->code]; })


Or, if you wanted to avoid the text growth, it could be:

#ifdef CONFIG_RETPOLINE
/*
* Avoid a 40% increase in function text size by getting GCC to generate a
* single retpoline jump instead of 160+.
*/
#define CONT ({ insn++; goto select_insn; })
#define CONT_JMP ({ insn++; goto select_insn; })

select_insn:
goto *jumptable[insn->code];

#else /* !CONFIG_RETPOLINE */
#define CONT ({ insn++; goto *jumptable[insn->code]; })
#define CONT_JMP ({ insn++; goto *jumptable[insn->code]; })
#endif /* CONFIG_RETPOLINE */


But since this is legacy path, I think the first one is much nicer.


Also, JMP_TAIL_CALL has a "goto select_insn", is it ok to convert that
to CONT?

--
Josh

2020-04-30 04:25:53

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: BPF vs objtool again

On Wed, Apr 29, 2020 at 10:53:15PM -0500, Josh Poimboeuf wrote:
> On Wed, Apr 29, 2020 at 07:10:52PM -0700, Alexei Starovoitov wrote:
> > > For example:
> > >
> > > #define GOTO ({ goto *jumptable[insn->code]; })
> > >
> > > and then replace all 'goto select_insn' with 'GOTO;'
> > >
> > > The problem is that with RETPOLINE=y, the function text size grows from
> > > 5k to 7k, because for each of the 160+ retpoline JMPs, GCC (stupidly)
> > > reloads the jump table register into a scratch register.
> >
> > that would be a tiny change, right?
> > I'd rather go with that and gate it with 'ifdef CONFIG_FRAME_POINTER'
> > Like:
> > #ifndef CONFIG_FRAME_POINTER
> > #define CONT ({ insn++; goto select_insn; })
> > #define CONT_JMP ({ insn++; goto select_insn; })
> > #else
> > #define CONT ({ insn++; goto *jumptable[insn->code]; })
> > #define CONT_JMP ({ insn++; goto *jumptable[insn->code]; })
> > #endif
> >
> > The reason this CONT and CONT_JMP macros are there because a combination
> > of different gcc versions together with different cpus make branch predictor
> > behave 'unpredictably'.
> >
> > I've played with CONT and CONT_JMP either both doing direct goto or
> > indirect goto and observed quite different performance characteristics
> > from the interpreter.
> > What you see right now was the best tune I could get from a set of cpus
> > I had to play with and compilers. If I did the same tuning today the outcome
> > could have been different.
> > So I think it's totally fine to use above code. I think some cpus may actually
> > see performance gains with certain versions of gcc.
> > The retpoline text increase is unfortunate but when retpoline is on
> > other security knobs should be on too. In particular CONFIG_BPF_JIT_ALWAYS_ON
> > should be on as well. Which will remove interpreter from .text completely.
>
> This would actually be contingent on RETPOLINE, not FRAME_POINTER.
>
> (FRAME_POINTER was the other issue with the "optimize" attribute, which
> we're reverting so it'll no longer be a problem.)
>
> So if you're not concerned about the retpoline text growth, it could be
> as simple as:
>
> #define CONT ({ insn++; goto *jumptable[insn->code]; })
> #define CONT_JMP ({ insn++; goto *jumptable[insn->code]; })
>
>
> Or, if you wanted to avoid the text growth, it could be:
>
> #ifdef CONFIG_RETPOLINE

I'm a bit lost. So objtool is fine with the asm when retpoline is on?
Then pls do:
#if defined(CONFIG_RETPOLINE) || !defined(CONFIG_X86)

since there is no need to mess with other archs.

> /*
> * Avoid a 40% increase in function text size by getting GCC to generate a
> * single retpoline jump instead of 160+.
> */
> #define CONT ({ insn++; goto select_insn; })
> #define CONT_JMP ({ insn++; goto select_insn; })
>
> select_insn:
> goto *jumptable[insn->code];
>
> #else /* !CONFIG_RETPOLINE */
> #define CONT ({ insn++; goto *jumptable[insn->code]; })
> #define CONT_JMP ({ insn++; goto *jumptable[insn->code]; })
> #endif /* CONFIG_RETPOLINE */
>
>
> But since this is legacy path, I think the first one is much nicer.
>
>
> Also, JMP_TAIL_CALL has a "goto select_insn", is it ok to convert that
> to CONT?

yep

2020-04-30 04:45:10

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: BPF vs objtool again

On Wed, Apr 29, 2020 at 09:24:00PM -0700, Alexei Starovoitov wrote:
> > This would actually be contingent on RETPOLINE, not FRAME_POINTER.
> >
> > (FRAME_POINTER was the other issue with the "optimize" attribute, which
> > we're reverting so it'll no longer be a problem.)
> >
> > So if you're not concerned about the retpoline text growth, it could be
> > as simple as:
> >
> > #define CONT ({ insn++; goto *jumptable[insn->code]; })
> > #define CONT_JMP ({ insn++; goto *jumptable[insn->code]; })
> >
> >
> > Or, if you wanted to avoid the text growth, it could be:
> >
> > #ifdef CONFIG_RETPOLINE
>
> I'm a bit lost. So objtool is fine with the asm when retpoline is on?

Yeah, it's confusing... this has been quite an adventure with GCC.

Objtool is fine with the RETPOLINE double goto. It's only the
!RETPOLINE double goto which is the problem, because that triggers
more GCC weirdness (see 3193c0836f20).

> Then pls do:
> #if defined(CONFIG_RETPOLINE) || !defined(CONFIG_X86)
>
> since there is no need to mess with other archs.

Getting rid of select_insn altogether would make the code a lot simpler,
but it's your call. I'll make a patch soon.

--
Josh