LinuxLists.cc - [PATCH bpf-next v2 0/3] arm64 BPF JIT Optimizations

2020-05-08 18:17:48

Subject: [PATCH bpf-next v2 0/3] arm64 BPF JIT Optimizations

This patch series introduces several optimizations to the arm64 BPF JIT.
The optimizations make use of arm64 immediate instructions to avoid
loading BPF immediates to temporary registers, when possible.

In the process, we discovered two bugs in the logical immediate encoding
function in arch/arm64/kernel/insn.c using Serval. The series also fixes
the two bugs before introducing the optimizations.

Tested on aarch64 QEMU virt machine using test_bpf and test_verifier.

v2:
- Cleaned up patch to insn.c.
(Marc Zyngier, Will Deacon)

Luke Nelson (3):
arm64: insn: Fix two bugs in encoding 32-bit logical immediates
bpf, arm64: Optimize AND,OR,XOR,JSET BPF_K using arm64 logical
immediates
bpf, arm64: Optimize ADD,SUB,JMP BPF_K using arm64 add/sub immediates

arch/arm64/kernel/insn.c | 14 +++----
arch/arm64/net/bpf_jit.h | 22 +++++++++++
arch/arm64/net/bpf_jit_comp.c | 73 ++++++++++++++++++++++++++++-------
3 files changed, 88 insertions(+), 21 deletions(-)

Cc: Xi Wang <[email protected]>

--
2.17.1

2020-05-08 18:18:02

by Luke Nelson

[permalink] [raw]

Subject: [PATCH bpf-next v2 1/3] arm64: insn: Fix two bugs in encoding 32-bit logical immediates

This patch fixes two issues present in the current function for encoding
arm64 logical immediates when using the 32-bit variants of instructions.

First, the code does not correctly reject an all-ones 32-bit immediate,
and returns an undefined instruction encoding.

Second, the code incorrectly rejects some 32-bit immediates that are
actually encodable as logical immediates. The root cause is that the code
uses a default mask of 64-bit all-ones, even for 32-bit immediates.
This causes an issue later on when the default mask is used to fill the
top bits of the immediate with ones, shown here:

/*
* Pattern: 0..01..10..01..1
*
* Fill the unused top bits with ones, and check if
* the result is a valid immediate (all ones with a
* contiguous ranges of zeroes).
*/
imm |= ~mask;
if (!range_of_ones(~imm))
return AARCH64_BREAK_FAULT;

To see the problem, consider an immediate of the form 0..01..10..01..1,
where the upper 32 bits are zero, such as 0x80000001. The code checks
if ~(imm | ~mask) contains a range of ones: the incorrect mask yields
1..10..01..10..0, which fails the check; the correct mask yields
0..01..10..0, which succeeds.

The fix for both issues is to generate a correct mask based on the
instruction immediate size, and use the mask to check for all-ones,
all-zeroes, and values wider than the mask.

Currently, arch/arm64/kvm/va_layout.c is the only user of this function,
which uses 64-bit immediates and therefore won't trigger these bugs.

We tested the new code against llvm-mc with all 1,302 encodable 32-bit
logical immediates and all 5,334 encodable 64-bit logical immediates.

Fixes: ef3935eeebff ("arm64: insn: Add encoder for bitwise operations using literals")
Co-developed-by: Xi Wang <[email protected]>
Signed-off-by: Xi Wang <[email protected]>
Signed-off-by: Luke Nelson <[email protected]>
Reviewed-by: Marc Zyngier <[email protected]>
Suggested-by: Will Deacon <[email protected]>
---
arch/arm64/kernel/insn.c | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/kernel/insn.c b/arch/arm64/kernel/insn.c
index 4a9e773a177f..cc2f3d901c91 100644
--- a/arch/arm64/kernel/insn.c
+++ b/arch/arm64/kernel/insn.c
@@ -1535,16 +1535,10 @@ static u32 aarch64_encode_immediate(u64 imm,
u32 insn)
{
unsigned int immr, imms, n, ones, ror, esz, tmp;
- u64 mask = ~0UL;
-
- /* Can't encode full zeroes or full ones */
- if (!imm || !~imm)
- return AARCH64_BREAK_FAULT;
+ u64 mask;

switch (variant) {
case AARCH64_INSN_VARIANT_32BIT:
- if (upper_32_bits(imm))
- return AARCH64_BREAK_FAULT;
esz = 32;
break;
case AARCH64_INSN_VARIANT_64BIT:
@@ -1556,6 +1550,12 @@ static u32 aarch64_encode_immediate(u64 imm,
return AARCH64_BREAK_FAULT;
}

+ mask = GENMASK(esz - 1, 0);
+
+ /* Can't encode full zeroes, full ones, or value wider than the mask */
+ if (!imm || imm == mask || imm & ~mask)
+ return AARCH64_BREAK_FAULT;
+
/*
* Inverse of Replicate(). Try to spot a repeating pattern
* with a pow2 stride.
--
2.17.1

2020-05-08 18:18:36

by Luke Nelson

[permalink] [raw]

Subject: [PATCH bpf-next v2 3/3] bpf, arm64: Optimize ADD,SUB,JMP BPF_K using arm64 add/sub immediates

The current code for BPF_{ADD,SUB} BPF_K loads the BPF immediate to a
temporary register before performing the addition/subtraction. Similarly,
BPF_JMP BPF_K cases load the immediate to a temporary register before
comparison.

This patch introduces optimizations that use arm64 immediate add, sub,
cmn, or cmp instructions when the BPF immediate fits. If the immediate
does not fit, it falls back to using a temporary register.

Example of generated code for BPF_ALU64_IMM(BPF_ADD, R0, 2):

without optimization:

24: mov x10, #0x2
28: add x7, x7, x10

with optimization:

24: add x7, x7, #0x2

The code could use A64_{ADD,SUB}_I directly and check if it returns
AARCH64_BREAK_FAULT, similar to how logical immediates are handled.
However, aarch64_insn_gen_add_sub_imm from insn.c prints error messages
when the immediate does not fit, and it's simpler to check if the
immediate fits ahead of time.

Co-developed-by: Xi Wang <[email protected]>
Signed-off-by: Xi Wang <[email protected]>
Signed-off-by: Luke Nelson <[email protected]>
Acked-by: Daniel Borkmann <[email protected]>
---
arch/arm64/net/bpf_jit.h | 8 ++++++++
arch/arm64/net/bpf_jit_comp.c | 36 +++++++++++++++++++++++++++++------
2 files changed, 38 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/net/bpf_jit.h b/arch/arm64/net/bpf_jit.h
index f36a779949e6..923ae7ff68c8 100644
--- a/arch/arm64/net/bpf_jit.h
+++ b/arch/arm64/net/bpf_jit.h
@@ -100,6 +100,14 @@
/* Rd = Rn OP imm12 */
#define A64_ADD_I(sf, Rd, Rn, imm12) A64_ADDSUB_IMM(sf, Rd, Rn, imm12, ADD)
#define A64_SUB_I(sf, Rd, Rn, imm12) A64_ADDSUB_IMM(sf, Rd, Rn, imm12, SUB)
+#define A64_ADDS_I(sf, Rd, Rn, imm12) \
+ A64_ADDSUB_IMM(sf, Rd, Rn, imm12, ADD_SETFLAGS)
+#define A64_SUBS_I(sf, Rd, Rn, imm12) \
+ A64_ADDSUB_IMM(sf, Rd, Rn, imm12, SUB_SETFLAGS)
+/* Rn + imm12; set condition flags */
+#define A64_CMN_I(sf, Rn, imm12) A64_ADDS_I(sf, A64_ZR, Rn, imm12)
+/* Rn - imm12; set condition flags */
+#define A64_CMP_I(sf, Rn, imm12) A64_SUBS_I(sf, A64_ZR, Rn, imm12)
/* Rd = Rn */
#define A64_MOV(sf, Rd, Rn) A64_ADD_I(sf, Rd, Rn, 0)

diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
index 083e5d8a5e2c..561a2fea9cdd 100644
--- a/arch/arm64/net/bpf_jit_comp.c
+++ b/arch/arm64/net/bpf_jit_comp.c
@@ -167,6 +167,12 @@ static inline int epilogue_offset(const struct jit_ctx *ctx)
return to - from;
}

+static bool is_addsub_imm(u32 imm)
+{
+ /* Either imm12 or shifted imm12. */
+ return !(imm & ~0xfff) || !(imm & ~0xfff000);
+}
+
/* Stack must be multiples of 16B */
#define STACK_ALIGN(sz) (((sz) + 15) & ~15)

@@ -479,13 +485,25 @@ static int build_insn(const struct bpf_insn *insn, struct jit_ctx *ctx,
/* dst = dst OP imm */
case BPF_ALU | BPF_ADD | BPF_K:
case BPF_ALU64 | BPF_ADD | BPF_K:
- emit_a64_mov_i(is64, tmp, imm, ctx);
- emit(A64_ADD(is64, dst, dst, tmp), ctx);
+ if (is_addsub_imm(imm)) {
+ emit(A64_ADD_I(is64, dst, dst, imm), ctx);
+ } else if (is_addsub_imm(-imm)) {
+ emit(A64_SUB_I(is64, dst, dst, -imm), ctx);
+ } else {
+ emit_a64_mov_i(is64, tmp, imm, ctx);
+ emit(A64_ADD(is64, dst, dst, tmp), ctx);
+ }
break;
case BPF_ALU | BPF_SUB | BPF_K:
case BPF_ALU64 | BPF_SUB | BPF_K:
- emit_a64_mov_i(is64, tmp, imm, ctx);
- emit(A64_SUB(is64, dst, dst, tmp), ctx);
+ if (is_addsub_imm(imm)) {
+ emit(A64_SUB_I(is64, dst, dst, imm), ctx);
+ } else if (is_addsub_imm(-imm)) {
+ emit(A64_ADD_I(is64, dst, dst, -imm), ctx);
+ } else {
+ emit_a64_mov_i(is64, tmp, imm, ctx);
+ emit(A64_SUB(is64, dst, dst, tmp), ctx);
+ }
break;
case BPF_ALU | BPF_AND | BPF_K:
case BPF_ALU64 | BPF_AND | BPF_K:
@@ -639,8 +657,14 @@ static int build_insn(const struct bpf_insn *insn, struct jit_ctx *ctx,
case BPF_JMP32 | BPF_JSLT | BPF_K:
case BPF_JMP32 | BPF_JSGE | BPF_K:
case BPF_JMP32 | BPF_JSLE | BPF_K:
- emit_a64_mov_i(is64, tmp, imm, ctx);
- emit(A64_CMP(is64, dst, tmp), ctx);
+ if (is_addsub_imm(imm)) {
+ emit(A64_CMP_I(is64, dst, imm), ctx);
+ } else if (is_addsub_imm(-imm)) {
+ emit(A64_CMN_I(is64, dst, -imm), ctx);
+ } else {
+ emit_a64_mov_i(is64, tmp, imm, ctx);
+ emit(A64_CMP(is64, dst, tmp), ctx);
+ }
goto emit_cond_jmp;
case BPF_JMP | BPF_JSET | BPF_K:
case BPF_JMP32 | BPF_JSET | BPF_K:
--
2.17.1

2020-05-08 18:18:52

by Luke Nelson

[permalink] [raw]

Subject: [PATCH bpf-next v2 2/3] bpf, arm64: Optimize AND,OR,XOR,JSET BPF_K using arm64 logical immediates

The current code for BPF_{AND,OR,XOR,JSET} BPF_K loads the immediate to
a temporary register before use.

This patch changes the code to avoid using a temporary register
when the BPF immediate is encodable using an arm64 logical immediate
instruction. If the encoding fails (due to the immediate not being
encodable), it falls back to using a temporary register.

Example of generated code for BPF_ALU32_IMM(BPF_AND, R0, 0x80000001):

without optimization:

24: mov w10, #0x8000ffff
28: movk w10, #0x1
2c: and w7, w7, w10

with optimization:

24: and w7, w7, #0x80000001

Since the encoding process is quite complex, the JIT reuses existing
functionality in arch/arm64/kernel/insn.c for encoding logical immediates
rather than duplicate it in the JIT.

Co-developed-by: Xi Wang <[email protected]>
Signed-off-by: Xi Wang <[email protected]>
Signed-off-by: Luke Nelson <[email protected]>
Acked-by: Daniel Borkmann <[email protected]>
---
arch/arm64/net/bpf_jit.h | 14 +++++++++++++
arch/arm64/net/bpf_jit_comp.c | 37 +++++++++++++++++++++++++++--------
2 files changed, 43 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/net/bpf_jit.h b/arch/arm64/net/bpf_jit.h
index eb73f9f72c46..f36a779949e6 100644
--- a/arch/arm64/net/bpf_jit.h
+++ b/arch/arm64/net/bpf_jit.h
@@ -189,4 +189,18 @@
/* Rn & Rm; set condition flags */
#define A64_TST(sf, Rn, Rm) A64_ANDS(sf, A64_ZR, Rn, Rm)

+/* Logical (immediate) */
+#define A64_LOGIC_IMM(sf, Rd, Rn, imm, type) ({ \
+ u64 imm64 = (sf) ? (u64)imm : (u64)(u32)imm; \
+ aarch64_insn_gen_logical_immediate(AARCH64_INSN_LOGIC_##type, \
+ A64_VARIANT(sf), Rn, Rd, imm64); \
+})
+/* Rd = Rn OP imm */
+#define A64_AND_I(sf, Rd, Rn, imm) A64_LOGIC_IMM(sf, Rd, Rn, imm, AND)
+#define A64_ORR_I(sf, Rd, Rn, imm) A64_LOGIC_IMM(sf, Rd, Rn, imm, ORR)
+#define A64_EOR_I(sf, Rd, Rn, imm) A64_LOGIC_IMM(sf, Rd, Rn, imm, EOR)
+#define A64_ANDS_I(sf, Rd, Rn, imm) A64_LOGIC_IMM(sf, Rd, Rn, imm, AND_SETFLAGS)
+/* Rn & imm; set condition flags */
+#define A64_TST_I(sf, Rn, imm) A64_ANDS_I(sf, A64_ZR, Rn, imm)
+
#endif /* _BPF_JIT_H */
diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
index cdc79de0c794..083e5d8a5e2c 100644
--- a/arch/arm64/net/bpf_jit_comp.c
+++ b/arch/arm64/net/bpf_jit_comp.c
@@ -356,6 +356,7 @@ static int build_insn(const struct bpf_insn *insn, struct jit_ctx *ctx,
const bool isdw = BPF_SIZE(code) == BPF_DW;
u8 jmp_cond, reg;
s32 jmp_offset;
+ u32 a64_insn;

#define check_imm(bits, imm) do { \
if ((((imm) > 0) && ((imm) >> (bits))) || \
@@ -488,18 +489,33 @@ static int build_insn(const struct bpf_insn *insn, struct jit_ctx *ctx,
break;
case BPF_ALU | BPF_AND | BPF_K:
case BPF_ALU64 | BPF_AND | BPF_K:
- emit_a64_mov_i(is64, tmp, imm, ctx);
- emit(A64_AND(is64, dst, dst, tmp), ctx);
+ a64_insn = A64_AND_I(is64, dst, dst, imm);
+ if (a64_insn != AARCH64_BREAK_FAULT) {
+ emit(a64_insn, ctx);
+ } else {
+ emit_a64_mov_i(is64, tmp, imm, ctx);
+ emit(A64_AND(is64, dst, dst, tmp), ctx);
+ }
break;
case BPF_ALU | BPF_OR | BPF_K:
case BPF_ALU64 | BPF_OR | BPF_K:
- emit_a64_mov_i(is64, tmp, imm, ctx);
- emit(A64_ORR(is64, dst, dst, tmp), ctx);
+ a64_insn = A64_ORR_I(is64, dst, dst, imm);
+ if (a64_insn != AARCH64_BREAK_FAULT) {
+ emit(a64_insn, ctx);
+ } else {
+ emit_a64_mov_i(is64, tmp, imm, ctx);
+ emit(A64_ORR(is64, dst, dst, tmp), ctx);
+ }
break;
case BPF_ALU | BPF_XOR | BPF_K:
case BPF_ALU64 | BPF_XOR | BPF_K:
- emit_a64_mov_i(is64, tmp, imm, ctx);
- emit(A64_EOR(is64, dst, dst, tmp), ctx);
+ a64_insn = A64_EOR_I(is64, dst, dst, imm);
+ if (a64_insn != AARCH64_BREAK_FAULT) {
+ emit(a64_insn, ctx);
+ } else {
+ emit_a64_mov_i(is64, tmp, imm, ctx);
+ emit(A64_EOR(is64, dst, dst, tmp), ctx);
+ }
break;
case BPF_ALU | BPF_MUL | BPF_K:
case BPF_ALU64 | BPF_MUL | BPF_K:
@@ -628,8 +644,13 @@ static int build_insn(const struct bpf_insn *insn, struct jit_ctx *ctx,
goto emit_cond_jmp;
case BPF_JMP | BPF_JSET | BPF_K:
case BPF_JMP32 | BPF_JSET | BPF_K:
- emit_a64_mov_i(is64, tmp, imm, ctx);
- emit(A64_TST(is64, dst, tmp), ctx);
+ a64_insn = A64_TST_I(is64, dst, imm);
+ if (a64_insn != AARCH64_BREAK_FAULT) {
+ emit(a64_insn, ctx);
+ } else {
+ emit_a64_mov_i(is64, tmp, imm, ctx);
+ emit(A64_TST(is64, dst, tmp), ctx);
+ }
goto emit_cond_jmp;
/* function call */
case BPF_JMP | BPF_CALL:
--
2.17.1

2020-05-11 12:59:17

by Will Deacon

[permalink] [raw]

Subject: Re: [PATCH bpf-next v2 0/3] arm64 BPF JIT Optimizations

On Fri, 8 May 2020 11:15:43 -0700, Luke Nelson wrote:
> This patch series introduces several optimizations to the arm64 BPF JIT.
> The optimizations make use of arm64 immediate instructions to avoid
> loading BPF immediates to temporary registers, when possible.
>
> In the process, we discovered two bugs in the logical immediate encoding
> function in arch/arm64/kernel/insn.c using Serval. The series also fixes
> the two bugs before introducing the optimizations.
>
> [...]

Applied to arm64 (for-next/bpf), thanks!

[1/3] arm64: insn: Fix two bugs in encoding 32-bit logical immediates
https://git.kernel.org/arm64/c/579d1b3faa37
[2/3] bpf, arm64: Optimize AND,OR,XOR,JSET BPF_K using arm64 logical immediates
https://git.kernel.org/arm64/c/fd49591cb49b
[3/3] bpf, arm64: Optimize ADD,SUB,JMP BPF_K using arm64 add/sub immediates
https://git.kernel.org/arm64/c/fd868f148189

Cheers,
--
Will

https://fixes.arm64.dev
https://next.arm64.dev
https://will.arm64.dev