2024-04-09 09:51:05

by Puranjay Mohan

[permalink] [raw]
Subject: [PATCH bpf] arm32, bpf: Fix sign-extension mov instruction

The current implementation of the mov instruction with sign extension
clobbers the source register because it sign extends the source and then
moves it to the destination.

Fix this by moving the src to a temporary register before doing the sign
extension only if src is not an emulated register (on the scratch stack).

Also fix the emit_a32_movsx_r64() to put the register back on scratch
stack if that register is emulated on stack.

Fixes: fc832653fa0d ("arm32, bpf: add support for sign-extension mov instruction")
Reported-by: [email protected]
Closes: https://lore.kernel.org/all/[email protected]/
Signed-off-by: Puranjay Mohan <[email protected]>
---
arch/arm/net/bpf_jit_32.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/arm/net/bpf_jit_32.c b/arch/arm/net/bpf_jit_32.c
index 1d672457d02f..8fde6ab66cb4 100644
--- a/arch/arm/net/bpf_jit_32.c
+++ b/arch/arm/net/bpf_jit_32.c
@@ -878,6 +878,13 @@ static inline void emit_a32_mov_r(const s8 dst, const s8 src, const u8 off,

rt = arm_bpf_get_reg32(src, tmp[0], ctx);
if (off && off != 32) {
+ /* If rt is not a stacked register, move it to tmp, so it doesn't get clobbered by
+ * the shift operations.
+ */
+ if (rt == src) {
+ emit(ARM_MOV_R(tmp[0], rt), ctx);
+ rt = tmp[0];
+ }
emit(ARM_LSL_I(rt, rt, 32 - off), ctx);
emit(ARM_ASR_I(rt, rt, 32 - off), ctx);
}
@@ -919,15 +926,15 @@ static inline void emit_a32_movsx_r64(const bool is64, const u8 off, const s8 ds
const s8 *tmp = bpf2a32[TMP_REG_1];
const s8 *rt;

- rt = arm_bpf_get_reg64(dst, tmp, ctx);
-
emit_a32_mov_r(dst_lo, src_lo, off, ctx);
if (!is64) {
if (!ctx->prog->aux->verifier_zext)
/* Zero out high 4 bytes */
emit_a32_mov_i(dst_hi, 0, ctx);
} else {
+ rt = arm_bpf_get_reg64(dst, tmp, ctx);
emit(ARM_ASR_I(rt[0], rt[1], 31), ctx);
+ arm_bpf_put_reg64(dst, rt, ctx);
}
}

--
2.40.1



2024-04-09 15:42:04

by Russell King (Oracle)

[permalink] [raw]
Subject: Re: [PATCH bpf] arm32, bpf: Fix sign-extension mov instruction

On Tue, Apr 09, 2024 at 09:50:38AM +0000, Puranjay Mohan wrote:
> The current implementation of the mov instruction with sign extension
> clobbers the source register because it sign extends the source and then
> moves it to the destination.
>
> Fix this by moving the src to a temporary register before doing the sign
> extension only if src is not an emulated register (on the scratch stack).
>
> Also fix the emit_a32_movsx_r64() to put the register back on scratch
> stack if that register is emulated on stack.

It would be good to include in the commit message an example or two of
the resulting assembly code so that it's clear what the expected
generation is. Instead, I'm going to have to work it out myself, but
I'm quite sure this is information you already have.

> Fixes: fc832653fa0d ("arm32, bpf: add support for sign-extension mov instruction")
> Reported-by: [email protected]
> Closes: https://lore.kernel.org/all/[email protected]/
> Signed-off-by: Puranjay Mohan <[email protected]>
> ---
> arch/arm/net/bpf_jit_32.c | 11 +++++++++--
> 1 file changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arm/net/bpf_jit_32.c b/arch/arm/net/bpf_jit_32.c
> index 1d672457d02f..8fde6ab66cb4 100644
> --- a/arch/arm/net/bpf_jit_32.c
> +++ b/arch/arm/net/bpf_jit_32.c
> @@ -878,6 +878,13 @@ static inline void emit_a32_mov_r(const s8 dst, const s8 src, const u8 off,
>
> rt = arm_bpf_get_reg32(src, tmp[0], ctx);
> if (off && off != 32) {
> + /* If rt is not a stacked register, move it to tmp, so it doesn't get clobbered by
> + * the shift operations.
> + */
> + if (rt == src) {
> + emit(ARM_MOV_R(tmp[0], rt), ctx);
> + rt = tmp[0];
> + }

This change is adding inefficiency, don't we want to have the JIT
creating as efficient code as possible within the bounds of
reasonableness?

> emit(ARM_LSL_I(rt, rt, 32 - off), ctx);
> emit(ARM_ASR_I(rt, rt, 32 - off), ctx);

LSL and ASR can very easily take a different source register to the
destination register. All this needs to be is:

emit(ARM_LSL_I(tmp[0], rt, 32 - off), ctx);
emit(ARM_ASR_I(tmp[0], tmp[0], 32 - off), ctx);
rt = tmp[0];

This will generate:

lsl tmp[0], src, #32-off
asr tmp[0], tmp[0], #32-off

and then the store to the output register will occur.

What about the high-32 bits of the register pair - should that be
taking any value?

> }

I notice in passing that the comments are out of sync with the
code - please update the comments along with code changes.

> @@ -919,15 +926,15 @@ static inline void emit_a32_movsx_r64(const bool is64, const u8 off, const s8 ds
> const s8 *tmp = bpf2a32[TMP_REG_1];
> const s8 *rt;
>
> - rt = arm_bpf_get_reg64(dst, tmp, ctx);
> -
> emit_a32_mov_r(dst_lo, src_lo, off, ctx);
> if (!is64) {
> if (!ctx->prog->aux->verifier_zext)
> /* Zero out high 4 bytes */
> emit_a32_mov_i(dst_hi, 0, ctx);
> } else {
> + rt = arm_bpf_get_reg64(dst, tmp, ctx);
> emit(ARM_ASR_I(rt[0], rt[1], 31), ctx);
> + arm_bpf_put_reg64(dst, rt, ctx);
> }
> }

Why oh why oh why are we emitting code to read the source register
(which may be a load), then write it to the destination (which may
be a store) to only then immediately reload from the destination
to then do the sign extension? This is madness.

Please... apply some thought to the code generation from the JIT...
or I will remove you from being a maintainer of this code. I spent
time crafting some parts of the JIT to generate efficient code and
I'm seeing that a lot of that work is now being thrown away by
someone who seemingly doesn't care about generating "good" code.

Why not read the source 32-bit register (potentially into a temporary
register), store it to the destination low register, then do the
sign extension into the destination high register or zero the high
register. We _could_ be a bit more optimal here by checking whether
dst_hi is a stacked register and use that directly for the ASR
instruction, omitting the need to move it there afterwards - whether
that's worth it or not depends on the performance we expect from this
eBPF opcode.

rt = arm_bpf_get_reg32(src_lo, tmp[1], ctx);
/* rt may be either src[1] or tmp[1] */

/* write dst_lo */
arm_bpf_put_reg32(dst_lo, rt, ctx)

if (is64) {
emit(ARM_ASR_I(tmp[0], rt, 31), ctx);
arm_bpf_put_reg32(dst_hi, tmp[0], ctx);
} else if (!ctx->prog->aux->verifier_zext) {
emit_a32_mov_i(dst_hi, 0, ctx);
}

--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

2024-04-19 18:52:47

by Puranjay Mohan

[permalink] [raw]
Subject: Re: [PATCH bpf] arm32, bpf: Fix sign-extension mov instruction

"Russell King (Oracle)" <[email protected]> writes:

> On Tue, Apr 09, 2024 at 09:50:38AM +0000, Puranjay Mohan wrote:
>> The current implementation of the mov instruction with sign extension
>> clobbers the source register because it sign extends the source and then
>> moves it to the destination.
>>
>> Fix this by moving the src to a temporary register before doing the sign
>> extension only if src is not an emulated register (on the scratch stack).
>>
>> Also fix the emit_a32_movsx_r64() to put the register back on scratch
>> stack if that register is emulated on stack.
>
> It would be good to include in the commit message an example or two of
> the resulting assembly code so that it's clear what the expected
> generation is. Instead, I'm going to have to work it out myself, but
> I'm quite sure this is information you already have.
>
>> Fixes: fc832653fa0d ("arm32, bpf: add support for sign-extension mov instruction")
>> Reported-by: [email protected]
>> Closes: https://lore.kernel.org/all/[email protected]/
>> Signed-off-by: Puranjay Mohan <[email protected]>
>> ---
>> arch/arm/net/bpf_jit_32.c | 11 +++++++++--
>> 1 file changed, 9 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/arm/net/bpf_jit_32.c b/arch/arm/net/bpf_jit_32.c
>> index 1d672457d02f..8fde6ab66cb4 100644
>> --- a/arch/arm/net/bpf_jit_32.c
>> +++ b/arch/arm/net/bpf_jit_32.c
>> @@ -878,6 +878,13 @@ static inline void emit_a32_mov_r(const s8 dst, const s8 src, const u8 off,
>>
>> rt = arm_bpf_get_reg32(src, tmp[0], ctx);
>> if (off && off != 32) {
>> + /* If rt is not a stacked register, move it to tmp, so it doesn't get clobbered by
>> + * the shift operations.
>> + */
>> + if (rt == src) {
>> + emit(ARM_MOV_R(tmp[0], rt), ctx);
>> + rt = tmp[0];
>> + }
>
> This change is adding inefficiency, don't we want to have the JIT
> creating as efficient code as possible within the bounds of
> reasonableness?
>
>> emit(ARM_LSL_I(rt, rt, 32 - off), ctx);
>> emit(ARM_ASR_I(rt, rt, 32 - off), ctx);
>
> LSL and ASR can very easily take a different source register to the
> destination register. All this needs to be is:
>
> emit(ARM_LSL_I(tmp[0], rt, 32 - off), ctx);
> emit(ARM_ASR_I(tmp[0], tmp[0], 32 - off), ctx);
> rt = tmp[0];
>
> This will generate:
>
> lsl tmp[0], src, #32-off
> asr tmp[0], tmp[0], #32-off
>
> and then the store to the output register will occur.
>
> What about the high-32 bits of the register pair - should that be
> taking any value?
>
>> }
>
> I notice in passing that the comments are out of sync with the
> code - please update the comments along with code changes.
>
>> @@ -919,15 +926,15 @@ static inline void emit_a32_movsx_r64(const bool is64, const u8 off, const s8 ds
>> const s8 *tmp = bpf2a32[TMP_REG_1];
>> const s8 *rt;
>>
>> - rt = arm_bpf_get_reg64(dst, tmp, ctx);
>> -
>> emit_a32_mov_r(dst_lo, src_lo, off, ctx);
>> if (!is64) {
>> if (!ctx->prog->aux->verifier_zext)
>> /* Zero out high 4 bytes */
>> emit_a32_mov_i(dst_hi, 0, ctx);
>> } else {
>> + rt = arm_bpf_get_reg64(dst, tmp, ctx);
>> emit(ARM_ASR_I(rt[0], rt[1], 31), ctx);
>> + arm_bpf_put_reg64(dst, rt, ctx);
>> }
>> }
>
> Why oh why oh why are we emitting code to read the source register
> (which may be a load), then write it to the destination (which may
> be a store) to only then immediately reload from the destination
> to then do the sign extension? This is madness.
>
> Please... apply some thought to the code generation from the JIT...
> or I will remove you from being a maintainer of this code. I spent
> time crafting some parts of the JIT to generate efficient code and
> I'm seeing that a lot of that work is now being thrown away by
> someone who seemingly doesn't care about generating "good" code.
>

Sorry for this, I also like to make sure the JITs are as efficient as
possible. I was too focused on fixing this as fast as possible and
didn't pay attention that day.

I have reimplemented the whole thing again to make sure all bugs are
fixed. The commit message has the generated assembly for all cases:

https://lore.kernel.org/all/[email protected]/

Thanks,
Puranjay