With BPF_PROBE_MEM, BPF allows de-referencing an untrusted pointer. To
thwart invalid memory accesses, the JITs add an exception table entry
for all such accesses. But in case the src_reg + offset overflows and
turns into a userspace address, the BPF program might read that memory if
the user has mapped it.
There are architectural features that prevent the kernel from accessing
userspace memory, like Privileged Access Never (PAN) on ARM64,
Supervisor Mode Access Prevention (SMAP) on x86-64, Supervisor User
Memory access (SUM) on RISC-V, etc. But BPF should not rely on the
existence of these features.
Make the verifier add guard instructions around such memory accesses and
skip the load if the address falls into the userspace region.
The JITs need to implement bpf_arch_uaddress_limit() to define where
the userspace addresses end for that architecture or TASK_SIZE is taken
as default.
The implementation is as follows:
REG_AX = SRC_REG
if(offset)
REG_AX += offset;
REG_AX >>= 32;
if (REG_AX <= (uaddress_limit >> 32))
DST_REG = 0;
else
DST_REG = *(size *)(SRC_REG + offset);
Comparing just the upper 32 bits of the load address with the upper
32 bits of uaddress_limit implies that the values are being aligned down
to a 4GB boundary before comparison.
The above means that all loads with address <= uaddress_limit + 4GB are
skipped. This is acceptable because there is a large hole (much larger
than 4GB) between userspace and kernel space memory, therefore a
correctly functioning BPF program should not access this 4GB memory
above the userspace.
Let's analyze what this patch does to the following fentry program
dereferencing an untrusted pointer:
SEC("fentry/tcp_v4_connect")
int BPF_PROG(fentry_tcp_v4_connect, struct sock *sk)
{
*(volatile long *)sk;
return 0;
}
BPF Program before | BPF Program after
------------------ | -----------------
0: (79) r1 = *(u64 *)(r1 +0) 0: (79) r1 = *(u64 *)(r1 +0)
-----------------------------------------------------------------------
1: (79) r1 = *(u64 *)(r1 +0) --\ 1: (bf) r11 = r1
----------------------------\ \ 2: (77) r11 >>= 32
2: (b7) r0 = 0 \ \ 3: (b5) if r11 <= 0x8000 goto pc+2
3: (95) exit \ \-> 4: (79) r1 = *(u64 *)(r1 +0)
\ 5: (05) goto pc+1
\ 6: (b7) r1 = 0
\--------------------------------------
7: (b7) r0 = 0
8: (95) exit
As you can see from above, in the best case (off=0), 5 extra instructions
are emitted.
Now, we analyse the same program after it has gone through the JITs of
X86-64, ARM64, and RISC-V architectures. We follow the single load
instruction that has the untrusted pointer and see what instrumentation
has been added around it.
x86-64 JIT
==========
JIT's Instrumentation Verifier's Instrumentation
(upstream) (This patch)
--------------------- --------------------------
0: nopl 0x0(%rax,%rax,1) 0: nopl 0x0(%rax,%rax,1)
5: xchg %ax,%ax 5: xchg %ax,%ax
7: push %rbp 7: push %rbp
8: mov %rsp,%rbp 8: mov %rsp,%rbp
b: mov 0x0(%rdi),%rdi b: mov 0x0(%rdi),%rdi
------------------------------------------------------------------------
f: movabs $0x800000000000,%r11 f: mov %rdi,%r10
19: cmp %r11,%rdi 12: shr $0x20,%r10
1c: jb 0x000000000000002a 16: cmp $0x8000,%r10
1e: mov %rdi,%r11 1d: jbe 0x0000000000000025
21: add $0x0,%r11 /--> 1f: mov 0x0(%rdi),%rdi
28: jae 0x000000000000002e / 23: jmp 0x0000000000000027
2a: xor %edi,%edi / 25: xor %edi,%edi
2c: jmp 0x0000000000000032 / /------------------------------------
2e: mov 0x0(%rdi),%rdi ---/ / 27: xor %eax,%eax
---------------------------------/ 29: leave
32: xor %eax,%eax 2a: ret
34: leave
35: ret
The x86-64 JIT already emits some instructions to protect against user
memory access. The implementation in this patch leads to a smaller
number of instructions being emitted. In the worst case the JIT will
emit 9 extra instructions and this patch decreases it to 7.
ARM64 JIT
=========
No Intrumentation Verifier's Instrumentation
(upstream) (This patch)
----------------- --------------------------
0: add x9, x30, #0x0 0: add x9, x30, #0x0
4: nop 4: nop
8: paciasp 8: paciasp
c: stp x29, x30, [sp, #-16]! c: stp x29, x30, [sp, #-16]!
10: mov x29, sp 10: mov x29, sp
14: stp x19, x20, [sp, #-16]! 14: stp x19, x20, [sp, #-16]!
18: stp x21, x22, [sp, #-16]! 18: stp x21, x22, [sp, #-16]!
1c: stp x25, x26, [sp, #-16]! 1c: stp x25, x26, [sp, #-16]!
20: stp x27, x28, [sp, #-16]! 20: stp x27, x28, [sp, #-16]!
24: mov x25, sp 24: mov x25, sp
28: mov x26, #0x0 28: mov x26, #0x0
2c: sub x27, x25, #0x0 2c: sub x27, x25, #0x0
30: sub sp, sp, #0x0 30: sub sp, sp, #0x0
34: ldr x0, [x0] 34: ldr x0, [x0]
--------------------------------------------------------------------------------
38: ldr x0, [x0] ----------\ 38: add x9, x0, #0x0
-----------------------------------\\ 3c: lsr x9, x9, #32
3c: mov x7, #0x0 \\ 40: cmp x9, #0x10, lsl #12
40: mov sp, sp \\ 44: b.ls 0x0000000000000050
44: ldp x27, x28, [sp], #16 \\--> 48: ldr x0, [x0]
48: ldp x25, x26, [sp], #16 \ 4c: b 0x0000000000000054
4c: ldp x21, x22, [sp], #16 \ 50: mov x0, #0x0
50: ldp x19, x20, [sp], #16 \---------------------------------------
54: ldp x29, x30, [sp], #16 54: mov x7, #0x0
58: add x0, x7, #0x0 58: mov sp, sp
5c: autiasp 5c: ldp x27, x28, [sp], #16
60: ret 60: ldp x25, x26, [sp], #16
64: nop 64: ldp x21, x22, [sp], #16
68: ldr x10, 0x0000000000000070 68: ldp x19, x20, [sp], #16
6c: br x10 6c: ldp x29, x30, [sp], #16
70: add x0, x7, #0x0
74: autiasp
78: ret
7c: nop
80: ldr x10, 0x0000000000000088
84: br x10
There are 6 extra instructions added in ARM64 in the best case. This will
become 7 in the worst case (off != 0).
RISC-V JIT (RISCV_ISA_C Disabled)
==========
No Intrumentation Verifier's Instrumentation
(upstream) (This patch)
----------------- --------------------------
0: nop 0: nop
4: nop 4: nop
8: li a6, 33 8: li a6, 33
c: addi sp, sp, -16 c: addi sp, sp, -16
10: sd s0, 8(sp) 10: sd s0, 8(sp)
14: addi s0, sp, 16 14: addi s0, sp, 16
18: ld a0, 0(a0) 18: ld a0, 0(a0)
---------------------------------------------------------------
1c: ld a0, 0(a0) --\ 1c: mv t0, a0
--------------------------\ \ 20: srli t0, t0, 32
20: li a5, 0 \ \ 24: lui t1, 4096
24: ld s0, 8(sp) \ \ 28: sext.w t1, t1
28: addi sp, sp, 16 \ \ 2c: bgeu t1, t0, 12
2c: sext.w a0, a5 \ \--> 30: ld a0, 0(a0)
30: ret \ 34: j 8
\ 38: li a0, 0
\------------------------------
3c: li a5, 0
40: ld s0, 8(sp)
44: addi sp, sp, 16
48: sext.w a0, a5
4c: ret
There are 7 extra instructions added in RISC-V.
Fixes: 800834285361 ("bpf, arm64: Add BPF exception tables")
Reported-by: Breno Leitao <[email protected]>
Suggested-by: Alexei Starovoitov <[email protected]>
Signed-off-by: Puranjay Mohan <[email protected]>
---
arch/x86/net/bpf_jit_comp.c | 72 +++++--------------------------------
include/linux/filter.h | 1 +
kernel/bpf/core.c | 9 +++++
kernel/bpf/verifier.c | 30 ++++++++++++++++
4 files changed, 48 insertions(+), 64 deletions(-)
diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 4900b1ee019f..9b3136187938 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -1327,7 +1327,6 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
u8 b2 = 0, b3 = 0;
u8 *start_of_ldx;
s64 jmp_offset;
- s16 insn_off;
u8 jmp_cond;
u8 *func;
int nops;
@@ -1802,78 +1801,18 @@ st: if (is_imm8(insn->off))
case BPF_LDX | BPF_PROBE_MEMSX | BPF_B:
case BPF_LDX | BPF_PROBE_MEMSX | BPF_H:
case BPF_LDX | BPF_PROBE_MEMSX | BPF_W:
- insn_off = insn->off;
-
- if (BPF_MODE(insn->code) == BPF_PROBE_MEM ||
- BPF_MODE(insn->code) == BPF_PROBE_MEMSX) {
- /* Conservatively check that src_reg + insn->off is a kernel address:
- * src_reg + insn->off >= TASK_SIZE_MAX + PAGE_SIZE
- * src_reg is used as scratch for src_reg += insn->off and restored
- * after emit_ldx if necessary
- */
-
- u64 limit = TASK_SIZE_MAX + PAGE_SIZE;
- u8 *end_of_jmp;
-
- /* At end of these emitted checks, insn->off will have been added
- * to src_reg, so no need to do relative load with insn->off offset
- */
- insn_off = 0;
-
- /* movabsq r11, limit */
- EMIT2(add_1mod(0x48, AUX_REG), add_1reg(0xB8, AUX_REG));
- EMIT((u32)limit, 4);
- EMIT(limit >> 32, 4);
-
- if (insn->off) {
- /* add src_reg, insn->off */
- maybe_emit_1mod(&prog, src_reg, true);
- EMIT2_off32(0x81, add_1reg(0xC0, src_reg), insn->off);
- }
-
- /* cmp src_reg, r11 */
- maybe_emit_mod(&prog, src_reg, AUX_REG, true);
- EMIT2(0x39, add_2reg(0xC0, src_reg, AUX_REG));
-
- /* if unsigned '>=', goto load */
- EMIT2(X86_JAE, 0);
- end_of_jmp = prog;
-
- /* xor dst_reg, dst_reg */
- emit_mov_imm32(&prog, false, dst_reg, 0);
- /* jmp byte_after_ldx */
- EMIT2(0xEB, 0);
-
- /* populate jmp_offset for JAE above to jump to start_of_ldx */
- start_of_ldx = prog;
- end_of_jmp[-1] = start_of_ldx - end_of_jmp;
- }
+ start_of_ldx = prog;
if (BPF_MODE(insn->code) == BPF_PROBE_MEMSX ||
BPF_MODE(insn->code) == BPF_MEMSX)
- emit_ldsx(&prog, BPF_SIZE(insn->code), dst_reg, src_reg, insn_off);
+ emit_ldsx(&prog, BPF_SIZE(insn->code), dst_reg, src_reg, insn->off);
else
- emit_ldx(&prog, BPF_SIZE(insn->code), dst_reg, src_reg, insn_off);
+ emit_ldx(&prog, BPF_SIZE(insn->code), dst_reg, src_reg, insn->off);
if (BPF_MODE(insn->code) == BPF_PROBE_MEM ||
BPF_MODE(insn->code) == BPF_PROBE_MEMSX) {
struct exception_table_entry *ex;
u8 *_insn = image + proglen + (start_of_ldx - temp);
s64 delta;
- /* populate jmp_offset for JMP above */
- start_of_ldx[-1] = prog - start_of_ldx;
-
- if (insn->off && src_reg != dst_reg) {
- /* sub src_reg, insn->off
- * Restore src_reg after "add src_reg, insn->off" in prev
- * if statement. But if src_reg == dst_reg, emit_ldx
- * above already clobbered src_reg, so no need to restore.
- * If add src_reg, insn->off was unnecessary, no need to
- * restore either.
- */
- maybe_emit_1mod(&prog, src_reg, true);
- EMIT2_off32(0x81, add_1reg(0xE8, src_reg), insn->off);
- }
-
if (!bpf_prog->aux->extable)
break;
@@ -3473,3 +3412,8 @@ bool bpf_jit_supports_ptr_xchg(void)
{
return true;
}
+
+u64 bpf_arch_uaddress_limit(void)
+{
+ return TASK_SIZE_MAX + PAGE_SIZE;
+}
diff --git a/include/linux/filter.h b/include/linux/filter.h
index c0d51bff8f96..cf12bfa2a78c 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -965,6 +965,7 @@ bool bpf_jit_supports_far_kfunc_call(void);
bool bpf_jit_supports_exceptions(void);
bool bpf_jit_supports_ptr_xchg(void);
bool bpf_jit_supports_arena(void);
+u64 bpf_arch_uaddress_limit(void);
void arch_bpf_stack_walk(bool (*consume_fn)(void *cookie, u64 ip, u64 sp, u64 bp), void *cookie);
bool bpf_helper_changes_pkt_data(void *func);
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 5aacb1d3c4cc..bb8024eacc31 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -2958,6 +2958,15 @@ bool __weak bpf_jit_supports_arena(void)
return false;
}
+u64 __weak bpf_arch_uaddress_limit(void)
+{
+#ifdef CONFIG_64BIT
+ return TASK_SIZE;
+#else
+ return -ENOTSUPP;
+#endif
+}
+
/* Return TRUE if the JIT backend satisfies the following two conditions:
* 1) JIT backend supports atomic_xchg() on pointer-sized words.
* 2) Under the specific arch, the implementation of xchg() is the same
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index de7813947981..6eb0d7643af1 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -19657,6 +19657,36 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
goto next_insn;
}
+ /* Make it impossible to de-reference a userspace address */
+ if (BPF_CLASS(insn->code) == BPF_LDX &&
+ (BPF_MODE(insn->code) == BPF_PROBE_MEM ||
+ BPF_MODE(insn->code) == BPF_PROBE_MEMSX)) {
+ struct bpf_insn *patch = &insn_buf[0];
+ u64 uaddress_limit = bpf_arch_uaddress_limit();
+
+ if (uaddress_limit < 0)
+ goto next_insn;
+
+ *patch++ = BPF_MOV64_REG(BPF_REG_AX, insn->src_reg);
+ if (insn->off)
+ *patch++ = BPF_ALU64_IMM(BPF_ADD, BPF_REG_AX, insn->off);
+ *patch++ = BPF_ALU64_IMM(BPF_RSH, BPF_REG_AX, 32);
+ *patch++ = BPF_JMP_IMM(BPF_JLE, BPF_REG_AX, uaddress_limit >> 32, 2);
+ *patch++ = *insn;
+ *patch++ = BPF_JMP_IMM(BPF_JA, 0, 0, 1);
+ *patch++ = BPF_MOV64_IMM(insn->dst_reg, 0);
+
+ cnt = patch - insn_buf;
+ new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
+ if (!new_prog)
+ return -ENOMEM;
+
+ delta += cnt - 1;
+ env->prog = prog = new_prog;
+ insn = new_prog->insnsi + i + delta;
+ goto next_insn;
+ }
+
/* Implement LD_ABS and LD_IND with a rewrite, if supported by the program type. */
if (BPF_CLASS(insn->code) == BPF_LD &&
(BPF_MODE(insn->code) == BPF_ABS ||
--
2.42.0
On Wed, Mar 20, 2024 at 3:55 AM Puranjay Mohan <[email protected]> wrote:
>
> The JITs need to implement bpf_arch_uaddress_limit() to define where
> the userspace addresses end for that architecture or TASK_SIZE is taken
> as default.
>
> The implementation is as follows:
>
> REG_AX = SRC_REG
> if(offset)
> REG_AX += offset;
> REG_AX >>= 32;
> if (REG_AX <= (uaddress_limit >> 32))
> DST_REG = 0;
> else
> DST_REG = *(size *)(SRC_REG + offset);
The patch looks good, but it seems to be causing s390 CI failures.
Ilya,
could you help us understand is this check needed on s390
and if so, what should be the uaddress_limit ?
On Wed, Mar 20, 2024 at 11:08:00PM -0700, Alexei Starovoitov wrote:
> On Wed, Mar 20, 2024 at 3:55 AM Puranjay Mohan <[email protected]> wrote:
> >
> > The JITs need to implement bpf_arch_uaddress_limit() to define where
> > the userspace addresses end for that architecture or TASK_SIZE is taken
> > as default.
> >
> > The implementation is as follows:
> >
> > REG_AX = SRC_REG
> > if(offset)
> > REG_AX += offset;
> > REG_AX >>= 32;
> > if (REG_AX <= (uaddress_limit >> 32))
> > DST_REG = 0;
> > else
> > DST_REG = *(size *)(SRC_REG + offset);
>
> The patch looks good, but it seems to be causing s390 CI failures.
>
> Ilya,
> could you help us understand is this check needed on s390
> and if so, what should be the uaddress_limit ?
s390x does not define ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE.
Userspace and kernel run in completely different and isolated address
spaces, so it's not possible to determine from a pointer value whether
it's a user or a kernel pointer.
But the good news is that whatever you deference without using
special instruction sequences will refer to the kernel address space.
So I wonder if we could somehow disable this check on s390x altogether?
And if we are not sure whether it's a valid pointer, use BPF_PROBE_MEM
as always.
Ilya Leoshkevich <[email protected]> writes:
> On Wed, Mar 20, 2024 at 11:08:00PM -0700, Alexei Starovoitov wrote:
>> On Wed, Mar 20, 2024 at 3:55 AM Puranjay Mohan <puranjay12@gmailcom> wrote:
>> >
>> > The JITs need to implement bpf_arch_uaddress_limit() to define where
>> > the userspace addresses end for that architecture or TASK_SIZE is taken
>> > as default.
>> >
>> > The implementation is as follows:
>> >
>> > REG_AX = SRC_REG
>> > if(offset)
>> > REG_AX += offset;
>> > REG_AX >>= 32;
>> > if (REG_AX <= (uaddress_limit >> 32))
>> > DST_REG = 0;
>> > else
>> > DST_REG = *(size *)(SRC_REG + offset);
>>
>> The patch looks good, but it seems to be causing s390 CI failures.
>>
>> Ilya,
>> could you help us understand is this check needed on s390
>> and if so, what should be the uaddress_limit ?
>
> s390x does not define ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE.
> Userspace and kernel run in completely different and isolated address
> spaces, so it's not possible to determine from a pointer value whether
> it's a user or a kernel pointer.
> But the good news is that whatever you deference without using
> special instruction sequences will refer to the kernel address space.
> So I wonder if we could somehow disable this check on s390x altogether?
> And if we are not sure whether it's a valid pointer, use BPF_PROBE_MEM
> as always.
Thanks for the details. I understand that s390x doesn't need this extra
check because all normal accesses are in the kernel address space and they
will be marked with BPF_PROBE_MEM by the verifier if the pointer is
untrusted.
I have sent v2 of this patch with this check disabled on s390x
https://lore.kernel.org/bpf/[email protected]/
Thanks,
Puranjay
From: Alexei Starovoitov
> Sent: 21 March 2024 06:08
>
> On Wed, Mar 20, 2024 at 3:55 AM Puranjay Mohan <[email protected]> wrote:
> >
> > The JITs need to implement bpf_arch_uaddress_limit() to define where
> > the userspace addresses end for that architecture or TASK_SIZE is taken
> > as default.
> >
> > The implementation is as follows:
> >
> > REG_AX = SRC_REG
> > if(offset)
> > REG_AX += offset;
> > REG_AX >>= 32;
> > if (REG_AX <= (uaddress_limit >> 32))
> > DST_REG = 0;
> > else
> > DST_REG = *(size *)(SRC_REG + offset);
>
> The patch looks good, but it seems to be causing s390 CI failures.
I'm confused by the need for this check (and, IIRC, some other bpf
code that does kernel copies that can fault - and return an error).
I though that the entire point of bpf was that is sanitised and
verified everything to limit what the 'program' could do in order
to stop it overwriting (or even reading) kernel structures that
is wasn't supposed to access.
So it just shouldn't have a address that might be (in any way)
invalid.
The only possible address verify is access_ok() to ensure that
a uses address really is a user address.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On Sun, Mar 24, 2024 at 1:05 PM David Laight <[email protected]> wrote:
>
> From: Alexei Starovoitov
> > Sent: 21 March 2024 06:08
> >
> > On Wed, Mar 20, 2024 at 3:55 AM Puranjay Mohan <[email protected]> wrote:
> > >
> > > The JITs need to implement bpf_arch_uaddress_limit() to define where
> > > the userspace addresses end for that architecture or TASK_SIZE is taken
> > > as default.
> > >
> > > The implementation is as follows:
> > >
> > > REG_AX = SRC_REG
> > > if(offset)
> > > REG_AX += offset;
> > > REG_AX >>= 32;
> > > if (REG_AX <= (uaddress_limit >> 32))
> > > DST_REG = 0;
> > > else
> > > DST_REG = *(size *)(SRC_REG + offset);
> >
> > The patch looks good, but it seems to be causing s390 CI failures.
>
> I'm confused by the need for this check (and, IIRC, some other bpf
> code that does kernel copies that can fault - and return an error).
>
> I though that the entire point of bpf was that is sanitised and
> verified everything to limit what the 'program' could do in order
> to stop it overwriting (or even reading) kernel structures that
> is wasn't supposed to access.
>
> So it just shouldn't have a address that might be (in any way)
> invalid.
bpf tracing progs can call bpf_probe_read_kernel() which
can read any kernel memory.
This is nothing but an inlined version of it.
> The only possible address verify is access_ok() to ensure that
> a uses address really is a user address.
access_ok() considerations don't apply.
We're not dealing with user memory access.
From: Alexei Starovoitov
> Sent: 24 March 2024 20:43
>
> On Sun, Mar 24, 2024 at 1:05 PM David Laight <[email protected]> wrote:
> >
> > From: Alexei Starovoitov
> > > Sent: 21 March 2024 06:08
> > >
> > > On Wed, Mar 20, 2024 at 3:55 AM Puranjay Mohan <[email protected]> wrote:
> > > >
> > > > The JITs need to implement bpf_arch_uaddress_limit() to define where
> > > > the userspace addresses end for that architecture or TASK_SIZE is taken
> > > > as default.
> > > >
> > > > The implementation is as follows:
> > > >
> > > > REG_AX = SRC_REG
> > > > if(offset)
> > > > REG_AX += offset;
> > > > REG_AX >>= 32;
> > > > if (REG_AX <= (uaddress_limit >> 32))
> > > > DST_REG = 0;
> > > > else
> > > > DST_REG = *(size *)(SRC_REG + offset);
> > >
> > > The patch looks good, but it seems to be causing s390 CI failures.
> >
> > I'm confused by the need for this check (and, IIRC, some other bpf
> > code that does kernel copies that can fault - and return an error).
> >
> > I though that the entire point of bpf was that is sanitised and
> > verified everything to limit what the 'program' could do in order
> > to stop it overwriting (or even reading) kernel structures that
> > is wasn't supposed to access.
> >
> > So it just shouldn't have a address that might be (in any way)
> > invalid.
>
> bpf tracing progs can call bpf_probe_read_kernel() which
> can read any kernel memory.
> This is nothing but an inlined version of it.
It was the getsockopt() code were I saw the copy_nocheck() calls.
Those have to be broken.
Although the way some of the options use the ptr:len supplied by
the application you stand no chance of do an in-kernel call
without a proper buffer descriptor argument (with separate optlen
and bufferlen fields.)
>
> > The only possible address verify is access_ok() to ensure that
> > a uses address really is a user address.
>
> access_ok() considerations don't apply.
> We're not dealing with user memory access.
If you do need a check for 'not a user address' don't you want to just
require access_ok() fail?
That would be architecture independent.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On Sun, Mar 24, 2024 at 3:30 PM David Laight <[email protected]> wrote:
>
> From: Alexei Starovoitov
> > Sent: 24 March 2024 20:43
> >
> > On Sun, Mar 24, 2024 at 1:05 PM David Laight <[email protected]> wrote:
> > >
> > > From: Alexei Starovoitov
> > > > Sent: 21 March 2024 06:08
> > > >
> > > > On Wed, Mar 20, 2024 at 3:55 AM Puranjay Mohan <[email protected]> wrote:
> > > > >
> > > > > The JITs need to implement bpf_arch_uaddress_limit() to define where
> > > > > the userspace addresses end for that architecture or TASK_SIZE is taken
> > > > > as default.
> > > > >
> > > > > The implementation is as follows:
> > > > >
> > > > > REG_AX = SRC_REG
> > > > > if(offset)
> > > > > REG_AX += offset;
> > > > > REG_AX >>= 32;
> > > > > if (REG_AX <= (uaddress_limit >> 32))
> > > > > DST_REG = 0;
> > > > > else
> > > > > DST_REG = *(size *)(SRC_REG + offset);
> > > >
> > > > The patch looks good, but it seems to be causing s390 CI failures.
> > >
> > > I'm confused by the need for this check (and, IIRC, some other bpf
> > > code that does kernel copies that can fault - and return an error).
> > >
> > > I though that the entire point of bpf was that is sanitised and
> > > verified everything to limit what the 'program' could do in order
> > > to stop it overwriting (or even reading) kernel structures that
> > > is wasn't supposed to access.
> > >
> > > So it just shouldn't have a address that might be (in any way)
> > > invalid.
> >
> > bpf tracing progs can call bpf_probe_read_kernel() which
> > can read any kernel memory.
> > This is nothing but an inlined version of it.
>
> It was the getsockopt() code were I saw the copy_nocheck() calls.
> Those have to be broken.
No. If you mean csum_partial_copy_nocheck() then they're fine.
> Although the way some of the options use the ptr:len supplied by
> the application you stand no chance of do an in-kernel call
> without a proper buffer descriptor argument (with separate optlen
> and bufferlen fields.)
>
> >
> > > The only possible address verify is access_ok() to ensure that
> > > a uses address really is a user address.
> >
> > access_ok() considerations don't apply.
> > We're not dealing with user memory access.
>
> If you do need a check for 'not a user address' don't you want to just
> require access_ok() fail?
> That would be architecture independent.
No. access_ok() can only be used on the user addr.
access_ok() == false on the kernel addr doesn't mean that
it's a kernel addr.