hi,
as part of the effort on speeding up the uprobes [0] coming with
return uprobe optimization by using syscall instead of the trap
on the uretprobe trampoline.
The speed up depends on instruction type that uprobe is installed
and depends on specific HW type, please check patch 1 for details.
I added extra selftests to check on registers values before and
after uretprobe to make sure the syscall saves all the needed regs
and that uprobe consumer can still change registers.
Patch 1 is based on:
https://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git
probes/for-next
and patches 2,3 are based on bpf-next/master and can be taken
separately through bpf-next tree without patch 1.
v2 changes:
- reposted to proper tracing mailing list [Masami]
- renamed selftest trigger functions [Andrii]
- kept bpf_testmod register call pattern [Andrii]
- added acks [Andrii/Oleg]
thanks,
jirka
[0] https://lore.kernel.org/bpf/ZeCXHKJ--iYYbmLj@krava/
---
Jiri Olsa (3):
uprobe: Add uretprobe syscall to speed up return probe
selftests/bpf: Add uretprobe syscall test for regs integrity
selftests/bpf: Add uretprobe syscall test for regs changes
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/x86/kernel/uprobes.c | 83 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/syscalls.h | 2 ++
include/linux/uprobes.h | 2 ++
include/uapi/asm-generic/unistd.h | 5 +++-
kernel/events/uprobes.c | 18 +++++++++---
kernel/sys_ni.c | 2 ++
tools/include/linux/compiler.h | 4 +++
tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c | 122 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c | 230 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
tools/testing/selftests/bpf/progs/uprobe_syscall.c | 15 ++++++++++
11 files changed, 478 insertions(+), 6 deletions(-)
create mode 100644 tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
create mode 100644 tools/testing/selftests/bpf/progs/uprobe_syscall.c
Adding uretprobe syscall instead of trap to speed up return probe.
At the moment the uretprobe setup/path is:
- install entry uprobe
- when the uprobe is hit, it overwrites probed function's return address
on stack with address of the trampoline that contains breakpoint
instruction
- the breakpoint trap code handles the uretprobe consumers execution and
jumps back to original return address
This patch replaces the above trampoline's breakpoint instruction with new
ureprobe syscall call. This syscall does exactly the same job as the trap
with some more extra work:
- syscall trampoline must save original value for rax/r11/rcx registers
on stack - rax is set to syscall number and r11/rcx are changed and
used by syscall instruction
- the syscall code reads the original values of those registers and
restore those values in task's pt_regs area
Even with the extra registers handling code the having uretprobes handled
by syscalls shows speed improvement.
On Intel (11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz)
current:
base : 15.888 ± 0.033M/s
uprobe-nop : 3.016 ± 0.000M/s
uprobe-push : 2.832 ± 0.005M/s
uprobe-ret : 1.104 ± 0.000M/s
uretprobe-nop : 1.487 ± 0.000M/s
uretprobe-push : 1.456 ± 0.000M/s
uretprobe-ret : 0.816 ± 0.001M/s
with the fix:
base : 15.116 ± 0.045M/s
uprobe-nop : 3.001 ± 0.045M/s
uprobe-push : 2.831 ± 0.004M/s
uprobe-ret : 1.102 ± 0.001M/s
uretprobe-nop : 1.969 ± 0.001M/s < 32% speedup
uretprobe-push : 1.905 ± 0.004M/s < 30% speedup
uretprobe-ret : 0.933 ± 0.002M/s < 14% speedup
On Amd (AMD Ryzen 7 5700U)
current:
base : 5.105 ± 0.003M/s
uprobe-nop : 1.552 ± 0.002M/s
uprobe-push : 1.408 ± 0.003M/s
uprobe-ret : 0.827 ± 0.001M/s
uretprobe-nop : 0.779 ± 0.001M/s
uretprobe-push : 0.750 ± 0.001M/s
uretprobe-ret : 0.539 ± 0.001M/s
with the fix:
base : 5.119 ± 0.002M/s
uprobe-nop : 1.523 ± 0.003M/s
uprobe-push : 1.384 ± 0.003M/s
uprobe-ret : 0.826 ± 0.002M/s
uretprobe-nop : 0.866 ± 0.002M/s < 11% speedup
uretprobe-push : 0.826 ± 0.002M/s < 10% speedup
uretprobe-ret : 0.581 ± 0.001M/s < 7% speedup
Reviewed-by: Oleg Nesterov <[email protected]>
Suggested-by: Andrii Nakryiko <[email protected]>
Acked-by: Andrii Nakryiko <[email protected]>
Signed-off-by: Oleg Nesterov <[email protected]>
Signed-off-by: Jiri Olsa <[email protected]>
---
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/x86/kernel/uprobes.c | 83 ++++++++++++++++++++++++++
include/linux/syscalls.h | 2 +
include/linux/uprobes.h | 2 +
include/uapi/asm-generic/unistd.h | 5 +-
kernel/events/uprobes.c | 18 ++++--
kernel/sys_ni.c | 2 +
7 files changed, 108 insertions(+), 5 deletions(-)
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 7e8d46f4147f..af0a33ab06ee 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -383,6 +383,7 @@
459 common lsm_get_self_attr sys_lsm_get_self_attr
460 common lsm_set_self_attr sys_lsm_set_self_attr
461 common lsm_list_modules sys_lsm_list_modules
+462 64 uretprobe sys_uretprobe
#
# Due to a historical design error, certain syscalls are numbered differently
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index 6c07f6daaa22..6fc5d16f6c17 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -12,6 +12,7 @@
#include <linux/ptrace.h>
#include <linux/uprobes.h>
#include <linux/uaccess.h>
+#include <linux/syscalls.h>
#include <linux/kdebug.h>
#include <asm/processor.h>
@@ -308,6 +309,88 @@ static int uprobe_init_insn(struct arch_uprobe *auprobe, struct insn *insn, bool
}
#ifdef CONFIG_X86_64
+
+asm (
+ ".pushsection .rodata\n"
+ ".global uretprobe_syscall_entry\n"
+ "uretprobe_syscall_entry:\n"
+ "pushq %rax\n"
+ "pushq %rcx\n"
+ "pushq %r11\n"
+ "movq $" __stringify(__NR_uretprobe) ", %rax\n"
+ "syscall\n"
+ "popq %r11\n"
+ "popq %rcx\n"
+
+ /* The uretprobe syscall replaces stored %rax value with final
+ * return address, so we don't restore %rax in here and just
+ * call ret.
+ */
+ "retq\n"
+ ".global uretprobe_syscall_end\n"
+ "uretprobe_syscall_end:\n"
+ ".popsection\n"
+);
+
+extern u8 uretprobe_syscall_entry[];
+extern u8 uretprobe_syscall_end[];
+
+void *arch_uprobe_trampoline(unsigned long *psize)
+{
+ *psize = uretprobe_syscall_end - uretprobe_syscall_entry;
+ return uretprobe_syscall_entry;
+}
+
+SYSCALL_DEFINE0(uretprobe)
+{
+ struct pt_regs *regs = task_pt_regs(current);
+ unsigned long err, ip, sp, r11_cx_ax[3];
+
+ err = copy_from_user(r11_cx_ax, (void __user *)regs->sp, sizeof(r11_cx_ax));
+ WARN_ON_ONCE(err);
+
+ /* expose the "right" values of r11/cx/ax/sp to uprobe_consumer/s */
+ regs->r11 = r11_cx_ax[0];
+ regs->cx = r11_cx_ax[1];
+ regs->ax = r11_cx_ax[2];
+ regs->sp += sizeof(r11_cx_ax);
+ regs->orig_ax = -1;
+
+ ip = regs->ip;
+ sp = regs->sp;
+
+ uprobe_handle_trampoline(regs);
+
+ /*
+ * uprobe_consumer has changed sp, we can do nothing,
+ * just return via iret
+ */
+ if (regs->sp != sp)
+ return regs->ax;
+ regs->sp -= sizeof(r11_cx_ax);
+
+ /* for the case uprobe_consumer has changed r11/cx */
+ r11_cx_ax[0] = regs->r11;
+ r11_cx_ax[1] = regs->cx;
+
+ /*
+ * ax register is passed through as return value, so we can use
+ * its space on stack for ip value and jump to it through the
+ * trampoline's ret instruction
+ */
+ r11_cx_ax[2] = regs->ip;
+ regs->ip = ip;
+
+ err = copy_to_user((void __user *)regs->sp, r11_cx_ax, sizeof(r11_cx_ax));
+ WARN_ON_ONCE(err);
+
+ /* ensure sysret, see do_syscall_64() */
+ regs->r11 = regs->flags;
+ regs->cx = regs->ip;
+
+ return regs->ax;
+}
+
/*
* If arch_uprobe->insn doesn't use rip-relative addressing, return
* immediately. Otherwise, rewrite the instruction so that it accesses
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 77eb9b0e7685..db150794f89d 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -972,6 +972,8 @@ asmlinkage long sys_lsm_list_modules(u64 *ids, size_t *size, u32 flags);
/* x86 */
asmlinkage long sys_ioperm(unsigned long from, unsigned long num, int on);
+asmlinkage long sys_uretprobe(void);
+
/* pciconfig: alpha, arm, arm64, ia64, sparc */
asmlinkage long sys_pciconfig_read(unsigned long bus, unsigned long dfn,
unsigned long off, unsigned long len,
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index f46e0ca0169c..a490146ad89d 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -138,6 +138,8 @@ extern bool arch_uretprobe_is_alive(struct return_instance *ret, enum rp_check c
extern bool arch_uprobe_ignore(struct arch_uprobe *aup, struct pt_regs *regs);
extern void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
void *src, unsigned long len);
+extern void uprobe_handle_trampoline(struct pt_regs *regs);
+extern void *arch_uprobe_trampoline(unsigned long *psize);
#else /* !CONFIG_UPROBES */
struct uprobes_state {
};
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 75f00965ab15..8a747cd1d735 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -842,8 +842,11 @@ __SYSCALL(__NR_lsm_set_self_attr, sys_lsm_set_self_attr)
#define __NR_lsm_list_modules 461
__SYSCALL(__NR_lsm_list_modules, sys_lsm_list_modules)
+#define __NR_uretprobe 462
+__SYSCALL(__NR_uretprobe, sys_uretprobe)
+
#undef __NR_syscalls
-#define __NR_syscalls 462
+#define __NR_syscalls 463
/*
* 32 bit systems traditionally used different
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 929e98c62965..90395b16bde0 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1474,11 +1474,20 @@ static int xol_add_vma(struct mm_struct *mm, struct xol_area *area)
return ret;
}
+void * __weak arch_uprobe_trampoline(unsigned long *psize)
+{
+ static uprobe_opcode_t insn = UPROBE_SWBP_INSN;
+
+ *psize = UPROBE_SWBP_INSN_SIZE;
+ return &insn;
+}
+
static struct xol_area *__create_xol_area(unsigned long vaddr)
{
struct mm_struct *mm = current->mm;
- uprobe_opcode_t insn = UPROBE_SWBP_INSN;
+ unsigned long insns_size;
struct xol_area *area;
+ void *insns;
area = kmalloc(sizeof(*area), GFP_KERNEL);
if (unlikely(!area))
@@ -1502,7 +1511,8 @@ static struct xol_area *__create_xol_area(unsigned long vaddr)
/* Reserve the 1st slot for get_trampoline_vaddr() */
set_bit(0, area->bitmap);
atomic_set(&area->slot_count, 1);
- arch_uprobe_copy_ixol(area->pages[0], 0, &insn, UPROBE_SWBP_INSN_SIZE);
+ insns = arch_uprobe_trampoline(&insns_size);
+ arch_uprobe_copy_ixol(area->pages[0], 0, insns, insns_size);
if (!xol_add_vma(mm, area))
return area;
@@ -2123,7 +2133,7 @@ static struct return_instance *find_next_ret_chain(struct return_instance *ri)
return ri;
}
-static void handle_trampoline(struct pt_regs *regs)
+void uprobe_handle_trampoline(struct pt_regs *regs)
{
struct uprobe_task *utask;
struct return_instance *ri, *next;
@@ -2188,7 +2198,7 @@ static void handle_swbp(struct pt_regs *regs)
bp_vaddr = uprobe_get_swbp_addr(regs);
if (bp_vaddr == get_trampoline_vaddr())
- return handle_trampoline(regs);
+ return uprobe_handle_trampoline(regs);
uprobe = find_active_uprobe(bp_vaddr, &is_swbp);
if (!uprobe) {
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index faad00cce269..be6195e0d078 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -391,3 +391,5 @@ COND_SYSCALL(setuid16);
/* restartable sequence */
COND_SYSCALL(rseq);
+
+COND_SYSCALL(uretprobe);
--
2.44.0
Add uretprobe test that compares register values before and
after the uretprobe is hit. It also compares the register
values seen from attached bpf program.
Acked-by: Andrii Nakryiko <[email protected]>
Signed-off-by: Jiri Olsa <[email protected]>
---
tools/include/linux/compiler.h | 4 +
.../selftests/bpf/prog_tests/uprobe_syscall.c | 163 ++++++++++++++++++
.../selftests/bpf/progs/uprobe_syscall.c | 15 ++
3 files changed, 182 insertions(+)
create mode 100644 tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
create mode 100644 tools/testing/selftests/bpf/progs/uprobe_syscall.c
diff --git a/tools/include/linux/compiler.h b/tools/include/linux/compiler.h
index 8a63a9913495..6f7f22ac9da5 100644
--- a/tools/include/linux/compiler.h
+++ b/tools/include/linux/compiler.h
@@ -62,6 +62,10 @@
#define __nocf_check __attribute__((nocf_check))
#endif
+#ifndef __naked
+#define __naked __attribute__((__naked__))
+#endif
+
/* Are two types/vars the same type (ignoring qualifiers)? */
#ifndef __same_type
# define __same_type(a, b) __builtin_types_compatible_p(typeof(a), typeof(b))
diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
new file mode 100644
index 000000000000..311ac19d8992
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
@@ -0,0 +1,163 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <test_progs.h>
+
+#ifdef __x86_64__
+
+#include <unistd.h>
+#include <asm/ptrace.h>
+#include <linux/compiler.h>
+#include "uprobe_syscall.skel.h"
+
+__naked unsigned long uretprobe_regs_trigger(void)
+{
+ asm volatile (
+ "movq $0xdeadbeef, %rax\n"
+ "ret\n"
+ );
+}
+
+__naked void uretprobe_regs(struct pt_regs *before, struct pt_regs *after)
+{
+ asm volatile (
+ "movq %r15, 0(%rdi)\n"
+ "movq %r14, 8(%rdi)\n"
+ "movq %r13, 16(%rdi)\n"
+ "movq %r12, 24(%rdi)\n"
+ "movq %rbp, 32(%rdi)\n"
+ "movq %rbx, 40(%rdi)\n"
+ "movq %r11, 48(%rdi)\n"
+ "movq %r10, 56(%rdi)\n"
+ "movq %r9, 64(%rdi)\n"
+ "movq %r8, 72(%rdi)\n"
+ "movq %rax, 80(%rdi)\n"
+ "movq %rcx, 88(%rdi)\n"
+ "movq %rdx, 96(%rdi)\n"
+ "movq %rsi, 104(%rdi)\n"
+ "movq %rdi, 112(%rdi)\n"
+ "movq $0, 120(%rdi)\n" /* orig_rax */
+ "movq $0, 128(%rdi)\n" /* rip */
+ "movq $0, 136(%rdi)\n" /* cs */
+ "pushf\n"
+ "pop %rax\n"
+ "movq %rax, 144(%rdi)\n" /* eflags */
+ "movq %rsp, 152(%rdi)\n" /* rsp */
+ "movq $0, 160(%rdi)\n" /* ss */
+
+ /* save 2nd argument */
+ "pushq %rsi\n"
+ "call uretprobe_regs_trigger\n"
+
+ /* save return value and load 2nd argument pointer to rax */
+ "pushq %rax\n"
+ "movq 8(%rsp), %rax\n"
+
+ "movq %r15, 0(%rax)\n"
+ "movq %r14, 8(%rax)\n"
+ "movq %r13, 16(%rax)\n"
+ "movq %r12, 24(%rax)\n"
+ "movq %rbp, 32(%rax)\n"
+ "movq %rbx, 40(%rax)\n"
+ "movq %r11, 48(%rax)\n"
+ "movq %r10, 56(%rax)\n"
+ "movq %r9, 64(%rax)\n"
+ "movq %r8, 72(%rax)\n"
+ "movq %rcx, 88(%rax)\n"
+ "movq %rdx, 96(%rax)\n"
+ "movq %rsi, 104(%rax)\n"
+ "movq %rdi, 112(%rax)\n"
+ "movq $0, 120(%rax)\n" /* orig_rax */
+ "movq $0, 128(%rax)\n" /* rip */
+ "movq $0, 136(%rax)\n" /* cs */
+
+ /* restore return value and 2nd argument */
+ "pop %rax\n"
+ "pop %rsi\n"
+
+ "movq %rax, 80(%rsi)\n"
+
+ "pushf\n"
+ "pop %rax\n"
+
+ "movq %rax, 144(%rsi)\n" /* eflags */
+ "movq %rsp, 152(%rsi)\n" /* rsp */
+ "movq $0, 160(%rsi)\n" /* ss */
+ "ret\n"
+);
+}
+
+static void test_uretprobe_regs_equal(void)
+{
+ struct uprobe_syscall *skel = NULL;
+ struct pt_regs before = {}, after = {};
+ unsigned long *pb = (unsigned long *) &before;
+ unsigned long *pa = (unsigned long *) &after;
+ unsigned long *pp;
+ unsigned int i, cnt;
+ int err;
+
+ skel = uprobe_syscall__open_and_load();
+ if (!ASSERT_OK_PTR(skel, "uprobe_syscall__open_and_load"))
+ goto cleanup;
+
+ err = uprobe_syscall__attach(skel);
+ if (!ASSERT_OK(err, "uprobe_syscall__attach"))
+ goto cleanup;
+
+ uretprobe_regs(&before, &after);
+
+ pp = (unsigned long *) &skel->bss->regs;
+ cnt = sizeof(before)/sizeof(*pb);
+
+ for (i = 0; i < cnt; i++) {
+ unsigned int offset = i * sizeof(unsigned long);
+
+ /*
+ * Check register before and after uretprobe_regs_trigger call
+ * that triggers the uretprobe.
+ */
+ switch (offset) {
+ case offsetof(struct pt_regs, rax):
+ ASSERT_EQ(pa[i], 0xdeadbeef, "return value");
+ break;
+ default:
+ if (!ASSERT_EQ(pb[i], pa[i], "register before-after value check"))
+ fprintf(stdout, "failed register offset %u\n", offset);
+ }
+
+ /*
+ * Check register seen from bpf program and register after
+ * uretprobe_regs_trigger call
+ */
+ switch (offset) {
+ /*
+ * These values will be different (not set in uretprobe_regs),
+ * we don't care.
+ */
+ case offsetof(struct pt_regs, orig_rax):
+ case offsetof(struct pt_regs, rip):
+ case offsetof(struct pt_regs, cs):
+ case offsetof(struct pt_regs, rsp):
+ case offsetof(struct pt_regs, ss):
+ break;
+ default:
+ if (!ASSERT_EQ(pp[i], pa[i], "register prog-after value check"))
+ fprintf(stdout, "failed register offset %u\n", offset);
+ }
+ }
+
+cleanup:
+ uprobe_syscall__destroy(skel);
+}
+#else
+static void test_uretprobe_regs_equal(void)
+{
+ test__skip();
+}
+#endif
+
+void test_uprobe_syscall(void)
+{
+ if (test__start_subtest("uretprobe_regs_equal"))
+ test_uretprobe_regs_equal();
+}
diff --git a/tools/testing/selftests/bpf/progs/uprobe_syscall.c b/tools/testing/selftests/bpf/progs/uprobe_syscall.c
new file mode 100644
index 000000000000..8a4fa6c7ef59
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/uprobe_syscall.c
@@ -0,0 +1,15 @@
+// SPDX-License-Identifier: GPL-2.0
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <string.h>
+
+struct pt_regs regs;
+
+char _license[] SEC("license") = "GPL";
+
+SEC("uretprobe//proc/self/exe:uretprobe_regs_trigger")
+int uretprobe(struct pt_regs *ctx)
+{
+ __builtin_memcpy(®s, ctx, sizeof(regs));
+ return 0;
+}
--
2.44.0
Adding test that creates uprobe consumer on uretprobe which changes some
of the registers. Making sure the changed registers are propagated to the
user space from the ureprobe trampoline on x86_64.
To be able to do this, adding support to bpf_testmod to create uprobe via
new attribute file:
/sys/kernel/bpf_testmod_uprobe
This file is expecting file offset and creates related uprobe on current
process exe file and removes existing uprobe if offset is 0. The can be
only single uprobe at any time.
The uprobe has specific consumer that changes registers used in ureprobe
trampoline and which are later checked in the test.
Acked-by: Andrii Nakryiko <[email protected]>
Signed-off-by: Jiri Olsa <[email protected]>
---
.../selftests/bpf/bpf_testmod/bpf_testmod.c | 122 +++++++++++++++++-
.../selftests/bpf/prog_tests/uprobe_syscall.c | 67 ++++++++++
2 files changed, 188 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
index 39ad96a18123..704ebd7fb53d 100644
--- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
+++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
@@ -10,6 +10,7 @@
#include <linux/percpu-defs.h>
#include <linux/sysfs.h>
#include <linux/tracepoint.h>
+#include <linux/namei.h>
#include "bpf_testmod.h"
#include "bpf_testmod_kfunc.h"
@@ -343,6 +344,118 @@ static struct bin_attribute bin_attr_bpf_testmod_file __ro_after_init = {
.write = bpf_testmod_test_write,
};
+/* bpf_testmod_uprobe sysfs attribute is so far enabled for x86_64 only,
+ * please see test_uretprobe_regs_change test */
+#ifdef __x86_64__
+
+static int
+uprobe_ret_handler(struct uprobe_consumer *self, unsigned long func,
+ struct pt_regs *regs)
+
+{
+ regs->ax = 0x12345678deadbeef;
+ regs->cx = 0x87654321feebdaed;
+ regs->r11 = (u64) -1;
+ return true;
+}
+
+struct testmod_uprobe {
+ struct path path;
+ loff_t offset;
+ struct uprobe_consumer consumer;
+};
+
+static DEFINE_MUTEX(testmod_uprobe_mutex);
+
+static struct testmod_uprobe uprobe = {
+ .consumer.ret_handler = uprobe_ret_handler,
+};
+
+static int testmod_register_uprobe(loff_t offset)
+{
+ int err = -EBUSY;
+
+ if (uprobe.offset)
+ return -EBUSY;
+
+ mutex_lock(&testmod_uprobe_mutex);
+
+ if (uprobe.offset)
+ goto out;
+
+ err = kern_path("/proc/self/exe", LOOKUP_FOLLOW, &uprobe.path);
+ if (err)
+ goto out;
+
+ err = uprobe_register_refctr(d_real_inode(uprobe.path.dentry),
+ offset, 0, &uprobe.consumer);
+ if (err)
+ path_put(&uprobe.path);
+ else
+ uprobe.offset = offset;
+
+out:
+ mutex_unlock(&testmod_uprobe_mutex);
+ return err;
+}
+
+static void testmod_unregister_uprobe(void)
+{
+ mutex_lock(&testmod_uprobe_mutex);
+
+ if (uprobe.offset) {
+ uprobe_unregister(d_real_inode(uprobe.path.dentry),
+ uprobe.offset, &uprobe.consumer);
+ uprobe.offset = 0;
+ }
+
+ mutex_unlock(&testmod_uprobe_mutex);
+}
+
+static ssize_t
+bpf_testmod_uprobe_write(struct file *file, struct kobject *kobj,
+ struct bin_attribute *bin_attr,
+ char *buf, loff_t off, size_t len)
+{
+ unsigned long offset;
+ int err;
+
+ if (kstrtoul(buf, 0, &offset))
+ return -EINVAL;
+
+ if (offset)
+ err = testmod_register_uprobe(offset);
+ else
+ testmod_unregister_uprobe();
+
+ return err ?: strlen(buf);
+}
+
+static struct bin_attribute bin_attr_bpf_testmod_uprobe_file __ro_after_init = {
+ .attr = { .name = "bpf_testmod_uprobe", .mode = 0666, },
+ .write = bpf_testmod_uprobe_write,
+};
+
+static int register_bpf_testmod_uprobe(void)
+{
+ return sysfs_create_bin_file(kernel_kobj, &bin_attr_bpf_testmod_uprobe_file);
+}
+
+static void unregister_bpf_testmod_uprobe(void)
+{
+ testmod_unregister_uprobe();
+ sysfs_remove_bin_file(kernel_kobj, &bin_attr_bpf_testmod_uprobe_file);
+}
+
+#else
+static int register_bpf_testmod_uprobe(void)
+{
+ return 0;
+}
+
+static void unregister_bpf_testmod_uprobe(void) { }
+#endif
+
BTF_KFUNCS_START(bpf_testmod_common_kfunc_ids)
BTF_ID_FLAGS(func, bpf_iter_testmod_seq_new, KF_ITER_NEW)
BTF_ID_FLAGS(func, bpf_iter_testmod_seq_next, KF_ITER_NEXT | KF_RET_NULL)
@@ -650,7 +763,13 @@ static int bpf_testmod_init(void)
return ret;
if (bpf_fentry_test1(0) < 0)
return -EINVAL;
- return sysfs_create_bin_file(kernel_kobj, &bin_attr_bpf_testmod_file);
+ ret = sysfs_create_bin_file(kernel_kobj, &bin_attr_bpf_testmod_file);
+ if (ret < 0)
+ return ret;
+ ret = register_bpf_testmod_uprobe();
+ if (ret < 0)
+ return ret;
+ return 0;
}
static void bpf_testmod_exit(void)
@@ -664,6 +783,7 @@ static void bpf_testmod_exit(void)
msleep(20);
sysfs_remove_bin_file(kernel_kobj, &bin_attr_bpf_testmod_file);
+ unregister_bpf_testmod_uprobe();
}
module_init(bpf_testmod_init);
diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
index 311ac19d8992..1a50cd35205d 100644
--- a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
+++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
@@ -149,15 +149,82 @@ static void test_uretprobe_regs_equal(void)
cleanup:
uprobe_syscall__destroy(skel);
}
+
+#define BPF_TESTMOD_UPROBE_TEST_FILE "/sys/kernel/bpf_testmod_uprobe"
+
+static int write_bpf_testmod_uprobe(unsigned long offset)
+{
+ size_t n, ret;
+ char buf[30];
+ int fd;
+
+ n = sprintf(buf, "%lu", offset);
+
+ fd = open(BPF_TESTMOD_UPROBE_TEST_FILE, O_WRONLY);
+ if (fd < 0)
+ return -errno;
+
+ ret = write(fd, buf, n);
+ close(fd);
+ return ret != n ? (int) ret : 0;
+}
+
+static void test_uretprobe_regs_change(void)
+{
+ struct pt_regs before = {}, after = {};
+ unsigned long *pb = (unsigned long *) &before;
+ unsigned long *pa = (unsigned long *) &after;
+ unsigned long cnt = sizeof(before)/sizeof(*pb);
+ unsigned int i, err, offset;
+
+ offset = get_uprobe_offset(uretprobe_regs_trigger);
+
+ err = write_bpf_testmod_uprobe(offset);
+ if (!ASSERT_OK(err, "register_uprobe"))
+ return;
+
+ uretprobe_regs(&before, &after);
+
+ err = write_bpf_testmod_uprobe(0);
+ if (!ASSERT_OK(err, "unregister_uprobe"))
+ return;
+
+ for (i = 0; i < cnt; i++) {
+ unsigned int offset = i * sizeof(unsigned long);
+
+ switch (offset) {
+ case offsetof(struct pt_regs, rax):
+ ASSERT_EQ(pa[i], 0x12345678deadbeef, "rax");
+ break;
+ case offsetof(struct pt_regs, rcx):
+ ASSERT_EQ(pa[i], 0x87654321feebdaed, "rcx");
+ break;
+ case offsetof(struct pt_regs, r11):
+ ASSERT_EQ(pa[i], (__u64) -1, "r11");
+ break;
+ default:
+ if (!ASSERT_EQ(pa[i], pb[i], "register before-after value check"))
+ fprintf(stdout, "failed register offset %u\n", offset);
+ }
+ }
+}
+
#else
static void test_uretprobe_regs_equal(void)
{
test__skip();
}
+
+static void test_uretprobe_regs_change(void)
+{
+ test__skip();
+}
#endif
void test_uprobe_syscall(void)
{
if (test__start_subtest("uretprobe_regs_equal"))
test_uretprobe_regs_equal();
+ if (test__start_subtest("uretprobe_regs_change"))
+ test_uretprobe_regs_change();
}
--
2.44.0
Hi Jiri,
On Tue, 2 Apr 2024 11:33:00 +0200
Jiri Olsa <[email protected]> wrote:
> Adding uretprobe syscall instead of trap to speed up return probe.
This is interesting approach. But I doubt we need to add additional
syscall just for this purpose. Can't we use another syscall or ioctl?
Also, we should run syzkaller on this syscall. And if uretprobe is
set in the user function, what happen if the user function directly
calls this syscall? (maybe it consumes shadow stack?)
Thank you,
>
> At the moment the uretprobe setup/path is:
>
> - install entry uprobe
>
> - when the uprobe is hit, it overwrites probed function's return address
> on stack with address of the trampoline that contains breakpoint
> instruction
>
> - the breakpoint trap code handles the uretprobe consumers execution and
> jumps back to original return address
>
> This patch replaces the above trampoline's breakpoint instruction with new
> ureprobe syscall call. This syscall does exactly the same job as the trap
> with some more extra work:
>
> - syscall trampoline must save original value for rax/r11/rcx registers
> on stack - rax is set to syscall number and r11/rcx are changed and
> used by syscall instruction
>
> - the syscall code reads the original values of those registers and
> restore those values in task's pt_regs area
>
> Even with the extra registers handling code the having uretprobes handled
> by syscalls shows speed improvement.
>
> On Intel (11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz)
>
> current:
>
> base : 15.888 ± 0.033M/s
> uprobe-nop : 3.016 ± 0.000M/s
> uprobe-push : 2.832 ± 0.005M/s
> uprobe-ret : 1.104 ± 0.000M/s
> uretprobe-nop : 1.487 ± 0.000M/s
> uretprobe-push : 1.456 ± 0.000M/s
> uretprobe-ret : 0.816 ± 0.001M/s
>
> with the fix:
>
> base : 15.116 ± 0.045M/s
> uprobe-nop : 3.001 ± 0.045M/s
> uprobe-push : 2.831 ± 0.004M/s
> uprobe-ret : 1.102 ± 0.001M/s
> uretprobe-nop : 1.969 ± 0.001M/s < 32% speedup
> uretprobe-push : 1.905 ± 0.004M/s < 30% speedup
> uretprobe-ret : 0.933 ± 0.002M/s < 14% speedup
>
> On Amd (AMD Ryzen 7 5700U)
>
> current:
>
> base : 5.105 ± 0.003M/s
> uprobe-nop : 1.552 ± 0.002M/s
> uprobe-push : 1.408 ± 0.003M/s
> uprobe-ret : 0.827 ± 0.001M/s
> uretprobe-nop : 0.779 ± 0.001M/s
> uretprobe-push : 0.750 ± 0.001M/s
> uretprobe-ret : 0.539 ± 0.001M/s
>
> with the fix:
>
> base : 5.119 ± 0.002M/s
> uprobe-nop : 1.523 ± 0.003M/s
> uprobe-push : 1.384 ± 0.003M/s
> uprobe-ret : 0.826 ± 0.002M/s
> uretprobe-nop : 0.866 ± 0.002M/s < 11% speedup
> uretprobe-push : 0.826 ± 0.002M/s < 10% speedup
> uretprobe-ret : 0.581 ± 0.001M/s < 7% speedup
>
> Reviewed-by: Oleg Nesterov <[email protected]>
> Suggested-by: Andrii Nakryiko <[email protected]>
> Acked-by: Andrii Nakryiko <[email protected]>
> Signed-off-by: Oleg Nesterov <[email protected]>
> Signed-off-by: Jiri Olsa <[email protected]>
> ---
> arch/x86/entry/syscalls/syscall_64.tbl | 1 +
> arch/x86/kernel/uprobes.c | 83 ++++++++++++++++++++++++++
> include/linux/syscalls.h | 2 +
> include/linux/uprobes.h | 2 +
> include/uapi/asm-generic/unistd.h | 5 +-
> kernel/events/uprobes.c | 18 ++++--
> kernel/sys_ni.c | 2 +
> 7 files changed, 108 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index 7e8d46f4147f..af0a33ab06ee 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -383,6 +383,7 @@
> 459 common lsm_get_self_attr sys_lsm_get_self_attr
> 460 common lsm_set_self_attr sys_lsm_set_self_attr
> 461 common lsm_list_modules sys_lsm_list_modules
> +462 64 uretprobe sys_uretprobe
>
> #
> # Due to a historical design error, certain syscalls are numbered differently
> diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> index 6c07f6daaa22..6fc5d16f6c17 100644
> --- a/arch/x86/kernel/uprobes.c
> +++ b/arch/x86/kernel/uprobes.c
> @@ -12,6 +12,7 @@
> #include <linux/ptrace.h>
> #include <linux/uprobes.h>
> #include <linux/uaccess.h>
> +#include <linux/syscalls.h>
>
> #include <linux/kdebug.h>
> #include <asm/processor.h>
> @@ -308,6 +309,88 @@ static int uprobe_init_insn(struct arch_uprobe *auprobe, struct insn *insn, bool
> }
>
> #ifdef CONFIG_X86_64
> +
> +asm (
> + ".pushsection .rodata\n"
> + ".global uretprobe_syscall_entry\n"
> + "uretprobe_syscall_entry:\n"
> + "pushq %rax\n"
> + "pushq %rcx\n"
> + "pushq %r11\n"
> + "movq $" __stringify(__NR_uretprobe) ", %rax\n"
> + "syscall\n"
> + "popq %r11\n"
> + "popq %rcx\n"
> +
> + /* The uretprobe syscall replaces stored %rax value with final
> + * return address, so we don't restore %rax in here and just
> + * call ret.
> + */
> + "retq\n"
> + ".global uretprobe_syscall_end\n"
> + "uretprobe_syscall_end:\n"
> + ".popsection\n"
> +);
> +
> +extern u8 uretprobe_syscall_entry[];
> +extern u8 uretprobe_syscall_end[];
> +
> +void *arch_uprobe_trampoline(unsigned long *psize)
> +{
> + *psize = uretprobe_syscall_end - uretprobe_syscall_entry;
> + return uretprobe_syscall_entry;
> +}
> +
> +SYSCALL_DEFINE0(uretprobe)
> +{
> + struct pt_regs *regs = task_pt_regs(current);
> + unsigned long err, ip, sp, r11_cx_ax[3];
> +
> + err = copy_from_user(r11_cx_ax, (void __user *)regs->sp, sizeof(r11_cx_ax));
> + WARN_ON_ONCE(err);
> +
> + /* expose the "right" values of r11/cx/ax/sp to uprobe_consumer/s */
> + regs->r11 = r11_cx_ax[0];
> + regs->cx = r11_cx_ax[1];
> + regs->ax = r11_cx_ax[2];
> + regs->sp += sizeof(r11_cx_ax);
> + regs->orig_ax = -1;
> +
> + ip = regs->ip;
> + sp = regs->sp;
> +
> + uprobe_handle_trampoline(regs);
> +
> + /*
> + * uprobe_consumer has changed sp, we can do nothing,
> + * just return via iret
> + */
> + if (regs->sp != sp)
> + return regs->ax;
> + regs->sp -= sizeof(r11_cx_ax);
> +
> + /* for the case uprobe_consumer has changed r11/cx */
> + r11_cx_ax[0] = regs->r11;
> + r11_cx_ax[1] = regs->cx;
> +
> + /*
> + * ax register is passed through as return value, so we can use
> + * its space on stack for ip value and jump to it through the
> + * trampoline's ret instruction
> + */
> + r11_cx_ax[2] = regs->ip;
> + regs->ip = ip;
> +
> + err = copy_to_user((void __user *)regs->sp, r11_cx_ax, sizeof(r11_cx_ax));
> + WARN_ON_ONCE(err);
> +
> + /* ensure sysret, see do_syscall_64() */
> + regs->r11 = regs->flags;
> + regs->cx = regs->ip;
> +
> + return regs->ax;
> +}
> +
> /*
> * If arch_uprobe->insn doesn't use rip-relative addressing, return
> * immediately. Otherwise, rewrite the instruction so that it accesses
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 77eb9b0e7685..db150794f89d 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -972,6 +972,8 @@ asmlinkage long sys_lsm_list_modules(u64 *ids, size_t *size, u32 flags);
> /* x86 */
> asmlinkage long sys_ioperm(unsigned long from, unsigned long num, int on);
>
> +asmlinkage long sys_uretprobe(void);
> +
> /* pciconfig: alpha, arm, arm64, ia64, sparc */
> asmlinkage long sys_pciconfig_read(unsigned long bus, unsigned long dfn,
> unsigned long off, unsigned long len,
> diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> index f46e0ca0169c..a490146ad89d 100644
> --- a/include/linux/uprobes.h
> +++ b/include/linux/uprobes.h
> @@ -138,6 +138,8 @@ extern bool arch_uretprobe_is_alive(struct return_instance *ret, enum rp_check c
> extern bool arch_uprobe_ignore(struct arch_uprobe *aup, struct pt_regs *regs);
> extern void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
> void *src, unsigned long len);
> +extern void uprobe_handle_trampoline(struct pt_regs *regs);
> +extern void *arch_uprobe_trampoline(unsigned long *psize);
> #else /* !CONFIG_UPROBES */
> struct uprobes_state {
> };
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index 75f00965ab15..8a747cd1d735 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -842,8 +842,11 @@ __SYSCALL(__NR_lsm_set_self_attr, sys_lsm_set_self_attr)
> #define __NR_lsm_list_modules 461
> __SYSCALL(__NR_lsm_list_modules, sys_lsm_list_modules)
>
> +#define __NR_uretprobe 462
> +__SYSCALL(__NR_uretprobe, sys_uretprobe)
> +
> #undef __NR_syscalls
> -#define __NR_syscalls 462
> +#define __NR_syscalls 463
>
> /*
> * 32 bit systems traditionally used different
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 929e98c62965..90395b16bde0 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -1474,11 +1474,20 @@ static int xol_add_vma(struct mm_struct *mm, struct xol_area *area)
> return ret;
> }
>
> +void * __weak arch_uprobe_trampoline(unsigned long *psize)
> +{
> + static uprobe_opcode_t insn = UPROBE_SWBP_INSN;
> +
> + *psize = UPROBE_SWBP_INSN_SIZE;
> + return &insn;
> +}
> +
> static struct xol_area *__create_xol_area(unsigned long vaddr)
> {
> struct mm_struct *mm = current->mm;
> - uprobe_opcode_t insn = UPROBE_SWBP_INSN;
> + unsigned long insns_size;
> struct xol_area *area;
> + void *insns;
>
> area = kmalloc(sizeof(*area), GFP_KERNEL);
> if (unlikely(!area))
> @@ -1502,7 +1511,8 @@ static struct xol_area *__create_xol_area(unsigned long vaddr)
> /* Reserve the 1st slot for get_trampoline_vaddr() */
> set_bit(0, area->bitmap);
> atomic_set(&area->slot_count, 1);
> - arch_uprobe_copy_ixol(area->pages[0], 0, &insn, UPROBE_SWBP_INSN_SIZE);
> + insns = arch_uprobe_trampoline(&insns_size);
> + arch_uprobe_copy_ixol(area->pages[0], 0, insns, insns_size);
>
> if (!xol_add_vma(mm, area))
> return area;
> @@ -2123,7 +2133,7 @@ static struct return_instance *find_next_ret_chain(struct return_instance *ri)
> return ri;
> }
>
> -static void handle_trampoline(struct pt_regs *regs)
> +void uprobe_handle_trampoline(struct pt_regs *regs)
> {
> struct uprobe_task *utask;
> struct return_instance *ri, *next;
> @@ -2188,7 +2198,7 @@ static void handle_swbp(struct pt_regs *regs)
>
> bp_vaddr = uprobe_get_swbp_addr(regs);
> if (bp_vaddr == get_trampoline_vaddr())
> - return handle_trampoline(regs);
> + return uprobe_handle_trampoline(regs);
>
> uprobe = find_active_uprobe(bp_vaddr, &is_swbp);
> if (!uprobe) {
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index faad00cce269..be6195e0d078 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -391,3 +391,5 @@ COND_SYSCALL(setuid16);
>
> /* restartable sequence */
> COND_SYSCALL(rseq);
> +
> +COND_SYSCALL(uretprobe);
> --
> 2.44.0
>
--
Masami Hiramatsu (Google) <[email protected]>
On Wed, Apr 03, 2024 at 10:07:08AM +0900, Masami Hiramatsu wrote:
> Hi Jiri,
>
> On Tue, 2 Apr 2024 11:33:00 +0200
> Jiri Olsa <[email protected]> wrote:
>
> > Adding uretprobe syscall instead of trap to speed up return probe.
>
> This is interesting approach. But I doubt we need to add additional
> syscall just for this purpose. Can't we use another syscall or ioctl?
so the plan is to optimize entry uprobe in a similar way and given
the syscall is not a scarce resource I wanted to add another syscall
for that one as well
tbh I'm not sure sure which syscall or ioctl to reuse for this, it's
possible to do that, the trampoline will just have to save one or
more additional registers, but adding new syscall seems cleaner to me
>
> Also, we should run syzkaller on this syscall. And if uretprobe is
right, I'll check on syzkaller
> set in the user function, what happen if the user function directly
> calls this syscall? (maybe it consumes shadow stack?)
the process should receive SIGILL if there's no pending uretprobe for
the current task, or it will trigger uretprobe if there's one pending
but we could limit the syscall to be executed just from the trampoline,
that should prevent all the user space use cases, I'll do that in next
version and add more tests for that
thanks,
jirka
>
> Thank you,
>
> >
> > At the moment the uretprobe setup/path is:
> >
> > - install entry uprobe
> >
> > - when the uprobe is hit, it overwrites probed function's return address
> > on stack with address of the trampoline that contains breakpoint
> > instruction
> >
> > - the breakpoint trap code handles the uretprobe consumers execution and
> > jumps back to original return address
> >
> > This patch replaces the above trampoline's breakpoint instruction with new
> > ureprobe syscall call. This syscall does exactly the same job as the trap
> > with some more extra work:
> >
> > - syscall trampoline must save original value for rax/r11/rcx registers
> > on stack - rax is set to syscall number and r11/rcx are changed and
> > used by syscall instruction
> >
> > - the syscall code reads the original values of those registers and
> > restore those values in task's pt_regs area
> >
> > Even with the extra registers handling code the having uretprobes handled
> > by syscalls shows speed improvement.
> >
> > On Intel (11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz)
> >
> > current:
> >
> > base : 15.888 ? 0.033M/s
> > uprobe-nop : 3.016 ? 0.000M/s
> > uprobe-push : 2.832 ? 0.005M/s
> > uprobe-ret : 1.104 ? 0.000M/s
> > uretprobe-nop : 1.487 ? 0.000M/s
> > uretprobe-push : 1.456 ? 0.000M/s
> > uretprobe-ret : 0.816 ? 0.001M/s
> >
> > with the fix:
> >
> > base : 15.116 ? 0.045M/s
> > uprobe-nop : 3.001 ? 0.045M/s
> > uprobe-push : 2.831 ? 0.004M/s
> > uprobe-ret : 1.102 ? 0.001M/s
> > uretprobe-nop : 1.969 ? 0.001M/s < 32% speedup
> > uretprobe-push : 1.905 ? 0.004M/s < 30% speedup
> > uretprobe-ret : 0.933 ? 0.002M/s < 14% speedup
> >
> > On Amd (AMD Ryzen 7 5700U)
> >
> > current:
> >
> > base : 5.105 ? 0.003M/s
> > uprobe-nop : 1.552 ? 0.002M/s
> > uprobe-push : 1.408 ? 0.003M/s
> > uprobe-ret : 0.827 ? 0.001M/s
> > uretprobe-nop : 0.779 ? 0.001M/s
> > uretprobe-push : 0.750 ? 0.001M/s
> > uretprobe-ret : 0.539 ? 0.001M/s
> >
> > with the fix:
> >
> > base : 5.119 ? 0.002M/s
> > uprobe-nop : 1.523 ? 0.003M/s
> > uprobe-push : 1.384 ? 0.003M/s
> > uprobe-ret : 0.826 ? 0.002M/s
> > uretprobe-nop : 0.866 ? 0.002M/s < 11% speedup
> > uretprobe-push : 0.826 ? 0.002M/s < 10% speedup
> > uretprobe-ret : 0.581 ? 0.001M/s < 7% speedup
> >
> > Reviewed-by: Oleg Nesterov <[email protected]>
> > Suggested-by: Andrii Nakryiko <[email protected]>
> > Acked-by: Andrii Nakryiko <[email protected]>
> > Signed-off-by: Oleg Nesterov <[email protected]>
> > Signed-off-by: Jiri Olsa <[email protected]>
> > ---
> > arch/x86/entry/syscalls/syscall_64.tbl | 1 +
> > arch/x86/kernel/uprobes.c | 83 ++++++++++++++++++++++++++
> > include/linux/syscalls.h | 2 +
> > include/linux/uprobes.h | 2 +
> > include/uapi/asm-generic/unistd.h | 5 +-
> > kernel/events/uprobes.c | 18 ++++--
> > kernel/sys_ni.c | 2 +
> > 7 files changed, 108 insertions(+), 5 deletions(-)
> >
> > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> > index 7e8d46f4147f..af0a33ab06ee 100644
> > --- a/arch/x86/entry/syscalls/syscall_64.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> > @@ -383,6 +383,7 @@
> > 459 common lsm_get_self_attr sys_lsm_get_self_attr
> > 460 common lsm_set_self_attr sys_lsm_set_self_attr
> > 461 common lsm_list_modules sys_lsm_list_modules
> > +462 64 uretprobe sys_uretprobe
> >
> > #
> > # Due to a historical design error, certain syscalls are numbered differently
> > diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> > index 6c07f6daaa22..6fc5d16f6c17 100644
> > --- a/arch/x86/kernel/uprobes.c
> > +++ b/arch/x86/kernel/uprobes.c
> > @@ -12,6 +12,7 @@
> > #include <linux/ptrace.h>
> > #include <linux/uprobes.h>
> > #include <linux/uaccess.h>
> > +#include <linux/syscalls.h>
> >
> > #include <linux/kdebug.h>
> > #include <asm/processor.h>
> > @@ -308,6 +309,88 @@ static int uprobe_init_insn(struct arch_uprobe *auprobe, struct insn *insn, bool
> > }
> >
> > #ifdef CONFIG_X86_64
> > +
> > +asm (
> > + ".pushsection .rodata\n"
> > + ".global uretprobe_syscall_entry\n"
> > + "uretprobe_syscall_entry:\n"
> > + "pushq %rax\n"
> > + "pushq %rcx\n"
> > + "pushq %r11\n"
> > + "movq $" __stringify(__NR_uretprobe) ", %rax\n"
> > + "syscall\n"
> > + "popq %r11\n"
> > + "popq %rcx\n"
> > +
> > + /* The uretprobe syscall replaces stored %rax value with final
> > + * return address, so we don't restore %rax in here and just
> > + * call ret.
> > + */
> > + "retq\n"
> > + ".global uretprobe_syscall_end\n"
> > + "uretprobe_syscall_end:\n"
> > + ".popsection\n"
> > +);
> > +
> > +extern u8 uretprobe_syscall_entry[];
> > +extern u8 uretprobe_syscall_end[];
> > +
> > +void *arch_uprobe_trampoline(unsigned long *psize)
> > +{
> > + *psize = uretprobe_syscall_end - uretprobe_syscall_entry;
> > + return uretprobe_syscall_entry;
> > +}
> > +
> > +SYSCALL_DEFINE0(uretprobe)
> > +{
> > + struct pt_regs *regs = task_pt_regs(current);
> > + unsigned long err, ip, sp, r11_cx_ax[3];
> > +
> > + err = copy_from_user(r11_cx_ax, (void __user *)regs->sp, sizeof(r11_cx_ax));
> > + WARN_ON_ONCE(err);
> > +
> > + /* expose the "right" values of r11/cx/ax/sp to uprobe_consumer/s */
> > + regs->r11 = r11_cx_ax[0];
> > + regs->cx = r11_cx_ax[1];
> > + regs->ax = r11_cx_ax[2];
> > + regs->sp += sizeof(r11_cx_ax);
> > + regs->orig_ax = -1;
> > +
> > + ip = regs->ip;
> > + sp = regs->sp;
> > +
> > + uprobe_handle_trampoline(regs);
> > +
> > + /*
> > + * uprobe_consumer has changed sp, we can do nothing,
> > + * just return via iret
> > + */
> > + if (regs->sp != sp)
> > + return regs->ax;
> > + regs->sp -= sizeof(r11_cx_ax);
> > +
> > + /* for the case uprobe_consumer has changed r11/cx */
> > + r11_cx_ax[0] = regs->r11;
> > + r11_cx_ax[1] = regs->cx;
> > +
> > + /*
> > + * ax register is passed through as return value, so we can use
> > + * its space on stack for ip value and jump to it through the
> > + * trampoline's ret instruction
> > + */
> > + r11_cx_ax[2] = regs->ip;
> > + regs->ip = ip;
> > +
> > + err = copy_to_user((void __user *)regs->sp, r11_cx_ax, sizeof(r11_cx_ax));
> > + WARN_ON_ONCE(err);
> > +
> > + /* ensure sysret, see do_syscall_64() */
> > + regs->r11 = regs->flags;
> > + regs->cx = regs->ip;
> > +
> > + return regs->ax;
> > +}
> > +
> > /*
> > * If arch_uprobe->insn doesn't use rip-relative addressing, return
> > * immediately. Otherwise, rewrite the instruction so that it accesses
> > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> > index 77eb9b0e7685..db150794f89d 100644
> > --- a/include/linux/syscalls.h
> > +++ b/include/linux/syscalls.h
> > @@ -972,6 +972,8 @@ asmlinkage long sys_lsm_list_modules(u64 *ids, size_t *size, u32 flags);
> > /* x86 */
> > asmlinkage long sys_ioperm(unsigned long from, unsigned long num, int on);
> >
> > +asmlinkage long sys_uretprobe(void);
> > +
> > /* pciconfig: alpha, arm, arm64, ia64, sparc */
> > asmlinkage long sys_pciconfig_read(unsigned long bus, unsigned long dfn,
> > unsigned long off, unsigned long len,
> > diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> > index f46e0ca0169c..a490146ad89d 100644
> > --- a/include/linux/uprobes.h
> > +++ b/include/linux/uprobes.h
> > @@ -138,6 +138,8 @@ extern bool arch_uretprobe_is_alive(struct return_instance *ret, enum rp_check c
> > extern bool arch_uprobe_ignore(struct arch_uprobe *aup, struct pt_regs *regs);
> > extern void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
> > void *src, unsigned long len);
> > +extern void uprobe_handle_trampoline(struct pt_regs *regs);
> > +extern void *arch_uprobe_trampoline(unsigned long *psize);
> > #else /* !CONFIG_UPROBES */
> > struct uprobes_state {
> > };
> > diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> > index 75f00965ab15..8a747cd1d735 100644
> > --- a/include/uapi/asm-generic/unistd.h
> > +++ b/include/uapi/asm-generic/unistd.h
> > @@ -842,8 +842,11 @@ __SYSCALL(__NR_lsm_set_self_attr, sys_lsm_set_self_attr)
> > #define __NR_lsm_list_modules 461
> > __SYSCALL(__NR_lsm_list_modules, sys_lsm_list_modules)
> >
> > +#define __NR_uretprobe 462
> > +__SYSCALL(__NR_uretprobe, sys_uretprobe)
> > +
> > #undef __NR_syscalls
> > -#define __NR_syscalls 462
> > +#define __NR_syscalls 463
> >
> > /*
> > * 32 bit systems traditionally used different
> > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > index 929e98c62965..90395b16bde0 100644
> > --- a/kernel/events/uprobes.c
> > +++ b/kernel/events/uprobes.c
> > @@ -1474,11 +1474,20 @@ static int xol_add_vma(struct mm_struct *mm, struct xol_area *area)
> > return ret;
> > }
> >
> > +void * __weak arch_uprobe_trampoline(unsigned long *psize)
> > +{
> > + static uprobe_opcode_t insn = UPROBE_SWBP_INSN;
> > +
> > + *psize = UPROBE_SWBP_INSN_SIZE;
> > + return &insn;
> > +}
> > +
> > static struct xol_area *__create_xol_area(unsigned long vaddr)
> > {
> > struct mm_struct *mm = current->mm;
> > - uprobe_opcode_t insn = UPROBE_SWBP_INSN;
> > + unsigned long insns_size;
> > struct xol_area *area;
> > + void *insns;
> >
> > area = kmalloc(sizeof(*area), GFP_KERNEL);
> > if (unlikely(!area))
> > @@ -1502,7 +1511,8 @@ static struct xol_area *__create_xol_area(unsigned long vaddr)
> > /* Reserve the 1st slot for get_trampoline_vaddr() */
> > set_bit(0, area->bitmap);
> > atomic_set(&area->slot_count, 1);
> > - arch_uprobe_copy_ixol(area->pages[0], 0, &insn, UPROBE_SWBP_INSN_SIZE);
> > + insns = arch_uprobe_trampoline(&insns_size);
> > + arch_uprobe_copy_ixol(area->pages[0], 0, insns, insns_size);
> >
> > if (!xol_add_vma(mm, area))
> > return area;
> > @@ -2123,7 +2133,7 @@ static struct return_instance *find_next_ret_chain(struct return_instance *ri)
> > return ri;
> > }
> >
> > -static void handle_trampoline(struct pt_regs *regs)
> > +void uprobe_handle_trampoline(struct pt_regs *regs)
> > {
> > struct uprobe_task *utask;
> > struct return_instance *ri, *next;
> > @@ -2188,7 +2198,7 @@ static void handle_swbp(struct pt_regs *regs)
> >
> > bp_vaddr = uprobe_get_swbp_addr(regs);
> > if (bp_vaddr == get_trampoline_vaddr())
> > - return handle_trampoline(regs);
> > + return uprobe_handle_trampoline(regs);
> >
> > uprobe = find_active_uprobe(bp_vaddr, &is_swbp);
> > if (!uprobe) {
> > diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> > index faad00cce269..be6195e0d078 100644
> > --- a/kernel/sys_ni.c
> > +++ b/kernel/sys_ni.c
> > @@ -391,3 +391,5 @@ COND_SYSCALL(setuid16);
> >
> > /* restartable sequence */
> > COND_SYSCALL(rseq);
> > +
> > +COND_SYSCALL(uretprobe);
> > --
> > 2.44.0
> >
>
>
> --
> Masami Hiramatsu (Google) <[email protected]>
I leave this to you and Masami, but...
On 04/03, Jiri Olsa wrote:
>
> On Wed, Apr 03, 2024 at 10:07:08AM +0900, Masami Hiramatsu wrote:
> >
> > This is interesting approach. But I doubt we need to add additional
> > syscall just for this purpose. Can't we use another syscall or ioctl?
>
> so the plan is to optimize entry uprobe in a similar way and given
> the syscall is not a scarce resource I wanted to add another syscall
> for that one as well
>
> tbh I'm not sure sure which syscall or ioctl to reuse for this, it's
> possible to do that, the trampoline will just have to save one or
> more additional registers, but adding new syscall seems cleaner to me
Agreed.
> > Also, we should run syzkaller on this syscall. And if uretprobe is
>
> right, I'll check on syzkaller
I don't understand this concern...
> > set in the user function, what happen if the user function directly
> > calls this syscall? (maybe it consumes shadow stack?)
>
> the process should receive SIGILL if there's no pending uretprobe for
> the current task,
Yes,
> or it will trigger uretprobe if there's one pending
.. and corrupt the caller. So what?
> but we could limit the syscall to be executed just from the trampoline,
> that should prevent all the user space use cases, I'll do that in next
> version and add more tests for that
Yes, we can... well, ignoring the race with mremap() from another thread.
But why should we care?
Userspace should not call sys_uretprobe(). Likewise, it should not call
sys_restart_syscall(). Likewise, it should not jump to xol_area.
Of course, userspace (especially syzkaller) _can_ do this. So what?
I think the only thing we need to ensure is that the "malicious" task
which calls sys_uretprobe() can only harm itself, nothing more.
No?
Oleg.
On Wed, 3 Apr 2024 11:47:41 +0200
Jiri Olsa <[email protected]> wrote:
> On Wed, Apr 03, 2024 at 10:07:08AM +0900, Masami Hiramatsu wrote:
> > Hi Jiri,
> >
> > On Tue, 2 Apr 2024 11:33:00 +0200
> > Jiri Olsa <[email protected]> wrote:
> >
> > > Adding uretprobe syscall instead of trap to speed up return probe.
> >
> > This is interesting approach. But I doubt we need to add additional
> > syscall just for this purpose. Can't we use another syscall or ioctl?
>
> so the plan is to optimize entry uprobe in a similar way and given
> the syscall is not a scarce resource I wanted to add another syscall
> for that one as well
>
> tbh I'm not sure sure which syscall or ioctl to reuse for this, it's
> possible to do that, the trampoline will just have to save one or
> more additional registers, but adding new syscall seems cleaner to me
Hmm, I think a similar syscall is ptrace? prctl may also be a candidate.
>
> >
> > Also, we should run syzkaller on this syscall. And if uretprobe is
>
> right, I'll check on syzkaller
>
> > set in the user function, what happen if the user function directly
> > calls this syscall? (maybe it consumes shadow stack?)
>
> the process should receive SIGILL if there's no pending uretprobe for
> the current task, or it will trigger uretprobe if there's one pending
No, that is too aggressive and not safe. Since the syscall is exposed to
user program, it should return appropriate error code instead of SIGILL.
>
> but we could limit the syscall to be executed just from the trampoline,
> that should prevent all the user space use cases, I'll do that in next
> version and add more tests for that
Why not limit? :) The uprobe_handle_trampoline() expects it is called
only from the trampoline, so it is natural to check the caller address.
(and uprobe should know where is the trampoline)
Since the syscall is always exposed to the user program, it should
- Do nothing and return an error unless it is properly called.
- check the prerequisites for operation strictly.
I concern that new system calls introduce vulnerabilities.
Thank you,
>
> thanks,
> jirka
>
>
> >
> > Thank you,
> >
> > >
> > > At the moment the uretprobe setup/path is:
> > >
> > > - install entry uprobe
> > >
> > > - when the uprobe is hit, it overwrites probed function's return address
> > > on stack with address of the trampoline that contains breakpoint
> > > instruction
> > >
> > > - the breakpoint trap code handles the uretprobe consumers execution and
> > > jumps back to original return address
> > >
> > > This patch replaces the above trampoline's breakpoint instruction with new
> > > ureprobe syscall call. This syscall does exactly the same job as the trap
> > > with some more extra work:
> > >
> > > - syscall trampoline must save original value for rax/r11/rcx registers
> > > on stack - rax is set to syscall number and r11/rcx are changed and
> > > used by syscall instruction
> > >
> > > - the syscall code reads the original values of those registers and
> > > restore those values in task's pt_regs area
> > >
> > > Even with the extra registers handling code the having uretprobes handled
> > > by syscalls shows speed improvement.
> > >
> > > On Intel (11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz)
> > >
> > > current:
> > >
> > > base : 15.888 ± 0.033M/s
> > > uprobe-nop : 3.016 ± 0.000M/s
> > > uprobe-push : 2.832 ± 0.005M/s
> > > uprobe-ret : 1.104 ± 0.000M/s
> > > uretprobe-nop : 1.487 ± 0.000M/s
> > > uretprobe-push : 1.456 ± 0.000M/s
> > > uretprobe-ret : 0.816 ± 0.001M/s
> > >
> > > with the fix:
> > >
> > > base : 15.116 ± 0.045M/s
> > > uprobe-nop : 3.001 ± 0.045M/s
> > > uprobe-push : 2.831 ± 0.004M/s
> > > uprobe-ret : 1.102 ± 0.001M/s
> > > uretprobe-nop : 1.969 ± 0.001M/s < 32% speedup
> > > uretprobe-push : 1.905 ± 0.004M/s < 30% speedup
> > > uretprobe-ret : 0.933 ± 0.002M/s < 14% speedup
> > >
> > > On Amd (AMD Ryzen 7 5700U)
> > >
> > > current:
> > >
> > > base : 5.105 ± 0.003M/s
> > > uprobe-nop : 1.552 ± 0.002M/s
> > > uprobe-push : 1.408 ± 0.003M/s
> > > uprobe-ret : 0.827 ± 0.001M/s
> > > uretprobe-nop : 0.779 ± 0.001M/s
> > > uretprobe-push : 0.750 ± 0.001M/s
> > > uretprobe-ret : 0.539 ± 0.001M/s
> > >
> > > with the fix:
> > >
> > > base : 5.119 ± 0.002M/s
> > > uprobe-nop : 1.523 ± 0.003M/s
> > > uprobe-push : 1.384 ± 0.003M/s
> > > uprobe-ret : 0.826 ± 0.002M/s
> > > uretprobe-nop : 0.866 ± 0.002M/s < 11% speedup
> > > uretprobe-push : 0.826 ± 0.002M/s < 10% speedup
> > > uretprobe-ret : 0.581 ± 0.001M/s < 7% speedup
> > >
> > > Reviewed-by: Oleg Nesterov <[email protected]>
> > > Suggested-by: Andrii Nakryiko <[email protected]>
> > > Acked-by: Andrii Nakryiko <[email protected]>
> > > Signed-off-by: Oleg Nesterov <[email protected]>
> > > Signed-off-by: Jiri Olsa <[email protected]>
> > > ---
> > > arch/x86/entry/syscalls/syscall_64.tbl | 1 +
> > > arch/x86/kernel/uprobes.c | 83 ++++++++++++++++++++++++++
> > > include/linux/syscalls.h | 2 +
> > > include/linux/uprobes.h | 2 +
> > > include/uapi/asm-generic/unistd.h | 5 +-
> > > kernel/events/uprobes.c | 18 ++++--
> > > kernel/sys_ni.c | 2 +
> > > 7 files changed, 108 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> > > index 7e8d46f4147f..af0a33ab06ee 100644
> > > --- a/arch/x86/entry/syscalls/syscall_64.tbl
> > > +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> > > @@ -383,6 +383,7 @@
> > > 459 common lsm_get_self_attr sys_lsm_get_self_attr
> > > 460 common lsm_set_self_attr sys_lsm_set_self_attr
> > > 461 common lsm_list_modules sys_lsm_list_modules
> > > +462 64 uretprobe sys_uretprobe
> > >
> > > #
> > > # Due to a historical design error, certain syscalls are numbered differently
> > > diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> > > index 6c07f6daaa22..6fc5d16f6c17 100644
> > > --- a/arch/x86/kernel/uprobes.c
> > > +++ b/arch/x86/kernel/uprobes.c
> > > @@ -12,6 +12,7 @@
> > > #include <linux/ptrace.h>
> > > #include <linux/uprobes.h>
> > > #include <linux/uaccess.h>
> > > +#include <linux/syscalls.h>
> > >
> > > #include <linux/kdebug.h>
> > > #include <asm/processor.h>
> > > @@ -308,6 +309,88 @@ static int uprobe_init_insn(struct arch_uprobe *auprobe, struct insn *insn, bool
> > > }
> > >
> > > #ifdef CONFIG_X86_64
> > > +
> > > +asm (
> > > + ".pushsection .rodata\n"
> > > + ".global uretprobe_syscall_entry\n"
> > > + "uretprobe_syscall_entry:\n"
> > > + "pushq %rax\n"
> > > + "pushq %rcx\n"
> > > + "pushq %r11\n"
> > > + "movq $" __stringify(__NR_uretprobe) ", %rax\n"
> > > + "syscall\n"
> > > + "popq %r11\n"
> > > + "popq %rcx\n"
> > > +
> > > + /* The uretprobe syscall replaces stored %rax value with final
> > > + * return address, so we don't restore %rax in here and just
> > > + * call ret.
> > > + */
> > > + "retq\n"
> > > + ".global uretprobe_syscall_end\n"
> > > + "uretprobe_syscall_end:\n"
> > > + ".popsection\n"
> > > +);
> > > +
> > > +extern u8 uretprobe_syscall_entry[];
> > > +extern u8 uretprobe_syscall_end[];
> > > +
> > > +void *arch_uprobe_trampoline(unsigned long *psize)
> > > +{
> > > + *psize = uretprobe_syscall_end - uretprobe_syscall_entry;
> > > + return uretprobe_syscall_entry;
> > > +}
> > > +
> > > +SYSCALL_DEFINE0(uretprobe)
> > > +{
> > > + struct pt_regs *regs = task_pt_regs(current);
> > > + unsigned long err, ip, sp, r11_cx_ax[3];
> > > +
> > > + err = copy_from_user(r11_cx_ax, (void __user *)regs->sp, sizeof(r11_cx_ax));
> > > + WARN_ON_ONCE(err);
> > > +
> > > + /* expose the "right" values of r11/cx/ax/sp to uprobe_consumer/s */
> > > + regs->r11 = r11_cx_ax[0];
> > > + regs->cx = r11_cx_ax[1];
> > > + regs->ax = r11_cx_ax[2];
> > > + regs->sp += sizeof(r11_cx_ax);
> > > + regs->orig_ax = -1;
> > > +
> > > + ip = regs->ip;
> > > + sp = regs->sp;
> > > +
> > > + uprobe_handle_trampoline(regs);
> > > +
> > > + /*
> > > + * uprobe_consumer has changed sp, we can do nothing,
> > > + * just return via iret
> > > + */
> > > + if (regs->sp != sp)
> > > + return regs->ax;
> > > + regs->sp -= sizeof(r11_cx_ax);
> > > +
> > > + /* for the case uprobe_consumer has changed r11/cx */
> > > + r11_cx_ax[0] = regs->r11;
> > > + r11_cx_ax[1] = regs->cx;
> > > +
> > > + /*
> > > + * ax register is passed through as return value, so we can use
> > > + * its space on stack for ip value and jump to it through the
> > > + * trampoline's ret instruction
> > > + */
> > > + r11_cx_ax[2] = regs->ip;
> > > + regs->ip = ip;
> > > +
> > > + err = copy_to_user((void __user *)regs->sp, r11_cx_ax, sizeof(r11_cx_ax));
> > > + WARN_ON_ONCE(err);
> > > +
> > > + /* ensure sysret, see do_syscall_64() */
> > > + regs->r11 = regs->flags;
> > > + regs->cx = regs->ip;
> > > +
> > > + return regs->ax;
> > > +}
> > > +
> > > /*
> > > * If arch_uprobe->insn doesn't use rip-relative addressing, return
> > > * immediately. Otherwise, rewrite the instruction so that it accesses
> > > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> > > index 77eb9b0e7685..db150794f89d 100644
> > > --- a/include/linux/syscalls.h
> > > +++ b/include/linux/syscalls.h
> > > @@ -972,6 +972,8 @@ asmlinkage long sys_lsm_list_modules(u64 *ids, size_t *size, u32 flags);
> > > /* x86 */
> > > asmlinkage long sys_ioperm(unsigned long from, unsigned long num, int on);
> > >
> > > +asmlinkage long sys_uretprobe(void);
> > > +
> > > /* pciconfig: alpha, arm, arm64, ia64, sparc */
> > > asmlinkage long sys_pciconfig_read(unsigned long bus, unsigned long dfn,
> > > unsigned long off, unsigned long len,
> > > diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> > > index f46e0ca0169c..a490146ad89d 100644
> > > --- a/include/linux/uprobes.h
> > > +++ b/include/linux/uprobes.h
> > > @@ -138,6 +138,8 @@ extern bool arch_uretprobe_is_alive(struct return_instance *ret, enum rp_check c
> > > extern bool arch_uprobe_ignore(struct arch_uprobe *aup, struct pt_regs *regs);
> > > extern void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
> > > void *src, unsigned long len);
> > > +extern void uprobe_handle_trampoline(struct pt_regs *regs);
> > > +extern void *arch_uprobe_trampoline(unsigned long *psize);
> > > #else /* !CONFIG_UPROBES */
> > > struct uprobes_state {
> > > };
> > > diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> > > index 75f00965ab15..8a747cd1d735 100644
> > > --- a/include/uapi/asm-generic/unistd.h
> > > +++ b/include/uapi/asm-generic/unistd.h
> > > @@ -842,8 +842,11 @@ __SYSCALL(__NR_lsm_set_self_attr, sys_lsm_set_self_attr)
> > > #define __NR_lsm_list_modules 461
> > > __SYSCALL(__NR_lsm_list_modules, sys_lsm_list_modules)
> > >
> > > +#define __NR_uretprobe 462
> > > +__SYSCALL(__NR_uretprobe, sys_uretprobe)
> > > +
> > > #undef __NR_syscalls
> > > -#define __NR_syscalls 462
> > > +#define __NR_syscalls 463
> > >
> > > /*
> > > * 32 bit systems traditionally used different
> > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > > index 929e98c62965..90395b16bde0 100644
> > > --- a/kernel/events/uprobes.c
> > > +++ b/kernel/events/uprobes.c
> > > @@ -1474,11 +1474,20 @@ static int xol_add_vma(struct mm_struct *mm, struct xol_area *area)
> > > return ret;
> > > }
> > >
> > > +void * __weak arch_uprobe_trampoline(unsigned long *psize)
> > > +{
> > > + static uprobe_opcode_t insn = UPROBE_SWBP_INSN;
> > > +
> > > + *psize = UPROBE_SWBP_INSN_SIZE;
> > > + return &insn;
> > > +}
> > > +
> > > static struct xol_area *__create_xol_area(unsigned long vaddr)
> > > {
> > > struct mm_struct *mm = current->mm;
> > > - uprobe_opcode_t insn = UPROBE_SWBP_INSN;
> > > + unsigned long insns_size;
> > > struct xol_area *area;
> > > + void *insns;
> > >
> > > area = kmalloc(sizeof(*area), GFP_KERNEL);
> > > if (unlikely(!area))
> > > @@ -1502,7 +1511,8 @@ static struct xol_area *__create_xol_area(unsigned long vaddr)
> > > /* Reserve the 1st slot for get_trampoline_vaddr() */
> > > set_bit(0, area->bitmap);
> > > atomic_set(&area->slot_count, 1);
> > > - arch_uprobe_copy_ixol(area->pages[0], 0, &insn, UPROBE_SWBP_INSN_SIZE);
> > > + insns = arch_uprobe_trampoline(&insns_size);
> > > + arch_uprobe_copy_ixol(area->pages[0], 0, insns, insns_size);
> > >
> > > if (!xol_add_vma(mm, area))
> > > return area;
> > > @@ -2123,7 +2133,7 @@ static struct return_instance *find_next_ret_chain(struct return_instance *ri)
> > > return ri;
> > > }
> > >
> > > -static void handle_trampoline(struct pt_regs *regs)
> > > +void uprobe_handle_trampoline(struct pt_regs *regs)
> > > {
> > > struct uprobe_task *utask;
> > > struct return_instance *ri, *next;
> > > @@ -2188,7 +2198,7 @@ static void handle_swbp(struct pt_regs *regs)
> > >
> > > bp_vaddr = uprobe_get_swbp_addr(regs);
> > > if (bp_vaddr == get_trampoline_vaddr())
> > > - return handle_trampoline(regs);
> > > + return uprobe_handle_trampoline(regs);
> > >
> > > uprobe = find_active_uprobe(bp_vaddr, &is_swbp);
> > > if (!uprobe) {
> > > diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> > > index faad00cce269..be6195e0d078 100644
> > > --- a/kernel/sys_ni.c
> > > +++ b/kernel/sys_ni.c
> > > @@ -391,3 +391,5 @@ COND_SYSCALL(setuid16);
> > >
> > > /* restartable sequence */
> > > COND_SYSCALL(rseq);
> > > +
> > > +COND_SYSCALL(uretprobe);
> > > --
> > > 2.44.0
> > >
> >
> >
> > --
> > Masami Hiramatsu (Google) <[email protected]>
--
Masami Hiramatsu (Google) <[email protected]>
Again, I leave this to you and Jiri, but
On 04/03, Masami Hiramatsu wrote:
>
> On Wed, 3 Apr 2024 11:47:41 +0200
> > > set in the user function, what happen if the user function directly
> > > calls this syscall? (maybe it consumes shadow stack?)
> >
> > the process should receive SIGILL if there's no pending uretprobe for
> > the current task, or it will trigger uretprobe if there's one pending
>
> No, that is too aggressive and not safe. Since the syscall is exposed to
> user program, it should return appropriate error code instead of SIGILL.
..
> Since the syscall is always exposed to the user program, it should
> - Do nothing and return an error unless it is properly called.
> - check the prerequisites for operation strictly.
We have sys_munmap(). should it check if the caller is going to unmap
the code region which contains regs->ip and do nothing?
I don't think it should. Userspace should blame itself, SIGSEGV is not
"too aggressive" in this case.
> I concern that new system calls introduce vulnerabilities.
Yes, we need to ensure that sys_uretprobe() can only damage the malicious
caller and nothing else.
Oleg.
On Wed, Apr 3, 2024 at 7:09 AM Masami Hiramatsu <[email protected]> wrote:
>
> On Wed, 3 Apr 2024 11:47:41 +0200
> Jiri Olsa <[email protected]> wrote:
>
> > On Wed, Apr 03, 2024 at 10:07:08AM +0900, Masami Hiramatsu wrote:
> > > Hi Jiri,
> > >
> > > On Tue, 2 Apr 2024 11:33:00 +0200
> > > Jiri Olsa <[email protected]> wrote:
> > >
> > > > Adding uretprobe syscall instead of trap to speed up return probe.
> > >
> > > This is interesting approach. But I doubt we need to add additional
> > > syscall just for this purpose. Can't we use another syscall or ioctl?
> >
> > so the plan is to optimize entry uprobe in a similar way and given
> > the syscall is not a scarce resource I wanted to add another syscall
> > for that one as well
> >
> > tbh I'm not sure sure which syscall or ioctl to reuse for this, it's
> > possible to do that, the trampoline will just have to save one or
> > more additional registers, but adding new syscall seems cleaner to me
>
> Hmm, I think a similar syscall is ptrace? prctl may also be a candidate.
I think both ptrace and prctl are for completely different use cases
and it would be an abuse of existing API to reuse them for uretprobe
tracing. Also, keep in mind, that any extra argument that has to be
passed into this syscall means that we need to complicate and slow
generated assembly code that is injected into user process (to
save/restore registers) and also kernel-side (again, to deal with all
the extra registers that would be stored/restored on stack).
Given syscalls are not some kind of scarce resources, what's the
downside to have a dedicated and simple syscall?
>
> >
> > >
> > > Also, we should run syzkaller on this syscall. And if uretprobe is
> >
> > right, I'll check on syzkaller
> >
> > > set in the user function, what happen if the user function directly
> > > calls this syscall? (maybe it consumes shadow stack?)
> >
> > the process should receive SIGILL if there's no pending uretprobe for
> > the current task, or it will trigger uretprobe if there's one pending
>
> No, that is too aggressive and not safe. Since the syscall is exposed to
> user program, it should return appropriate error code instead of SIGILL.
>
This is the way it is today with uretprobes even through interrupt.
E.g., it could happen that user process is using fibers and is
replacing stack pointer without kernel realizing this, which will
trigger some defensive checks in uretprobe handling code and kernel
will send SIGILL because it can't support such cases. This is
happening today already, and it works fine in practice (except for
applications that manually change stack pointer, too bad, you can't
trace them with uretprobes, unfortunately).
So I think it's absolutely adequate to have this behavior if the user
process is *intentionally* abusing this API.
> >
> > but we could limit the syscall to be executed just from the trampoline,
> > that should prevent all the user space use cases, I'll do that in next
> > version and add more tests for that
>
> Why not limit? :) The uprobe_handle_trampoline() expects it is called
> only from the trampoline, so it is natural to check the caller address.
> (and uprobe should know where is the trampoline)
>
> Since the syscall is always exposed to the user program, it should
> - Do nothing and return an error unless it is properly called.
> - check the prerequisites for operation strictly.
> I concern that new system calls introduce vulnerabilities.
>
As Oleg and Jiri mentioned, this syscall can't harm kernel or other
processes, only the process that is abusing the API. So any extra
checks that would slow down this approach is an unnecessary overhead
and complication that will never be useful in practice.
Also note that sys_uretprobe is a kind of internal and unstable API
and it is explicitly called out that its contract can change at any
time and user space shouldn't rely on it. It's purely for the kernel's
own usage.
So let's please keep it fast and simple.
> Thank you,
>
>
> >
> > thanks,
> > jirka
> >
> >
> > >
[...]
On Wed, 3 Apr 2024 09:58:12 -0700
Andrii Nakryiko <[email protected]> wrote:
> On Wed, Apr 3, 2024 at 7:09 AM Masami Hiramatsu <[email protected]> wrote:
> >
> > On Wed, 3 Apr 2024 11:47:41 +0200
> > Jiri Olsa <[email protected]> wrote:
> >
> > > On Wed, Apr 03, 2024 at 10:07:08AM +0900, Masami Hiramatsu wrote:
> > > > Hi Jiri,
> > > >
> > > > On Tue, 2 Apr 2024 11:33:00 +0200
> > > > Jiri Olsa <[email protected]> wrote:
> > > >
> > > > > Adding uretprobe syscall instead of trap to speed up return probe.
> > > >
> > > > This is interesting approach. But I doubt we need to add additional
> > > > syscall just for this purpose. Can't we use another syscall or ioctl?
> > >
> > > so the plan is to optimize entry uprobe in a similar way and given
> > > the syscall is not a scarce resource I wanted to add another syscall
> > > for that one as well
> > >
> > > tbh I'm not sure sure which syscall or ioctl to reuse for this, it's
> > > possible to do that, the trampoline will just have to save one or
> > > more additional registers, but adding new syscall seems cleaner to me
> >
> > Hmm, I think a similar syscall is ptrace? prctl may also be a candidate.
>
> I think both ptrace and prctl are for completely different use cases
> and it would be an abuse of existing API to reuse them for uretprobe
> tracing. Also, keep in mind, that any extra argument that has to be
> passed into this syscall means that we need to complicate and slow
> generated assembly code that is injected into user process (to
> save/restore registers) and also kernel-side (again, to deal with all
> the extra registers that would be stored/restored on stack).
>
> Given syscalls are not some kind of scarce resources, what's the
> downside to have a dedicated and simple syscall?
Syscalls are explicitly exposed to user space, thus, even if it is used
ONLY for a very specific situation, it is an official kernel interface,
and need to care about the compatibility. (If it causes SIGILL unless
a specific use case, I don't know there is a "compatibility".)
And the number of syscalls are limited resource.
I'm actually not sure how much we need to care of it, but adding a new
syscall is worth to be discussed carefully because all of them are
user-space compatibility.
> > > > Also, we should run syzkaller on this syscall. And if uretprobe is
> > >
> > > right, I'll check on syzkaller
> > >
> > > > set in the user function, what happen if the user function directly
> > > > calls this syscall? (maybe it consumes shadow stack?)
> > >
> > > the process should receive SIGILL if there's no pending uretprobe for
> > > the current task, or it will trigger uretprobe if there's one pending
> >
> > No, that is too aggressive and not safe. Since the syscall is exposed to
> > user program, it should return appropriate error code instead of SIGILL.
> >
>
> This is the way it is today with uretprobes even through interrupt.
I doubt that the interrupt (exception) and syscall should be handled
differently. Especially, this exception is injected by uprobes but
syscall will be caused by itself. But syscall can be called from user
program (of couse this works as sys_kill(self, SIGILL)).
> E.g., it could happen that user process is using fibers and is
> replacing stack pointer without kernel realizing this, which will
> trigger some defensive checks in uretprobe handling code and kernel
> will send SIGILL because it can't support such cases. This is
> happening today already, and it works fine in practice (except for
> applications that manually change stack pointer, too bad, you can't
> trace them with uretprobes, unfortunately).
OK, we at least need to document it.
>
> So I think it's absolutely adequate to have this behavior if the user
> process is *intentionally* abusing this API.
Of course user expected that it is abusing. So at least we need to
add a document that this syscall number is reserved to uprobes and
user program must not use it.
>
> > >
> > > but we could limit the syscall to be executed just from the trampoline,
> > > that should prevent all the user space use cases, I'll do that in next
> > > version and add more tests for that
> >
> > Why not limit? :) The uprobe_handle_trampoline() expects it is called
> > only from the trampoline, so it is natural to check the caller address.
> > (and uprobe should know where is the trampoline)
> >
> > Since the syscall is always exposed to the user program, it should
> > - Do nothing and return an error unless it is properly called.
> > - check the prerequisites for operation strictly.
> > I concern that new system calls introduce vulnerabilities.
> >
>
> As Oleg and Jiri mentioned, this syscall can't harm kernel or other
> processes, only the process that is abusing the API. So any extra
> checks that would slow down this approach is an unnecessary overhead
> and complication that will never be useful in practice.
I think at least it should check the caller address to ensure the
address is in the trampoline.
But anyway, uprobes itself can break the target process, so no one
might care if this system call breaks the process now.
>
> Also note that sys_uretprobe is a kind of internal and unstable API
> and it is explicitly called out that its contract can change at any
> time and user space shouldn't rely on it. It's purely for the kernel's
> own usage.
Is that OK to use a syscall as "internal" and "unstable" API?
>
> So let's please keep it fast and simple.
>
>
> > Thank you,
> >
> >
> > >
> > > thanks,
> > > jirka
> > >
> > >
> > > >
>
> [...]
([OT] If we can add syscall so casually, I would like to add sys_traceevent
for recording user space events :-) .)
--
Masami Hiramatsu (Google) <[email protected]>
On Wed, Apr 3, 2024 at 5:58 PM Masami Hiramatsu <[email protected]> wrote:
>
> On Wed, 3 Apr 2024 09:58:12 -0700
> Andrii Nakryiko <[email protected]> wrote:
>
> > On Wed, Apr 3, 2024 at 7:09 AM Masami Hiramatsu <[email protected]> wrote:
> > >
> > > On Wed, 3 Apr 2024 11:47:41 +0200
> > > Jiri Olsa <[email protected]> wrote:
> > >
> > > > On Wed, Apr 03, 2024 at 10:07:08AM +0900, Masami Hiramatsu wrote:
> > > > > Hi Jiri,
> > > > >
> > > > > On Tue, 2 Apr 2024 11:33:00 +0200
> > > > > Jiri Olsa <[email protected]> wrote:
> > > > >
> > > > > > Adding uretprobe syscall instead of trap to speed up return probe.
> > > > >
> > > > > This is interesting approach. But I doubt we need to add additional
> > > > > syscall just for this purpose. Can't we use another syscall or ioctl?
> > > >
> > > > so the plan is to optimize entry uprobe in a similar way and given
> > > > the syscall is not a scarce resource I wanted to add another syscall
> > > > for that one as well
> > > >
> > > > tbh I'm not sure sure which syscall or ioctl to reuse for this, it's
> > > > possible to do that, the trampoline will just have to save one or
> > > > more additional registers, but adding new syscall seems cleaner to me
> > >
> > > Hmm, I think a similar syscall is ptrace? prctl may also be a candidate.
> >
> > I think both ptrace and prctl are for completely different use cases
> > and it would be an abuse of existing API to reuse them for uretprobe
> > tracing. Also, keep in mind, that any extra argument that has to be
> > passed into this syscall means that we need to complicate and slow
> > generated assembly code that is injected into user process (to
> > save/restore registers) and also kernel-side (again, to deal with all
> > the extra registers that would be stored/restored on stack).
> >
> > Given syscalls are not some kind of scarce resources, what's the
> > downside to have a dedicated and simple syscall?
>
> Syscalls are explicitly exposed to user space, thus, even if it is used
> ONLY for a very specific situation, it is an official kernel interface,
> and need to care about the compatibility. (If it causes SIGILL unless
> a specific use case, I don't know there is a "compatibility".)
Check rt_sigreturn syscall (manpage at [0], for example).
sigreturn() exists only to allow the implementation of signal
handlers. It should never be called directly. (Indeed, a simple
sigreturn() wrapper in the GNU C library simply returns -1, with
errno set to ENOSYS.) Details of the arguments (if any) passed
to sigreturn() vary depending on the architecture. (On some
architectures, such as x86-64, sigreturn() takes no arguments,
since all of the information that it requires is available in the
stack frame that was previously created by the kernel on the
user-space stack.)
This is a very similar use case. Also, check its source code in
arch/x86/kernel/signal_64.c. It sends SIGSEGV to the calling process
on any sign of something not being right. It's exactly the same with
sys_uretprobe.
[0] https://man7.org/linux/man-pages/man2/sigreturn.2.html
> And the number of syscalls are limited resource.
We have almost 500 of them, it didn't seems like adding 1-2 for good
reasons would be a problem. Can you please point to where the limits
on syscalls as a resource are described? I'm curious to learn.
>
> I'm actually not sure how much we need to care of it, but adding a new
> syscall is worth to be discussed carefully because all of them are
> user-space compatibility.
Absolutely, it's a good discussion to have.
>
> > > > > Also, we should run syzkaller on this syscall. And if uretprobe is
> > > >
> > > > right, I'll check on syzkaller
> > > >
> > > > > set in the user function, what happen if the user function directly
> > > > > calls this syscall? (maybe it consumes shadow stack?)
> > > >
> > > > the process should receive SIGILL if there's no pending uretprobe for
> > > > the current task, or it will trigger uretprobe if there's one pending
> > >
> > > No, that is too aggressive and not safe. Since the syscall is exposed to
> > > user program, it should return appropriate error code instead of SIGILL.
> > >
> >
> > This is the way it is today with uretprobes even through interrupt.
>
> I doubt that the interrupt (exception) and syscall should be handled
> differently. Especially, this exception is injected by uprobes but
> syscall will be caused by itself. But syscall can be called from user
> program (of couse this works as sys_kill(self, SIGILL)).
Yep, I'd keep the behavior the same between uretprobes implemented
through int3 and sys_uretprobe.
>
> > E.g., it could happen that user process is using fibers and is
> > replacing stack pointer without kernel realizing this, which will
> > trigger some defensive checks in uretprobe handling code and kernel
> > will send SIGILL because it can't support such cases. This is
> > happening today already, and it works fine in practice (except for
> > applications that manually change stack pointer, too bad, you can't
> > trace them with uretprobes, unfortunately).
>
> OK, we at least need to document it.
+1, yep
>
> >
> > So I think it's absolutely adequate to have this behavior if the user
> > process is *intentionally* abusing this API.
>
> Of course user expected that it is abusing. So at least we need to
> add a document that this syscall number is reserved to uprobes and
> user program must not use it.
>
Totally agree about documenting this.
> >
> > > >
> > > > but we could limit the syscall to be executed just from the trampoline,
> > > > that should prevent all the user space use cases, I'll do that in next
> > > > version and add more tests for that
> > >
> > > Why not limit? :) The uprobe_handle_trampoline() expects it is called
> > > only from the trampoline, so it is natural to check the caller address.
> > > (and uprobe should know where is the trampoline)
> > >
> > > Since the syscall is always exposed to the user program, it should
> > > - Do nothing and return an error unless it is properly called.
> > > - check the prerequisites for operation strictly.
> > > I concern that new system calls introduce vulnerabilities.
> > >
> >
> > As Oleg and Jiri mentioned, this syscall can't harm kernel or other
> > processes, only the process that is abusing the API. So any extra
> > checks that would slow down this approach is an unnecessary overhead
> > and complication that will never be useful in practice.
>
> I think at least it should check the caller address to ensure the
> address is in the trampoline.
> But anyway, uprobes itself can break the target process, so no one
> might care if this system call breaks the process now.
If we already have an expected range of addresses, then I think it's
fine to do a quick unlikely() check. I'd be more concerned if we need
to do another lookup or search to just validate this. I'm sure Jiri
will figure it out.
>
> >
> > Also note that sys_uretprobe is a kind of internal and unstable API
> > and it is explicitly called out that its contract can change at any
> > time and user space shouldn't rely on it. It's purely for the kernel's
> > own usage.
>
> Is that OK to use a syscall as "internal" and "unstable" API?
See above about rt_sigreturn. It seems like yes, for some highly
specialized syscalls it is the case already.
>
> >
> > So let's please keep it fast and simple.
> >
> >
> > > Thank you,
> > >
> > >
> > > >
> > > > thanks,
> > > > jirka
> > > >
> > > >
> > > > >
> >
> > [...]
>
>
> ([OT] If we can add syscall so casually, I would like to add sys_traceevent
> for recording user space events :-) .)
Have you proposed this upstream? :) I have no clue and no opinion about it..
>
> --
> Masami Hiramatsu (Google) <[email protected]>
On Wed, Apr 03, 2024 at 07:00:07PM -0700, Andrii Nakryiko wrote:
SNIP
> Check rt_sigreturn syscall (manpage at [0], for example).
>
> sigreturn() exists only to allow the implementation of signal
> handlers. It should never be called directly. (Indeed, a simple
> sigreturn() wrapper in the GNU C library simply returns -1, with
> errno set to ENOSYS.) Details of the arguments (if any) passed
> to sigreturn() vary depending on the architecture. (On some
> architectures, such as x86-64, sigreturn() takes no arguments,
> since all of the information that it requires is available in the
> stack frame that was previously created by the kernel on the
> user-space stack.)
>
> This is a very similar use case. Also, check its source code in
> arch/x86/kernel/signal_64.c. It sends SIGSEGV to the calling process
> on any sign of something not being right. It's exactly the same with
> sys_uretprobe.
>
> [0] https://man7.org/linux/man-pages/man2/sigreturn.2.html
>
> > And the number of syscalls are limited resource.
>
> We have almost 500 of them, it didn't seems like adding 1-2 for good
> reasons would be a problem. Can you please point to where the limits
> on syscalls as a resource are described? I'm curious to learn.
>
> >
> > I'm actually not sure how much we need to care of it, but adding a new
> > syscall is worth to be discussed carefully because all of them are
> > user-space compatibility.
>
> Absolutely, it's a good discussion to have.
>
> >
> > > > > > Also, we should run syzkaller on this syscall. And if uretprobe is
> > > > >
> > > > > right, I'll check on syzkaller
> > > > >
> > > > > > set in the user function, what happen if the user function directly
> > > > > > calls this syscall? (maybe it consumes shadow stack?)
> > > > >
> > > > > the process should receive SIGILL if there's no pending uretprobe for
> > > > > the current task, or it will trigger uretprobe if there's one pending
> > > >
> > > > No, that is too aggressive and not safe. Since the syscall is exposed to
> > > > user program, it should return appropriate error code instead of SIGILL.
> > > >
> > >
> > > This is the way it is today with uretprobes even through interrupt.
> >
> > I doubt that the interrupt (exception) and syscall should be handled
> > differently. Especially, this exception is injected by uprobes but
> > syscall will be caused by itself. But syscall can be called from user
> > program (of couse this works as sys_kill(self, SIGILL)).
>
> Yep, I'd keep the behavior the same between uretprobes implemented
> through int3 and sys_uretprobe.
+1
>
> >
> > > E.g., it could happen that user process is using fibers and is
> > > replacing stack pointer without kernel realizing this, which will
> > > trigger some defensive checks in uretprobe handling code and kernel
> > > will send SIGILL because it can't support such cases. This is
> > > happening today already, and it works fine in practice (except for
> > > applications that manually change stack pointer, too bad, you can't
> > > trace them with uretprobes, unfortunately).
> >
> > OK, we at least need to document it.
>
> +1, yep
>
> >
> > >
> > > So I think it's absolutely adequate to have this behavior if the user
> > > process is *intentionally* abusing this API.
> >
> > Of course user expected that it is abusing. So at least we need to
> > add a document that this syscall number is reserved to uprobes and
> > user program must not use it.
> >
>
> Totally agree about documenting this.
ok there's map page on sigreturn.. do you think we should add man page
for uretprobe or you can think of some other place to document it?
>
> > >
> > > > >
> > > > > but we could limit the syscall to be executed just from the trampoline,
> > > > > that should prevent all the user space use cases, I'll do that in next
> > > > > version and add more tests for that
> > > >
> > > > Why not limit? :) The uprobe_handle_trampoline() expects it is called
> > > > only from the trampoline, so it is natural to check the caller address.
> > > > (and uprobe should know where is the trampoline)
> > > >
> > > > Since the syscall is always exposed to the user program, it should
> > > > - Do nothing and return an error unless it is properly called.
> > > > - check the prerequisites for operation strictly.
> > > > I concern that new system calls introduce vulnerabilities.
> > > >
> > >
> > > As Oleg and Jiri mentioned, this syscall can't harm kernel or other
> > > processes, only the process that is abusing the API. So any extra
> > > checks that would slow down this approach is an unnecessary overhead
> > > and complication that will never be useful in practice.
> >
> > I think at least it should check the caller address to ensure the
> > address is in the trampoline.
> > But anyway, uprobes itself can break the target process, so no one
> > might care if this system call breaks the process now.
>
> If we already have an expected range of addresses, then I think it's
> fine to do a quick unlikely() check. I'd be more concerned if we need
> to do another lookup or search to just validate this. I'm sure Jiri
> will figure it out.
Oleg mentioned the trampoline address check could race with another
thread's mremap call, however trap is using that check as well, it
still seems like good idea to do it also in the uretprobe syscall
jirka
On Wed, 3 Apr 2024 19:00:07 -0700
Andrii Nakryiko <[email protected]> wrote:
> On Wed, Apr 3, 2024 at 5:58 PM Masami Hiramatsu <[email protected]> wrote:
> >
> > On Wed, 3 Apr 2024 09:58:12 -0700
> > Andrii Nakryiko <[email protected]> wrote:
> >
> > > On Wed, Apr 3, 2024 at 7:09 AM Masami Hiramatsu <[email protected]> wrote:
> > > >
> > > > On Wed, 3 Apr 2024 11:47:41 +0200
> > > > Jiri Olsa <[email protected]> wrote:
> > > >
> > > > > On Wed, Apr 03, 2024 at 10:07:08AM +0900, Masami Hiramatsu wrote:
> > > > > > Hi Jiri,
> > > > > >
> > > > > > On Tue, 2 Apr 2024 11:33:00 +0200
> > > > > > Jiri Olsa <[email protected]> wrote:
> > > > > >
> > > > > > > Adding uretprobe syscall instead of trap to speed up return probe.
> > > > > >
> > > > > > This is interesting approach. But I doubt we need to add additional
> > > > > > syscall just for this purpose. Can't we use another syscall or ioctl?
> > > > >
> > > > > so the plan is to optimize entry uprobe in a similar way and given
> > > > > the syscall is not a scarce resource I wanted to add another syscall
> > > > > for that one as well
> > > > >
> > > > > tbh I'm not sure sure which syscall or ioctl to reuse for this, it's
> > > > > possible to do that, the trampoline will just have to save one or
> > > > > more additional registers, but adding new syscall seems cleaner to me
> > > >
> > > > Hmm, I think a similar syscall is ptrace? prctl may also be a candidate.
> > >
> > > I think both ptrace and prctl are for completely different use cases
> > > and it would be an abuse of existing API to reuse them for uretprobe
> > > tracing. Also, keep in mind, that any extra argument that has to be
> > > passed into this syscall means that we need to complicate and slow
> > > generated assembly code that is injected into user process (to
> > > save/restore registers) and also kernel-side (again, to deal with all
> > > the extra registers that would be stored/restored on stack).
> > >
> > > Given syscalls are not some kind of scarce resources, what's the
> > > downside to have a dedicated and simple syscall?
> >
> > Syscalls are explicitly exposed to user space, thus, even if it is used
> > ONLY for a very specific situation, it is an official kernel interface,
> > and need to care about the compatibility. (If it causes SIGILL unless
> > a specific use case, I don't know there is a "compatibility".)
>
> Check rt_sigreturn syscall (manpage at [0], for example).
>
> sigreturn() exists only to allow the implementation of signal
> handlers. It should never be called directly. (Indeed, a simple
> sigreturn() wrapper in the GNU C library simply returns -1, with
> errno set to ENOSYS.) Details of the arguments (if any) passed
> to sigreturn() vary depending on the architecture. (On some
> architectures, such as x86-64, sigreturn() takes no arguments,
> since all of the information that it requires is available in the
> stack frame that was previously created by the kernel on the
> user-space stack.)
>
> This is a very similar use case. Also, check its source code in
> arch/x86/kernel/signal_64.c. It sends SIGSEGV to the calling process
> on any sign of something not being right. It's exactly the same with
> sys_uretprobe.
>
> [0] https://man7.org/linux/man-pages/man2/sigreturn.2.html
Thanks for a good example.
Hm, in the case of rt_sigreturn, it has no other way to do it so it
needs to use syscall. OTOH, sys_uretprobe is only for performance
optimization, and the performance may depend on the architecture.
> > And the number of syscalls are limited resource.
>
> We have almost 500 of them, it didn't seems like adding 1-2 for good
> reasons would be a problem. Can you please point to where the limits
> on syscalls as a resource are described? I'm curious to learn.
Syscall table is compiled as a fixed array, so if we increase
the number, we need more tables. Of course this just increase 1 entry
and at least for x86 we already allocated bigger table, so it is OK.
But I'm just afraid if we can add more syscalls without any clear
rules, we may fill the tables with more specific syscalls.
Ah, we also should follow this document.
https://docs.kernel.org/process/adding-syscalls.html
Let me Cc [email protected].
> >
> > I'm actually not sure how much we need to care of it, but adding a new
> > syscall is worth to be discussed carefully because all of them are
> > user-space compatibility.
>
> Absolutely, it's a good discussion to have.
Thanks, if this is discussed enough and agreed from other maintainers,
I can safely pick this on my tree.
>
> >
> > > > > > Also, we should run syzkaller on this syscall. And if uretprobe is
> > > > >
> > > > > right, I'll check on syzkaller
> > > > >
> > > > > > set in the user function, what happen if the user function directly
> > > > > > calls this syscall? (maybe it consumes shadow stack?)
> > > > >
> > > > > the process should receive SIGILL if there's no pending uretprobe for
> > > > > the current task, or it will trigger uretprobe if there's one pending
> > > >
> > > > No, that is too aggressive and not safe. Since the syscall is exposed to
> > > > user program, it should return appropriate error code instead of SIGILL.
> > > >
> > >
> > > This is the way it is today with uretprobes even through interrupt.
> >
> > I doubt that the interrupt (exception) and syscall should be handled
> > differently. Especially, this exception is injected by uprobes but
> > syscall will be caused by itself. But syscall can be called from user
> > program (of couse this works as sys_kill(self, SIGILL)).
>
> Yep, I'd keep the behavior the same between uretprobes implemented
> through int3 and sys_uretprobe.
OK, so this syscall is something like coding int3 without debugger.
> >
> > > E.g., it could happen that user process is using fibers and is
> > > replacing stack pointer without kernel realizing this, which will
> > > trigger some defensive checks in uretprobe handling code and kernel
> > > will send SIGILL because it can't support such cases. This is
> > > happening today already, and it works fine in practice (except for
> > > applications that manually change stack pointer, too bad, you can't
> > > trace them with uretprobes, unfortunately).
> >
> > OK, we at least need to document it.
>
> +1, yep
Can we make this syscall and uprobe behavior clearer? As you said, if
the application use sigreturn or longjump, it may skip returns and
shadow stack entries are left in the kernel. In such cases, can uretprobe
detect it properly, or just crash the process (or process runs wrongly)?
>
> >
> > >
> > > So I think it's absolutely adequate to have this behavior if the user
> > > process is *intentionally* abusing this API.
> >
> > Of course user expected that it is abusing. So at least we need to
> > add a document that this syscall number is reserved to uprobes and
> > user program must not use it.
> >
>
> Totally agree about documenting this.
>
> > >
> > > > >
> > > > > but we could limit the syscall to be executed just from the trampoline,
> > > > > that should prevent all the user space use cases, I'll do that in next
> > > > > version and add more tests for that
> > > >
> > > > Why not limit? :) The uprobe_handle_trampoline() expects it is called
> > > > only from the trampoline, so it is natural to check the caller address.
> > > > (and uprobe should know where is the trampoline)
> > > >
> > > > Since the syscall is always exposed to the user program, it should
> > > > - Do nothing and return an error unless it is properly called.
> > > > - check the prerequisites for operation strictly.
> > > > I concern that new system calls introduce vulnerabilities.
> > > >
> > >
> > > As Oleg and Jiri mentioned, this syscall can't harm kernel or other
> > > processes, only the process that is abusing the API. So any extra
> > > checks that would slow down this approach is an unnecessary overhead
> > > and complication that will never be useful in practice.
> >
> > I think at least it should check the caller address to ensure the
> > address is in the trampoline.
> > But anyway, uprobes itself can break the target process, so no one
> > might care if this system call breaks the process now.
>
> If we already have an expected range of addresses, then I think it's
> fine to do a quick unlikely() check. I'd be more concerned if we need
> to do another lookup or search to just validate this. I'm sure Jiri
> will figure it out.
Good.
>
> >
> > >
> > > Also note that sys_uretprobe is a kind of internal and unstable API
> > > and it is explicitly called out that its contract can change at any
> > > time and user space shouldn't rely on it. It's purely for the kernel's
> > > own usage.
> >
> > Is that OK to use a syscall as "internal" and "unstable" API?
>
> See above about rt_sigreturn. It seems like yes, for some highly
> specialized syscalls it is the case already.
OK, but as I said it is just for performance optimization, that is
a bit different from rt_sigreturn case.
Thank you,
> >
> > >
> > > So let's please keep it fast and simple.
> > >
> > >
> > > > Thank you,
> > > >
> > > >
> > > > >
> > > > > thanks,
> > > > > jirka
> > > > >
> > > > >
> > > > > >
> > >
> > > [...]
> >
> >
> > ([OT] If we can add syscall so casually, I would like to add sys_traceevent
> > for recording user space events :-) .)
>
> Have you proposed this upstream? :) I have no clue and no opinion about it...
>
> >
> > --
> > Masami Hiramatsu (Google) <[email protected]>
--
Masami Hiramatsu (Google) <[email protected]>
On 04/05, Masami Hiramatsu wrote:
>
> Can we make this syscall and uprobe behavior clearer? As you said, if
> the application use sigreturn or longjump, it may skip returns and
> shadow stack entries are left in the kernel. In such cases, can uretprobe
> detect it properly, or just crash the process (or process runs wrongly)?
Please see the comment in handle_trampoline(), it tries to detect this case.
This patch should not make any difference.
Oleg.
On Thu, 4 Apr 2024 13:58:43 +0200
Jiri Olsa <[email protected]> wrote:
> On Wed, Apr 03, 2024 at 07:00:07PM -0700, Andrii Nakryiko wrote:
>
> SNIP
>
> > Check rt_sigreturn syscall (manpage at [0], for example).
> >
> > sigreturn() exists only to allow the implementation of signal
> > handlers. It should never be called directly. (Indeed, a simple
> > sigreturn() wrapper in the GNU C library simply returns -1, with
> > errno set to ENOSYS.) Details of the arguments (if any) passed
> > to sigreturn() vary depending on the architecture. (On some
> > architectures, such as x86-64, sigreturn() takes no arguments,
> > since all of the information that it requires is available in the
> > stack frame that was previously created by the kernel on the
> > user-space stack.)
> >
> > This is a very similar use case. Also, check its source code in
> > arch/x86/kernel/signal_64.c. It sends SIGSEGV to the calling process
> > on any sign of something not being right. It's exactly the same with
> > sys_uretprobe.
> >
> > [0] https://man7.org/linux/man-pages/man2/sigreturn.2.html
> >
> > > And the number of syscalls are limited resource.
> >
> > We have almost 500 of them, it didn't seems like adding 1-2 for good
> > reasons would be a problem. Can you please point to where the limits
> > on syscalls as a resource are described? I'm curious to learn.
> >
> > >
> > > I'm actually not sure how much we need to care of it, but adding a new
> > > syscall is worth to be discussed carefully because all of them are
> > > user-space compatibility.
> >
> > Absolutely, it's a good discussion to have.
> >
> > >
> > > > > > > Also, we should run syzkaller on this syscall. And if uretprobe is
> > > > > >
> > > > > > right, I'll check on syzkaller
> > > > > >
> > > > > > > set in the user function, what happen if the user function directly
> > > > > > > calls this syscall? (maybe it consumes shadow stack?)
> > > > > >
> > > > > > the process should receive SIGILL if there's no pending uretprobe for
> > > > > > the current task, or it will trigger uretprobe if there's one pending
> > > > >
> > > > > No, that is too aggressive and not safe. Since the syscall is exposed to
> > > > > user program, it should return appropriate error code instead of SIGILL.
> > > > >
> > > >
> > > > This is the way it is today with uretprobes even through interrupt.
> > >
> > > I doubt that the interrupt (exception) and syscall should be handled
> > > differently. Especially, this exception is injected by uprobes but
> > > syscall will be caused by itself. But syscall can be called from user
> > > program (of couse this works as sys_kill(self, SIGILL)).
> >
> > Yep, I'd keep the behavior the same between uretprobes implemented
> > through int3 and sys_uretprobe.
>
> +1
>
> >
> > >
> > > > E.g., it could happen that user process is using fibers and is
> > > > replacing stack pointer without kernel realizing this, which will
> > > > trigger some defensive checks in uretprobe handling code and kernel
> > > > will send SIGILL because it can't support such cases. This is
> > > > happening today already, and it works fine in practice (except for
> > > > applications that manually change stack pointer, too bad, you can't
> > > > trace them with uretprobes, unfortunately).
> > >
> > > OK, we at least need to document it.
> >
> > +1, yep
> >
> > >
> > > >
> > > > So I think it's absolutely adequate to have this behavior if the user
> > > > process is *intentionally* abusing this API.
> > >
> > > Of course user expected that it is abusing. So at least we need to
> > > add a document that this syscall number is reserved to uprobes and
> > > user program must not use it.
> > >
> >
> > Totally agree about documenting this.
>
> ok there's map page on sigreturn.. do you think we should add man page
> for uretprobe or you can think of some other place to document it?
I think it is better to have a man-page. Anyway, to discuss and explain
this syscall, the man-page is a good format to describe it.
>
> >
> > > >
> > > > > >
> > > > > > but we could limit the syscall to be executed just from the trampoline,
> > > > > > that should prevent all the user space use cases, I'll do that in next
> > > > > > version and add more tests for that
> > > > >
> > > > > Why not limit? :) The uprobe_handle_trampoline() expects it is called
> > > > > only from the trampoline, so it is natural to check the caller address.
> > > > > (and uprobe should know where is the trampoline)
> > > > >
> > > > > Since the syscall is always exposed to the user program, it should
> > > > > - Do nothing and return an error unless it is properly called.
> > > > > - check the prerequisites for operation strictly.
> > > > > I concern that new system calls introduce vulnerabilities.
> > > > >
> > > >
> > > > As Oleg and Jiri mentioned, this syscall can't harm kernel or other
> > > > processes, only the process that is abusing the API. So any extra
> > > > checks that would slow down this approach is an unnecessary overhead
> > > > and complication that will never be useful in practice.
> > >
> > > I think at least it should check the caller address to ensure the
> > > address is in the trampoline.
> > > But anyway, uprobes itself can break the target process, so no one
> > > might care if this system call breaks the process now.
> >
> > If we already have an expected range of addresses, then I think it's
> > fine to do a quick unlikely() check. I'd be more concerned if we need
> > to do another lookup or search to just validate this. I'm sure Jiri
> > will figure it out.
>
> Oleg mentioned the trampoline address check could race with another
> thread's mremap call, however trap is using that check as well, it
> still seems like good idea to do it also in the uretprobe syscall
Yeah, and also, can we add a stack pointer check if the trampoline is
shared with other probe points?
Thank you,
>
> jirka
--
Masami Hiramatsu (Google) <[email protected]>
On Thu, 4 Apr 2024 18:11:09 +0200
Oleg Nesterov <[email protected]> wrote:
> On 04/05, Masami Hiramatsu wrote:
> >
> > Can we make this syscall and uprobe behavior clearer? As you said, if
> > the application use sigreturn or longjump, it may skip returns and
> > shadow stack entries are left in the kernel. In such cases, can uretprobe
> > detect it properly, or just crash the process (or process runs wrongly)?
>
> Please see the comment in handle_trampoline(), it tries to detect this case.
> This patch should not make any difference.
I think you mean this loop will skip and discard the stacked return_instance
to find the valid one.
----
do {
/*
* We should throw out the frames invalidated by longjmp().
* If this chain is valid, then the next one should be alive
* or NULL; the latter case means that nobody but ri->func
* could hit this trampoline on return. TODO: sigaltstack().
*/
next = find_next_ret_chain(ri);
valid = !next || arch_uretprobe_is_alive(next, RP_CHECK_RET, regs);
instruction_pointer_set(regs, ri->orig_ret_vaddr);
do {
if (valid)
handle_uretprobe_chain(ri, regs);
ri = free_ret_instance(ri);
utask->depth--;
} while (ri != next);
} while (!valid);
----
I think this expects setjmp/longjmp as below
foo() { <- retprobe1
setjmp()
bar() { <- retprobe2
longjmp()
}
} <- return to trampoline
In this case, we need to skip retprobe2's instance.
My concern is, if we can not find appropriate return instance, what happen?
e.g.
foo() { <-- retprobe1
bar() { # sp is decremented
sys_uretprobe() <-- ??
}
}
It seems sys_uretprobe() will handle retprobe1 at that point instead of
SIGILL.
Can we avoid this with below strict check?
if (ri->stack != regs->sp + expected_offset)
goto sigill;
expected_offset should be 16 (push * 3 - ret) on x64 if we ri->stack is the
regs->sp right after call.
Thank you,
--
Masami Hiramatsu (Google) <[email protected]>
On Fri, Apr 05, 2024 at 10:22:03AM +0900, Masami Hiramatsu wrote:
> On Thu, 4 Apr 2024 18:11:09 +0200
> Oleg Nesterov <[email protected]> wrote:
>
> > On 04/05, Masami Hiramatsu wrote:
> > >
> > > Can we make this syscall and uprobe behavior clearer? As you said, if
> > > the application use sigreturn or longjump, it may skip returns and
> > > shadow stack entries are left in the kernel. In such cases, can uretprobe
> > > detect it properly, or just crash the process (or process runs wrongly)?
> >
> > Please see the comment in handle_trampoline(), it tries to detect this case.
> > This patch should not make any difference.
>
> I think you mean this loop will skip and discard the stacked return_instance
> to find the valid one.
>
> ----
> do {
> /*
> * We should throw out the frames invalidated by longjmp().
> * If this chain is valid, then the next one should be alive
> * or NULL; the latter case means that nobody but ri->func
> * could hit this trampoline on return. TODO: sigaltstack().
> */
> next = find_next_ret_chain(ri);
> valid = !next || arch_uretprobe_is_alive(next, RP_CHECK_RET, regs);
>
> instruction_pointer_set(regs, ri->orig_ret_vaddr);
> do {
> if (valid)
> handle_uretprobe_chain(ri, regs);
> ri = free_ret_instance(ri);
> utask->depth--;
> } while (ri != next);
> } while (!valid);
> ----
>
> I think this expects setjmp/longjmp as below
>
> foo() { <- retprobe1
> setjmp()
> bar() { <- retprobe2
> longjmp()
> }
> } <- return to trampoline
>
> In this case, we need to skip retprobe2's instance.
> My concern is, if we can not find appropriate return instance, what happen?
> e.g.
>
> foo() { <-- retprobe1
> bar() { # sp is decremented
> sys_uretprobe() <-- ??
> }
> }
>
> It seems sys_uretprobe() will handle retprobe1 at that point instead of
> SIGILL.
yes, and I think it's fine, you get the consumer called in wrong place,
but it's your fault and kernel won't crash
this can be fixed by checking the syscall is called from the trampoline
and prevent handle_trampoline call if it's not
>
> Can we avoid this with below strict check?
>
> if (ri->stack != regs->sp + expected_offset)
> goto sigill;
hm the current uprobe 'alive' check makes sure the return_instance is above
or at the same stack address, not sure we can match it exactly, need to think
about that more
>
> expected_offset should be 16 (push * 3 - ret) on x64 if we ri->stack is the
> regs->sp right after call.
the syscall trampoline already updates the regs->sp before calling
handle_trampoline
regs->sp += sizeof(r11_cx_ax);
jirka
On 04/05, Jiri Olsa wrote:
>
> On Fri, Apr 05, 2024 at 10:22:03AM +0900, Masami Hiramatsu wrote:
> >
> > I think this expects setjmp/longjmp as below
> >
> > foo() { <- retprobe1
> > setjmp()
> > bar() { <- retprobe2
> > longjmp()
> > }
> > } <- return to trampoline
> >
> > In this case, we need to skip retprobe2's instance.
Yes,
> > My concern is, if we can not find appropriate return instance, what happen?
> > e.g.
> >
> > foo() { <-- retprobe1
> > bar() { # sp is decremented
> > sys_uretprobe() <-- ??
> > }
> > }
> >
> > It seems sys_uretprobe() will handle retprobe1 at that point instead of
> > SIGILL.
>
> yes, and I think it's fine, you get the consumer called in wrong place,
> but it's your fault and kernel won't crash
Agreed.
With or without this patch userpace can also do
foo() { <-- retprobe1
bar() {
jump to xol_area
}
}
handle_trampoline() will handle retprobe1.
> this can be fixed by checking the syscall is called from the trampoline
> and prevent handle_trampoline call if it's not
Yes, but I still do not think this makes a lot of sense. But I won't argue.
And what should sys_uretprobe() do if it is not called from the trampoline?
I'd prefer force_sig(SIGILL) to punish the abuser ;) OK, OK, EINVAL.
I agree very much with Andrii,
sigreturn() exists only to allow the implementation of signal handlers. It should never be
called directly. Details of the arguments (if any) passed to sigreturn() vary depending on
the architecture.
this is how sys_uretprobe() should be treated/documented.
sigreturn() can be "improved" too. Say, it could validate sigcontext->ip
and return -EINVAL if this addr is not valid. But why?
Oleg.
On Fri, 5 Apr 2024 13:02:30 +0200
Oleg Nesterov <[email protected]> wrote:
> On 04/05, Jiri Olsa wrote:
> >
> > On Fri, Apr 05, 2024 at 10:22:03AM +0900, Masami Hiramatsu wrote:
> > >
> > > I think this expects setjmp/longjmp as below
> > >
> > > foo() { <- retprobe1
> > > setjmp()
> > > bar() { <- retprobe2
> > > longjmp()
> > > }
> > > } <- return to trampoline
> > >
> > > In this case, we need to skip retprobe2's instance.
>
> Yes,
>
> > > My concern is, if we can not find appropriate return instance, what happen?
> > > e.g.
> > >
> > > foo() { <-- retprobe1
> > > bar() { # sp is decremented
> > > sys_uretprobe() <-- ??
> > > }
> > > }
> > >
> > > It seems sys_uretprobe() will handle retprobe1 at that point instead of
> > > SIGILL.
> >
> > yes, and I think it's fine, you get the consumer called in wrong place,
> > but it's your fault and kernel won't crash
>
> Agreed.
>
> With or without this patch userpace can also do
>
> foo() { <-- retprobe1
> bar() {
> jump to xol_area
> }
> }
>
> handle_trampoline() will handle retprobe1.
This is OK because the execution path has been changed to trampoline, but
the above will continue running bar() after sys_uretprobe().
>
> > this can be fixed by checking the syscall is called from the trampoline
> > and prevent handle_trampoline call if it's not
>
> Yes, but I still do not think this makes a lot of sense. But I won't argue.
>
> And what should sys_uretprobe() do if it is not called from the trampoline?
> I'd prefer force_sig(SIGILL) to punish the abuser ;) OK, OK, EINVAL.
>
> I agree very much with Andrii,
>
> sigreturn() exists only to allow the implementation of signal handlers. It should never be
> called directly. Details of the arguments (if any) passed to sigreturn() vary depending on
> the architecture.
>
> this is how sys_uretprobe() should be treated/documented.
OK.
>
> sigreturn() can be "improved" too. Say, it could validate sigcontext->ip
> and return -EINVAL if this addr is not valid. But why?
Because sigreturn() never returns, but sys_uretprobe() will return.
If it changes the regs->ip and directly returns to the original return address,
I think it is natural to send SIGILL.
Thank you,
--
Masami Hiramatsu (Google) <[email protected]>
On 04/06, Masami Hiramatsu wrote:
>
> On Fri, 5 Apr 2024 13:02:30 +0200
> Oleg Nesterov <[email protected]> wrote:
>
> > With or without this patch userpace can also do
> >
> > foo() { <-- retprobe1
> > bar() {
> > jump to xol_area
> > }
> > }
> >
> > handle_trampoline() will handle retprobe1.
>
> This is OK because the execution path has been changed to trampoline,
Agreed, in this case the misuse is more clear. But please see below.
> but the above will continue running bar() after sys_uretprobe().
.. and most probably crash
> > sigreturn() can be "improved" too. Say, it could validate sigcontext->ip
> > and return -EINVAL if this addr is not valid. But why?
>
> Because sigreturn() never returns, but sys_uretprobe() will return.
You mean, sys_uretprobe() returns to the next insn after syscall.
Almost certainly yes, but this is not necessarily true. If one of consumers
changes regs->sp sys_uretprobe() "returns" to another location, just like
sys_rt_sigreturn().
That said.
Masami, it is not that I am trying to prove that you are "wrong" ;) No.
I see your points even if I am biased, I understand that my objections are
not 100% "fair".
I am just trying to explain why, rightly or not, I care much less about the
abuse of sys_uretprobe().
Thanks!
Oleg.
On Fri, 5 Apr 2024 10:56:15 +0200
Jiri Olsa <[email protected]> wrote:
> >
> > Can we avoid this with below strict check?
> >
> > if (ri->stack != regs->sp + expected_offset)
> > goto sigill;
>
> hm the current uprobe 'alive' check makes sure the return_instance is above
> or at the same stack address, not sure we can match it exactly, need to think
> about that more
>
> >
> > expected_offset should be 16 (push * 3 - ret) on x64 if we ri->stack is the
> > regs->sp right after call.
>
> the syscall trampoline already updates the regs->sp before calling
> handle_trampoline
>
> regs->sp += sizeof(r11_cx_ax);
Yes, that is "push * 3" part. And "- ret" is that the stack entry is consumed
by the "ret", which is stored by call.
1: |--------| <- sp at function entry == ri->stack
0: |ret-addr| <- call pushed it
0: |ret-addr| <- sp at return trampoline
3: |r11 | <- regs->sp at syscall
2: |rcx |
1: |rax | <- ri->stack
0: |ret-addr|
(Note: The lower the line, the larger the address.)
Thus, we can check the stack address by (regs->sp + 16 == ri->stack).
Thank you,
--
Masami Hiramatsu (Google) <[email protected]>
On Sat, 6 Apr 2024 19:55:59 +0200
Oleg Nesterov <[email protected]> wrote:
> On 04/06, Masami Hiramatsu wrote:
> >
> > On Fri, 5 Apr 2024 13:02:30 +0200
> > Oleg Nesterov <[email protected]> wrote:
> >
> > > With or without this patch userpace can also do
> > >
> > > foo() { <-- retprobe1
> > > bar() {
> > > jump to xol_area
> > > }
> > > }
> > >
> > > handle_trampoline() will handle retprobe1.
> >
> > This is OK because the execution path has been changed to trampoline,
>
> Agreed, in this case the misuse is more clear. But please see below.
>
> > but the above will continue running bar() after sys_uretprobe().
>
> .. and most probably crash
Yes, unless it returns with longjmp(). (but this is rare case and
maybe malicious program.)
>
> > > sigreturn() can be "improved" too. Say, it could validate sigcontext->ip
> > > and return -EINVAL if this addr is not valid. But why?
> >
> > Because sigreturn() never returns, but sys_uretprobe() will return.
>
> You mean, sys_uretprobe() returns to the next insn after syscall.
>
> Almost certainly yes, but this is not necessarily true. If one of consumers
> changes regs->sp sys_uretprobe() "returns" to another location, just like
> sys_rt_sigreturn().
>
> That said.
>
> Masami, it is not that I am trying to prove that you are "wrong" ;) No.
>
> I see your points even if I am biased, I understand that my objections are
> not 100% "fair".
>
> I am just trying to explain why, rightly or not, I care much less about the
> abuse of sys_uretprobe().
I would like to clear that the abuse of this syscall will not possible to harm
the normal programs, and even if it is used by malicious code (e.g. injected by
stack overflow) it doesn't cause a problem. At least thsese points are cleared,
and documented. it is easier to push it as new Linux API.
Thank you,
>
> Thanks!
>
> Oleg.
>
>
--
Masami Hiramatsu (Google) <[email protected]>
On Fri, Apr 05, 2024 at 01:02:30PM +0200, Oleg Nesterov wrote:
> On 04/05, Jiri Olsa wrote:
> >
> > On Fri, Apr 05, 2024 at 10:22:03AM +0900, Masami Hiramatsu wrote:
> > >
> > > I think this expects setjmp/longjmp as below
> > >
> > > foo() { <- retprobe1
> > > setjmp()
> > > bar() { <- retprobe2
> > > longjmp()
> > > }
> > > } <- return to trampoline
> > >
> > > In this case, we need to skip retprobe2's instance.
>
> Yes,
>
> > > My concern is, if we can not find appropriate return instance, what happen?
> > > e.g.
> > >
> > > foo() { <-- retprobe1
> > > bar() { # sp is decremented
> > > sys_uretprobe() <-- ??
> > > }
> > > }
> > >
> > > It seems sys_uretprobe() will handle retprobe1 at that point instead of
> > > SIGILL.
> >
> > yes, and I think it's fine, you get the consumer called in wrong place,
> > but it's your fault and kernel won't crash
>
> Agreed.
>
> With or without this patch userpace can also do
>
> foo() { <-- retprobe1
> bar() {
> jump to xol_area
> }
> }
>
> handle_trampoline() will handle retprobe1.
>
> > this can be fixed by checking the syscall is called from the trampoline
> > and prevent handle_trampoline call if it's not
>
> Yes, but I still do not think this makes a lot of sense. But I won't argue.
>
> And what should sys_uretprobe() do if it is not called from the trampoline?
> I'd prefer force_sig(SIGILL) to punish the abuser ;) OK, OK, EINVAL.
so the similar behaviour with int3 ends up with immediate SIGTRAP
and not invoking pending uretprobe consumers, like:
- setup uretprobe for foo
- foo() {
executes int 3 -> sends SIGTRAP
}
because the int3 handler checks if it got executed from the uretprobe's
trampoline.. if not it treats that int3 as regular trap
while for uretprobe syscall we have at the moment following behaviour:
- setup uretprobe for foo
- foo() {
uretprobe_syscall -> executes foo's uretprobe consumers
}
- at some point we get to the 'ret' instruction that jump into uretprobe
trampoline and the uretprobe_syscall won't find pending uretprobe and
will send SIGILL
so I think we should mimic int3 behaviour and:
- setup uretprobe for foo
- foo() {
uretprobe_syscall -> check if we got executed from uretprobe's
trampoline and send SIGILL if that's not the case
I think it's better to have the offending process killed right away,
rather than having more undefined behaviour, waiting for final 'ret'
instruction that jumps to uretprobe trampoline and causes SIGILL
>
> I agree very much with Andrii,
>
> sigreturn() exists only to allow the implementation of signal handlers. It should never be
> called directly. Details of the arguments (if any) passed to sigreturn() vary depending on
> the architecture.
>
> this is how sys_uretprobe() should be treated/documented.
yes, will include man page patch in new version
jirka
>
> sigreturn() can be "improved" too. Say, it could validate sigcontext->ip
> and return -EINVAL if this addr is not valid. But why?
>
> Oleg.
>
On 04/08, Jiri Olsa wrote:
>
> On Fri, Apr 05, 2024 at 01:02:30PM +0200, Oleg Nesterov wrote:
> >
> > And what should sys_uretprobe() do if it is not called from the trampoline?
> > I'd prefer force_sig(SIGILL) to punish the abuser ;) OK, OK, EINVAL.
>
> so the similar behaviour with int3 ends up with immediate SIGTRAP
> and not invoking pending uretprobe consumers, like:
>
> - setup uretprobe for foo
> - foo() {
> executes int 3 -> sends SIGTRAP
> }
>
> because the int3 handler checks if it got executed from the uretprobe's
> trampoline.
.. or the task has uprobe at this address
> if not it treats that int3 as regular trap
Yes this mimics the "default" behaviour without uprobes/uretprobes
> so I think we should mimic int3 behaviour and:
>
> - setup uretprobe for foo
> - foo() {
> uretprobe_syscall -> check if we got executed from uretprobe's
> trampoline and send SIGILL if that's not the case
Agreed,
> I think it's better to have the offending process killed right away,
> rather than having more undefined behaviour, waiting for final 'ret'
> instruction that jumps to uretprobe trampoline and causes SIGILL
Agreed. In fact I think it should be also killed if copy_to/from_user()
fails by the same reason.
Oleg.
On Mon, 8 Apr 2024 18:02:13 +0200
Jiri Olsa <[email protected]> wrote:
> On Fri, Apr 05, 2024 at 01:02:30PM +0200, Oleg Nesterov wrote:
> > On 04/05, Jiri Olsa wrote:
> > >
> > > On Fri, Apr 05, 2024 at 10:22:03AM +0900, Masami Hiramatsu wrote:
> > > >
> > > > I think this expects setjmp/longjmp as below
> > > >
> > > > foo() { <- retprobe1
> > > > setjmp()
> > > > bar() { <- retprobe2
> > > > longjmp()
> > > > }
> > > > } <- return to trampoline
> > > >
> > > > In this case, we need to skip retprobe2's instance.
> >
> > Yes,
> >
> > > > My concern is, if we can not find appropriate return instance, what happen?
> > > > e.g.
> > > >
> > > > foo() { <-- retprobe1
> > > > bar() { # sp is decremented
> > > > sys_uretprobe() <-- ??
> > > > }
> > > > }
> > > >
> > > > It seems sys_uretprobe() will handle retprobe1 at that point instead of
> > > > SIGILL.
> > >
> > > yes, and I think it's fine, you get the consumer called in wrong place,
> > > but it's your fault and kernel won't crash
> >
> > Agreed.
> >
> > With or without this patch userpace can also do
> >
> > foo() { <-- retprobe1
> > bar() {
> > jump to xol_area
> > }
> > }
> >
> > handle_trampoline() will handle retprobe1.
> >
> > > this can be fixed by checking the syscall is called from the trampoline
> > > and prevent handle_trampoline call if it's not
> >
> > Yes, but I still do not think this makes a lot of sense. But I won't argue.
> >
> > And what should sys_uretprobe() do if it is not called from the trampoline?
> > I'd prefer force_sig(SIGILL) to punish the abuser ;) OK, OK, EINVAL.
>
> so the similar behaviour with int3 ends up with immediate SIGTRAP
> and not invoking pending uretprobe consumers, like:
>
> - setup uretprobe for foo
> - foo() {
> executes int 3 -> sends SIGTRAP
> }
>
> because the int3 handler checks if it got executed from the uretprobe's
> trampoline.. if not it treats that int3 as regular trap
Yeah, that is consistent behavior. Sounds good to me.
>
> while for uretprobe syscall we have at the moment following behaviour:
>
> - setup uretprobe for foo
> - foo() {
> uretprobe_syscall -> executes foo's uretprobe consumers
> }
> - at some point we get to the 'ret' instruction that jump into uretprobe
> trampoline and the uretprobe_syscall won't find pending uretprobe and
> will send SIGILL
>
>
> so I think we should mimic int3 behaviour and:
>
> - setup uretprobe for foo
> - foo() {
> uretprobe_syscall -> check if we got executed from uretprobe's
> trampoline and send SIGILL if that's not the case
OK, this looks good to me.
>
> I think it's better to have the offending process killed right away,
> rather than having more undefined behaviour, waiting for final 'ret'
> instruction that jumps to uretprobe trampoline and causes SIGILL
>
> >
> > I agree very much with Andrii,
> >
> > sigreturn() exists only to allow the implementation of signal handlers. It should never be
> > called directly. Details of the arguments (if any) passed to sigreturn() vary depending on
> > the architecture.
> >
> > this is how sys_uretprobe() should be treated/documented.
>
> yes, will include man page patch in new version
And please follow Documentation/process/adding-syscalls.rst in new version,
then we can avoid repeating the same discussion :-)
Thank you!
>
> jirka
>
> >
> > sigreturn() can be "improved" too. Say, it could validate sigcontext->ip
> > and return -EINVAL if this addr is not valid. But why?
> >
> > Oleg.
> >
--
Masami Hiramatsu (Google) <[email protected]>
On Tue, Apr 09, 2024 at 09:34:39AM +0900, Masami Hiramatsu wrote:
SNIP
> > >
> > > > this can be fixed by checking the syscall is called from the trampoline
> > > > and prevent handle_trampoline call if it's not
> > >
> > > Yes, but I still do not think this makes a lot of sense. But I won't argue.
> > >
> > > And what should sys_uretprobe() do if it is not called from the trampoline?
> > > I'd prefer force_sig(SIGILL) to punish the abuser ;) OK, OK, EINVAL.
> >
> > so the similar behaviour with int3 ends up with immediate SIGTRAP
> > and not invoking pending uretprobe consumers, like:
> >
> > - setup uretprobe for foo
> > - foo() {
> > executes int 3 -> sends SIGTRAP
> > }
> >
> > because the int3 handler checks if it got executed from the uretprobe's
> > trampoline.. if not it treats that int3 as regular trap
>
> Yeah, that is consistent behavior. Sounds good to me.
>
> >
> > while for uretprobe syscall we have at the moment following behaviour:
> >
> > - setup uretprobe for foo
> > - foo() {
> > uretprobe_syscall -> executes foo's uretprobe consumers
> > }
> > - at some point we get to the 'ret' instruction that jump into uretprobe
> > trampoline and the uretprobe_syscall won't find pending uretprobe and
> > will send SIGILL
> >
> >
> > so I think we should mimic int3 behaviour and:
> >
> > - setup uretprobe for foo
> > - foo() {
> > uretprobe_syscall -> check if we got executed from uretprobe's
> > trampoline and send SIGILL if that's not the case
>
> OK, this looks good to me.
>
> >
> > I think it's better to have the offending process killed right away,
> > rather than having more undefined behaviour, waiting for final 'ret'
> > instruction that jumps to uretprobe trampoline and causes SIGILL
> >
> > >
> > > I agree very much with Andrii,
> > >
> > > sigreturn() exists only to allow the implementation of signal handlers. It should never be
> > > called directly. Details of the arguments (if any) passed to sigreturn() vary depending on
> > > the architecture.
> > >
> > > this is how sys_uretprobe() should be treated/documented.
> >
> > yes, will include man page patch in new version
>
> And please follow Documentation/process/adding-syscalls.rst in new version,
> then we can avoid repeating the same discussion :-)
yep, will do
thanks,
jirka
On Mon, Apr 08, 2024 at 06:22:59PM +0200, Oleg Nesterov wrote:
> On 04/08, Jiri Olsa wrote:
> >
> > On Fri, Apr 05, 2024 at 01:02:30PM +0200, Oleg Nesterov wrote:
> > >
> > > And what should sys_uretprobe() do if it is not called from the trampoline?
> > > I'd prefer force_sig(SIGILL) to punish the abuser ;) OK, OK, EINVAL.
> >
> > so the similar behaviour with int3 ends up with immediate SIGTRAP
> > and not invoking pending uretprobe consumers, like:
> >
> > - setup uretprobe for foo
> > - foo() {
> > executes int 3 -> sends SIGTRAP
> > }
> >
> > because the int3 handler checks if it got executed from the uretprobe's
> > trampoline.
>
> ... or the task has uprobe at this address
>
> > if not it treats that int3 as regular trap
>
> Yes this mimics the "default" behaviour without uprobes/uretprobes
>
> > so I think we should mimic int3 behaviour and:
> >
> > - setup uretprobe for foo
> > - foo() {
> > uretprobe_syscall -> check if we got executed from uretprobe's
> > trampoline and send SIGILL if that's not the case
>
> Agreed,
>
> > I think it's better to have the offending process killed right away,
> > rather than having more undefined behaviour, waiting for final 'ret'
> > instruction that jumps to uretprobe trampoline and causes SIGILL
>
> Agreed. In fact I think it should be also killed if copy_to/from_user()
> fails by the same reason.
+1 makes sense
jirka
>
> Oleg.
>
On Tue, Apr 02, 2024 at 11:33:00AM +0200, Jiri Olsa wrote:
SNIP
> #include <linux/kdebug.h>
> #include <asm/processor.h>
> @@ -308,6 +309,88 @@ static int uprobe_init_insn(struct arch_uprobe *auprobe, struct insn *insn, bool
> }
>
> #ifdef CONFIG_X86_64
> +
> +asm (
> + ".pushsection .rodata\n"
> + ".global uretprobe_syscall_entry\n"
> + "uretprobe_syscall_entry:\n"
> + "pushq %rax\n"
> + "pushq %rcx\n"
> + "pushq %r11\n"
> + "movq $" __stringify(__NR_uretprobe) ", %rax\n"
> + "syscall\n"
> + "popq %r11\n"
> + "popq %rcx\n"
> +
> + /* The uretprobe syscall replaces stored %rax value with final
> + * return address, so we don't restore %rax in here and just
> + * call ret.
> + */
> + "retq\n"
> + ".global uretprobe_syscall_end\n"
> + "uretprobe_syscall_end:\n"
> + ".popsection\n"
> +);
> +
> +extern u8 uretprobe_syscall_entry[];
> +extern u8 uretprobe_syscall_end[];
> +
> +void *arch_uprobe_trampoline(unsigned long *psize)
> +{
> + *psize = uretprobe_syscall_end - uretprobe_syscall_entry;
> + return uretprobe_syscall_entry;
fyi I realized this screws 32-bit programs, we either need to add
compat trampoline, or keep the standard breakpoint for them:
+ struct pt_regs *regs = task_pt_regs(current);
+ static uprobe_opcode_t insn = UPROBE_SWBP_INSN;
+
+ if (user_64bit_mode(regs)) {
+ *psize = uretprobe_syscall_end - uretprobe_syscall_entry;
+ return uretprobe_syscall_entry;
+ }
+
+ *psize = UPROBE_SWBP_INSN_SIZE;
+ return &insn;
not sure it's worth the effort to add the trampoline, I'll check
jirka
On Mon, Apr 15, 2024 at 1:25 AM Jiri Olsa <[email protected]> wrote:
>
> On Tue, Apr 02, 2024 at 11:33:00AM +0200, Jiri Olsa wrote:
>
> SNIP
>
> > #include <linux/kdebug.h>
> > #include <asm/processor.h>
> > @@ -308,6 +309,88 @@ static int uprobe_init_insn(struct arch_uprobe *auprobe, struct insn *insn, bool
> > }
> >
> > #ifdef CONFIG_X86_64
> > +
> > +asm (
> > + ".pushsection .rodata\n"
> > + ".global uretprobe_syscall_entry\n"
> > + "uretprobe_syscall_entry:\n"
> > + "pushq %rax\n"
> > + "pushq %rcx\n"
> > + "pushq %r11\n"
> > + "movq $" __stringify(__NR_uretprobe) ", %rax\n"
> > + "syscall\n"
> > + "popq %r11\n"
> > + "popq %rcx\n"
> > +
> > + /* The uretprobe syscall replaces stored %rax value with final
> > + * return address, so we don't restore %rax in here and just
> > + * call ret.
> > + */
> > + "retq\n"
> > + ".global uretprobe_syscall_end\n"
> > + "uretprobe_syscall_end:\n"
> > + ".popsection\n"
> > +);
> > +
> > +extern u8 uretprobe_syscall_entry[];
> > +extern u8 uretprobe_syscall_end[];
> > +
> > +void *arch_uprobe_trampoline(unsigned long *psize)
> > +{
> > + *psize = uretprobe_syscall_end - uretprobe_syscall_entry;
> > + return uretprobe_syscall_entry;
>
> fyi I realized this screws 32-bit programs, we either need to add
> compat trampoline, or keep the standard breakpoint for them:
>
> + struct pt_regs *regs = task_pt_regs(current);
> + static uprobe_opcode_t insn = UPROBE_SWBP_INSN;
> +
> + if (user_64bit_mode(regs)) {
> + *psize = uretprobe_syscall_end - uretprobe_syscall_entry;
> + return uretprobe_syscall_entry;
> + }
> +
> + *psize = UPROBE_SWBP_INSN_SIZE;
> + return &insn;
>
>
> not sure it's worth the effort to add the trampoline, I'll check
>
32-bit arch isn't a high-performance target anyways, so I'd probably
not bother and prioritize simplicity and long term maintenance.
>
> jirka