2023-11-20 20:55:08

by Mathieu Desnoyers

[permalink] [raw]
Subject: [PATCH v4 0/5] Faultable Tracepoints

Wire up the system call tracepoints with Tasks Trace RCU to allow
the ftrace, perf, and eBPF tracers to handle page faults.

This series does the initial wire-up allowing tracers to handle page
faults, but leaves out the actual handling of said page faults as future
work.

I have tested this against a feature branch of lttng-modules which
implements handling of page faults for the filename argument of the
openat(2) system call.

This series is based on kernel v6.6.2.

Thanks,

Mathieu

Mathieu Desnoyers (5):
tracing: Introduce faultable tracepoints (v4)
tracing/ftrace: Add support for faultable tracepoints
tracing/bpf-trace: add support for faultable tracepoints
tracing/perf: add support for faultable tracepoints
tracing: convert sys_enter/exit to faultable tracepoints

Cc: Steven Rostedt <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Yonghong Song <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: [email protected]
Cc: Joel Fernandes <[email protected]>

include/linux/tracepoint-defs.h | 14 +++++
include/linux/tracepoint.h | 88 ++++++++++++++++++++++---------
include/trace/bpf_probe.h | 21 ++++++--
include/trace/define_trace.h | 7 +++
include/trace/events/syscalls.h | 4 +-
include/trace/perf.h | 27 ++++++++--
include/trace/trace_events.h | 73 ++++++++++++++++++++++++--
init/Kconfig | 1 +
kernel/trace/bpf_trace.c | 10 +++-
kernel/trace/trace_events.c | 26 +++++++---
kernel/trace/trace_fprobe.c | 5 +-
kernel/trace/trace_syscalls.c | 92 +++++++++++++++++++++++----------
kernel/tracepoint.c | 58 +++++++++++----------
13 files changed, 325 insertions(+), 101 deletions(-)

--
2.25.1


2023-11-20 20:55:11

by Mathieu Desnoyers

[permalink] [raw]
Subject: [PATCH v4 5/5] tracing: convert sys_enter/exit to faultable tracepoints

Convert the definition of the system call enter/exit tracepoints to
faultable tracepoints now that all upstream tracers handle it.

This allows tracers to fault-in userspace system call arguments such as
path strings within their probe callbacks.

Link: https://lore.kernel.org/lkml/[email protected]/
Co-developed-by: Michael Jeanson <[email protected]>
Signed-off-by: Mathieu Desnoyers <[email protected]>
Signed-off-by: Michael Jeanson <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Yonghong Song <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: [email protected]
Cc: Joel Fernandes <[email protected]>
---
include/trace/events/syscalls.h | 4 +-
kernel/trace/trace_syscalls.c | 92 +++++++++++++++++++++++----------
2 files changed, 68 insertions(+), 28 deletions(-)

diff --git a/include/trace/events/syscalls.h b/include/trace/events/syscalls.h
index b6e0cbc2c71f..dc30e3004818 100644
--- a/include/trace/events/syscalls.h
+++ b/include/trace/events/syscalls.h
@@ -15,7 +15,7 @@

#ifdef CONFIG_HAVE_SYSCALL_TRACEPOINTS

-TRACE_EVENT_FN(sys_enter,
+TRACE_EVENT_FN_MAY_FAULT(sys_enter,

TP_PROTO(struct pt_regs *regs, long id),

@@ -41,7 +41,7 @@ TRACE_EVENT_FN(sys_enter,

TRACE_EVENT_FLAGS(sys_enter, TRACE_EVENT_FL_CAP_ANY)

-TRACE_EVENT_FN(sys_exit,
+TRACE_EVENT_FN_MAY_FAULT(sys_exit,

TP_PROTO(struct pt_regs *regs, long ret),

diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index de753403cdaf..718a0723a0bc 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -299,27 +299,33 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
int syscall_nr;
int size;

+ /*
+ * Probe called with preemption enabled (may_fault), but ring buffer and
+ * per-cpu data require preemption to be disabled.
+ */
+ preempt_disable_notrace();
+
syscall_nr = trace_get_syscall_nr(current, regs);
if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
- return;
+ goto end;

/* Here we're inside tp handler's rcu_read_lock_sched (__DO_TRACE) */
trace_file = rcu_dereference_sched(tr->enter_syscall_files[syscall_nr]);
if (!trace_file)
- return;
+ goto end;

if (trace_trigger_soft_disabled(trace_file))
- return;
+ goto end;

sys_data = syscall_nr_to_meta(syscall_nr);
if (!sys_data)
- return;
+ goto end;

size = sizeof(*entry) + sizeof(unsigned long) * sys_data->nb_args;

entry = trace_event_buffer_reserve(&fbuffer, trace_file, size);
if (!entry)
- return;
+ goto end;

entry = ring_buffer_event_data(fbuffer.event);
entry->nr = syscall_nr;
@@ -327,6 +333,8 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
memcpy(entry->args, args, sizeof(unsigned long) * sys_data->nb_args);

trace_event_buffer_commit(&fbuffer);
+end:
+ preempt_enable_notrace();
}

static void ftrace_syscall_exit(void *data, struct pt_regs *regs, long ret)
@@ -338,31 +346,39 @@ static void ftrace_syscall_exit(void *data, struct pt_regs *regs, long ret)
struct trace_event_buffer fbuffer;
int syscall_nr;

+ /*
+ * Probe called with preemption enabled (may_fault), but ring buffer and
+ * per-cpu data require preemption to be disabled.
+ */
+ preempt_disable_notrace();
+
syscall_nr = trace_get_syscall_nr(current, regs);
if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
- return;
+ goto end;

/* Here we're inside tp handler's rcu_read_lock_sched (__DO_TRACE()) */
trace_file = rcu_dereference_sched(tr->exit_syscall_files[syscall_nr]);
if (!trace_file)
- return;
+ goto end;

if (trace_trigger_soft_disabled(trace_file))
- return;
+ goto end;

sys_data = syscall_nr_to_meta(syscall_nr);
if (!sys_data)
- return;
+ goto end;

entry = trace_event_buffer_reserve(&fbuffer, trace_file, sizeof(*entry));
if (!entry)
- return;
+ goto end;

entry = ring_buffer_event_data(fbuffer.event);
entry->nr = syscall_nr;
entry->ret = syscall_get_return_value(current, regs);

trace_event_buffer_commit(&fbuffer);
+end:
+ preempt_enable_notrace();
}

static int reg_event_syscall_enter(struct trace_event_file *file,
@@ -377,7 +393,9 @@ static int reg_event_syscall_enter(struct trace_event_file *file,
return -ENOSYS;
mutex_lock(&syscall_trace_lock);
if (!tr->sys_refcount_enter)
- ret = register_trace_sys_enter(ftrace_syscall_enter, tr);
+ ret = register_trace_prio_flags_sys_enter(ftrace_syscall_enter, tr,
+ TRACEPOINT_DEFAULT_PRIO,
+ TRACEPOINT_MAY_FAULT);
if (!ret) {
rcu_assign_pointer(tr->enter_syscall_files[num], file);
tr->sys_refcount_enter++;
@@ -415,7 +433,9 @@ static int reg_event_syscall_exit(struct trace_event_file *file,
return -ENOSYS;
mutex_lock(&syscall_trace_lock);
if (!tr->sys_refcount_exit)
- ret = register_trace_sys_exit(ftrace_syscall_exit, tr);
+ ret = register_trace_prio_flags_sys_exit(ftrace_syscall_exit, tr,
+ TRACEPOINT_DEFAULT_PRIO,
+ TRACEPOINT_MAY_FAULT);
if (!ret) {
rcu_assign_pointer(tr->exit_syscall_files[num], file);
tr->sys_refcount_exit++;
@@ -582,20 +602,26 @@ static void perf_syscall_enter(void *ignore, struct pt_regs *regs, long id)
int rctx;
int size;

+ /*
+ * Probe called with preemption enabled (may_fault), but ring buffer and
+ * per-cpu data require preemption to be disabled.
+ */
+ preempt_disable_notrace();
+
syscall_nr = trace_get_syscall_nr(current, regs);
if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
- return;
+ goto end;
if (!test_bit(syscall_nr, enabled_perf_enter_syscalls))
- return;
+ goto end;

sys_data = syscall_nr_to_meta(syscall_nr);
if (!sys_data)
- return;
+ goto end;

head = this_cpu_ptr(sys_data->enter_event->perf_events);
valid_prog_array = bpf_prog_array_valid(sys_data->enter_event);
if (!valid_prog_array && hlist_empty(head))
- return;
+ goto end;

/* get the size after alignment with the u32 buffer size field */
size = sizeof(unsigned long) * sys_data->nb_args + sizeof(*rec);
@@ -604,7 +630,7 @@ static void perf_syscall_enter(void *ignore, struct pt_regs *regs, long id)

rec = perf_trace_buf_alloc(size, NULL, &rctx);
if (!rec)
- return;
+ goto end;

rec->nr = syscall_nr;
syscall_get_arguments(current, regs, args);
@@ -614,12 +640,14 @@ static void perf_syscall_enter(void *ignore, struct pt_regs *regs, long id)
!perf_call_bpf_enter(sys_data->enter_event, regs, sys_data, rec)) ||
hlist_empty(head)) {
perf_swevent_put_recursion_context(rctx);
- return;
+ goto end;
}

perf_trace_buf_submit(rec, size, rctx,
sys_data->enter_event->event.type, 1, regs,
head, NULL);
+end:
+ preempt_enable_notrace();
}

static int perf_sysenter_enable(struct trace_event_call *call)
@@ -631,7 +659,9 @@ static int perf_sysenter_enable(struct trace_event_call *call)

mutex_lock(&syscall_trace_lock);
if (!sys_perf_refcount_enter)
- ret = register_trace_sys_enter(perf_syscall_enter, NULL);
+ ret = register_trace_prio_flags_sys_enter(perf_syscall_enter, NULL,
+ TRACEPOINT_DEFAULT_PRIO,
+ TRACEPOINT_MAY_FAULT);
if (ret) {
pr_info("event trace: Could not activate syscall entry trace point");
} else {
@@ -682,20 +712,26 @@ static void perf_syscall_exit(void *ignore, struct pt_regs *regs, long ret)
int rctx;
int size;

+ /*
+ * Probe called with preemption enabled (may_fault), but ring buffer and
+ * per-cpu data require preemption to be disabled.
+ */
+ preempt_disable_notrace();
+
syscall_nr = trace_get_syscall_nr(current, regs);
if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
- return;
+ goto end;
if (!test_bit(syscall_nr, enabled_perf_exit_syscalls))
- return;
+ goto end;

sys_data = syscall_nr_to_meta(syscall_nr);
if (!sys_data)
- return;
+ goto end;

head = this_cpu_ptr(sys_data->exit_event->perf_events);
valid_prog_array = bpf_prog_array_valid(sys_data->exit_event);
if (!valid_prog_array && hlist_empty(head))
- return;
+ goto end;

/* We can probably do that at build time */
size = ALIGN(sizeof(*rec) + sizeof(u32), sizeof(u64));
@@ -703,7 +739,7 @@ static void perf_syscall_exit(void *ignore, struct pt_regs *regs, long ret)

rec = perf_trace_buf_alloc(size, NULL, &rctx);
if (!rec)
- return;
+ goto end;

rec->nr = syscall_nr;
rec->ret = syscall_get_return_value(current, regs);
@@ -712,11 +748,13 @@ static void perf_syscall_exit(void *ignore, struct pt_regs *regs, long ret)
!perf_call_bpf_exit(sys_data->exit_event, regs, rec)) ||
hlist_empty(head)) {
perf_swevent_put_recursion_context(rctx);
- return;
+ goto end;
}

perf_trace_buf_submit(rec, size, rctx, sys_data->exit_event->event.type,
1, regs, head, NULL);
+end:
+ preempt_enable_notrace();
}

static int perf_sysexit_enable(struct trace_event_call *call)
@@ -728,7 +766,9 @@ static int perf_sysexit_enable(struct trace_event_call *call)

mutex_lock(&syscall_trace_lock);
if (!sys_perf_refcount_exit)
- ret = register_trace_sys_exit(perf_syscall_exit, NULL);
+ ret = register_trace_prio_flags_sys_exit(perf_syscall_exit, NULL,
+ TRACEPOINT_DEFAULT_PRIO,
+ TRACEPOINT_MAY_FAULT);
if (ret) {
pr_info("event trace: Could not activate syscall exit trace point");
} else {
--
2.25.1

2023-11-20 20:55:45

by Mathieu Desnoyers

[permalink] [raw]
Subject: [PATCH v4 3/5] tracing/bpf-trace: add support for faultable tracepoints

In preparation for converting system call enter/exit instrumentation
into faultable tracepoints, make sure that bpf can handle registering to
such tracepoints by explicitly disabling preemption within the bpf
tracepoint probes to respect the current expectations within bpf tracing
code.

This change does not yet allow bpf to take page faults per se within its
probe, but allows its existing probes to connect to faultable
tracepoints.

Link: https://lore.kernel.org/lkml/[email protected]/
Co-developed-by: Michael Jeanson <[email protected]>
Signed-off-by: Mathieu Desnoyers <[email protected]>
Signed-off-by: Michael Jeanson <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Yonghong Song <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: [email protected]
Cc: Joel Fernandes <[email protected]>
---
include/trace/bpf_probe.h | 21 +++++++++++++++++----
kernel/trace/bpf_trace.c | 11 ++++++++---
2 files changed, 25 insertions(+), 7 deletions(-)

diff --git a/include/trace/bpf_probe.h b/include/trace/bpf_probe.h
index e609cd7da47e..dd4b42715e5d 100644
--- a/include/trace/bpf_probe.h
+++ b/include/trace/bpf_probe.h
@@ -42,17 +42,30 @@
/* tracepoints with more than 12 arguments will hit build error */
#define CAST_TO_U64(...) CONCATENATE(__CAST, COUNT_ARGS(__VA_ARGS__))(__VA_ARGS__)

-#define __BPF_DECLARE_TRACE(call, proto, args) \
+#define __BPF_DECLARE_TRACE(call, proto, args, tp_flags) \
static notrace void \
__bpf_trace_##call(void *__data, proto) \
{ \
struct bpf_prog *prog = __data; \
+ \
+ if ((tp_flags) & TRACEPOINT_MAY_FAULT) { \
+ might_fault(); \
+ preempt_disable_notrace(); \
+ } \
+ \
CONCATENATE(bpf_trace_run, COUNT_ARGS(args))(prog, CAST_TO_U64(args)); \
+ \
+ if ((tp_flags) & TRACEPOINT_MAY_FAULT) \
+ preempt_enable_notrace(); \
}

#undef DECLARE_EVENT_CLASS
#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \
- __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args))
+ __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args), 0)
+
+#undef DECLARE_EVENT_CLASS_MAY_FAULT
+#define DECLARE_EVENT_CLASS_MAY_FAULT(call, proto, args, tstruct, assign, print) \
+ __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args), TRACEPOINT_MAY_FAULT)

/*
* This part is compiled out, it is only here as a build time check
@@ -106,13 +119,13 @@ static inline void bpf_test_buffer_##call(void) \

#undef DECLARE_TRACE
#define DECLARE_TRACE(call, proto, args) \
- __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args)) \
+ __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args), 0) \
__DEFINE_EVENT(call, call, PARAMS(proto), PARAMS(args), 0)

#undef DECLARE_TRACE_WRITABLE
#define DECLARE_TRACE_WRITABLE(call, proto, args, size) \
__CHECK_WRITABLE_BUF_SIZE(call, PARAMS(proto), PARAMS(args), size) \
- __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args)) \
+ __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args), 0) \
__DEFINE_EVENT(call, call, PARAMS(proto), PARAMS(args), size)

#include TRACE_INCLUDE(TRACE_INCLUDE_FILE)
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 344a41873445..1eb770dafa8a 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -2368,9 +2368,14 @@ static int __bpf_probe_register(struct bpf_raw_event_map *btp, struct bpf_prog *
if (prog->aux->max_tp_access > btp->writable_size)
return -EINVAL;

- return tracepoint_probe_register_prio_flags(tp, (void *)btp->bpf_func,
- prog, TRACEPOINT_DEFAULT_PRIO,
- TRACEPOINT_MAY_EXIST);
+ if (tp->flags & TRACEPOINT_MAY_FAULT)
+ return tracepoint_probe_register_prio_flags(tp, (void *)btp->bpf_func,
+ prog, TRACEPOINT_DEFAULT_PRIO,
+ TRACEPOINT_MAY_EXIST | TRACEPOINT_MAY_FAULT);
+ else
+ return tracepoint_probe_register_prio_flags(tp, (void *)btp->bpf_func,
+ prog, TRACEPOINT_DEFAULT_PRIO,
+ TRACEPOINT_MAY_EXIST);
}

int bpf_probe_register(struct bpf_raw_event_map *btp, struct bpf_prog *prog)
--
2.25.1

2023-11-20 20:56:23

by Mathieu Desnoyers

[permalink] [raw]
Subject: [PATCH v4 4/5] tracing/perf: add support for faultable tracepoints

In preparation for converting system call enter/exit instrumentation
into faultable tracepoints, make sure that perf can handle registering
to such tracepoints by explicitly disabling preemption within the perf
tracepoint probes to respect the current expectations within perf ring
buffer code.

This change does not yet allow perf to take page faults per se within
its probe, but allows its existing probes to connect to faultable
tracepoints.

Link: https://lore.kernel.org/lkml/[email protected]/
Co-developed-by: Michael Jeanson <[email protected]>
Signed-off-by: Mathieu Desnoyers <[email protected]>
Signed-off-by: Michael Jeanson <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Yonghong Song <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: [email protected]
Cc: Joel Fernandes <[email protected]>
---
include/trace/perf.h | 27 +++++++++++++++++++++++----
1 file changed, 23 insertions(+), 4 deletions(-)

diff --git a/include/trace/perf.h b/include/trace/perf.h
index 2c11181c82e0..fb47815f6eff 100644
--- a/include/trace/perf.h
+++ b/include/trace/perf.h
@@ -12,8 +12,8 @@
#undef __perf_task
#define __perf_task(t) (__task = (t))

-#undef DECLARE_EVENT_CLASS
-#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \
+#undef _DECLARE_EVENT_CLASS
+#define _DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print, tp_flags) \
static notrace void \
perf_trace_##call(void *__data, proto) \
{ \
@@ -28,13 +28,18 @@ perf_trace_##call(void *__data, proto) \
int __data_size; \
int rctx; \
\
+ if ((tp_flags) & TRACEPOINT_MAY_FAULT) { \
+ might_fault(); \
+ preempt_disable_notrace(); \
+ } \
+ \
__data_size = trace_event_get_offsets_##call(&__data_offsets, args); \
\
head = this_cpu_ptr(event_call->perf_events); \
if (!bpf_prog_array_valid(event_call) && \
__builtin_constant_p(!__task) && !__task && \
hlist_empty(head)) \
- return; \
+ goto end; \
\
__entry_size = ALIGN(__data_size + sizeof(*entry) + sizeof(u32),\
sizeof(u64)); \
@@ -42,7 +47,7 @@ perf_trace_##call(void *__data, proto) \
\
entry = perf_trace_buf_alloc(__entry_size, &__regs, &rctx); \
if (!entry) \
- return; \
+ goto end; \
\
perf_fetch_caller_regs(__regs); \
\
@@ -53,8 +58,22 @@ perf_trace_##call(void *__data, proto) \
perf_trace_run_bpf_submit(entry, __entry_size, rctx, \
event_call, __count, __regs, \
head, __task); \
+end: \
+ if ((tp_flags) & TRACEPOINT_MAY_FAULT) \
+ preempt_enable_notrace(); \
}

+#undef DECLARE_EVENT_CLASS
+#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \
+ _DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args), \
+ PARAMS(tstruct), PARAMS(assign), PARAMS(print), 0)
+
+#undef DECLARE_EVENT_CLASS_MAY_FAULT
+#define DECLARE_EVENT_CLASS_MAY_FAULT(call, proto, args, tstruct, assign, print) \
+ _DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args), \
+ PARAMS(tstruct), PARAMS(assign), PARAMS(print), \
+ TRACEPOINT_MAY_FAULT)
+
/*
* This part is compiled out, it is only here as a build time check
* to make sure that if the tracepoint handling changes, the
--
2.25.1

2023-11-20 20:57:09

by Mathieu Desnoyers

[permalink] [raw]
Subject: [PATCH v4 1/5] tracing: Introduce faultable tracepoints

When invoked from system call enter/exit instrumentation, accessing
user-space data is a common use-case for tracers. However, tracepoints
currently disable preemption around iteration on the registered
tracepoint probes and invocation of the probe callbacks, which prevents
tracers from handling page faults.

Extend the tracepoint and trace event APIs to allow defining a faultable
tracepoint which invokes its callback with preemption enabled.

Also extend the tracepoint API to allow tracers to request specific
probes to be connected to those faultable tracepoints. When the
TRACEPOINT_MAY_FAULT flag is provided on registration, the probe
callback will be called with preemption enabled, and is allowed to take
page faults. Faultable probes can only be registered on faultable
tracepoints and non-faultable probes on non-faultable tracepoints.

The tasks trace rcu mechanism is used to synchronize read-side
marshalling of the registered probes with respect to faultable probes
unregistration and teardown.

Link: https://lore.kernel.org/lkml/[email protected]/
Co-developed-by: Michael Jeanson <[email protected]>
Signed-off-by: Mathieu Desnoyers <[email protected]>
Signed-off-by: Michael Jeanson <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Yonghong Song <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: [email protected]
Cc: Joel Fernandes <[email protected]>
---
Changes since v1:
- Cleanup __DO_TRACE() implementation.
- Rename "sleepable tracepoints" to "faultable tracepoints", MAYSLEEP to
MAYFAULT, and use might_fault() rather than might_sleep(), to properly
convey that the tracepoints are meant to be able to take a page fault,
which requires to be able to sleep *and* to hold the mmap_sem.
Changes since v2:
- Rename MAYFAULT to MAY_FAULT.
- Rebased on 6.5.5.
- Introduce MAY_EXIST tracepoint flag.
Changes since v3:
- Rebased on 6.6.2.
---
include/linux/tracepoint-defs.h | 14 ++++++
include/linux/tracepoint.h | 88 +++++++++++++++++++++++----------
include/trace/define_trace.h | 7 +++
include/trace/trace_events.h | 6 +++
init/Kconfig | 1 +
kernel/trace/bpf_trace.c | 5 +-
kernel/trace/trace_fprobe.c | 5 +-
kernel/tracepoint.c | 58 ++++++++++++----------
8 files changed, 129 insertions(+), 55 deletions(-)

diff --git a/include/linux/tracepoint-defs.h b/include/linux/tracepoint-defs.h
index 4dc4955f0fbf..67bacfaa8fd0 100644
--- a/include/linux/tracepoint-defs.h
+++ b/include/linux/tracepoint-defs.h
@@ -29,6 +29,19 @@ struct tracepoint_func {
int prio;
};

+/**
+ * enum tracepoint_flags - Tracepoint flags
+ * @TRACEPOINT_MAY_EXIST: Don't return an error if the tracepoint does not
+ * exist upon registration.
+ * @TRACEPOINT_MAY_FAULT: The tracepoint probe callback will be called with
+ * preemption enabled, and is allowed to take page
+ * faults.
+ */
+enum tracepoint_flags {
+ TRACEPOINT_MAY_EXIST = (1 << 0),
+ TRACEPOINT_MAY_FAULT = (1 << 1),
+};
+
struct tracepoint {
const char *name; /* Tracepoint name */
struct static_key key;
@@ -39,6 +52,7 @@ struct tracepoint {
int (*regfunc)(void);
void (*unregfunc)(void);
struct tracepoint_func __rcu *funcs;
+ unsigned int flags;
};

#ifdef CONFIG_HAVE_ARCH_PREL32_RELOCATIONS
diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index 88c0ba623ee6..8a6b58a2bf3b 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -18,6 +18,7 @@
#include <linux/types.h>
#include <linux/cpumask.h>
#include <linux/rcupdate.h>
+#include <linux/rcupdate_trace.h>
#include <linux/tracepoint-defs.h>
#include <linux/static_call.h>

@@ -41,17 +42,10 @@ extern int
tracepoint_probe_register_prio(struct tracepoint *tp, void *probe, void *data,
int prio);
extern int
-tracepoint_probe_register_prio_may_exist(struct tracepoint *tp, void *probe, void *data,
- int prio);
+tracepoint_probe_register_prio_flags(struct tracepoint *tp, void *probe, void *data,
+ int prio, unsigned int flags);
extern int
tracepoint_probe_unregister(struct tracepoint *tp, void *probe, void *data);
-static inline int
-tracepoint_probe_register_may_exist(struct tracepoint *tp, void *probe,
- void *data)
-{
- return tracepoint_probe_register_prio_may_exist(tp, probe, data,
- TRACEPOINT_DEFAULT_PRIO);
-}
extern void
for_each_kernel_tracepoint(void (*fct)(struct tracepoint *tp, void *priv),
void *priv);
@@ -90,6 +84,7 @@ int unregister_tracepoint_module_notifier(struct notifier_block *nb)
#ifdef CONFIG_TRACEPOINTS
static inline void tracepoint_synchronize_unregister(void)
{
+ synchronize_rcu_tasks_trace();
synchronize_srcu(&tracepoint_srcu);
synchronize_rcu();
}
@@ -192,9 +187,10 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
* it_func[0] is never NULL because there is at least one element in the array
* when the array itself is non NULL.
*/
-#define __DO_TRACE(name, args, cond, rcuidle) \
+#define __DO_TRACE(name, args, cond, rcuidle, tp_flags) \
do { \
int __maybe_unused __idx = 0; \
+ bool mayfault = (tp_flags) & TRACEPOINT_MAY_FAULT; \
\
if (!(cond)) \
return; \
@@ -202,8 +198,12 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
if (WARN_ON_ONCE(RCUIDLE_COND(rcuidle))) \
return; \
\
- /* keep srcu and sched-rcu usage consistent */ \
- preempt_disable_notrace(); \
+ if (mayfault) { \
+ rcu_read_lock_trace(); \
+ } else { \
+ /* keep srcu and sched-rcu usage consistent */ \
+ preempt_disable_notrace(); \
+ } \
\
/* \
* For rcuidle callers, use srcu since sched-rcu \
@@ -221,20 +221,23 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
srcu_read_unlock_notrace(&tracepoint_srcu, __idx);\
} \
\
- preempt_enable_notrace(); \
+ if (mayfault) \
+ rcu_read_unlock_trace(); \
+ else \
+ preempt_enable_notrace(); \
} while (0)

#ifndef MODULE
-#define __DECLARE_TRACE_RCU(name, proto, args, cond) \
+#define __DECLARE_TRACE_RCU(name, proto, args, cond, tp_flags) \
static inline void trace_##name##_rcuidle(proto) \
{ \
if (static_key_false(&__tracepoint_##name.key)) \
__DO_TRACE(name, \
TP_ARGS(args), \
- TP_CONDITION(cond), 1); \
+ TP_CONDITION(cond), 1, tp_flags); \
}
#else
-#define __DECLARE_TRACE_RCU(name, proto, args, cond)
+#define __DECLARE_TRACE_RCU(name, proto, args, cond, tp_flags)
#endif

/*
@@ -248,7 +251,7 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
* site if it is not watching, as it will need to be active when the
* tracepoint is enabled.
*/
-#define __DECLARE_TRACE(name, proto, args, cond, data_proto) \
+#define __DECLARE_TRACE(name, proto, args, cond, data_proto, tp_flags) \
extern int __traceiter_##name(data_proto); \
DECLARE_STATIC_CALL(tp_func_##name, __traceiter_##name); \
extern struct tracepoint __tracepoint_##name; \
@@ -257,13 +260,15 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
if (static_key_false(&__tracepoint_##name.key)) \
__DO_TRACE(name, \
TP_ARGS(args), \
- TP_CONDITION(cond), 0); \
+ TP_CONDITION(cond), 0, tp_flags); \
if (IS_ENABLED(CONFIG_LOCKDEP) && (cond)) { \
WARN_ON_ONCE(!rcu_is_watching()); \
} \
+ if ((tp_flags) & TRACEPOINT_MAY_FAULT) \
+ might_fault(); \
} \
__DECLARE_TRACE_RCU(name, PARAMS(proto), PARAMS(args), \
- PARAMS(cond)) \
+ PARAMS(cond), tp_flags) \
static inline int \
register_trace_##name(void (*probe)(data_proto), void *data) \
{ \
@@ -278,6 +283,13 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
(void *)probe, data, prio); \
} \
static inline int \
+ register_trace_prio_flags_##name(void (*probe)(data_proto), void *data, \
+ int prio, unsigned int flags) \
+ { \
+ return tracepoint_probe_register_prio_flags(&__tracepoint_##name, \
+ (void *)probe, data, prio, flags); \
+ } \
+ static inline int \
unregister_trace_##name(void (*probe)(data_proto), void *data) \
{ \
return tracepoint_probe_unregister(&__tracepoint_##name,\
@@ -298,7 +310,7 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
* structures, so we create an array of pointers that will be used for iteration
* on the tracepoints.
*/
-#define DEFINE_TRACE_FN(_name, _reg, _unreg, proto, args) \
+#define DEFINE_TRACE_FN_FLAGS(_name, _reg, _unreg, proto, args, tp_flags) \
static const char __tpstrtab_##_name[] \
__section("__tracepoints_strings") = #_name; \
extern struct static_call_key STATIC_CALL_KEY(tp_func_##_name); \
@@ -314,7 +326,9 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
.probestub = &__probestub_##_name, \
.regfunc = _reg, \
.unregfunc = _unreg, \
- .funcs = NULL }; \
+ .funcs = NULL, \
+ .flags = (tp_flags), \
+ }; \
__TRACEPOINT_ENTRY(_name); \
int __traceiter_##_name(void *__data, proto) \
{ \
@@ -337,8 +351,11 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
} \
DEFINE_STATIC_CALL(tp_func_##_name, __traceiter_##_name);

+#define DEFINE_TRACE_FN(_name, _reg, _unreg, proto, args) \
+ DEFINE_TRACE_FN_FLAGS(_name, _reg, _unreg, PARAMS(proto), PARAMS(args), 0)
+
#define DEFINE_TRACE(name, proto, args) \
- DEFINE_TRACE_FN(name, NULL, NULL, PARAMS(proto), PARAMS(args));
+ DEFINE_TRACE_FN(name, NULL, NULL, PARAMS(proto), PARAMS(args))

#define EXPORT_TRACEPOINT_SYMBOL_GPL(name) \
EXPORT_SYMBOL_GPL(__tracepoint_##name); \
@@ -351,7 +368,7 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)


#else /* !TRACEPOINTS_ENABLED */
-#define __DECLARE_TRACE(name, proto, args, cond, data_proto) \
+#define __DECLARE_TRACE(name, proto, args, cond, data_proto, tp_flags) \
static inline void trace_##name(proto) \
{ } \
static inline void trace_##name##_rcuidle(proto) \
@@ -363,6 +380,18 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
return -ENOSYS; \
} \
static inline int \
+ register_trace_prio_##name(void (*probe)(data_proto), \
+ void *data, int prio) \
+ { \
+ return -ENOSYS; \
+ } \
+ static inline int \
+ register_trace_prio_flags_##name(void (*probe)(data_proto), \
+ void *data, int prio, unsigned int flags) \
+ { \
+ return -ENOSYS; \
+ } \
+ static inline int \
unregister_trace_##name(void (*probe)(data_proto), \
void *data) \
{ \
@@ -377,6 +406,7 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
return false; \
}

+#define DEFINE_TRACE_FN_FLAGS(name, reg, unreg, proto, args, tp_flags)
#define DEFINE_TRACE_FN(name, reg, unreg, proto, args)
#define DEFINE_TRACE(name, proto, args)
#define EXPORT_TRACEPOINT_SYMBOL_GPL(name)
@@ -431,12 +461,17 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
#define DECLARE_TRACE(name, proto, args) \
__DECLARE_TRACE(name, PARAMS(proto), PARAMS(args), \
cpu_online(raw_smp_processor_id()), \
- PARAMS(void *__data, proto))
+ PARAMS(void *__data, proto), 0)
+
+#define DECLARE_TRACE_MAY_FAULT(name, proto, args) \
+ __DECLARE_TRACE(name, PARAMS(proto), PARAMS(args), \
+ cpu_online(raw_smp_processor_id()), \
+ PARAMS(void *__data, proto), TRACEPOINT_MAY_FAULT)

#define DECLARE_TRACE_CONDITION(name, proto, args, cond) \
__DECLARE_TRACE(name, PARAMS(proto), PARAMS(args), \
cpu_online(raw_smp_processor_id()) && (PARAMS(cond)), \
- PARAMS(void *__data, proto))
+ PARAMS(void *__data, proto), 0)

#define TRACE_EVENT_FLAGS(event, flag)

@@ -567,6 +602,9 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
#define TRACE_EVENT_FN(name, proto, args, struct, \
assign, print, reg, unreg) \
DECLARE_TRACE(name, PARAMS(proto), PARAMS(args))
+#define TRACE_EVENT_FN_MAY_FAULT(name, proto, args, struct, \
+ assign, print, reg, unreg) \
+ DECLARE_TRACE_MAY_FAULT(name, PARAMS(proto), PARAMS(args))
#define TRACE_EVENT_FN_COND(name, proto, args, cond, struct, \
assign, print, reg, unreg) \
DECLARE_TRACE_CONDITION(name, PARAMS(proto), \
diff --git a/include/trace/define_trace.h b/include/trace/define_trace.h
index 00723935dcc7..1b8ca143724a 100644
--- a/include/trace/define_trace.h
+++ b/include/trace/define_trace.h
@@ -41,6 +41,12 @@
assign, print, reg, unreg) \
DEFINE_TRACE_FN(name, reg, unreg, PARAMS(proto), PARAMS(args))

+#undef TRACE_EVENT_FN_MAY_FAULT
+#define TRACE_EVENT_FN_MAY_FAULT(name, proto, args, tstruct, \
+ assign, print, reg, unreg) \
+ DEFINE_TRACE_FN_FLAGS(name, reg, unreg, PARAMS(proto), \
+ PARAMS(args), TRACEPOINT_MAY_FAULT)
+
#undef TRACE_EVENT_FN_COND
#define TRACE_EVENT_FN_COND(name, proto, args, cond, tstruct, \
assign, print, reg, unreg) \
@@ -106,6 +112,7 @@

#undef TRACE_EVENT
#undef TRACE_EVENT_FN
+#undef TRACE_EVENT_FN_MAY_FAULT
#undef TRACE_EVENT_FN_COND
#undef TRACE_EVENT_CONDITION
#undef TRACE_EVENT_NOP
diff --git a/include/trace/trace_events.h b/include/trace/trace_events.h
index c2f9cabf154d..df590eea8ae4 100644
--- a/include/trace/trace_events.h
+++ b/include/trace/trace_events.h
@@ -77,6 +77,12 @@
TRACE_EVENT(name, PARAMS(proto), PARAMS(args), \
PARAMS(tstruct), PARAMS(assign), PARAMS(print)) \

+#undef TRACE_EVENT_FN_MAY_FAULT
+#define TRACE_EVENT_FN_MAY_FAULT(name, proto, args, tstruct, \
+ assign, print, reg, unreg) \
+ TRACE_EVENT(name, PARAMS(proto), PARAMS(args), \
+ PARAMS(tstruct), PARAMS(assign), PARAMS(print)) \
+
#undef TRACE_EVENT_FN_COND
#define TRACE_EVENT_FN_COND(name, proto, args, cond, tstruct, \
assign, print, reg, unreg) \
diff --git a/init/Kconfig b/init/Kconfig
index 6d35728b94b2..f6fd53822868 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1919,6 +1919,7 @@ config BINDGEN_VERSION_TEXT
#
config TRACEPOINTS
bool
+ select TASKS_TRACE_RCU

source "kernel/Kconfig.kexec"

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 868008f56fec..344a41873445 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -2368,8 +2368,9 @@ static int __bpf_probe_register(struct bpf_raw_event_map *btp, struct bpf_prog *
if (prog->aux->max_tp_access > btp->writable_size)
return -EINVAL;

- return tracepoint_probe_register_may_exist(tp, (void *)btp->bpf_func,
- prog);
+ return tracepoint_probe_register_prio_flags(tp, (void *)btp->bpf_func,
+ prog, TRACEPOINT_DEFAULT_PRIO,
+ TRACEPOINT_MAY_EXIST);
}

int bpf_probe_register(struct bpf_raw_event_map *btp, struct bpf_prog *prog)
diff --git a/kernel/trace/trace_fprobe.c b/kernel/trace/trace_fprobe.c
index 8bfe23af9c73..22b935c82730 100644
--- a/kernel/trace/trace_fprobe.c
+++ b/kernel/trace/trace_fprobe.c
@@ -687,8 +687,9 @@ static int __register_trace_fprobe(struct trace_fprobe *tf)
* At first, put __probestub_##TP function on the tracepoint
* and put a fprobe on the stub function.
*/
- ret = tracepoint_probe_register_prio_may_exist(tpoint,
- tpoint->probestub, NULL, 0);
+ ret = tracepoint_probe_register_prio_flags(tpoint,
+ tpoint->probestub, NULL, 0,
+ TRACEPOINT_MAY_EXIST);
if (ret < 0)
return ret;
return register_fprobe_ips(&tf->fp, &ip, 1);
diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
index 8d1507dd0724..1f137163bdc5 100644
--- a/kernel/tracepoint.c
+++ b/kernel/tracepoint.c
@@ -111,11 +111,16 @@ static inline void *allocate_probes(int count)
return p == NULL ? NULL : p->probes;
}

-static void srcu_free_old_probes(struct rcu_head *head)
+static void rcu_tasks_trace_free_old_probes(struct rcu_head *head)
{
kfree(container_of(head, struct tp_probes, rcu));
}

+static void srcu_free_old_probes(struct rcu_head *head)
+{
+ call_rcu_tasks_trace(head, rcu_tasks_trace_free_old_probes);
+}
+
static void rcu_free_old_probes(struct rcu_head *head)
{
call_srcu(&tracepoint_srcu, head, srcu_free_old_probes);
@@ -136,7 +141,7 @@ static __init int release_early_probes(void)
return 0;
}

-/* SRCU is initialized at core_initcall */
+/* SRCU and Tasks Trace RCU are initialized at core_initcall */
postcore_initcall(release_early_probes);

static inline void release_probes(struct tracepoint_func *old)
@@ -146,8 +151,9 @@ static inline void release_probes(struct tracepoint_func *old)
struct tp_probes, probes[0]);

/*
- * We can't free probes if SRCU is not initialized yet.
- * Postpone the freeing till after SRCU is initialized.
+ * We can't free probes if SRCU and Tasks Trace RCU are not
+ * initialized yet. Postpone the freeing till after both are
+ * initialized.
*/
if (unlikely(!ok_to_free_tracepoints)) {
tp_probes->rcu.next = early_probes;
@@ -156,10 +162,9 @@ static inline void release_probes(struct tracepoint_func *old)
}

/*
- * Tracepoint probes are protected by both sched RCU and SRCU,
- * by calling the SRCU callback in the sched RCU callback we
- * cover both cases. So let us chain the SRCU and sched RCU
- * callbacks to wait for both grace periods.
+ * Tracepoint probes are protected by sched RCU, SRCU and
+ * Tasks Trace RCU by chaining the callbacks we cover all three
+ * cases and wait for all three grace periods.
*/
call_rcu(&tp_probes->rcu, rcu_free_old_probes);
}
@@ -460,30 +465,38 @@ static int tracepoint_remove_func(struct tracepoint *tp,
}

/**
- * tracepoint_probe_register_prio_may_exist - Connect a probe to a tracepoint with priority
+ * tracepoint_probe_register_prio_flags - Connect a probe to a tracepoint with priority and flags
* @tp: tracepoint
* @probe: probe handler
* @data: tracepoint data
* @prio: priority of this function over other registered functions
+ * @flags: tracepoint flags argument (enum tracepoint_flags bits)
*
- * Same as tracepoint_probe_register_prio() except that it will not warn
- * if the tracepoint is already registered.
+ * Returns 0 if ok, error value on error.
+ * Note: if @tp is within a module, the caller is responsible for
+ * unregistering the probe before the module is gone. This can be
+ * performed either with a tracepoint module going notifier, or from
+ * within module exit functions.
*/
-int tracepoint_probe_register_prio_may_exist(struct tracepoint *tp, void *probe,
- void *data, int prio)
+int tracepoint_probe_register_prio_flags(struct tracepoint *tp, void *probe,
+ void *data, int prio, unsigned int flags)
{
struct tracepoint_func tp_func;
int ret;

+ if (((tp->flags & TRACEPOINT_MAY_FAULT) && !(flags & TRACEPOINT_MAY_FAULT)) ||
+ (!(tp->flags & TRACEPOINT_MAY_FAULT) && (flags & TRACEPOINT_MAY_FAULT)))
+ return -EINVAL;
+
mutex_lock(&tracepoints_mutex);
tp_func.func = probe;
tp_func.data = data;
tp_func.prio = prio;
- ret = tracepoint_add_func(tp, &tp_func, prio, false);
+ ret = tracepoint_add_func(tp, &tp_func, prio, flags & TRACEPOINT_MAY_EXIST);
mutex_unlock(&tracepoints_mutex);
return ret;
}
-EXPORT_SYMBOL_GPL(tracepoint_probe_register_prio_may_exist);
+EXPORT_SYMBOL_GPL(tracepoint_probe_register_prio_flags);

/**
* tracepoint_probe_register_prio - Connect a probe to a tracepoint with priority
@@ -501,16 +514,7 @@ EXPORT_SYMBOL_GPL(tracepoint_probe_register_prio_may_exist);
int tracepoint_probe_register_prio(struct tracepoint *tp, void *probe,
void *data, int prio)
{
- struct tracepoint_func tp_func;
- int ret;
-
- mutex_lock(&tracepoints_mutex);
- tp_func.func = probe;
- tp_func.data = data;
- tp_func.prio = prio;
- ret = tracepoint_add_func(tp, &tp_func, prio, true);
- mutex_unlock(&tracepoints_mutex);
- return ret;
+ return tracepoint_probe_register_prio_flags(tp, probe, data, prio, 0);
}
EXPORT_SYMBOL_GPL(tracepoint_probe_register_prio);

@@ -520,6 +524,8 @@ EXPORT_SYMBOL_GPL(tracepoint_probe_register_prio);
* @probe: probe handler
* @data: tracepoint data
*
+ * Non-faultable probes can only be registered on non-faultable tracepoints.
+ *
* Returns 0 if ok, error value on error.
* Note: if @tp is within a module, the caller is responsible for
* unregistering the probe before the module is gone. This can be
@@ -528,7 +534,7 @@ EXPORT_SYMBOL_GPL(tracepoint_probe_register_prio);
*/
int tracepoint_probe_register(struct tracepoint *tp, void *probe, void *data)
{
- return tracepoint_probe_register_prio(tp, probe, data, TRACEPOINT_DEFAULT_PRIO);
+ return tracepoint_probe_register_prio_flags(tp, probe, data, TRACEPOINT_DEFAULT_PRIO, 0);
}
EXPORT_SYMBOL_GPL(tracepoint_probe_register);

--
2.25.1

2023-11-20 21:49:27

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 5/5] tracing: convert sys_enter/exit to faultable tracepoints

On Mon, Nov 20, 2023 at 03:54:18PM -0500, Mathieu Desnoyers wrote:

> diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
> index de753403cdaf..718a0723a0bc 100644
> --- a/kernel/trace/trace_syscalls.c
> +++ b/kernel/trace/trace_syscalls.c
> @@ -299,27 +299,33 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
> int syscall_nr;
> int size;
>
> + /*
> + * Probe called with preemption enabled (may_fault), but ring buffer and
> + * per-cpu data require preemption to be disabled.
> + */
> + preempt_disable_notrace();

guard(preempt_notrace)();

and ditch all the goto crap.

> +
> syscall_nr = trace_get_syscall_nr(current, regs);
> if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
> - return;
> + goto end;
>
> /* Here we're inside tp handler's rcu_read_lock_sched (__DO_TRACE) */
> trace_file = rcu_dereference_sched(tr->enter_syscall_files[syscall_nr]);
> if (!trace_file)
> - return;
> + goto end;
>
> if (trace_trigger_soft_disabled(trace_file))
> - return;
> + goto end;
>
> sys_data = syscall_nr_to_meta(syscall_nr);
> if (!sys_data)
> - return;
> + goto end;
>
> size = sizeof(*entry) + sizeof(unsigned long) * sys_data->nb_args;
>
> entry = trace_event_buffer_reserve(&fbuffer, trace_file, size);
> if (!entry)
> - return;
> + goto end;
>
> entry = ring_buffer_event_data(fbuffer.event);
> entry->nr = syscall_nr;
> @@ -327,6 +333,8 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
> memcpy(entry->args, args, sizeof(unsigned long) * sys_data->nb_args);
>
> trace_event_buffer_commit(&fbuffer);
> +end:
> + preempt_enable_notrace();
> }
>
> static void ftrace_syscall_exit(void *data, struct pt_regs *regs, long ret)
> @@ -338,31 +346,39 @@ static void ftrace_syscall_exit(void *data, struct pt_regs *regs, long ret)
> struct trace_event_buffer fbuffer;
> int syscall_nr;
>
> + /*
> + * Probe called with preemption enabled (may_fault), but ring buffer and
> + * per-cpu data require preemption to be disabled.
> + */
> + preempt_disable_notrace();

Idem.

> +
> syscall_nr = trace_get_syscall_nr(current, regs);
> if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
> - return;
> + goto end;
>
> /* Here we're inside tp handler's rcu_read_lock_sched (__DO_TRACE()) */
> trace_file = rcu_dereference_sched(tr->exit_syscall_files[syscall_nr]);
> if (!trace_file)
> - return;
> + goto end;
>
> if (trace_trigger_soft_disabled(trace_file))
> - return;
> + goto end;
>
> sys_data = syscall_nr_to_meta(syscall_nr);
> if (!sys_data)
> - return;
> + goto end;
>
> entry = trace_event_buffer_reserve(&fbuffer, trace_file, sizeof(*entry));
> if (!entry)
> - return;
> + goto end;
>
> entry = ring_buffer_event_data(fbuffer.event);
> entry->nr = syscall_nr;
> entry->ret = syscall_get_return_value(current, regs);
>
> trace_event_buffer_commit(&fbuffer);
> +end:
> + preempt_enable_notrace();
> }
>
> static int reg_event_syscall_enter(struct trace_event_file *file,
> @@ -377,7 +393,9 @@ static int reg_event_syscall_enter(struct trace_event_file *file,
> return -ENOSYS;
> mutex_lock(&syscall_trace_lock);
> if (!tr->sys_refcount_enter)
> - ret = register_trace_sys_enter(ftrace_syscall_enter, tr);
> + ret = register_trace_prio_flags_sys_enter(ftrace_syscall_enter, tr,
> + TRACEPOINT_DEFAULT_PRIO,
> + TRACEPOINT_MAY_FAULT);
> if (!ret) {
> rcu_assign_pointer(tr->enter_syscall_files[num], file);
> tr->sys_refcount_enter++;
> @@ -415,7 +433,9 @@ static int reg_event_syscall_exit(struct trace_event_file *file,
> return -ENOSYS;
> mutex_lock(&syscall_trace_lock);
> if (!tr->sys_refcount_exit)
> - ret = register_trace_sys_exit(ftrace_syscall_exit, tr);
> + ret = register_trace_prio_flags_sys_exit(ftrace_syscall_exit, tr,
> + TRACEPOINT_DEFAULT_PRIO,
> + TRACEPOINT_MAY_FAULT);
> if (!ret) {
> rcu_assign_pointer(tr->exit_syscall_files[num], file);
> tr->sys_refcount_exit++;

/me hands you a bucket of {}, free of charge.

> @@ -582,20 +602,26 @@ static void perf_syscall_enter(void *ignore, struct pt_regs *regs, long id)
> int rctx;
> int size;
>
> + /*
> + * Probe called with preemption enabled (may_fault), but ring buffer and
> + * per-cpu data require preemption to be disabled.
> + */
> + preempt_disable_notrace();

Again, guard(preempt_notrace)();

> +
> syscall_nr = trace_get_syscall_nr(current, regs);
> if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
> - return;
> + goto end;
> if (!test_bit(syscall_nr, enabled_perf_enter_syscalls))
> - return;
> + goto end;
>
> sys_data = syscall_nr_to_meta(syscall_nr);
> if (!sys_data)
> - return;
> + goto end;
>
> head = this_cpu_ptr(sys_data->enter_event->perf_events);
> valid_prog_array = bpf_prog_array_valid(sys_data->enter_event);
> if (!valid_prog_array && hlist_empty(head))
> - return;
> + goto end;
>
> /* get the size after alignment with the u32 buffer size field */
> size = sizeof(unsigned long) * sys_data->nb_args + sizeof(*rec);
> @@ -604,7 +630,7 @@ static void perf_syscall_enter(void *ignore, struct pt_regs *regs, long id)
>
> rec = perf_trace_buf_alloc(size, NULL, &rctx);
> if (!rec)
> - return;
> + goto end;
>
> rec->nr = syscall_nr;
> syscall_get_arguments(current, regs, args);
> @@ -614,12 +640,14 @@ static void perf_syscall_enter(void *ignore, struct pt_regs *regs, long id)
> !perf_call_bpf_enter(sys_data->enter_event, regs, sys_data, rec)) ||
> hlist_empty(head)) {
> perf_swevent_put_recursion_context(rctx);
> - return;
> + goto end;
> }
>
> perf_trace_buf_submit(rec, size, rctx,
> sys_data->enter_event->event.type, 1, regs,
> head, NULL);
> +end:
> + preempt_enable_notrace();
> }
>
> static int perf_sysenter_enable(struct trace_event_call *call)
> @@ -631,7 +659,9 @@ static int perf_sysenter_enable(struct trace_event_call *call)
>
> mutex_lock(&syscall_trace_lock);
> if (!sys_perf_refcount_enter)
> - ret = register_trace_sys_enter(perf_syscall_enter, NULL);
> + ret = register_trace_prio_flags_sys_enter(perf_syscall_enter, NULL,
> + TRACEPOINT_DEFAULT_PRIO,
> + TRACEPOINT_MAY_FAULT);

More {}

> if (ret) {
> pr_info("event trace: Could not activate syscall entry trace point");
> } else {
> @@ -682,20 +712,26 @@ static void perf_syscall_exit(void *ignore, struct pt_regs *regs, long ret)
> int rctx;
> int size;
>
> + /*
> + * Probe called with preemption enabled (may_fault), but ring buffer and
> + * per-cpu data require preemption to be disabled.
> + */
> + preempt_disable_notrace();

Guess?

> +
> syscall_nr = trace_get_syscall_nr(current, regs);
> if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
> - return;
> + goto end;
> if (!test_bit(syscall_nr, enabled_perf_exit_syscalls))
> - return;
> + goto end;
>
> sys_data = syscall_nr_to_meta(syscall_nr);
> if (!sys_data)
> - return;
> + goto end;
>
> head = this_cpu_ptr(sys_data->exit_event->perf_events);
> valid_prog_array = bpf_prog_array_valid(sys_data->exit_event);
> if (!valid_prog_array && hlist_empty(head))
> - return;
> + goto end;
>
> /* We can probably do that at build time */
> size = ALIGN(sizeof(*rec) + sizeof(u32), sizeof(u64));
> @@ -703,7 +739,7 @@ static void perf_syscall_exit(void *ignore, struct pt_regs *regs, long ret)
>
> rec = perf_trace_buf_alloc(size, NULL, &rctx);
> if (!rec)
> - return;
> + goto end;
>
> rec->nr = syscall_nr;
> rec->ret = syscall_get_return_value(current, regs);
> @@ -712,11 +748,13 @@ static void perf_syscall_exit(void *ignore, struct pt_regs *regs, long ret)
> !perf_call_bpf_exit(sys_data->exit_event, regs, rec)) ||
> hlist_empty(head)) {
> perf_swevent_put_recursion_context(rctx);
> - return;
> + goto end;
> }
>
> perf_trace_buf_submit(rec, size, rctx, sys_data->exit_event->event.type,
> 1, regs, head, NULL);
> +end:
> + preempt_enable_notrace();
> }
>
> static int perf_sysexit_enable(struct trace_event_call *call)
> @@ -728,7 +766,9 @@ static int perf_sysexit_enable(struct trace_event_call *call)
>
> mutex_lock(&syscall_trace_lock);
> if (!sys_perf_refcount_exit)
> - ret = register_trace_sys_exit(perf_syscall_exit, NULL);
> + ret = register_trace_prio_flags_sys_exit(perf_syscall_exit, NULL,
> + TRACEPOINT_DEFAULT_PRIO,
> + TRACEPOINT_MAY_FAULT);

And yet more {}

> if (ret) {
> pr_info("event trace: Could not activate syscall exit trace point");
> } else {
> --
> 2.25.1
>

2023-11-20 21:49:56

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 1/5] tracing: Introduce faultable tracepoints

On Mon, Nov 20, 2023 at 03:54:14PM -0500, Mathieu Desnoyers wrote:
> When invoked from system call enter/exit instrumentation, accessing
> user-space data is a common use-case for tracers. However, tracepoints
> currently disable preemption around iteration on the registered
> tracepoint probes and invocation of the probe callbacks, which prevents
> tracers from handling page faults.
>
> Extend the tracepoint and trace event APIs to allow defining a faultable
> tracepoint which invokes its callback with preemption enabled.
>
> Also extend the tracepoint API to allow tracers to request specific
> probes to be connected to those faultable tracepoints. When the
> TRACEPOINT_MAY_FAULT flag is provided on registration, the probe
> callback will be called with preemption enabled, and is allowed to take
> page faults. Faultable probes can only be registered on faultable
> tracepoints and non-faultable probes on non-faultable tracepoints.
>
> The tasks trace rcu mechanism is used to synchronize read-side
> marshalling of the registered probes with respect to faultable probes
> unregistration and teardown.

What is trace-trace rcu and why is it needed here? What's wrong with
SRCU ?

2023-11-20 22:18:49

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v4 1/5] tracing: Introduce faultable tracepoints

On Mon, Nov 20, 2023 at 10:47:42PM +0100, Peter Zijlstra wrote:
> On Mon, Nov 20, 2023 at 03:54:14PM -0500, Mathieu Desnoyers wrote:
> > When invoked from system call enter/exit instrumentation, accessing
> > user-space data is a common use-case for tracers. However, tracepoints
> > currently disable preemption around iteration on the registered
> > tracepoint probes and invocation of the probe callbacks, which prevents
> > tracers from handling page faults.
> >
> > Extend the tracepoint and trace event APIs to allow defining a faultable
> > tracepoint which invokes its callback with preemption enabled.
> >
> > Also extend the tracepoint API to allow tracers to request specific
> > probes to be connected to those faultable tracepoints. When the
> > TRACEPOINT_MAY_FAULT flag is provided on registration, the probe
> > callback will be called with preemption enabled, and is allowed to take
> > page faults. Faultable probes can only be registered on faultable
> > tracepoints and non-faultable probes on non-faultable tracepoints.
> >
> > The tasks trace rcu mechanism is used to synchronize read-side
> > marshalling of the registered probes with respect to faultable probes
> > unregistration and teardown.
>
> What is trace-trace rcu and why is it needed here? What's wrong with
> SRCU ?

Tasks Trace RCU avoids SRCU's full barriers and the array accesses in the
read-side primitives. This can be important when tracing low-overhead
components of fast paths.

Thanx, Paul

2023-11-20 22:21:41

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v4 1/5] tracing: Introduce faultable tracepoints

On Mon, 20 Nov 2023 15:54:14 -0500
Mathieu Desnoyers <[email protected]> wrote:

> diff --git a/include/linux/tracepoint-defs.h b/include/linux/tracepoint-defs.h
> index 4dc4955f0fbf..67bacfaa8fd0 100644
> --- a/include/linux/tracepoint-defs.h
> +++ b/include/linux/tracepoint-defs.h
> @@ -29,6 +29,19 @@ struct tracepoint_func {
> int prio;
> };
>
> +/**
> + * enum tracepoint_flags - Tracepoint flags
> + * @TRACEPOINT_MAY_EXIST: Don't return an error if the tracepoint does not
> + * exist upon registration.
> + * @TRACEPOINT_MAY_FAULT: The tracepoint probe callback will be called with
> + * preemption enabled, and is allowed to take page
> + * faults.
> + */
> +enum tracepoint_flags {
> + TRACEPOINT_MAY_EXIST = (1 << 0),
> + TRACEPOINT_MAY_FAULT = (1 << 1),
> +};
> +
> struct tracepoint {
> const char *name; /* Tracepoint name */
> struct static_key key;
> @@ -39,6 +52,7 @@ struct tracepoint {
> int (*regfunc)(void);
> void (*unregfunc)(void);
> struct tracepoint_func __rcu *funcs;
> + unsigned int flags;

Since faultable and non-faultable events are mutually exclusive, why not
just allocated them separately? Then you could have the __DO_TRACE() macro
get passed in whether the event can be faulted or not, by the created trace.


> };
>
> #ifdef CONFIG_HAVE_ARCH_PREL32_RELOCATIONS
> diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
> index 88c0ba623ee6..8a6b58a2bf3b 100644
> --- a/include/linux/tracepoint.h
> +++ b/include/linux/tracepoint.h
> @@ -18,6 +18,7 @@
> #include <linux/types.h>
> #include <linux/cpumask.h>
> #include <linux/rcupdate.h>
> +#include <linux/rcupdate_trace.h>
> #include <linux/tracepoint-defs.h>
> #include <linux/static_call.h>
>
> @@ -41,17 +42,10 @@ extern int
> tracepoint_probe_register_prio(struct tracepoint *tp, void *probe, void *data,
> int prio);
> extern int
> -tracepoint_probe_register_prio_may_exist(struct tracepoint *tp, void *probe, void *data,
> - int prio);
> +tracepoint_probe_register_prio_flags(struct tracepoint *tp, void *probe, void *data,
> + int prio, unsigned int flags);
> extern int
> tracepoint_probe_unregister(struct tracepoint *tp, void *probe, void *data);
> -static inline int
> -tracepoint_probe_register_may_exist(struct tracepoint *tp, void *probe,
> - void *data)
> -{
> - return tracepoint_probe_register_prio_may_exist(tp, probe, data,
> - TRACEPOINT_DEFAULT_PRIO);
> -}
> extern void
> for_each_kernel_tracepoint(void (*fct)(struct tracepoint *tp, void *priv),
> void *priv);
> @@ -90,6 +84,7 @@ int unregister_tracepoint_module_notifier(struct notifier_block *nb)
> #ifdef CONFIG_TRACEPOINTS
> static inline void tracepoint_synchronize_unregister(void)
> {
> + synchronize_rcu_tasks_trace();

As Peter mentioned, why not use the srcu below?

> synchronize_srcu(&tracepoint_srcu);
> synchronize_rcu();
> }
> @@ -192,9 +187,10 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
> * it_func[0] is never NULL because there is at least one element in the array
> * when the array itself is non NULL.
> */
> -#define __DO_TRACE(name, args, cond, rcuidle) \
> +#define __DO_TRACE(name, args, cond, rcuidle, tp_flags) \
> do { \
> int __maybe_unused __idx = 0; \
> + bool mayfault = (tp_flags) & TRACEPOINT_MAY_FAULT; \
> \
> if (!(cond)) \
> return; \
> @@ -202,8 +198,12 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
> if (WARN_ON_ONCE(RCUIDLE_COND(rcuidle))) \
> return; \
> \
> - /* keep srcu and sched-rcu usage consistent */ \
> - preempt_disable_notrace(); \
> + if (mayfault) { \
> + rcu_read_lock_trace(); \
> + } else { \
> + /* keep srcu and sched-rcu usage consistent */ \
> + preempt_disable_notrace(); \
> + } \

Change the above comment and have:

if (!mayfault)
preempt_disable_notrace();

And we can have:

if (rcuidle || mayfault) {
__idx = srcu_read_lock_notrace(&tracepoint_srcu);
if (!mayfault)
ct_irq_enter_irqson();
}

> \
> /* \
> * For rcuidle callers, use srcu since sched-rcu \
> @@ -221,20 +221,23 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
> srcu_read_unlock_notrace(&tracepoint_srcu, __idx);\
> } \
> \
> - preempt_enable_notrace(); \
> + if (mayfault) \
> + rcu_read_unlock_trace(); \
> + else \
> + preempt_enable_notrace(); \
> } while (0)
>
> #ifndef MODULE
> -#define __DECLARE_TRACE_RCU(name, proto, args, cond) \
> +#define __DECLARE_TRACE_RCU(name, proto, args, cond, tp_flags) \
> static inline void trace_##name##_rcuidle(proto) \
> { \
> if (static_key_false(&__tracepoint_##name.key)) \
> __DO_TRACE(name, \
> TP_ARGS(args), \
> - TP_CONDITION(cond), 1); \
> + TP_CONDITION(cond), 1, tp_flags); \
> }
> #else
> -#define __DECLARE_TRACE_RCU(name, proto, args, cond)
> +#define __DECLARE_TRACE_RCU(name, proto, args, cond, tp_flags)
> #endif
>
> /*
> @@ -248,7 +251,7 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
> * site if it is not watching, as it will need to be active when the
> * tracepoint is enabled.
> */
> -#define __DECLARE_TRACE(name, proto, args, cond, data_proto) \
> +#define __DECLARE_TRACE(name, proto, args, cond, data_proto, tp_flags) \

Instead of adding "tp_flags" just pass the "mayfault" boolean in.

> extern int __traceiter_##name(data_proto); \
> DECLARE_STATIC_CALL(tp_func_##name, __traceiter_##name); \
> extern struct tracepoint __tracepoint_##name; \
> @@ -257,13 +260,15 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
> if (static_key_false(&__tracepoint_##name.key)) \
> __DO_TRACE(name, \
> TP_ARGS(args), \
> - TP_CONDITION(cond), 0); \
> + TP_CONDITION(cond), 0, tp_flags); \
> if (IS_ENABLED(CONFIG_LOCKDEP) && (cond)) { \
> WARN_ON_ONCE(!rcu_is_watching()); \
> } \
> + if ((tp_flags) & TRACEPOINT_MAY_FAULT) \
> + might_fault(); \
> } \
> __DECLARE_TRACE_RCU(name, PARAMS(proto), PARAMS(args), \
> - PARAMS(cond)) \
> + PARAMS(cond), tp_flags) \
> static inline int \
> register_trace_##name(void (*probe)(data_proto), void *data) \
> { \
> @@ -278,6 +283,13 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
> (void *)probe, data, prio); \
> } \
> static inline int \
> + register_trace_prio_flags_##name(void (*probe)(data_proto), void *data, \
> + int prio, unsigned int flags) \
> + { \
> + return tracepoint_probe_register_prio_flags(&__tracepoint_##name, \
> + (void *)probe, data, prio, flags); \
> + } \
> + static inline int \
> unregister_trace_##name(void (*probe)(data_proto), void *data) \
> { \
> return tracepoint_probe_unregister(&__tracepoint_##name,\
> @@ -298,7 +310,7 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
> * structures, so we create an array of pointers that will be used for iteration
> * on the tracepoints.
> */
> -#define DEFINE_TRACE_FN(_name, _reg, _unreg, proto, args) \
> +#define DEFINE_TRACE_FN_FLAGS(_name, _reg, _unreg, proto, args, tp_flags) \

Instead of passing in flags, I'm thinking that the faultable tracepoints
need to go into its own section, and possibly have a
register_trace_mayfault_##event() to make it highly distinguishable from
events that don't expect to fault.

Since everything is made by macros, it's not hard to keep all the above
code, and wrap it in other macros so that the faultable and non-faultable
tracepoints share most of the code.

But as tracepoints live in __section("__tracepoints"), I'm thinking we may
want __section("__tracepoints_mayfault") to keep them separate.

Thoughts?

-- Steve


> static const char __tpstrtab_##_name[] \
> __section("__tracepoints_strings") = #_name; \
> extern struct static_call_key STATIC_CALL_KEY(tp_func_##_name); \
> @@ -314,7 +326,9 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
> .probestub = &__probestub_##_name, \
> .regfunc = _reg, \
> .unregfunc = _unreg, \
> - .funcs = NULL }; \
> + .funcs = NULL, \
> + .flags = (tp_flags), \
> + }; \
> __TRACEPOINT_ENTRY(_name); \
> int __traceiter_##_name(void *__data, proto) \
> { \
> @@ -337,8 +351,11 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
> } \
> DEFINE_STATIC_CALL(tp_func_##_name, __traceiter_##_name);
>
> +#define DEFINE_TRACE_FN(_name, _reg, _unreg, proto, args) \
> + DEFINE_TRACE_FN_FLAGS(_name, _reg, _unreg, PARAMS(proto), PARAMS(args), 0)
> +
> #define DEFINE_TRACE(name, proto, args) \
> - DEFINE_TRACE_FN(name, NULL, NULL, PARAMS(proto), PARAMS(args));
> + DEFINE_TRACE_FN(name, NULL, NULL, PARAMS(proto), PARAMS(args))
>
> #define EXPORT_TRACEPOINT_SYMBOL_GPL(name) \
> EXPORT_SYMBOL_GPL(__tracepoint_##name); \
> @@ -351,7 +368,7 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
>
>
> #else /* !TRACEPOINTS_ENABLED */
> -#define __DECLARE_TRACE(name, proto, args, cond, data_proto) \
> +#define __DECLARE_TRACE(name, proto, args, cond, data_proto, tp_flags) \
> static inline void trace_##name(proto) \
> { } \
> static inline void trace_##name##_rcuidle(proto) \
> @@ -363,6 +380,18 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
> return -ENOSYS; \
> } \
> static inline int \
> + register_trace_prio_##name(void (*probe)(data_proto), \
> + void *data, int prio) \
> + { \
> + return -ENOSYS; \
> + } \
> + static inline int \
> + register_trace_prio_flags_##name(void (*probe)(data_proto), \
> + void *data, int prio, unsigned int flags) \
> + { \
> + return -ENOSYS; \
> + } \
> + static inline int \
> unregister_trace_##name(void (*probe)(data_proto), \
> void *data) \
> { \
> @@ -377,6 +406,7 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
> return false; \
> }
>

2023-11-20 22:24:12

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 1/5] tracing: Introduce faultable tracepoints

On Mon, Nov 20, 2023 at 02:18:29PM -0800, Paul E. McKenney wrote:
> On Mon, Nov 20, 2023 at 10:47:42PM +0100, Peter Zijlstra wrote:
> > On Mon, Nov 20, 2023 at 03:54:14PM -0500, Mathieu Desnoyers wrote:
> > > When invoked from system call enter/exit instrumentation, accessing
> > > user-space data is a common use-case for tracers. However, tracepoints
> > > currently disable preemption around iteration on the registered
> > > tracepoint probes and invocation of the probe callbacks, which prevents
> > > tracers from handling page faults.
> > >
> > > Extend the tracepoint and trace event APIs to allow defining a faultable
> > > tracepoint which invokes its callback with preemption enabled.
> > >
> > > Also extend the tracepoint API to allow tracers to request specific
> > > probes to be connected to those faultable tracepoints. When the
> > > TRACEPOINT_MAY_FAULT flag is provided on registration, the probe
> > > callback will be called with preemption enabled, and is allowed to take
> > > page faults. Faultable probes can only be registered on faultable
> > > tracepoints and non-faultable probes on non-faultable tracepoints.
> > >
> > > The tasks trace rcu mechanism is used to synchronize read-side
> > > marshalling of the registered probes with respect to faultable probes
> > > unregistration and teardown.
> >
> > What is trace-trace rcu and why is it needed here? What's wrong with
> > SRCU ?
>
> Tasks Trace RCU avoids SRCU's full barriers and the array accesses in the
> read-side primitives. This can be important when tracing low-overhead
> components of fast paths.

So why wasn't SRCU improved? That is, the above doesn't much explain.

What is the trade-off made to justify adding yet another RCU flavour?

2023-11-20 23:57:14

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v4 1/5] tracing: Introduce faultable tracepoints

On Mon, Nov 20, 2023 at 11:23:11PM +0100, Peter Zijlstra wrote:
> On Mon, Nov 20, 2023 at 02:18:29PM -0800, Paul E. McKenney wrote:
> > On Mon, Nov 20, 2023 at 10:47:42PM +0100, Peter Zijlstra wrote:
> > > On Mon, Nov 20, 2023 at 03:54:14PM -0500, Mathieu Desnoyers wrote:
> > > > When invoked from system call enter/exit instrumentation, accessing
> > > > user-space data is a common use-case for tracers. However, tracepoints
> > > > currently disable preemption around iteration on the registered
> > > > tracepoint probes and invocation of the probe callbacks, which prevents
> > > > tracers from handling page faults.
> > > >
> > > > Extend the tracepoint and trace event APIs to allow defining a faultable
> > > > tracepoint which invokes its callback with preemption enabled.
> > > >
> > > > Also extend the tracepoint API to allow tracers to request specific
> > > > probes to be connected to those faultable tracepoints. When the
> > > > TRACEPOINT_MAY_FAULT flag is provided on registration, the probe
> > > > callback will be called with preemption enabled, and is allowed to take
> > > > page faults. Faultable probes can only be registered on faultable
> > > > tracepoints and non-faultable probes on non-faultable tracepoints.
> > > >
> > > > The tasks trace rcu mechanism is used to synchronize read-side
> > > > marshalling of the registered probes with respect to faultable probes
> > > > unregistration and teardown.
> > >
> > > What is trace-trace rcu and why is it needed here? What's wrong with
> > > SRCU ?
> >
> > Tasks Trace RCU avoids SRCU's full barriers and the array accesses in the
> > read-side primitives. This can be important when tracing low-overhead
> > components of fast paths.
>
> So why wasn't SRCU improved? That is, the above doesn't much explain.
>
> What is the trade-off made to justify adding yet another RCU flavour?

We didn't think you would be all that happy about having each and
every context switch iterating through many tens or even hundreds of
srcu_struct structures. For that matter, we didn't think that anyone
else would be all that happy either. Us included.

Thanx, Paul

2023-11-21 08:48:14

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 1/5] tracing: Introduce faultable tracepoints

On Mon, Nov 20, 2023 at 03:56:30PM -0800, Paul E. McKenney wrote:
> On Mon, Nov 20, 2023 at 11:23:11PM +0100, Peter Zijlstra wrote:
> > On Mon, Nov 20, 2023 at 02:18:29PM -0800, Paul E. McKenney wrote:
> > > On Mon, Nov 20, 2023 at 10:47:42PM +0100, Peter Zijlstra wrote:
> > > > On Mon, Nov 20, 2023 at 03:54:14PM -0500, Mathieu Desnoyers wrote:
> > > > > When invoked from system call enter/exit instrumentation, accessing
> > > > > user-space data is a common use-case for tracers. However, tracepoints
> > > > > currently disable preemption around iteration on the registered
> > > > > tracepoint probes and invocation of the probe callbacks, which prevents
> > > > > tracers from handling page faults.
> > > > >
> > > > > Extend the tracepoint and trace event APIs to allow defining a faultable
> > > > > tracepoint which invokes its callback with preemption enabled.
> > > > >
> > > > > Also extend the tracepoint API to allow tracers to request specific
> > > > > probes to be connected to those faultable tracepoints. When the
> > > > > TRACEPOINT_MAY_FAULT flag is provided on registration, the probe
> > > > > callback will be called with preemption enabled, and is allowed to take
> > > > > page faults. Faultable probes can only be registered on faultable
> > > > > tracepoints and non-faultable probes on non-faultable tracepoints.
> > > > >
> > > > > The tasks trace rcu mechanism is used to synchronize read-side
> > > > > marshalling of the registered probes with respect to faultable probes
> > > > > unregistration and teardown.
> > > >
> > > > What is trace-trace rcu and why is it needed here? What's wrong with
> > > > SRCU ?
> > >
> > > Tasks Trace RCU avoids SRCU's full barriers and the array accesses in the
> > > read-side primitives. This can be important when tracing low-overhead
> > > components of fast paths.
> >
> > So why wasn't SRCU improved? That is, the above doesn't much explain.
> >
> > What is the trade-off made to justify adding yet another RCU flavour?
>
> We didn't think you would be all that happy about having each and
> every context switch iterating through many tens or even hundreds of
> srcu_struct structures. For that matter, we didn't think that anyone
> else would be all that happy either. Us included.

So again, what is task-trace RCU ? How does it differ from say
preemptible rcu, which AFAICT could be used here too, no?

2023-11-21 14:07:22

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH v4 1/5] tracing: Introduce faultable tracepoints

On 2023-11-21 03:47, Peter Zijlstra wrote:
> On Mon, Nov 20, 2023 at 03:56:30PM -0800, Paul E. McKenney wrote:
>> On Mon, Nov 20, 2023 at 11:23:11PM +0100, Peter Zijlstra wrote:
>>> On Mon, Nov 20, 2023 at 02:18:29PM -0800, Paul E. McKenney wrote:
>>>> On Mon, Nov 20, 2023 at 10:47:42PM +0100, Peter Zijlstra wrote:
>>>>> On Mon, Nov 20, 2023 at 03:54:14PM -0500, Mathieu Desnoyers wrote:
>>>>>> When invoked from system call enter/exit instrumentation, accessing
>>>>>> user-space data is a common use-case for tracers. However, tracepoints
>>>>>> currently disable preemption around iteration on the registered
>>>>>> tracepoint probes and invocation of the probe callbacks, which prevents
>>>>>> tracers from handling page faults.
>>>>>>
>>>>>> Extend the tracepoint and trace event APIs to allow defining a faultable
>>>>>> tracepoint which invokes its callback with preemption enabled.
>>>>>>
>>>>>> Also extend the tracepoint API to allow tracers to request specific
>>>>>> probes to be connected to those faultable tracepoints. When the
>>>>>> TRACEPOINT_MAY_FAULT flag is provided on registration, the probe
>>>>>> callback will be called with preemption enabled, and is allowed to take
>>>>>> page faults. Faultable probes can only be registered on faultable
>>>>>> tracepoints and non-faultable probes on non-faultable tracepoints.
>>>>>>
>>>>>> The tasks trace rcu mechanism is used to synchronize read-side
>>>>>> marshalling of the registered probes with respect to faultable probes
>>>>>> unregistration and teardown.
>>>>>
>>>>> What is trace-trace rcu and why is it needed here? What's wrong with
>>>>> SRCU ?
>>>>
>>>> Tasks Trace RCU avoids SRCU's full barriers and the array accesses in the
>>>> read-side primitives. This can be important when tracing low-overhead
>>>> components of fast paths.
>>>
>>> So why wasn't SRCU improved? That is, the above doesn't much explain.
>>>
>>> What is the trade-off made to justify adding yet another RCU flavour?
>>
>> We didn't think you would be all that happy about having each and
>> every context switch iterating through many tens or even hundreds of
>> srcu_struct structures. For that matter, we didn't think that anyone
>> else would be all that happy either. Us included.
>
> So again, what is task-trace RCU ? How does it differ from say
> preemptible rcu, which AFAICT could be used here too, no?

Task trace RCU fits a niche that has the following set of requirements/tradeoffs:

- Allow page faults within RCU read-side (like SRCU),
- Has a low-overhead read lock-unlock (without the memory barrier overhead of SRCU),
- The tradeoff: Has a rather slow synchronize_rcu(), but tracers should not care about
that. Hence, this is not meant to be a generic replacement for SRCU.

Based on my reading of https://lwn.net/Articles/253651/ , preemptible RCU is not a good
fit for the following reasons:

- It disallows blocking within a RCU read-side on non-CONFIG_PREEMPT kernels,
- AFAIU the mmap_sem used within the page fault handler does not have priority inheritance.

Please let me know if I'm missing something.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

2023-11-21 14:37:21

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 1/5] tracing: Introduce faultable tracepoints

On Tue, Nov 21, 2023 at 09:06:18AM -0500, Mathieu Desnoyers wrote:
> Task trace RCU fits a niche that has the following set of requirements/tradeoffs:
>
> - Allow page faults within RCU read-side (like SRCU),
> - Has a low-overhead read lock-unlock (without the memory barrier overhead of SRCU),
> - The tradeoff: Has a rather slow synchronize_rcu(), but tracers should not care about
> that. Hence, this is not meant to be a generic replacement for SRCU.
>
> Based on my reading of https://lwn.net/Articles/253651/ , preemptible RCU is not a good
> fit for the following reasons:
>
> - It disallows blocking within a RCU read-side on non-CONFIG_PREEMPT kernels,

Your counter points are confused, we simply don't build preemptible RCU
unless PREEMPT=y, but that could surely be fixed and exposed as a
separate flavour.

> - AFAIU the mmap_sem used within the page fault handler does not have priority inheritance.

What's that got to do with anything?

Still utterly confused about what task-tracing rcu is and how it is
different from preemptible rcu.

2023-11-21 14:40:20

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH v4 1/5] tracing: Introduce faultable tracepoints

On 2023-11-21 09:36, Peter Zijlstra wrote:
> On Tue, Nov 21, 2023 at 09:06:18AM -0500, Mathieu Desnoyers wrote:
>> Task trace RCU fits a niche that has the following set of requirements/tradeoffs:
>>
>> - Allow page faults within RCU read-side (like SRCU),
>> - Has a low-overhead read lock-unlock (without the memory barrier overhead of SRCU),
>> - The tradeoff: Has a rather slow synchronize_rcu(), but tracers should not care about
>> that. Hence, this is not meant to be a generic replacement for SRCU.
>>
>> Based on my reading of https://lwn.net/Articles/253651/ , preemptible RCU is not a good
>> fit for the following reasons:
>>
>> - It disallows blocking within a RCU read-side on non-CONFIG_PREEMPT kernels,
>
> Your counter points are confused, we simply don't build preemptible RCU
> unless PREEMPT=y, but that could surely be fixed and exposed as a
> separate flavour.
>
>> - AFAIU the mmap_sem used within the page fault handler does not have priority inheritance.
>
> What's that got to do with anything?
>
> Still utterly confused about what task-tracing rcu is and how it is
> different from preemptible rcu.

In addition to taking the mmap_sem, the page fault handler need to block
until its requested pages are faulted in, which may depend on disk I/O.
Is it acceptable to wait for I/O while holding preemptible RCU read-side?

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

2023-11-21 14:44:40

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v4 1/5] tracing: Introduce faultable tracepoints

On Tue, 21 Nov 2023 15:36:47 +0100
Peter Zijlstra <[email protected]> wrote:
>
> Still utterly confused about what task-tracing rcu is and how it is
> different from preemptible rcu.

Is this similar to synchronize_rcu_tasks()? As I understand that one (grace
period continues until all tasks have voluntarily scheduled or gone into
user space). But I'm a bit confused by synchronize_rcu_tasks_trace()?

Note, that for syncronize_rcu_tasks() the critical sections must not call
schedule (although it is OK to be preempted).

-- Steve

2023-11-21 14:47:16

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 1/5] tracing: Introduce faultable tracepoints

On Tue, Nov 21, 2023 at 09:40:24AM -0500, Mathieu Desnoyers wrote:
> On 2023-11-21 09:36, Peter Zijlstra wrote:
> > On Tue, Nov 21, 2023 at 09:06:18AM -0500, Mathieu Desnoyers wrote:
> > > Task trace RCU fits a niche that has the following set of requirements/tradeoffs:
> > >
> > > - Allow page faults within RCU read-side (like SRCU),
> > > - Has a low-overhead read lock-unlock (without the memory barrier overhead of SRCU),
> > > - The tradeoff: Has a rather slow synchronize_rcu(), but tracers should not care about
> > > that. Hence, this is not meant to be a generic replacement for SRCU.
> > >
> > > Based on my reading of https://lwn.net/Articles/253651/ , preemptible RCU is not a good
> > > fit for the following reasons:
> > >
> > > - It disallows blocking within a RCU read-side on non-CONFIG_PREEMPT kernels,
> >
> > Your counter points are confused, we simply don't build preemptible RCU
> > unless PREEMPT=y, but that could surely be fixed and exposed as a
> > separate flavour.
> >
> > > - AFAIU the mmap_sem used within the page fault handler does not have priority inheritance.
> >
> > What's that got to do with anything?
> >
> > Still utterly confused about what task-tracing rcu is and how it is
> > different from preemptible rcu.
>
> In addition to taking the mmap_sem, the page fault handler need to block
> until its requested pages are faulted in, which may depend on disk I/O.
> Is it acceptable to wait for I/O while holding preemptible RCU read-side?

I don't know, preemptible rcu already needs to track task state anyway,
it needs to ensure all tasks have passed through a safe spot etc.. vs regular
RCU which only needs to ensure all CPUs have passed through start.

Why is this such a hard question?

2023-11-21 14:57:16

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH v4 1/5] tracing: Introduce faultable tracepoints

On 2023-11-21 09:46, Peter Zijlstra wrote:
> On Tue, Nov 21, 2023 at 09:40:24AM -0500, Mathieu Desnoyers wrote:
>> On 2023-11-21 09:36, Peter Zijlstra wrote:
>>> On Tue, Nov 21, 2023 at 09:06:18AM -0500, Mathieu Desnoyers wrote:
>>>> Task trace RCU fits a niche that has the following set of requirements/tradeoffs:
>>>>
>>>> - Allow page faults within RCU read-side (like SRCU),
>>>> - Has a low-overhead read lock-unlock (without the memory barrier overhead of SRCU),
>>>> - The tradeoff: Has a rather slow synchronize_rcu(), but tracers should not care about
>>>> that. Hence, this is not meant to be a generic replacement for SRCU.
>>>>
>>>> Based on my reading of https://lwn.net/Articles/253651/ , preemptible RCU is not a good
>>>> fit for the following reasons:
>>>>
>>>> - It disallows blocking within a RCU read-side on non-CONFIG_PREEMPT kernels,
>>>
>>> Your counter points are confused, we simply don't build preemptible RCU
>>> unless PREEMPT=y, but that could surely be fixed and exposed as a
>>> separate flavour.
>>>
>>>> - AFAIU the mmap_sem used within the page fault handler does not have priority inheritance.
>>>
>>> What's that got to do with anything?
>>>
>>> Still utterly confused about what task-tracing rcu is and how it is
>>> different from preemptible rcu.
>>
>> In addition to taking the mmap_sem, the page fault handler need to block
>> until its requested pages are faulted in, which may depend on disk I/O.
>> Is it acceptable to wait for I/O while holding preemptible RCU read-side?
>
> I don't know, preemptible rcu already needs to track task state anyway,
> it needs to ensure all tasks have passed through a safe spot etc.. vs regular
> RCU which only needs to ensure all CPUs have passed through start.
>
> Why is this such a hard question?

Personally what I am looking for is a clear documentation of preemptible
rcu with respect to whether it is possible to block on I/O (take a page
fault, call schedule() explicitly) from within a preemptible rcu
critical section. I guess this is a hard question because there is no
clear statement to that effect in the kernel documentation.

If it is allowed (which I doubt), then I wonder about the effect of
those long readers on grace period delays. Things like expedited grace
periods may suffer.

Based on Documentation/RCU/rcu.rst:

Preemptible variants of RCU (CONFIG_PREEMPT_RCU) get the
same effect, but require that the readers manipulate CPU-local
counters. These counters allow limited types of blocking within
RCU read-side critical sections. SRCU also uses CPU-local
counters, and permits general blocking within RCU read-side
critical sections. These variants of RCU detect grace periods
by sampling these counters.

Then we just have to find a definition of "limited types of blocking"
vs "general blocking".

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

2023-11-21 15:53:13

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v4 1/5] tracing: Introduce faultable tracepoints

On Tue, Nov 21, 2023 at 09:56:55AM -0500, Mathieu Desnoyers wrote:
> On 2023-11-21 09:46, Peter Zijlstra wrote:
> > On Tue, Nov 21, 2023 at 09:40:24AM -0500, Mathieu Desnoyers wrote:
> > > On 2023-11-21 09:36, Peter Zijlstra wrote:
> > > > On Tue, Nov 21, 2023 at 09:06:18AM -0500, Mathieu Desnoyers wrote:
> > > > > Task trace RCU fits a niche that has the following set of requirements/tradeoffs:
> > > > >
> > > > > - Allow page faults within RCU read-side (like SRCU),
> > > > > - Has a low-overhead read lock-unlock (without the memory barrier overhead of SRCU),
> > > > > - The tradeoff: Has a rather slow synchronize_rcu(), but tracers should not care about
> > > > > that. Hence, this is not meant to be a generic replacement for SRCU.
> > > > >
> > > > > Based on my reading of https://lwn.net/Articles/253651/ , preemptible RCU is not a good
> > > > > fit for the following reasons:
> > > > >
> > > > > - It disallows blocking within a RCU read-side on non-CONFIG_PREEMPT kernels,
> > > >
> > > > Your counter points are confused, we simply don't build preemptible RCU
> > > > unless PREEMPT=y, but that could surely be fixed and exposed as a
> > > > separate flavour.
> > > >
> > > > > - AFAIU the mmap_sem used within the page fault handler does not have priority inheritance.
> > > >
> > > > What's that got to do with anything?

Preemptible RCU allows blocking/preemption only in those cases where
priority inheritance applies. As Mathieu says below, this prevents
indefinite postponement of a global grace period. Such postponement is
especially problematic in kernels built with PREEMPT_RCU=y. For but one
example, consider a situation where someone maps a file served by NFS.
We can debate the wisdom of creating such a map, but having the kernel
OOM would be a completely unacceptable "error message".

> > > > Still utterly confused about what task-tracing rcu is and how it is
> > > > different from preemptible rcu.

Task Trace RCU allows general blocking, which it tolerates by stringent
restrictions on what exactly it is used for (tracing in cases where the
memory to be included in the tracing might page fault). Preemptible RCU
does not permit general blocking.

Tasks Trace RCU is a very specialized tool for a very specific use case.

> > > In addition to taking the mmap_sem, the page fault handler need to block
> > > until its requested pages are faulted in, which may depend on disk I/O.
> > > Is it acceptable to wait for I/O while holding preemptible RCU read-side?
> >
> > I don't know, preemptible rcu already needs to track task state anyway,
> > it needs to ensure all tasks have passed through a safe spot etc.. vs regular
> > RCU which only needs to ensure all CPUs have passed through start.
> >
> > Why is this such a hard question?

It is not a hard question. Nor is the answer, which is that preemptible
RCU is not a good fit for this task for all the reasons that Mathieu has
laid out.

> Personally what I am looking for is a clear documentation of preemptible rcu
> with respect to whether it is possible to block on I/O (take a page fault,
> call schedule() explicitly) from within a preemptible rcu critical section.
> I guess this is a hard question because there is no clear statement to that
> effect in the kernel documentation.
>
> If it is allowed (which I doubt), then I wonder about the effect of those
> long readers on grace period delays. Things like expedited grace periods may
> suffer.
>
> Based on Documentation/RCU/rcu.rst:
>
> Preemptible variants of RCU (CONFIG_PREEMPT_RCU) get the
> same effect, but require that the readers manipulate CPU-local
> counters. These counters allow limited types of blocking within
> RCU read-side critical sections. SRCU also uses CPU-local
> counters, and permits general blocking within RCU read-side
> critical sections. These variants of RCU detect grace periods
> by sampling these counters.
>
> Then we just have to find a definition of "limited types of blocking"
> vs "general blocking".

The key point is that you are not allowed to place any source code in
a preemptible RCU reader that would not be legal in a non-preemptible
RCU reader. The rationale again is that the cases in which preemptible
RCU readers call schedule are cases to which priority boosting applies.

It is quite possible that the documentation needs upgrading. ;-)

Thanx, Paul

2023-11-21 15:54:01

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 1/5] tracing: Introduce faultable tracepoints

On Tue, Nov 21, 2023 at 03:46:43PM +0100, Peter Zijlstra wrote:

> Why is this such a hard question?

Anyway, recapping from IRC:

preemptible, SRCU:
counter-array based, GP advances by increasing array index
and waiting for previous index to drop to 0.

notably, a GP can pass while a task is preempted but not within a
critical section.

SRCU has smp_mb() in the critical sections to improve GP.

tasks:
waits for every task to pass schedule()

ensures that any pieces of text rendered unreachable before, is
actually unused after.

tasks-rude:
like tasks, but different? build to handle tracing while rcu-idle,
even though that was already deemed bad?

tasks-tracing-rcu:
extention of tasks to have critical-sections ? Should this simply be
tasks?


Can someone complete, please?



2023-11-21 15:59:02

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v4 1/5] tracing: Introduce faultable tracepoints

On Tue, Nov 21, 2023 at 03:36:47PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 21, 2023 at 09:06:18AM -0500, Mathieu Desnoyers wrote:
> > Task trace RCU fits a niche that has the following set of requirements/tradeoffs:
> >
> > - Allow page faults within RCU read-side (like SRCU),
> > - Has a low-overhead read lock-unlock (without the memory barrier overhead of SRCU),
> > - The tradeoff: Has a rather slow synchronize_rcu(), but tracers should not care about
> > that. Hence, this is not meant to be a generic replacement for SRCU.
> >
> > Based on my reading of https://lwn.net/Articles/253651/ , preemptible RCU is not a good
> > fit for the following reasons:
> >
> > - It disallows blocking within a RCU read-side on non-CONFIG_PREEMPT kernels,
>
> Your counter points are confused, we simply don't build preemptible RCU
> unless PREEMPT=y, but that could surely be fixed and exposed as a
> separate flavour.

It certainly used to be available as a separate flavor, but only in
CONFIG_PREEMPT=y kernels. In CONFIG_PREEMPT=n kernels, the API mapped
to the non-preemptible flavor, as in synchronize_sched() and friends.
And we need tracing in the full set of kernels.

> > - AFAIU the mmap_sem used within the page fault handler does not have priority inheritance.
>
> What's that got to do with anything?
>
> Still utterly confused about what task-tracing rcu is and how it is
> different from preemptible rcu.

Tasks Trace RCU allows general blocking in its readers, not just the
subject-to-priority-boosting blocking permitted within preemptible RCU
readers. Restrictions on the use of Tasks Trace RCU are in place to allow
getting away with this general blocking. Even systems generously endowed
with memory are not going to do well when the RCU grace period is blocked
on I/O, especially if that I/O is across a network to a slow file server.

Which means a separate RCU instance is needed. Which is Tasks Trace RCU.

Thanx, Paul

2023-11-21 16:02:11

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH v4 1/5] tracing: Introduce faultable tracepoints

On 2023-11-21 10:52, Peter Zijlstra wrote:
> On Tue, Nov 21, 2023 at 03:46:43PM +0100, Peter Zijlstra wrote:
>
>> Why is this such a hard question?
>
> Anyway, recapping from IRC:
>
> preemptible, SRCU:
> counter-array based, GP advances by increasing array index
> and waiting for previous index to drop to 0.
>
> notably, a GP can pass while a task is preempted but not within a
> critical section.
>
> SRCU has smp_mb() in the critical sections to improve GP.

Also:

preemptible only allows blocking when priority inheritance is
guarantees, which excludes doing I/O, and thus page faults.
Otherwise a long I/O could cause the system to OOM.

SRCU allows all kind of blocking, as long as the entire SRCU
domain does not mind waiting for a while before readers complete.

>
> tasks:
> waits for every task to pass schedule()
>
> ensures that any pieces of text rendered unreachable before, is
> actually unused after.
>
> tasks-rude:
> like tasks, but different? build to handle tracing while rcu-idle,
> even though that was already deemed bad?
>
> tasks-tracing-rcu:
> extention of tasks to have critical-sections ? Should this simply be
> tasks?

tasks-trace-rcu is meant to allow tasks to block/take a page fault
within the read-side. It is specialized for tracing and has a single
domain. It does not need the smp_mb on the read-side, which makes it
lower-overhead than SRCU.

Thanks,

Mathieu

>
>
> Can someone complete, please?
>
>
>

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

2023-11-21 16:03:41

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 1/5] tracing: Introduce faultable tracepoints

On Tue, Nov 21, 2023 at 07:58:40AM -0800, Paul E. McKenney wrote:

> Tasks Trace RCU allows general blocking in its readers, not just the
> subject-to-priority-boosting blocking permitted within preemptible RCU
> readers. Restrictions on the use of Tasks Trace RCU are in place to allow
> getting away with this general blocking. Even systems generously endowed
> with memory are not going to do well when the RCU grace period is blocked
> on I/O, especially if that I/O is across a network to a slow file server.
>
> Which means a separate RCU instance is needed. Which is Tasks Trace RCU.

Separate instance not a problem, nor really the question.

What is the basic mechanism of task-tracing? Is it really the existing
tasks-rcu extended with read-side critical sections and call_rcu ?

If so, then why not have it be tasks-rcu?

Or is it a variant of the preemptible/SRCU class of RCUs that are
counter-array based? I suspect not.

So once again, what exactly is tasks-tracing ?

2023-11-21 16:10:12

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v4 1/5] tracing: Introduce faultable tracepoints

On Tue, 21 Nov 2023 11:00:13 -0500
Mathieu Desnoyers <[email protected]> wrote:

> > tasks-tracing-rcu:
> > extention of tasks to have critical-sections ? Should this simply be
> > tasks?
>
> tasks-trace-rcu is meant to allow tasks to block/take a page fault
> within the read-side. It is specialized for tracing and has a single
> domain. It does not need the smp_mb on the read-side, which makes it
> lower-overhead than SRCU.

IOW, task-trace-rcu allows the call to schedule in its critical section,
whereas task-rcu does not?

-- Steve

2023-11-21 16:12:43

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH v4 1/5] tracing: Introduce faultable tracepoints

On 2023-11-21 11:07, Steven Rostedt wrote:
> On Tue, 21 Nov 2023 11:00:13 -0500
> Mathieu Desnoyers <[email protected]> wrote:
>
>>> tasks-tracing-rcu:
>>> extention of tasks to have critical-sections ? Should this simply be
>>> tasks?
>>
>> tasks-trace-rcu is meant to allow tasks to block/take a page fault
>> within the read-side. It is specialized for tracing and has a single
>> domain. It does not need the smp_mb on the read-side, which makes it
>> lower-overhead than SRCU.
>
> IOW, task-trace-rcu allows the call to schedule in its critical section,
> whereas task-rcu does not?

Correct.

And unlike preemptible rcu, tasks-trace-rcu allows calls to schedule
which do not provide priority inheritance guarantees (such as I/O
triggered by page faults).

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

2023-11-21 16:43:53

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v4 1/5] tracing: Introduce faultable tracepoints

On Tue, Nov 21, 2023 at 11:11:57AM -0500, Mathieu Desnoyers wrote:
> On 2023-11-21 11:07, Steven Rostedt wrote:
> > On Tue, 21 Nov 2023 11:00:13 -0500
> > Mathieu Desnoyers <[email protected]> wrote:
> >
> > > > tasks-tracing-rcu:
> > > > extention of tasks to have critical-sections ? Should this simply be
> > > > tasks?
> > >
> > > tasks-trace-rcu is meant to allow tasks to block/take a page fault
> > > within the read-side. It is specialized for tracing and has a single
> > > domain. It does not need the smp_mb on the read-side, which makes it
> > > lower-overhead than SRCU.
> >
> > IOW, task-trace-rcu allows the call to schedule in its critical section,
> > whereas task-rcu does not?
>
> Correct.
>
> And unlike preemptible rcu, tasks-trace-rcu allows calls to schedule which
> do not provide priority inheritance guarantees (such as I/O triggered by
> page faults).

But please keep it to things allowed in vanilla RCU read-side critical
sections and I/O triggered by page faults. If you need something else,
that is a discussion between all the current users of RCU Tasks Trace.

Thanx, Paul

2023-11-21 16:43:55

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v4 1/5] tracing: Introduce faultable tracepoints

On Tue, Nov 21, 2023 at 04:52:56PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 21, 2023 at 03:46:43PM +0100, Peter Zijlstra wrote:
>
> > Why is this such a hard question?

The place to look is here:

https://docs.kernel.org/RCU/Design/Requirements/Requirements.html

Or, if you prefer, Documentation/RCU/Design/Requirements/Requirements.rst.

> Anyway, recapping from IRC:
>
> preemptible, SRCU:
> counter-array based, GP advances by increasing array index
> and waiting for previous index to drop to 0.
>
> notably, a GP can pass while a task is preempted but not within a
> critical section.
>
> SRCU has smp_mb() in the critical sections to improve GP.

https://docs.kernel.org/RCU/Design/Requirements/Requirements.html#sleepable-rcu

Allows general blocking in SRCU readers, which it tolerates by giving
each user its own SRCU via DEFINE_SRCU(), DEFINE_STATIC_SRCU() or
a srcu_struct structure. Users blocking too much in SRCU read-side
critical sections hurt only themselves. Yes, heavy-weight readers.

> tasks:
> waits for every task to pass schedule()
>
> ensures that any pieces of text rendered unreachable before, is
> actually unused after.

But does not wait for tasks where RCU is not watching, including the
idle loop.

> tasks-rude:
> like tasks, but different? build to handle tracing while rcu-idle,
> even though that was already deemed bad?

This waits for the tasks that RCU Tasks cannot wait for. If noinstr
is fully fixed, RCU Tasks Rude can go away.

> tasks-tracing-rcu:
> extention of tasks to have critical-sections ? Should this simply be
> tasks?

Tasks Trace RCU is its own thing. It uses rcu_read_lock_trace() and
rcu_read_unlock_trace() to mark its readers. It can detect quiescent
states even when the task in question does not call schedule().
Unlike Tasks RCU, Tasks Trace RCU does not scan the full task list.
(It used to, but that caused latency blows on non-realtime workloads.)
Tasks Trace RCU allows preemption and blocking for page faults in
its readers. Also blocking on non-raw spinlocks in PREEMPT_RT, but I
am not sure that anyone cares. If you want to block on anything else,
you need to talk to the current Tasks Trace RCU users.

Thanx, Paul

> Can someone complete, please?
>
>
>

2023-11-21 16:46:13

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v4 1/5] tracing: Introduce faultable tracepoints

On Tue, Nov 21, 2023 at 09:44:44AM -0500, Steven Rostedt wrote:
> On Tue, 21 Nov 2023 15:36:47 +0100
> Peter Zijlstra <[email protected]> wrote:
> >
> > Still utterly confused about what task-tracing rcu is and how it is
> > different from preemptible rcu.
>
> Is this similar to synchronize_rcu_tasks()? As I understand that one (grace
> period continues until all tasks have voluntarily scheduled or gone into
> user space). But I'm a bit confused by synchronize_rcu_tasks_trace()?
>
> Note, that for syncronize_rcu_tasks() the critical sections must not call
> schedule (although it is OK to be preempted).

The synchronize_rcu_tasks() and synchronize_rcu_tasks_trace() functions
are quite different, as noted elsewhere in this thread.

Thanx, Paul

2023-11-21 16:46:44

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v4 1/5] tracing: Introduce faultable tracepoints

On Tue, Nov 21, 2023 at 05:03:00PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 21, 2023 at 07:58:40AM -0800, Paul E. McKenney wrote:
>
> > Tasks Trace RCU allows general blocking in its readers, not just the
> > subject-to-priority-boosting blocking permitted within preemptible RCU
> > readers. Restrictions on the use of Tasks Trace RCU are in place to allow
> > getting away with this general blocking. Even systems generously endowed
> > with memory are not going to do well when the RCU grace period is blocked
> > on I/O, especially if that I/O is across a network to a slow file server.
> >
> > Which means a separate RCU instance is needed. Which is Tasks Trace RCU.
>
> Separate instance not a problem, nor really the question.
>
> What is the basic mechanism of task-tracing? Is it really the existing
> tasks-rcu extended with read-side critical sections and call_rcu ?
>
> If so, then why not have it be tasks-rcu?
>
> Or is it a variant of the preemptible/SRCU class of RCUs that are
> counter-array based? I suspect not.
>
> So once again, what exactly is tasks-tracing ?

I think we covered this elsewhere in the thread, but if there are
lingering questions, you know where to find us. ;-)

Thanx, Paul

2023-11-21 16:54:22

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 1/5] tracing: Introduce faultable tracepoints

On Tue, Nov 21, 2023 at 11:00:13AM -0500, Mathieu Desnoyers wrote:
> On 2023-11-21 10:52, Peter Zijlstra wrote:
> > On Tue, Nov 21, 2023 at 03:46:43PM +0100, Peter Zijlstra wrote:
> >
> > > Why is this such a hard question?
> >
> > Anyway, recapping from IRC:
> >
> > preemptible, SRCU:
> > counter-array based, GP advances by increasing array index
> > and waiting for previous index to drop to 0.
> >
> > notably, a GP can pass while a task is preempted but not within a
> > critical section.
> >
> > SRCU has smp_mb() in the critical sections to improve GP.
>
> Also:
>
> preemptible only allows blocking when priority inheritance is
> guarantees, which excludes doing I/O, and thus page faults.
> Otherwise a long I/O could cause the system to OOM.
>
> SRCU allows all kind of blocking, as long as the entire SRCU
> domain does not mind waiting for a while before readers complete.

Well, no. Fundamentally both SRCU and preemptible (and many other
flavours) are just a counter-array. The non-blocking for preempt comes
from the fact that it is the main global rcu instance and allowing all
that would make GPs too rare and cause you memory trouble.

But that's not because of how it's implemented, but because of it being
the main global instance.

> > tasks:
> > waits for every task to pass schedule()
> >
> > ensures that any pieces of text rendered unreachable before, is
> > actually unused after.
> >
> > tasks-rude:
> > like tasks, but different? build to handle tracing while rcu-idle,
> > even though that was already deemed bad?
> >
> > tasks-tracing-rcu:
> > extention of tasks to have critical-sections ? Should this simply be
> > tasks?
>
> tasks-trace-rcu is meant to allow tasks to block/take a page fault within
> the read-side. It is specialized for tracing and has a single domain. It
> does not need the smp_mb on the read-side, which makes it lower-overhead
> than SRCU.

That's what it's meant for, not what it is.

Turns out that tasks-tracing is a per-task counter based thing, and as
such does not require all tasks to pass through schedule() and does not
imply the tasks flavour (nor the tasks-rude) despite the similarity in
naming.

But now I am again left wondering what the fundamental difference is
between a per-task counter and a per-cpu counter.

At the end of the day, you still have to wait for the thing to hit 0.

So I'm once again confused, ...

2023-11-21 17:32:50

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 1/5] tracing: Introduce faultable tracepoints

On Tue, Nov 21, 2023 at 05:50:29PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 21, 2023 at 11:00:13AM -0500, Mathieu Desnoyers wrote:
> > On 2023-11-21 10:52, Peter Zijlstra wrote:
> > > On Tue, Nov 21, 2023 at 03:46:43PM +0100, Peter Zijlstra wrote:
> > >
> > > > Why is this such a hard question?
> > >
> > > Anyway, recapping from IRC:
> > >
> > > preemptible, SRCU:
> > > counter-array based, GP advances by increasing array index
> > > and waiting for previous index to drop to 0.
> > >
> > > notably, a GP can pass while a task is preempted but not within a
> > > critical section.
> > >
> > > SRCU has smp_mb() in the critical sections to improve GP.
> >
> > Also:
> >
> > preemptible only allows blocking when priority inheritance is
> > guarantees, which excludes doing I/O, and thus page faults.
> > Otherwise a long I/O could cause the system to OOM.
> >
> > SRCU allows all kind of blocking, as long as the entire SRCU
> > domain does not mind waiting for a while before readers complete.
>
> Well, no. Fundamentally both SRCU and preemptible (and many other
> flavours) are just a counter-array. The non-blocking for preempt comes
> from the fact that it is the main global rcu instance and allowing all
> that would make GPs too rare and cause you memory trouble.
>
> But that's not because of how it's implemented, but because of it being
> the main global instance.
>
> > > tasks:
> > > waits for every task to pass schedule()
> > >
> > > ensures that any pieces of text rendered unreachable before, is
> > > actually unused after.
> > >
> > > tasks-rude:
> > > like tasks, but different? build to handle tracing while rcu-idle,
> > > even though that was already deemed bad?
> > >
> > > tasks-tracing-rcu:
> > > extention of tasks to have critical-sections ? Should this simply be
> > > tasks?
> >
> > tasks-trace-rcu is meant to allow tasks to block/take a page fault within
> > the read-side. It is specialized for tracing and has a single domain. It
> > does not need the smp_mb on the read-side, which makes it lower-overhead
> > than SRCU.
>
> That's what it's meant for, not what it is.
>
> Turns out that tasks-tracing is a per-task counter based thing, and as
> such does not require all tasks to pass through schedule() and does not
> imply the tasks flavour (nor the tasks-rude) despite the similarity in
> naming.
>
> But now I am again left wondering what the fundamental difference is
> between a per-task counter and a per-cpu counter.
>
> At the end of the day, you still have to wait for the thing to hit 0.
>
> So I'm once again confused, ...

Updating myself.. so task-tracing-rcu is in fact *very* similar to
regular preemptible-rcu but is slightly different mostly because it is
*not* the main global instance.

Both are a single per-task counter (and not the per-cpu summing that I
remember from many many *many* years ago; OLS'07), mostly because this
helps identify which task is to blame when things go sideways.