2020-10-23 23:49:09

by Michael Jeanson

[permalink] [raw]
Subject: [RFC PATCH 0/6] Sleepable tracepoints

When invoked from system call enter/exit instrumentation, accessing
user-space data is a common use-case for tracers. However, tracepoints
currently disable preemption around iteration on the registered
tracepoint probes and invocation of the probe callbacks, which prevents
tracers from handling page faults.

Extend the tracepoint and trace event APIs to allow specific tracer
probes to take page faults. Adapt ftrace, perf, and ebpf to allow being
called from sleepable context, and convert the system call enter/exit
instrumentation to sleepable tracepoints.

This series only implements the tracepoint infrastructure required to
allow tracers to handle page faults. Modifying each tracer to handle
those page faults would be a next step after we all agree on this piece
of instrumentation infrastructure.

This patchset is base on v5.9.1.

Cc: Steven Rostedt (VMware) <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Yonghong Song <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: [email protected]

Mathieu Desnoyers (1):
tracing: use sched-RCU instead of SRCU for rcuidle tracepoints

Michael Jeanson (5):
tracing: introduce sleepable tracepoints
tracing: ftrace: add support for sleepable tracepoints
tracing: bpf-trace: add support for sleepable tracepoints
tracing: perf: add support for sleepable tracepoints
tracing: convert sys_enter/exit to sleepable tracepoints

include/linux/tracepoint-defs.h | 11 ++++
include/linux/tracepoint.h | 104 +++++++++++++++++++++-----------
include/trace/bpf_probe.h | 23 ++++++-
include/trace/define_trace.h | 7 +++
include/trace/events/syscalls.h | 4 +-
include/trace/perf.h | 26 ++++++--
include/trace/trace_events.h | 79 ++++++++++++++++++++++--
init/Kconfig | 1 +
kernel/trace/bpf_trace.c | 5 +-
kernel/trace/trace_events.c | 15 ++++-
kernel/trace/trace_syscalls.c | 68 +++++++++++++--------
kernel/tracepoint.c | 104 +++++++++++++++++++++++++-------
12 files changed, 351 insertions(+), 96 deletions(-)

--
2.25.1


2020-10-23 23:50:24

by Michael Jeanson

[permalink] [raw]
Subject: [RFC PATCH 1/6] tracing: introduce sleepable tracepoints

When invoked from system call enter/exit instrumentation, accessing
user-space data is a common use-case for tracers. However, tracepoints
currently disable preemption around iteration on the registered
tracepoint probes and invocation of the probe callbacks, which prevents
tracers from handling page faults.

Extend the tracepoint and trace event APIs to allow defining a sleepable
tracepoint which invokes its callback with preemption enabled.

Also extend the tracepoint API to allow tracers to request specific
probes to be connected to those sleepable tracepoints. When the
TRACEPOINT_MAYSLEEP flag is provided on registration, the probe callback
will be called with preemption enabled, and is allowed to take page
faults. Sleepable probes can only be registered on sleepable
tracepoints and non-sleepable probes on non-sleepable tracepoints.

The tasks trace rcu mechanism is used to synchronize read-side
marshalling of the registered probes with respect to sleepable probes
unregistration and teardown.

Co-developed-by: Mathieu Desnoyers <[email protected]>
Signed-off-by: Mathieu Desnoyers <[email protected]>
Signed-off-by: Michael Jeanson <[email protected]>
Cc: Steven Rostedt (VMware) <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Yonghong Song <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: [email protected]
---
include/linux/tracepoint-defs.h | 11 ++++
include/linux/tracepoint.h | 85 +++++++++++++++++++++-----
include/trace/define_trace.h | 7 +++
include/trace/trace_events.h | 6 ++
init/Kconfig | 1 +
kernel/tracepoint.c | 103 ++++++++++++++++++++++++++------
6 files changed, 181 insertions(+), 32 deletions(-)

diff --git a/include/linux/tracepoint-defs.h b/include/linux/tracepoint-defs.h
index b29950a19205..87ff40cf343f 100644
--- a/include/linux/tracepoint-defs.h
+++ b/include/linux/tracepoint-defs.h
@@ -27,12 +27,23 @@ struct tracepoint_func {
int prio;
};

+/**
+ * enum tracepoint_flags - Tracepoint flags
+ * @TRACEPOINT_MAYSLEEP: The tracepoint probe callback will be called with
+ * preemption enabled, and is allowed to take page
+ * faults.
+ */
+enum tracepoint_flags {
+ TRACEPOINT_MAYSLEEP = (1 << 0),
+};
+
struct tracepoint {
const char *name; /* Tracepoint name */
struct static_key key;
int (*regfunc)(void);
void (*unregfunc)(void);
struct tracepoint_func __rcu *funcs;
+ unsigned int flags;
};

#ifdef CONFIG_HAVE_ARCH_PREL32_RELOCATIONS
diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index 598fec9f9dbf..0386b54cbcbb 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -18,6 +18,7 @@
#include <linux/types.h>
#include <linux/cpumask.h>
#include <linux/rcupdate.h>
+#include <linux/rcupdate_trace.h>
#include <linux/tracepoint-defs.h>

struct module;
@@ -37,9 +38,14 @@ extern struct srcu_struct tracepoint_srcu;
extern int
tracepoint_probe_register(struct tracepoint *tp, void *probe, void *data);
extern int
+tracepoint_probe_register_maysleep(struct tracepoint *tp, void *probe, void *data);
+extern int
tracepoint_probe_register_prio(struct tracepoint *tp, void *probe, void *data,
int prio);
extern int
+tracepoint_probe_register_prio_maysleep(struct tracepoint *tp, void *probe, void *data,
+ int prio);
+extern int
tracepoint_probe_unregister(struct tracepoint *tp, void *probe, void *data);
extern void
for_each_kernel_tracepoint(void (*fct)(struct tracepoint *tp, void *priv),
@@ -79,6 +85,7 @@ int unregister_tracepoint_module_notifier(struct notifier_block *nb)
#ifdef CONFIG_TRACEPOINTS
static inline void tracepoint_synchronize_unregister(void)
{
+ synchronize_rcu_tasks_trace();
synchronize_srcu(&tracepoint_srcu);
synchronize_rcu();
}
@@ -157,12 +164,13 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
* has a "void" prototype, then it is invalid to declare a function
* as "(void *, void)".
*/
-#define __DO_TRACE(tp, proto, args, cond, rcuidle) \
+#define __DO_TRACE(tp, proto, args, cond, rcuidle, tp_flags) \
do { \
struct tracepoint_func *it_func_ptr; \
void *it_func; \
void *__data; \
int __maybe_unused __idx = 0; \
+ bool maysleep = (tp_flags) & TRACEPOINT_MAYSLEEP; \
\
if (!(cond)) \
return; \
@@ -170,8 +178,13 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
/* srcu can't be used from NMI */ \
WARN_ON_ONCE(rcuidle && in_nmi()); \
\
- /* keep srcu and sched-rcu usage consistent */ \
- preempt_disable_notrace(); \
+ if (maysleep) { \
+ might_sleep(); \
+ rcu_read_lock_trace(); \
+ } else { \
+ /* keep srcu and sched-rcu usage consistent */ \
+ preempt_disable_notrace(); \
+ } \
\
/* \
* For rcuidle callers, use srcu since sched-rcu \
@@ -197,21 +210,24 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
srcu_read_unlock_notrace(&tracepoint_srcu, __idx);\
} \
\
- preempt_enable_notrace(); \
+ if (maysleep) \
+ rcu_read_unlock_trace(); \
+ else \
+ preempt_enable_notrace(); \
} while (0)

#ifndef MODULE
-#define __DECLARE_TRACE_RCU(name, proto, args, cond, data_proto, data_args) \
+#define __DECLARE_TRACE_RCU(name, proto, args, cond, data_proto, data_args, tp_flags) \
static inline void trace_##name##_rcuidle(proto) \
{ \
if (static_key_false(&__tracepoint_##name.key)) \
__DO_TRACE(&__tracepoint_##name, \
TP_PROTO(data_proto), \
TP_ARGS(data_args), \
- TP_CONDITION(cond), 1); \
+ TP_CONDITION(cond), 1, tp_flags); \
}
#else
-#define __DECLARE_TRACE_RCU(name, proto, args, cond, data_proto, data_args)
+#define __DECLARE_TRACE_RCU(name, proto, args, cond, data_proto, data_args, tp_flags)
#endif

/*
@@ -226,7 +242,7 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
* even when this tracepoint is off. This code has no purpose other than
* poking RCU a bit.
*/
-#define __DECLARE_TRACE(name, proto, args, cond, data_proto, data_args) \
+#define __DECLARE_TRACE(name, proto, args, cond, data_proto, data_args, tp_flags) \
extern struct tracepoint __tracepoint_##name; \
static inline void trace_##name(proto) \
{ \
@@ -234,7 +250,7 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
__DO_TRACE(&__tracepoint_##name, \
TP_PROTO(data_proto), \
TP_ARGS(data_args), \
- TP_CONDITION(cond), 0); \
+ TP_CONDITION(cond), 0, tp_flags); \
if (IS_ENABLED(CONFIG_LOCKDEP) && (cond)) { \
rcu_read_lock_sched_notrace(); \
rcu_dereference_sched(__tracepoint_##name.funcs);\
@@ -242,7 +258,7 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
} \
} \
__DECLARE_TRACE_RCU(name, PARAMS(proto), PARAMS(args), \
- PARAMS(cond), PARAMS(data_proto), PARAMS(data_args)) \
+ PARAMS(cond), PARAMS(data_proto), PARAMS(data_args), tp_flags) \
static inline int \
register_trace_##name(void (*probe)(data_proto), void *data) \
{ \
@@ -250,6 +266,12 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
(void *)probe, data); \
} \
static inline int \
+ register_trace_maysleep_##name(void (*probe)(data_proto), void *data) \
+ { \
+ return tracepoint_probe_register_maysleep(&__tracepoint_##name, \
+ (void *)probe, data); \
+ } \
+ static inline int \
register_trace_prio_##name(void (*probe)(data_proto), void *data,\
int prio) \
{ \
@@ -257,6 +279,13 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
(void *)probe, data, prio); \
} \
static inline int \
+ register_trace_prio_maysleep_##name(void (*probe)(data_proto), \
+ void *data, int prio) \
+ { \
+ return tracepoint_probe_register_prio_maysleep(&__tracepoint_##name, \
+ (void *)probe, data, prio); \
+ } \
+ static inline int \
unregister_trace_##name(void (*probe)(data_proto), void *data) \
{ \
return tracepoint_probe_unregister(&__tracepoint_##name,\
@@ -277,14 +306,17 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
* structures, so we create an array of pointers that will be used for iteration
* on the tracepoints.
*/
-#define DEFINE_TRACE_FN(name, reg, unreg) \
+#define DEFINE_TRACE_FN_FLAGS(name, reg, unreg, tp_flags) \
static const char __tpstrtab_##name[] \
__section(__tracepoints_strings) = #name; \
struct tracepoint __tracepoint_##name __used \
__section(__tracepoints) = \
- { __tpstrtab_##name, STATIC_KEY_INIT_FALSE, reg, unreg, NULL };\
+ { __tpstrtab_##name, STATIC_KEY_INIT_FALSE, reg, unreg, NULL, tp_flags };\
__TRACEPOINT_ENTRY(name);

+#define DEFINE_TRACE_FN(name, reg, unreg) \
+ DEFINE_TRACE_FN_FLAGS(name, reg, unreg, 0)
+
#define DEFINE_TRACE(name) \
DEFINE_TRACE_FN(name, NULL, NULL);

@@ -294,7 +326,7 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
EXPORT_SYMBOL(__tracepoint_##name)

#else /* !TRACEPOINTS_ENABLED */
-#define __DECLARE_TRACE(name, proto, args, cond, data_proto, data_args) \
+#define __DECLARE_TRACE(name, proto, args, cond, data_proto, data_args, tp_flags) \
static inline void trace_##name(proto) \
{ } \
static inline void trace_##name##_rcuidle(proto) \
@@ -306,6 +338,18 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
return -ENOSYS; \
} \
static inline int \
+ register_trace_maysleep_##name(void (*probe)(data_proto), \
+ void *data) \
+ { \
+ return -ENOSYS; \
+ } \
+ static inline int \
+ register_trace_prio_maysleep_##name(void (*probe)(data_proto), \
+ void *data, int prio) \
+ { \
+ return -ENOSYS; \
+ } \
+ static inline int \
unregister_trace_##name(void (*probe)(data_proto), \
void *data) \
{ \
@@ -320,6 +364,7 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
return false; \
}

+#define DEFINE_TRACE_FN_FLAGS(name, reg, unreg, tp_flags)
#define DEFINE_TRACE_FN(name, reg, unreg)
#define DEFINE_TRACE(name)
#define EXPORT_TRACEPOINT_SYMBOL_GPL(name)
@@ -375,13 +420,20 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
__DECLARE_TRACE(name, PARAMS(proto), PARAMS(args), \
cpu_online(raw_smp_processor_id()), \
PARAMS(void *__data, proto), \
- PARAMS(__data, args))
+ PARAMS(__data, args), 0)
+
+#define DECLARE_TRACE_MAYSLEEP(name, proto, args) \
+ __DECLARE_TRACE(name, PARAMS(proto), PARAMS(args), \
+ cpu_online(raw_smp_processor_id()), \
+ PARAMS(void *__data, proto), \
+ PARAMS(__data, args), \
+ TRACEPOINT_MAYSLEEP)

#define DECLARE_TRACE_CONDITION(name, proto, args, cond) \
__DECLARE_TRACE(name, PARAMS(proto), PARAMS(args), \
cpu_online(raw_smp_processor_id()) && (PARAMS(cond)), \
PARAMS(void *__data, proto), \
- PARAMS(__data, args))
+ PARAMS(__data, args), 0)

#define TRACE_EVENT_FLAGS(event, flag)

@@ -512,6 +564,9 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
#define TRACE_EVENT_FN(name, proto, args, struct, \
assign, print, reg, unreg) \
DECLARE_TRACE(name, PARAMS(proto), PARAMS(args))
+#define TRACE_EVENT_FN_MAYSLEEP(name, proto, args, struct, \
+ assign, print, reg, unreg) \
+ DECLARE_TRACE_MAYSLEEP(name, PARAMS(proto), PARAMS(args))
#define TRACE_EVENT_FN_COND(name, proto, args, cond, struct, \
assign, print, reg, unreg) \
DECLARE_TRACE_CONDITION(name, PARAMS(proto), \
diff --git a/include/trace/define_trace.h b/include/trace/define_trace.h
index bd75f97867b9..2b6ae7c978b3 100644
--- a/include/trace/define_trace.h
+++ b/include/trace/define_trace.h
@@ -41,6 +41,12 @@
assign, print, reg, unreg) \
DEFINE_TRACE_FN(name, reg, unreg)

+/* Define a trace event with the MAYSLEEP flag set */
+#undef TRACE_EVENT_FN_MAYSLEEP
+#define TRACE_EVENT_FN_MAYSLEEP(name, proto, args, tstruct, \
+ assign, print, reg, unreg) \
+ DEFINE_TRACE_FN_FLAGS(name, reg, unreg, TRACEPOINT_MAYSLEEP)
+
#undef TRACE_EVENT_FN_COND
#define TRACE_EVENT_FN_COND(name, proto, args, cond, tstruct, \
assign, print, reg, unreg) \
@@ -106,6 +112,7 @@

#undef TRACE_EVENT
#undef TRACE_EVENT_FN
+#undef TRACE_EVENT_FN_MAYSLEEP
#undef TRACE_EVENT_FN_COND
#undef TRACE_EVENT_CONDITION
#undef TRACE_EVENT_NOP
diff --git a/include/trace/trace_events.h b/include/trace/trace_events.h
index 1bc3e7bba9a4..8b3f4068a702 100644
--- a/include/trace/trace_events.h
+++ b/include/trace/trace_events.h
@@ -138,6 +138,12 @@ TRACE_MAKE_SYSTEM_STR();
TRACE_EVENT(name, PARAMS(proto), PARAMS(args), \
PARAMS(tstruct), PARAMS(assign), PARAMS(print)) \

+#undef TRACE_EVENT_FN_MAYSLEEP
+#define TRACE_EVENT_FN_MAYSLEEP(name, proto, args, tstruct, \
+ assign, print, reg, unreg) \
+ TRACE_EVENT(name, PARAMS(proto), PARAMS(args), \
+ PARAMS(tstruct), PARAMS(assign), PARAMS(print)) \
+
#undef TRACE_EVENT_FN_COND
#define TRACE_EVENT_FN_COND(name, proto, args, cond, tstruct, \
assign, print, reg, unreg) \
diff --git a/init/Kconfig b/init/Kconfig
index d6a0b31b13dc..857f57562490 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -2018,6 +2018,7 @@ config PROFILING
#
config TRACEPOINTS
bool
+ select TASKS_TRACE_RCU

endmenu # General setup

diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
index 73956eaff8a9..8d8e41c5d8a5 100644
--- a/kernel/tracepoint.c
+++ b/kernel/tracepoint.c
@@ -60,11 +60,16 @@ static inline void *allocate_probes(int count)
return p == NULL ? NULL : p->probes;
}

-static void srcu_free_old_probes(struct rcu_head *head)
+static void rcu_tasks_trace_free_old_probes(struct rcu_head *head)
{
kfree(container_of(head, struct tp_probes, rcu));
}

+static void srcu_free_old_probes(struct rcu_head *head)
+{
+ call_rcu_tasks_trace(head, rcu_tasks_trace_free_old_probes);
+}
+
static void rcu_free_old_probes(struct rcu_head *head)
{
call_srcu(&tracepoint_srcu, head, srcu_free_old_probes);
@@ -85,7 +90,7 @@ static __init int release_early_probes(void)
return 0;
}

-/* SRCU is initialized at core_initcall */
+/* SRCU and Tasks Trace RCU are initialized at core_initcall */
postcore_initcall(release_early_probes);

static inline void release_probes(struct tracepoint_func *old)
@@ -95,8 +100,9 @@ static inline void release_probes(struct tracepoint_func *old)
struct tp_probes, probes[0]);

/*
- * We can't free probes if SRCU is not initialized yet.
- * Postpone the freeing till after SRCU is initialized.
+ * We can't free probes if SRCU and Tasks Trace RCU are not
+ * initialized yet. Postpone the freeing till after both are
+ * initialized.
*/
if (unlikely(!ok_to_free_tracepoints)) {
tp_probes->rcu.next = early_probes;
@@ -105,10 +111,9 @@ static inline void release_probes(struct tracepoint_func *old)
}

/*
- * Tracepoint probes are protected by both sched RCU and SRCU,
- * by calling the SRCU callback in the sched RCU callback we
- * cover both cases. So let us chain the SRCU and sched RCU
- * callbacks to wait for both grace periods.
+ * Tracepoint probes are protected by sched RCU, SRCU and
+ * Tasks Trace RCU by chaining the callbacks we cover all three
+ * cases and wait for all three grace periods.
*/
call_rcu(&tp_probes->rcu, rcu_free_old_probes);
}
@@ -289,6 +294,21 @@ static int tracepoint_remove_func(struct tracepoint *tp,
return 0;
}

+static int __tracepoint_probe_register_prio(struct tracepoint *tp, void *probe,
+ void *data, int prio)
+{
+ struct tracepoint_func tp_func;
+ int ret;
+
+ mutex_lock(&tracepoints_mutex);
+ tp_func.func = probe;
+ tp_func.data = data;
+ tp_func.prio = prio;
+ ret = tracepoint_add_func(tp, &tp_func, prio);
+ mutex_unlock(&tracepoints_mutex);
+ return ret;
+}
+
/**
* tracepoint_probe_register_prio - Connect a probe to a tracepoint with priority
* @tp: tracepoint
@@ -296,6 +316,8 @@ static int tracepoint_remove_func(struct tracepoint *tp,
* @data: tracepoint data
* @prio: priority of this function over other registered functions
*
+ * Non-sleepable probes can only be registered on non-sleepable tracepoints.
+ *
* Returns 0 if ok, error value on error.
* Note: if @tp is within a module, the caller is responsible for
* unregistering the probe before the module is gone. This can be
@@ -305,25 +327,49 @@ static int tracepoint_remove_func(struct tracepoint *tp,
int tracepoint_probe_register_prio(struct tracepoint *tp, void *probe,
void *data, int prio)
{
- struct tracepoint_func tp_func;
- int ret;
+ if (tp->flags & TRACEPOINT_MAYSLEEP)
+ return -EINVAL;

- mutex_lock(&tracepoints_mutex);
- tp_func.func = probe;
- tp_func.data = data;
- tp_func.prio = prio;
- ret = tracepoint_add_func(tp, &tp_func, prio);
- mutex_unlock(&tracepoints_mutex);
- return ret;
+ return __tracepoint_probe_register_prio(tp, probe, data, prio);
}
EXPORT_SYMBOL_GPL(tracepoint_probe_register_prio);

+/**
+ * tracepoint_probe_register_prio_maysleep - Connect a sleepable probe to a tracepoint with priority
+ * @tp: tracepoint
+ * @probe: probe handler
+ * @data: tracepoint data
+ * @prio: priority of this function over other registered functions
+ *
+ * When the TRACEPOINT_MAYSLEEP flag is provided on registration, the probe
+ * callback will be called with preemption enabled, and is allowed to take
+ * page faults. Sleepable probes can only be registered on sleepable
+ * tracepoints.
+ *
+ * Returns 0 if ok, error value on error.
+ * Note: if @tp is within a module, the caller is responsible for
+ * unregistering the probe before the module is gone. This can be
+ * performed either with a tracepoint module going notifier, or from
+ * within module exit functions.
+ */
+int tracepoint_probe_register_prio_maysleep(struct tracepoint *tp, void *probe,
+ void *data, int prio)
+{
+ if (!(tp->flags & TRACEPOINT_MAYSLEEP))
+ return -EINVAL;
+
+ return __tracepoint_probe_register_prio(tp, probe, data, prio);
+}
+EXPORT_SYMBOL_GPL(tracepoint_probe_register_prio_maysleep);
+
/**
* tracepoint_probe_register - Connect a probe to a tracepoint
* @tp: tracepoint
* @probe: probe handler
* @data: tracepoint data
*
+ * Non-sleepable probes can only be registered on non-sleepable tracepoints.
+ *
* Returns 0 if ok, error value on error.
* Note: if @tp is within a module, the caller is responsible for
* unregistering the probe before the module is gone. This can be
@@ -336,6 +382,29 @@ int tracepoint_probe_register(struct tracepoint *tp, void *probe, void *data)
}
EXPORT_SYMBOL_GPL(tracepoint_probe_register);

+/**
+ * tracepoint_probe_register_maysleep - Connect a sleepable probe to a tracepoint
+ * @tp: tracepoint
+ * @probe: probe handler
+ * @data: tracepoint data
+ *
+ * When the TRACEPOINT_MAYSLEEP flag is provided on registration, the probe
+ * callback will be called with preemption enabled, and is allowed to take
+ * page faults. Sleepable probes can only be registered on sleepable
+ * tracepoints.
+ *
+ * Returns 0 if ok, error value on error.
+ * Note: if @tp is within a module, the caller is responsible for
+ * unregistering the probe before the module is gone. This can be
+ * performed either with a tracepoint module going notifier, or from
+ * within module exit functions.
+ */
+int tracepoint_probe_register_maysleep(struct tracepoint *tp, void *probe, void *data)
+{
+ return tracepoint_probe_register_prio_maysleep(tp, probe, data, TRACEPOINT_DEFAULT_PRIO);
+}
+EXPORT_SYMBOL_GPL(tracepoint_probe_register_maysleep);
+
/**
* tracepoint_probe_unregister - Disconnect a probe from a tracepoint
* @tp: tracepoint
--
2.25.1

2020-10-23 23:50:55

by Michael Jeanson

[permalink] [raw]
Subject: [RFC PATCH 2/6] tracing: ftrace: add support for sleepable tracepoints

In preparation for converting system call enter/exit instrumentation
into sleepable tracepoints, make sure that ftrace can handle registering
to such tracepoints by explicitly disabling preemption within the ftrace
tracepoint probes to respect the current expectations within ftrace ring
buffer code.

This change does not yet allow ftrace to take page faults per se within
its probe, but allows its existing probes to connect to sleepable
tracepoints.

Co-developed-by: Mathieu Desnoyers <[email protected]>
Signed-off-by: Mathieu Desnoyers <[email protected]>
Signed-off-by: Michael Jeanson <[email protected]>
Cc: Steven Rostedt (VMware) <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Yonghong Song <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: [email protected]
---
include/trace/trace_events.h | 75 +++++++++++++++++++++++++++++++++---
kernel/trace/trace_events.c | 15 +++++++-
2 files changed, 83 insertions(+), 7 deletions(-)

diff --git a/include/trace/trace_events.h b/include/trace/trace_events.h
index 8b3f4068a702..b95a9c3d9405 100644
--- a/include/trace/trace_events.h
+++ b/include/trace/trace_events.h
@@ -80,6 +80,16 @@ TRACE_MAKE_SYSTEM_STR();
PARAMS(print)); \
DEFINE_EVENT(name, name, PARAMS(proto), PARAMS(args));

+#undef TRACE_EVENT_MAYSLEEP
+#define TRACE_EVENT_MAYSLEEP(name, proto, args, tstruct, assign, print) \
+ DECLARE_EVENT_CLASS_MAYSLEEP(name, \
+ PARAMS(proto), \
+ PARAMS(args), \
+ PARAMS(tstruct), \
+ PARAMS(assign), \
+ PARAMS(print)); \
+ DEFINE_EVENT(name, name, PARAMS(proto), PARAMS(args));
+

#undef __field
#define __field(type, item) type item;
@@ -118,6 +128,12 @@ TRACE_MAKE_SYSTEM_STR();
\
static struct trace_event_class event_class_##name;

+#undef DECLARE_EVENT_CLASS_MAYSLEEP
+#define DECLARE_EVENT_CLASS_MAYSLEEP(name, proto, args, \
+ tstruct, assign, print) \
+ DECLARE_EVENT_CLASS(name, PARAMS(proto), PARAMS(args), \
+ PARAMS(tstruct), PARAMS(assign), PARAMS(print))
+
#undef DEFINE_EVENT
#define DEFINE_EVENT(template, name, proto, args) \
static struct trace_event_call __used \
@@ -141,7 +157,7 @@ TRACE_MAKE_SYSTEM_STR();
#undef TRACE_EVENT_FN_MAYSLEEP
#define TRACE_EVENT_FN_MAYSLEEP(name, proto, args, tstruct, \
assign, print, reg, unreg) \
- TRACE_EVENT(name, PARAMS(proto), PARAMS(args), \
+ TRACE_EVENT_MAYSLEEP(name, PARAMS(proto), PARAMS(args), \
PARAMS(tstruct), PARAMS(assign), PARAMS(print)) \

#undef TRACE_EVENT_FN_COND
@@ -212,6 +228,12 @@ TRACE_MAKE_SYSTEM_STR();
tstruct; \
};

+#undef DECLARE_EVENT_CLASS_MAYSLEEP
+#define DECLARE_EVENT_CLASS_MAYSLEEP(call, proto, args, \
+ tstruct, assign, print) \
+ DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args), \
+ PARAMS(tstruct), PARAMS(assign), PARAMS(print))
+
#undef DEFINE_EVENT
#define DEFINE_EVENT(template, name, proto, args)

@@ -378,6 +400,12 @@ static struct trace_event_functions trace_event_type_funcs_##call = { \
.trace = trace_raw_output_##call, \
};

+#undef DECLARE_EVENT_CLASS_MAYSLEEP
+#define DECLARE_EVENT_CLASS_MAYSLEEP(call, proto, args, \
+ tstruct, assign, print) \
+ DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args), \
+ PARAMS(tstruct), PARAMS(assign), PARAMS(print))
+
#undef DEFINE_EVENT_PRINT
#define DEFINE_EVENT_PRINT(template, call, proto, args, print) \
static notrace enum print_line_t \
@@ -448,6 +476,12 @@ static struct trace_event_fields trace_event_fields_##call[] = { \
tstruct \
{} };

+#undef DECLARE_EVENT_CLASS_MAYSLEEP
+#define DECLARE_EVENT_CLASS_MAYSLEEP(call, proto, args, \
+ tstruct, func, print) \
+ DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args), \
+ PARAMS(tstruct), PARAMS(func), PARAMS(print))
+
#undef DEFINE_EVENT_PRINT
#define DEFINE_EVENT_PRINT(template, name, proto, args, print)

@@ -524,6 +558,12 @@ static inline notrace int trace_event_get_offsets_##call( \
return __data_size; \
}

+#undef DECLARE_EVENT_CLASS_MAYSLEEP
+#define DECLARE_EVENT_CLASS_MAYSLEEP(call, proto, args, \
+ tstruct, assign, print) \
+ DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args), \
+ PARAMS(tstruct), PARAMS(assign), PARAMS(print))
+
#include TRACE_INCLUDE(TRACE_INCLUDE_FILE)

/*
@@ -673,8 +713,8 @@ static inline notrace int trace_event_get_offsets_##call( \
#undef __perf_task
#define __perf_task(t) (t)

-#undef DECLARE_EVENT_CLASS
-#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \
+#undef _DECLARE_EVENT_CLASS
+#define _DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print, tp_flags) \
\
static notrace void \
trace_event_raw_event_##call(void *__data, proto) \
@@ -685,8 +725,11 @@ trace_event_raw_event_##call(void *__data, proto) \
struct trace_event_raw_##call *entry; \
int __data_size; \
\
+ if ((tp_flags) & TRACEPOINT_MAYSLEEP) \
+ preempt_disable_notrace(); \
+ \
if (trace_trigger_soft_disabled(trace_file)) \
- return; \
+ goto end; \
\
__data_size = trace_event_get_offsets_##call(&__data_offsets, args); \
\
@@ -694,14 +737,30 @@ trace_event_raw_event_##call(void *__data, proto) \
sizeof(*entry) + __data_size); \
\
if (!entry) \
- return; \
+ goto end; \
\
tstruct \
\
{ assign; } \
\
trace_event_buffer_commit(&fbuffer); \
+end: \
+ if ((tp_flags) & TRACEPOINT_MAYSLEEP) \
+ preempt_enable_notrace(); \
}
+
+#undef DECLARE_EVENT_CLASS
+#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \
+ _DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args), \
+ PARAMS(tstruct), PARAMS(assign), \
+ PARAMS(print), 0)
+
+#undef DECLARE_EVENT_CLASS_MAYSLEEP
+#define DECLARE_EVENT_CLASS_MAYSLEEP(call, proto, args, tstruct, assign, print) \
+ _DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args), \
+ PARAMS(tstruct), PARAMS(assign), \
+ PARAMS(print), TRACEPOINT_MAYSLEEP)
+
/*
* The ftrace_test_probe is compiled out, it is only here as a build time check
* to make sure that if the tracepoint handling changes, the ftrace probe will
@@ -748,6 +807,12 @@ static struct trace_event_class __used __refdata event_class_##call = { \
_TRACE_PERF_INIT(call) \
};

+#undef DECLARE_EVENT_CLASS_MAYSLEEP
+#define DECLARE_EVENT_CLASS_MAYSLEEP(call, proto, args, \
+ tstruct, assign, print) \
+ DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args), \
+ PARAMS(tstruct), PARAMS(assign), PARAMS(print))
+
#undef DEFINE_EVENT
#define DEFINE_EVENT(template, call, proto, args) \
\
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index a85effb2373b..058fe2834f14 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -290,9 +290,15 @@ int trace_event_reg(struct trace_event_call *call,
WARN_ON(!(call->flags & TRACE_EVENT_FL_TRACEPOINT));
switch (type) {
case TRACE_REG_REGISTER:
- return tracepoint_probe_register(call->tp,
+ if (call->tp->flags & TRACEPOINT_MAYSLEEP)
+ return tracepoint_probe_register_maysleep(call->tp,
call->class->probe,
file);
+ else
+ return tracepoint_probe_register(call->tp,
+ call->class->probe,
+ file);
+
case TRACE_REG_UNREGISTER:
tracepoint_probe_unregister(call->tp,
call->class->probe,
@@ -301,7 +307,12 @@ int trace_event_reg(struct trace_event_call *call,

#ifdef CONFIG_PERF_EVENTS
case TRACE_REG_PERF_REGISTER:
- return tracepoint_probe_register(call->tp,
+ if (call->tp->flags & TRACEPOINT_MAYSLEEP)
+ return tracepoint_probe_register_maysleep(call->tp,
+ call->class->perf_probe,
+ call);
+ else
+ return tracepoint_probe_register(call->tp,
call->class->perf_probe,
call);
case TRACE_REG_PERF_UNREGISTER:
--
2.25.1

2020-10-23 23:52:08

by Michael Jeanson

[permalink] [raw]
Subject: [RFC PATCH 6/6] tracing: use sched-RCU instead of SRCU for rcuidle tracepoints

From: Mathieu Desnoyers <[email protected]>

Considering that tracer callbacks expect RCU to be watching (for
instance, perf uses rcu_read_lock), we need rcuidle tracepoints to issue
rcu_irq_{enter,exit}_irqson around calls to the callbacks. So there is
no point in using SRCU anymore given that rcuidle tracepoints need to
ensure RCU is watching. Therefore, simply use sched-RCU like normal
tracepoints for rcuidle tracepoints.

Signed-off-by: Mathieu Desnoyers <[email protected]>
Cc: Michael Jeanson <[email protected]>
Cc: Steven Rostedt (VMware) <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Yonghong Song <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: [email protected]
---
include/linux/tracepoint.h | 33 +++++++--------------------------
kernel/tracepoint.c | 25 +++++++++----------------
2 files changed, 16 insertions(+), 42 deletions(-)

diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index 0386b54cbcbb..1414b11f864b 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -13,7 +13,6 @@
*/

#include <linux/smp.h>
-#include <linux/srcu.h>
#include <linux/errno.h>
#include <linux/types.h>
#include <linux/cpumask.h>
@@ -33,8 +32,6 @@ struct trace_eval_map {

#define TRACEPOINT_DEFAULT_PRIO 10

-extern struct srcu_struct tracepoint_srcu;
-
extern int
tracepoint_probe_register(struct tracepoint *tp, void *probe, void *data);
extern int
@@ -86,7 +83,6 @@ int unregister_tracepoint_module_notifier(struct notifier_block *nb)
static inline void tracepoint_synchronize_unregister(void)
{
synchronize_rcu_tasks_trace();
- synchronize_srcu(&tracepoint_srcu);
synchronize_rcu();
}
#else
@@ -175,25 +171,13 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
if (!(cond)) \
return; \
\
- /* srcu can't be used from NMI */ \
- WARN_ON_ONCE(rcuidle && in_nmi()); \
- \
- if (maysleep) { \
- might_sleep(); \
+ might_sleep_if(maysleep); \
+ if (rcuidle) \
+ rcu_irq_enter_irqson(); \
+ if (maysleep) \
rcu_read_lock_trace(); \
- } else { \
- /* keep srcu and sched-rcu usage consistent */ \
+ else \
preempt_disable_notrace(); \
- } \
- \
- /* \
- * For rcuidle callers, use srcu since sched-rcu \
- * doesn't work from the idle path. \
- */ \
- if (rcuidle) { \
- __idx = srcu_read_lock_notrace(&tracepoint_srcu);\
- rcu_irq_enter_irqson(); \
- } \
\
it_func_ptr = rcu_dereference_raw((tp)->funcs); \
\
@@ -205,15 +189,12 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
} while ((++it_func_ptr)->func); \
} \
\
- if (rcuidle) { \
- rcu_irq_exit_irqson(); \
- srcu_read_unlock_notrace(&tracepoint_srcu, __idx);\
- } \
- \
if (maysleep) \
rcu_read_unlock_trace(); \
else \
preempt_enable_notrace(); \
+ if (rcuidle) \
+ rcu_irq_exit_irqson(); \
} while (0)

#ifndef MODULE
diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
index 8d8e41c5d8a5..68b4e50798b1 100644
--- a/kernel/tracepoint.c
+++ b/kernel/tracepoint.c
@@ -18,9 +18,6 @@
extern tracepoint_ptr_t __start___tracepoints_ptrs[];
extern tracepoint_ptr_t __stop___tracepoints_ptrs[];

-DEFINE_SRCU(tracepoint_srcu);
-EXPORT_SYMBOL_GPL(tracepoint_srcu);
-
/* Set to 1 to enable tracepoint debug output */
static const int tracepoint_debug;

@@ -65,14 +62,9 @@ static void rcu_tasks_trace_free_old_probes(struct rcu_head *head)
kfree(container_of(head, struct tp_probes, rcu));
}

-static void srcu_free_old_probes(struct rcu_head *head)
-{
- call_rcu_tasks_trace(head, rcu_tasks_trace_free_old_probes);
-}
-
static void rcu_free_old_probes(struct rcu_head *head)
{
- call_srcu(&tracepoint_srcu, head, srcu_free_old_probes);
+ call_rcu_tasks_trace(head, rcu_tasks_trace_free_old_probes);
}

static __init int release_early_probes(void)
@@ -90,7 +82,7 @@ static __init int release_early_probes(void)
return 0;
}

-/* SRCU and Tasks Trace RCU are initialized at core_initcall */
+/* Tasks Trace RCU is initialized at core_initcall */
postcore_initcall(release_early_probes);

static inline void release_probes(struct tracepoint_func *old)
@@ -100,9 +92,8 @@ static inline void release_probes(struct tracepoint_func *old)
struct tp_probes, probes[0]);

/*
- * We can't free probes if SRCU and Tasks Trace RCU are not
- * initialized yet. Postpone the freeing till after both are
- * initialized.
+ * We can't free probes if Tasks Trace RCU is not initialized yet.
+ * Postpone the freeing till after Tasks Trace RCU is initialized.
*/
if (unlikely(!ok_to_free_tracepoints)) {
tp_probes->rcu.next = early_probes;
@@ -111,9 +102,11 @@ static inline void release_probes(struct tracepoint_func *old)
}

/*
- * Tracepoint probes are protected by sched RCU, SRCU and
- * Tasks Trace RCU by chaining the callbacks we cover all three
- * cases and wait for all three grace periods.
+ * Tracepoint probes are protected by both sched RCU and
+ * Tasks Trace RCU, by calling the Tasks Trace RCU callback in
+ * the sched RCU callback we cover both cases. So let us chain
+ * the Tasks Trace RCU and sched RCU callbacks to wait for both
+ * grace periods.
*/
call_rcu(&tp_probes->rcu, rcu_free_old_probes);
}
--
2.25.1

2020-10-24 00:32:52

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH 6/6] tracing: use sched-RCU instead of SRCU for rcuidle tracepoints

On Fri, Oct 23, 2020 at 03:53:52PM -0400, Michael Jeanson wrote:
> From: Mathieu Desnoyers <[email protected]>
>
> Considering that tracer callbacks expect RCU to be watching (for
> instance, perf uses rcu_read_lock), we need rcuidle tracepoints to issue
> rcu_irq_{enter,exit}_irqson around calls to the callbacks. So there is
> no point in using SRCU anymore given that rcuidle tracepoints need to
> ensure RCU is watching. Therefore, simply use sched-RCU like normal
> tracepoints for rcuidle tracepoints.

High level question:

IIRC, doing this increases overhead for general tracing that does not use
perf, for 'rcuidle' tracepoints such as the preempt/irq enable/disable
tracepoints. I remember adding SRCU because of this reason.

Can the 'rcuidle' information not be pushed down further, such that perf does
it because it requires RCU to be watching, so that it does not effect, say,
trace events?

thanks,

- Joel

> Signed-off-by: Mathieu Desnoyers <[email protected]>
> Cc: Michael Jeanson <[email protected]>
> Cc: Steven Rostedt (VMware) <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Alexei Starovoitov <[email protected]>
> Cc: Yonghong Song <[email protected]>
> Cc: Paul E. McKenney <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Arnaldo Carvalho de Melo <[email protected]>
> Cc: Mark Rutland <[email protected]>
> Cc: Alexander Shishkin <[email protected]>
> Cc: Jiri Olsa <[email protected]>
> Cc: Namhyung Kim <[email protected]>
> Cc: Joel Fernandes (Google) <[email protected]>
> Cc: [email protected]
> ---
> include/linux/tracepoint.h | 33 +++++++--------------------------
> kernel/tracepoint.c | 25 +++++++++----------------
> 2 files changed, 16 insertions(+), 42 deletions(-)
>
> diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
> index 0386b54cbcbb..1414b11f864b 100644
> --- a/include/linux/tracepoint.h
> +++ b/include/linux/tracepoint.h
> @@ -13,7 +13,6 @@
> */
>
> #include <linux/smp.h>
> -#include <linux/srcu.h>
> #include <linux/errno.h>
> #include <linux/types.h>
> #include <linux/cpumask.h>
> @@ -33,8 +32,6 @@ struct trace_eval_map {
>
> #define TRACEPOINT_DEFAULT_PRIO 10
>
> -extern struct srcu_struct tracepoint_srcu;
> -
> extern int
> tracepoint_probe_register(struct tracepoint *tp, void *probe, void *data);
> extern int
> @@ -86,7 +83,6 @@ int unregister_tracepoint_module_notifier(struct notifier_block *nb)
> static inline void tracepoint_synchronize_unregister(void)
> {
> synchronize_rcu_tasks_trace();
> - synchronize_srcu(&tracepoint_srcu);
> synchronize_rcu();
> }
> #else
> @@ -175,25 +171,13 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
> if (!(cond)) \
> return; \
> \
> - /* srcu can't be used from NMI */ \
> - WARN_ON_ONCE(rcuidle && in_nmi()); \
> - \
> - if (maysleep) { \
> - might_sleep(); \
> + might_sleep_if(maysleep); \
> + if (rcuidle) \
> + rcu_irq_enter_irqson(); \
> + if (maysleep) \
> rcu_read_lock_trace(); \
> - } else { \
> - /* keep srcu and sched-rcu usage consistent */ \
> + else \
> preempt_disable_notrace(); \
> - } \
> - \
> - /* \
> - * For rcuidle callers, use srcu since sched-rcu \
> - * doesn't work from the idle path. \
> - */ \
> - if (rcuidle) { \
> - __idx = srcu_read_lock_notrace(&tracepoint_srcu);\
> - rcu_irq_enter_irqson(); \
> - } \
> \
> it_func_ptr = rcu_dereference_raw((tp)->funcs); \
> \
> @@ -205,15 +189,12 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
> } while ((++it_func_ptr)->func); \
> } \
> \
> - if (rcuidle) { \
> - rcu_irq_exit_irqson(); \
> - srcu_read_unlock_notrace(&tracepoint_srcu, __idx);\
> - } \
> - \
> if (maysleep) \
> rcu_read_unlock_trace(); \
> else \
> preempt_enable_notrace(); \
> + if (rcuidle) \
> + rcu_irq_exit_irqson(); \
> } while (0)
>
> #ifndef MODULE
> diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
> index 8d8e41c5d8a5..68b4e50798b1 100644
> --- a/kernel/tracepoint.c
> +++ b/kernel/tracepoint.c
> @@ -18,9 +18,6 @@
> extern tracepoint_ptr_t __start___tracepoints_ptrs[];
> extern tracepoint_ptr_t __stop___tracepoints_ptrs[];
>
> -DEFINE_SRCU(tracepoint_srcu);
> -EXPORT_SYMBOL_GPL(tracepoint_srcu);
> -
> /* Set to 1 to enable tracepoint debug output */
> static const int tracepoint_debug;
>
> @@ -65,14 +62,9 @@ static void rcu_tasks_trace_free_old_probes(struct rcu_head *head)
> kfree(container_of(head, struct tp_probes, rcu));
> }
>
> -static void srcu_free_old_probes(struct rcu_head *head)
> -{
> - call_rcu_tasks_trace(head, rcu_tasks_trace_free_old_probes);
> -}
> -
> static void rcu_free_old_probes(struct rcu_head *head)
> {
> - call_srcu(&tracepoint_srcu, head, srcu_free_old_probes);
> + call_rcu_tasks_trace(head, rcu_tasks_trace_free_old_probes);
> }
>
> static __init int release_early_probes(void)
> @@ -90,7 +82,7 @@ static __init int release_early_probes(void)
> return 0;
> }
>
> -/* SRCU and Tasks Trace RCU are initialized at core_initcall */
> +/* Tasks Trace RCU is initialized at core_initcall */
> postcore_initcall(release_early_probes);
>
> static inline void release_probes(struct tracepoint_func *old)
> @@ -100,9 +92,8 @@ static inline void release_probes(struct tracepoint_func *old)
> struct tp_probes, probes[0]);
>
> /*
> - * We can't free probes if SRCU and Tasks Trace RCU are not
> - * initialized yet. Postpone the freeing till after both are
> - * initialized.
> + * We can't free probes if Tasks Trace RCU is not initialized yet.
> + * Postpone the freeing till after Tasks Trace RCU is initialized.
> */
> if (unlikely(!ok_to_free_tracepoints)) {
> tp_probes->rcu.next = early_probes;
> @@ -111,9 +102,11 @@ static inline void release_probes(struct tracepoint_func *old)
> }
>
> /*
> - * Tracepoint probes are protected by sched RCU, SRCU and
> - * Tasks Trace RCU by chaining the callbacks we cover all three
> - * cases and wait for all three grace periods.
> + * Tracepoint probes are protected by both sched RCU and
> + * Tasks Trace RCU, by calling the Tasks Trace RCU callback in
> + * the sched RCU callback we cover both cases. So let us chain
> + * the Tasks Trace RCU and sched RCU callbacks to wait for both
> + * grace periods.
> */
> call_rcu(&tp_probes->rcu, rcu_free_old_probes);
> }
> --
> 2.25.1
>

2020-10-24 06:54:10

by Michael Jeanson

[permalink] [raw]
Subject: [RFC PATCH 4/6] tracing: perf: add support for sleepable tracepoints

In preparation for converting system call enter/exit instrumentation
into sleepable tracepoints, make sure that perf can handle registering
to such tracepoints by explicitly disabling preemption within the perf
tracepoint probes to respect the current expectations within perf ring
buffer code.

This change does not yet allow perf to take page faults per se within
its probe, but allows its existing probes to connect to sleepable
tracepoints.

Co-developed-by: Mathieu Desnoyers <[email protected]>
Signed-off-by: Mathieu Desnoyers <[email protected]>
Signed-off-by: Michael Jeanson <[email protected]>
Cc: Steven Rostedt (VMware) <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Yonghong Song <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: [email protected]
---
include/trace/perf.h | 26 ++++++++++++++++++++++----
1 file changed, 22 insertions(+), 4 deletions(-)

diff --git a/include/trace/perf.h b/include/trace/perf.h
index dbc6c74defc3..e1d866c3a076 100644
--- a/include/trace/perf.h
+++ b/include/trace/perf.h
@@ -27,8 +27,8 @@
#undef __perf_task
#define __perf_task(t) (__task = (t))

-#undef DECLARE_EVENT_CLASS
-#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \
+#undef _DECLARE_EVENT_CLASS
+#define _DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print, tp_flags) \
static notrace void \
perf_trace_##call(void *__data, proto) \
{ \
@@ -43,13 +43,16 @@ perf_trace_##call(void *__data, proto) \
int __data_size; \
int rctx; \
\
+ if ((tp_flags) & TRACEPOINT_MAYSLEEP) \
+ preempt_disable_notrace(); \
+ \
__data_size = trace_event_get_offsets_##call(&__data_offsets, args); \
\
head = this_cpu_ptr(event_call->perf_events); \
if (!bpf_prog_array_valid(event_call) && \
__builtin_constant_p(!__task) && !__task && \
hlist_empty(head)) \
- return; \
+ goto end; \
\
__entry_size = ALIGN(__data_size + sizeof(*entry) + sizeof(u32),\
sizeof(u64)); \
@@ -57,7 +60,7 @@ perf_trace_##call(void *__data, proto) \
\
entry = perf_trace_buf_alloc(__entry_size, &__regs, &rctx); \
if (!entry) \
- return; \
+ goto end; \
\
perf_fetch_caller_regs(__regs); \
\
@@ -68,8 +71,23 @@ perf_trace_##call(void *__data, proto) \
perf_trace_run_bpf_submit(entry, __entry_size, rctx, \
event_call, __count, __regs, \
head, __task); \
+end: \
+ if ((tp_flags) & TRACEPOINT_MAYSLEEP) \
+ preempt_enable_notrace(); \
}

+#undef DECLARE_EVENT_CLASS
+#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \
+ _DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args), \
+ PARAMS(tstruct), PARAMS(assign), PARAMS(print), 0)
+
+#undef DECLARE_EVENT_CLASS_MAYSLEEP
+#define DECLARE_EVENT_CLASS_MAYSLEEP(call, proto, args, tstruct, \
+ assign, print) \
+ _DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args), \
+ PARAMS(tstruct), PARAMS(assign), PARAMS(print), \
+ TRACEPOINT_MAYSLEEP)
+
/*
* This part is compiled out, it is only here as a build time check
* to make sure that if the tracepoint handling changes, the
--
2.25.1

2020-10-24 06:54:10

by Michael Jeanson

[permalink] [raw]
Subject: [RFC PATCH 5/6] tracing: convert sys_enter/exit to sleepable tracepoints

Convert the definition of the system call enter/exit tracepoints to
sleepable tracepoints now that all upstream tracers handle it.

Co-developed-by: Mathieu Desnoyers <[email protected]>
Signed-off-by: Mathieu Desnoyers <[email protected]>
Signed-off-by: Michael Jeanson <[email protected]>
Cc: Steven Rostedt (VMware) <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Yonghong Song <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: [email protected]
---
include/trace/events/syscalls.h | 4 +-
kernel/trace/trace_syscalls.c | 68 ++++++++++++++++++++-------------
2 files changed, 44 insertions(+), 28 deletions(-)

diff --git a/include/trace/events/syscalls.h b/include/trace/events/syscalls.h
index b6e0cbc2c71f..fbb8d8b48f81 100644
--- a/include/trace/events/syscalls.h
+++ b/include/trace/events/syscalls.h
@@ -15,7 +15,7 @@

#ifdef CONFIG_HAVE_SYSCALL_TRACEPOINTS

-TRACE_EVENT_FN(sys_enter,
+TRACE_EVENT_FN_MAYSLEEP(sys_enter,

TP_PROTO(struct pt_regs *regs, long id),

@@ -41,7 +41,7 @@ TRACE_EVENT_FN(sys_enter,

TRACE_EVENT_FLAGS(sys_enter, TRACE_EVENT_FL_CAP_ANY)

-TRACE_EVENT_FN(sys_exit,
+TRACE_EVENT_FN_MAYSLEEP(sys_exit,

TP_PROTO(struct pt_regs *regs, long ret),

diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index d85a2f0f316b..48d92d59fb92 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -304,21 +304,23 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
int syscall_nr;
int size;

+ preempt_disable_notrace();
+
syscall_nr = trace_get_syscall_nr(current, regs);
if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
- return;
+ goto end;

/* Here we're inside tp handler's rcu_read_lock_sched (__DO_TRACE) */
trace_file = rcu_dereference_sched(tr->enter_syscall_files[syscall_nr]);
if (!trace_file)
- return;
+ goto end;

if (trace_trigger_soft_disabled(trace_file))
- return;
+ goto end;

sys_data = syscall_nr_to_meta(syscall_nr);
if (!sys_data)
- return;
+ goto end;

size = sizeof(*entry) + sizeof(unsigned long) * sys_data->nb_args;

@@ -329,7 +331,7 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
event = trace_buffer_lock_reserve(buffer,
sys_data->enter_event->event.type, size, irq_flags, pc);
if (!event)
- return;
+ goto end;

entry = ring_buffer_event_data(event);
entry->nr = syscall_nr;
@@ -338,6 +340,8 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)

event_trigger_unlock_commit(trace_file, buffer, event, entry,
irq_flags, pc);
+end:
+ preempt_enable_notrace();
}

static void ftrace_syscall_exit(void *data, struct pt_regs *regs, long ret)
@@ -352,21 +356,23 @@ static void ftrace_syscall_exit(void *data, struct pt_regs *regs, long ret)
int pc;
int syscall_nr;

+ preempt_disable_notrace();
+
syscall_nr = trace_get_syscall_nr(current, regs);
if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
- return;
+ goto end;

/* Here we're inside tp handler's rcu_read_lock_sched (__DO_TRACE()) */
trace_file = rcu_dereference_sched(tr->exit_syscall_files[syscall_nr]);
if (!trace_file)
- return;
+ goto end;

if (trace_trigger_soft_disabled(trace_file))
- return;
+ goto end;

sys_data = syscall_nr_to_meta(syscall_nr);
if (!sys_data)
- return;
+ goto end;

local_save_flags(irq_flags);
pc = preempt_count();
@@ -376,7 +382,7 @@ static void ftrace_syscall_exit(void *data, struct pt_regs *regs, long ret)
sys_data->exit_event->event.type, sizeof(*entry),
irq_flags, pc);
if (!event)
- return;
+ goto end;

entry = ring_buffer_event_data(event);
entry->nr = syscall_nr;
@@ -384,6 +390,8 @@ static void ftrace_syscall_exit(void *data, struct pt_regs *regs, long ret)

event_trigger_unlock_commit(trace_file, buffer, event, entry,
irq_flags, pc);
+end:
+ preempt_enable_notrace();
}

static int reg_event_syscall_enter(struct trace_event_file *file,
@@ -398,7 +406,7 @@ static int reg_event_syscall_enter(struct trace_event_file *file,
return -ENOSYS;
mutex_lock(&syscall_trace_lock);
if (!tr->sys_refcount_enter)
- ret = register_trace_sys_enter(ftrace_syscall_enter, tr);
+ ret = register_trace_maysleep_sys_enter(ftrace_syscall_enter, tr);
if (!ret) {
rcu_assign_pointer(tr->enter_syscall_files[num], file);
tr->sys_refcount_enter++;
@@ -436,7 +444,7 @@ static int reg_event_syscall_exit(struct trace_event_file *file,
return -ENOSYS;
mutex_lock(&syscall_trace_lock);
if (!tr->sys_refcount_exit)
- ret = register_trace_sys_exit(ftrace_syscall_exit, tr);
+ ret = register_trace_maysleep_sys_exit(ftrace_syscall_exit, tr);
if (!ret) {
rcu_assign_pointer(tr->exit_syscall_files[num], file);
tr->sys_refcount_exit++;
@@ -600,20 +608,22 @@ static void perf_syscall_enter(void *ignore, struct pt_regs *regs, long id)
int rctx;
int size;

+ preempt_disable_notrace();
+
syscall_nr = trace_get_syscall_nr(current, regs);
if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
- return;
+ goto end;
if (!test_bit(syscall_nr, enabled_perf_enter_syscalls))
- return;
+ goto end;

sys_data = syscall_nr_to_meta(syscall_nr);
if (!sys_data)
- return;
+ goto end;

head = this_cpu_ptr(sys_data->enter_event->perf_events);
valid_prog_array = bpf_prog_array_valid(sys_data->enter_event);
if (!valid_prog_array && hlist_empty(head))
- return;
+ goto end;

/* get the size after alignment with the u32 buffer size field */
size = sizeof(unsigned long) * sys_data->nb_args + sizeof(*rec);
@@ -622,7 +632,7 @@ static void perf_syscall_enter(void *ignore, struct pt_regs *regs, long id)

rec = perf_trace_buf_alloc(size, NULL, &rctx);
if (!rec)
- return;
+ goto end;

rec->nr = syscall_nr;
syscall_get_arguments(current, regs, args);
@@ -632,12 +642,14 @@ static void perf_syscall_enter(void *ignore, struct pt_regs *regs, long id)
!perf_call_bpf_enter(sys_data->enter_event, regs, sys_data, rec)) ||
hlist_empty(head)) {
perf_swevent_put_recursion_context(rctx);
- return;
+ goto end;
}

perf_trace_buf_submit(rec, size, rctx,
sys_data->enter_event->event.type, 1, regs,
head, NULL);
+end:
+ preempt_enable_notrace();
}

static int perf_sysenter_enable(struct trace_event_call *call)
@@ -649,7 +661,7 @@ static int perf_sysenter_enable(struct trace_event_call *call)

mutex_lock(&syscall_trace_lock);
if (!sys_perf_refcount_enter)
- ret = register_trace_sys_enter(perf_syscall_enter, NULL);
+ ret = register_trace_maysleep_sys_enter(perf_syscall_enter, NULL);
if (ret) {
pr_info("event trace: Could not activate syscall entry trace point");
} else {
@@ -699,20 +711,22 @@ static void perf_syscall_exit(void *ignore, struct pt_regs *regs, long ret)
int rctx;
int size;

+ preempt_disable_notrace();
+
syscall_nr = trace_get_syscall_nr(current, regs);
if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
- return;
+ goto end;
if (!test_bit(syscall_nr, enabled_perf_exit_syscalls))
- return;
+ goto end;

sys_data = syscall_nr_to_meta(syscall_nr);
if (!sys_data)
- return;
+ goto end;

head = this_cpu_ptr(sys_data->exit_event->perf_events);
valid_prog_array = bpf_prog_array_valid(sys_data->exit_event);
if (!valid_prog_array && hlist_empty(head))
- return;
+ goto end;

/* We can probably do that at build time */
size = ALIGN(sizeof(*rec) + sizeof(u32), sizeof(u64));
@@ -720,7 +734,7 @@ static void perf_syscall_exit(void *ignore, struct pt_regs *regs, long ret)

rec = perf_trace_buf_alloc(size, NULL, &rctx);
if (!rec)
- return;
+ goto end;

rec->nr = syscall_nr;
rec->ret = syscall_get_return_value(current, regs);
@@ -729,11 +743,13 @@ static void perf_syscall_exit(void *ignore, struct pt_regs *regs, long ret)
!perf_call_bpf_exit(sys_data->exit_event, regs, rec)) ||
hlist_empty(head)) {
perf_swevent_put_recursion_context(rctx);
- return;
+ goto end;
}

perf_trace_buf_submit(rec, size, rctx, sys_data->exit_event->event.type,
1, regs, head, NULL);
+end:
+ preempt_enable_notrace();
}

static int perf_sysexit_enable(struct trace_event_call *call)
@@ -745,7 +761,7 @@ static int perf_sysexit_enable(struct trace_event_call *call)

mutex_lock(&syscall_trace_lock);
if (!sys_perf_refcount_exit)
- ret = register_trace_sys_exit(perf_syscall_exit, NULL);
+ ret = register_trace_maysleep_sys_exit(perf_syscall_exit, NULL);
if (ret) {
pr_info("event trace: Could not activate syscall exit trace point");
} else {
--
2.25.1

2020-10-26 11:14:33

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 6/6] tracing: use sched-RCU instead of SRCU for rcuidle tracepoints

On Fri, Oct 23, 2020 at 05:13:59PM -0400, Joel Fernandes wrote:
> On Fri, Oct 23, 2020 at 03:53:52PM -0400, Michael Jeanson wrote:
> > From: Mathieu Desnoyers <[email protected]>
> >
> > Considering that tracer callbacks expect RCU to be watching (for
> > instance, perf uses rcu_read_lock), we need rcuidle tracepoints to issue
> > rcu_irq_{enter,exit}_irqson around calls to the callbacks. So there is
> > no point in using SRCU anymore given that rcuidle tracepoints need to
> > ensure RCU is watching. Therefore, simply use sched-RCU like normal
> > tracepoints for rcuidle tracepoints.
>
> High level question:
>
> IIRC, doing this increases overhead for general tracing that does not use
> perf, for 'rcuidle' tracepoints such as the preempt/irq enable/disable
> tracepoints. I remember adding SRCU because of this reason.
>
> Can the 'rcuidle' information not be pushed down further, such that perf does
> it because it requires RCU to be watching, so that it does not effect, say,
> trace events?

There's very few trace_.*_rcuidle() users left. We should eradicate them
and remove the option. It's bugs to begin with.

2020-10-26 13:20:12

by peter enderborg

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Sleepable tracepoints

On 10/23/20 9:53 PM, Michael Jeanson wrote:
> When invoked from system call enter/exit instrumentation, accessing
> user-space data is a common use-case for tracers. However, tracepoints
> currently disable preemption around iteration on the registered
> tracepoint probes and invocation of the probe callbacks, which prevents
> tracers from handling page faults.
>
> Extend the tracepoint and trace event APIs to allow specific tracer
> probes to take page faults. Adapt ftrace, perf, and ebpf to allow being
> called from sleepable context, and convert the system call enter/exit
> instrumentation to sleepable tracepoints.

Will this not be a problem for analyse of the trace? It get two
relevant times, one it when it is called and one when it returns.

It makes things harder to correlate in what order things happen.

And handling of tracing of contexts that already are not preamptable?

Eg the same tracepoint are used in different places and contexts.


> This series only implements the tracepoint infrastructure required to
> allow tracers to handle page faults. Modifying each tracer to handle
> those page faults would be a next step after we all agree on this piece
> of instrumentation infrastructure.
>
> This patchset is base on v5.9.1.
>
> Cc: Steven Rostedt (VMware) <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Alexei Starovoitov <[email protected]>
> Cc: Yonghong Song <[email protected]>
> Cc: Paul E. McKenney <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Arnaldo Carvalho de Melo <[email protected]>
> Cc: Mark Rutland <[email protected]>
> Cc: Alexander Shishkin <[email protected]>
> Cc: Jiri Olsa <[email protected]>
> Cc: Namhyung Kim <[email protected]>
> Cc: Joel Fernandes (Google) <[email protected]>
> Cc: [email protected]
>
> Mathieu Desnoyers (1):
> tracing: use sched-RCU instead of SRCU for rcuidle tracepoints
>
> Michael Jeanson (5):
> tracing: introduce sleepable tracepoints
> tracing: ftrace: add support for sleepable tracepoints
> tracing: bpf-trace: add support for sleepable tracepoints
> tracing: perf: add support for sleepable tracepoints
> tracing: convert sys_enter/exit to sleepable tracepoints
>
> include/linux/tracepoint-defs.h | 11 ++++
> include/linux/tracepoint.h | 104 +++++++++++++++++++++-----------
> include/trace/bpf_probe.h | 23 ++++++-
> include/trace/define_trace.h | 7 +++
> include/trace/events/syscalls.h | 4 +-
> include/trace/perf.h | 26 ++++++--
> include/trace/trace_events.h | 79 ++++++++++++++++++++++--
> init/Kconfig | 1 +
> kernel/trace/bpf_trace.c | 5 +-
> kernel/trace/trace_events.c | 15 ++++-
> kernel/trace/trace_syscalls.c | 68 +++++++++++++--------
> kernel/tracepoint.c | 104 +++++++++++++++++++++++++-------
> 12 files changed, 351 insertions(+), 96 deletions(-)
>

2020-10-26 14:32:15

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH 6/6] tracing: use sched-RCU instead of SRCU for rcuidle tracepoints

----- On Oct 26, 2020, at 4:20 AM, Peter Zijlstra [email protected] wrote:

> On Fri, Oct 23, 2020 at 05:13:59PM -0400, Joel Fernandes wrote:
>> On Fri, Oct 23, 2020 at 03:53:52PM -0400, Michael Jeanson wrote:
>> > From: Mathieu Desnoyers <[email protected]>
>> >
>> > Considering that tracer callbacks expect RCU to be watching (for
>> > instance, perf uses rcu_read_lock), we need rcuidle tracepoints to issue
>> > rcu_irq_{enter,exit}_irqson around calls to the callbacks. So there is
>> > no point in using SRCU anymore given that rcuidle tracepoints need to
>> > ensure RCU is watching. Therefore, simply use sched-RCU like normal
>> > tracepoints for rcuidle tracepoints.
>>
>> High level question:
>>
>> IIRC, doing this increases overhead for general tracing that does not use
>> perf, for 'rcuidle' tracepoints such as the preempt/irq enable/disable
>> tracepoints. I remember adding SRCU because of this reason.
>>
>> Can the 'rcuidle' information not be pushed down further, such that perf does
>> it because it requires RCU to be watching, so that it does not effect, say,
>> trace events?
>
> There's very few trace_.*_rcuidle() users left. We should eradicate them
> and remove the option. It's bugs to begin with.

I agree with Peter. Removing the trace_.*_rcuidle weirdness from the tracepoint
API and fixing all callers to ensure they trace from a context where RCU is
watching would simplify instrumentation of the Linux kernel, thus making it harder
for subtle bugs to hide and be unearthed only when tracing is enabled. This is
AFAIU the general approach Thomas Gleixner has been aiming for recently, and I
think it is a good thing.

So if we consider this our target, and that the current state of things is that
we need to have RCU watching around callback invocation, then removing the
dependency on SRCU seems like an overall simplification which does not regress
feature-wise nor speed-wise compared with what we have upstream today. The next
steps would then be to audit all rcuidle tracepoints and make sure the context
where they are placed has RCU watching already, so we can remove the tracepoint
rcuidle API. That would effectively remove the calls to rcu_irq_{enter,exit}_irqson
from the tracepoint code.

This is however beyond the scope of the proposed patch set.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2020-10-26 15:01:26

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Sleepable tracepoints

----- On Oct 26, 2020, at 8:05 AM, peter enderborg [email protected] wrote:

> On 10/23/20 9:53 PM, Michael Jeanson wrote:
>> When invoked from system call enter/exit instrumentation, accessing
>> user-space data is a common use-case for tracers. However, tracepoints
>> currently disable preemption around iteration on the registered
>> tracepoint probes and invocation of the probe callbacks, which prevents
>> tracers from handling page faults.
>>
>> Extend the tracepoint and trace event APIs to allow specific tracer
>> probes to take page faults. Adapt ftrace, perf, and ebpf to allow being
>> called from sleepable context, and convert the system call enter/exit
>> instrumentation to sleepable tracepoints.
>
> Will this not be a problem for analyse of the trace? It get two
> relevant times, one it when it is called and one when it returns.

It will depend on what the tracer chooses to do. If we call the side-effect
of what is being traced a "transaction" (e.g. actually opening a file
descriptor and adding it to a process'file descriptor table as the result
of an open(2) system call), we have to consider that already today the
timestamp which we get is either slightly before or after the actual
side-effect of the transaction in the kernel. That is true even without
being preemptable.

Sometimes it's not relevant to have a tracepoint before and after the
transaction, e.g. when all we care about is to know that the transaction
has successfully happened or not.

In the case of system calls, we have sys_enter and sys_exit to mark the
beginning and end of the "transaction". Whatever side-effects are done by
the system call happens in between.

I think the question here is whether it is relevant to know whether page
faults triggered by accessing system call input parameters need to
happen after we trace a "system call entry" event. If the tracers care,
then it would be up to them to first trace that "system call entry" event,
and have a separate event for the argument payload. But there are other
ways to identify whether page faults happen within the system call or
from user-space, for instance by using the instruction pointer associated
with the page fault. So when observing page faults happening before sys
enter, but associated with a kernel instruction pointer, a trace analysis
tool could theoretically figure out who is to blame for that page fault,
*if* it cares.

>
> It makes things harder to correlate in what order things happen.

The alternative is to have partial payloads like LTTng does today for
system call arguments. If reading a string from userspace (e.g. open(2)
file name) requires to take a page fault, LTTng truncates the string. This
is pretty bad for automated analysis as well.

>
> And handling of tracing of contexts that already are not preamptable?

The sleepable tracepoints are only meant to be used in contexts which can sleep.
For tracepoints placed in non-preemptible contexts, those should never take
a page fault to begin with.

>
> Eg the same tracepoint are used in different places and contexts.

As far as considering that a given tracepoint "name" could be registered to
by both a sleepable and non-sleepable tracer probes, I would like to see an
actual use-case for this. I don't have any.

I can envision that some tracer code will want to be allowed to work in
both sleepable and non-sleepable context, e.g. take page faults in
sleepable context (and called from a sleepable tracepoint), but have a
truncation behavior when called from non-sleepable context. This can actually
be done by looking at the new "TRACEPOINT_MAYSLEEP" tp flag. Passing that
tp_flags to code shared between sleepable and non-sleepable probes would allow
the callee to know whether it can take a page fault or not.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2020-10-27 00:10:13

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [RFC PATCH 1/6] tracing: introduce sleepable tracepoints

On Fri, Oct 23, 2020 at 03:53:47PM -0400, Michael Jeanson wrote:
> -#define __DO_TRACE(tp, proto, args, cond, rcuidle) \
> +#define __DO_TRACE(tp, proto, args, cond, rcuidle, tp_flags) \
> do { \
> struct tracepoint_func *it_func_ptr; \
> void *it_func; \
> void *__data; \
> int __maybe_unused __idx = 0; \
> + bool maysleep = (tp_flags) & TRACEPOINT_MAYSLEEP; \
> \
> if (!(cond)) \
> return; \
> @@ -170,8 +178,13 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
> /* srcu can't be used from NMI */ \
> WARN_ON_ONCE(rcuidle && in_nmi()); \
> \
> - /* keep srcu and sched-rcu usage consistent */ \
> - preempt_disable_notrace(); \
> + if (maysleep) { \
> + might_sleep(); \

The main purpose of the patch set is to access user memory in tracepoints, right?
In such case I suggest to use stronger might_fault() here.
We used might_sleep() in sleepable bpf and it wasn't enough to catch
a combination where sleepable hook was invoked while mm->mmap_lock was
taken which may cause a deadlock.

> + rcu_read_lock_trace(); \
> + } else { \
> + /* keep srcu and sched-rcu usage consistent */ \
> + preempt_disable_notrace(); \
> + } \

2020-10-27 06:57:49

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 6/6] tracing: use sched-RCU instead of SRCU for rcuidle tracepoints

On Mon, 26 Oct 2020 10:28:07 -0400 (EDT)
Mathieu Desnoyers <[email protected]> wrote:

> I agree with Peter. Removing the trace_.*_rcuidle weirdness from the tracepoint
> API and fixing all callers to ensure they trace from a context where RCU is
> watching would simplify instrumentation of the Linux kernel, thus making it harder
> for subtle bugs to hide and be unearthed only when tracing is enabled. This is

Note, the lockdep RCU checking of a tracepoint is outside of it being
enabled or disable. So if a non rcuidle() tracepoint is in a location that
RCU is not watching, it will complain loudly, even if you don't enable that
tracepoint.

> AFAIU the general approach Thomas Gleixner has been aiming for recently, and I
> think it is a good thing.
>
> So if we consider this our target, and that the current state of things is that
> we need to have RCU watching around callback invocation, then removing the
> dependency on SRCU seems like an overall simplification which does not regress
> feature-wise nor speed-wise compared with what we have upstream today. The next
> steps would then be to audit all rcuidle tracepoints and make sure the context
> where they are placed has RCU watching already, so we can remove the tracepoint

Just remove the _rcuidle() from them, and lockdep will complain if they are
being called without RCU watching.

-- Steve


> rcuidle API. That would effectively remove the calls to rcu_irq_{enter,exit}_irqson
> from the tracepoint code.
>
> This is however beyond the scope of the proposed patch set.
>
> Thanks,
>
> Mathieu
>

2020-10-27 14:55:04

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH 1/6] tracing: introduce sleepable tracepoints


----- On Oct 26, 2020, at 6:43 PM, Alexei Starovoitov [email protected] wrote:

> On Fri, Oct 23, 2020 at 03:53:47PM -0400, Michael Jeanson wrote:
>> -#define __DO_TRACE(tp, proto, args, cond, rcuidle) \
>> +#define __DO_TRACE(tp, proto, args, cond, rcuidle, tp_flags) \
>> do { \
>> struct tracepoint_func *it_func_ptr; \
>> void *it_func; \
>> void *__data; \
>> int __maybe_unused __idx = 0; \
>> + bool maysleep = (tp_flags) & TRACEPOINT_MAYSLEEP; \
>> \
>> if (!(cond)) \
>> return; \
>> @@ -170,8 +178,13 @@ static inline struct tracepoint
>> *tracepoint_ptr_deref(tracepoint_ptr_t *p)
>> /* srcu can't be used from NMI */ \
>> WARN_ON_ONCE(rcuidle && in_nmi()); \
>> \
>> - /* keep srcu and sched-rcu usage consistent */ \
>> - preempt_disable_notrace(); \
>> + if (maysleep) { \
>> + might_sleep(); \
>
> The main purpose of the patch set is to access user memory in tracepoints,
> right?

Yes, exactly.

> In such case I suggest to use stronger might_fault() here.
> We used might_sleep() in sleepable bpf and it wasn't enough to catch
> a combination where sleepable hook was invoked while mm->mmap_lock was
> taken which may cause a deadlock.

Good point! We will do that for the next round.

By the way, we named this "sleepable" tracepoint (with flag TRACEPOINT_MAYSLEEP),
but we are open to a better name. Would TRACEPOINT_MAYFAULT be more descriptive ?
(a "faultable" tracepoint sounds weird though)

Thanks,

Mathieu

>
>> + rcu_read_lock_trace(); \
>> + } else { \
>> + /* keep srcu and sched-rcu usage consistent */ \
>> + preempt_disable_notrace(); \
> > + } \

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2020-10-28 08:07:05

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH 6/6] tracing: use sched-RCU instead of SRCU for rcuidle tracepoints

----- On Oct 26, 2020, at 4:44 PM, rostedt [email protected] wrote:

> On Mon, 26 Oct 2020 10:28:07 -0400 (EDT)
> Mathieu Desnoyers <[email protected]> wrote:
>
>> I agree with Peter. Removing the trace_.*_rcuidle weirdness from the tracepoint
>> API and fixing all callers to ensure they trace from a context where RCU is
>> watching would simplify instrumentation of the Linux kernel, thus making it
>> harder
>> for subtle bugs to hide and be unearthed only when tracing is enabled. This is
>
> Note, the lockdep RCU checking of a tracepoint is outside of it being
> enabled or disable. So if a non rcuidle() tracepoint is in a location that
> RCU is not watching, it will complain loudly, even if you don't enable that
> tracepoint.
>
>> AFAIU the general approach Thomas Gleixner has been aiming for recently, and I
>> think it is a good thing.
>>
>> So if we consider this our target, and that the current state of things is that
>> we need to have RCU watching around callback invocation, then removing the
>> dependency on SRCU seems like an overall simplification which does not regress
>> feature-wise nor speed-wise compared with what we have upstream today. The next
>> steps would then be to audit all rcuidle tracepoints and make sure the context
>> where they are placed has RCU watching already, so we can remove the tracepoint
>
> Just remove the _rcuidle() from them, and lockdep will complain if they are
> being called without RCU watching.

That's the easy part. The patch removing rcuidle is attached to this email,
and here are the splats I get just when booting the machine (see below). The
part which takes more time is fixing those splats and figuring out how to
restructure the code. For instance, what should we do about the rcuidle
tracepoint within printk() ?

Also, how do we test rcuidle tracepoints triggered only when printk is called
from printk warnings ? We'd also need to test on arm32 boards, because some arm
architecture code uses rcuidle tracepoints as well.

As this is beyond the scope of this patch set, I can either drop this patch
entirely (it's not required for sleepable tracepoints), or keep it as an
intermediate cleanup step. Removing rcuidle tracepoints entirely is beyond
the effort Michael and I can undertake now.

=============================
WARNING: suspicious RCU usage

5.9.1+ #72 Not tainted
-----------------------------
=============================
./include/trace/events/preemptirq.h:42 suspicious rcu_dereference_check() usage!
WARNING: suspicious RCU usage

other info that might help us debug this:


rcu_scheduler_active = 1, debug_locks = 1
5.9.1+ #72 Not tainted
RCU used illegally from extended quiescent state!
no locks held by swapper/0/0.
-----------------------------
./include/trace/events/preemptirq.h:38 suspicious rcu_dereference_check() usage!

stack backtrace:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.9.1+ #72

other info that might help us debug this:


rcu_scheduler_active = 1, debug_locks = 1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
RCU used illegally from extended quiescent state!
no locks held by swapper/1/0.
dump_stack+0x8d/0xbb

stack backtrace:
? default_idle+0xa/0x20
trace_hardirqs_on+0xed/0xf0
default_idle+0xa/0x20
default_idle_call+0x4f/0x1e0
do_idle+0x1ef/0x2c0
cpu_startup_entry+0x19/0x20
start_kernel+0x578/0x59d
secondary_startup_64+0xa4/0xb0
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.9.1+ #72

=============================
WARNING: suspicious RCU usage
5.9.1+ #72 Not tainted
-----------------------------
./include/trace/events/lock.h:37 suspicious rcu_dereference_check() usage!

other info that might help us debug this:


rcu_scheduler_active = 1, debug_locks = 1
RCU used illegally from extended quiescent state!
no locks held by swapper/0/0.

stack backtrace:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.9.1+ #72
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
dump_stack+0x8d/0xbb
lock_acquire+0x346/0x3b0
? __lock_acquire+0x2e7/0x1da0
_raw_spin_lock+0x2c/0x40
? vprintk_emit+0x9f/0x410
vprintk_emit+0x9f/0x410
printk+0x52/0x6e
lockdep_rcu_suspicious+0x1b/0xe0
? default_idle+0xa/0x20
trace_hardirqs_on+0xed/0xf0
default_idle+0xa/0x20
default_idle_call+0x4f/0x1e0
do_idle+0x1ef/0x2c0
cpu_startup_entry+0x19/0x20
start_kernel+0x578/0x59d
secondary_startup_64+0xa4/0xb0

=============================
WARNING: suspicious RCU usage
5.9.1+ #72 Not tainted
-----------------------------
./include/trace/events/lock.h:63 suspicious rcu_dereference_check() usage!

other info that might help us debug this:


rcu_scheduler_active = 1, debug_locks = 1
RCU used illegally from extended quiescent state!
1 lock held by swapper/0/0:
#0: ffffffff97a80158 (logbuf_lock){-...}-{2:2}, at: vprintk_emit+0x9f/0x410

stack backtrace:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.9.1+ #72
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
dump_stack+0x8d/0xbb
lock_release+0x25a/0x280
_raw_spin_unlock+0x17/0x30
vprintk_emit+0xdf/0x410
printk+0x52/0x6e
lockdep_rcu_suspicious+0x1b/0xe0
? default_idle+0xa/0x20
trace_hardirqs_on+0xed/0xf0
default_idle+0xa/0x20
default_idle_call+0x4f/0x1e0
do_idle+0x1ef/0x2c0
cpu_startup_entry+0x19/0x20
start_kernel+0x578/0x59d
secondary_startup_64+0xa4/0xb0

=============================
WARNING: suspicious RCU usage
5.9.1+ #72 Not tainted
-----------------------------
./include/trace/events/printk.h:33 suspicious rcu_dereference_check() usage!

other info that might help us debug this:


rcu_scheduler_active = 1, debug_locks = 1
RCU used illegally from extended quiescent state!
2 locks held by swapper/0/0:
#0: ffffffff97a801a0 (console_lock){+.+.}-{0:0}, at: vprintk_emit+0x126/0x410
#1: ffffffff97a7fec0 (console_owner){....}-{0:0}, at: console_unlock+0x190/0x5d0

stack backtrace:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.9.1+ #72
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
dump_stack+0x8d/0xbb
console_unlock+0x5ad/0x5d0
vprintk_emit+0x133/0x410
printk+0x52/0x6e
lockdep_rcu_suspicious+0x1b/0xe0
? default_idle+0xa/0x20
trace_hardirqs_on+0xed/0xf0
default_idle+0xa/0x20
default_idle_call+0x4f/0x1e0
do_idle+0x1ef/0x2c0
cpu_startup_entry+0x19/0x20
start_kernel+0x578/0x59d
secondary_startup_64+0xa4/0xb0

=============================
WARNING: suspicious RCU usage
5.9.1+ #72 Not tainted
-----------------------------
./include/linux/rcupdate.h:636 rcu_read_lock() used illegally while idle!

other info that might help us debug this:


rcu_scheduler_active = 1, debug_locks = 1
RCU used illegally from extended quiescent state!
4 locks held by swapper/0/0:
#0: ffffffff97a801a0 (console_lock){+.+.}-{0:0}, at: vprintk_emit+0x126/0x410
#1: ffffffff97a7fec0 (console_owner){....}-{0:0}, at: console_unlock+0x190/0x5d0
#2: ffffffff97b7d598 (printing_lock){....}-{2:2}, at: vt_console_print+0x7d/0x3e0
#3: ffffffff97a82d80 (rcu_read_lock){....}-{1:2}, at: __atomic_notifier_call_chain+0x5/0x110

stack backtrace:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.9.1+ #72
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
dump_stack+0x8d/0xbb
__atomic_notifier_call_chain+0xd7/0x110
vt_console_print+0x19e/0x3e0
console_unlock+0x417/0x5d0
vprintk_emit+0x133/0x410
printk+0x52/0x6e
lockdep_rcu_suspicious+0x1b/0xe0
? default_idle+0xa/0x20
trace_hardirqs_on+0xed/0xf0
default_idle+0xa/0x20
default_idle_call+0x4f/0x1e0
do_idle+0x1ef/0x2c0
cpu_startup_entry+0x19/0x20
start_kernel+0x578/0x59d
secondary_startup_64+0xa4/0xb0

=============================
WARNING: suspicious RCU usage
5.9.1+ #72 Not tainted
-----------------------------
./include/linux/rcupdate.h:685 rcu_read_unlock() used illegally while idle!

other info that might help us debug this:


rcu_scheduler_active = 1, debug_locks = 1
RCU used illegally from extended quiescent state!
4 locks held by swapper/0/0:
#0: ffffffff97a801a0 (console_lock){+.+.}-{0:0}, at: vprintk_emit+0x126/0x410
#1: ffffffff97a7fec0 (console_owner){....}-{0:0}, at: console_unlock+0x190/0x5d0
#2: ffffffff97b7d598 (printing_lock){....}-{2:2}, at: vt_console_print+0x7d/0x3e0
#3: ffffffff97a82d80 (rcu_read_lock){....}-{1:2}, at: __atomic_notifier_call_chain+0x5/0x110

stack backtrace:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.9.1+ #72
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
dump_stack+0x8d/0xbb
__atomic_notifier_call_chain+0x101/0x110
vt_console_print+0x19e/0x3e0
console_unlock+0x417/0x5d0
vprintk_emit+0x133/0x410
printk+0x52/0x6e
lockdep_rcu_suspicious+0x1b/0xe0
? default_idle+0xa/0x20
trace_hardirqs_on+0xed/0xf0
default_idle+0xa/0x20
default_idle_call+0x4f/0x1e0
do_idle+0x1ef/0x2c0
cpu_startup_entry+0x19/0x20
start_kernel+0x578/0x59d
secondary_startup_64+0xa4/0xb0
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
dump_stack+0x8d/0xbb
? rcu_idle_exit+0x32/0x40
trace_hardirqs_off+0xe3/0xf0
rcu_idle_exit+0x32/0x40
default_idle_call+0x54/0x1e0
do_idle+0x1ef/0x2c0
? lockdep_hardirqs_on_prepare+0xec/0x180
cpu_startup_entry+0x19/0x20
start_secondary+0x11c/0x160
secondary_startup_64+0xa4/0xb0

Thanks,

Mathieu


>
> -- Steve
>
>
>> rcuidle API. That would effectively remove the calls to
>> rcu_irq_{enter,exit}_irqson
>> from the tracepoint code.
>>
>> This is however beyond the scope of the proposed patch set.
>>
>> Thanks,
>>
>> Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Attachments:
0001-wip-remove-rcuidle-tracepoint-API.patch (15.46 kB)

2020-10-29 02:16:37

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [RFC PATCH 1/6] tracing: introduce sleepable tracepoints

On Tue, Oct 27, 2020 at 09:37:08AM -0400, Mathieu Desnoyers wrote:
>
> ----- On Oct 26, 2020, at 6:43 PM, Alexei Starovoitov [email protected] wrote:
>
> > On Fri, Oct 23, 2020 at 03:53:47PM -0400, Michael Jeanson wrote:
> >> -#define __DO_TRACE(tp, proto, args, cond, rcuidle) \
> >> +#define __DO_TRACE(tp, proto, args, cond, rcuidle, tp_flags) \
> >> do { \
> >> struct tracepoint_func *it_func_ptr; \
> >> void *it_func; \
> >> void *__data; \
> >> int __maybe_unused __idx = 0; \
> >> + bool maysleep = (tp_flags) & TRACEPOINT_MAYSLEEP; \
> >> \
> >> if (!(cond)) \
> >> return; \
> >> @@ -170,8 +178,13 @@ static inline struct tracepoint
> >> *tracepoint_ptr_deref(tracepoint_ptr_t *p)
> >> /* srcu can't be used from NMI */ \
> >> WARN_ON_ONCE(rcuidle && in_nmi()); \
> >> \
> >> - /* keep srcu and sched-rcu usage consistent */ \
> >> - preempt_disable_notrace(); \
> >> + if (maysleep) { \
> >> + might_sleep(); \
> >
> > The main purpose of the patch set is to access user memory in tracepoints,
> > right?
>
> Yes, exactly.
>
> > In such case I suggest to use stronger might_fault() here.
> > We used might_sleep() in sleepable bpf and it wasn't enough to catch
> > a combination where sleepable hook was invoked while mm->mmap_lock was
> > taken which may cause a deadlock.
>
> Good point! We will do that for the next round.
>
> By the way, we named this "sleepable" tracepoint (with flag TRACEPOINT_MAYSLEEP),
> but we are open to a better name. Would TRACEPOINT_MAYFAULT be more descriptive ?
> (a "faultable" tracepoint sounds weird though)

bpf kept 'sleepable' as a name. 'faultable' is too misleading.

2020-11-02 18:45:50

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH 6/6] tracing: use sched-RCU instead of SRCU for rcuidle tracepoints

On Mon, Oct 26, 2020 at 10:28:07AM -0400, Mathieu Desnoyers wrote:
> ----- On Oct 26, 2020, at 4:20 AM, Peter Zijlstra [email protected] wrote:
>
> > On Fri, Oct 23, 2020 at 05:13:59PM -0400, Joel Fernandes wrote:
> >> On Fri, Oct 23, 2020 at 03:53:52PM -0400, Michael Jeanson wrote:
> >> > From: Mathieu Desnoyers <[email protected]>
> >> >
> >> > Considering that tracer callbacks expect RCU to be watching (for
> >> > instance, perf uses rcu_read_lock), we need rcuidle tracepoints to issue
> >> > rcu_irq_{enter,exit}_irqson around calls to the callbacks. So there is
> >> > no point in using SRCU anymore given that rcuidle tracepoints need to
> >> > ensure RCU is watching. Therefore, simply use sched-RCU like normal
> >> > tracepoints for rcuidle tracepoints.
> >>
> >> High level question:
> >>
> >> IIRC, doing this increases overhead for general tracing that does not use
> >> perf, for 'rcuidle' tracepoints such as the preempt/irq enable/disable
> >> tracepoints. I remember adding SRCU because of this reason.
> >>
> >> Can the 'rcuidle' information not be pushed down further, such that perf does
> >> it because it requires RCU to be watching, so that it does not effect, say,
> >> trace events?
> >
> > There's very few trace_.*_rcuidle() users left. We should eradicate them
> > and remove the option. It's bugs to begin with.
>
> I agree with Peter. Removing the trace_.*_rcuidle weirdness from the tracepoint
> API and fixing all callers to ensure they trace from a context where RCU is
> watching would simplify instrumentation of the Linux kernel, thus making it harder
> for subtle bugs to hide and be unearthed only when tracing is enabled. This is
> AFAIU the general approach Thomas Gleixner has been aiming for recently, and I
> think it is a good thing.
>
> So if we consider this our target, and that the current state of things is that
> we need to have RCU watching around callback invocation, then removing the
> dependency on SRCU seems like an overall simplification which does not regress
> feature-wise nor speed-wise compared with what we have upstream today. The next
> steps would then be to audit all rcuidle tracepoints and make sure the context
> where they are placed has RCU watching already, so we can remove the tracepoint
> rcuidle API. That would effectively remove the calls to rcu_irq_{enter,exit}_irqson
> from the tracepoint code.
>
> This is however beyond the scope of the proposed patch set.

You are right, it doesn't regress speedwise - I got confused since the code
was modified to call rcu_enter_irqson() even for the rcuidle case (which I
had avoided when I added SRCU). So in current code, SRCU is kind of
pointless. I think keep the patch in the series.

thanks,

- Joel

2020-11-02 18:53:30

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH 1/6] tracing: introduce sleepable tracepoints

On Tue, Oct 27, 2020 at 09:37:08AM -0400, Mathieu Desnoyers wrote:
>
> ----- On Oct 26, 2020, at 6:43 PM, Alexei Starovoitov [email protected] wrote:
>
> > On Fri, Oct 23, 2020 at 03:53:47PM -0400, Michael Jeanson wrote:
> >> -#define __DO_TRACE(tp, proto, args, cond, rcuidle) \
> >> +#define __DO_TRACE(tp, proto, args, cond, rcuidle, tp_flags) \
> >> do { \
> >> struct tracepoint_func *it_func_ptr; \
> >> void *it_func; \
> >> void *__data; \
> >> int __maybe_unused __idx = 0; \
> >> + bool maysleep = (tp_flags) & TRACEPOINT_MAYSLEEP; \
> >> \
> >> if (!(cond)) \
> >> return; \
> >> @@ -170,8 +178,13 @@ static inline struct tracepoint
> >> *tracepoint_ptr_deref(tracepoint_ptr_t *p)
> >> /* srcu can't be used from NMI */ \
> >> WARN_ON_ONCE(rcuidle && in_nmi()); \
> >> \
> >> - /* keep srcu and sched-rcu usage consistent */ \
> >> - preempt_disable_notrace(); \
> >> + if (maysleep) { \
> >> + might_sleep(); \
> >
> > The main purpose of the patch set is to access user memory in tracepoints,
> > right?
>
> Yes, exactly.
>
> > In such case I suggest to use stronger might_fault() here.
> > We used might_sleep() in sleepable bpf and it wasn't enough to catch
> > a combination where sleepable hook was invoked while mm->mmap_lock was
> > taken which may cause a deadlock.
>
> Good point! We will do that for the next round.
>
> By the way, we named this "sleepable" tracepoint (with flag TRACEPOINT_MAYSLEEP),
> but we are open to a better name. Would TRACEPOINT_MAYFAULT be more descriptive ?
> (a "faultable" tracepoint sounds weird though)

What about keeping it might_sleep() here and then adding might_fault() in the
probe handler? Since the probe handler knows that it may cause page fault, it
could itself make sure about it.

One more thought: Should we make _all_ tracepoints sleepable, and then move
the preempt_disable() bit to the probe handler as needed? That could simplify
the tracepoint API as well. Steven said before that whoever registers probes
knows what they are doing so I am ok with that.

No strong feelings one way or the other, for either of these though.

thanks,

- Joel

>
> Thanks,
>
> Mathieu
>
> >
> >> + rcu_read_lock_trace(); \
> >> + } else { \
> >> + /* keep srcu and sched-rcu usage consistent */ \
> >> + preempt_disable_notrace(); \
> > > + } \
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

2021-02-11 19:38:35

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH 1/6] tracing: introduce sleepable tracepoints

----- On Oct 28, 2020, at 5:23 PM, Alexei Starovoitov [email protected] wrote:

> On Tue, Oct 27, 2020 at 09:37:08AM -0400, Mathieu Desnoyers wrote:
>>
>> ----- On Oct 26, 2020, at 6:43 PM, Alexei Starovoitov
>> [email protected] wrote:
>>
>> > On Fri, Oct 23, 2020 at 03:53:47PM -0400, Michael Jeanson wrote:
>> >> -#define __DO_TRACE(tp, proto, args, cond, rcuidle) \
>> >> +#define __DO_TRACE(tp, proto, args, cond, rcuidle, tp_flags) \
>> >> do { \
>> >> struct tracepoint_func *it_func_ptr; \
>> >> void *it_func; \
>> >> void *__data; \
>> >> int __maybe_unused __idx = 0; \
>> >> + bool maysleep = (tp_flags) & TRACEPOINT_MAYSLEEP; \
>> >> \
>> >> if (!(cond)) \
>> >> return; \
>> >> @@ -170,8 +178,13 @@ static inline struct tracepoint
>> >> *tracepoint_ptr_deref(tracepoint_ptr_t *p)
>> >> /* srcu can't be used from NMI */ \
>> >> WARN_ON_ONCE(rcuidle && in_nmi()); \
>> >> \
>> >> - /* keep srcu and sched-rcu usage consistent */ \
>> >> - preempt_disable_notrace(); \
>> >> + if (maysleep) { \
>> >> + might_sleep(); \
>> >
>> > The main purpose of the patch set is to access user memory in tracepoints,
>> > right?
>>
>> Yes, exactly.
>>
>> > In such case I suggest to use stronger might_fault() here.
>> > We used might_sleep() in sleepable bpf and it wasn't enough to catch
>> > a combination where sleepable hook was invoked while mm->mmap_lock was
>> > taken which may cause a deadlock.
>>
>> Good point! We will do that for the next round.
>>
>> By the way, we named this "sleepable" tracepoint (with flag
>> TRACEPOINT_MAYSLEEP),
>> but we are open to a better name. Would TRACEPOINT_MAYFAULT be more descriptive
>> ?
>> (a "faultable" tracepoint sounds weird though)
>
> bpf kept 'sleepable' as a name. 'faultable' is too misleading.

We're working on an updated patchset for those "sleepable tracepoints", and considering
that those are really "tracepoints allowing page faults", I must admit that I am
uncomfortable with the confusion between "sleep" and "fault" in the naming here.

I am tempted to do the following changes:

- Change name from "sleepable tracepoints" to a better suited "tracepoints allowing page faults",
- Use might_fault() rather than might_sleep() in __DO_TRACE(), effectively guaranteeing that all
probes connecting to a tracepoint which allows page faults can indeed take page faults.
- Change TRACEPOINT_MAYSLEEP into TRACEPOINT_MAYFAULT.

Any objections ?

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com