This is the next revision of preempt/irq tracepoint centralization and
unified usage across the kernel [1].
The preempt/irq tracepoints exist but not everything in the kernel is
using it. This makes things not work simultaneously (for ex, only either
lockdep or irqsoff events can be used at a time). This series is an attempt to
solve that, and also results in a nice clean up of kernel in general.
Several ifdefs are simpler, and the design is more unified and better.
Also as a result of this, we also speeded performance all rcuidle
tracepoints since their handling is simpler.
v5:
- Fixed performance issues due to rcu-idle handling
Joel Fernandes (5):
softirq: reorder trace_softirqs_on to prevent lockdep splat
srcu: Add notrace variant of srcu_dereference
trace/irqsoff: Split reset into seperate functions
tracepoint: Make rcuidle tracepoint callers use SRCU
tracing: Centralize preemptirq tracepoints and unify their usage
Paul E. McKenney (1):
srcu: Add notrace variants of srcu_read_{lock,unlock}
include/linux/ftrace.h | 11 +-
include/linux/irqflags.h | 11 +-
include/linux/lockdep.h | 8 +-
include/linux/preempt.h | 2 +-
include/linux/srcu.h | 22 +++
include/linux/tracepoint.h | 47 +++++-
include/trace/events/preemptirq.h | 23 +--
init/main.c | 5 +-
kernel/locking/lockdep.c | 35 ++---
kernel/sched/core.c | 2 +-
kernel/softirq.c | 6 +-
kernel/trace/Kconfig | 22 ++-
kernel/trace/Makefile | 2 +-
kernel/trace/trace_irqsoff.c | 235 +++++++++---------------------
kernel/trace/trace_preemptirq.c | 71 +++++++++
kernel/tracepoint.c | 10 +-
16 files changed, 283 insertions(+), 229 deletions(-)
create mode 100644 kernel/trace/trace_preemptirq.c
Cc: Steven Rostedt <[email protected]>
Cc: Peter Zilstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Cc: Tom Zanussi <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Thomas Glexiner <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Paul McKenney <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Fenguang Wu <[email protected]>
Cc: Baohong Liu <[email protected]>
Cc: Vedang Patel <[email protected]>
Cc: [email protected]
Signed-off-by: Joel Fernandes <[email protected]>
--
2.17.0.441.gb46fe60e1d-goog
In this series, we are making lockdep use an rcuidle tracepoint. For
this reason we need a notrace variant of srcu_dereference since
otherwise we get lockdep splats since lockdep hooks may not have run
yet. This patch adds the needed variant.
Cc: Steven Rostedt <[email protected]>
Cc: Peter Zilstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Cc: Tom Zanussi <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Thomas Glexiner <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Paul McKenney <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Fenguang Wu <[email protected]>
Cc: Baohong Liu <[email protected]>
Cc: Vedang Patel <[email protected]>
Cc: [email protected]
Signed-off-by: Joel Fernandes <[email protected]>
---
include/linux/srcu.h | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/include/linux/srcu.h b/include/linux/srcu.h
index 2ec618979b20..a1c4947be877 100644
--- a/include/linux/srcu.h
+++ b/include/linux/srcu.h
@@ -135,6 +135,11 @@ static inline int srcu_read_lock_held(const struct srcu_struct *sp)
*/
#define srcu_dereference(p, sp) srcu_dereference_check((p), (sp), 0)
+/**
+ * srcu_dereference_notrace - no tracing and no lockdep calls from here
+ */
+#define srcu_dereference_notrace(p, sp) srcu_dereference_check((p), (sp), 1)
+
/**
* srcu_read_lock - register a new reader for an SRCU-protected structure.
* @sp: srcu_struct in which to register the new reader.
--
2.17.0.441.gb46fe60e1d-goog
In recent tests with IRQ on/off tracepoints, a large performance
overhead ~10% is noticed when running hackbench. This is root caused to
calls to rcu_irq_enter_irqson and rcu_irq_exit_irqson from the
tracepoint code. Following a long discussion on the list [1] about this,
we concluded that srcu is a better alternative for use during rcu idle.
Although it does involve extra barriers, its lighter than the sched-rcu
version which has to do additional RCU calls to notify RCU idle about
entry into RCU sections.
In this patch, we change the underlying implementation of the
trace_*_rcuidle API to use SRCU. This has shown to improve performance
alot for the high frequency irq enable/disable tracepoints.
Test: Tested idle and preempt/irq tracepoints.
[1] https://patchwork.kernel.org/patch/10344297/
Cc: Steven Rostedt <[email protected]>
Cc: Peter Zilstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Cc: Tom Zanussi <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Thomas Glexiner <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Paul McKenney <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Fenguang Wu <[email protected]>
Cc: Baohong Liu <[email protected]>
Cc: Vedang Patel <[email protected]>
Cc: [email protected]
Signed-off-by: Joel Fernandes <[email protected]>
---
include/linux/tracepoint.h | 46 +++++++++++++++++++++++++++++++-------
kernel/tracepoint.c | 10 ++++++++-
2 files changed, 47 insertions(+), 9 deletions(-)
diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index c94f466d57ef..4135e08fb5f1 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -15,6 +15,7 @@
*/
#include <linux/smp.h>
+#include <linux/srcu.h>
#include <linux/errno.h>
#include <linux/types.h>
#include <linux/cpumask.h>
@@ -33,6 +34,8 @@ struct trace_eval_map {
#define TRACEPOINT_DEFAULT_PRIO 10
+extern struct srcu_struct tracepoint_srcu;
+
extern int
tracepoint_probe_register(struct tracepoint *tp, void *probe, void *data);
extern int
@@ -77,6 +80,9 @@ int unregister_tracepoint_module_notifier(struct notifier_block *nb)
*/
static inline void tracepoint_synchronize_unregister(void)
{
+#ifdef CONFIG_TRACEPOINTS
+ synchronize_srcu(&tracepoint_srcu);
+#endif
synchronize_sched();
}
@@ -129,18 +135,38 @@ extern void syscall_unregfunc(void);
* as "(void *, void)". The DECLARE_TRACE_NOARGS() will pass in just
* "void *data", where as the DECLARE_TRACE() will pass in "void *data, proto".
*/
-#define __DO_TRACE(tp, proto, args, cond, rcucheck) \
+#define __DO_TRACE(tp, proto, args, cond, rcuidle) \
do { \
struct tracepoint_func *it_func_ptr; \
void *it_func; \
void *__data; \
+ int __maybe_unused idx = 0; \
\
if (!(cond)) \
return; \
- if (rcucheck) \
- rcu_irq_enter_irqson(); \
- rcu_read_lock_sched_notrace(); \
- it_func_ptr = rcu_dereference_sched((tp)->funcs); \
+ \
+ /* \
+ * For rcuidle callers, use srcu since sched-rcu \
+ * doesn't work from the idle path. \
+ */ \
+ if (rcuidle) { \
+ if (in_nmi()) { \
+ WARN_ON_ONCE(1); \
+ return; /* no srcu from nmi */ \
+ } \
+ \
+ /* To keep it consistent with !rcuidle path */ \
+ preempt_disable_notrace(); \
+ \
+ idx = srcu_read_lock_notrace(&tracepoint_srcu); \
+ it_func_ptr = srcu_dereference((tp)->funcs, \
+ &tracepoint_srcu); \
+ } else { \
+ rcu_read_lock_sched_notrace(); \
+ it_func_ptr = \
+ rcu_dereference_sched((tp)->funcs); \
+ } \
+ \
if (it_func_ptr) { \
do { \
it_func = (it_func_ptr)->func; \
@@ -148,9 +174,13 @@ extern void syscall_unregfunc(void);
((void(*)(proto))(it_func))(args); \
} while ((++it_func_ptr)->func); \
} \
- rcu_read_unlock_sched_notrace(); \
- if (rcucheck) \
- rcu_irq_exit_irqson(); \
+ \
+ if (rcuidle) { \
+ srcu_read_unlock_notrace(&tracepoint_srcu, idx);\
+ preempt_enable_notrace(); \
+ } else { \
+ rcu_read_unlock_sched_notrace(); \
+ } \
} while (0)
#ifndef MODULE
diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
index 671b13457387..b3b1d65a2460 100644
--- a/kernel/tracepoint.c
+++ b/kernel/tracepoint.c
@@ -31,6 +31,9 @@
extern struct tracepoint * const __start___tracepoints_ptrs[];
extern struct tracepoint * const __stop___tracepoints_ptrs[];
+DEFINE_SRCU(tracepoint_srcu);
+EXPORT_SYMBOL_GPL(tracepoint_srcu);
+
/* Set to 1 to enable tracepoint debug output */
static const int tracepoint_debug;
@@ -67,11 +70,16 @@ static inline void *allocate_probes(int count)
return p == NULL ? NULL : p->probes;
}
-static void rcu_free_old_probes(struct rcu_head *head)
+static void srcu_free_old_probes(struct rcu_head *head)
{
kfree(container_of(head, struct tp_probes, rcu));
}
+static void rcu_free_old_probes(struct rcu_head *head)
+{
+ call_srcu(&tracepoint_srcu, head, srcu_free_old_probes);
+}
+
static inline void release_probes(struct tracepoint_func *old)
{
if (old) {
--
2.17.0.441.gb46fe60e1d-goog
This patch detaches the preemptirq tracepoints from the tracers and
keeps it separate.
Advantages:
* Lockdep and irqsoff event can now run in parallel since they no longer
have their own calls.
* This unifies the usecase of adding hooks to an irqsoff and irqson
event, and a preemptoff and preempton event.
3 users of the events exist:
- Lockdep
- irqsoff and preemptoff tracers
- irqs and preempt trace events
The unification cleans up several ifdefs and makes the code in preempt
tracer and irqsoff tracers simpler. It gets rid of all the horrific
ifdeferry around PROVE_LOCKING and makes configuration of the different
users of the tracepoints more easy and understandable. It also gets rid
of the time_* function calls from the lockdep hooks used to call into
the preemptirq tracer which is not needed anymore. The negative delta in
lines of code in this patch is quite large too.
In the patch we introduce a new CONFIG option PREEMPTIRQ_TRACEPOINTS
as a single point for registering probes onto the tracepoints. With
this,
the web of config options for preempt/irq toggle tracepoints and its
users becomes:
PREEMPT_TRACER PREEMPTIRQ_EVENTS IRQSOFF_TRACER PROVE_LOCKING
| | \ | |
\ (selects) / \ \ (selects) /
TRACE_PREEMPT_TOGGLE ----> TRACE_IRQFLAGS
\ /
\ (depends on) /
PREEMPTIRQ_TRACEPOINTS
One note, I have to check for lockdep recursion in the code that calls
the trace events API and bail out if we're in lockdep recursion
protection to prevent something like the following case: a spin_lock is
taken. Then lockdep_acquired is called. That does a raw_local_irq_save
and then sets lockdep_recursion, and then calls __lockdep_acquired. In
this function, a call to get_lock_stats happens which calls
preempt_disable, which calls trace IRQS off somewhere which enters my
tracepoint code and sets the tracing_irq_cpu flag to prevent recursion.
This flag is then never cleared causing lockdep paths to never be
entered and thus causing splats and other bad things.
Cc: Steven Rostedt <[email protected]>
Cc: Peter Zilstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Cc: Tom Zanussi <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Thomas Glexiner <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Paul McKenney <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Fenguang Wu <[email protected]>
Cc: Baohong Liu <[email protected]>
Cc: Vedang Patel <[email protected]>
Cc: [email protected]
Signed-off-by: Joel Fernandes <[email protected]>
---
include/linux/ftrace.h | 11 +-
include/linux/irqflags.h | 11 +-
include/linux/lockdep.h | 8 +-
include/linux/preempt.h | 2 +-
include/linux/tracepoint.h | 3 +-
include/trace/events/preemptirq.h | 23 ++--
init/main.c | 5 +-
kernel/locking/lockdep.c | 35 ++---
kernel/sched/core.c | 2 +-
kernel/trace/Kconfig | 22 ++-
kernel/trace/Makefile | 2 +-
kernel/trace/trace_irqsoff.c | 213 ++++++++----------------------
kernel/trace/trace_preemptirq.c | 71 ++++++++++
13 files changed, 191 insertions(+), 217 deletions(-)
create mode 100644 kernel/trace/trace_preemptirq.c
diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
index 9c3c9a319e48..5191030af0c0 100644
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -709,16 +709,7 @@ static inline unsigned long get_lock_parent_ip(void)
return CALLER_ADDR2;
}
-#ifdef CONFIG_IRQSOFF_TRACER
- extern void time_hardirqs_on(unsigned long a0, unsigned long a1);
- extern void time_hardirqs_off(unsigned long a0, unsigned long a1);
-#else
- static inline void time_hardirqs_on(unsigned long a0, unsigned long a1) { }
- static inline void time_hardirqs_off(unsigned long a0, unsigned long a1) { }
-#endif
-
-#if defined(CONFIG_PREEMPT_TRACER) || \
- (defined(CONFIG_DEBUG_PREEMPT) && defined(CONFIG_PREEMPTIRQ_EVENTS))
+#ifdef CONFIG_TRACE_PREEMPT_TOGGLE
extern void trace_preempt_on(unsigned long a0, unsigned long a1);
extern void trace_preempt_off(unsigned long a0, unsigned long a1);
#else
diff --git a/include/linux/irqflags.h b/include/linux/irqflags.h
index 9700f00bbc04..50edb9cbbd26 100644
--- a/include/linux/irqflags.h
+++ b/include/linux/irqflags.h
@@ -15,9 +15,16 @@
#include <linux/typecheck.h>
#include <asm/irqflags.h>
-#ifdef CONFIG_TRACE_IRQFLAGS
+/* Currently trace_softirqs_on/off is used only by lockdep */
+#ifdef CONFIG_PROVE_LOCKING
extern void trace_softirqs_on(unsigned long ip);
extern void trace_softirqs_off(unsigned long ip);
+#else
+# define trace_softirqs_on(ip) do { } while (0)
+# define trace_softirqs_off(ip) do { } while (0)
+#endif
+
+#ifdef CONFIG_TRACE_IRQFLAGS
extern void trace_hardirqs_on(void);
extern void trace_hardirqs_off(void);
# define trace_hardirq_context(p) ((p)->hardirq_context)
@@ -43,8 +50,6 @@ do { \
#else
# define trace_hardirqs_on() do { } while (0)
# define trace_hardirqs_off() do { } while (0)
-# define trace_softirqs_on(ip) do { } while (0)
-# define trace_softirqs_off(ip) do { } while (0)
# define trace_hardirq_context(p) 0
# define trace_softirq_context(p) 0
# define trace_hardirqs_enabled(p) 0
diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
index 6fc77d4dbdcd..a8113357ceeb 100644
--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -266,7 +266,8 @@ struct held_lock {
/*
* Initialization, self-test and debugging-output methods:
*/
-extern void lockdep_info(void);
+extern void lockdep_init(void);
+extern void lockdep_init_early(void);
extern void lockdep_reset(void);
extern void lockdep_reset_lock(struct lockdep_map *lock);
extern void lockdep_free_key_range(void *start, unsigned long size);
@@ -406,7 +407,8 @@ static inline void lockdep_on(void)
# define lock_downgrade(l, i) do { } while (0)
# define lock_set_class(l, n, k, s, i) do { } while (0)
# define lock_set_subclass(l, s, i) do { } while (0)
-# define lockdep_info() do { } while (0)
+# define lockdep_init() do { } while (0)
+# define lockdep_init_early() do { } while (0)
# define lockdep_init_map(lock, name, key, sub) \
do { (void)(name); (void)(key); } while (0)
# define lockdep_set_class(lock, key) do { (void)(key); } while (0)
@@ -532,7 +534,7 @@ do { \
#endif /* CONFIG_LOCKDEP */
-#ifdef CONFIG_TRACE_IRQFLAGS
+#ifdef CONFIG_PROVE_LOCKING
extern void print_irqtrace_events(struct task_struct *curr);
#else
static inline void print_irqtrace_events(struct task_struct *curr)
diff --git a/include/linux/preempt.h b/include/linux/preempt.h
index 5bd3f151da78..c01813c3fbe9 100644
--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -150,7 +150,7 @@
*/
#define in_atomic_preempt_off() (preempt_count() != PREEMPT_DISABLE_OFFSET)
-#if defined(CONFIG_DEBUG_PREEMPT) || defined(CONFIG_PREEMPT_TRACER)
+#if defined(CONFIG_DEBUG_PREEMPT) || defined(CONFIG_TRACE_PREEMPT_TOGGLE)
extern void preempt_count_add(int val);
extern void preempt_count_sub(int val);
#define preempt_count_dec_and_test() \
diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index 4135e08fb5f1..f73715e6235d 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -159,7 +159,8 @@ extern void syscall_unregfunc(void);
preempt_disable_notrace(); \
\
idx = srcu_read_lock_notrace(&tracepoint_srcu); \
- it_func_ptr = srcu_dereference((tp)->funcs, \
+ it_func_ptr = \
+ srcu_dereference_notrace((tp)->funcs, \
&tracepoint_srcu); \
} else { \
rcu_read_lock_sched_notrace(); \
diff --git a/include/trace/events/preemptirq.h b/include/trace/events/preemptirq.h
index 9c4eb33c5a1d..9a0d4ceeb166 100644
--- a/include/trace/events/preemptirq.h
+++ b/include/trace/events/preemptirq.h
@@ -1,4 +1,4 @@
-#ifdef CONFIG_PREEMPTIRQ_EVENTS
+#ifdef CONFIG_PREEMPTIRQ_TRACEPOINTS
#undef TRACE_SYSTEM
#define TRACE_SYSTEM preemptirq
@@ -32,7 +32,7 @@ DECLARE_EVENT_CLASS(preemptirq_template,
(void *)((unsigned long)(_stext) + __entry->parent_offs))
);
-#ifndef CONFIG_PROVE_LOCKING
+#ifdef CONFIG_TRACE_IRQFLAGS
DEFINE_EVENT(preemptirq_template, irq_disable,
TP_PROTO(unsigned long ip, unsigned long parent_ip),
TP_ARGS(ip, parent_ip));
@@ -40,9 +40,14 @@ DEFINE_EVENT(preemptirq_template, irq_disable,
DEFINE_EVENT(preemptirq_template, irq_enable,
TP_PROTO(unsigned long ip, unsigned long parent_ip),
TP_ARGS(ip, parent_ip));
+#else
+#define trace_irq_enable(...)
+#define trace_irq_disable(...)
+#define trace_irq_enable_rcuidle(...)
+#define trace_irq_disable_rcuidle(...)
#endif
-#ifdef CONFIG_DEBUG_PREEMPT
+#ifdef CONFIG_TRACE_PREEMPT_TOGGLE
DEFINE_EVENT(preemptirq_template, preempt_disable,
TP_PROTO(unsigned long ip, unsigned long parent_ip),
TP_ARGS(ip, parent_ip));
@@ -50,22 +55,22 @@ DEFINE_EVENT(preemptirq_template, preempt_disable,
DEFINE_EVENT(preemptirq_template, preempt_enable,
TP_PROTO(unsigned long ip, unsigned long parent_ip),
TP_ARGS(ip, parent_ip));
+#else
+#define trace_preempt_enable(...)
+#define trace_preempt_disable(...)
+#define trace_preempt_enable_rcuidle(...)
+#define trace_preempt_disable_rcuidle(...)
#endif
#endif /* _TRACE_PREEMPTIRQ_H */
#include <trace/define_trace.h>
-#endif /* !CONFIG_PREEMPTIRQ_EVENTS */
-
-#if !defined(CONFIG_PREEMPTIRQ_EVENTS) || defined(CONFIG_PROVE_LOCKING)
+#else /* !CONFIG_PREEMPTIRQ_TRACEPOINTS */
#define trace_irq_enable(...)
#define trace_irq_disable(...)
#define trace_irq_enable_rcuidle(...)
#define trace_irq_disable_rcuidle(...)
-#endif
-
-#if !defined(CONFIG_PREEMPTIRQ_EVENTS) || !defined(CONFIG_DEBUG_PREEMPT)
#define trace_preempt_enable(...)
#define trace_preempt_disable(...)
#define trace_preempt_enable_rcuidle(...)
diff --git a/init/main.c b/init/main.c
index 21efbf6ace93..37f23ed10666 100644
--- a/init/main.c
+++ b/init/main.c
@@ -630,6 +630,9 @@ asmlinkage __visible void __init start_kernel(void)
call_function_init();
WARN(!irqs_disabled(), "Interrupts were enabled early\n");
early_boot_irqs_disabled = false;
+
+ lockdep_init_early();
+
local_irq_enable();
kmem_cache_init_late();
@@ -644,7 +647,7 @@ asmlinkage __visible void __init start_kernel(void)
panic("Too many boot %s vars at `%s'", panic_later,
panic_param);
- lockdep_info();
+ lockdep_init();
/*
* Need to run this when irqs are enabled, because it wants
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 89b5f83f1969..1e79cf7293ef 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -55,6 +55,7 @@
#include "lockdep_internals.h"
+#include <trace/events/preemptirq.h>
#define CREATE_TRACE_POINTS
#include <trace/events/lock.h>
@@ -2841,10 +2842,9 @@ static void __trace_hardirqs_on_caller(unsigned long ip)
debug_atomic_inc(hardirqs_on_events);
}
-__visible void trace_hardirqs_on_caller(unsigned long ip)
+static void lockdep_hardirqs_on(void *none, unsigned long ignore,
+ unsigned long ip)
{
- time_hardirqs_on(CALLER_ADDR0, ip);
-
if (unlikely(!debug_locks || current->lockdep_recursion))
return;
@@ -2883,23 +2883,15 @@ __visible void trace_hardirqs_on_caller(unsigned long ip)
__trace_hardirqs_on_caller(ip);
current->lockdep_recursion = 0;
}
-EXPORT_SYMBOL(trace_hardirqs_on_caller);
-
-void trace_hardirqs_on(void)
-{
- trace_hardirqs_on_caller(CALLER_ADDR0);
-}
-EXPORT_SYMBOL(trace_hardirqs_on);
/*
* Hardirqs were disabled:
*/
-__visible void trace_hardirqs_off_caller(unsigned long ip)
+static void lockdep_hardirqs_off(void *none, unsigned long ignore,
+ unsigned long ip)
{
struct task_struct *curr = current;
- time_hardirqs_off(CALLER_ADDR0, ip);
-
if (unlikely(!debug_locks || current->lockdep_recursion))
return;
@@ -2921,13 +2913,6 @@ __visible void trace_hardirqs_off_caller(unsigned long ip)
} else
debug_atomic_inc(redundant_hardirqs_off);
}
-EXPORT_SYMBOL(trace_hardirqs_off_caller);
-
-void trace_hardirqs_off(void)
-{
- trace_hardirqs_off_caller(CALLER_ADDR0);
-}
-EXPORT_SYMBOL(trace_hardirqs_off);
/*
* Softirqs will be enabled:
@@ -4334,7 +4319,15 @@ void lockdep_reset_lock(struct lockdep_map *lock)
raw_local_irq_restore(flags);
}
-void __init lockdep_info(void)
+void __init lockdep_init_early(void)
+{
+#ifdef CONFIG_PROVE_LOCKING
+ register_trace_prio_irq_disable(lockdep_hardirqs_off, NULL, INT_MAX);
+ register_trace_prio_irq_enable(lockdep_hardirqs_on, NULL, INT_MIN);
+#endif
+}
+
+void __init lockdep_init(void)
{
printk("Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo Molnar\n");
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c94895bc5a2c..417c40c0eeef 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3128,7 +3128,7 @@ u64 scheduler_tick_max_deferment(void)
#endif
#if defined(CONFIG_PREEMPT) && (defined(CONFIG_DEBUG_PREEMPT) || \
- defined(CONFIG_PREEMPT_TRACER))
+ defined(CONFIG_TRACE_PREEMPT_TOGGLE))
/*
* If the value passed in is equal to the current preempt count
* then we just disabled preemption. Start timing the latency.
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 0b249e2f0c3c..af027233b762 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -82,6 +82,15 @@ config RING_BUFFER_ALLOW_SWAP
Allow the use of ring_buffer_swap_cpu.
Adds a very slight overhead to tracing when enabled.
+config PREEMPTIRQ_TRACEPOINTS
+ bool
+ depends on TRACE_PREEMPT_TOGGLE || TRACE_IRQFLAGS
+ select TRACING
+ default y
+ help
+ Create preempt/irq toggle tracepoints if needed, so that other parts
+ of the kernel can use them to generate or add hooks to them.
+
# All tracer options should select GENERIC_TRACER. For those options that are
# enabled by all tracers (context switch and event tracer) they select TRACING.
# This allows those options to appear when no other tracer is selected. But the
@@ -159,18 +168,20 @@ config FUNCTION_GRAPH_TRACER
the return value. This is done by setting the current return
address on the current task structure into a stack of calls.
+config TRACE_PREEMPT_TOGGLE
+ bool
+ help
+ Enables hooks which will be called when preemption is first disabled,
+ and last enabled.
config PREEMPTIRQ_EVENTS
bool "Enable trace events for preempt and irq disable/enable"
select TRACE_IRQFLAGS
- depends on DEBUG_PREEMPT || !PROVE_LOCKING
- depends on TRACING
+ select TRACE_PREEMPT_TOGGLE if PREEMPT
+ select GENERIC_TRACER
default n
help
Enable tracing of disable and enable events for preemption and irqs.
- For tracing preempt disable/enable events, DEBUG_PREEMPT must be
- enabled. For tracing irq disable/enable events, PROVE_LOCKING must
- be disabled.
config IRQSOFF_TRACER
bool "Interrupts-off Latency Tracer"
@@ -207,6 +218,7 @@ config PREEMPT_TRACER
select RING_BUFFER_ALLOW_SWAP
select TRACER_SNAPSHOT
select TRACER_SNAPSHOT_PER_CPU_SWAP
+ select TRACE_PREEMPT_TOGGLE
help
This option measures the time spent in preemption-off critical
sections, with microsecond accuracy.
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index e2538c7638d4..84a0cb222f20 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -35,7 +35,7 @@ obj-$(CONFIG_TRACING) += trace_printk.o
obj-$(CONFIG_TRACING_MAP) += tracing_map.o
obj-$(CONFIG_CONTEXT_SWITCH_TRACER) += trace_sched_switch.o
obj-$(CONFIG_FUNCTION_TRACER) += trace_functions.o
-obj-$(CONFIG_PREEMPTIRQ_EVENTS) += trace_irqsoff.o
+obj-$(CONFIG_PREEMPTIRQ_TRACEPOINTS) += trace_preemptirq.o
obj-$(CONFIG_IRQSOFF_TRACER) += trace_irqsoff.o
obj-$(CONFIG_PREEMPT_TRACER) += trace_irqsoff.o
obj-$(CONFIG_SCHED_TRACER) += trace_sched_wakeup.o
diff --git a/kernel/trace/trace_irqsoff.c b/kernel/trace/trace_irqsoff.c
index f8daa754cce2..d2d8284088f0 100644
--- a/kernel/trace/trace_irqsoff.c
+++ b/kernel/trace/trace_irqsoff.c
@@ -16,7 +16,6 @@
#include "trace.h"
-#define CREATE_TRACE_POINTS
#include <trace/events/preemptirq.h>
#if defined(CONFIG_IRQSOFF_TRACER) || defined(CONFIG_PREEMPT_TRACER)
@@ -450,66 +449,6 @@ void stop_critical_timings(void)
}
EXPORT_SYMBOL_GPL(stop_critical_timings);
-#ifdef CONFIG_IRQSOFF_TRACER
-#ifdef CONFIG_PROVE_LOCKING
-void time_hardirqs_on(unsigned long a0, unsigned long a1)
-{
- if (!preempt_trace() && irq_trace())
- stop_critical_timing(a0, a1);
-}
-
-void time_hardirqs_off(unsigned long a0, unsigned long a1)
-{
- if (!preempt_trace() && irq_trace())
- start_critical_timing(a0, a1);
-}
-
-#else /* !CONFIG_PROVE_LOCKING */
-
-/*
- * We are only interested in hardirq on/off events:
- */
-static inline void tracer_hardirqs_on(void)
-{
- if (!preempt_trace() && irq_trace())
- stop_critical_timing(CALLER_ADDR0, CALLER_ADDR1);
-}
-
-static inline void tracer_hardirqs_off(void)
-{
- if (!preempt_trace() && irq_trace())
- start_critical_timing(CALLER_ADDR0, CALLER_ADDR1);
-}
-
-static inline void tracer_hardirqs_on_caller(unsigned long caller_addr)
-{
- if (!preempt_trace() && irq_trace())
- stop_critical_timing(CALLER_ADDR0, caller_addr);
-}
-
-static inline void tracer_hardirqs_off_caller(unsigned long caller_addr)
-{
- if (!preempt_trace() && irq_trace())
- start_critical_timing(CALLER_ADDR0, caller_addr);
-}
-
-#endif /* CONFIG_PROVE_LOCKING */
-#endif /* CONFIG_IRQSOFF_TRACER */
-
-#ifdef CONFIG_PREEMPT_TRACER
-static inline void tracer_preempt_on(unsigned long a0, unsigned long a1)
-{
- if (preempt_trace() && !irq_trace())
- stop_critical_timing(a0, a1);
-}
-
-static inline void tracer_preempt_off(unsigned long a0, unsigned long a1)
-{
- if (preempt_trace() && !irq_trace())
- start_critical_timing(a0, a1);
-}
-#endif /* CONFIG_PREEMPT_TRACER */
-
#ifdef CONFIG_FUNCTION_TRACER
static bool function_enabled;
@@ -659,15 +598,34 @@ static void irqsoff_tracer_stop(struct trace_array *tr)
}
#ifdef CONFIG_IRQSOFF_TRACER
+/*
+ * We are only interested in hardirq on/off events:
+ */
+static void tracer_hardirqs_on(void *none, unsigned long a0, unsigned long a1)
+{
+ if (!preempt_trace() && irq_trace())
+ stop_critical_timing(a0, a1);
+}
+
+static void tracer_hardirqs_off(void *none, unsigned long a0, unsigned long a1)
+{
+ if (!preempt_trace() && irq_trace())
+ start_critical_timing(a0, a1);
+}
+
static int irqsoff_tracer_init(struct trace_array *tr)
{
trace_type = TRACER_IRQS_OFF;
+ register_trace_irq_disable(tracer_hardirqs_off, NULL);
+ register_trace_irq_enable(tracer_hardirqs_on, NULL);
return __irqsoff_tracer_init(tr);
}
static void irqsoff_tracer_reset(struct trace_array *tr)
{
+ unregister_trace_irq_disable(tracer_hardirqs_off, NULL);
+ unregister_trace_irq_enable(tracer_hardirqs_on, NULL);
__irqsoff_tracer_reset(tr);
}
@@ -692,19 +650,37 @@ static struct tracer irqsoff_tracer __read_mostly =
};
# define register_irqsoff(trace) register_tracer(&trace)
#else
+static inline void tracer_hardirqs_on(unsigned long a0, unsigned long a1) { }
+static inline void tracer_hardirqs_off(unsigned long a0, unsigned long a1) { }
# define register_irqsoff(trace) do { } while (0)
-#endif
+#endif /* CONFIG_IRQSOFF_TRACER */
#ifdef CONFIG_PREEMPT_TRACER
+static void tracer_preempt_on(void *none, unsigned long a0, unsigned long a1)
+{
+ if (preempt_trace() && !irq_trace())
+ stop_critical_timing(a0, a1);
+}
+
+static void tracer_preempt_off(void *none, unsigned long a0, unsigned long a1)
+{
+ if (preempt_trace() && !irq_trace())
+ start_critical_timing(a0, a1);
+}
+
static int preemptoff_tracer_init(struct trace_array *tr)
{
trace_type = TRACER_PREEMPT_OFF;
+ register_trace_preempt_disable(tracer_preempt_off, NULL);
+ register_trace_preempt_enable(tracer_preempt_on, NULL);
return __irqsoff_tracer_init(tr);
}
static void preemptoff_tracer_reset(struct trace_array *tr)
{
+ unregister_trace_preempt_disable(tracer_preempt_off, NULL);
+ unregister_trace_preempt_enable(tracer_preempt_on, NULL);
__irqsoff_tracer_reset(tr);
}
@@ -729,21 +705,32 @@ static struct tracer preemptoff_tracer __read_mostly =
};
# define register_preemptoff(trace) register_tracer(&trace)
#else
+static inline void tracer_preempt_on(unsigned long a0, unsigned long a1) { }
+static inline void tracer_preempt_off(unsigned long a0, unsigned long a1) { }
# define register_preemptoff(trace) do { } while (0)
-#endif
+#endif /* CONFIG_PREEMPT_TRACER */
-#if defined(CONFIG_IRQSOFF_TRACER) && \
- defined(CONFIG_PREEMPT_TRACER)
+#if defined(CONFIG_IRQSOFF_TRACER) && defined(CONFIG_PREEMPT_TRACER)
static int preemptirqsoff_tracer_init(struct trace_array *tr)
{
trace_type = TRACER_IRQS_OFF | TRACER_PREEMPT_OFF;
+ register_trace_irq_disable(tracer_hardirqs_off, NULL);
+ register_trace_irq_enable(tracer_hardirqs_on, NULL);
+ register_trace_preempt_disable(tracer_preempt_off, NULL);
+ register_trace_preempt_enable(tracer_preempt_on, NULL);
+
return __irqsoff_tracer_init(tr);
}
static void preemptirqsoff_tracer_reset(struct trace_array *tr)
{
+ unregister_trace_irq_disable(tracer_hardirqs_off, NULL);
+ unregister_trace_irq_enable(tracer_hardirqs_on, NULL);
+ unregister_trace_preempt_disable(tracer_preempt_off, NULL);
+ unregister_trace_preempt_enable(tracer_preempt_on, NULL);
+
__irqsoff_tracer_reset(tr);
}
@@ -782,99 +769,3 @@ __init static int init_irqsoff_tracer(void)
}
core_initcall(init_irqsoff_tracer);
#endif /* IRQSOFF_TRACER || PREEMPTOFF_TRACER */
-
-#ifndef CONFIG_IRQSOFF_TRACER
-static inline void tracer_hardirqs_on(void) { }
-static inline void tracer_hardirqs_off(void) { }
-static inline void tracer_hardirqs_on_caller(unsigned long caller_addr) { }
-static inline void tracer_hardirqs_off_caller(unsigned long caller_addr) { }
-#endif
-
-#ifndef CONFIG_PREEMPT_TRACER
-static inline void tracer_preempt_on(unsigned long a0, unsigned long a1) { }
-static inline void tracer_preempt_off(unsigned long a0, unsigned long a1) { }
-#endif
-
-#if defined(CONFIG_TRACE_IRQFLAGS) && !defined(CONFIG_PROVE_LOCKING)
-/* Per-cpu variable to prevent redundant calls when IRQs already off */
-static DEFINE_PER_CPU(int, tracing_irq_cpu);
-
-void trace_hardirqs_on(void)
-{
- if (!this_cpu_read(tracing_irq_cpu))
- return;
-
- trace_irq_enable_rcuidle(CALLER_ADDR0, CALLER_ADDR1);
- tracer_hardirqs_on();
-
- this_cpu_write(tracing_irq_cpu, 0);
-}
-EXPORT_SYMBOL(trace_hardirqs_on);
-
-void trace_hardirqs_off(void)
-{
- if (this_cpu_read(tracing_irq_cpu))
- return;
-
- this_cpu_write(tracing_irq_cpu, 1);
-
- trace_irq_disable_rcuidle(CALLER_ADDR0, CALLER_ADDR1);
- tracer_hardirqs_off();
-}
-EXPORT_SYMBOL(trace_hardirqs_off);
-
-__visible void trace_hardirqs_on_caller(unsigned long caller_addr)
-{
- if (!this_cpu_read(tracing_irq_cpu))
- return;
-
- trace_irq_enable_rcuidle(CALLER_ADDR0, caller_addr);
- tracer_hardirqs_on_caller(caller_addr);
-
- this_cpu_write(tracing_irq_cpu, 0);
-}
-EXPORT_SYMBOL(trace_hardirqs_on_caller);
-
-__visible void trace_hardirqs_off_caller(unsigned long caller_addr)
-{
- if (this_cpu_read(tracing_irq_cpu))
- return;
-
- this_cpu_write(tracing_irq_cpu, 1);
-
- trace_irq_disable_rcuidle(CALLER_ADDR0, caller_addr);
- tracer_hardirqs_off_caller(caller_addr);
-}
-EXPORT_SYMBOL(trace_hardirqs_off_caller);
-
-/*
- * Stubs:
- */
-
-void trace_softirqs_on(unsigned long ip)
-{
-}
-
-void trace_softirqs_off(unsigned long ip)
-{
-}
-
-inline void print_irqtrace_events(struct task_struct *curr)
-{
-}
-#endif
-
-#if defined(CONFIG_PREEMPT_TRACER) || \
- (defined(CONFIG_DEBUG_PREEMPT) && defined(CONFIG_PREEMPTIRQ_EVENTS))
-void trace_preempt_on(unsigned long a0, unsigned long a1)
-{
- trace_preempt_enable_rcuidle(a0, a1);
- tracer_preempt_on(a0, a1);
-}
-
-void trace_preempt_off(unsigned long a0, unsigned long a1)
-{
- trace_preempt_disable_rcuidle(a0, a1);
- tracer_preempt_off(a0, a1);
-}
-#endif
diff --git a/kernel/trace/trace_preemptirq.c b/kernel/trace/trace_preemptirq.c
new file mode 100644
index 000000000000..14f08659d7d3
--- /dev/null
+++ b/kernel/trace/trace_preemptirq.c
@@ -0,0 +1,71 @@
+/*
+ * preemptoff and irqoff tracepoints
+ *
+ * Copyright (C) 2018 Joel Fernandes <[email protected]>
+ */
+
+#include <linux/kallsyms.h>
+#include <linux/uaccess.h>
+#include <linux/module.h>
+#include <linux/ftrace.h>
+
+#define CREATE_TRACE_POINTS
+#include <trace/events/preemptirq.h>
+
+#ifdef CONFIG_TRACE_IRQFLAGS
+/* Per-cpu variable to prevent redundant calls when IRQs already off */
+static DEFINE_PER_CPU(int, tracing_irq_cpu);
+
+void trace_hardirqs_on(void)
+{
+ if (current->lockdep_recursion || !this_cpu_read(tracing_irq_cpu))
+ return;
+
+ trace_irq_enable_rcuidle(CALLER_ADDR0, CALLER_ADDR1);
+ this_cpu_write(tracing_irq_cpu, 0);
+}
+EXPORT_SYMBOL(trace_hardirqs_on);
+
+void trace_hardirqs_off(void)
+{
+ if (current->lockdep_recursion || this_cpu_read(tracing_irq_cpu))
+ return;
+
+ this_cpu_write(tracing_irq_cpu, 1);
+ trace_irq_disable_rcuidle(CALLER_ADDR0, CALLER_ADDR1);
+}
+EXPORT_SYMBOL(trace_hardirqs_off);
+
+__visible void trace_hardirqs_on_caller(unsigned long caller_addr)
+{
+ if (current->lockdep_recursion || !this_cpu_read(tracing_irq_cpu))
+ return;
+
+ trace_irq_enable_rcuidle(CALLER_ADDR0, caller_addr);
+ this_cpu_write(tracing_irq_cpu, 0);
+}
+EXPORT_SYMBOL(trace_hardirqs_on_caller);
+
+__visible void trace_hardirqs_off_caller(unsigned long caller_addr)
+{
+ if (current->lockdep_recursion || this_cpu_read(tracing_irq_cpu))
+ return;
+
+ this_cpu_write(tracing_irq_cpu, 1);
+ trace_irq_disable_rcuidle(CALLER_ADDR0, caller_addr);
+}
+EXPORT_SYMBOL(trace_hardirqs_off_caller);
+#endif /* CONFIG_TRACE_IRQFLAGS */
+
+#ifdef CONFIG_TRACE_PREEMPT_TOGGLE
+
+void trace_preempt_on(unsigned long a0, unsigned long a1)
+{
+ trace_preempt_enable_rcuidle(a0, a1);
+}
+
+void trace_preempt_off(unsigned long a0, unsigned long a1)
+{
+ trace_preempt_disable_rcuidle(a0, a1);
+}
+#endif
--
2.17.0.441.gb46fe60e1d-goog
Split reset functions into seperate functions in preparation
of future patches that need to do tracer specific reset.
Cc: Steven Rostedt <[email protected]>
Cc: Peter Zilstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Cc: Tom Zanussi <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Thomas Glexiner <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Paul McKenney <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Fenguang Wu <[email protected]>
Cc: Baohong Liu <[email protected]>
Cc: Vedang Patel <[email protected]>
Cc: [email protected]
Signed-off-by: Joel Fernandes <[email protected]>
---
kernel/trace/trace_irqsoff.c | 22 +++++++++++++++++++---
1 file changed, 19 insertions(+), 3 deletions(-)
diff --git a/kernel/trace/trace_irqsoff.c b/kernel/trace/trace_irqsoff.c
index 03ecb4465ee4..f8daa754cce2 100644
--- a/kernel/trace/trace_irqsoff.c
+++ b/kernel/trace/trace_irqsoff.c
@@ -634,7 +634,7 @@ static int __irqsoff_tracer_init(struct trace_array *tr)
return 0;
}
-static void irqsoff_tracer_reset(struct trace_array *tr)
+static void __irqsoff_tracer_reset(struct trace_array *tr)
{
int lat_flag = save_flags & TRACE_ITER_LATENCY_FMT;
int overwrite_flag = save_flags & TRACE_ITER_OVERWRITE;
@@ -665,6 +665,12 @@ static int irqsoff_tracer_init(struct trace_array *tr)
return __irqsoff_tracer_init(tr);
}
+
+static void irqsoff_tracer_reset(struct trace_array *tr)
+{
+ __irqsoff_tracer_reset(tr);
+}
+
static struct tracer irqsoff_tracer __read_mostly =
{
.name = "irqsoff",
@@ -697,11 +703,16 @@ static int preemptoff_tracer_init(struct trace_array *tr)
return __irqsoff_tracer_init(tr);
}
+static void preemptoff_tracer_reset(struct trace_array *tr)
+{
+ __irqsoff_tracer_reset(tr);
+}
+
static struct tracer preemptoff_tracer __read_mostly =
{
.name = "preemptoff",
.init = preemptoff_tracer_init,
- .reset = irqsoff_tracer_reset,
+ .reset = preemptoff_tracer_reset,
.start = irqsoff_tracer_start,
.stop = irqsoff_tracer_stop,
.print_max = true,
@@ -731,11 +742,16 @@ static int preemptirqsoff_tracer_init(struct trace_array *tr)
return __irqsoff_tracer_init(tr);
}
+static void preemptirqsoff_tracer_reset(struct trace_array *tr)
+{
+ __irqsoff_tracer_reset(tr);
+}
+
static struct tracer preemptirqsoff_tracer __read_mostly =
{
.name = "preemptirqsoff",
.init = preemptirqsoff_tracer_init,
- .reset = irqsoff_tracer_reset,
+ .reset = preemptirqsoff_tracer_reset,
.start = irqsoff_tracer_start,
.stop = irqsoff_tracer_stop,
.print_max = true,
--
2.17.0.441.gb46fe60e1d-goog
I'm able to reproduce a lockdep splat when CONFIG_PROVE_LOCKING=y and
CONFIG_PREEMPTIRQ_EVENTS=y.
$ echo 1 > /d/tracing/events/preemptirq/preempt_enable/enable
Cc: Steven Rostedt <[email protected]>
Cc: Peter Zilstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Cc: Tom Zanussi <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Thomas Glexiner <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Paul McKenney <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Fenguang Wu <[email protected]>
Cc: Baohong Liu <[email protected]>
Cc: Vedang Patel <[email protected]>
Cc: [email protected]
Signed-off-by: Joel Fernandes <[email protected]>
---
kernel/softirq.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 24d243ef8e71..47e2f61938c0 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -139,9 +139,13 @@ static void __local_bh_enable(unsigned int cnt)
{
lockdep_assert_irqs_disabled();
+ if (preempt_count() == cnt)
+ trace_preempt_on(CALLER_ADDR0, get_lock_parent_ip());
+
if (softirq_count() == (cnt & SOFTIRQ_MASK))
trace_softirqs_on(_RET_IP_);
- preempt_count_sub(cnt);
+
+ __preempt_count_sub(cnt);
}
/*
--
2.17.0.441.gb46fe60e1d-goog
From: "Paul E. McKenney" <[email protected]>
This is needed for a future tracepoint patch that uses srcu, and to make
sure it doesn't call into lockdep.
tracepoint code already calls notrace variants for rcu_read_lock_sched
so this patch does the same for srcu which will be used in a later
patch. Keeps it consistent with rcu-sched.
[Joel: Added commit message]
Cc: Steven Rostedt <[email protected]>
Cc: Peter Zilstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Cc: Tom Zanussi <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Thomas Glexiner <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Paul McKenney <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Fenguang Wu <[email protected]>
Cc: Baohong Liu <[email protected]>
Cc: Vedang Patel <[email protected]>
Cc: [email protected]
Signed-off-by: Paul McKenney <[email protected]>
Signed-off-by: Joel Fernandes <[email protected]>
---
include/linux/srcu.h | 17 +++++++++++++++++
1 file changed, 17 insertions(+)
diff --git a/include/linux/srcu.h b/include/linux/srcu.h
index 33c1c698df09..2ec618979b20 100644
--- a/include/linux/srcu.h
+++ b/include/linux/srcu.h
@@ -161,6 +161,16 @@ static inline int srcu_read_lock(struct srcu_struct *sp) __acquires(sp)
return retval;
}
+/* Used by tracing, cannot be traced and cannot invoke lockdep. */
+static inline notrace int
+srcu_read_lock_notrace(struct srcu_struct *sp) __acquires(sp)
+{
+ int retval;
+
+ retval = __srcu_read_lock(sp);
+ return retval;
+}
+
/**
* srcu_read_unlock - unregister a old reader from an SRCU-protected structure.
* @sp: srcu_struct in which to unregister the old reader.
@@ -175,6 +185,13 @@ static inline void srcu_read_unlock(struct srcu_struct *sp, int idx)
__srcu_read_unlock(sp, idx);
}
+/* Used by tracing, cannot be traced and cannot call lockdep. */
+static inline notrace void
+srcu_read_unlock_notrace(struct srcu_struct *sp, int idx) __releases(sp)
+{
+ __srcu_read_unlock(sp, idx);
+}
+
/**
* smp_mb__after_srcu_read_unlock - ensure full ordering after srcu_read_unlock
*
--
2.17.0.441.gb46fe60e1d-goog
On Mon, Apr 30, 2018 at 6:42 PM Joel Fernandes <[email protected]> wrote:
> In recent tests with IRQ on/off tracepoints, a large performance
> overhead ~10% is noticed when running hackbench. This is root caused to
> calls to rcu_irq_enter_irqson and rcu_irq_exit_irqson from the
> tracepoint code. Following a long discussion on the list [1] about this,
> we concluded that srcu is a better alternative for use during rcu idle.
> Although it does involve extra barriers, its lighter than the sched-rcu
> version which has to do additional RCU calls to notify RCU idle about
> entry into RCU sections.
> In this patch, we change the underlying implementation of the
> trace_*_rcuidle API to use SRCU. This has shown to improve performance
> alot for the high frequency irq enable/disable tracepoints.
> Test: Tested idle and preempt/irq tracepoints.
> [1] https://patchwork.kernel.org/patch/10344297/
> Cc: Steven Rostedt <[email protected]>
> Cc: Peter Zilstra <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Mathieu Desnoyers <[email protected]>
> Cc: Tom Zanussi <[email protected]>
> Cc: Namhyung Kim <[email protected]>
> Cc: Thomas Glexiner <[email protected]>
> Cc: Boqun Feng <[email protected]>
> Cc: Paul McKenney <[email protected]>
> Cc: Frederic Weisbecker <[email protected]>
> Cc: Randy Dunlap <[email protected]>
> Cc: Masami Hiramatsu <[email protected]>
> Cc: Fenguang Wu <[email protected]>
> Cc: Baohong Liu <[email protected]>
> Cc: Vedang Patel <[email protected]>
> Cc: [email protected]
> Signed-off-by: Joel Fernandes <[email protected]>
> ---
> include/linux/tracepoint.h | 46 +++++++++++++++++++++++++++++++-------
> kernel/tracepoint.c | 10 ++++++++-
> 2 files changed, 47 insertions(+), 9 deletions(-)
> diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
> index c94f466d57ef..4135e08fb5f1 100644
> --- a/include/linux/tracepoint.h
> +++ b/include/linux/tracepoint.h
> @@ -15,6 +15,7 @@
> */
> #include <linux/smp.h>
> +#include <linux/srcu.h>
> #include <linux/errno.h>
> #include <linux/types.h>
> #include <linux/cpumask.h>
> @@ -33,6 +34,8 @@ struct trace_eval_map {
> #define TRACEPOINT_DEFAULT_PRIO 10
> +extern struct srcu_struct tracepoint_srcu;
> +
> extern int
> tracepoint_probe_register(struct tracepoint *tp, void *probe, void
*data);
> extern int
> @@ -77,6 +80,9 @@ int unregister_tracepoint_module_notifier(struct
notifier_block *nb)
> */
> static inline void tracepoint_synchronize_unregister(void)
> {
> +#ifdef CONFIG_TRACEPOINTS
> + synchronize_srcu(&tracepoint_srcu);
> +#endif
> synchronize_sched();
> }
> @@ -129,18 +135,38 @@ extern void syscall_unregfunc(void);
> * as "(void *, void)". The DECLARE_TRACE_NOARGS() will pass in just
> * "void *data", where as the DECLARE_TRACE() will pass in "void *data,
proto".
> */
> -#define __DO_TRACE(tp, proto, args, cond, rcucheck) \
> +#define __DO_TRACE(tp, proto, args, cond, rcuidle) \
> do { \
> struct tracepoint_func *it_func_ptr; \
> void *it_func; \
> void *__data; \
> + int __maybe_unused idx = 0; \
> \
> if (!(cond)) \
> return; \
> - if (rcucheck) \
> - rcu_irq_enter_irqson(); \
> - rcu_read_lock_sched_notrace(); \
> - it_func_ptr = rcu_dereference_sched((tp)->funcs); \
> + \
> + /* \
> + * For rcuidle callers, use srcu since sched-rcu \
> + * doesn't work from the idle path. \
> + */ \
> + if (rcuidle) { \
> + if (in_nmi()) { \
> + WARN_ON_ONCE(1); \
> + return; /* no srcu from nmi */ \
> + } \
> + \
> + /* To keep it consistent with !rcuidle path */ \
> + preempt_disable_notrace(); \
> + \
> + idx = srcu_read_lock_notrace(&tracepoint_srcu); \
> + it_func_ptr = srcu_dereference((tp)->funcs, \
> + &tracepoint_srcu); \
This last bit is supposed to be srcu_dereference_notrace. The hunk to use
that is actually in patch 6/6 , sorry about that. I've fixed it in my tree
and it means patches 5/6 and 6/6 need an update. Steve, if you want me to
repost it right away I can do that, or can wait for additional comments
and then repost.
thanks,
- Joel
On Mon, Apr 30, 2018 at 8:46 PM Randy Dunlap <[email protected]> wrote:
> On 04/30/2018 06:42 PM, Joel Fernandes wrote:
> > Split reset functions into seperate functions in preparation
> > of future patches that need to do tracer specific reset.
> >
> Hi,
> Since you are updating patches anyway, please
> s/seperate/separate/.
Thanks Randy, will do.
- Joel
On 04/30/2018 06:42 PM, Joel Fernandes wrote:
> Split reset functions into seperate functions in preparation
> of future patches that need to do tracer specific reset.
>
Hi,
Since you are updating patches anyway, please
s/seperate/separate/.
thanks,
--
~Randy
On Mon, 30 Apr 2018 18:41:59 -0700
Joel Fernandes <[email protected]> wrote:
> I'm able to reproduce a lockdep splat when CONFIG_PROVE_LOCKING=y and
> CONFIG_PREEMPTIRQ_EVENTS=y.
Needs more info in the change log. It also requires that
CONFIG_DEBUG_LOCKED=y is set.
Add this to the change log:
The issue was this:
Start with: preempt_count = 1 << SOFTIRQ_SHIFT
__local_bh_enable(cnt = 1 << SOFTIRQ_SHIFT) {
if (softirq_count() == (cnt && SOFTIRQ_MASK)) {
trace_softirqs_on() {
current->softirqs_enabled = 1;
}
}
preempt_count_sub(cnt) {
trace_preempt_on() {
tracepoint() {
rcu_read_lock_sched() {
// jumps into lockdep
Where preempt_count still has softirqs disabled, but
current->softirqs_enabled is true, and we get a splat.
This patch should be separate (as you had it before), and needs to be
submitted now because it already causes issues. We can add:
Cc: [email protected]
Fixes: d59158162e032 ("tracing: Add support for preempt and irq enable/disable events")
-- Steve
>
> $ echo 1 > /d/tracing/events/preemptirq/preempt_enable/enable
>
> Cc: Steven Rostedt <[email protected]>
> Cc: Peter Zilstra <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Mathieu Desnoyers <[email protected]>
> Cc: Tom Zanussi <[email protected]>
> Cc: Namhyung Kim <[email protected]>
> Cc: Thomas Glexiner <[email protected]>
> Cc: Boqun Feng <[email protected]>
> Cc: Paul McKenney <[email protected]>
> Cc: Frederic Weisbecker <[email protected]>
> Cc: Randy Dunlap <[email protected]>
> Cc: Masami Hiramatsu <[email protected]>
> Cc: Fenguang Wu <[email protected]>
> Cc: Baohong Liu <[email protected]>
> Cc: Vedang Patel <[email protected]>
> Cc: [email protected]
> Signed-off-by: Joel Fernandes <[email protected]>
> ---
> kernel/softirq.c | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/softirq.c b/kernel/softirq.c
> index 24d243ef8e71..47e2f61938c0 100644
> --- a/kernel/softirq.c
> +++ b/kernel/softirq.c
> @@ -139,9 +139,13 @@ static void __local_bh_enable(unsigned int cnt)
> {
> lockdep_assert_irqs_disabled();
>
> + if (preempt_count() == cnt)
> + trace_preempt_on(CALLER_ADDR0, get_lock_parent_ip());
> +
> if (softirq_count() == (cnt & SOFTIRQ_MASK))
> trace_softirqs_on(_RET_IP_);
> - preempt_count_sub(cnt);
> +
> + __preempt_count_sub(cnt);
> }
>
> /*
On Mon, 30 Apr 2018 18:42:00 -0700
Joel Fernandes <[email protected]> wrote:
> From: "Paul E. McKenney" <[email protected]>
>
> This is needed for a future tracepoint patch that uses srcu, and to make
> sure it doesn't call into lockdep.
>
> tracepoint code already calls notrace variants for rcu_read_lock_sched
> so this patch does the same for srcu which will be used in a later
> patch. Keeps it consistent with rcu-sched.
>
> [Joel: Added commit message]
>
Reviewed-by: Steven Rostedt (VMware) <[email protected]>
-- Steve
> Cc: Steven Rostedt <[email protected]>
> Cc: Peter Zilstra <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Mathieu Desnoyers <[email protected]>
> Cc: Tom Zanussi <[email protected]>
> Cc: Namhyung Kim <[email protected]>
> Cc: Thomas Glexiner <[email protected]>
> Cc: Boqun Feng <[email protected]>
> Cc: Paul McKenney <[email protected]>
> Cc: Frederic Weisbecker <[email protected]>
> Cc: Randy Dunlap <[email protected]>
> Cc: Masami Hiramatsu <[email protected]>
> Cc: Fenguang Wu <[email protected]>
> Cc: Baohong Liu <[email protected]>
> Cc: Vedang Patel <[email protected]>
> Cc: [email protected]
> Signed-off-by: Paul McKenney <[email protected]>
> Signed-off-by: Joel Fernandes <[email protected]>
> ---
>
On Mon, 30 Apr 2018 18:42:01 -0700
Joel Fernandes <[email protected]> wrote:
> In this series, we are making lockdep use an rcuidle tracepoint. For
> this reason we need a notrace variant of srcu_dereference since
> otherwise we get lockdep splats since lockdep hooks may not have run
> yet. This patch adds the needed variant.
This change log is rather confusing. Why would lockdep use an rcuidle
tracepoint? I think we need to explain more here.
-- Steve
On Mon, Apr 30, 2018 at 06:42:01PM -0700, Joel Fernandes wrote:
> In this series, we are making lockdep use an rcuidle tracepoint. For
> this reason we need a notrace variant of srcu_dereference since
> otherwise we get lockdep splats since lockdep hooks may not have run
> yet. This patch adds the needed variant.
>
> Cc: Steven Rostedt <[email protected]>
> Cc: Peter Zilstra <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Mathieu Desnoyers <[email protected]>
> Cc: Tom Zanussi <[email protected]>
> Cc: Namhyung Kim <[email protected]>
> Cc: Thomas Glexiner <[email protected]>
> Cc: Boqun Feng <[email protected]>
> Cc: Paul McKenney <[email protected]>
> Cc: Frederic Weisbecker <[email protected]>
> Cc: Randy Dunlap <[email protected]>
> Cc: Masami Hiramatsu <[email protected]>
> Cc: Fenguang Wu <[email protected]>
> Cc: Baohong Liu <[email protected]>
> Cc: Vedang Patel <[email protected]>
> Cc: [email protected]
> Signed-off-by: Joel Fernandes <[email protected]>
> ---
> include/linux/srcu.h | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/include/linux/srcu.h b/include/linux/srcu.h
> index 2ec618979b20..a1c4947be877 100644
> --- a/include/linux/srcu.h
> +++ b/include/linux/srcu.h
> @@ -135,6 +135,11 @@ static inline int srcu_read_lock_held(const struct srcu_struct *sp)
> */
> #define srcu_dereference(p, sp) srcu_dereference_check((p), (sp), 0)
>
> +/**
> + * srcu_dereference_notrace - no tracing and no lockdep calls from here
> + */
> +#define srcu_dereference_notrace(p, sp) srcu_dereference_check((p), (sp), 1)
Given that this is a one-liner, why not just use srcu_dereference_check()
directly, with 1 as you do above to disable lockdep?
Thanx, Paul
> +
> /**
> * srcu_read_lock - register a new reader for an SRCU-protected structure.
> * @sp: srcu_struct in which to register the new reader.
> --
> 2.17.0.441.gb46fe60e1d-goog
>
On Mon, 30 Apr 2018 18:42:03 -0700
Joel Fernandes <[email protected]> wrote:
> In recent tests with IRQ on/off tracepoints, a large performance
> overhead ~10% is noticed when running hackbench. This is root caused to
> calls to rcu_irq_enter_irqson and rcu_irq_exit_irqson from the
> tracepoint code. Following a long discussion on the list [1] about this,
> we concluded that srcu is a better alternative for use during rcu idle.
> Although it does involve extra barriers, its lighter than the sched-rcu
> version which has to do additional RCU calls to notify RCU idle about
> entry into RCU sections.
>
> In this patch, we change the underlying implementation of the
> trace_*_rcuidle API to use SRCU. This has shown to improve performance
> alot for the high frequency irq enable/disable tracepoints.
Can you post some numbers?
>
> Test: Tested idle and preempt/irq tracepoints.
>
> [1] https://patchwork.kernel.org/patch/10344297/
>
> Cc: Steven Rostedt <[email protected]>
> Cc: Peter Zilstra <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Mathieu Desnoyers <[email protected]>
> Cc: Tom Zanussi <[email protected]>
> Cc: Namhyung Kim <[email protected]>
> Cc: Thomas Glexiner <[email protected]>
> Cc: Boqun Feng <[email protected]>
> Cc: Paul McKenney <[email protected]>
> Cc: Frederic Weisbecker <[email protected]>
> Cc: Randy Dunlap <[email protected]>
> Cc: Masami Hiramatsu <[email protected]>
> Cc: Fenguang Wu <[email protected]>
> Cc: Baohong Liu <[email protected]>
> Cc: Vedang Patel <[email protected]>
> Cc: [email protected]
> Signed-off-by: Joel Fernandes <[email protected]>
> ---
> include/linux/tracepoint.h | 46 +++++++++++++++++++++++++++++++-------
> kernel/tracepoint.c | 10 ++++++++-
> 2 files changed, 47 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
> index c94f466d57ef..4135e08fb5f1 100644
> --- a/include/linux/tracepoint.h
> +++ b/include/linux/tracepoint.h
> @@ -15,6 +15,7 @@
> */
>
> #include <linux/smp.h>
> +#include <linux/srcu.h>
> #include <linux/errno.h>
> #include <linux/types.h>
> #include <linux/cpumask.h>
> @@ -33,6 +34,8 @@ struct trace_eval_map {
>
> #define TRACEPOINT_DEFAULT_PRIO 10
>
> +extern struct srcu_struct tracepoint_srcu;
> +
> extern int
> tracepoint_probe_register(struct tracepoint *tp, void *probe, void *data);
> extern int
> @@ -77,6 +80,9 @@ int unregister_tracepoint_module_notifier(struct notifier_block *nb)
> */
> static inline void tracepoint_synchronize_unregister(void)
> {
> +#ifdef CONFIG_TRACEPOINTS
> + synchronize_srcu(&tracepoint_srcu);
> +#endif
Not related to your patch, but I find it interesting that we don't make
this function a nop if CONFIG_TRACEPOINTS is not set. Is it because
something might rely on our implementation that we call
synchronize_sched here? I think that's a too tight of a coupling for
others to rely on this, especially since it's not in the comments about
this function.
Again, not related to this series, but something we should probably
consider in the future. It would require auditing users of this too.
> synchronize_sched();
> }
>
> @@ -129,18 +135,38 @@ extern void syscall_unregfunc(void);
> * as "(void *, void)". The DECLARE_TRACE_NOARGS() will pass in just
> * "void *data", where as the DECLARE_TRACE() will pass in "void *data, proto".
> */
> -#define __DO_TRACE(tp, proto, args, cond, rcucheck) \
> +#define __DO_TRACE(tp, proto, args, cond, rcuidle) \
> do { \
> struct tracepoint_func *it_func_ptr; \
> void *it_func; \
> void *__data; \
> + int __maybe_unused idx = 0; \
> \
> if (!(cond)) \
> return; \
> - if (rcucheck) \
> - rcu_irq_enter_irqson(); \
> - rcu_read_lock_sched_notrace(); \
> - it_func_ptr = rcu_dereference_sched((tp)->funcs); \
> + \
> + /* \
> + * For rcuidle callers, use srcu since sched-rcu \
> + * doesn't work from the idle path. \
> + */ \
> + if (rcuidle) { \
> + if (in_nmi()) { \
> + WARN_ON_ONCE(1); \
> + return; /* no srcu from nmi */ \
> + } \
> + \
> + /* To keep it consistent with !rcuidle path */ \
> + preempt_disable_notrace(); \
Why not disable preemption after taking the srcu lock?
> + \
> + idx = srcu_read_lock_notrace(&tracepoint_srcu); \
> + it_func_ptr = srcu_dereference((tp)->funcs, \
> + &tracepoint_srcu); \
> + } else { \
> + rcu_read_lock_sched_notrace(); \
> + it_func_ptr = \
> + rcu_dereference_sched((tp)->funcs); \
> + } \
> + \
> if (it_func_ptr) { \
> do { \
> it_func = (it_func_ptr)->func; \
> @@ -148,9 +174,13 @@ extern void syscall_unregfunc(void);
> ((void(*)(proto))(it_func))(args); \
> } while ((++it_func_ptr)->func); \
> } \
> - rcu_read_unlock_sched_notrace(); \
> - if (rcucheck) \
> - rcu_irq_exit_irqson(); \
> + \
> + if (rcuidle) { \
> + srcu_read_unlock_notrace(&tracepoint_srcu, idx);\
> + preempt_enable_notrace(); \
> + } else { \
> + rcu_read_unlock_sched_notrace(); \
> + } \
> } while (0)
>
> #ifndef MODULE
> diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
> index 671b13457387..b3b1d65a2460 100644
> --- a/kernel/tracepoint.c
> +++ b/kernel/tracepoint.c
> @@ -31,6 +31,9 @@
> extern struct tracepoint * const __start___tracepoints_ptrs[];
> extern struct tracepoint * const __stop___tracepoints_ptrs[];
>
> +DEFINE_SRCU(tracepoint_srcu);
> +EXPORT_SYMBOL_GPL(tracepoint_srcu);
> +
> /* Set to 1 to enable tracepoint debug output */
> static const int tracepoint_debug;
>
> @@ -67,11 +70,16 @@ static inline void *allocate_probes(int count)
> return p == NULL ? NULL : p->probes;
> }
>
> -static void rcu_free_old_probes(struct rcu_head *head)
> +static void srcu_free_old_probes(struct rcu_head *head)
> {
> kfree(container_of(head, struct tp_probes, rcu));
> }
>
> +static void rcu_free_old_probes(struct rcu_head *head)
> +{
> + call_srcu(&tracepoint_srcu, head, srcu_free_old_probes);
Hmm, is it OK to call call_srcu() from a call_rcu() callback? I guess
it would be.
I think we should add a comment to why we are doing this. Something
like:
/*
* Tracepoint probes are protected by both sched RCU and SRCU, by
* calling the SRCU callback in the sched RCU callback we cover
* both cases.
*/
Or something along those lines.
-- Steve
> +}
> +
> static inline void release_probes(struct tracepoint_func *old)
> {
> if (old) {
On Tue, 01 May 2018 01:56:01 +0000
Joel Fernandes <[email protected]> wrote:
> This last bit is supposed to be srcu_dereference_notrace. The hunk to use
> that is actually in patch 6/6 , sorry about that. I've fixed it in my tree
> and it means patches 5/6 and 6/6 need an update. Steve, if you want me to
> repost it right away I can do that, or can wait for additional comments
> and then repost.
Wait for additional comments.
-- Steve
On Tue, May 01, 2018 at 10:24:01AM -0400, Steven Rostedt wrote:
> On Mon, 30 Apr 2018 18:42:03 -0700
> Joel Fernandes <[email protected]> wrote:
>
> > In recent tests with IRQ on/off tracepoints, a large performance
> > overhead ~10% is noticed when running hackbench. This is root caused to
> > calls to rcu_irq_enter_irqson and rcu_irq_exit_irqson from the
> > tracepoint code. Following a long discussion on the list [1] about this,
> > we concluded that srcu is a better alternative for use during rcu idle.
> > Although it does involve extra barriers, its lighter than the sched-rcu
> > version which has to do additional RCU calls to notify RCU idle about
> > entry into RCU sections.
> >
> > In this patch, we change the underlying implementation of the
> > trace_*_rcuidle API to use SRCU. This has shown to improve performance
> > alot for the high frequency irq enable/disable tracepoints.
[ . . . ]
> > --- a/kernel/tracepoint.c
> > +++ b/kernel/tracepoint.c
> > @@ -31,6 +31,9 @@
> > extern struct tracepoint * const __start___tracepoints_ptrs[];
> > extern struct tracepoint * const __stop___tracepoints_ptrs[];
> >
> > +DEFINE_SRCU(tracepoint_srcu);
> > +EXPORT_SYMBOL_GPL(tracepoint_srcu);
> > +
> > /* Set to 1 to enable tracepoint debug output */
> > static const int tracepoint_debug;
> >
> > @@ -67,11 +70,16 @@ static inline void *allocate_probes(int count)
> > return p == NULL ? NULL : p->probes;
> > }
> >
> > -static void rcu_free_old_probes(struct rcu_head *head)
> > +static void srcu_free_old_probes(struct rcu_head *head)
> > {
> > kfree(container_of(head, struct tp_probes, rcu));
> > }
> >
> > +static void rcu_free_old_probes(struct rcu_head *head)
> > +{
> > + call_srcu(&tracepoint_srcu, head, srcu_free_old_probes);
>
> Hmm, is it OK to call call_srcu() from a call_rcu() callback? I guess
> it would be.
It is perfectly legal, and quite a bit simpler than setting something
up to wait for both to complete concurrently.
Of course, if you unconditionally call call_srcu() from that same
srcu_struct's callback, SRCU will be unable to safely delete the
srcu_struct, so cleanup_srcu_struct() will react by leaking memory. ;-)
Normal RCU deals with the analogous situation by leaving at least one
callback uninvoked when the system goes down.
Thanx, Paul
> I think we should add a comment to why we are doing this. Something
> like:
>
> /*
> * Tracepoint probes are protected by both sched RCU and SRCU, by
> * calling the SRCU callback in the sched RCU callback we cover
> * both cases.
> */
>
> Or something along those lines.
>
> -- Steve
>
>
> > +}
> > +
> > static inline void release_probes(struct tracepoint_func *old)
> > {
> > if (old) {
>
On Tue, May 1, 2018 at 7:02 AM Steven Rostedt <[email protected]> wrote:
> On Mon, 30 Apr 2018 18:41:59 -0700
> Joel Fernandes <[email protected]> wrote:
> > I'm able to reproduce a lockdep splat when CONFIG_PROVE_LOCKING=y and
> > CONFIG_PREEMPTIRQ_EVENTS=y.
> Needs more info in the change log. It also requires that
> CONFIG_DEBUG_LOCKED=y is set.
> Add this to the change log:
> The issue was this:
> Start with: preempt_count = 1 << SOFTIRQ_SHIFT
> __local_bh_enable(cnt = 1 << SOFTIRQ_SHIFT) {
> if (softirq_count() == (cnt && SOFTIRQ_MASK)) {
> trace_softirqs_on() {
> current->softirqs_enabled = 1;
> }
> }
> preempt_count_sub(cnt) {
> trace_preempt_on() {
> tracepoint() {
> rcu_read_lock_sched() {
> // jumps into lockdep
> Where preempt_count still has softirqs disabled, but
> current->softirqs_enabled is true, and we get a splat.
> This patch should be separate (as you had it before), and needs to be
> submitted now because it already causes issues. We can add:
> Cc: [email protected]
> Fixes: d59158162e032 ("tracing: Add support for preempt and irq
enable/disable events")
Ok, I'll split and resubmit with your reasoning in the changelog.
Just to clarify, my changelog was based on older patches, that's why the
reason appears different but the root cause is the same. In an earlier
series, I was doing lockdep_{off,on} around the rcu_read_lock_sched call so
that stage wasn't splatting, but then after the read_lock is held, the
tracepoint probe such as those registered by preemptoff tracer was
acquiring locks and causing the same splat.
Anyway, this just for some justification of why changelog appears to
present a different scenario with the same fix. But I'll change it to yours
and resubmit with the tags, thanks a lot for looking into it,
thanks,
- Joel
On Tue, 1 May 2018 07:18:17 -0700
"Paul E. McKenney" <[email protected]> wrote:
> > diff --git a/include/linux/srcu.h b/include/linux/srcu.h
> > index 2ec618979b20..a1c4947be877 100644
> > --- a/include/linux/srcu.h
> > +++ b/include/linux/srcu.h
> > @@ -135,6 +135,11 @@ static inline int srcu_read_lock_held(const struct srcu_struct *sp)
> > */
> > #define srcu_dereference(p, sp) srcu_dereference_check((p), (sp), 0)
> >
> > +/**
> > + * srcu_dereference_notrace - no tracing and no lockdep calls from here
> > + */
> > +#define srcu_dereference_notrace(p, sp) srcu_dereference_check((p), (sp), 1)
>
> Given that this is a one-liner, why not just use srcu_dereference_check()
> directly, with 1 as you do above to disable lockdep?
I prefer what Joel did. It documents why we are doing this. Open coding
the call directly will just confuse others that touch tracepoint code.
-- Steve
On Tue, May 1, 2018 at 7:06 AM Steven Rostedt <[email protected]> wrote:
> On Mon, 30 Apr 2018 18:42:01 -0700
> Joel Fernandes <[email protected]> wrote:
> > In this series, we are making lockdep use an rcuidle tracepoint. For
> > this reason we need a notrace variant of srcu_dereference since
> > otherwise we get lockdep splats since lockdep hooks may not have run
> > yet. This patch adds the needed variant.
> This change log is rather confusing. Why would lockdep use an rcuidle
> tracepoint? I think we need to explain more here.
Patch 6/6 registers lockdep onto irq_disable and irq_enable tracepoints
which use rcuidle:
https://github.com/torvalds/linux/blob/master/kernel/trace/trace_irqsoff.c#L791
I can add more details to the change log about this.
thanks,
- Joel
On Tue, May 1, 2018 at 7:34 AM Paul E. McKenney <[email protected]>
wrote:
> On Tue, May 01, 2018 at 10:24:01AM -0400, Steven Rostedt wrote:
> > On Mon, 30 Apr 2018 18:42:03 -0700
> > Joel Fernandes <[email protected]> wrote:
> >
> > > In recent tests with IRQ on/off tracepoints, a large performance
> > > overhead ~10% is noticed when running hackbench. This is root caused
to
> > > calls to rcu_irq_enter_irqson and rcu_irq_exit_irqson from the
> > > tracepoint code. Following a long discussion on the list [1] about
this,
> > > we concluded that srcu is a better alternative for use during rcu
idle.
> > > Although it does involve extra barriers, its lighter than the
sched-rcu
> > > version which has to do additional RCU calls to notify RCU idle about
> > > entry into RCU sections.
> > >
> > > In this patch, we change the underlying implementation of the
> > > trace_*_rcuidle API to use SRCU. This has shown to improve performance
> > > alot for the high frequency irq enable/disable tracepoints.
> [ . . . ]
> > > --- a/kernel/tracepoint.c
> > > +++ b/kernel/tracepoint.c
> > > @@ -31,6 +31,9 @@
> > > extern struct tracepoint * const __start___tracepoints_ptrs[];
> > > extern struct tracepoint * const __stop___tracepoints_ptrs[];
> > >
> > > +DEFINE_SRCU(tracepoint_srcu);
> > > +EXPORT_SYMBOL_GPL(tracepoint_srcu);
> > > +
> > > /* Set to 1 to enable tracepoint debug output */
> > > static const int tracepoint_debug;
> > >
> > > @@ -67,11 +70,16 @@ static inline void *allocate_probes(int count)
> > > return p == NULL ? NULL : p->probes;
> > > }
> > >
> > > -static void rcu_free_old_probes(struct rcu_head *head)
> > > +static void srcu_free_old_probes(struct rcu_head *head)
> > > {
> > > kfree(container_of(head, struct tp_probes, rcu));
> > > }
> > >
> > > +static void rcu_free_old_probes(struct rcu_head *head)
> > > +{
> > > + call_srcu(&tracepoint_srcu, head, srcu_free_old_probes);
> >
> > Hmm, is it OK to call call_srcu() from a call_rcu() callback? I guess
> > it would be.
> It is perfectly legal, and quite a bit simpler than setting something
> up to wait for both to complete concurrently.
Cool. Also in this case if we call both in sequence, then I felt there
could be a race to free the old data since both callbacks would try to do
the same thing. The same thing being freeing of the same set of old probes
which would need some synchronization between the 2 callbacks. With the
chaining, since the ordering is assured there wouldn't be a question of
such a race. I could add this reasoning to the changelog as well.
thanks,
- Joel
On Tue, May 01, 2018 at 03:16:02PM +0000, Joel Fernandes wrote:
> On Tue, May 1, 2018 at 7:34 AM Paul E. McKenney <[email protected]>
> wrote:
>
> > On Tue, May 01, 2018 at 10:24:01AM -0400, Steven Rostedt wrote:
> > > On Mon, 30 Apr 2018 18:42:03 -0700
> > > Joel Fernandes <[email protected]> wrote:
> > >
> > > > In recent tests with IRQ on/off tracepoints, a large performance
> > > > overhead ~10% is noticed when running hackbench. This is root caused
> to
> > > > calls to rcu_irq_enter_irqson and rcu_irq_exit_irqson from the
> > > > tracepoint code. Following a long discussion on the list [1] about
> this,
> > > > we concluded that srcu is a better alternative for use during rcu
> idle.
> > > > Although it does involve extra barriers, its lighter than the
> sched-rcu
> > > > version which has to do additional RCU calls to notify RCU idle about
> > > > entry into RCU sections.
> > > >
> > > > In this patch, we change the underlying implementation of the
> > > > trace_*_rcuidle API to use SRCU. This has shown to improve performance
> > > > alot for the high frequency irq enable/disable tracepoints.
>
> > [ . . . ]
>
> > > > --- a/kernel/tracepoint.c
> > > > +++ b/kernel/tracepoint.c
> > > > @@ -31,6 +31,9 @@
> > > > extern struct tracepoint * const __start___tracepoints_ptrs[];
> > > > extern struct tracepoint * const __stop___tracepoints_ptrs[];
> > > >
> > > > +DEFINE_SRCU(tracepoint_srcu);
> > > > +EXPORT_SYMBOL_GPL(tracepoint_srcu);
> > > > +
> > > > /* Set to 1 to enable tracepoint debug output */
> > > > static const int tracepoint_debug;
> > > >
> > > > @@ -67,11 +70,16 @@ static inline void *allocate_probes(int count)
> > > > return p == NULL ? NULL : p->probes;
> > > > }
> > > >
> > > > -static void rcu_free_old_probes(struct rcu_head *head)
> > > > +static void srcu_free_old_probes(struct rcu_head *head)
> > > > {
> > > > kfree(container_of(head, struct tp_probes, rcu));
> > > > }
> > > >
> > > > +static void rcu_free_old_probes(struct rcu_head *head)
> > > > +{
> > > > + call_srcu(&tracepoint_srcu, head, srcu_free_old_probes);
> > >
> > > Hmm, is it OK to call call_srcu() from a call_rcu() callback? I guess
> > > it would be.
>
> > It is perfectly legal, and quite a bit simpler than setting something
> > up to wait for both to complete concurrently.
>
> Cool. Also in this case if we call both in sequence, then I felt there
> could be a race to free the old data since both callbacks would try to do
> the same thing. The same thing being freeing of the same set of old probes
> which would need some synchronization between the 2 callbacks. With the
> chaining, since the ordering is assured there wouldn't be a question of
> such a race. I could add this reasoning to the changelog as well.
Actually, as long as you have a solid happens-before between both of the
callbacks and the freeing, you are in good shape. A release-acquire would
work fine, as would a lock acquired in both callbacks and then acquired
(and possibly released) before the free.
Thanx, Paul
On Tue, May 1, 2018 at 8:20 AM Paul E. McKenney <[email protected]>
wrote:
[...]
> > > > > --- a/kernel/tracepoint.c
> > > > > +++ b/kernel/tracepoint.c
> > > > > @@ -31,6 +31,9 @@
> > > > > extern struct tracepoint * const __start___tracepoints_ptrs[];
> > > > > extern struct tracepoint * const __stop___tracepoints_ptrs[];
> > > > >
> > > > > +DEFINE_SRCU(tracepoint_srcu);
> > > > > +EXPORT_SYMBOL_GPL(tracepoint_srcu);
> > > > > +
> > > > > /* Set to 1 to enable tracepoint debug output */
> > > > > static const int tracepoint_debug;
> > > > >
> > > > > @@ -67,11 +70,16 @@ static inline void *allocate_probes(int count)
> > > > > return p == NULL ? NULL : p->probes;
> > > > > }
> > > > >
> > > > > -static void rcu_free_old_probes(struct rcu_head *head)
> > > > > +static void srcu_free_old_probes(struct rcu_head *head)
> > > > > {
> > > > > kfree(container_of(head, struct tp_probes, rcu));
> > > > > }
> > > > >
> > > > > +static void rcu_free_old_probes(struct rcu_head *head)
> > > > > +{
> > > > > + call_srcu(&tracepoint_srcu, head, srcu_free_old_probes);
> > > >
> > > > Hmm, is it OK to call call_srcu() from a call_rcu() callback? I
guess
> > > > it would be.
> >
> > > It is perfectly legal, and quite a bit simpler than setting something
> > > up to wait for both to complete concurrently.
> >
> > Cool. Also in this case if we call both in sequence, then I felt there
> > could be a race to free the old data since both callbacks would try to
do
> > the same thing. The same thing being freeing of the same set of old
probes
> > which would need some synchronization between the 2 callbacks. With the
> > chaining, since the ordering is assured there wouldn't be a question of
> > such a race. I could add this reasoning to the changelog as well.
> Actually, as long as you have a solid happens-before between both of the
> callbacks and the freeing, you are in good shape. A release-acquire would
> work fine, as would a lock acquired in both callbacks and then acquired
> (and possibly released) before the free.
Got it, thanks. For now, if its Ok with you and others, I will leave it in
the chained configuration. I also feel this is temporary since in the
future if we switch to single rcu mechanism for tracepoints (srcu), then we
could do with just a single callback.
thanks,
- Joel
On Tue, May 01, 2018 at 11:04:03AM -0400, Steven Rostedt wrote:
> On Tue, 1 May 2018 07:18:17 -0700
> "Paul E. McKenney" <[email protected]> wrote:
>
>
> > > diff --git a/include/linux/srcu.h b/include/linux/srcu.h
> > > index 2ec618979b20..a1c4947be877 100644
> > > --- a/include/linux/srcu.h
> > > +++ b/include/linux/srcu.h
> > > @@ -135,6 +135,11 @@ static inline int srcu_read_lock_held(const struct srcu_struct *sp)
> > > */
> > > #define srcu_dereference(p, sp) srcu_dereference_check((p), (sp), 0)
> > >
> > > +/**
> > > + * srcu_dereference_notrace - no tracing and no lockdep calls from here
> > > + */
> > > +#define srcu_dereference_notrace(p, sp) srcu_dereference_check((p), (sp), 1)
> >
> > Given that this is a one-liner, why not just use srcu_dereference_check()
> > directly, with 1 as you do above to disable lockdep?
>
> I prefer what Joel did. It documents why we are doing this. Open coding
> the call directly will just confuse others that touch tracepoint code.
I suppose...
Reviewed-by: Paul E. McKenney <[email protected]>
On Tue, May 01, 2018 at 03:23:52PM +0000, Joel Fernandes wrote:
> On Tue, May 1, 2018 at 8:20 AM Paul E. McKenney <[email protected]>
> wrote:
> [...]
> > > > > > --- a/kernel/tracepoint.c
> > > > > > +++ b/kernel/tracepoint.c
> > > > > > @@ -31,6 +31,9 @@
> > > > > > extern struct tracepoint * const __start___tracepoints_ptrs[];
> > > > > > extern struct tracepoint * const __stop___tracepoints_ptrs[];
> > > > > >
> > > > > > +DEFINE_SRCU(tracepoint_srcu);
> > > > > > +EXPORT_SYMBOL_GPL(tracepoint_srcu);
> > > > > > +
> > > > > > /* Set to 1 to enable tracepoint debug output */
> > > > > > static const int tracepoint_debug;
> > > > > >
> > > > > > @@ -67,11 +70,16 @@ static inline void *allocate_probes(int count)
> > > > > > return p == NULL ? NULL : p->probes;
> > > > > > }
> > > > > >
> > > > > > -static void rcu_free_old_probes(struct rcu_head *head)
> > > > > > +static void srcu_free_old_probes(struct rcu_head *head)
> > > > > > {
> > > > > > kfree(container_of(head, struct tp_probes, rcu));
> > > > > > }
> > > > > >
> > > > > > +static void rcu_free_old_probes(struct rcu_head *head)
> > > > > > +{
> > > > > > + call_srcu(&tracepoint_srcu, head, srcu_free_old_probes);
> > > > >
> > > > > Hmm, is it OK to call call_srcu() from a call_rcu() callback? I
> guess
> > > > > it would be.
> > >
> > > > It is perfectly legal, and quite a bit simpler than setting something
> > > > up to wait for both to complete concurrently.
> > >
> > > Cool. Also in this case if we call both in sequence, then I felt there
> > > could be a race to free the old data since both callbacks would try to
> do
> > > the same thing. The same thing being freeing of the same set of old
> probes
> > > which would need some synchronization between the 2 callbacks. With the
> > > chaining, since the ordering is assured there wouldn't be a question of
> > > such a race. I could add this reasoning to the changelog as well.
>
> > Actually, as long as you have a solid happens-before between both of the
> > callbacks and the freeing, you are in good shape. A release-acquire would
> > work fine, as would a lock acquired in both callbacks and then acquired
> > (and possibly released) before the free.
>
> Got it, thanks. For now, if its Ok with you and others, I will leave it in
> the chained configuration. I also feel this is temporary since in the
> future if we switch to single rcu mechanism for tracepoints (srcu), then we
> could do with just a single callback.
I have no problem with chained callbacks. Heck, I even test them in
rcutorture. ;-)
Thanx, Paul
Missed replying to some comments..
On Tue, May 1, 2018 at 7:24 AM Steven Rostedt <[email protected]> wrote:
> On Mon, 30 Apr 2018 18:42:03 -0700
> Joel Fernandes <[email protected]> wrote:
> > In recent tests with IRQ on/off tracepoints, a large performance
> > overhead ~10% is noticed when running hackbench. This is root caused to
> > calls to rcu_irq_enter_irqson and rcu_irq_exit_irqson from the
> > tracepoint code. Following a long discussion on the list [1] about this,
> > we concluded that srcu is a better alternative for use during rcu idle.
> > Although it does involve extra barriers, its lighter than the sched-rcu
> > version which has to do additional RCU calls to notify RCU idle about
> > entry into RCU sections.
> >
> > In this patch, we change the underlying implementation of the
> > trace_*_rcuidle API to use SRCU. This has shown to improve performance
> > alot for the high frequency irq enable/disable tracepoints.
> Can you post some numbers?
Sure, I will post them in the next revision.
> > Test: Tested idle and preempt/irq tracepoints.
> >
> > [1] https://patchwork.kernel.org/patch/10344297/
> > [...]
> > include/linux/tracepoint.h | 46 +++++++++++++++++++++++++++++++-------
> > kernel/tracepoint.c | 10 ++++++++-
> > 2 files changed, 47 insertions(+), 9 deletions(-)
> >
> > diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
> > index c94f466d57ef..4135e08fb5f1 100644
> > --- a/include/linux/tracepoint.h
> > +++ b/include/linux/tracepoint.h
> > @@ -15,6 +15,7 @@
> > */
> >
> > #include <linux/smp.h>
> > +#include <linux/srcu.h>
> > #include <linux/errno.h>
> > #include <linux/types.h>
> > #include <linux/cpumask.h>
> > @@ -33,6 +34,8 @@ struct trace_eval_map {
> >
> > #define TRACEPOINT_DEFAULT_PRIO 10
> >
> > +extern struct srcu_struct tracepoint_srcu;
> > +
> > extern int
> > tracepoint_probe_register(struct tracepoint *tp, void *probe, void
*data);
> > extern int
> > @@ -77,6 +80,9 @@ int unregister_tracepoint_module_notifier(struct
notifier_block *nb)
> > */
> > static inline void tracepoint_synchronize_unregister(void)
> > {
> > +#ifdef CONFIG_TRACEPOINTS
> > + synchronize_srcu(&tracepoint_srcu);
> > +#endif
> Not related to your patch, but I find it interesting that we don't make
> this function a nop if CONFIG_TRACEPOINTS is not set. Is it because
> something might rely on our implementation that we call
> synchronize_sched here? I think that's a too tight of a coupling for
> others to rely on this, especially since it's not in the comments about
> this function.
If there's no CONFIG_TRACEPOINTS, then nothing should be replying on the
implementation?
Basically, if !TRACEPOINTS, then there shouldn't be any active rcu read
sections calling probes.
> Again, not related to this series, but something we should probably
> consider in the future. It would require auditing users of this too.
Yes, probably could be a noop in the future.
> > synchronize_sched();
> > }
> >
> > @@ -129,18 +135,38 @@ extern void syscall_unregfunc(void);
> > * as "(void *, void)". The DECLARE_TRACE_NOARGS() will pass in just
> > * "void *data", where as the DECLARE_TRACE() will pass in "void
*data, proto".
> > */
> > -#define __DO_TRACE(tp, proto, args, cond, rcucheck) \
> > +#define __DO_TRACE(tp, proto, args, cond, rcuidle) \
> > do { \
> > struct tracepoint_func *it_func_ptr; \
> > void *it_func; \
> > void *__data; \
> > + int __maybe_unused idx = 0; \
> > \
> > if (!(cond)) \
> > return; \
> > - if (rcucheck) \
> > - rcu_irq_enter_irqson(); \
> > - rcu_read_lock_sched_notrace(); \
> > - it_func_ptr = rcu_dereference_sched((tp)->funcs); \
> > + \
> > + /* \
> > + * For rcuidle callers, use srcu since sched-rcu \
> > + * doesn't work from the idle path. \
> > + */ \
> > + if (rcuidle) { \
> > + if (in_nmi()) { \
> > + WARN_ON_ONCE(1); \
> > + return; /* no srcu from nmi */ \
> > + } \
> > + \
> > + /* To keep it consistent with !rcuidle path */ \
> > + preempt_disable_notrace(); \
> Why not disable preemption after taking the srcu lock?
Sure. I don't have a strong preference for either way so I could disable it
after.
[...]
> > #ifndef MODULE
> > diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
> > index 671b13457387..b3b1d65a2460 100644
> > --- a/kernel/tracepoint.c
> > +++ b/kernel/tracepoint.c
> > @@ -31,6 +31,9 @@
> > extern struct tracepoint * const __start___tracepoints_ptrs[];
> > extern struct tracepoint * const __stop___tracepoints_ptrs[];
> >
> > +DEFINE_SRCU(tracepoint_srcu);
> > +EXPORT_SYMBOL_GPL(tracepoint_srcu);
> > +
> > /* Set to 1 to enable tracepoint debug output */
> > static const int tracepoint_debug;
> >
> > @@ -67,11 +70,16 @@ static inline void *allocate_probes(int count)
> > return p == NULL ? NULL : p->probes;
> > }
> >
> > -static void rcu_free_old_probes(struct rcu_head *head)
> > +static void srcu_free_old_probes(struct rcu_head *head)
> > {
> > kfree(container_of(head, struct tp_probes, rcu));
> > }
> >
> > +static void rcu_free_old_probes(struct rcu_head *head)
> > +{
> > + call_srcu(&tracepoint_srcu, head, srcu_free_old_probes);
> Hmm, is it OK to call call_srcu() from a call_rcu() callback? I guess
> it would be.
> I think we should add a comment to why we are doing this. Something
> like:
> /*
> * Tracepoint probes are protected by both sched RCU and SRCU, by
> * calling the SRCU callback in the sched RCU callback we cover
> * both cases.
> */
> Or something along those lines.
Ok I'll add these. Thanks,
- Joel
On Tue, May 1, 2018 at 7:24 AM Steven Rostedt <[email protected]> wrote:
> On Mon, 30 Apr 2018 18:42:03 -0700
> Joel Fernandes <[email protected]> wrote:
> > In recent tests with IRQ on/off tracepoints, a large performance
> > overhead ~10% is noticed when running hackbench. This is root caused to
> > calls to rcu_irq_enter_irqson and rcu_irq_exit_irqson from the
> > tracepoint code. Following a long discussion on the list [1] about this,
> > we concluded that srcu is a better alternative for use during rcu idle.
> > Although it does involve extra barriers, its lighter than the sched-rcu
> > version which has to do additional RCU calls to notify RCU idle about
> > entry into RCU sections.
> >
> > In this patch, we change the underlying implementation of the
> > trace_*_rcuidle API to use SRCU. This has shown to improve performance
> > alot for the high frequency irq enable/disable tracepoints.
> Can you post some numbers?
Here are some numbers (which I'll also include in the next series spin):
With a run of the following 30 times on a single core x86 Qemu instance
with 1GB memory:
hackbench -g 4 -f 2 -l 3000
Completion times in seconds. CONFIG_PROVE_LOCKING=y.
No patches (without this series)
Mean: 3.048
Median: 3.025
Std Dev: 0.064
With Lockdep using irq tracepoints with RCU implementation:
Mean: 3.451 (-11.66 %)
Median: 3.447 (-12.22%)
Std Dev: 0.049
With Lockdep using irq tracepoints with SRCU implementation:
Mean: 3.020 (I would consider the improvement against the "without this
series" case as just noise).
Median: 3.013
Std Dev: 0.033
thanks,
- Joel