2018-04-17 04:11:16

by Joel Fernandes

[permalink] [raw]
Subject: [RFC v4 0/4] Centralize and unify usage of preempt/irq tracepoints

Hello,
This is the next revision of preempt/irq tracepoint centralization and
unified usage across the kernel [1].
The preempt/irq tracepoints exist but not everything in the kernel is
using it. This makes things not work simultaneously (for ex, only
lockdep or irqsoff events can be used). This series is an attempt to
solve that.

I fixed several issues since last rev and I'm posting it slightly early
than I wanted to for some feedback especially for the new patches 1-3 in
this series. Also any testing is deeply appreciated.

Status:
Currently I'm noticing a performance issue with "enabling" of the
preemptoff tracer (not during its running but the enabling/startup of
it). I narrowed this down to rcu_irq_enter_irqson and
rcu_irq_exit_irqson calls in tracepoint.h. Commenting these out makes
the perf issue go away. I also notice that disabling function tracer
also makes the issue go away. Note that I ran hackbench and don't notice
any performance issues with my patch itself per-se, however the
preemptoff tracer startup seems slow and I'm debugging that. If you have
a clue why this might be happening, please let me know.

Note that I skipped v3 because I used that number already for some
internal reviews of these patches. Also I changed the subject back to
RFC in light of the lockdep issues I fixed this time and added some more
folks for review. Thanks much for any review or testing.

[1] https://www.spinics.net/lists/kernel/msg2750826.html

Joel Fernandes (4):
tracepoint: Add API to not do lockdep checks during RCU ops
softirq: reorder trace_softirqs_on to prevent lockdep splat
irqflags: Avoid unnecessary calls to trace_ if you can
tracing: Centralize preemptirq tracepoints and unify their usage

include/linux/ftrace.h | 11 +-
include/linux/irqflags.h | 32 +++--
include/linux/lockdep.h | 8 +-
include/linux/preempt.h | 2 +-
include/linux/tracepoint.h | 32 ++++-
include/trace/events/preemptirq.h | 23 ++--
init/main.c | 5 +-
kernel/locking/lockdep.c | 35 ++---
kernel/sched/core.c | 2 +-
kernel/softirq.c | 6 +-
kernel/trace/Kconfig | 22 +++-
kernel/trace/Makefile | 2 +-
kernel/trace/trace_irqsoff.c | 206 +++++++-----------------------
kernel/trace/trace_preemptirq.c | 73 +++++++++++
14 files changed, 229 insertions(+), 230 deletions(-)
create mode 100644 kernel/trace/trace_preemptirq.c

Cc: Steven Rostedt <[email protected]>
Cc: Peter Zilstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Cc: Tom Zanussi <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Thomas Glexiner <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Paul McKenney <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Fenguang Wu <[email protected]>
Cc: Baohong Liu <[email protected]>
Cc: Vedang Patel <[email protected]>
--
2.17.0.484.g0c8726318c-goog



2018-04-17 04:09:47

by Joel Fernandes

[permalink] [raw]
Subject: [RFC v4 2/4] softirq: reorder trace_softirqs_on to prevent lockdep splat

Currently there's an issue which is not always reproducible, where the
softirqs annotation in lockdep goes out of sync with the reality of the
world and causes a lockdep splat. This is because the preemptoff tracer
calls into lockdep which can cause a softirqs invalid annotation splat.

The same issue was fixed in local_bh_disable_ip.
/*
* The preempt tracer hooks into preempt_count_add and will break
* lockdep because it calls back into lockdep after SOFTIRQ_OFFSET
* is set and before current->softirq_enabled is cleared.
* We must manually increment preempt_count here and manually
* call the trace_preempt_off later.
*/

I am not fully sure why the issue is reproducible with my patches easily,
but it seems like the same ordering issue can exist so we ought to fix
it.

Cc: Steven Rostedt <[email protected]>
Cc: Peter Zilstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Cc: Tom Zanussi <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Thomas Glexiner <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Paul McKenney <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Fenguang Wu <[email protected]>
Cc: Baohong Liu <[email protected]>
Cc: Vedang Patel <[email protected]>
Signed-off-by: Joel Fernandes <[email protected]>
---
kernel/softirq.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/softirq.c b/kernel/softirq.c
index 177de3640c78..8a040bcaa033 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -139,9 +139,13 @@ static void __local_bh_enable(unsigned int cnt)
{
lockdep_assert_irqs_disabled();

+ if (preempt_count() == cnt)
+ trace_preempt_on(CALLER_ADDR0, get_lock_parent_ip());
+
if (softirq_count() == (cnt & SOFTIRQ_MASK))
trace_softirqs_on(_RET_IP_);
- preempt_count_sub(cnt);
+
+ __preempt_count_sub(cnt);
}

/*
--
2.17.0.484.g0c8726318c-goog


2018-04-17 04:10:00

by Joel Fernandes

[permalink] [raw]
Subject: [RFC v4 4/4] tracing: Centralize preemptirq tracepoints and unify their usage

This patch detaches the preemptirq tracepoints from the tracers and
keeps it separate.

Advantages:
* Lockdep and irqsoff event can now run in parallel since they no longer
have their own calls.

* This unifies the usecase of adding hooks to an irqsoff and irqson
event, and a preemptoff and preempton event.
3 users of the events exist:
- Lockdep
- irqsoff and preemptoff tracers
- irqs and preempt trace events

The unification cleans up several ifdefs and makes the code in preempt
tracer and irqsoff tracers simpler. It gets rid of all the horrific
ifdeferry around PROVE_LOCKING and makes configuration of the different
users of the tracepoints more easy and understandable. It also gets rid
of the time_* function calls from the lockdep hooks used to call into
the preemptirq tracer which is not needed anymore. The negative delta in
lines of code in this patch is quite large too.

In the patch we introduce a new CONFIG option PREEMPTIRQ_TRACEPOINTS
as a single point for registering probes onto the tracepoints. With
this,
the web of config options for preempt/irq toggle tracepoints and its
users becomes:

PREEMPT_TRACER PREEMPTIRQ_EVENTS IRQSOFF_TRACER PROVE_LOCKING
| | \ | |
\ (selects) / \ \ (selects) /
TRACE_PREEMPT_TOGGLE ----> TRACE_IRQFLAGS
\ /
\ (depends on) /
PREEMPTIRQ_TRACEPOINTS

One note, I have to check for lockdep recursion in the code that calls
the trace events API and bail out if we're in lockdep recursion
protection to prevent something like the following case: a spin_lock is
taken. Then lockdep_acquired is called. That does a raw_local_irq_save
and then sets lockdep_recursion, and then calls __lockdep_acquired. In
this function, a call to get_lock_stats happens which calls
preempt_disable, which calls trace IRQS off somewhere which enters my
tracepoint code and sets the tracing_irq_cpu flag to prevent recursion.

Cc: Steven Rostedt <[email protected]>
Cc: Peter Zilstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Cc: Tom Zanussi <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Thomas Glexiner <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Paul McKenney <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Fenguang Wu <[email protected]>
Cc: Baohong Liu <[email protected]>
Cc: Vedang Patel <[email protected]>
Signed-off-by: Joel Fernandes <[email protected]>
---
include/linux/ftrace.h | 11 +-
include/linux/irqflags.h | 11 +-
include/linux/lockdep.h | 8 +-
include/linux/preempt.h | 2 +-
include/trace/events/preemptirq.h | 23 ++--
init/main.c | 5 +-
kernel/locking/lockdep.c | 35 ++---
kernel/sched/core.c | 2 +-
kernel/trace/Kconfig | 22 +++-
kernel/trace/Makefile | 2 +-
kernel/trace/trace_irqsoff.c | 206 +++++++-----------------------
kernel/trace/trace_preemptirq.c | 73 +++++++++++
12 files changed, 184 insertions(+), 216 deletions(-)
create mode 100644 kernel/trace/trace_preemptirq.c

diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
index 9c3c9a319e48..5191030af0c0 100644
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -709,16 +709,7 @@ static inline unsigned long get_lock_parent_ip(void)
return CALLER_ADDR2;
}

-#ifdef CONFIG_IRQSOFF_TRACER
- extern void time_hardirqs_on(unsigned long a0, unsigned long a1);
- extern void time_hardirqs_off(unsigned long a0, unsigned long a1);
-#else
- static inline void time_hardirqs_on(unsigned long a0, unsigned long a1) { }
- static inline void time_hardirqs_off(unsigned long a0, unsigned long a1) { }
-#endif
-
-#if defined(CONFIG_PREEMPT_TRACER) || \
- (defined(CONFIG_DEBUG_PREEMPT) && defined(CONFIG_PREEMPTIRQ_EVENTS))
+#ifdef CONFIG_TRACE_PREEMPT_TOGGLE
extern void trace_preempt_on(unsigned long a0, unsigned long a1);
extern void trace_preempt_off(unsigned long a0, unsigned long a1);
#else
diff --git a/include/linux/irqflags.h b/include/linux/irqflags.h
index ea8df0ac57d3..26db5540cb64 100644
--- a/include/linux/irqflags.h
+++ b/include/linux/irqflags.h
@@ -15,9 +15,16 @@
#include <linux/typecheck.h>
#include <asm/irqflags.h>

-#ifdef CONFIG_TRACE_IRQFLAGS
+/* Currently trace_softirqs_on/off is used only by lockdep */
+#ifdef CONFIG_PROVE_LOCKING
extern void trace_softirqs_on(unsigned long ip);
extern void trace_softirqs_off(unsigned long ip);
+#else
+# define trace_softirqs_on(ip) do { } while (0)
+# define trace_softirqs_off(ip) do { } while (0)
+#endif
+
+#ifdef CONFIG_TRACE_IRQFLAGS
extern void trace_hardirqs_on(void);
extern void trace_hardirqs_off(void);
# define trace_hardirq_context(p) ((p)->hardirq_context)
@@ -43,8 +50,6 @@ do { \
#else
# define trace_hardirqs_on() do { } while (0)
# define trace_hardirqs_off() do { } while (0)
-# define trace_softirqs_on(ip) do { } while (0)
-# define trace_softirqs_off(ip) do { } while (0)
# define trace_hardirq_context(p) 0
# define trace_softirq_context(p) 0
# define trace_hardirqs_enabled(p) 0
diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
index 6fc77d4dbdcd..a8113357ceeb 100644
--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -266,7 +266,8 @@ struct held_lock {
/*
* Initialization, self-test and debugging-output methods:
*/
-extern void lockdep_info(void);
+extern void lockdep_init(void);
+extern void lockdep_init_early(void);
extern void lockdep_reset(void);
extern void lockdep_reset_lock(struct lockdep_map *lock);
extern void lockdep_free_key_range(void *start, unsigned long size);
@@ -406,7 +407,8 @@ static inline void lockdep_on(void)
# define lock_downgrade(l, i) do { } while (0)
# define lock_set_class(l, n, k, s, i) do { } while (0)
# define lock_set_subclass(l, s, i) do { } while (0)
-# define lockdep_info() do { } while (0)
+# define lockdep_init() do { } while (0)
+# define lockdep_init_early() do { } while (0)
# define lockdep_init_map(lock, name, key, sub) \
do { (void)(name); (void)(key); } while (0)
# define lockdep_set_class(lock, key) do { (void)(key); } while (0)
@@ -532,7 +534,7 @@ do { \

#endif /* CONFIG_LOCKDEP */

-#ifdef CONFIG_TRACE_IRQFLAGS
+#ifdef CONFIG_PROVE_LOCKING
extern void print_irqtrace_events(struct task_struct *curr);
#else
static inline void print_irqtrace_events(struct task_struct *curr)
diff --git a/include/linux/preempt.h b/include/linux/preempt.h
index 5bd3f151da78..c01813c3fbe9 100644
--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -150,7 +150,7 @@
*/
#define in_atomic_preempt_off() (preempt_count() != PREEMPT_DISABLE_OFFSET)

-#if defined(CONFIG_DEBUG_PREEMPT) || defined(CONFIG_PREEMPT_TRACER)
+#if defined(CONFIG_DEBUG_PREEMPT) || defined(CONFIG_TRACE_PREEMPT_TOGGLE)
extern void preempt_count_add(int val);
extern void preempt_count_sub(int val);
#define preempt_count_dec_and_test() \
diff --git a/include/trace/events/preemptirq.h b/include/trace/events/preemptirq.h
index 9c4eb33c5a1d..9a0d4ceeb166 100644
--- a/include/trace/events/preemptirq.h
+++ b/include/trace/events/preemptirq.h
@@ -1,4 +1,4 @@
-#ifdef CONFIG_PREEMPTIRQ_EVENTS
+#ifdef CONFIG_PREEMPTIRQ_TRACEPOINTS

#undef TRACE_SYSTEM
#define TRACE_SYSTEM preemptirq
@@ -32,7 +32,7 @@ DECLARE_EVENT_CLASS(preemptirq_template,
(void *)((unsigned long)(_stext) + __entry->parent_offs))
);

-#ifndef CONFIG_PROVE_LOCKING
+#ifdef CONFIG_TRACE_IRQFLAGS
DEFINE_EVENT(preemptirq_template, irq_disable,
TP_PROTO(unsigned long ip, unsigned long parent_ip),
TP_ARGS(ip, parent_ip));
@@ -40,9 +40,14 @@ DEFINE_EVENT(preemptirq_template, irq_disable,
DEFINE_EVENT(preemptirq_template, irq_enable,
TP_PROTO(unsigned long ip, unsigned long parent_ip),
TP_ARGS(ip, parent_ip));
+#else
+#define trace_irq_enable(...)
+#define trace_irq_disable(...)
+#define trace_irq_enable_rcuidle(...)
+#define trace_irq_disable_rcuidle(...)
#endif

-#ifdef CONFIG_DEBUG_PREEMPT
+#ifdef CONFIG_TRACE_PREEMPT_TOGGLE
DEFINE_EVENT(preemptirq_template, preempt_disable,
TP_PROTO(unsigned long ip, unsigned long parent_ip),
TP_ARGS(ip, parent_ip));
@@ -50,22 +55,22 @@ DEFINE_EVENT(preemptirq_template, preempt_disable,
DEFINE_EVENT(preemptirq_template, preempt_enable,
TP_PROTO(unsigned long ip, unsigned long parent_ip),
TP_ARGS(ip, parent_ip));
+#else
+#define trace_preempt_enable(...)
+#define trace_preempt_disable(...)
+#define trace_preempt_enable_rcuidle(...)
+#define trace_preempt_disable_rcuidle(...)
#endif

#endif /* _TRACE_PREEMPTIRQ_H */

#include <trace/define_trace.h>

-#endif /* !CONFIG_PREEMPTIRQ_EVENTS */
-
-#if !defined(CONFIG_PREEMPTIRQ_EVENTS) || defined(CONFIG_PROVE_LOCKING)
+#else /* !CONFIG_PREEMPTIRQ_TRACEPOINTS */
#define trace_irq_enable(...)
#define trace_irq_disable(...)
#define trace_irq_enable_rcuidle(...)
#define trace_irq_disable_rcuidle(...)
-#endif
-
-#if !defined(CONFIG_PREEMPTIRQ_EVENTS) || !defined(CONFIG_DEBUG_PREEMPT)
#define trace_preempt_enable(...)
#define trace_preempt_disable(...)
#define trace_preempt_enable_rcuidle(...)
diff --git a/init/main.c b/init/main.c
index e4a3160991ea..34823072ef9e 100644
--- a/init/main.c
+++ b/init/main.c
@@ -630,6 +630,9 @@ asmlinkage __visible void __init start_kernel(void)
call_function_init();
WARN(!irqs_disabled(), "Interrupts were enabled early\n");
early_boot_irqs_disabled = false;
+
+ lockdep_init_early();
+
local_irq_enable();

kmem_cache_init_late();
@@ -644,7 +647,7 @@ asmlinkage __visible void __init start_kernel(void)
panic("Too many boot %s vars at `%s'", panic_later,
panic_param);

- lockdep_info();
+ lockdep_init();

/*
* Need to run this when irqs are enabled, because it wants
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 023386338269..871a42232858 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -55,6 +55,7 @@

#include "lockdep_internals.h"

+#include <trace/events/preemptirq.h>
#define CREATE_TRACE_POINTS
#include <trace/events/lock.h>

@@ -2841,10 +2842,9 @@ static void __trace_hardirqs_on_caller(unsigned long ip)
debug_atomic_inc(hardirqs_on_events);
}

-__visible void trace_hardirqs_on_caller(unsigned long ip)
+static void lockdep_hardirqs_on(void *none, unsigned long ignore,
+ unsigned long ip)
{
- time_hardirqs_on(CALLER_ADDR0, ip);
-
if (unlikely(!debug_locks || current->lockdep_recursion))
return;

@@ -2883,23 +2883,15 @@ __visible void trace_hardirqs_on_caller(unsigned long ip)
__trace_hardirqs_on_caller(ip);
current->lockdep_recursion = 0;
}
-EXPORT_SYMBOL(trace_hardirqs_on_caller);
-
-void trace_hardirqs_on(void)
-{
- trace_hardirqs_on_caller(CALLER_ADDR0);
-}
-EXPORT_SYMBOL(trace_hardirqs_on);

/*
* Hardirqs were disabled:
*/
-__visible void trace_hardirqs_off_caller(unsigned long ip)
+static void lockdep_hardirqs_off(void *none, unsigned long ignore,
+ unsigned long ip)
{
struct task_struct *curr = current;

- time_hardirqs_off(CALLER_ADDR0, ip);
-
if (unlikely(!debug_locks || current->lockdep_recursion))
return;

@@ -2921,13 +2913,6 @@ __visible void trace_hardirqs_off_caller(unsigned long ip)
} else
debug_atomic_inc(redundant_hardirqs_off);
}
-EXPORT_SYMBOL(trace_hardirqs_off_caller);
-
-void trace_hardirqs_off(void)
-{
- trace_hardirqs_off_caller(CALLER_ADDR0);
-}
-EXPORT_SYMBOL(trace_hardirqs_off);

/*
* Softirqs will be enabled:
@@ -4334,7 +4319,15 @@ void lockdep_reset_lock(struct lockdep_map *lock)
raw_local_irq_restore(flags);
}

-void __init lockdep_info(void)
+void __init lockdep_init_early(void)
+{
+#ifdef CONFIG_PROVE_LOCKING
+ register_trace_prio_irq_disable(lockdep_hardirqs_off, NULL, INT_MAX);
+ register_trace_prio_irq_enable(lockdep_hardirqs_on, NULL, INT_MIN);
+#endif
+}
+
+void __init lockdep_init(void)
{
printk("Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo Molnar\n");

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e8afd6086f23..7271f6984f41 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3178,7 +3178,7 @@ static inline void sched_tick_stop(int cpu) { }
#endif

#if defined(CONFIG_PREEMPT) && (defined(CONFIG_DEBUG_PREEMPT) || \
- defined(CONFIG_PREEMPT_TRACER))
+ defined(CONFIG_TRACE_PREEMPT_TOGGLE))
/*
* If the value passed in is equal to the current preempt count
* then we just disabled preemption. Start timing the latency.
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 0b249e2f0c3c..af027233b762 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -82,6 +82,15 @@ config RING_BUFFER_ALLOW_SWAP
Allow the use of ring_buffer_swap_cpu.
Adds a very slight overhead to tracing when enabled.

+config PREEMPTIRQ_TRACEPOINTS
+ bool
+ depends on TRACE_PREEMPT_TOGGLE || TRACE_IRQFLAGS
+ select TRACING
+ default y
+ help
+ Create preempt/irq toggle tracepoints if needed, so that other parts
+ of the kernel can use them to generate or add hooks to them.
+
# All tracer options should select GENERIC_TRACER. For those options that are
# enabled by all tracers (context switch and event tracer) they select TRACING.
# This allows those options to appear when no other tracer is selected. But the
@@ -159,18 +168,20 @@ config FUNCTION_GRAPH_TRACER
the return value. This is done by setting the current return
address on the current task structure into a stack of calls.

+config TRACE_PREEMPT_TOGGLE
+ bool
+ help
+ Enables hooks which will be called when preemption is first disabled,
+ and last enabled.

config PREEMPTIRQ_EVENTS
bool "Enable trace events for preempt and irq disable/enable"
select TRACE_IRQFLAGS
- depends on DEBUG_PREEMPT || !PROVE_LOCKING
- depends on TRACING
+ select TRACE_PREEMPT_TOGGLE if PREEMPT
+ select GENERIC_TRACER
default n
help
Enable tracing of disable and enable events for preemption and irqs.
- For tracing preempt disable/enable events, DEBUG_PREEMPT must be
- enabled. For tracing irq disable/enable events, PROVE_LOCKING must
- be disabled.

config IRQSOFF_TRACER
bool "Interrupts-off Latency Tracer"
@@ -207,6 +218,7 @@ config PREEMPT_TRACER
select RING_BUFFER_ALLOW_SWAP
select TRACER_SNAPSHOT
select TRACER_SNAPSHOT_PER_CPU_SWAP
+ select TRACE_PREEMPT_TOGGLE
help
This option measures the time spent in preemption-off critical
sections, with microsecond accuracy.
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index e2538c7638d4..84a0cb222f20 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -35,7 +35,7 @@ obj-$(CONFIG_TRACING) += trace_printk.o
obj-$(CONFIG_TRACING_MAP) += tracing_map.o
obj-$(CONFIG_CONTEXT_SWITCH_TRACER) += trace_sched_switch.o
obj-$(CONFIG_FUNCTION_TRACER) += trace_functions.o
-obj-$(CONFIG_PREEMPTIRQ_EVENTS) += trace_irqsoff.o
+obj-$(CONFIG_PREEMPTIRQ_TRACEPOINTS) += trace_preemptirq.o
obj-$(CONFIG_IRQSOFF_TRACER) += trace_irqsoff.o
obj-$(CONFIG_PREEMPT_TRACER) += trace_irqsoff.o
obj-$(CONFIG_SCHED_TRACER) += trace_sched_wakeup.o
diff --git a/kernel/trace/trace_irqsoff.c b/kernel/trace/trace_irqsoff.c
index 03ecb4465ee4..e3cec13dd935 100644
--- a/kernel/trace/trace_irqsoff.c
+++ b/kernel/trace/trace_irqsoff.c
@@ -16,7 +16,6 @@

#include "trace.h"

-#define CREATE_TRACE_POINTS
#include <trace/events/preemptirq.h>

#if defined(CONFIG_IRQSOFF_TRACER) || defined(CONFIG_PREEMPT_TRACER)
@@ -450,66 +449,6 @@ void stop_critical_timings(void)
}
EXPORT_SYMBOL_GPL(stop_critical_timings);

-#ifdef CONFIG_IRQSOFF_TRACER
-#ifdef CONFIG_PROVE_LOCKING
-void time_hardirqs_on(unsigned long a0, unsigned long a1)
-{
- if (!preempt_trace() && irq_trace())
- stop_critical_timing(a0, a1);
-}
-
-void time_hardirqs_off(unsigned long a0, unsigned long a1)
-{
- if (!preempt_trace() && irq_trace())
- start_critical_timing(a0, a1);
-}
-
-#else /* !CONFIG_PROVE_LOCKING */
-
-/*
- * We are only interested in hardirq on/off events:
- */
-static inline void tracer_hardirqs_on(void)
-{
- if (!preempt_trace() && irq_trace())
- stop_critical_timing(CALLER_ADDR0, CALLER_ADDR1);
-}
-
-static inline void tracer_hardirqs_off(void)
-{
- if (!preempt_trace() && irq_trace())
- start_critical_timing(CALLER_ADDR0, CALLER_ADDR1);
-}
-
-static inline void tracer_hardirqs_on_caller(unsigned long caller_addr)
-{
- if (!preempt_trace() && irq_trace())
- stop_critical_timing(CALLER_ADDR0, caller_addr);
-}
-
-static inline void tracer_hardirqs_off_caller(unsigned long caller_addr)
-{
- if (!preempt_trace() && irq_trace())
- start_critical_timing(CALLER_ADDR0, caller_addr);
-}
-
-#endif /* CONFIG_PROVE_LOCKING */
-#endif /* CONFIG_IRQSOFF_TRACER */
-
-#ifdef CONFIG_PREEMPT_TRACER
-static inline void tracer_preempt_on(unsigned long a0, unsigned long a1)
-{
- if (preempt_trace() && !irq_trace())
- stop_critical_timing(a0, a1);
-}
-
-static inline void tracer_preempt_off(unsigned long a0, unsigned long a1)
-{
- if (preempt_trace() && !irq_trace())
- start_critical_timing(a0, a1);
-}
-#endif /* CONFIG_PREEMPT_TRACER */
-
#ifdef CONFIG_FUNCTION_TRACER
static bool function_enabled;

@@ -659,10 +598,28 @@ static void irqsoff_tracer_stop(struct trace_array *tr)
}

#ifdef CONFIG_IRQSOFF_TRACER
+/*
+ * We are only interested in hardirq on/off events:
+ */
+static void tracer_hardirqs_on(void *none, unsigned long a0, unsigned long a1)
+{
+ if (!preempt_trace() && irq_trace())
+ stop_critical_timing(a0, a1);
+}
+
+static void tracer_hardirqs_off(void *none, unsigned long a0, unsigned long a1)
+{
+ if (!preempt_trace() && irq_trace())
+ start_critical_timing(a0, a1);
+}
+
static int irqsoff_tracer_init(struct trace_array *tr)
{
trace_type = TRACER_IRQS_OFF;

+ register_trace_irq_disable(tracer_hardirqs_off, NULL);
+ register_trace_irq_enable(tracer_hardirqs_on, NULL);
+
return __irqsoff_tracer_init(tr);
}
static struct tracer irqsoff_tracer __read_mostly =
@@ -686,14 +643,31 @@ static struct tracer irqsoff_tracer __read_mostly =
};
# define register_irqsoff(trace) register_tracer(&trace)
#else
+static inline void tracer_hardirqs_on(unsigned long a0, unsigned long a1) { }
+static inline void tracer_hardirqs_off(unsigned long a0, unsigned long a1) { }
# define register_irqsoff(trace) do { } while (0)
-#endif
+#endif /* CONFIG_IRQSOFF_TRACER */

#ifdef CONFIG_PREEMPT_TRACER
+static void tracer_preempt_on(void *none, unsigned long a0, unsigned long a1)
+{
+ if (preempt_trace() && !irq_trace())
+ stop_critical_timing(a0, a1);
+}
+
+static void tracer_preempt_off(void *none, unsigned long a0, unsigned long a1)
+{
+ if (preempt_trace() && !irq_trace())
+ start_critical_timing(a0, a1);
+}
+
static int preemptoff_tracer_init(struct trace_array *tr)
{
trace_type = TRACER_PREEMPT_OFF;

+ register_trace_preempt_disable(tracer_preempt_off, NULL);
+ register_trace_preempt_enable(tracer_preempt_on, NULL);
+
return __irqsoff_tracer_init(tr);
}

@@ -718,16 +692,22 @@ static struct tracer preemptoff_tracer __read_mostly =
};
# define register_preemptoff(trace) register_tracer(&trace)
#else
+static inline void tracer_preempt_on(unsigned long a0, unsigned long a1) { }
+static inline void tracer_preempt_off(unsigned long a0, unsigned long a1) { }
# define register_preemptoff(trace) do { } while (0)
-#endif
+#endif /* CONFIG_PREEMPT_TRACER */

-#if defined(CONFIG_IRQSOFF_TRACER) && \
- defined(CONFIG_PREEMPT_TRACER)
+#if defined(CONFIG_IRQSOFF_TRACER) && defined(CONFIG_PREEMPT_TRACER)

static int preemptirqsoff_tracer_init(struct trace_array *tr)
{
trace_type = TRACER_IRQS_OFF | TRACER_PREEMPT_OFF;

+ register_trace_irq_disable(tracer_hardirqs_off, NULL);
+ register_trace_irq_enable(tracer_hardirqs_on, NULL);
+ register_trace_preempt_disable(tracer_preempt_off, NULL);
+ register_trace_preempt_enable(tracer_preempt_on, NULL);
+
return __irqsoff_tracer_init(tr);
}

@@ -766,99 +746,3 @@ __init static int init_irqsoff_tracer(void)
}
core_initcall(init_irqsoff_tracer);
#endif /* IRQSOFF_TRACER || PREEMPTOFF_TRACER */
-
-#ifndef CONFIG_IRQSOFF_TRACER
-static inline void tracer_hardirqs_on(void) { }
-static inline void tracer_hardirqs_off(void) { }
-static inline void tracer_hardirqs_on_caller(unsigned long caller_addr) { }
-static inline void tracer_hardirqs_off_caller(unsigned long caller_addr) { }
-#endif
-
-#ifndef CONFIG_PREEMPT_TRACER
-static inline void tracer_preempt_on(unsigned long a0, unsigned long a1) { }
-static inline void tracer_preempt_off(unsigned long a0, unsigned long a1) { }
-#endif
-
-#if defined(CONFIG_TRACE_IRQFLAGS) && !defined(CONFIG_PROVE_LOCKING)
-/* Per-cpu variable to prevent redundant calls when IRQs already off */
-static DEFINE_PER_CPU(int, tracing_irq_cpu);
-
-void trace_hardirqs_on(void)
-{
- if (!this_cpu_read(tracing_irq_cpu))
- return;
-
- trace_irq_enable_rcuidle(CALLER_ADDR0, CALLER_ADDR1);
- tracer_hardirqs_on();
-
- this_cpu_write(tracing_irq_cpu, 0);
-}
-EXPORT_SYMBOL(trace_hardirqs_on);
-
-void trace_hardirqs_off(void)
-{
- if (this_cpu_read(tracing_irq_cpu))
- return;
-
- this_cpu_write(tracing_irq_cpu, 1);
-
- trace_irq_disable_rcuidle(CALLER_ADDR0, CALLER_ADDR1);
- tracer_hardirqs_off();
-}
-EXPORT_SYMBOL(trace_hardirqs_off);
-
-__visible void trace_hardirqs_on_caller(unsigned long caller_addr)
-{
- if (!this_cpu_read(tracing_irq_cpu))
- return;
-
- trace_irq_enable_rcuidle(CALLER_ADDR0, caller_addr);
- tracer_hardirqs_on_caller(caller_addr);
-
- this_cpu_write(tracing_irq_cpu, 0);
-}
-EXPORT_SYMBOL(trace_hardirqs_on_caller);
-
-__visible void trace_hardirqs_off_caller(unsigned long caller_addr)
-{
- if (this_cpu_read(tracing_irq_cpu))
- return;
-
- this_cpu_write(tracing_irq_cpu, 1);
-
- trace_irq_disable_rcuidle(CALLER_ADDR0, caller_addr);
- tracer_hardirqs_off_caller(caller_addr);
-}
-EXPORT_SYMBOL(trace_hardirqs_off_caller);
-
-/*
- * Stubs:
- */
-
-void trace_softirqs_on(unsigned long ip)
-{
-}
-
-void trace_softirqs_off(unsigned long ip)
-{
-}
-
-inline void print_irqtrace_events(struct task_struct *curr)
-{
-}
-#endif
-
-#if defined(CONFIG_PREEMPT_TRACER) || \
- (defined(CONFIG_DEBUG_PREEMPT) && defined(CONFIG_PREEMPTIRQ_EVENTS))
-void trace_preempt_on(unsigned long a0, unsigned long a1)
-{
- trace_preempt_enable_rcuidle(a0, a1);
- tracer_preempt_on(a0, a1);
-}
-
-void trace_preempt_off(unsigned long a0, unsigned long a1)
-{
- trace_preempt_disable_rcuidle(a0, a1);
- tracer_preempt_off(a0, a1);
-}
-#endif
diff --git a/kernel/trace/trace_preemptirq.c b/kernel/trace/trace_preemptirq.c
new file mode 100644
index 000000000000..82eb1974a163
--- /dev/null
+++ b/kernel/trace/trace_preemptirq.c
@@ -0,0 +1,73 @@
+/*
+ * preemptoff and irqoff tracepoints
+ *
+ * Copyright (C) 2017 Joel Fernandes <[email protected]>
+ */
+
+#include <linux/kallsyms.h>
+#include <linux/uaccess.h>
+#include <linux/module.h>
+#include <linux/ftrace.h>
+
+#define CREATE_TRACE_POINTS
+#include <trace/events/preemptirq.h>
+
+#ifdef CONFIG_TRACE_IRQFLAGS
+/* Per-cpu variable to prevent redundant calls when IRQs already off */
+static DEFINE_PER_CPU(int, tracing_irq_cpu);
+
+#define traceirqs_recursion()
+
+void trace_hardirqs_on(void)
+{
+ if (current->lockdep_recursion || !this_cpu_read(tracing_irq_cpu))
+ return;
+
+ trace_irq_enable_rcuidle_nolockdep(CALLER_ADDR0, CALLER_ADDR1);
+ this_cpu_write(tracing_irq_cpu, 0);
+}
+EXPORT_SYMBOL(trace_hardirqs_on);
+
+void trace_hardirqs_off(void)
+{
+ if (current->lockdep_recursion || this_cpu_read(tracing_irq_cpu))
+ return;
+
+ this_cpu_write(tracing_irq_cpu, 1);
+ trace_irq_disable_rcuidle_nolockdep(CALLER_ADDR0, CALLER_ADDR1);
+}
+EXPORT_SYMBOL(trace_hardirqs_off);
+
+__visible void trace_hardirqs_on_caller(unsigned long caller_addr)
+{
+ if (current->lockdep_recursion || !this_cpu_read(tracing_irq_cpu))
+ return;
+
+ trace_irq_enable_rcuidle_nolockdep(CALLER_ADDR0, caller_addr);
+ this_cpu_write(tracing_irq_cpu, 0);
+}
+EXPORT_SYMBOL(trace_hardirqs_on_caller);
+
+__visible void trace_hardirqs_off_caller(unsigned long caller_addr)
+{
+ if (current->lockdep_recursion || this_cpu_read(tracing_irq_cpu))
+ return;
+
+ this_cpu_write(tracing_irq_cpu, 1);
+ trace_irq_disable_rcuidle_nolockdep(CALLER_ADDR0, caller_addr);
+}
+EXPORT_SYMBOL(trace_hardirqs_off_caller);
+#endif /* CONFIG_TRACE_IRQFLAGS */
+
+#ifdef CONFIG_TRACE_PREEMPT_TOGGLE
+
+void trace_preempt_on(unsigned long a0, unsigned long a1)
+{
+ trace_preempt_enable_rcuidle_nolockdep(a0, a1);
+}
+
+void trace_preempt_off(unsigned long a0, unsigned long a1)
+{
+ trace_preempt_disable_rcuidle_nolockdep(a0, a1);
+}
+#endif
--
2.17.0.484.g0c8726318c-goog


2018-04-17 04:10:30

by Joel Fernandes

[permalink] [raw]
Subject: [RFC v4 1/4] tracepoint: Add API to not do lockdep checks during RCU ops

This series makes lockdeps use the trace irqs on/off tracepoints. Lets
introduce an API where lockdep checks are not done so that we don't get
false lockdep splats. This API introduction will not affect any of the
existing users of the tracepoint API and the new call will only be used
by the IRQs on/off tracepoints.

Cc: Steven Rostedt <[email protected]>
Cc: Peter Zilstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Cc: Tom Zanussi <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Thomas Glexiner <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Paul McKenney <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Fenguang Wu <[email protected]>
Cc: Baohong Liu <[email protected]>
Cc: Vedang Patel <[email protected]>
Signed-off-by: Joel Fernandes <[email protected]>
---
include/linux/tracepoint.h | 32 +++++++++++++++++++++++++++++---
1 file changed, 29 insertions(+), 3 deletions(-)

diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index c94f466d57ef..0b6d7d1a16e2 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -119,6 +119,14 @@ extern void syscall_unregfunc(void);

#ifdef TRACEPOINTS_ENABLED

+#define LOCKDEP_CHECKS_ON(lockdep) \
+ if (IS_ENABLED(CONFIG_LOCKDEP) && !lockdep) \
+ lockdep_on();
+
+#define LOCKDEP_CHECKS_OFF(lockdep) \
+ if (IS_ENABLED(CONFIG_LOCKDEP) && !lockdep) \
+ lockdep_off();
+
/*
* it_func[0] is never NULL because there is at least one element in the array
* when the array itself is non NULL.
@@ -129,7 +137,7 @@ extern void syscall_unregfunc(void);
* as "(void *, void)". The DECLARE_TRACE_NOARGS() will pass in just
* "void *data", where as the DECLARE_TRACE() will pass in "void *data, proto".
*/
-#define __DO_TRACE(tp, proto, args, cond, rcucheck) \
+#define __DO_TRACE(tp, proto, args, cond, rcucheck, lockdep) \
do { \
struct tracepoint_func *it_func_ptr; \
void *it_func; \
@@ -137,10 +145,14 @@ extern void syscall_unregfunc(void);
\
if (!(cond)) \
return; \
+ \
+ LOCKDEP_CHECKS_OFF(lockdep) \
if (rcucheck) \
rcu_irq_enter_irqson(); \
rcu_read_lock_sched_notrace(); \
it_func_ptr = rcu_dereference_sched((tp)->funcs); \
+ LOCKDEP_CHECKS_ON(lockdep) \
+ \
if (it_func_ptr) { \
do { \
it_func = (it_func_ptr)->func; \
@@ -148,9 +160,12 @@ extern void syscall_unregfunc(void);
((void(*)(proto))(it_func))(args); \
} while ((++it_func_ptr)->func); \
} \
+ \
+ LOCKDEP_CHECKS_OFF(lockdep) \
rcu_read_unlock_sched_notrace(); \
if (rcucheck) \
rcu_irq_exit_irqson(); \
+ LOCKDEP_CHECKS_ON(lockdep) \
} while (0)

#ifndef MODULE
@@ -161,7 +176,16 @@ extern void syscall_unregfunc(void);
__DO_TRACE(&__tracepoint_##name, \
TP_PROTO(data_proto), \
TP_ARGS(data_args), \
- TP_CONDITION(cond), 1); \
+ TP_CONDITION(cond), 1, 1); \
+ } \
+ \
+ static inline void trace_##name##_rcuidle_nolockdep(proto) \
+ { \
+ if (static_key_false(&__tracepoint_##name.key)) \
+ __DO_TRACE(&__tracepoint_##name, \
+ TP_PROTO(data_proto), \
+ TP_ARGS(data_args), \
+ TP_CONDITION(cond), 1, 0); \
}
#else
#define __DECLARE_TRACE_RCU(name, proto, args, cond, data_proto, data_args)
@@ -187,7 +211,7 @@ extern void syscall_unregfunc(void);
__DO_TRACE(&__tracepoint_##name, \
TP_PROTO(data_proto), \
TP_ARGS(data_args), \
- TP_CONDITION(cond), 0); \
+ TP_CONDITION(cond), 0, 1); \
if (IS_ENABLED(CONFIG_LOCKDEP) && (cond)) { \
rcu_read_lock_sched_notrace(); \
rcu_dereference_sched(__tracepoint_##name.funcs);\
@@ -254,6 +278,8 @@ extern void syscall_unregfunc(void);
{ } \
static inline void trace_##name##_rcuidle(proto) \
{ } \
+ static inline void trace_##name##_rcuidle_nolockdep(proto) \
+ { } \
static inline int \
register_trace_##name(void (*probe)(data_proto), \
void *data) \
--
2.17.0.484.g0c8726318c-goog


2018-04-17 04:10:39

by Joel Fernandes

[permalink] [raw]
Subject: [RFC v4 3/4] irqflags: Avoid unnecessary calls to trace_ if you can

With TRACE_IRQFLAGS, we call trace_ API too many times. We don't need
to if local_irq_restore or local_irq_save didn't actually do anything.

This gives around a 4% improvement in performance when doing the
following command: "time find / > /dev/null"

Also its best to avoid these calls where possible, since in this series,
the RCU code in tracepoint.h seems to be call these quite a bit and I'd
like to keep this overhead low.

Cc: Steven Rostedt <[email protected]>
Cc: Peter Zilstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Cc: Tom Zanussi <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Thomas Glexiner <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Paul McKenney <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Fenguang Wu <[email protected]>
Cc: Baohong Liu <[email protected]>
Cc: Vedang Patel <[email protected]>
Signed-off-by: Joel Fernandes <[email protected]>
---
include/linux/irqflags.h | 21 +++++++++++----------
1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/include/linux/irqflags.h b/include/linux/irqflags.h
index 9700f00bbc04..ea8df0ac57d3 100644
--- a/include/linux/irqflags.h
+++ b/include/linux/irqflags.h
@@ -104,19 +104,20 @@ do { \
#define local_irq_save(flags) \
do { \
raw_local_irq_save(flags); \
- trace_hardirqs_off(); \
+ if (!raw_irqs_disabled_flags(flags)) \
+ trace_hardirqs_off(); \
} while (0)


-#define local_irq_restore(flags) \
- do { \
- if (raw_irqs_disabled_flags(flags)) { \
- raw_local_irq_restore(flags); \
- trace_hardirqs_off(); \
- } else { \
- trace_hardirqs_on(); \
- raw_local_irq_restore(flags); \
- } \
+#define local_irq_restore(flags) \
+ do { \
+ if (raw_irqs_disabled_flags(flags) && !irqs_disabled()) { \
+ raw_local_irq_restore(flags); \
+ trace_hardirqs_off(); \
+ } else if (!raw_irqs_disabled_flags(flags) && irqs_disabled()) {\
+ trace_hardirqs_on(); \
+ raw_local_irq_restore(flags); \
+ } \
} while (0)

#define safe_halt() \
--
2.17.0.484.g0c8726318c-goog


2018-04-18 09:04:20

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [RFC v4 3/4] irqflags: Avoid unnecessary calls to trace_ if you can

On Mon, 16 Apr 2018 21:07:47 -0700
Joel Fernandes <[email protected]> wrote:

> With TRACE_IRQFLAGS, we call trace_ API too many times. We don't need
> to if local_irq_restore or local_irq_save didn't actually do anything.
>
> This gives around a 4% improvement in performance when doing the
> following command: "time find / > /dev/null"
>
> Also its best to avoid these calls where possible, since in this series,
> the RCU code in tracepoint.h seems to be call these quite a bit and I'd
> like to keep this overhead low.

Can we assume that the "flags" has only 1 bit irq-disable flag?
Since it skips calling raw_local_irq_restore(flags); too,
if there is any state in the flags on any arch, it may change the
result. In that case, we can do it as below (just skipping trace_hardirqs_*)

int disabled = irqs_disabled();

if (!raw_irqs_disabled_flags(flags) && disabled)
trace_hardirqs_on();

raw_local_irq_restore(flags);

if (raw_irqs_disabled_flags(flags) && !disabled)
trace_hardirqs_off();

Thank you,

>
> Cc: Steven Rostedt <[email protected]>
> Cc: Peter Zilstra <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Mathieu Desnoyers <[email protected]>
> Cc: Tom Zanussi <[email protected]>
> Cc: Namhyung Kim <[email protected]>
> Cc: Thomas Glexiner <[email protected]>
> Cc: Boqun Feng <[email protected]>
> Cc: Paul McKenney <[email protected]>
> Cc: Frederic Weisbecker <[email protected]>
> Cc: Randy Dunlap <[email protected]>
> Cc: Masami Hiramatsu <[email protected]>
> Cc: Fenguang Wu <[email protected]>
> Cc: Baohong Liu <[email protected]>
> Cc: Vedang Patel <[email protected]>
> Signed-off-by: Joel Fernandes <[email protected]>
> ---
> include/linux/irqflags.h | 21 +++++++++++----------
> 1 file changed, 11 insertions(+), 10 deletions(-)
>
> diff --git a/include/linux/irqflags.h b/include/linux/irqflags.h
> index 9700f00bbc04..ea8df0ac57d3 100644
> --- a/include/linux/irqflags.h
> +++ b/include/linux/irqflags.h
> @@ -104,19 +104,20 @@ do { \
> #define local_irq_save(flags) \
> do { \
> raw_local_irq_save(flags); \
> - trace_hardirqs_off(); \
> + if (!raw_irqs_disabled_flags(flags)) \
> + trace_hardirqs_off(); \
> } while (0)
>
>
> -#define local_irq_restore(flags) \
> - do { \
> - if (raw_irqs_disabled_flags(flags)) { \
> - raw_local_irq_restore(flags); \
> - trace_hardirqs_off(); \
> - } else { \
> - trace_hardirqs_on(); \
> - raw_local_irq_restore(flags); \
> - } \
> +#define local_irq_restore(flags) \
> + do { \
> + if (raw_irqs_disabled_flags(flags) && !irqs_disabled()) { \
> + raw_local_irq_restore(flags); \
> + trace_hardirqs_off(); \
> + } else if (!raw_irqs_disabled_flags(flags) && irqs_disabled()) {\
> + trace_hardirqs_on(); \
> + raw_local_irq_restore(flags); \
> + } \
> } while (0)
>
> #define safe_halt() \
> --
> 2.17.0.484.g0c8726318c-goog
>


--
Masami Hiramatsu <[email protected]>

2018-04-19 05:44:28

by Namhyung Kim

[permalink] [raw]
Subject: Re: [RFC v4 3/4] irqflags: Avoid unnecessary calls to trace_ if you can

On Wed, Apr 18, 2018 at 06:02:50PM +0900, Masami Hiramatsu wrote:
> On Mon, 16 Apr 2018 21:07:47 -0700
> Joel Fernandes <[email protected]> wrote:
>
> > With TRACE_IRQFLAGS, we call trace_ API too many times. We don't need
> > to if local_irq_restore or local_irq_save didn't actually do anything.
> >
> > This gives around a 4% improvement in performance when doing the
> > following command: "time find / > /dev/null"
> >
> > Also its best to avoid these calls where possible, since in this series,
> > the RCU code in tracepoint.h seems to be call these quite a bit and I'd
> > like to keep this overhead low.
>
> Can we assume that the "flags" has only 1 bit irq-disable flag?
> Since it skips calling raw_local_irq_restore(flags); too,

I don't know how many it impacts on performance but maybe we can have
an arch-specific config option something like below?


> if there is any state in the flags on any arch, it may change the
> result. In that case, we can do it as below (just skipping trace_hardirqs_*)
>
> int disabled = irqs_disabled();

if (disabled == raw_irqs_disabled_flags(flags)) {
#ifndef CONFIG_ARCH_CAN_SKIP_NESTED_IRQ_RESTORE
raw_local_irq_restore(flags);
#endif
return;
}

>
> if (!raw_irqs_disabled_flags(flags) && disabled)
> trace_hardirqs_on();
>
> raw_local_irq_restore(flags);
>
> if (raw_irqs_disabled_flags(flags) && !disabled)
> trace_hardirqs_off();

Thanks,
Namhyung

2018-04-20 07:08:47

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC v4 3/4] irqflags: Avoid unnecessary calls to trace_ if you can

Hi,

Thanks Matsami and Namhyung for the suggestions!

On Wed, Apr 18, 2018 at 10:43 PM, Namhyung Kim <[email protected]> wrote:
> On Wed, Apr 18, 2018 at 06:02:50PM +0900, Masami Hiramatsu wrote:
>> On Mon, 16 Apr 2018 21:07:47 -0700
>> Joel Fernandes <[email protected]> wrote:
>>
>> > With TRACE_IRQFLAGS, we call trace_ API too many times. We don't need
>> > to if local_irq_restore or local_irq_save didn't actually do anything.
>> >
>> > This gives around a 4% improvement in performance when doing the
>> > following command: "time find / > /dev/null"
>> >
>> > Also its best to avoid these calls where possible, since in this series,
>> > the RCU code in tracepoint.h seems to be call these quite a bit and I'd
>> > like to keep this overhead low.
>>
>> Can we assume that the "flags" has only 1 bit irq-disable flag?
>> Since it skips calling raw_local_irq_restore(flags); too,
>
> I don't know how many it impacts on performance but maybe we can have
> an arch-specific config option something like below?

The flags restoration I am hoping is "cheap" but I haven't measured
specifically the cost of this though.

>
>
>> if there is any state in the flags on any arch, it may change the
>> result. In that case, we can do it as below (just skipping trace_hardirqs_*)
>>
>> int disabled = irqs_disabled();
>
> if (disabled == raw_irqs_disabled_flags(flags)) {
> #ifndef CONFIG_ARCH_CAN_SKIP_NESTED_IRQ_RESTORE
> raw_local_irq_restore(flags);
> #endif
> return;
> }

Hmm, somehow I feel this part should be written generically enough
that it applies to all architectures (as a first step).

>
>>
>> if (!raw_irqs_disabled_flags(flags) && disabled)
>> trace_hardirqs_on();
>>
>> raw_local_irq_restore(flags);
>>
>> if (raw_irqs_disabled_flags(flags) && !disabled)
>> trace_hardirqs_off();

I like this idea since its a good thing to do the flag restoration
just to be safe and preserve the current behaviors. Also my goal was
to reduce the trace_ calls in this series, so its probably better I
just do as you're suggesting. I will do some experiments and make the
changes for the next series.

thanks,

- Joel

2018-04-23 01:16:51

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC v4 3/4] irqflags: Avoid unnecessary calls to trace_ if you can

On Fri, Apr 20, 2018 at 12:07 AM, Joel Fernandes <[email protected]> wrote:
> Hi,
>
> Thanks Matsami and Namhyung for the suggestions!
>
> On Wed, Apr 18, 2018 at 10:43 PM, Namhyung Kim <[email protected]> wrote:
>> On Wed, Apr 18, 2018 at 06:02:50PM +0900, Masami Hiramatsu wrote:
>>> On Mon, 16 Apr 2018 21:07:47 -0700
>>> Joel Fernandes <[email protected]> wrote:
>>>
>>> > With TRACE_IRQFLAGS, we call trace_ API too many times. We don't need
>>> > to if local_irq_restore or local_irq_save didn't actually do anything.
>>> >
>>> > This gives around a 4% improvement in performance when doing the
>>> > following command: "time find / > /dev/null"
>>> >
>>> > Also its best to avoid these calls where possible, since in this series,
>>> > the RCU code in tracepoint.h seems to be call these quite a bit and I'd
>>> > like to keep this overhead low.
>>>
>>> Can we assume that the "flags" has only 1 bit irq-disable flag?
>>> Since it skips calling raw_local_irq_restore(flags); too,
>>
>> I don't know how many it impacts on performance but maybe we can have
>> an arch-specific config option something like below?
>
> The flags restoration I am hoping is "cheap" but I haven't measured
> specifically the cost of this though.
>
>>
>>
>>> if there is any state in the flags on any arch, it may change the
>>> result. In that case, we can do it as below (just skipping trace_hardirqs_*)
>>>
>>> int disabled = irqs_disabled();
>>
>> if (disabled == raw_irqs_disabled_flags(flags)) {
>> #ifndef CONFIG_ARCH_CAN_SKIP_NESTED_IRQ_RESTORE
>> raw_local_irq_restore(flags);
>> #endif
>> return;
>> }
>
> Hmm, somehow I feel this part should be written generically enough
> that it applies to all architectures (as a first step).
>
>>
>>>
>>> if (!raw_irqs_disabled_flags(flags) && disabled)
>>> trace_hardirqs_on();
>>>
>>> raw_local_irq_restore(flags);
>>>
>>> if (raw_irqs_disabled_flags(flags) && !disabled)
>>> trace_hardirqs_off();
>
> I like this idea since its a good thing to do the flag restoration
> just to be safe and preserve the current behaviors. Also my goal was
> to reduce the trace_ calls in this series, so its probably better I
> just do as you're suggesting. I will do some experiments and make the
> changes for the next series.

So about performance of this series..

lockdep hooking into tracepoint code is a bit heavy, compared to
without this series. That's because of the design approach of
IRQ on/off -> Trace point -> lockdep

Versus without this series which does
IRQ on/off -> lockdep

So we lose performance because of that.

This particular patch improves the situation, as such so this
particular patch is probably good to merge once we can test
performance of Matsami's suggestion as well.

However, patch 4/4 which makes lockdep use the tracepoint causes a
performance hit of around 8% of mean time when I run:
hackbench -g 4 -f 2 -l 30000

I narrowed the performance hit down to the call to
rcu_irq_enter_irqson() and rcu_irq_exit_irqson() in __DO_TRACE.
Commenting these 2 functions brings the perf level back.

I was thinking about RCU usage here, and really we never change this
particular performance-sensitive tracepoint's function table 99.9% of
the time, so it seems there's quite in a win if we just had another
read-mostly synchronization mechanism that doesn't do all the RCU
tracking that's currently done here and such a mechanism can be
simpler..

If I understand correctly, RCU also adds other complications such as
that it can't be used from the idle path, that's why the
rcu_irq_enter_* was added in the first place. Would be nice if we can
just avoid these RCU calls for the preempt/irq tracepoints... Any
thoughts about this or any other ideas to solve this?

Meanwhile I'll also do some performance testing with Matsami's idea as well..

thanks,

- Joel

2018-04-23 03:19:46

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC v4 3/4] irqflags: Avoid unnecessary calls to trace_ if you can

On Sun, Apr 22, 2018 at 06:14:18PM -0700, Joel Fernandes wrote:
> On Fri, Apr 20, 2018 at 12:07 AM, Joel Fernandes <[email protected]> wrote:
> > Hi,
> >
> > Thanks Matsami and Namhyung for the suggestions!
> >
> > On Wed, Apr 18, 2018 at 10:43 PM, Namhyung Kim <[email protected]> wrote:
> >> On Wed, Apr 18, 2018 at 06:02:50PM +0900, Masami Hiramatsu wrote:
> >>> On Mon, 16 Apr 2018 21:07:47 -0700
> >>> Joel Fernandes <[email protected]> wrote:
> >>>
> >>> > With TRACE_IRQFLAGS, we call trace_ API too many times. We don't need
> >>> > to if local_irq_restore or local_irq_save didn't actually do anything.
> >>> >
> >>> > This gives around a 4% improvement in performance when doing the
> >>> > following command: "time find / > /dev/null"
> >>> >
> >>> > Also its best to avoid these calls where possible, since in this series,
> >>> > the RCU code in tracepoint.h seems to be call these quite a bit and I'd
> >>> > like to keep this overhead low.
> >>>
> >>> Can we assume that the "flags" has only 1 bit irq-disable flag?
> >>> Since it skips calling raw_local_irq_restore(flags); too,
> >>
> >> I don't know how many it impacts on performance but maybe we can have
> >> an arch-specific config option something like below?
> >
> > The flags restoration I am hoping is "cheap" but I haven't measured
> > specifically the cost of this though.
> >
> >>
> >>
> >>> if there is any state in the flags on any arch, it may change the
> >>> result. In that case, we can do it as below (just skipping trace_hardirqs_*)
> >>>
> >>> int disabled = irqs_disabled();
> >>
> >> if (disabled == raw_irqs_disabled_flags(flags)) {
> >> #ifndef CONFIG_ARCH_CAN_SKIP_NESTED_IRQ_RESTORE
> >> raw_local_irq_restore(flags);
> >> #endif
> >> return;
> >> }
> >
> > Hmm, somehow I feel this part should be written generically enough
> > that it applies to all architectures (as a first step).
> >
> >>
> >>>
> >>> if (!raw_irqs_disabled_flags(flags) && disabled)
> >>> trace_hardirqs_on();
> >>>
> >>> raw_local_irq_restore(flags);
> >>>
> >>> if (raw_irqs_disabled_flags(flags) && !disabled)
> >>> trace_hardirqs_off();
> >
> > I like this idea since its a good thing to do the flag restoration
> > just to be safe and preserve the current behaviors. Also my goal was
> > to reduce the trace_ calls in this series, so its probably better I
> > just do as you're suggesting. I will do some experiments and make the
> > changes for the next series.
>
> So about performance of this series..
>
> lockdep hooking into tracepoint code is a bit heavy, compared to
> without this series. That's because of the design approach of
> IRQ on/off -> Trace point -> lockdep
>
> Versus without this series which does
> IRQ on/off -> lockdep
>
> So we lose performance because of that.
>
> This particular patch improves the situation, as such so this
> particular patch is probably good to merge once we can test
> performance of Matsami's suggestion as well.
>
> However, patch 4/4 which makes lockdep use the tracepoint causes a
> performance hit of around 8% of mean time when I run:
> hackbench -g 4 -f 2 -l 30000
>
> I narrowed the performance hit down to the call to
> rcu_irq_enter_irqson() and rcu_irq_exit_irqson() in __DO_TRACE.
> Commenting these 2 functions brings the perf level back.
>
> I was thinking about RCU usage here, and really we never change this
> particular performance-sensitive tracepoint's function table 99.9% of
> the time, so it seems there's quite in a win if we just had another
> read-mostly synchronization mechanism that doesn't do all the RCU
> tracking that's currently done here and such a mechanism can be
> simpler..
>
> If I understand correctly, RCU also adds other complications such as
> that it can't be used from the idle path, that's why the
> rcu_irq_enter_* was added in the first place. Would be nice if we can
> just avoid these RCU calls for the preempt/irq tracepoints... Any
> thoughts about this or any other ideas to solve this?

In theory, the tracepoint code could use SRCU instead of RCU, given that
SRCU readers can be in the idle loop, although at the expense of a couple
of smp_mb() calls in each tracepoint. In practice, I must defer to the
people who know the tracepoint code better than I.

Thanx, Paul

> Meanwhile I'll also do some performance testing with Matsami's idea as well..
>
> thanks,
>
> - Joel
>


2018-04-23 14:33:21

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC v4 3/4] irqflags: Avoid unnecessary calls to trace_ if you can

----- On Apr 22, 2018, at 11:19 PM, Paul E. McKenney [email protected] wrote:

> On Sun, Apr 22, 2018 at 06:14:18PM -0700, Joel Fernandes wrote:
>> On Fri, Apr 20, 2018 at 12:07 AM, Joel Fernandes <[email protected]> wrote:
>> > Hi,
>> >
>> > Thanks Matsami and Namhyung for the suggestions!
>> >
>> > On Wed, Apr 18, 2018 at 10:43 PM, Namhyung Kim <[email protected]> wrote:
>> >> On Wed, Apr 18, 2018 at 06:02:50PM +0900, Masami Hiramatsu wrote:
>> >>> On Mon, 16 Apr 2018 21:07:47 -0700
>> >>> Joel Fernandes <[email protected]> wrote:
>> >>>
>> >>> > With TRACE_IRQFLAGS, we call trace_ API too many times. We don't need
>> >>> > to if local_irq_restore or local_irq_save didn't actually do anything.
>> >>> >
>> >>> > This gives around a 4% improvement in performance when doing the
>> >>> > following command: "time find / > /dev/null"
>> >>> >
>> >>> > Also its best to avoid these calls where possible, since in this series,
>> >>> > the RCU code in tracepoint.h seems to be call these quite a bit and I'd
>> >>> > like to keep this overhead low.
>> >>>
>> >>> Can we assume that the "flags" has only 1 bit irq-disable flag?
>> >>> Since it skips calling raw_local_irq_restore(flags); too,
>> >>
>> >> I don't know how many it impacts on performance but maybe we can have
>> >> an arch-specific config option something like below?
>> >
>> > The flags restoration I am hoping is "cheap" but I haven't measured
>> > specifically the cost of this though.
>> >
>> >>
>> >>
>> >>> if there is any state in the flags on any arch, it may change the
>> >>> result. In that case, we can do it as below (just skipping trace_hardirqs_*)
>> >>>
>> >>> int disabled = irqs_disabled();
>> >>
>> >> if (disabled == raw_irqs_disabled_flags(flags)) {
>> >> #ifndef CONFIG_ARCH_CAN_SKIP_NESTED_IRQ_RESTORE
>> >> raw_local_irq_restore(flags);
>> >> #endif
>> >> return;
>> >> }
>> >
>> > Hmm, somehow I feel this part should be written generically enough
>> > that it applies to all architectures (as a first step).
>> >
>> >>
>> >>>
>> >>> if (!raw_irqs_disabled_flags(flags) && disabled)
>> >>> trace_hardirqs_on();
>> >>>
>> >>> raw_local_irq_restore(flags);
>> >>>
>> >>> if (raw_irqs_disabled_flags(flags) && !disabled)
>> >>> trace_hardirqs_off();
>> >
>> > I like this idea since its a good thing to do the flag restoration
>> > just to be safe and preserve the current behaviors. Also my goal was
>> > to reduce the trace_ calls in this series, so its probably better I
>> > just do as you're suggesting. I will do some experiments and make the
>> > changes for the next series.
>>
>> So about performance of this series..
>>
>> lockdep hooking into tracepoint code is a bit heavy, compared to
>> without this series. That's because of the design approach of
>> IRQ on/off -> Trace point -> lockdep
>>
>> Versus without this series which does
>> IRQ on/off -> lockdep
>>
>> So we lose performance because of that.
>>
>> This particular patch improves the situation, as such so this
>> particular patch is probably good to merge once we can test
>> performance of Matsami's suggestion as well.
>>
>> However, patch 4/4 which makes lockdep use the tracepoint causes a
>> performance hit of around 8% of mean time when I run:
>> hackbench -g 4 -f 2 -l 30000
>>
>> I narrowed the performance hit down to the call to
>> rcu_irq_enter_irqson() and rcu_irq_exit_irqson() in __DO_TRACE.
>> Commenting these 2 functions brings the perf level back.
>>
>> I was thinking about RCU usage here, and really we never change this
>> particular performance-sensitive tracepoint's function table 99.9% of
>> the time, so it seems there's quite in a win if we just had another
>> read-mostly synchronization mechanism that doesn't do all the RCU
>> tracking that's currently done here and such a mechanism can be
>> simpler..
>>
>> If I understand correctly, RCU also adds other complications such as
>> that it can't be used from the idle path, that's why the
>> rcu_irq_enter_* was added in the first place. Would be nice if we can
>> just avoid these RCU calls for the preempt/irq tracepoints... Any
>> thoughts about this or any other ideas to solve this?
>
> In theory, the tracepoint code could use SRCU instead of RCU, given that
> SRCU readers can be in the idle loop, although at the expense of a couple
> of smp_mb() calls in each tracepoint. In practice, I must defer to the
> people who know the tracepoint code better than I.

I've been wanting to introduce an alternative tracepoint instrumentation
"flavor" for e.g. system call entry/exit which rely on SRCU rather than
sched-rcu (preempt-off). This would allow taking faults within the instrumentation
probe, which makes lots of things easier when fetching data from user-space
upon system call entry/exit. This could also be used to cleanly instrument
the idle loop.

I would be tempted to proceed carefully and introduce a new kind of SRCU
tracepoint rather than changing all existing ones from sched-rcu to SRCU
though.

So the lockdep stuff could use the SRCU tracepoint flavor, which I guess
would be faster than the rcu_irq_enter_*().

Thanks,

Mathieu


>
> Thanx, Paul
>
>> Meanwhile I'll also do some performance testing with Matsami's idea as well..
>>
>> thanks,
>>
>> - Joel

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2018-04-23 14:55:26

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC v4 3/4] irqflags: Avoid unnecessary calls to trace_ if you can

On Mon, 23 Apr 2018 10:31:28 -0400 (EDT)
Mathieu Desnoyers <[email protected]> wrote:

> I've been wanting to introduce an alternative tracepoint instrumentation
> "flavor" for e.g. system call entry/exit which rely on SRCU rather than
> sched-rcu (preempt-off). This would allow taking faults within the instrumentation
> probe, which makes lots of things easier when fetching data from user-space
> upon system call entry/exit. This could also be used to cleanly instrument
> the idle loop.

I'd be OK with such an approach. And I don't think it would be that
hard to implement. It could be similar to the rcu_idle() tracepoints,
where each flavor simply passes in what protection it uses for
DO_TRACE(). We could do linker tricks to tell the tracepoint.c code how
the tracepoint is protected (add section code, that could be read to
update flags in the tracepoint). Of course modules that have
tracepoints could only use the standard preempt ones.

That is, if trace_##event##_srcu(trace_##event##_sp, PARAMS), is used,
then the trace_##event##_sp would need to be created somewhere. The use
of trace_##event##_srcu() would create a section entry, and on boot up
we can see that the use of this tracepoint requires srcu protection
with a pointer to the trace_##event##_sp srcu_struct. This could be
used to make sure that trace_#event() call isn't done multiple times
that uses two different protection flavors.

I'm just brain storming the idea, and I'm sure I screwed up something
above, but I do believe it is feasible.

>
> I would be tempted to proceed carefully and introduce a new kind of SRCU
> tracepoint rather than changing all existing ones from sched-rcu to SRCU
> though.

Agreed.

>
> So the lockdep stuff could use the SRCU tracepoint flavor, which I guess
> would be faster than the rcu_irq_enter_*().

Also agree.

-- Steve

2018-04-23 15:03:46

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC v4 3/4] irqflags: Avoid unnecessary calls to trace_ if you can

----- On Apr 23, 2018, at 10:53 AM, rostedt [email protected] wrote:

> On Mon, 23 Apr 2018 10:31:28 -0400 (EDT)
> Mathieu Desnoyers <[email protected]> wrote:
>
>> I've been wanting to introduce an alternative tracepoint instrumentation
>> "flavor" for e.g. system call entry/exit which rely on SRCU rather than
>> sched-rcu (preempt-off). This would allow taking faults within the
>> instrumentation
>> probe, which makes lots of things easier when fetching data from user-space
>> upon system call entry/exit. This could also be used to cleanly instrument
>> the idle loop.
>
> I'd be OK with such an approach. And I don't think it would be that
> hard to implement. It could be similar to the rcu_idle() tracepoints,
> where each flavor simply passes in what protection it uses for
> DO_TRACE(). We could do linker tricks to tell the tracepoint.c code how
> the tracepoint is protected (add section code, that could be read to
> update flags in the tracepoint). Of course modules that have
> tracepoints could only use the standard preempt ones.
>
> That is, if trace_##event##_srcu(trace_##event##_sp, PARAMS), is used,
> then the trace_##event##_sp would need to be created somewhere. The use
> of trace_##event##_srcu() would create a section entry, and on boot up
> we can see that the use of this tracepoint requires srcu protection
> with a pointer to the trace_##event##_sp srcu_struct. This could be
> used to make sure that trace_#event() call isn't done multiple times
> that uses two different protection flavors.
>
> I'm just brain storming the idea, and I'm sure I screwed up something
> above, but I do believe it is feasible.

The main open question here is whether we want one SRCU grace period
domain per SRCU tracepoint definition, or just one SRCU domain for all
SRCU tracepoints would be fine.

I'm not sure what we would gain by having the extra granularity provided
by one SRCU grace period domain per tracepoint, and having a single SRCU
domain for all SRCU tracepoints makes it easy to batch grace period after
bulk tracepoint modifications.

Thoughts ?

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2018-04-23 15:13:23

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC v4 3/4] irqflags: Avoid unnecessary calls to trace_ if you can

On Mon, Apr 23, 2018 at 10:59:43AM -0400, Mathieu Desnoyers wrote:
> ----- On Apr 23, 2018, at 10:53 AM, rostedt [email protected] wrote:
>
> > On Mon, 23 Apr 2018 10:31:28 -0400 (EDT)
> > Mathieu Desnoyers <[email protected]> wrote:
> >
> >> I've been wanting to introduce an alternative tracepoint instrumentation
> >> "flavor" for e.g. system call entry/exit which rely on SRCU rather than
> >> sched-rcu (preempt-off). This would allow taking faults within the
> >> instrumentation
> >> probe, which makes lots of things easier when fetching data from user-space
> >> upon system call entry/exit. This could also be used to cleanly instrument
> >> the idle loop.
> >
> > I'd be OK with such an approach. And I don't think it would be that
> > hard to implement. It could be similar to the rcu_idle() tracepoints,
> > where each flavor simply passes in what protection it uses for
> > DO_TRACE(). We could do linker tricks to tell the tracepoint.c code how
> > the tracepoint is protected (add section code, that could be read to
> > update flags in the tracepoint). Of course modules that have
> > tracepoints could only use the standard preempt ones.
> >
> > That is, if trace_##event##_srcu(trace_##event##_sp, PARAMS), is used,
> > then the trace_##event##_sp would need to be created somewhere. The use
> > of trace_##event##_srcu() would create a section entry, and on boot up
> > we can see that the use of this tracepoint requires srcu protection
> > with a pointer to the trace_##event##_sp srcu_struct. This could be
> > used to make sure that trace_#event() call isn't done multiple times
> > that uses two different protection flavors.
> >
> > I'm just brain storming the idea, and I'm sure I screwed up something
> > above, but I do believe it is feasible.
>
> The main open question here is whether we want one SRCU grace period
> domain per SRCU tracepoint definition, or just one SRCU domain for all
> SRCU tracepoints would be fine.
>
> I'm not sure what we would gain by having the extra granularity provided
> by one SRCU grace period domain per tracepoint, and having a single SRCU
> domain for all SRCU tracepoints makes it easy to batch grace period after
> bulk tracepoint modifications.

I don't see how having multiple SRCU domains would help anything, but
perhaps I am missing something basic.

thanx, Paul


2018-04-23 15:52:24

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC v4 3/4] irqflags: Avoid unnecessary calls to trace_ if you can

Hi,

On Mon, Apr 23, 2018 at 7:53 AM, Steven Rostedt <[email protected]> wrote:
> On Mon, 23 Apr 2018 10:31:28 -0400 (EDT)
> Mathieu Desnoyers <[email protected]> wrote:
[..]
>> I would be tempted to proceed carefully and introduce a new kind of SRCU
>> tracepoint rather than changing all existing ones from sched-rcu to SRCU
>> though.
>
> Agreed.
>
>>
>> So the lockdep stuff could use the SRCU tracepoint flavor, which I guess
>> would be faster than the rcu_irq_enter_*().
>
> Also agree.

I also agree about careful proceeding with separate SRCU flavor and
using that for few places first. Perhaps, if it works well once we can
benchmark it then we can slowly convert the _rcuidle API users to
this, since it seems from my testing those will benefit the most.

thanks,

- Joel

2018-04-23 16:19:32

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC v4 3/4] irqflags: Avoid unnecessary calls to trace_ if you can

On Mon, 23 Apr 2018 10:59:43 -0400 (EDT)
Mathieu Desnoyers <[email protected]> wrote:

> The main open question here is whether we want one SRCU grace period
> domain per SRCU tracepoint definition, or just one SRCU domain for all
> SRCU tracepoints would be fine.
>
> I'm not sure what we would gain by having the extra granularity provided
> by one SRCU grace period domain per tracepoint, and having a single SRCU
> domain for all SRCU tracepoints makes it easy to batch grace period after
> bulk tracepoint modifications.
>
> Thoughts ?

I didn't think too much depth in this. It was more of just a brain
storming idea. Yeah, one singe RCU domain may be good enough. I was
thinking more of how to know when a tracepoint required the SRCU domain
as supposed to a preempt disabled domain, and wanted to just suggest
the linker script approach.

This is how I detect if trace_printk() is used anywhere in the kernel
(and do that big warning if it is). That way the trace events don't
need to be created any special way. You just use the trace_##event##_X
flavor and it automatically detects what to do. But we need to make
sure the same event isn't used for multiple flavors (SRCU vs schedule),
or maybe we can, and any change would have to do both synchronizations.

-- Steve

2018-04-23 17:14:13

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC v4 3/4] irqflags: Avoid unnecessary calls to trace_ if you can

----- On Apr 23, 2018, at 12:18 PM, rostedt [email protected] wrote:

> On Mon, 23 Apr 2018 10:59:43 -0400 (EDT)
> Mathieu Desnoyers <[email protected]> wrote:
>
>> The main open question here is whether we want one SRCU grace period
>> domain per SRCU tracepoint definition, or just one SRCU domain for all
>> SRCU tracepoints would be fine.
>>
>> I'm not sure what we would gain by having the extra granularity provided
>> by one SRCU grace period domain per tracepoint, and having a single SRCU
>> domain for all SRCU tracepoints makes it easy to batch grace period after
>> bulk tracepoint modifications.
>>
>> Thoughts ?
>
> I didn't think too much depth in this. It was more of just a brain
> storming idea. Yeah, one singe RCU domain may be good enough. I was
> thinking more of how to know when a tracepoint required the SRCU domain
> as supposed to a preempt disabled domain, and wanted to just suggest
> the linker script approach.
>
> This is how I detect if trace_printk() is used anywhere in the kernel
> (and do that big warning if it is). That way the trace events don't
> need to be created any special way. You just use the trace_##event##_X
> flavor and it automatically detects what to do. But we need to make
> sure the same event isn't used for multiple flavors (SRCU vs schedule),
> or maybe we can, and any change would have to do both synchronizations.

The approach I used for synchronize rcu a few years ago when I did a srcu
tracepoint prototype [1] was simply this:

static inline void tracepoint_synchronize_unregister(void)
{
+ synchronize_srcu(&tracepoint_srcu);
synchronize_sched();
}

So whenever we synchronize after tracepoint unregistration, the tracepoint
code always issue both synchronize_sched() and SRCU synchronize. This way,
tracepoint API users don't have to care about the kind of tracepoint they
are updating.

I'm inclined to explicitly declare the tracepoints with their given
synchronization method. Tracepoint probe callback functions for currently
existing tracepoints expect to have preemption disabled when invoked.
This assumption will not be true anymore for srcu-tracepoints.

Thanks,

Mathieu

[1] https://github.com/compudj/linux-dev/commits/tracepoint-srcu

>
> -- Steve

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2018-04-23 17:25:35

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC v4 3/4] irqflags: Avoid unnecessary calls to trace_ if you can

Hi Mathieu,

On Mon, Apr 23, 2018 at 10:12 AM, Mathieu Desnoyers
<[email protected]> wrote:
> ----- On Apr 23, 2018, at 12:18 PM, rostedt [email protected] wrote:
>
>> On Mon, 23 Apr 2018 10:59:43 -0400 (EDT)
>> Mathieu Desnoyers <[email protected]> wrote:
>>
>>> The main open question here is whether we want one SRCU grace period
>>> domain per SRCU tracepoint definition, or just one SRCU domain for all
>>> SRCU tracepoints would be fine.
>>>
>>> I'm not sure what we would gain by having the extra granularity provided
>>> by one SRCU grace period domain per tracepoint, and having a single SRCU
>>> domain for all SRCU tracepoints makes it easy to batch grace period after
>>> bulk tracepoint modifications.
>>>
>>> Thoughts ?
>>
>> I didn't think too much depth in this. It was more of just a brain
>> storming idea. Yeah, one singe RCU domain may be good enough. I was
>> thinking more of how to know when a tracepoint required the SRCU domain
>> as supposed to a preempt disabled domain, and wanted to just suggest
>> the linker script approach.
>>
>> This is how I detect if trace_printk() is used anywhere in the kernel
>> (and do that big warning if it is). That way the trace events don't
>> need to be created any special way. You just use the trace_##event##_X
>> flavor and it automatically detects what to do. But we need to make
>> sure the same event isn't used for multiple flavors (SRCU vs schedule),
>> or maybe we can, and any change would have to do both synchronizations.
>
> The approach I used for synchronize rcu a few years ago when I did a srcu
> tracepoint prototype [1] was simply this:
>
> static inline void tracepoint_synchronize_unregister(void)
> {
> + synchronize_srcu(&tracepoint_srcu);
> synchronize_sched();
> }
>
> So whenever we synchronize after tracepoint unregistration, the tracepoint
> code always issue both synchronize_sched() and SRCU synchronize. This way,
> tracepoint API users don't have to care about the kind of tracepoint they
> are updating.
>
> I'm inclined to explicitly declare the tracepoints with their given
> synchronization method. Tracepoint probe callback functions for currently
> existing tracepoints expect to have preemption disabled when invoked.
> This assumption will not be true anymore for srcu-tracepoints.

Nice thing about your patch is it defines at declaration time that its
an srcu-tracepoint. The API users don't need care about which
synchronization method to use which will probably prevent bugs like,
if someone forget to use the _rcuidle variable for a tracepoint or
something.

I carved out some of my time to test this patch today :-)

thanks,

- Joel

2018-04-23 21:24:47

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC v4 3/4] irqflags: Avoid unnecessary calls to trace_ if you can

On Mon, 23 Apr 2018 13:12:21 -0400 (EDT)
Mathieu Desnoyers <[email protected]> wrote:


> I'm inclined to explicitly declare the tracepoints with their given
> synchronization method. Tracepoint probe callback functions for currently
> existing tracepoints expect to have preemption disabled when invoked.
> This assumption will not be true anymore for srcu-tracepoints.

Actually, why not have a flag attached to the tracepoint_func that
states if it expects preemption to be enabled or not? If a
trace_##event##_srcu() is called, then simply disable preemption before
calling the callbacks for it. That way if a callback is fine for use
with srcu, then it would require calling

register_trace_##event##_may_sleep();

Then if someone uses this on a tracepoint where preemption is disabled,
we simply do not call it.

-- Steve

2018-04-24 15:57:14

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC v4 3/4] irqflags: Avoid unnecessary calls to trace_ if you can

On Mon, Apr 23, 2018 at 05:22:44PM -0400, Steven Rostedt wrote:
> On Mon, 23 Apr 2018 13:12:21 -0400 (EDT)
> Mathieu Desnoyers <[email protected]> wrote:
>
>
> > I'm inclined to explicitly declare the tracepoints with their given
> > synchronization method. Tracepoint probe callback functions for currently
> > existing tracepoints expect to have preemption disabled when invoked.
> > This assumption will not be true anymore for srcu-tracepoints.
>
> Actually, why not have a flag attached to the tracepoint_func that
> states if it expects preemption to be enabled or not? If a
> trace_##event##_srcu() is called, then simply disable preemption before
> calling the callbacks for it. That way if a callback is fine for use
> with srcu, then it would require calling
>
> register_trace_##event##_may_sleep();
>
> Then if someone uses this on a tracepoint where preemption is disabled,
> we simply do not call it.

One more stupid question... If we are having to trace so much stuff
in the idle loop, are we perhaps grossly overstating the extent of that
"idle" loop? For being called "idle", this code seems quite busy!

Thanx, Paul


2018-04-24 16:03:28

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC v4 3/4] irqflags: Avoid unnecessary calls to trace_ if you can

On Tue, Apr 24, 2018 at 8:56 AM, Paul E. McKenney
<[email protected]> wrote:
> On Mon, Apr 23, 2018 at 05:22:44PM -0400, Steven Rostedt wrote:
>> On Mon, 23 Apr 2018 13:12:21 -0400 (EDT)
>> Mathieu Desnoyers <[email protected]> wrote:
>>
>>
>> > I'm inclined to explicitly declare the tracepoints with their given
>> > synchronization method. Tracepoint probe callback functions for currently
>> > existing tracepoints expect to have preemption disabled when invoked.
>> > This assumption will not be true anymore for srcu-tracepoints.
>>
>> Actually, why not have a flag attached to the tracepoint_func that
>> states if it expects preemption to be enabled or not? If a
>> trace_##event##_srcu() is called, then simply disable preemption before
>> calling the callbacks for it. That way if a callback is fine for use
>> with srcu, then it would require calling
>>
>> register_trace_##event##_may_sleep();
>>
>> Then if someone uses this on a tracepoint where preemption is disabled,
>> we simply do not call it.
>
> One more stupid question... If we are having to trace so much stuff
> in the idle loop, are we perhaps grossly overstating the extent of that
> "idle" loop? For being called "idle", this code seems quite busy!

;-)
The performance hit I am observing is when running a heavy workload,
like hackbench or something like that. That's what I am trying to
correct.
By the way is there any limitation on using SRCU too early during
boot? I backported Mathieu's srcu tracepoint patches but the kernel
hangs pretty early in the boot. I register lockdep probes in
start_kernel. I am hoping that's not why.

I could also have just screwed up the backporting... may be for my
testing, I will just replace the rcu API with the srcu instead of all
of Mathieu's new TRACE_EVENT macros for SRCU, since all I am trying to
do right now is measure the performance of my patches with SRCU.

thanks,

- Joel

2018-04-24 17:27:38

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC v4 3/4] irqflags: Avoid unnecessary calls to trace_ if you can

On Tue, Apr 24, 2018 at 09:01:34AM -0700, Joel Fernandes wrote:
> On Tue, Apr 24, 2018 at 8:56 AM, Paul E. McKenney
> <[email protected]> wrote:
> > On Mon, Apr 23, 2018 at 05:22:44PM -0400, Steven Rostedt wrote:
> >> On Mon, 23 Apr 2018 13:12:21 -0400 (EDT)
> >> Mathieu Desnoyers <[email protected]> wrote:
> >>
> >>
> >> > I'm inclined to explicitly declare the tracepoints with their given
> >> > synchronization method. Tracepoint probe callback functions for currently
> >> > existing tracepoints expect to have preemption disabled when invoked.
> >> > This assumption will not be true anymore for srcu-tracepoints.
> >>
> >> Actually, why not have a flag attached to the tracepoint_func that
> >> states if it expects preemption to be enabled or not? If a
> >> trace_##event##_srcu() is called, then simply disable preemption before
> >> calling the callbacks for it. That way if a callback is fine for use
> >> with srcu, then it would require calling
> >>
> >> register_trace_##event##_may_sleep();
> >>
> >> Then if someone uses this on a tracepoint where preemption is disabled,
> >> we simply do not call it.
> >
> > One more stupid question... If we are having to trace so much stuff
> > in the idle loop, are we perhaps grossly overstating the extent of that
> > "idle" loop? For being called "idle", this code seems quite busy!
>
> ;-)
> The performance hit I am observing is when running a heavy workload,
> like hackbench or something like that. That's what I am trying to
> correct.
> By the way is there any limitation on using SRCU too early during
> boot? I backported Mathieu's srcu tracepoint patches but the kernel
> hangs pretty early in the boot. I register lockdep probes in
> start_kernel. I am hoping that's not why.
>
> I could also have just screwed up the backporting... may be for my
> testing, I will just replace the rcu API with the srcu instead of all
> of Mathieu's new TRACE_EVENT macros for SRCU, since all I am trying to
> do right now is measure the performance of my patches with SRCU.

Gah, yes, there is an entry on my capacious todo list on making SRCU
grace periods work during early boot and mid-boot. Let me see what
I can do...

Thanx, Paul


2018-04-24 18:25:19

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC v4 3/4] irqflags: Avoid unnecessary calls to trace_ if you can

On Tue, Apr 24, 2018 at 10:26:58AM -0700, Paul E. McKenney wrote:
> On Tue, Apr 24, 2018 at 09:01:34AM -0700, Joel Fernandes wrote:
> > On Tue, Apr 24, 2018 at 8:56 AM, Paul E. McKenney
> > <[email protected]> wrote:
> > > On Mon, Apr 23, 2018 at 05:22:44PM -0400, Steven Rostedt wrote:
> > >> On Mon, 23 Apr 2018 13:12:21 -0400 (EDT)
> > >> Mathieu Desnoyers <[email protected]> wrote:
> > >>
> > >>
> > >> > I'm inclined to explicitly declare the tracepoints with their given
> > >> > synchronization method. Tracepoint probe callback functions for currently
> > >> > existing tracepoints expect to have preemption disabled when invoked.
> > >> > This assumption will not be true anymore for srcu-tracepoints.
> > >>
> > >> Actually, why not have a flag attached to the tracepoint_func that
> > >> states if it expects preemption to be enabled or not? If a
> > >> trace_##event##_srcu() is called, then simply disable preemption before
> > >> calling the callbacks for it. That way if a callback is fine for use
> > >> with srcu, then it would require calling
> > >>
> > >> register_trace_##event##_may_sleep();
> > >>
> > >> Then if someone uses this on a tracepoint where preemption is disabled,
> > >> we simply do not call it.
> > >
> > > One more stupid question... If we are having to trace so much stuff
> > > in the idle loop, are we perhaps grossly overstating the extent of that
> > > "idle" loop? For being called "idle", this code seems quite busy!
> >
> > ;-)
> > The performance hit I am observing is when running a heavy workload,
> > like hackbench or something like that. That's what I am trying to
> > correct.
> > By the way is there any limitation on using SRCU too early during
> > boot? I backported Mathieu's srcu tracepoint patches but the kernel
> > hangs pretty early in the boot. I register lockdep probes in
> > start_kernel. I am hoping that's not why.
> >
> > I could also have just screwed up the backporting... may be for my
> > testing, I will just replace the rcu API with the srcu instead of all
> > of Mathieu's new TRACE_EVENT macros for SRCU, since all I am trying to
> > do right now is measure the performance of my patches with SRCU.
>
> Gah, yes, there is an entry on my capacious todo list on making SRCU
> grace periods work during early boot and mid-boot. Let me see what
> I can do...

OK, just need to verify that you are OK with call_srcu()'s callbacks
not being invoked until sometime during core_initcall() time. (If you
really do need them to be invoked before that, in theory it is possible,
but in practice it is weird, even for RCU.)

Thanx, Paul


2018-04-24 18:27:14

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC v4 3/4] irqflags: Avoid unnecessary calls to trace_ if you can

On Tue, Apr 24, 2018 at 11:23:02AM -0700, Paul E. McKenney wrote:
> On Tue, Apr 24, 2018 at 10:26:58AM -0700, Paul E. McKenney wrote:
> > On Tue, Apr 24, 2018 at 09:01:34AM -0700, Joel Fernandes wrote:
> > > On Tue, Apr 24, 2018 at 8:56 AM, Paul E. McKenney
> > > <[email protected]> wrote:
> > > > On Mon, Apr 23, 2018 at 05:22:44PM -0400, Steven Rostedt wrote:
> > > >> On Mon, 23 Apr 2018 13:12:21 -0400 (EDT)
> > > >> Mathieu Desnoyers <[email protected]> wrote:
> > > >>
> > > >>
> > > >> > I'm inclined to explicitly declare the tracepoints with their given
> > > >> > synchronization method. Tracepoint probe callback functions for currently
> > > >> > existing tracepoints expect to have preemption disabled when invoked.
> > > >> > This assumption will not be true anymore for srcu-tracepoints.
> > > >>
> > > >> Actually, why not have a flag attached to the tracepoint_func that
> > > >> states if it expects preemption to be enabled or not? If a
> > > >> trace_##event##_srcu() is called, then simply disable preemption before
> > > >> calling the callbacks for it. That way if a callback is fine for use
> > > >> with srcu, then it would require calling
> > > >>
> > > >> register_trace_##event##_may_sleep();
> > > >>
> > > >> Then if someone uses this on a tracepoint where preemption is disabled,
> > > >> we simply do not call it.
> > > >
> > > > One more stupid question... If we are having to trace so much stuff
> > > > in the idle loop, are we perhaps grossly overstating the extent of that
> > > > "idle" loop? For being called "idle", this code seems quite busy!
> > >
> > > ;-)
> > > The performance hit I am observing is when running a heavy workload,
> > > like hackbench or something like that. That's what I am trying to
> > > correct.
> > > By the way is there any limitation on using SRCU too early during
> > > boot? I backported Mathieu's srcu tracepoint patches but the kernel
> > > hangs pretty early in the boot. I register lockdep probes in
> > > start_kernel. I am hoping that's not why.
> > >
> > > I could also have just screwed up the backporting... may be for my
> > > testing, I will just replace the rcu API with the srcu instead of all
> > > of Mathieu's new TRACE_EVENT macros for SRCU, since all I am trying to
> > > do right now is measure the performance of my patches with SRCU.
> >
> > Gah, yes, there is an entry on my capacious todo list on making SRCU
> > grace periods work during early boot and mid-boot. Let me see what
> > I can do...
>
> OK, just need to verify that you are OK with call_srcu()'s callbacks
> not being invoked until sometime during core_initcall() time. (If you
> really do need them to be invoked before that, in theory it is possible,
> but in practice it is weird, even for RCU.)

Oh, and that early at boot, you will need to use DEFINE_SRCU() or
DEFINE_STATIC_SRCU() rather than dynamic allocation and initialization.

Thanx, Paul


2018-04-24 19:01:02

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC v4 3/4] irqflags: Avoid unnecessary calls to trace_ if you can

Hi Paul,

On Tue, Apr 24, 2018 at 11:26 AM, Paul E. McKenney
<[email protected]> wrote:
> On Tue, Apr 24, 2018 at 11:23:02AM -0700, Paul E. McKenney wrote:
>> On Tue, Apr 24, 2018 at 10:26:58AM -0700, Paul E. McKenney wrote:
>> > On Tue, Apr 24, 2018 at 09:01:34AM -0700, Joel Fernandes wrote:
>> > > On Tue, Apr 24, 2018 at 8:56 AM, Paul E. McKenney
>> > > <[email protected]> wrote:
>> > > > On Mon, Apr 23, 2018 at 05:22:44PM -0400, Steven Rostedt wrote:
>> > > >> On Mon, 23 Apr 2018 13:12:21 -0400 (EDT)
>> > > >> Mathieu Desnoyers <[email protected]> wrote:
>> > > >>
>> > > >>
>> > > >> > I'm inclined to explicitly declare the tracepoints with their given
>> > > >> > synchronization method. Tracepoint probe callback functions for currently
>> > > >> > existing tracepoints expect to have preemption disabled when invoked.
>> > > >> > This assumption will not be true anymore for srcu-tracepoints.
>> > > >>
>> > > >> Actually, why not have a flag attached to the tracepoint_func that
>> > > >> states if it expects preemption to be enabled or not? If a
>> > > >> trace_##event##_srcu() is called, then simply disable preemption before
>> > > >> calling the callbacks for it. That way if a callback is fine for use
>> > > >> with srcu, then it would require calling
>> > > >>
>> > > >> register_trace_##event##_may_sleep();
>> > > >>
>> > > >> Then if someone uses this on a tracepoint where preemption is disabled,
>> > > >> we simply do not call it.
>> > > >
>> > > > One more stupid question... If we are having to trace so much stuff
>> > > > in the idle loop, are we perhaps grossly overstating the extent of that
>> > > > "idle" loop? For being called "idle", this code seems quite busy!
>> > >
>> > > ;-)
>> > > The performance hit I am observing is when running a heavy workload,
>> > > like hackbench or something like that. That's what I am trying to
>> > > correct.
>> > > By the way is there any limitation on using SRCU too early during
>> > > boot? I backported Mathieu's srcu tracepoint patches but the kernel
>> > > hangs pretty early in the boot. I register lockdep probes in
>> > > start_kernel. I am hoping that's not why.
>> > >
>> > > I could also have just screwed up the backporting... may be for my
>> > > testing, I will just replace the rcu API with the srcu instead of all
>> > > of Mathieu's new TRACE_EVENT macros for SRCU, since all I am trying to
>> > > do right now is measure the performance of my patches with SRCU.
>> >
>> > Gah, yes, there is an entry on my capacious todo list on making SRCU
>> > grace periods work during early boot and mid-boot. Let me see what
>> > I can do...
>>
>> OK, just need to verify that you are OK with call_srcu()'s callbacks
>> not being invoked until sometime during core_initcall() time. (If you
>> really do need them to be invoked before that, in theory it is possible,
>> but in practice it is weird, even for RCU.)
>
> Oh, and that early at boot, you will need to use DEFINE_SRCU() or
> DEFINE_STATIC_SRCU() rather than dynamic allocation and initialization.
>
> Thanx, Paul
>

Oh ok.

About call_rcu, calling it later may be an issue since we register the
probes in start_kernel, for the first probe call_rcu will be sched,
but for the second one I think it'll try to call_rcu to get rid of the
first one.

This is the relevant code that gets called when probes are added:

static inline void release_probes(struct tracepoint_func *old)
{
if (old) {
struct tp_probes *tp_probes = container_of(old,
struct tp_probes, probes[0]);
call_rcu_sched(&tp_probes->rcu, rcu_free_old_probes);
}
}

Maybe we can somehow defer the call_srcu until later? Would that be possible?

also Mathieu, you didn't modify the call_rcu_sched in your prototype
to be changed to use call_srcu, should you be doing that?

thanks,

- Joel

2018-04-24 19:02:44

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC v4 3/4] irqflags: Avoid unnecessary calls to trace_ if you can

On Tue, Apr 24, 2018 at 11:59 AM, Joel Fernandes <[email protected]> wrote:
[...]
>>> > > By the way is there any limitation on using SRCU too early during
>>> > > boot? I backported Mathieu's srcu tracepoint patches but the kernel
>>> > > hangs pretty early in the boot. I register lockdep probes in
>>> > > start_kernel. I am hoping that's not why.
>>> > >
>>> > > I could also have just screwed up the backporting... may be for my
>>> > > testing, I will just replace the rcu API with the srcu instead of all
>>> > > of Mathieu's new TRACE_EVENT macros for SRCU, since all I am trying to
>>> > > do right now is measure the performance of my patches with SRCU.
>>> >
>>> > Gah, yes, there is an entry on my capacious todo list on making SRCU
>>> > grace periods work during early boot and mid-boot. Let me see what
>>> > I can do...
>>>
>>> OK, just need to verify that you are OK with call_srcu()'s callbacks
>>> not being invoked until sometime during core_initcall() time. (If you
>>> really do need them to be invoked before that, in theory it is possible,
>>> but in practice it is weird, even for RCU.)
>>
>> Oh, and that early at boot, you will need to use DEFINE_SRCU() or
>> DEFINE_STATIC_SRCU() rather than dynamic allocation and initialization.
>>
>> Thanx, Paul
>>
>
> Oh ok.
>
> About call_rcu, calling it later may be an issue since we register the
> probes in start_kernel, for the first probe call_rcu will be sched,

Sorry, I need to start proof reading my emails. s/sched/skipped/

thanks,

- Joel

2018-04-24 19:10:43

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC v4 3/4] irqflags: Avoid unnecessary calls to trace_ if you can

On Tue, Apr 24, 2018 at 11:59:32AM -0700, Joel Fernandes wrote:
> Hi Paul,
>
> On Tue, Apr 24, 2018 at 11:26 AM, Paul E. McKenney
> <[email protected]> wrote:
> > On Tue, Apr 24, 2018 at 11:23:02AM -0700, Paul E. McKenney wrote:
> >> On Tue, Apr 24, 2018 at 10:26:58AM -0700, Paul E. McKenney wrote:
> >> > On Tue, Apr 24, 2018 at 09:01:34AM -0700, Joel Fernandes wrote:
> >> > > On Tue, Apr 24, 2018 at 8:56 AM, Paul E. McKenney
> >> > > <[email protected]> wrote:
> >> > > > On Mon, Apr 23, 2018 at 05:22:44PM -0400, Steven Rostedt wrote:
> >> > > >> On Mon, 23 Apr 2018 13:12:21 -0400 (EDT)
> >> > > >> Mathieu Desnoyers <[email protected]> wrote:
> >> > > >>
> >> > > >>
> >> > > >> > I'm inclined to explicitly declare the tracepoints with their given
> >> > > >> > synchronization method. Tracepoint probe callback functions for currently
> >> > > >> > existing tracepoints expect to have preemption disabled when invoked.
> >> > > >> > This assumption will not be true anymore for srcu-tracepoints.
> >> > > >>
> >> > > >> Actually, why not have a flag attached to the tracepoint_func that
> >> > > >> states if it expects preemption to be enabled or not? If a
> >> > > >> trace_##event##_srcu() is called, then simply disable preemption before
> >> > > >> calling the callbacks for it. That way if a callback is fine for use
> >> > > >> with srcu, then it would require calling
> >> > > >>
> >> > > >> register_trace_##event##_may_sleep();
> >> > > >>
> >> > > >> Then if someone uses this on a tracepoint where preemption is disabled,
> >> > > >> we simply do not call it.
> >> > > >
> >> > > > One more stupid question... If we are having to trace so much stuff
> >> > > > in the idle loop, are we perhaps grossly overstating the extent of that
> >> > > > "idle" loop? For being called "idle", this code seems quite busy!
> >> > >
> >> > > ;-)
> >> > > The performance hit I am observing is when running a heavy workload,
> >> > > like hackbench or something like that. That's what I am trying to
> >> > > correct.
> >> > > By the way is there any limitation on using SRCU too early during
> >> > > boot? I backported Mathieu's srcu tracepoint patches but the kernel
> >> > > hangs pretty early in the boot. I register lockdep probes in
> >> > > start_kernel. I am hoping that's not why.
> >> > >
> >> > > I could also have just screwed up the backporting... may be for my
> >> > > testing, I will just replace the rcu API with the srcu instead of all
> >> > > of Mathieu's new TRACE_EVENT macros for SRCU, since all I am trying to
> >> > > do right now is measure the performance of my patches with SRCU.
> >> >
> >> > Gah, yes, there is an entry on my capacious todo list on making SRCU
> >> > grace periods work during early boot and mid-boot. Let me see what
> >> > I can do...
> >>
> >> OK, just need to verify that you are OK with call_srcu()'s callbacks
> >> not being invoked until sometime during core_initcall() time. (If you
> >> really do need them to be invoked before that, in theory it is possible,
> >> but in practice it is weird, even for RCU.)
> >
> > Oh, and that early at boot, you will need to use DEFINE_SRCU() or
> > DEFINE_STATIC_SRCU() rather than dynamic allocation and initialization.
>
> Oh ok.
>
> About call_rcu, calling it later may be an issue since we register the
> probes in start_kernel, for the first probe call_rcu will be sched,
> but for the second one I think it'll try to call_rcu to get rid of the
> first one.
>
> This is the relevant code that gets called when probes are added:
>
> static inline void release_probes(struct tracepoint_func *old)
> {
> if (old) {
> struct tp_probes *tp_probes = container_of(old,
> struct tp_probes, probes[0]);
> call_rcu_sched(&tp_probes->rcu, rcu_free_old_probes);
> }
> }
>
> Maybe we can somehow defer the call_srcu until later? Would that be possible?

You will be able to invoke call_srcu() early if you wish, it is just that
the specified SRCU callback won't be invoked until core_initcall() time.

Thanx, Paul

> also Mathieu, you didn't modify the call_rcu_sched in your prototype
> to be changed to use call_srcu, should you be doing that?
>
> thanks,
>
> - Joel
>


2018-04-24 19:18:02

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC v4 3/4] irqflags: Avoid unnecessary calls to trace_ if you can

On Tue, Apr 24, 2018 at 12:09 PM, Paul E. McKenney
<[email protected]> wrote:
> On Tue, Apr 24, 2018 at 11:59:32AM -0700, Joel Fernandes wrote:
>> Hi Paul,
>>
>> On Tue, Apr 24, 2018 at 11:26 AM, Paul E. McKenney
>> <[email protected]> wrote:
>> > On Tue, Apr 24, 2018 at 11:23:02AM -0700, Paul E. McKenney wrote:
>> >> On Tue, Apr 24, 2018 at 10:26:58AM -0700, Paul E. McKenney wrote:
>> >> > On Tue, Apr 24, 2018 at 09:01:34AM -0700, Joel Fernandes wrote:
>> >> > > On Tue, Apr 24, 2018 at 8:56 AM, Paul E. McKenney
>> >> > > <[email protected]> wrote:
>> >> > > > On Mon, Apr 23, 2018 at 05:22:44PM -0400, Steven Rostedt wrote:
>> >> > > >> On Mon, 23 Apr 2018 13:12:21 -0400 (EDT)
>> >> > > >> Mathieu Desnoyers <[email protected]> wrote:
>> >> > > >>
>> >> > > >>
>> >> > > >> > I'm inclined to explicitly declare the tracepoints with their given
>> >> > > >> > synchronization method. Tracepoint probe callback functions for currently
>> >> > > >> > existing tracepoints expect to have preemption disabled when invoked.
>> >> > > >> > This assumption will not be true anymore for srcu-tracepoints.
>> >> > > >>
>> >> > > >> Actually, why not have a flag attached to the tracepoint_func that
>> >> > > >> states if it expects preemption to be enabled or not? If a
>> >> > > >> trace_##event##_srcu() is called, then simply disable preemption before
>> >> > > >> calling the callbacks for it. That way if a callback is fine for use
>> >> > > >> with srcu, then it would require calling
>> >> > > >>
>> >> > > >> register_trace_##event##_may_sleep();
>> >> > > >>
>> >> > > >> Then if someone uses this on a tracepoint where preemption is disabled,
>> >> > > >> we simply do not call it.
>> >> > > >
>> >> > > > One more stupid question... If we are having to trace so much stuff
>> >> > > > in the idle loop, are we perhaps grossly overstating the extent of that
>> >> > > > "idle" loop? For being called "idle", this code seems quite busy!
>> >> > >
>> >> > > ;-)
>> >> > > The performance hit I am observing is when running a heavy workload,
>> >> > > like hackbench or something like that. That's what I am trying to
>> >> > > correct.
>> >> > > By the way is there any limitation on using SRCU too early during
>> >> > > boot? I backported Mathieu's srcu tracepoint patches but the kernel
>> >> > > hangs pretty early in the boot. I register lockdep probes in
>> >> > > start_kernel. I am hoping that's not why.
>> >> > >
>> >> > > I could also have just screwed up the backporting... may be for my
>> >> > > testing, I will just replace the rcu API with the srcu instead of all
>> >> > > of Mathieu's new TRACE_EVENT macros for SRCU, since all I am trying to
>> >> > > do right now is measure the performance of my patches with SRCU.
>> >> >
>> >> > Gah, yes, there is an entry on my capacious todo list on making SRCU
>> >> > grace periods work during early boot and mid-boot. Let me see what
>> >> > I can do...
>> >>
>> >> OK, just need to verify that you are OK with call_srcu()'s callbacks
>> >> not being invoked until sometime during core_initcall() time. (If you
>> >> really do need them to be invoked before that, in theory it is possible,
>> >> but in practice it is weird, even for RCU.)
>> >
>> > Oh, and that early at boot, you will need to use DEFINE_SRCU() or
>> > DEFINE_STATIC_SRCU() rather than dynamic allocation and initialization.
>>
>> Oh ok.
>>
>> About call_rcu, calling it later may be an issue since we register the
>> probes in start_kernel, for the first probe call_rcu will be sched,
>> but for the second one I think it'll try to call_rcu to get rid of the
>> first one.
>>
>> This is the relevant code that gets called when probes are added:
>>
>> static inline void release_probes(struct tracepoint_func *old)
>> {
>> if (old) {
>> struct tp_probes *tp_probes = container_of(old,
>> struct tp_probes, probes[0]);
>> call_rcu_sched(&tp_probes->rcu, rcu_free_old_probes);
>> }
>> }
>>
>> Maybe we can somehow defer the call_srcu until later? Would that be possible?
>
> You will be able to invoke call_srcu() early if you wish, it is just that
> the specified SRCU callback won't be invoked until core_initcall() time.
>

Yes, that should be fine then.

Also I think I see no issue with static initialization and allocation
as you were suggesting in your earlier email so that's also Ok.

Let me know when you have anything I can test, thanks a lot. I'll
pause my testing of srcu for now then and focus on the other parts of
my patchset.

thanks,

- Joel

2018-04-24 23:22:39

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC v4 3/4] irqflags: Avoid unnecessary calls to trace_ if you can

----- On Apr 24, 2018, at 2:59 PM, Joel Fernandes [email protected] wrote:

> Hi Paul,
>
> On Tue, Apr 24, 2018 at 11:26 AM, Paul E. McKenney
> <[email protected]> wrote:
>> On Tue, Apr 24, 2018 at 11:23:02AM -0700, Paul E. McKenney wrote:
>>> On Tue, Apr 24, 2018 at 10:26:58AM -0700, Paul E. McKenney wrote:
>>> > On Tue, Apr 24, 2018 at 09:01:34AM -0700, Joel Fernandes wrote:
>>> > > On Tue, Apr 24, 2018 at 8:56 AM, Paul E. McKenney
>>> > > <[email protected]> wrote:
>>> > > > On Mon, Apr 23, 2018 at 05:22:44PM -0400, Steven Rostedt wrote:
>>> > > >> On Mon, 23 Apr 2018 13:12:21 -0400 (EDT)
>>> > > >> Mathieu Desnoyers <[email protected]> wrote:
>>> > > >>
>>> > > >>
>>> > > >> > I'm inclined to explicitly declare the tracepoints with their given
>>> > > >> > synchronization method. Tracepoint probe callback functions for currently
>>> > > >> > existing tracepoints expect to have preemption disabled when invoked.
>>> > > >> > This assumption will not be true anymore for srcu-tracepoints.
>>> > > >>
>>> > > >> Actually, why not have a flag attached to the tracepoint_func that
>>> > > >> states if it expects preemption to be enabled or not? If a
>>> > > >> trace_##event##_srcu() is called, then simply disable preemption before
>>> > > >> calling the callbacks for it. That way if a callback is fine for use
>>> > > >> with srcu, then it would require calling
>>> > > >>
>>> > > >> register_trace_##event##_may_sleep();
>>> > > >>
>>> > > >> Then if someone uses this on a tracepoint where preemption is disabled,
>>> > > >> we simply do not call it.
>>> > > >
>>> > > > One more stupid question... If we are having to trace so much stuff
>>> > > > in the idle loop, are we perhaps grossly overstating the extent of that
>>> > > > "idle" loop? For being called "idle", this code seems quite busy!
>>> > >
>>> > > ;-)
>>> > > The performance hit I am observing is when running a heavy workload,
>>> > > like hackbench or something like that. That's what I am trying to
>>> > > correct.
>>> > > By the way is there any limitation on using SRCU too early during
>>> > > boot? I backported Mathieu's srcu tracepoint patches but the kernel
>>> > > hangs pretty early in the boot. I register lockdep probes in
>>> > > start_kernel. I am hoping that's not why.
>>> > >
>>> > > I could also have just screwed up the backporting... may be for my
>>> > > testing, I will just replace the rcu API with the srcu instead of all
>>> > > of Mathieu's new TRACE_EVENT macros for SRCU, since all I am trying to
>>> > > do right now is measure the performance of my patches with SRCU.
>>> >
>>> > Gah, yes, there is an entry on my capacious todo list on making SRCU
>>> > grace periods work during early boot and mid-boot. Let me see what
>>> > I can do...
>>>
>>> OK, just need to verify that you are OK with call_srcu()'s callbacks
>>> not being invoked until sometime during core_initcall() time. (If you
>>> really do need them to be invoked before that, in theory it is possible,
>>> but in practice it is weird, even for RCU.)
>>
>> Oh, and that early at boot, you will need to use DEFINE_SRCU() or
>> DEFINE_STATIC_SRCU() rather than dynamic allocation and initialization.
>>
>> Thanx, Paul
>>
>
> Oh ok.
>
> About call_rcu, calling it later may be an issue since we register the
> probes in start_kernel, for the first probe call_rcu will be sched,
> but for the second one I think it'll try to call_rcu to get rid of the
> first one.
>
> This is the relevant code that gets called when probes are added:
>
> static inline void release_probes(struct tracepoint_func *old)
> {
> if (old) {
> struct tp_probes *tp_probes = container_of(old,
> struct tp_probes, probes[0]);
> call_rcu_sched(&tp_probes->rcu, rcu_free_old_probes);
> }
> }
>
> Maybe we can somehow defer the call_srcu until later? Would that be possible?
>
> also Mathieu, you didn't modify the call_rcu_sched in your prototype
> to be changed to use call_srcu, should you be doing that?

You're right, I think I should have introduced a call_srcu in there.
It's missing in my prototype.

However, in the prototype I did, we need to wait for *both* sched-rcu
and SRCU grace periods, because we don't track which site is using which
rcu flavor.

So you could achieve this relatively easily by means of two chained
RCU callbacks, e.g.:

release_probes() calls call_rcu_sched(... , rcu_free_old_probes)

and then in rcu_free_old_probes() do:

call_srcu(... , srcu_free_old_probes)

and perform kfree(container_of(head, struct tp_probes, rcu));
within srcu_free_old_probes.

It is somewhat a hack, but should work.

Thanks,

Mathieu

>
> thanks,
>
> - Joel

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2018-04-24 23:48:28

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC v4 3/4] irqflags: Avoid unnecessary calls to trace_ if you can



On 04/24/2018 04:21 PM, Mathieu Desnoyers wrote:
> ----- On Apr 24, 2018, at 2:59 PM, Joel Fernandes [email protected] wrote:
>> On Tue, Apr 24, 2018 at 11:26 AM, Paul E. McKenney
>> <[email protected]> wrote:
>>> On Tue, Apr 24, 2018 at 11:23:02AM -0700, Paul E. McKenney wrote:
>>>> On Tue, Apr 24, 2018 at 10:26:58AM -0700, Paul E. McKenney wrote:
>>>>> On Tue, Apr 24, 2018 at 09:01:34AM -0700, Joel Fernandes wrote:
>>>>>> On Tue, Apr 24, 2018 at 8:56 AM, Paul E. McKenney
>>>>>> <[email protected]> wrote:
>>>>>>> On Mon, Apr 23, 2018 at 05:22:44PM -0400, Steven Rostedt wrote:
>>>>>>>> On Mon, 23 Apr 2018 13:12:21 -0400 (EDT)
>>>>>>>> Mathieu Desnoyers <[email protected]> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>> I'm inclined to explicitly declare the tracepoints with their given
>>>>>>>>> synchronization method. Tracepoint probe callback functions for currently
>>>>>>>>> existing tracepoints expect to have preemption disabled when invoked.
>>>>>>>>> This assumption will not be true anymore for srcu-tracepoints.
>>>>>>>>
>>>>>>>> Actually, why not have a flag attached to the tracepoint_func that
>>>>>>>> states if it expects preemption to be enabled or not? If a
>>>>>>>> trace_##event##_srcu() is called, then simply disable preemption before
>>>>>>>> calling the callbacks for it. That way if a callback is fine for use
>>>>>>>> with srcu, then it would require calling
>>>>>>>>
>>>>>>>> register_trace_##event##_may_sleep();
>>>>>>>>
>>>>>>>> Then if someone uses this on a tracepoint where preemption is disabled,
>>>>>>>> we simply do not call it.
>>>>>>>
>>>>>>> One more stupid question... If we are having to trace so much stuff
>>>>>>> in the idle loop, are we perhaps grossly overstating the extent of that
>>>>>>> "idle" loop? For being called "idle", this code seems quite busy!
>>>>>>
>>>>>> ;-)
>>>>>> The performance hit I am observing is when running a heavy workload,
>>>>>> like hackbench or something like that. That's what I am trying to
>>>>>> correct.
>>>>>> By the way is there any limitation on using SRCU too early during
>>>>>> boot? I backported Mathieu's srcu tracepoint patches but the kernel
>>>>>> hangs pretty early in the boot. I register lockdep probes in
>>>>>> start_kernel. I am hoping that's not why.
>>>>>>
>>>>>> I could also have just screwed up the backporting... may be for my
>>>>>> testing, I will just replace the rcu API with the srcu instead of all
>>>>>> of Mathieu's new TRACE_EVENT macros for SRCU, since all I am trying to
>>>>>> do right now is measure the performance of my patches with SRCU.
>>>>>
>>>>> Gah, yes, there is an entry on my capacious todo list on making SRCU
>>>>> grace periods work during early boot and mid-boot. Let me see what
>>>>> I can do...
>>>>
>>>> OK, just need to verify that you are OK with call_srcu()'s callbacks
>>>> not being invoked until sometime during core_initcall() time. (If you
>>>> really do need them to be invoked before that, in theory it is possible,
>>>> but in practice it is weird, even for RCU.)
>>>
>>> Oh, and that early at boot, you will need to use DEFINE_SRCU() or
>>> DEFINE_STATIC_SRCU() rather than dynamic allocation and initialization.
>>>
>>> Thanx, Paul
>>>
>>
>> Oh ok.
>>
>> About call_rcu, calling it later may be an issue since we register the
>> probes in start_kernel, for the first probe call_rcu will be sched,
>> but for the second one I think it'll try to call_rcu to get rid of the
>> first one.
>>
>> This is the relevant code that gets called when probes are added:
>>
>> static inline void release_probes(struct tracepoint_func *old)
>> {
>> if (old) {
>> struct tp_probes *tp_probes = container_of(old,
>> struct tp_probes, probes[0]);
>> call_rcu_sched(&tp_probes->rcu, rcu_free_old_probes);
>> }
>> }
>>
>> Maybe we can somehow defer the call_srcu until later? Would that be possible?
>>
>> also Mathieu, you didn't modify the call_rcu_sched in your prototype
>> to be changed to use call_srcu, should you be doing that?
>
> You're right, I think I should have introduced a call_srcu in there.
> It's missing in my prototype.
>
> However, in the prototype I did, we need to wait for *both* sched-rcu
> and SRCU grace periods, because we don't track which site is using which
> rcu flavor.
>
> So you could achieve this relatively easily by means of two chained
> RCU callbacks, e.g.:
>
> release_probes() calls call_rcu_sched(... , rcu_free_old_probes)
>
> and then in rcu_free_old_probes() do:
>
> call_srcu(... , srcu_free_old_probes)
>
> and perform kfree(container_of(head, struct tp_probes, rcu));
> within srcu_free_old_probes.
>
> It is somewhat a hack, but should work.

Sounds good, thanks.

Also I found the reason for my boot issue. It was because the
init_srcu_struct in the prototype was being done in an initcall.
Instead if I do it in start_kernel before the tracepoint is used, it
fixes it (although I don't know if this is dangerous to do like this but
I can get it to boot atleast.. Let me know if this isn't the right way
to do it, or if something else could go wrong)

diff --git a/init/main.c b/init/main.c
index 34823072ef9e..ecc88319c6da 100644
--- a/init/main.c
+++ b/init/main.c
@@ -631,6 +631,7 @@ asmlinkage __visible void __init start_kernel(void)
WARN(!irqs_disabled(), "Interrupts were enabled early\n");
early_boot_irqs_disabled = false;

+ init_srcu_struct(&tracepoint_srcu);
lockdep_init_early();

local_irq_enable();
--

I benchmarked it and the performance also looks quite good compared to
the rcu tracepoint version.

If you, Paul and other think doing the init_srcu_struct like this should
be Ok, then I can try to work more on your srcu prototype and roll into
my series and post them in the next RFC series (or let me know if you
wanted to work your srcu stuff in a separate series..).

thanks,

- Joel


2018-04-25 00:11:04

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC v4 3/4] irqflags: Avoid unnecessary calls to trace_ if you can

On Tue, Apr 24, 2018 at 04:46:54PM -0700, Joel Fernandes wrote:
>
>
> On 04/24/2018 04:21 PM, Mathieu Desnoyers wrote:
> >----- On Apr 24, 2018, at 2:59 PM, Joel Fernandes [email protected] wrote:
> >>On Tue, Apr 24, 2018 at 11:26 AM, Paul E. McKenney
> >><[email protected]> wrote:
> >>>On Tue, Apr 24, 2018 at 11:23:02AM -0700, Paul E. McKenney wrote:
> >>>>On Tue, Apr 24, 2018 at 10:26:58AM -0700, Paul E. McKenney wrote:
> >>>>>On Tue, Apr 24, 2018 at 09:01:34AM -0700, Joel Fernandes wrote:
> >>>>>>On Tue, Apr 24, 2018 at 8:56 AM, Paul E. McKenney
> >>>>>><[email protected]> wrote:
> >>>>>>>On Mon, Apr 23, 2018 at 05:22:44PM -0400, Steven Rostedt wrote:
> >>>>>>>>On Mon, 23 Apr 2018 13:12:21 -0400 (EDT)
> >>>>>>>>Mathieu Desnoyers <[email protected]> wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>>I'm inclined to explicitly declare the tracepoints with their given
> >>>>>>>>>synchronization method. Tracepoint probe callback functions for currently
> >>>>>>>>>existing tracepoints expect to have preemption disabled when invoked.
> >>>>>>>>>This assumption will not be true anymore for srcu-tracepoints.
> >>>>>>>>
> >>>>>>>>Actually, why not have a flag attached to the tracepoint_func that
> >>>>>>>>states if it expects preemption to be enabled or not? If a
> >>>>>>>>trace_##event##_srcu() is called, then simply disable preemption before
> >>>>>>>>calling the callbacks for it. That way if a callback is fine for use
> >>>>>>>>with srcu, then it would require calling
> >>>>>>>>
> >>>>>>>> register_trace_##event##_may_sleep();
> >>>>>>>>
> >>>>>>>>Then if someone uses this on a tracepoint where preemption is disabled,
> >>>>>>>>we simply do not call it.
> >>>>>>>
> >>>>>>>One more stupid question... If we are having to trace so much stuff
> >>>>>>>in the idle loop, are we perhaps grossly overstating the extent of that
> >>>>>>>"idle" loop? For being called "idle", this code seems quite busy!
> >>>>>>
> >>>>>>;-)
> >>>>>>The performance hit I am observing is when running a heavy workload,
> >>>>>>like hackbench or something like that. That's what I am trying to
> >>>>>>correct.
> >>>>>>By the way is there any limitation on using SRCU too early during
> >>>>>>boot? I backported Mathieu's srcu tracepoint patches but the kernel
> >>>>>>hangs pretty early in the boot. I register lockdep probes in
> >>>>>>start_kernel. I am hoping that's not why.
> >>>>>>
> >>>>>>I could also have just screwed up the backporting... may be for my
> >>>>>>testing, I will just replace the rcu API with the srcu instead of all
> >>>>>>of Mathieu's new TRACE_EVENT macros for SRCU, since all I am trying to
> >>>>>>do right now is measure the performance of my patches with SRCU.
> >>>>>
> >>>>>Gah, yes, there is an entry on my capacious todo list on making SRCU
> >>>>>grace periods work during early boot and mid-boot. Let me see what
> >>>>>I can do...
> >>>>
> >>>>OK, just need to verify that you are OK with call_srcu()'s callbacks
> >>>>not being invoked until sometime during core_initcall() time. (If you
> >>>>really do need them to be invoked before that, in theory it is possible,
> >>>>but in practice it is weird, even for RCU.)
> >>>
> >>>Oh, and that early at boot, you will need to use DEFINE_SRCU() or
> >>>DEFINE_STATIC_SRCU() rather than dynamic allocation and initialization.
> >>>
> >>> Thanx, Paul
> >>>
> >>
> >>Oh ok.
> >>
> >>About call_rcu, calling it later may be an issue since we register the
> >>probes in start_kernel, for the first probe call_rcu will be sched,
> >>but for the second one I think it'll try to call_rcu to get rid of the
> >>first one.
> >>
> >>This is the relevant code that gets called when probes are added:
> >>
> >>static inline void release_probes(struct tracepoint_func *old)
> >>{
> >> if (old) {
> >> struct tp_probes *tp_probes = container_of(old,
> >> struct tp_probes, probes[0]);
> >> call_rcu_sched(&tp_probes->rcu, rcu_free_old_probes);
> >> }
> >>}
> >>
> >>Maybe we can somehow defer the call_srcu until later? Would that be possible?
> >>
> >>also Mathieu, you didn't modify the call_rcu_sched in your prototype
> >>to be changed to use call_srcu, should you be doing that?
> >
> >You're right, I think I should have introduced a call_srcu in there.
> >It's missing in my prototype.
> >
> >However, in the prototype I did, we need to wait for *both* sched-rcu
> >and SRCU grace periods, because we don't track which site is using which
> >rcu flavor.
> >
> >So you could achieve this relatively easily by means of two chained
> >RCU callbacks, e.g.:
> >
> >release_probes() calls call_rcu_sched(... , rcu_free_old_probes)
> >
> >and then in rcu_free_old_probes() do:
> >
> >call_srcu(... , srcu_free_old_probes)
> >
> >and perform kfree(container_of(head, struct tp_probes, rcu));
> >within srcu_free_old_probes.
> >
> >It is somewhat a hack, but should work.
>
> Sounds good, thanks.
>
> Also I found the reason for my boot issue. It was because the
> init_srcu_struct in the prototype was being done in an initcall.
> Instead if I do it in start_kernel before the tracepoint is used, it
> fixes it (although I don't know if this is dangerous to do like this
> but I can get it to boot atleast.. Let me know if this isn't the
> right way to do it, or if something else could go wrong)
>
> diff --git a/init/main.c b/init/main.c
> index 34823072ef9e..ecc88319c6da 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -631,6 +631,7 @@ asmlinkage __visible void __init start_kernel(void)
> WARN(!irqs_disabled(), "Interrupts were enabled early\n");
> early_boot_irqs_disabled = false;
>
> + init_srcu_struct(&tracepoint_srcu);
> lockdep_init_early();
>
> local_irq_enable();
> --
>
> I benchmarked it and the performance also looks quite good compared
> to the rcu tracepoint version.
>
> If you, Paul and other think doing the init_srcu_struct like this
> should be Ok, then I can try to work more on your srcu prototype and
> roll into my series and post them in the next RFC series (or let me
> know if you wanted to work your srcu stuff in a separate series..).

That is definitely not what I was expecting, but let's see if it works
anyway... ;-)

But first, I was instead expecting something like this:

DEFINE_SRCU(tracepoint_srcu);

With this approach, some of the initialization happens at compile time
and the rest happens at the first call_srcu().

This will work -only- if the first call_srcu() doesn't happen until after
workqueue_init_early() has been invoked. Which I believe must have been
the case in your testing, because otherwise it looks like __call_srcu()
would have complained bitterly.

On the other hand, if you need to invoke call_srcu() before the call
to workqueue_init_early(), then you need the patch that I am beating
into shape. Plus you would need to use DEFINE_SRCU() and to avoid
invoking init_srcu_struct().

Thanx, Paul


2018-04-25 04:21:16

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC v4 3/4] irqflags: Avoid unnecessary calls to trace_ if you can

On Tue, Apr 24, 2018 at 05:10:49PM -0700, Paul E. McKenney wrote:
> On Tue, Apr 24, 2018 at 04:46:54PM -0700, Joel Fernandes wrote:
> >
> >
> > On 04/24/2018 04:21 PM, Mathieu Desnoyers wrote:
> > >----- On Apr 24, 2018, at 2:59 PM, Joel Fernandes [email protected] wrote:
> > >>On Tue, Apr 24, 2018 at 11:26 AM, Paul E. McKenney
> > >><[email protected]> wrote:
> > >>>On Tue, Apr 24, 2018 at 11:23:02AM -0700, Paul E. McKenney wrote:
> > >>>>On Tue, Apr 24, 2018 at 10:26:58AM -0700, Paul E. McKenney wrote:
> > >>>>>On Tue, Apr 24, 2018 at 09:01:34AM -0700, Joel Fernandes wrote:
> > >>>>>>On Tue, Apr 24, 2018 at 8:56 AM, Paul E. McKenney
> > >>>>>><[email protected]> wrote:
> > >>>>>>>On Mon, Apr 23, 2018 at 05:22:44PM -0400, Steven Rostedt wrote:
> > >>>>>>>>On Mon, 23 Apr 2018 13:12:21 -0400 (EDT)
> > >>>>>>>>Mathieu Desnoyers <[email protected]> wrote:
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>>I'm inclined to explicitly declare the tracepoints with their given
> > >>>>>>>>>synchronization method. Tracepoint probe callback functions for currently
> > >>>>>>>>>existing tracepoints expect to have preemption disabled when invoked.
> > >>>>>>>>>This assumption will not be true anymore for srcu-tracepoints.
> > >>>>>>>>
> > >>>>>>>>Actually, why not have a flag attached to the tracepoint_func that
> > >>>>>>>>states if it expects preemption to be enabled or not? If a
> > >>>>>>>>trace_##event##_srcu() is called, then simply disable preemption before
> > >>>>>>>>calling the callbacks for it. That way if a callback is fine for use
> > >>>>>>>>with srcu, then it would require calling
> > >>>>>>>>
> > >>>>>>>> register_trace_##event##_may_sleep();
> > >>>>>>>>
> > >>>>>>>>Then if someone uses this on a tracepoint where preemption is disabled,
> > >>>>>>>>we simply do not call it.
> > >>>>>>>
> > >>>>>>>One more stupid question... If we are having to trace so much stuff
> > >>>>>>>in the idle loop, are we perhaps grossly overstating the extent of that
> > >>>>>>>"idle" loop? For being called "idle", this code seems quite busy!
> > >>>>>>
> > >>>>>>;-)
> > >>>>>>The performance hit I am observing is when running a heavy workload,
> > >>>>>>like hackbench or something like that. That's what I am trying to
> > >>>>>>correct.
> > >>>>>>By the way is there any limitation on using SRCU too early during
> > >>>>>>boot? I backported Mathieu's srcu tracepoint patches but the kernel
> > >>>>>>hangs pretty early in the boot. I register lockdep probes in
> > >>>>>>start_kernel. I am hoping that's not why.
> > >>>>>>
> > >>>>>>I could also have just screwed up the backporting... may be for my
> > >>>>>>testing, I will just replace the rcu API with the srcu instead of all
> > >>>>>>of Mathieu's new TRACE_EVENT macros for SRCU, since all I am trying to
> > >>>>>>do right now is measure the performance of my patches with SRCU.
> > >>>>>
> > >>>>>Gah, yes, there is an entry on my capacious todo list on making SRCU
> > >>>>>grace periods work during early boot and mid-boot. Let me see what
> > >>>>>I can do...
> > >>>>
> > >>>>OK, just need to verify that you are OK with call_srcu()'s callbacks
> > >>>>not being invoked until sometime during core_initcall() time. (If you
> > >>>>really do need them to be invoked before that, in theory it is possible,
> > >>>>but in practice it is weird, even for RCU.)
> > >>>
> > >>>Oh, and that early at boot, you will need to use DEFINE_SRCU() or
> > >>>DEFINE_STATIC_SRCU() rather than dynamic allocation and initialization.
> > >>>
> > >>> Thanx, Paul
> > >>>
> > >>
> > >>Oh ok.
> > >>
> > >>About call_rcu, calling it later may be an issue since we register the
> > >>probes in start_kernel, for the first probe call_rcu will be sched,
> > >>but for the second one I think it'll try to call_rcu to get rid of the
> > >>first one.
> > >>
> > >>This is the relevant code that gets called when probes are added:
> > >>
> > >>static inline void release_probes(struct tracepoint_func *old)
> > >>{
> > >> if (old) {
> > >> struct tp_probes *tp_probes = container_of(old,
> > >> struct tp_probes, probes[0]);
> > >> call_rcu_sched(&tp_probes->rcu, rcu_free_old_probes);
> > >> }
> > >>}
> > >>
> > >>Maybe we can somehow defer the call_srcu until later? Would that be possible?
> > >>
> > >>also Mathieu, you didn't modify the call_rcu_sched in your prototype
> > >>to be changed to use call_srcu, should you be doing that?
> > >
> > >You're right, I think I should have introduced a call_srcu in there.
> > >It's missing in my prototype.
> > >
> > >However, in the prototype I did, we need to wait for *both* sched-rcu
> > >and SRCU grace periods, because we don't track which site is using which
> > >rcu flavor.
> > >
> > >So you could achieve this relatively easily by means of two chained
> > >RCU callbacks, e.g.:
> > >
> > >release_probes() calls call_rcu_sched(... , rcu_free_old_probes)
> > >
> > >and then in rcu_free_old_probes() do:
> > >
> > >call_srcu(... , srcu_free_old_probes)
> > >
> > >and perform kfree(container_of(head, struct tp_probes, rcu));
> > >within srcu_free_old_probes.
> > >
> > >It is somewhat a hack, but should work.
> >
> > Sounds good, thanks.
> >
> > Also I found the reason for my boot issue. It was because the
> > init_srcu_struct in the prototype was being done in an initcall.
> > Instead if I do it in start_kernel before the tracepoint is used, it
> > fixes it (although I don't know if this is dangerous to do like this
> > but I can get it to boot atleast.. Let me know if this isn't the
> > right way to do it, or if something else could go wrong)
> >
> > diff --git a/init/main.c b/init/main.c
> > index 34823072ef9e..ecc88319c6da 100644
> > --- a/init/main.c
> > +++ b/init/main.c
> > @@ -631,6 +631,7 @@ asmlinkage __visible void __init start_kernel(void)
> > WARN(!irqs_disabled(), "Interrupts were enabled early\n");
> > early_boot_irqs_disabled = false;
> >
> > + init_srcu_struct(&tracepoint_srcu);
> > lockdep_init_early();
> >
> > local_irq_enable();
> > --
> >
> > I benchmarked it and the performance also looks quite good compared
> > to the rcu tracepoint version.
> >
> > If you, Paul and other think doing the init_srcu_struct like this
> > should be Ok, then I can try to work more on your srcu prototype and
> > roll into my series and post them in the next RFC series (or let me
> > know if you wanted to work your srcu stuff in a separate series..).
>
> That is definitely not what I was expecting, but let's see if it works
> anyway... ;-)
>
> But first, I was instead expecting something like this:
>
> DEFINE_SRCU(tracepoint_srcu);
>
> With this approach, some of the initialization happens at compile time
> and the rest happens at the first call_srcu().
>
> This will work -only- if the first call_srcu() doesn't happen until after
> workqueue_init_early() has been invoked. Which I believe must have been
> the case in your testing, because otherwise it looks like __call_srcu()
> would have complained bitterly.
>
> On the other hand, if you need to invoke call_srcu() before the call
> to workqueue_init_early(), then you need the patch that I am beating
> into shape. Plus you would need to use DEFINE_SRCU() and to avoid
> invoking init_srcu_struct().

And here is the patch. I do not intend to send it upstream unless it
actually proves necessary, and it appears that current SRCU does what
you need.

You would only need this patch if you wanted to invoke call_srcu()
before workqueue_init_early() was called, which does not seem likely.

Thanx, Paul

------------------------------------------------------------------------

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 11fc28ecdb6d..c24eb9c1656c 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4047,6 +4047,9 @@
expediting. Set to zero to disable automatic
expediting.

+ srcutree.srcu_self_test [KNL]
+ Run the SRCU early boot self tests.
+
stack_guard_gap= [MM]
override the default stack gap protection. The value
is in page units and it defines how many pages prior
diff --git a/include/linux/srcutree.h b/include/linux/srcutree.h
index 4eda108abee0..b3b94c082da0 100644
--- a/include/linux/srcutree.h
+++ b/include/linux/srcutree.h
@@ -94,6 +94,7 @@ struct srcu_struct {
/* callback for the barrier */
/* operation. */
struct delayed_work work;
+ struct list_head srcu_boot; /* Boottime call_srcu()? */
#ifdef CONFIG_DEBUG_LOCK_ALLOC
struct lockdep_map dep_map;
#endif /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */
@@ -109,6 +110,7 @@ struct srcu_struct {
.sda = &name##_srcu_data, \
.lock = __SPIN_LOCK_UNLOCKED(name.lock), \
.srcu_gp_seq_needed = 0 - 1, \
+ .srcu_boot = LIST_HEAD_INIT(name.srcu_boot), \
__SRCU_DEP_MAP_INIT(name) \
}

diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index b4123d7a2cec..ec612895d0ec 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -80,6 +80,9 @@ do { \
#define spin_unlock_irqrestore_rcu_node(p, flags) \
spin_unlock_irqrestore(&ACCESS_PRIVATE(p, lock), flags) \

+LIST_HEAD(srcu_boot_list); /* List and lock for early boot call_srcu(). */
+DEFINE_SPINLOCK(srcu_boot_lock);
+
/*
* Initialize SRCU combining tree. Note that statically allocated
* srcu_struct structures might already have srcu_read_lock() and
@@ -180,8 +183,10 @@ static int init_srcu_struct_fields(struct srcu_struct *sp, bool is_static)
mutex_init(&sp->srcu_barrier_mutex);
atomic_set(&sp->srcu_barrier_cpu_cnt, 0);
INIT_DELAYED_WORK(&sp->work, process_srcu);
- if (!is_static)
+ if (!is_static) {
sp->sda = alloc_percpu(struct srcu_data);
+ sp->srcu_boot.next = NULL; /* No early boot call_srcu()! */
+ }
init_srcu_struct_nodes(sp, is_static);
sp->srcu_gp_seq_needed_exp = 0;
sp->srcu_last_gp_end = ktime_get_mono_fast_ns();
@@ -691,6 +696,21 @@ static void srcu_funnel_gp_start(struct srcu_struct *sp, struct srcu_data *sdp,
sp->srcu_gp_seq_needed_exp = s;

/* If grace period not already done and none in progress, start it. */
+ if (list_empty(&sp->srcu_boot) &&
+ rcu_scheduler_active != RCU_SCHEDULER_RUNNING) {
+ spin_unlock_irqrestore_rcu_node(sp, flags);
+ spin_lock_irqsave(&srcu_boot_lock, flags);
+ if (list_empty(&sp->srcu_boot) &&
+ rcu_scheduler_active != RCU_SCHEDULER_RUNNING) {
+ /* No early boot call_srcu() on dynamic srcu_struct. */
+ WARN_ON(sp->srcu_boot.next == NULL);
+ list_add(&sp->srcu_boot, &srcu_boot_list);
+ spin_unlock_irqrestore(&srcu_boot_lock, flags);
+ return;
+ }
+ spin_unlock_irqrestore(&srcu_boot_lock, flags);
+ spin_lock_irqsave_rcu_node(sp, flags);
+ }
if (!rcu_seq_done(&sp->srcu_gp_seq, s) &&
rcu_seq_state(sp->srcu_gp_seq) == SRCU_STATE_IDLE) {
WARN_ON_ONCE(ULONG_CMP_GE(sp->srcu_gp_seq, sp->srcu_gp_seq_needed));
@@ -823,17 +843,17 @@ static void srcu_leak_callback(struct rcu_head *rhp)
* more than one CPU, this means that when "func()" is invoked, each CPU
* is guaranteed to have executed a full memory barrier since the end of
* its last corresponding SRCU read-side critical section whose beginning
- * preceded the call to call_rcu(). It also means that each CPU executing
+ * preceded the call to call_srcu(). It also means that each CPU executing
* an SRCU read-side critical section that continues beyond the start of
- * "func()" must have executed a memory barrier after the call_rcu()
+ * "func()" must have executed a memory barrier after the call_srcu()
* but before the beginning of that SRCU read-side critical section.
* Note that these guarantees include CPUs that are offline, idle, or
* executing in user mode, as well as CPUs that are executing in the kernel.
*
- * Furthermore, if CPU A invoked call_rcu() and CPU B invoked the
+ * Furthermore, if CPU A invoked call_srcu() and CPU B invoked the
* resulting SRCU callback function "func()", then both CPU A and CPU
* B are guaranteed to execute a full memory barrier during the time
- * interval between the call to call_rcu() and the invocation of "func()".
+ * interval between the call to call_srcu() and the invocation of "func()".
* This guarantee applies even if CPU A and CPU B are the same CPU (but
* again only if the system has more than one CPU).
*
@@ -1301,3 +1321,68 @@ static int __init srcu_bootup_announce(void)
return 0;
}
early_initcall(srcu_bootup_announce);
+
+/*
+ * Support for early boot call_srcu(). The trick is to create a list
+ * of all srcu_struct structures that were subjected to an early boot
+ * call_srcu(), and later in boot post a callback to each structure in
+ * the list to kick off the needed grace periods.
+ *
+ * Note that synchronize_srcu() and synchronize_srcu_expedited()
+ * remain unsupported during the mid-boot dead zone, which starts after
+ * the scheduler is running multiple processes and ends during the
+ * core_initcall() timeframe.
+ */
+
+/*
+ * Post an SRCU callback to each srcu_struct structure that was
+ * subjected to call_srcu() during early boot.
+ */
+void __init srcu_boot_cleanup(void)
+{
+ unsigned long flags;
+ struct srcu_struct *sp;
+
+ spin_lock_irqsave(&srcu_boot_lock, flags);
+ if (list_empty(&srcu_boot_list)) {
+ spin_unlock_irqrestore(&srcu_boot_lock, flags);
+ return; /* No early boot call_srcu(), nothing to do. */
+ }
+ list_for_each_entry(sp, &srcu_boot_list, srcu_boot)
+ srcu_reschedule(sp, 0);
+ spin_unlock_irqrestore(&srcu_boot_lock, flags);
+}
+
+#ifdef CONFIG_PROVE_RCU
+
+static bool srcu_self_test;
+module_param(srcu_self_test, bool, 0444);
+static bool srcu_self_test_happened;
+DEFINE_SRCU(srcu_self_test_struct);
+
+static void srcu_test_callback(struct rcu_head *rhp)
+{
+ srcu_self_test_happened = true;
+ pr_info("SRCU test callback executed\n");
+}
+
+void __init srcu_early_boot_test(void)
+{
+ static struct rcu_head rh;
+
+ if (!srcu_self_test)
+ return;
+ call_srcu(&srcu_self_test_struct, &rh, srcu_test_callback);
+ pr_info("SRCU test callback registered\n");
+}
+
+static int srcu_verify_early_boot_test(void)
+{
+ if (!srcu_self_test)
+ return 0;
+ WARN_ON(!srcu_self_test_happened);
+ return 0;
+}
+late_initcall(srcu_verify_early_boot_test);
+
+#endif /* #ifdef CONFIG_PROVE_RCU */
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index a38803a0342f..129c02b6d6b0 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -4129,6 +4129,10 @@ static void __init rcu_dump_rcu_node_tree(struct rcu_state *rsp)
pr_cont("\n");
}

+void __weak __init srcu_early_boot_test(void)
+{
+}
+
struct workqueue_struct *rcu_gp_wq;
struct workqueue_struct *rcu_par_gp_wq;

@@ -4137,6 +4141,7 @@ void __init rcu_init(void)
int cpu;

rcu_early_boot_tests();
+ srcu_early_boot_test();

rcu_bootup_announce();
rcu_init_geometry();
diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index 4c230a60ece4..810ec36e1449 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -212,6 +212,10 @@ void rcu_test_sync_prims(void)

#if !defined(CONFIG_TINY_RCU) || defined(CONFIG_SRCU)

+void __weak __init srcu_boot_cleanup(void)
+{
+}
+
/*
* Switch to run-time mode once RCU has fully initialized.
*/
@@ -220,6 +224,7 @@ static int __init rcu_set_runtime_mode(void)
rcu_test_sync_prims();
rcu_scheduler_active = RCU_SCHEDULER_RUNNING;
rcu_test_sync_prims();
+ srcu_boot_cleanup();
return 0;
}
core_initcall(rcu_set_runtime_mode);


2018-04-25 21:28:30

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC v4 3/4] irqflags: Avoid unnecessary calls to trace_ if you can

On Tue, Apr 24, 2018 at 9:20 PM, Paul E. McKenney
<[email protected]> wrote:
[..]
>> >
>> > Sounds good, thanks.
>> >
>> > Also I found the reason for my boot issue. It was because the
>> > init_srcu_struct in the prototype was being done in an initcall.
>> > Instead if I do it in start_kernel before the tracepoint is used, it
>> > fixes it (although I don't know if this is dangerous to do like this
>> > but I can get it to boot atleast.. Let me know if this isn't the
>> > right way to do it, or if something else could go wrong)
>> >
>> > diff --git a/init/main.c b/init/main.c
>> > index 34823072ef9e..ecc88319c6da 100644
>> > --- a/init/main.c
>> > +++ b/init/main.c
>> > @@ -631,6 +631,7 @@ asmlinkage __visible void __init start_kernel(void)
>> > WARN(!irqs_disabled(), "Interrupts were enabled early\n");
>> > early_boot_irqs_disabled = false;
>> >
>> > + init_srcu_struct(&tracepoint_srcu);
>> > lockdep_init_early();
>> >
>> > local_irq_enable();
>> > --
>> >
>> > I benchmarked it and the performance also looks quite good compared
>> > to the rcu tracepoint version.
>> >
>> > If you, Paul and other think doing the init_srcu_struct like this
>> > should be Ok, then I can try to work more on your srcu prototype and
>> > roll into my series and post them in the next RFC series (or let me
>> > know if you wanted to work your srcu stuff in a separate series..).
>>
>> That is definitely not what I was expecting, but let's see if it works
>> anyway... ;-)
>>
>> But first, I was instead expecting something like this:
>>
>> DEFINE_SRCU(tracepoint_srcu);
>>
>> With this approach, some of the initialization happens at compile time
>> and the rest happens at the first call_srcu().
>>
>> This will work -only- if the first call_srcu() doesn't happen until after
>> workqueue_init_early() has been invoked. Which I believe must have been
>> the case in your testing, because otherwise it looks like __call_srcu()
>> would have complained bitterly.
>>
>> On the other hand, if you need to invoke call_srcu() before the call
>> to workqueue_init_early(), then you need the patch that I am beating
>> into shape. Plus you would need to use DEFINE_SRCU() and to avoid
>> invoking init_srcu_struct().
>
> And here is the patch. I do not intend to send it upstream unless it
> actually proves necessary, and it appears that current SRCU does what
> you need.
>
> You would only need this patch if you wanted to invoke call_srcu()
> before workqueue_init_early() was called, which does not seem likely.

Cool. So I was chatting with Paul and just to update everyone as well,
I tried the DEFINE_SRCU instead of the late init_srcu_struct call and
can make it past boot too (thanks Paul!). Also I don't see a reason we
need the RCU callback to execute early and its fine if it runs later.

Also, I was thinking of introducing a separate trace_*event*_srcu API
as a replacement to the _rcuidle API. Then I can make use of it for my
tracepoints, and then later can use it for the other tracepoints
needing _rcuidle. After that we can finally get rid of the _rcuidle
API if there are no other users of it. This is just a rough plan, but
let me know if there's any issue with this plan that you can think
off.
IMO, I believe its simpler if the caller worries about whether it can
tolerate if tracepoint probes can block or not, than making it a
property of the tracepoint. That would also simplify the patch to
introduce srcu and keep the tracepoint creation API simple and less
confusing, but let me know if I'm missing something about this.

Thanks,

- Joel

2018-04-25 21:35:52

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC v4 3/4] irqflags: Avoid unnecessary calls to trace_ if you can

On Wed, Apr 25, 2018 at 02:27:08PM -0700, Joel Fernandes wrote:
> On Tue, Apr 24, 2018 at 9:20 PM, Paul E. McKenney
> <[email protected]> wrote:
> [..]
> >> >
> >> > Sounds good, thanks.
> >> >
> >> > Also I found the reason for my boot issue. It was because the
> >> > init_srcu_struct in the prototype was being done in an initcall.
> >> > Instead if I do it in start_kernel before the tracepoint is used, it
> >> > fixes it (although I don't know if this is dangerous to do like this
> >> > but I can get it to boot atleast.. Let me know if this isn't the
> >> > right way to do it, or if something else could go wrong)
> >> >
> >> > diff --git a/init/main.c b/init/main.c
> >> > index 34823072ef9e..ecc88319c6da 100644
> >> > --- a/init/main.c
> >> > +++ b/init/main.c
> >> > @@ -631,6 +631,7 @@ asmlinkage __visible void __init start_kernel(void)
> >> > WARN(!irqs_disabled(), "Interrupts were enabled early\n");
> >> > early_boot_irqs_disabled = false;
> >> >
> >> > + init_srcu_struct(&tracepoint_srcu);
> >> > lockdep_init_early();
> >> >
> >> > local_irq_enable();
> >> > --
> >> >
> >> > I benchmarked it and the performance also looks quite good compared
> >> > to the rcu tracepoint version.
> >> >
> >> > If you, Paul and other think doing the init_srcu_struct like this
> >> > should be Ok, then I can try to work more on your srcu prototype and
> >> > roll into my series and post them in the next RFC series (or let me
> >> > know if you wanted to work your srcu stuff in a separate series..).
> >>
> >> That is definitely not what I was expecting, but let's see if it works
> >> anyway... ;-)
> >>
> >> But first, I was instead expecting something like this:
> >>
> >> DEFINE_SRCU(tracepoint_srcu);
> >>
> >> With this approach, some of the initialization happens at compile time
> >> and the rest happens at the first call_srcu().
> >>
> >> This will work -only- if the first call_srcu() doesn't happen until after
> >> workqueue_init_early() has been invoked. Which I believe must have been
> >> the case in your testing, because otherwise it looks like __call_srcu()
> >> would have complained bitterly.
> >>
> >> On the other hand, if you need to invoke call_srcu() before the call
> >> to workqueue_init_early(), then you need the patch that I am beating
> >> into shape. Plus you would need to use DEFINE_SRCU() and to avoid
> >> invoking init_srcu_struct().
> >
> > And here is the patch. I do not intend to send it upstream unless it
> > actually proves necessary, and it appears that current SRCU does what
> > you need.
> >
> > You would only need this patch if you wanted to invoke call_srcu()
> > before workqueue_init_early() was called, which does not seem likely.
>
> Cool. So I was chatting with Paul and just to update everyone as well,
> I tried the DEFINE_SRCU instead of the late init_srcu_struct call and
> can make it past boot too (thanks Paul!). Also I don't see a reason we
> need the RCU callback to execute early and its fine if it runs later.

Very good, thank you!

> Also, I was thinking of introducing a separate trace_*event*_srcu API
> as a replacement to the _rcuidle API. Then I can make use of it for my
> tracepoints, and then later can use it for the other tracepoints
> needing _rcuidle. After that we can finally get rid of the _rcuidle
> API if there are no other users of it. This is just a rough plan, but
> let me know if there's any issue with this plan that you can think
> off.

You mean make _rcuidle use SRCU instead of RCU? Sounds reasonable to me.

> IMO, I believe its simpler if the caller worries about whether it can
> tolerate if tracepoint probes can block or not, than making it a
> property of the tracepoint. That would also simplify the patch to
> introduce srcu and keep the tracepoint creation API simple and less
> confusing, but let me know if I'm missing something about this.

If it helps, you can use synchronize_rcu_mult() to wait for several
different types of RCU grace periods concurrently. Of course,
if it is fast enough to just do a synchronize_rcu() followed by a
synchronize_srcu(), why worry?

Thanx, Paul


2018-04-25 21:42:24

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC v4 3/4] irqflags: Avoid unnecessary calls to trace_ if you can

----- On Apr 25, 2018, at 5:27 PM, Joel Fernandes [email protected] wrote:

> On Tue, Apr 24, 2018 at 9:20 PM, Paul E. McKenney
> <[email protected]> wrote:
> [..]
>>> >
>>> > Sounds good, thanks.
>>> >
>>> > Also I found the reason for my boot issue. It was because the
>>> > init_srcu_struct in the prototype was being done in an initcall.
>>> > Instead if I do it in start_kernel before the tracepoint is used, it
>>> > fixes it (although I don't know if this is dangerous to do like this
>>> > but I can get it to boot atleast.. Let me know if this isn't the
>>> > right way to do it, or if something else could go wrong)
>>> >
>>> > diff --git a/init/main.c b/init/main.c
>>> > index 34823072ef9e..ecc88319c6da 100644
>>> > --- a/init/main.c
>>> > +++ b/init/main.c
>>> > @@ -631,6 +631,7 @@ asmlinkage __visible void __init start_kernel(void)
>>> > WARN(!irqs_disabled(), "Interrupts were enabled early\n");
>>> > early_boot_irqs_disabled = false;
>>> >
>>> > + init_srcu_struct(&tracepoint_srcu);
>>> > lockdep_init_early();
>>> >
>>> > local_irq_enable();
>>> > --
>>> >
>>> > I benchmarked it and the performance also looks quite good compared
>>> > to the rcu tracepoint version.
>>> >
>>> > If you, Paul and other think doing the init_srcu_struct like this
>>> > should be Ok, then I can try to work more on your srcu prototype and
>>> > roll into my series and post them in the next RFC series (or let me
>>> > know if you wanted to work your srcu stuff in a separate series..).
>>>
>>> That is definitely not what I was expecting, but let's see if it works
>>> anyway... ;-)
>>>
>>> But first, I was instead expecting something like this:
>>>
>>> DEFINE_SRCU(tracepoint_srcu);
>>>
>>> With this approach, some of the initialization happens at compile time
>>> and the rest happens at the first call_srcu().
>>>
>>> This will work -only- if the first call_srcu() doesn't happen until after
>>> workqueue_init_early() has been invoked. Which I believe must have been
>>> the case in your testing, because otherwise it looks like __call_srcu()
>>> would have complained bitterly.
>>>
>>> On the other hand, if you need to invoke call_srcu() before the call
>>> to workqueue_init_early(), then you need the patch that I am beating
>>> into shape. Plus you would need to use DEFINE_SRCU() and to avoid
>>> invoking init_srcu_struct().
>>
>> And here is the patch. I do not intend to send it upstream unless it
>> actually proves necessary, and it appears that current SRCU does what
>> you need.
>>
>> You would only need this patch if you wanted to invoke call_srcu()
>> before workqueue_init_early() was called, which does not seem likely.
>
> Cool. So I was chatting with Paul and just to update everyone as well,
> I tried the DEFINE_SRCU instead of the late init_srcu_struct call and
> can make it past boot too (thanks Paul!). Also I don't see a reason we
> need the RCU callback to execute early and its fine if it runs later.
>
> Also, I was thinking of introducing a separate trace_*event*_srcu API
> as a replacement to the _rcuidle API. Then I can make use of it for my
> tracepoints, and then later can use it for the other tracepoints
> needing _rcuidle. After that we can finally get rid of the _rcuidle
> API if there are no other users of it. This is just a rough plan, but
> let me know if there's any issue with this plan that you can think
> off.
> IMO, I believe its simpler if the caller worries about whether it can
> tolerate if tracepoint probes can block or not, than making it a
> property of the tracepoint. That would also simplify the patch to
> introduce srcu and keep the tracepoint creation API simple and less
> confusing, but let me know if I'm missing something about this.

One problem with your approach is that you can have multiple callers
for the same tracepoint name, where some could be non-preemptible and
others blocking. Also, there is then no clear way for the callback
registration API to enforce whether the callback expects the tracepoint
to be blocking or non-preemptible. This can introduce hard to diagnose
issues in a kernel without debug options enabled.

Regarding the name, I'm OK with having something along the lines of
trace_*event*_blocking or such. Please don't use "srcu" or other naming
that is explicitly tied to the underlying mechanism used internally
however: what we want to convey is that this specific tracepoint probe
can be preempted and block. The underlying implementation could move to
a different RCU flavor brand in the future, and it should not impact
users of the tracepoint APIs.

In order to ensure that probes that may block only register themselves
to tracepoints that allow blocking, we should introduce new tracepoint
declaration/definition *and* registration APIs also contain the
"BLOCKING/blocking" keywords (or such), so we can ensure that a
tracepoint probe being registered to a "blocking" tracepoint is indeed
allowed to block.

Thanks,

Mathieu


>
> Thanks,
>
> - Joel

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2018-04-25 22:53:16

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC v4 3/4] irqflags: Avoid unnecessary calls to trace_ if you can

On Wed, 25 Apr 2018 17:40:56 -0400 (EDT)
Mathieu Desnoyers <[email protected]> wrote:

> One problem with your approach is that you can have multiple callers
> for the same tracepoint name, where some could be non-preemptible and
> others blocking. Also, there is then no clear way for the callback
> registration API to enforce whether the callback expects the tracepoint
> to be blocking or non-preemptible. This can introduce hard to diagnose
> issues in a kernel without debug options enabled.

I agree that it should not be tied to an implementation name. But
"blocking" is confusing. I would say "can_sleep" or some such name that
states that the trace point caller is indeed something that can sleep.

>
> Regarding the name, I'm OK with having something along the lines of
> trace_*event*_blocking or such. Please don't use "srcu" or other naming
> that is explicitly tied to the underlying mechanism used internally
> however: what we want to convey is that this specific tracepoint probe
> can be preempted and block. The underlying implementation could move to
> a different RCU flavor brand in the future, and it should not impact
> users of the tracepoint APIs.
>
> In order to ensure that probes that may block only register themselves
> to tracepoints that allow blocking, we should introduce new tracepoint
> declaration/definition *and* registration APIs also contain the
> "BLOCKING/blocking" keywords (or such), so we can ensure that a
> tracepoint probe being registered to a "blocking" tracepoint is indeed
> allowed to block.

I'd really don't want to add more declaration/definitions, as we
already have too many as is, and with different meanings and the number
is of incarnations is n! in growth.

I'd say we just stick with a trace_<event>_can_sleep() call, and make
sure that if that is used that no trace_<event>() call is also used, and
enforce this with linker or compiler tricks.

-- Steve

2018-04-25 23:15:02

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC v4 3/4] irqflags: Avoid unnecessary calls to trace_ if you can

Hi Mathieu,

On Wed, Apr 25, 2018 at 2:40 PM, Mathieu Desnoyers
<[email protected]> wrote:
> ----- On Apr 25, 2018, at 5:27 PM, Joel Fernandes [email protected] wrote:
>
>> On Tue, Apr 24, 2018 at 9:20 PM, Paul E. McKenney
>> <[email protected]> wrote:
>> [..]
>>>> >
>>>> > Sounds good, thanks.
>>>> >
>>>> > Also I found the reason for my boot issue. It was because the
>>>> > init_srcu_struct in the prototype was being done in an initcall.
>>>> > Instead if I do it in start_kernel before the tracepoint is used, it
>>>> > fixes it (although I don't know if this is dangerous to do like this
>>>> > but I can get it to boot atleast.. Let me know if this isn't the
>>>> > right way to do it, or if something else could go wrong)
>>>> >
>>>> > diff --git a/init/main.c b/init/main.c
>>>> > index 34823072ef9e..ecc88319c6da 100644
>>>> > --- a/init/main.c
>>>> > +++ b/init/main.c
>>>> > @@ -631,6 +631,7 @@ asmlinkage __visible void __init start_kernel(void)
>>>> > WARN(!irqs_disabled(), "Interrupts were enabled early\n");
>>>> > early_boot_irqs_disabled = false;
>>>> >
>>>> > + init_srcu_struct(&tracepoint_srcu);
>>>> > lockdep_init_early();
>>>> >
>>>> > local_irq_enable();
>>>> > --
>>>> >
>>>> > I benchmarked it and the performance also looks quite good compared
>>>> > to the rcu tracepoint version.
>>>> >
>>>> > If you, Paul and other think doing the init_srcu_struct like this
>>>> > should be Ok, then I can try to work more on your srcu prototype and
>>>> > roll into my series and post them in the next RFC series (or let me
>>>> > know if you wanted to work your srcu stuff in a separate series..).
>>>>
>>>> That is definitely not what I was expecting, but let's see if it works
>>>> anyway... ;-)
>>>>
>>>> But first, I was instead expecting something like this:
>>>>
>>>> DEFINE_SRCU(tracepoint_srcu);
>>>>
>>>> With this approach, some of the initialization happens at compile time
>>>> and the rest happens at the first call_srcu().
>>>>
>>>> This will work -only- if the first call_srcu() doesn't happen until after
>>>> workqueue_init_early() has been invoked. Which I believe must have been
>>>> the case in your testing, because otherwise it looks like __call_srcu()
>>>> would have complained bitterly.
>>>>
>>>> On the other hand, if you need to invoke call_srcu() before the call
>>>> to workqueue_init_early(), then you need the patch that I am beating
>>>> into shape. Plus you would need to use DEFINE_SRCU() and to avoid
>>>> invoking init_srcu_struct().
>>>
>>> And here is the patch. I do not intend to send it upstream unless it
>>> actually proves necessary, and it appears that current SRCU does what
>>> you need.
>>>
>>> You would only need this patch if you wanted to invoke call_srcu()
>>> before workqueue_init_early() was called, which does not seem likely.
>>
>> Cool. So I was chatting with Paul and just to update everyone as well,
>> I tried the DEFINE_SRCU instead of the late init_srcu_struct call and
>> can make it past boot too (thanks Paul!). Also I don't see a reason we
>> need the RCU callback to execute early and its fine if it runs later.
>>
>> Also, I was thinking of introducing a separate trace_*event*_srcu API
>> as a replacement to the _rcuidle API. Then I can make use of it for my
>> tracepoints, and then later can use it for the other tracepoints
>> needing _rcuidle. After that we can finally get rid of the _rcuidle
>> API if there are no other users of it. This is just a rough plan, but
>> let me know if there's any issue with this plan that you can think
>> off.
>> IMO, I believe its simpler if the caller worries about whether it can
>> tolerate if tracepoint probes can block or not, than making it a
>> property of the tracepoint. That would also simplify the patch to
>> introduce srcu and keep the tracepoint creation API simple and less
>> confusing, but let me know if I'm missing something about this.
>
> One problem with your approach is that you can have multiple callers
> for the same tracepoint name, where some could be non-preemptible and
> others blocking. Also, there is then no clear way for the callback

Shouldn't it be responsibility of the caller to make sure it calls
correct API? So if you're wanting to allow probes to block, then you'd
call trace*blocking, if not then you don't. So the caller side can
just always do the right thing. That's a caller side issue.

>
> Regarding the name, I'm OK with having something along the lines of
> trace_*event*_blocking or such. Please don't use "srcu" or other naming
> that is explicitly tied to the underlying mechanism used internally
> however: what we want to convey is that this specific tracepoint probe

Problem is that _blocking isn't the right word either. In my IRQ trace
point case, it will look something like this then:

local_irq_disable();
// IRQs are now off.
trace_irq_disable_blocking(..);

This wouldn't make sense. What we really want is to use the SRCU
implementation so that its low overhead...

So it would be something like:

local_irq_disable();
// IRQs are now off.
trace_irq_disable_srcu(..);

I also Ok if, as Paul was saying in his last email, that just for
_rcuidle, we use SRCU so that we don't have to do the rcu_enter_irq
stuff. Or we kill the _rcuidle API completely and use _srcu for those
users instead. We already have 1 implementation specific name anyway
(rcuidle), we're just replacing it with another one. If in the future,
if we want to change that name we can always do so (Also if you will,
correcting the existing already bad naming is a different problem and
we're not making it any worse tbh).

> can be preempted and block. The underlying implementation could move to
> a different RCU flavor brand in the future, and it should not impact
> users of the tracepoint APIs.
>
> In order to ensure that probes that may block only register themselves
> to tracepoints that allow blocking, we should introduce new tracepoint
> declaration/definition *and* registration APIs also contain the
> "BLOCKING/blocking" keywords (or such), so we can ensure that a
> tracepoint probe being registered to a "blocking" tracepoint is indeed
> allowed to block.

I feel this problem you're describing is slightly out of the scope of
the issues we're talking about, I think. Even right now, someone can
write a callback that blocks and then bad things will happen. If I
understand correctly, all callbacks right now will execute in a
preempt disabled section because of rcu_read_lock_sched. So we already
have a problem (without the SRCU changes) that if a callback blocks,
then we'll have hard to diagnose sleeping while atomic issues. Sorry
if I missed your point.

thanks,

- Joel

2018-04-26 02:19:35

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC v4 3/4] irqflags: Avoid unnecessary calls to trace_ if you can

On Sun, Apr 22, 2018 at 8:19 PM, Paul E. McKenney
<[email protected]> wrote:
> On Sun, Apr 22, 2018 at 06:14:18PM -0700, Joel Fernandes wrote:
[...]
>> I narrowed the performance hit down to the call to
>> rcu_irq_enter_irqson() and rcu_irq_exit_irqson() in __DO_TRACE.
>> Commenting these 2 functions brings the perf level back.
>>
>> I was thinking about RCU usage here, and really we never change this
>> particular performance-sensitive tracepoint's function table 99.9% of
>> the time, so it seems there's quite in a win if we just had another
>> read-mostly synchronization mechanism that doesn't do all the RCU
>> tracking that's currently done here and such a mechanism can be
>> simpler..
>>
>> If I understand correctly, RCU also adds other complications such as
>> that it can't be used from the idle path, that's why the
>> rcu_irq_enter_* was added in the first place. Would be nice if we can
>> just avoid these RCU calls for the preempt/irq tracepoints... Any
>> thoughts about this or any other ideas to solve this?
>
> In theory, the tracepoint code could use SRCU instead of RCU, given that
> SRCU readers can be in the idle loop, although at the expense of a couple
> of smp_mb() calls in each tracepoint. In practice, I must defer to the
> people who know the tracepoint code better than I.

Paul and me were chatting about handling of tracing from an NMI. If
the tracepoint's implementation were to be switched to using SRCU
instead of RCU, a complication could arise due to the use of
this_cpu_inc from srcu_read_lock.

int __srcu_read_lock(struct srcu_struct *sp)
{
int idx;

idx = READ_ONCE(sp->srcu_idx) & 0x1;
this_cpu_inc(sp->sda->srcu_lock_count[idx]);
smp_mb(); /* B */ /* Avoid leaking the critical section. */
return idx;
}
EXPORT_SYMBOL_GPL(__srcu_read_lock);

What could happen is if an NMI preempts the this_cpu_inc, and also
happens to call a tracepoint from the NMI handler, then this could
result in a lost-update issue on architectures that don't support
add-to-memory instructions. Paul said he wouldn't want to use atomics
to resolve this inorder to keep the srcu overhead low.

One way we discussed to resolve this could be to use a different
srcu_struct for NMI invocations, so that the above lost update doesn't
occur. We could use in_nmi() and switch the srcu_read_lock to use the
NMI version of the srcu_struct. Another way could be to just warn for
now if the srcu version of the trace_ API was used from NMI. This
could be fragile if some code path indirect results in a tracepoint
call so we should probably handle it by detecting and using the
correct srcu_struct for the srcu_read_lock.

thanks,

- Joel

2018-04-26 15:04:50

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC v4 3/4] irqflags: Avoid unnecessary calls to trace_ if you can

----- On Apr 25, 2018, at 6:51 PM, rostedt [email protected] wrote:

> On Wed, 25 Apr 2018 17:40:56 -0400 (EDT)
> Mathieu Desnoyers <[email protected]> wrote:
>
>> One problem with your approach is that you can have multiple callers
>> for the same tracepoint name, where some could be non-preemptible and
>> others blocking. Also, there is then no clear way for the callback
>> registration API to enforce whether the callback expects the tracepoint
>> to be blocking or non-preemptible. This can introduce hard to diagnose
>> issues in a kernel without debug options enabled.
>
> I agree that it should not be tied to an implementation name. But
> "blocking" is confusing. I would say "can_sleep" or some such name that
> states that the trace point caller is indeed something that can sleep.

"trace_*event*_{can,might,may}_sleep" are all acceptable candidates for
me.

>
>>
>> Regarding the name, I'm OK with having something along the lines of
>> trace_*event*_blocking or such. Please don't use "srcu" or other naming
>> that is explicitly tied to the underlying mechanism used internally
>> however: what we want to convey is that this specific tracepoint probe
>> can be preempted and block. The underlying implementation could move to
>> a different RCU flavor brand in the future, and it should not impact
>> users of the tracepoint APIs.
>>
>> In order to ensure that probes that may block only register themselves
>> to tracepoints that allow blocking, we should introduce new tracepoint
>> declaration/definition *and* registration APIs also contain the
>> "BLOCKING/blocking" keywords (or such), so we can ensure that a
>> tracepoint probe being registered to a "blocking" tracepoint is indeed
>> allowed to block.
>
> I'd really don't want to add more declaration/definitions, as we
> already have too many as is, and with different meanings and the number
> is of incarnations is n! in growth.
>
> I'd say we just stick with a trace_<event>_can_sleep() call, and make
> sure that if that is used that no trace_<event>() call is also used, and
> enforce this with linker or compiler tricks.

My main concern is not about having both trace_<event>_can_sleep() mixed
with trace_<event>() calls. It's more about having a registration API allowing
modules registering probes that may need to sleep to explicitly declare it,
and enforce that tracepoint never connects a probe that may need to sleep
with an instrumentation site which cannot sleep.

I'm unsure what's the best way to achieve this goal though. We could possibly
extend the tracepoint_probe_register_* APIs to introduce e.g.
tracepoint_probe_register_prio_flags() and provide a TRACEPOINT_PROBE_CAN_SLEEP
as parameter upon registration. If this flag is provided, then we could figure out
an way to iterate on all callers, and ensure they are all "can_sleep" type of
callers.

Thoughts ?

Thanks,

Mathieu



>
> -- Steve

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2018-04-26 15:14:55

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC v4 3/4] irqflags: Avoid unnecessary calls to trace_ if you can

----- On Apr 25, 2018, at 7:13 PM, Joel Fernandes [email protected] wrote:

> Hi Mathieu,
>
> On Wed, Apr 25, 2018 at 2:40 PM, Mathieu Desnoyers
> <[email protected]> wrote:
>> ----- On Apr 25, 2018, at 5:27 PM, Joel Fernandes [email protected] wrote:
>>
>>> On Tue, Apr 24, 2018 at 9:20 PM, Paul E. McKenney
>>> <[email protected]> wrote:
>>> [..]
>>>>> >
>>>>> > Sounds good, thanks.
>>>>> >
>>>>> > Also I found the reason for my boot issue. It was because the
>>>>> > init_srcu_struct in the prototype was being done in an initcall.
>>>>> > Instead if I do it in start_kernel before the tracepoint is used, it
>>>>> > fixes it (although I don't know if this is dangerous to do like this
>>>>> > but I can get it to boot atleast.. Let me know if this isn't the
>>>>> > right way to do it, or if something else could go wrong)
>>>>> >
>>>>> > diff --git a/init/main.c b/init/main.c
>>>>> > index 34823072ef9e..ecc88319c6da 100644
>>>>> > --- a/init/main.c
>>>>> > +++ b/init/main.c
>>>>> > @@ -631,6 +631,7 @@ asmlinkage __visible void __init start_kernel(void)
>>>>> > WARN(!irqs_disabled(), "Interrupts were enabled early\n");
>>>>> > early_boot_irqs_disabled = false;
>>>>> >
>>>>> > + init_srcu_struct(&tracepoint_srcu);
>>>>> > lockdep_init_early();
>>>>> >
>>>>> > local_irq_enable();
>>>>> > --
>>>>> >
>>>>> > I benchmarked it and the performance also looks quite good compared
>>>>> > to the rcu tracepoint version.
>>>>> >
>>>>> > If you, Paul and other think doing the init_srcu_struct like this
>>>>> > should be Ok, then I can try to work more on your srcu prototype and
>>>>> > roll into my series and post them in the next RFC series (or let me
>>>>> > know if you wanted to work your srcu stuff in a separate series..).
>>>>>
>>>>> That is definitely not what I was expecting, but let's see if it works
>>>>> anyway... ;-)
>>>>>
>>>>> But first, I was instead expecting something like this:
>>>>>
>>>>> DEFINE_SRCU(tracepoint_srcu);
>>>>>
>>>>> With this approach, some of the initialization happens at compile time
>>>>> and the rest happens at the first call_srcu().
>>>>>
>>>>> This will work -only- if the first call_srcu() doesn't happen until after
>>>>> workqueue_init_early() has been invoked. Which I believe must have been
>>>>> the case in your testing, because otherwise it looks like __call_srcu()
>>>>> would have complained bitterly.
>>>>>
>>>>> On the other hand, if you need to invoke call_srcu() before the call
>>>>> to workqueue_init_early(), then you need the patch that I am beating
>>>>> into shape. Plus you would need to use DEFINE_SRCU() and to avoid
>>>>> invoking init_srcu_struct().
>>>>
>>>> And here is the patch. I do not intend to send it upstream unless it
>>>> actually proves necessary, and it appears that current SRCU does what
>>>> you need.
>>>>
>>>> You would only need this patch if you wanted to invoke call_srcu()
>>>> before workqueue_init_early() was called, which does not seem likely.
>>>
>>> Cool. So I was chatting with Paul and just to update everyone as well,
>>> I tried the DEFINE_SRCU instead of the late init_srcu_struct call and
>>> can make it past boot too (thanks Paul!). Also I don't see a reason we
>>> need the RCU callback to execute early and its fine if it runs later.
>>>
>>> Also, I was thinking of introducing a separate trace_*event*_srcu API
>>> as a replacement to the _rcuidle API. Then I can make use of it for my
>>> tracepoints, and then later can use it for the other tracepoints
>>> needing _rcuidle. After that we can finally get rid of the _rcuidle
>>> API if there are no other users of it. This is just a rough plan, but
>>> let me know if there's any issue with this plan that you can think
>>> off.
>>> IMO, I believe its simpler if the caller worries about whether it can
>>> tolerate if tracepoint probes can block or not, than making it a
>>> property of the tracepoint. That would also simplify the patch to
>>> introduce srcu and keep the tracepoint creation API simple and less
>>> confusing, but let me know if I'm missing something about this.
>>
>> One problem with your approach is that you can have multiple callers
>> for the same tracepoint name, where some could be non-preemptible and
>> others blocking. Also, there is then no clear way for the callback
>
> Shouldn't it be responsibility of the caller to make sure it calls
> correct API? So if you're wanting to allow probes to block, then you'd
> call trace*blocking, if not then you don't. So the caller side can
> just always do the right thing. That's a caller side issue.

The issue there is that tracepoint.c has APIs both for instrumentation
and for registration of probe providers (callbacks). I want tracepoint.c
to provide guarantees that it won't connect incompatible probes and
callsites together.

>
>>
>> Regarding the name, I'm OK with having something along the lines of
>> trace_*event*_blocking or such. Please don't use "srcu" or other naming
>> that is explicitly tied to the underlying mechanism used internally
>> however: what we want to convey is that this specific tracepoint probe
>
> Problem is that _blocking isn't the right word either. In my IRQ trace
> point case, it will look something like this then:
>
> local_irq_disable();
> // IRQs are now off.
> trace_irq_disable_blocking(..);
>
> This wouldn't make sense. What we really want is to use the SRCU
> implementation so that its low overhead...
>
> So it would be something like:
>
> local_irq_disable();
> // IRQs are now off.
> trace_irq_disable_srcu(..);
>
> I also Ok if, as Paul was saying in his last email, that just for
> _rcuidle, we use SRCU so that we don't have to do the rcu_enter_irq
> stuff. Or we kill the _rcuidle API completely and use _srcu for those
> users instead. We already have 1 implementation specific name anyway
> (rcuidle), we're just replacing it with another one. If in the future,
> if we want to change that name we can always do so (Also if you will,
> correcting the existing already bad naming is a different problem and
> we're not making it any worse tbh).

Using SRCU rather than the sched-rcu tracepoint synchronization in your
use-case it caused by a limitation of sched-rcu: it cannot be efficiently
used within idle code. So you don't care about the "can_sleep" property
of SRCU. You could event mix SRCU and sched-rcu callsites for the same
probe name, and it would be perfectly valid.

So even though both "can_sleep" and "rcuidle" caller variants would end
up using SRCU under the hood, each can have its own caller API, e.g.:

* trace_<event>() -> only non-sleeping probes can register to those.
Uses sched-rcu under the hood.

* trace_<event>_can_sleep() -> both sleeping and non-sleeping probes can
register to those. Uses SRCU under the hood.

* trace_<event>_rcuidle() -> only non-sleeping probes can register to those,
uses SRCU under the hood.

>
>> can be preempted and block. The underlying implementation could move to
>> a different RCU flavor brand in the future, and it should not impact
>> users of the tracepoint APIs.
>>
>> In order to ensure that probes that may block only register themselves
>> to tracepoints that allow blocking, we should introduce new tracepoint
>> declaration/definition *and* registration APIs also contain the
>> "BLOCKING/blocking" keywords (or such), so we can ensure that a
>> tracepoint probe being registered to a "blocking" tracepoint is indeed
>> allowed to block.
>
> I feel this problem you're describing is slightly out of the scope of
> the issues we're talking about, I think. Even right now, someone can
> write a callback that blocks and then bad things will happen. If I
> understand correctly, all callbacks right now will execute in a
> preempt disabled section because of rcu_read_lock_sched. So we already
> have a problem (without the SRCU changes) that if a callback blocks,
> then we'll have hard to diagnose sleeping while atomic issues. Sorry
> if I missed your point.

The current situation is that no callback whatsoever can sleep. If we
introduce an API allowing some callbacks to sleep, I want to make sure
we don't end up registering sleepable callbacks to non-preemptible callsites.
Considering that the callback can be provided by a kernel module whereas the
callsite is within the kernel, having this kind of correctness validation
within tracepoint.c appears important.

Thanks!

Mathieu


>
> thanks,
>
> - Joel

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2018-04-26 15:21:50

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC v4 3/4] irqflags: Avoid unnecessary calls to trace_ if you can

On Thu, Apr 26, 2018 at 8:13 AM, Mathieu Desnoyers
<[email protected]> wrote:
[...]
>>> Regarding the name, I'm OK with having something along the lines of
>>> trace_*event*_blocking or such. Please don't use "srcu" or other naming
>>> that is explicitly tied to the underlying mechanism used internally
>>> however: what we want to convey is that this specific tracepoint probe
>>
>> Problem is that _blocking isn't the right word either. In my IRQ trace
>> point case, it will look something like this then:
>>
>> local_irq_disable();
>> // IRQs are now off.
>> trace_irq_disable_blocking(..);
>>
>> This wouldn't make sense. What we really want is to use the SRCU
>> implementation so that its low overhead...
>>
>> So it would be something like:
>>
>> local_irq_disable();
>> // IRQs are now off.
>> trace_irq_disable_srcu(..);
>>
>> I also Ok if, as Paul was saying in his last email, that just for
>> _rcuidle, we use SRCU so that we don't have to do the rcu_enter_irq
>> stuff. Or we kill the _rcuidle API completely and use _srcu for those
>> users instead. We already have 1 implementation specific name anyway
>> (rcuidle), we're just replacing it with another one. If in the future,
>> if we want to change that name we can always do so (Also if you will,
>> correcting the existing already bad naming is a different problem and
>> we're not making it any worse tbh).
>
> Using SRCU rather than the sched-rcu tracepoint synchronization in your
> use-case it caused by a limitation of sched-rcu: it cannot be efficiently
> used within idle code. So you don't care about the "can_sleep" property
> of SRCU. You could event mix SRCU and sched-rcu callsites for the same
> probe name, and it would be perfectly valid.
>
> So even though both "can_sleep" and "rcuidle" caller variants would end
> up using SRCU under the hood, each can have its own caller API, e.g.:
>
> * trace_<event>() -> only non-sleeping probes can register to those.
> Uses sched-rcu under the hood.
>
> * trace_<event>_can_sleep() -> both sleeping and non-sleeping probes can
> register to those. Uses SRCU under the hood.
>
> * trace_<event>_rcuidle() -> only non-sleeping probes can register to those,
> uses SRCU under the hood.
>

Cool, sounds good to me!
For starters I was thinking of changing the _rcuidle underlying
implementation as you pointed. This should be simple enough and needs
no further additional APIs. I'll work on a patch along these lines and
send it out soon. Also would love to work on your sleeping callback
case in the future as well incase you wanted to spend your cycles
working on other things.

thanks,

- Joel

2018-04-26 15:50:21

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC v4 3/4] irqflags: Avoid unnecessary calls to trace_ if you can

On Thu, Apr 26, 2018 at 11:13:16AM -0400, Mathieu Desnoyers wrote:
> ----- On Apr 25, 2018, at 7:13 PM, Joel Fernandes [email protected] wrote:
>
> > Hi Mathieu,
> >
> > On Wed, Apr 25, 2018 at 2:40 PM, Mathieu Desnoyers
> > <[email protected]> wrote:
> >> ----- On Apr 25, 2018, at 5:27 PM, Joel Fernandes [email protected] wrote:
> >>
> >>> On Tue, Apr 24, 2018 at 9:20 PM, Paul E. McKenney
> >>> <[email protected]> wrote:
> >>> [..]
> >>>>> >
> >>>>> > Sounds good, thanks.
> >>>>> >
> >>>>> > Also I found the reason for my boot issue. It was because the
> >>>>> > init_srcu_struct in the prototype was being done in an initcall.
> >>>>> > Instead if I do it in start_kernel before the tracepoint is used, it
> >>>>> > fixes it (although I don't know if this is dangerous to do like this
> >>>>> > but I can get it to boot atleast.. Let me know if this isn't the
> >>>>> > right way to do it, or if something else could go wrong)
> >>>>> >
> >>>>> > diff --git a/init/main.c b/init/main.c
> >>>>> > index 34823072ef9e..ecc88319c6da 100644
> >>>>> > --- a/init/main.c
> >>>>> > +++ b/init/main.c
> >>>>> > @@ -631,6 +631,7 @@ asmlinkage __visible void __init start_kernel(void)
> >>>>> > WARN(!irqs_disabled(), "Interrupts were enabled early\n");
> >>>>> > early_boot_irqs_disabled = false;
> >>>>> >
> >>>>> > + init_srcu_struct(&tracepoint_srcu);
> >>>>> > lockdep_init_early();
> >>>>> >
> >>>>> > local_irq_enable();
> >>>>> > --
> >>>>> >
> >>>>> > I benchmarked it and the performance also looks quite good compared
> >>>>> > to the rcu tracepoint version.
> >>>>> >
> >>>>> > If you, Paul and other think doing the init_srcu_struct like this
> >>>>> > should be Ok, then I can try to work more on your srcu prototype and
> >>>>> > roll into my series and post them in the next RFC series (or let me
> >>>>> > know if you wanted to work your srcu stuff in a separate series..).
> >>>>>
> >>>>> That is definitely not what I was expecting, but let's see if it works
> >>>>> anyway... ;-)
> >>>>>
> >>>>> But first, I was instead expecting something like this:
> >>>>>
> >>>>> DEFINE_SRCU(tracepoint_srcu);
> >>>>>
> >>>>> With this approach, some of the initialization happens at compile time
> >>>>> and the rest happens at the first call_srcu().
> >>>>>
> >>>>> This will work -only- if the first call_srcu() doesn't happen until after
> >>>>> workqueue_init_early() has been invoked. Which I believe must have been
> >>>>> the case in your testing, because otherwise it looks like __call_srcu()
> >>>>> would have complained bitterly.
> >>>>>
> >>>>> On the other hand, if you need to invoke call_srcu() before the call
> >>>>> to workqueue_init_early(), then you need the patch that I am beating
> >>>>> into shape. Plus you would need to use DEFINE_SRCU() and to avoid
> >>>>> invoking init_srcu_struct().
> >>>>
> >>>> And here is the patch. I do not intend to send it upstream unless it
> >>>> actually proves necessary, and it appears that current SRCU does what
> >>>> you need.
> >>>>
> >>>> You would only need this patch if you wanted to invoke call_srcu()
> >>>> before workqueue_init_early() was called, which does not seem likely.
> >>>
> >>> Cool. So I was chatting with Paul and just to update everyone as well,
> >>> I tried the DEFINE_SRCU instead of the late init_srcu_struct call and
> >>> can make it past boot too (thanks Paul!). Also I don't see a reason we
> >>> need the RCU callback to execute early and its fine if it runs later.
> >>>
> >>> Also, I was thinking of introducing a separate trace_*event*_srcu API
> >>> as a replacement to the _rcuidle API. Then I can make use of it for my
> >>> tracepoints, and then later can use it for the other tracepoints
> >>> needing _rcuidle. After that we can finally get rid of the _rcuidle
> >>> API if there are no other users of it. This is just a rough plan, but
> >>> let me know if there's any issue with this plan that you can think
> >>> off.
> >>> IMO, I believe its simpler if the caller worries about whether it can
> >>> tolerate if tracepoint probes can block or not, than making it a
> >>> property of the tracepoint. That would also simplify the patch to
> >>> introduce srcu and keep the tracepoint creation API simple and less
> >>> confusing, but let me know if I'm missing something about this.
> >>
> >> One problem with your approach is that you can have multiple callers
> >> for the same tracepoint name, where some could be non-preemptible and
> >> others blocking. Also, there is then no clear way for the callback
> >
> > Shouldn't it be responsibility of the caller to make sure it calls
> > correct API? So if you're wanting to allow probes to block, then you'd
> > call trace*blocking, if not then you don't. So the caller side can
> > just always do the right thing. That's a caller side issue.
>
> The issue there is that tracepoint.c has APIs both for instrumentation
> and for registration of probe providers (callbacks). I want tracepoint.c
> to provide guarantees that it won't connect incompatible probes and
> callsites together.
>
> >
> >>
> >> Regarding the name, I'm OK with having something along the lines of
> >> trace_*event*_blocking or such. Please don't use "srcu" or other naming
> >> that is explicitly tied to the underlying mechanism used internally
> >> however: what we want to convey is that this specific tracepoint probe
> >
> > Problem is that _blocking isn't the right word either. In my IRQ trace
> > point case, it will look something like this then:
> >
> > local_irq_disable();
> > // IRQs are now off.
> > trace_irq_disable_blocking(..);
> >
> > This wouldn't make sense. What we really want is to use the SRCU
> > implementation so that its low overhead...
> >
> > So it would be something like:
> >
> > local_irq_disable();
> > // IRQs are now off.
> > trace_irq_disable_srcu(..);
> >
> > I also Ok if, as Paul was saying in his last email, that just for
> > _rcuidle, we use SRCU so that we don't have to do the rcu_enter_irq
> > stuff. Or we kill the _rcuidle API completely and use _srcu for those
> > users instead. We already have 1 implementation specific name anyway
> > (rcuidle), we're just replacing it with another one. If in the future,
> > if we want to change that name we can always do so (Also if you will,
> > correcting the existing already bad naming is a different problem and
> > we're not making it any worse tbh).
>
> Using SRCU rather than the sched-rcu tracepoint synchronization in your
> use-case it caused by a limitation of sched-rcu: it cannot be efficiently
> used within idle code. So you don't care about the "can_sleep" property
> of SRCU. You could event mix SRCU and sched-rcu callsites for the same
> probe name, and it would be perfectly valid.
>
> So even though both "can_sleep" and "rcuidle" caller variants would end
> up using SRCU under the hood, each can have its own caller API, e.g.:
>
> * trace_<event>() -> only non-sleeping probes can register to those.
> Uses sched-rcu under the hood.
>
> * trace_<event>_can_sleep() -> both sleeping and non-sleeping probes can
> register to those. Uses SRCU under the hood.
>
> * trace_<event>_rcuidle() -> only non-sleeping probes can register to those,
> uses SRCU under the hood.

Of these, only trace_<event>() may be used from NMI handlers. We should
have a WARN_ON_ONCE(in_nmi()) in any of the ones using SRCU under the
hood. Yes, I would be surprised if there are tracepoints in functions
invoked both from idle and from NMI handlers, but I have been surprised
before. And it is much nicer to get surprised by a splat than by weird
hangs or silent data corruption. ;-)

Thanx, Paul

> >> can be preempted and block. The underlying implementation could move to
> >> a different RCU flavor brand in the future, and it should not impact
> >> users of the tracepoint APIs.
> >>
> >> In order to ensure that probes that may block only register themselves
> >> to tracepoints that allow blocking, we should introduce new tracepoint
> >> declaration/definition *and* registration APIs also contain the
> >> "BLOCKING/blocking" keywords (or such), so we can ensure that a
> >> tracepoint probe being registered to a "blocking" tracepoint is indeed
> >> allowed to block.
> >
> > I feel this problem you're describing is slightly out of the scope of
> > the issues we're talking about, I think. Even right now, someone can
> > write a callback that blocks and then bad things will happen. If I
> > understand correctly, all callbacks right now will execute in a
> > preempt disabled section because of rcu_read_lock_sched. So we already
> > have a problem (without the SRCU changes) that if a callback blocks,
> > then we'll have hard to diagnose sleeping while atomic issues. Sorry
> > if I missed your point.
>
> The current situation is that no callback whatsoever can sleep. If we
> introduce an API allowing some callbacks to sleep, I want to make sure
> we don't end up registering sleepable callbacks to non-preemptible callsites.
> Considering that the callback can be provided by a kernel module whereas the
> callsite is within the kernel, having this kind of correctness validation
> within tracepoint.c appears important.
>
> Thanks!
>
> Mathieu
>
>
> >
> > thanks,
> >
> > - Joel
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com
>


2018-04-26 16:09:40

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC v4 3/4] irqflags: Avoid unnecessary calls to trace_ if you can

----- On Apr 26, 2018, at 11:03 AM, Mathieu Desnoyers [email protected] wrote:

> ----- On Apr 25, 2018, at 6:51 PM, rostedt [email protected] wrote:
>
>> On Wed, 25 Apr 2018 17:40:56 -0400 (EDT)
>> Mathieu Desnoyers <[email protected]> wrote:
>>
>>> One problem with your approach is that you can have multiple callers
>>> for the same tracepoint name, where some could be non-preemptible and
>>> others blocking. Also, there is then no clear way for the callback
>>> registration API to enforce whether the callback expects the tracepoint
>>> to be blocking or non-preemptible. This can introduce hard to diagnose
>>> issues in a kernel without debug options enabled.
>>
>> I agree that it should not be tied to an implementation name. But
>> "blocking" is confusing. I would say "can_sleep" or some such name that
>> states that the trace point caller is indeed something that can sleep.
>
> "trace_*event*_{can,might,may}_sleep" are all acceptable candidates for
> me.
>
>>
>>>
>>> Regarding the name, I'm OK with having something along the lines of
>>> trace_*event*_blocking or such. Please don't use "srcu" or other naming
>>> that is explicitly tied to the underlying mechanism used internally
>>> however: what we want to convey is that this specific tracepoint probe
>>> can be preempted and block. The underlying implementation could move to
>>> a different RCU flavor brand in the future, and it should not impact
>>> users of the tracepoint APIs.
>>>
>>> In order to ensure that probes that may block only register themselves
>>> to tracepoints that allow blocking, we should introduce new tracepoint
>>> declaration/definition *and* registration APIs also contain the
>>> "BLOCKING/blocking" keywords (or such), so we can ensure that a
>>> tracepoint probe being registered to a "blocking" tracepoint is indeed
>>> allowed to block.
>>
>> I'd really don't want to add more declaration/definitions, as we
>> already have too many as is, and with different meanings and the number
>> is of incarnations is n! in growth.
>>
>> I'd say we just stick with a trace_<event>_can_sleep() call, and make
>> sure that if that is used that no trace_<event>() call is also used, and
>> enforce this with linker or compiler tricks.
>
> My main concern is not about having both trace_<event>_can_sleep() mixed
> with trace_<event>() calls. It's more about having a registration API allowing
> modules registering probes that may need to sleep to explicitly declare it,
> and enforce that tracepoint never connects a probe that may need to sleep
> with an instrumentation site which cannot sleep.
>
> I'm unsure what's the best way to achieve this goal though. We could possibly
> extend the tracepoint_probe_register_* APIs to introduce e.g.
> tracepoint_probe_register_prio_flags() and provide a TRACEPOINT_PROBE_CAN_SLEEP
> as parameter upon registration. If this flag is provided, then we could figure
> out
> an way to iterate on all callers, and ensure they are all "can_sleep" type of
> callers.

Iteration on all callers would require that we add some separate section data
for each caller, which we don't have currently. At the moment, the only data
we need is at the tracepoint definition. If we have tons of callers for a given
tracepoint (which might be the case for lockdep), we'd end up consuming a lot of
useless space.

This is one reason why I would prefer to have separate tracepoint definitions
for each of rcuidle, can_sleep, and nonpreemptible (nmi-safe) tracepoints.

Thanks,

Mathieu


>
> Thoughts ?
>
> Thanks,
>
> Mathieu
>
>
>
>>
>> -- Steve
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2018-05-01 01:21:00

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC v4 3/4] irqflags: Avoid unnecessary calls to trace_ if you can

On Wed, Apr 18, 2018 at 2:02 AM Masami Hiramatsu <[email protected]>
wrote:

> On Mon, 16 Apr 2018 21:07:47 -0700
> Joel Fernandes <[email protected]> wrote:

> > With TRACE_IRQFLAGS, we call trace_ API too many times. We don't need
> > to if local_irq_restore or local_irq_save didn't actually do anything.
> >
> > This gives around a 4% improvement in performance when doing the
> > following command: "time find / > /dev/null"
> >
> > Also its best to avoid these calls where possible, since in this series,
> > the RCU code in tracepoint.h seems to be call these quite a bit and I'd
> > like to keep this overhead low.

> Can we assume that the "flags" has only 1 bit irq-disable flag?
> Since it skips calling raw_local_irq_restore(flags); too,
> if there is any state in the flags on any arch, it may change the
> result. In that case, we can do it as below (just skipping
trace_hardirqs_*)

> int disabled = irqs_disabled();

> if (!raw_irqs_disabled_flags(flags) && disabled)
> trace_hardirqs_on();

> raw_local_irq_restore(flags);

> if (raw_irqs_disabled_flags(flags) && !disabled)
> trace_hardirqs_off();

With changes to the tracepoint implementation which uses srcu instead of
rcu, I'm not able to see a performance improvement with the above patch.
For this reason, I will drop this patch from next series. thanks,

- Joel