2018-07-30 22:25:46

by Joel Fernandes

[permalink] [raw]
Subject: [PATCH v12 0/3] tracing: Centralize preemptirq tracepoints and unify their usage

From: "Joel Fernandes (Google)" <[email protected]>

This series contains the last 2 patches with minor changes suggested by Peter
and Steven, and an additional clean up of get_lock_stats as suggested by Peter.

The preempt/irq tracepoints exist but not everything in the kernel is using it
whenever they need to be notified that a preempt disable/enable or an irq
disable/enable has occurred. This makes things not work simultaneously (for
example, only either lockdep or irqsoff trace-events can be used at a time).

This is particularly painful to deal with, since turning on lockdep breaks
tracers that install probes on IRQ events, such as the BCC atomic critical
section tracer [1]. This constraint also makes it not possible to use synthetic
events to trace irqsoff sections with lockdep simulataneously turned on.

This series solves that, and also results in a nice clean up of relevant parts
of the kernel. Several ifdefs are simpler, and the design is more unified and
better. Also as a result of this, we also speeded performance all rcuidle
tracepoints since their handling is simpler.

[1] https://github.com/iovisor/bcc/pull/1801/

v11->v12:
- minor corrections to changelog
- Added PeterZ's Acks
- Squashed in fix for get_lock_stats

v10->v11:
- Dropped extra unneeded else statement since
rcu_read_lock_sched_notrace is same as prempt_disable_notrace (PeterZ)

v9->v10:
- Dropped first 3 and last 2 patches that were applied previously
- Folded SPDK license into the main patch introducing trace_preemptirq.c (Steve)
- Dropped lockdep_recursing & use simplify get_cpu_var instead (PeterZ)
- Simplify __DO_TRACE and use rcu_dereference_raw for both RCU and SRCU (PeterZ)

v8->v9:
- Small style changes to tracepoint code (Mathieu)
- Minor style fix to use PTR_ERR_OR_ZERO (0-day bot)
- Minor fix to test_atomic_sections to use unsigned long.
- Added Namhyung's, Mathieu's Reviewed-by to some patches.

v7->v8:
- Refactored irqsoff tracer probe defines (Namhyung)

v6->v7:
- Added a module to simulate an atomic section, a kselftest to load and
and trigger it which verifies the preempt-tracer and this series.

- Fixed a new warning after I rebased in early boot, this is because
early_boot_irqs_disabled was set too early, I moved it after the lockdep
initialization.

- added back the softirq fix since it appears it wasn't picked up.

- Ran Ingo's locking API selftest suite which are passing with this
series.

- Mathieu suggested ifdef'ing the tracepoint_synchronize_unregister
function incase tracepoints aren't enabled, did that.

Joel Fernandes (Google) (3):
lockdep: use this_cpu_ptr instead of get_cpu_var stats
tracepoint: Make rcuidle tracepoint callers use SRCU
tracing: Centralize preemptirq tracepoints and unify their usage

include/linux/ftrace.h | 11 +-
include/linux/irqflags.h | 11 +-
include/linux/lockdep.h | 8 +-
include/linux/preempt.h | 2 +-
include/linux/tracepoint.h | 41 ++++--
include/trace/events/preemptirq.h | 23 +--
init/main.c | 5 +-
kernel/locking/lockdep.c | 45 ++----
kernel/sched/core.c | 2 +-
kernel/trace/Kconfig | 22 ++-
kernel/trace/Makefile | 2 +-
kernel/trace/trace_irqsoff.c | 231 ++++++++----------------------
kernel/trace/trace_preemptirq.c | 72 ++++++++++
kernel/tracepoint.c | 16 ++-
14 files changed, 244 insertions(+), 247 deletions(-)
create mode 100644 kernel/trace/trace_preemptirq.c

--
2.18.0.345.g5c9ce644c3-goog


2018-07-30 22:25:58

by Joel Fernandes

[permalink] [raw]
Subject: [PATCH v12 1/3] lockdep: use this_cpu_ptr instead of get_cpu_var stats

From: "Joel Fernandes (Google)" <[email protected]>

get_cpu_var disables preemption which has the potential to call into the
preemption disable trace points causing some complications. There's also
no need to disable preemption in uses of get_lock_stats anyway since
preempt is already disabled. So lets simplify the code.

Suggested-by: Peter Zijlstra <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Signed-off-by: Joel Fernandes (Google) <[email protected]>
---
kernel/locking/lockdep.c | 10 +---------
1 file changed, 1 insertion(+), 9 deletions(-)

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 5fa4d3138bf1..fbbb79d5cfa0 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -248,12 +248,7 @@ void clear_lock_stats(struct lock_class *class)

static struct lock_class_stats *get_lock_stats(struct lock_class *class)
{
- return &get_cpu_var(cpu_lock_stats)[class - lock_classes];
-}
-
-static void put_lock_stats(struct lock_class_stats *stats)
-{
- put_cpu_var(cpu_lock_stats);
+ return &this_cpu_ptr(cpu_lock_stats)[class - lock_classes];
}

static void lock_release_holdtime(struct held_lock *hlock)
@@ -271,7 +266,6 @@ static void lock_release_holdtime(struct held_lock *hlock)
lock_time_inc(&stats->read_holdtime, holdtime);
else
lock_time_inc(&stats->write_holdtime, holdtime);
- put_lock_stats(stats);
}
#else
static inline void lock_release_holdtime(struct held_lock *hlock)
@@ -4090,7 +4084,6 @@ __lock_contended(struct lockdep_map *lock, unsigned long ip)
stats->contending_point[contending_point]++;
if (lock->cpu != smp_processor_id())
stats->bounces[bounce_contended + !!hlock->read]++;
- put_lock_stats(stats);
}

static void
@@ -4138,7 +4131,6 @@ __lock_acquired(struct lockdep_map *lock, unsigned long ip)
}
if (lock->cpu != cpu)
stats->bounces[bounce_acquired + !!hlock->read]++;
- put_lock_stats(stats);

lock->cpu = cpu;
lock->ip = ip;
--
2.18.0.345.g5c9ce644c3-goog


2018-07-30 22:25:58

by Joel Fernandes

[permalink] [raw]
Subject: [PATCH v12 3/3] tracing: Centralize preemptirq tracepoints and unify their usage

From: "Joel Fernandes (Google)" <[email protected]>

This patch detaches the preemptirq tracepoints from the tracers and
keeps it separate.

Advantages:
* Lockdep and irqsoff event can now run in parallel since they no longer
have their own calls.

* This unifies the usecase of adding hooks to an irqsoff and irqson
event, and a preemptoff and preempton event.
3 users of the events exist:
- Lockdep
- irqsoff and preemptoff tracers
- irqs and preempt trace events

The unification cleans up several ifdefs and makes the code in preempt
tracer and irqsoff tracers simpler. It gets rid of all the horrific
ifdeferry around PROVE_LOCKING and makes configuration of the different
users of the tracepoints more easy and understandable. It also gets rid
of the time_* function calls from the lockdep hooks used to call into
the preemptirq tracer which is not needed anymore. The negative delta in
lines of code in this patch is quite large too.

In the patch we introduce a new CONFIG option PREEMPTIRQ_TRACEPOINTS
as a single point for registering probes onto the tracepoints. With
this,
the web of config options for preempt/irq toggle tracepoints and its
users becomes:

PREEMPT_TRACER PREEMPTIRQ_EVENTS IRQSOFF_TRACER PROVE_LOCKING
| | \ | |
\ (selects) / \ \ (selects) /
TRACE_PREEMPT_TOGGLE ----> TRACE_IRQFLAGS
\ /
\ (depends on) /
PREEMPTIRQ_TRACEPOINTS

Other than the performance tests mentioned in the previous patch, I also
ran the locking API test suite. I verified that all tests cases are
passing.

I also injected issues by not registering lockdep probes onto the
tracepoints and I see failures to confirm that the probes are indeed
working.

This series + lockdep probes not registered (just to inject errors):
[ 0.000000] hard-irqs-on + irq-safe-A/21: ok | ok | ok |
[ 0.000000] soft-irqs-on + irq-safe-A/21: ok | ok | ok |
[ 0.000000] sirq-safe-A => hirqs-on/12:FAILED|FAILED| ok |
[ 0.000000] sirq-safe-A => hirqs-on/21:FAILED|FAILED| ok |
[ 0.000000] hard-safe-A + irqs-on/12:FAILED|FAILED| ok |
[ 0.000000] soft-safe-A + irqs-on/12:FAILED|FAILED| ok |
[ 0.000000] hard-safe-A + irqs-on/21:FAILED|FAILED| ok |
[ 0.000000] soft-safe-A + irqs-on/21:FAILED|FAILED| ok |
[ 0.000000] hard-safe-A + unsafe-B #1/123: ok | ok | ok |
[ 0.000000] soft-safe-A + unsafe-B #1/123: ok | ok | ok |

With this series + lockdep probes registered, all locking tests pass:

[ 0.000000] hard-irqs-on + irq-safe-A/21: ok | ok | ok |
[ 0.000000] soft-irqs-on + irq-safe-A/21: ok | ok | ok |
[ 0.000000] sirq-safe-A => hirqs-on/12: ok | ok | ok |
[ 0.000000] sirq-safe-A => hirqs-on/21: ok | ok | ok |
[ 0.000000] hard-safe-A + irqs-on/12: ok | ok | ok |
[ 0.000000] soft-safe-A + irqs-on/12: ok | ok | ok |
[ 0.000000] hard-safe-A + irqs-on/21: ok | ok | ok |
[ 0.000000] soft-safe-A + irqs-on/21: ok | ok | ok |
[ 0.000000] hard-safe-A + unsafe-B #1/123: ok | ok | ok |
[ 0.000000] soft-safe-A + unsafe-B #1/123: ok | ok | ok |

Acked-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Namhyung Kim <[email protected]>
Signed-off-by: Joel Fernandes (Google) <[email protected]>
---
include/linux/ftrace.h | 11 +-
include/linux/irqflags.h | 11 +-
include/linux/lockdep.h | 8 +-
include/linux/preempt.h | 2 +-
include/trace/events/preemptirq.h | 23 +--
init/main.c | 5 +-
kernel/locking/lockdep.c | 35 ++---
kernel/sched/core.c | 2 +-
kernel/trace/Kconfig | 22 ++-
kernel/trace/Makefile | 2 +-
kernel/trace/trace_irqsoff.c | 231 ++++++++----------------------
kernel/trace/trace_preemptirq.c | 72 ++++++++++
12 files changed, 195 insertions(+), 229 deletions(-)
create mode 100644 kernel/trace/trace_preemptirq.c

diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
index 63af5eb0ff46..a397907e8d72 100644
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -701,16 +701,7 @@ static inline unsigned long get_lock_parent_ip(void)
return CALLER_ADDR2;
}

-#ifdef CONFIG_IRQSOFF_TRACER
- extern void time_hardirqs_on(unsigned long a0, unsigned long a1);
- extern void time_hardirqs_off(unsigned long a0, unsigned long a1);
-#else
- static inline void time_hardirqs_on(unsigned long a0, unsigned long a1) { }
- static inline void time_hardirqs_off(unsigned long a0, unsigned long a1) { }
-#endif
-
-#if defined(CONFIG_PREEMPT_TRACER) || \
- (defined(CONFIG_DEBUG_PREEMPT) && defined(CONFIG_PREEMPTIRQ_EVENTS))
+#ifdef CONFIG_TRACE_PREEMPT_TOGGLE
extern void trace_preempt_on(unsigned long a0, unsigned long a1);
extern void trace_preempt_off(unsigned long a0, unsigned long a1);
#else
diff --git a/include/linux/irqflags.h b/include/linux/irqflags.h
index 9700f00bbc04..50edb9cbbd26 100644
--- a/include/linux/irqflags.h
+++ b/include/linux/irqflags.h
@@ -15,9 +15,16 @@
#include <linux/typecheck.h>
#include <asm/irqflags.h>

-#ifdef CONFIG_TRACE_IRQFLAGS
+/* Currently trace_softirqs_on/off is used only by lockdep */
+#ifdef CONFIG_PROVE_LOCKING
extern void trace_softirqs_on(unsigned long ip);
extern void trace_softirqs_off(unsigned long ip);
+#else
+# define trace_softirqs_on(ip) do { } while (0)
+# define trace_softirqs_off(ip) do { } while (0)
+#endif
+
+#ifdef CONFIG_TRACE_IRQFLAGS
extern void trace_hardirqs_on(void);
extern void trace_hardirqs_off(void);
# define trace_hardirq_context(p) ((p)->hardirq_context)
@@ -43,8 +50,6 @@ do { \
#else
# define trace_hardirqs_on() do { } while (0)
# define trace_hardirqs_off() do { } while (0)
-# define trace_softirqs_on(ip) do { } while (0)
-# define trace_softirqs_off(ip) do { } while (0)
# define trace_hardirq_context(p) 0
# define trace_softirq_context(p) 0
# define trace_hardirqs_enabled(p) 0
diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
index 6fc77d4dbdcd..a8113357ceeb 100644
--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -266,7 +266,8 @@ struct held_lock {
/*
* Initialization, self-test and debugging-output methods:
*/
-extern void lockdep_info(void);
+extern void lockdep_init(void);
+extern void lockdep_init_early(void);
extern void lockdep_reset(void);
extern void lockdep_reset_lock(struct lockdep_map *lock);
extern void lockdep_free_key_range(void *start, unsigned long size);
@@ -406,7 +407,8 @@ static inline void lockdep_on(void)
# define lock_downgrade(l, i) do { } while (0)
# define lock_set_class(l, n, k, s, i) do { } while (0)
# define lock_set_subclass(l, s, i) do { } while (0)
-# define lockdep_info() do { } while (0)
+# define lockdep_init() do { } while (0)
+# define lockdep_init_early() do { } while (0)
# define lockdep_init_map(lock, name, key, sub) \
do { (void)(name); (void)(key); } while (0)
# define lockdep_set_class(lock, key) do { (void)(key); } while (0)
@@ -532,7 +534,7 @@ do { \

#endif /* CONFIG_LOCKDEP */

-#ifdef CONFIG_TRACE_IRQFLAGS
+#ifdef CONFIG_PROVE_LOCKING
extern void print_irqtrace_events(struct task_struct *curr);
#else
static inline void print_irqtrace_events(struct task_struct *curr)
diff --git a/include/linux/preempt.h b/include/linux/preempt.h
index 5bd3f151da78..c01813c3fbe9 100644
--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -150,7 +150,7 @@
*/
#define in_atomic_preempt_off() (preempt_count() != PREEMPT_DISABLE_OFFSET)

-#if defined(CONFIG_DEBUG_PREEMPT) || defined(CONFIG_PREEMPT_TRACER)
+#if defined(CONFIG_DEBUG_PREEMPT) || defined(CONFIG_TRACE_PREEMPT_TOGGLE)
extern void preempt_count_add(int val);
extern void preempt_count_sub(int val);
#define preempt_count_dec_and_test() \
diff --git a/include/trace/events/preemptirq.h b/include/trace/events/preemptirq.h
index 9c4eb33c5a1d..9a0d4ceeb166 100644
--- a/include/trace/events/preemptirq.h
+++ b/include/trace/events/preemptirq.h
@@ -1,4 +1,4 @@
-#ifdef CONFIG_PREEMPTIRQ_EVENTS
+#ifdef CONFIG_PREEMPTIRQ_TRACEPOINTS

#undef TRACE_SYSTEM
#define TRACE_SYSTEM preemptirq
@@ -32,7 +32,7 @@ DECLARE_EVENT_CLASS(preemptirq_template,
(void *)((unsigned long)(_stext) + __entry->parent_offs))
);

-#ifndef CONFIG_PROVE_LOCKING
+#ifdef CONFIG_TRACE_IRQFLAGS
DEFINE_EVENT(preemptirq_template, irq_disable,
TP_PROTO(unsigned long ip, unsigned long parent_ip),
TP_ARGS(ip, parent_ip));
@@ -40,9 +40,14 @@ DEFINE_EVENT(preemptirq_template, irq_disable,
DEFINE_EVENT(preemptirq_template, irq_enable,
TP_PROTO(unsigned long ip, unsigned long parent_ip),
TP_ARGS(ip, parent_ip));
+#else
+#define trace_irq_enable(...)
+#define trace_irq_disable(...)
+#define trace_irq_enable_rcuidle(...)
+#define trace_irq_disable_rcuidle(...)
#endif

-#ifdef CONFIG_DEBUG_PREEMPT
+#ifdef CONFIG_TRACE_PREEMPT_TOGGLE
DEFINE_EVENT(preemptirq_template, preempt_disable,
TP_PROTO(unsigned long ip, unsigned long parent_ip),
TP_ARGS(ip, parent_ip));
@@ -50,22 +55,22 @@ DEFINE_EVENT(preemptirq_template, preempt_disable,
DEFINE_EVENT(preemptirq_template, preempt_enable,
TP_PROTO(unsigned long ip, unsigned long parent_ip),
TP_ARGS(ip, parent_ip));
+#else
+#define trace_preempt_enable(...)
+#define trace_preempt_disable(...)
+#define trace_preempt_enable_rcuidle(...)
+#define trace_preempt_disable_rcuidle(...)
#endif

#endif /* _TRACE_PREEMPTIRQ_H */

#include <trace/define_trace.h>

-#endif /* !CONFIG_PREEMPTIRQ_EVENTS */
-
-#if !defined(CONFIG_PREEMPTIRQ_EVENTS) || defined(CONFIG_PROVE_LOCKING)
+#else /* !CONFIG_PREEMPTIRQ_TRACEPOINTS */
#define trace_irq_enable(...)
#define trace_irq_disable(...)
#define trace_irq_enable_rcuidle(...)
#define trace_irq_disable_rcuidle(...)
-#endif
-
-#if !defined(CONFIG_PREEMPTIRQ_EVENTS) || !defined(CONFIG_DEBUG_PREEMPT)
#define trace_preempt_enable(...)
#define trace_preempt_disable(...)
#define trace_preempt_enable_rcuidle(...)
diff --git a/init/main.c b/init/main.c
index 3b4ada11ed52..44fe43be84c1 100644
--- a/init/main.c
+++ b/init/main.c
@@ -648,6 +648,9 @@ asmlinkage __visible void __init start_kernel(void)
profile_init();
call_function_init();
WARN(!irqs_disabled(), "Interrupts were enabled early\n");
+
+ lockdep_init_early();
+
early_boot_irqs_disabled = false;
local_irq_enable();

@@ -663,7 +666,7 @@ asmlinkage __visible void __init start_kernel(void)
panic("Too many boot %s vars at `%s'", panic_later,
panic_param);

- lockdep_info();
+ lockdep_init();

/*
* Need to run this when irqs are enabled, because it wants
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index fbbb79d5cfa0..03bfaeb9f4e6 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -55,6 +55,7 @@

#include "lockdep_internals.h"

+#include <trace/events/preemptirq.h>
#define CREATE_TRACE_POINTS
#include <trace/events/lock.h>

@@ -2839,10 +2840,9 @@ static void __trace_hardirqs_on_caller(unsigned long ip)
debug_atomic_inc(hardirqs_on_events);
}

-__visible void trace_hardirqs_on_caller(unsigned long ip)
+static void lockdep_hardirqs_on(void *none, unsigned long ignore,
+ unsigned long ip)
{
- time_hardirqs_on(CALLER_ADDR0, ip);
-
if (unlikely(!debug_locks || current->lockdep_recursion))
return;

@@ -2881,23 +2881,15 @@ __visible void trace_hardirqs_on_caller(unsigned long ip)
__trace_hardirqs_on_caller(ip);
current->lockdep_recursion = 0;
}
-EXPORT_SYMBOL(trace_hardirqs_on_caller);
-
-void trace_hardirqs_on(void)
-{
- trace_hardirqs_on_caller(CALLER_ADDR0);
-}
-EXPORT_SYMBOL(trace_hardirqs_on);

/*
* Hardirqs were disabled:
*/
-__visible void trace_hardirqs_off_caller(unsigned long ip)
+static void lockdep_hardirqs_off(void *none, unsigned long ignore,
+ unsigned long ip)
{
struct task_struct *curr = current;

- time_hardirqs_off(CALLER_ADDR0, ip);
-
if (unlikely(!debug_locks || current->lockdep_recursion))
return;

@@ -2919,13 +2911,6 @@ __visible void trace_hardirqs_off_caller(unsigned long ip)
} else
debug_atomic_inc(redundant_hardirqs_off);
}
-EXPORT_SYMBOL(trace_hardirqs_off_caller);
-
-void trace_hardirqs_off(void)
-{
- trace_hardirqs_off_caller(CALLER_ADDR0);
-}
-EXPORT_SYMBOL(trace_hardirqs_off);

/*
* Softirqs will be enabled:
@@ -4330,7 +4315,15 @@ void lockdep_reset_lock(struct lockdep_map *lock)
raw_local_irq_restore(flags);
}

-void __init lockdep_info(void)
+void __init lockdep_init_early(void)
+{
+#ifdef CONFIG_PROVE_LOCKING
+ register_trace_prio_irq_disable(lockdep_hardirqs_off, NULL, INT_MAX);
+ register_trace_prio_irq_enable(lockdep_hardirqs_on, NULL, INT_MIN);
+#endif
+}
+
+void __init lockdep_init(void)
{
printk("Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo Molnar\n");

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fe365c9a08e9..5de1a4343424 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3189,7 +3189,7 @@ static inline void sched_tick_stop(int cpu) { }
#endif

#if defined(CONFIG_PREEMPT) && (defined(CONFIG_DEBUG_PREEMPT) || \
- defined(CONFIG_PREEMPT_TRACER))
+ defined(CONFIG_TRACE_PREEMPT_TOGGLE))
/*
* If the value passed in is equal to the current preempt count
* then we just disabled preemption. Start timing the latency.
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index e15fadbe5dfb..eb5ab6b511e2 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -82,6 +82,15 @@ config RING_BUFFER_ALLOW_SWAP
Allow the use of ring_buffer_swap_cpu.
Adds a very slight overhead to tracing when enabled.

+config PREEMPTIRQ_TRACEPOINTS
+ bool
+ depends on TRACE_PREEMPT_TOGGLE || TRACE_IRQFLAGS
+ select TRACING
+ default y
+ help
+ Create preempt/irq toggle tracepoints if needed, so that other parts
+ of the kernel can use them to generate or add hooks to them.
+
# All tracer options should select GENERIC_TRACER. For those options that are
# enabled by all tracers (context switch and event tracer) they select TRACING.
# This allows those options to appear when no other tracer is selected. But the
@@ -155,18 +164,20 @@ config FUNCTION_GRAPH_TRACER
the return value. This is done by setting the current return
address on the current task structure into a stack of calls.

+config TRACE_PREEMPT_TOGGLE
+ bool
+ help
+ Enables hooks which will be called when preemption is first disabled,
+ and last enabled.

config PREEMPTIRQ_EVENTS
bool "Enable trace events for preempt and irq disable/enable"
select TRACE_IRQFLAGS
- depends on DEBUG_PREEMPT || !PROVE_LOCKING
- depends on TRACING
+ select TRACE_PREEMPT_TOGGLE if PREEMPT
+ select GENERIC_TRACER
default n
help
Enable tracing of disable and enable events for preemption and irqs.
- For tracing preempt disable/enable events, DEBUG_PREEMPT must be
- enabled. For tracing irq disable/enable events, PROVE_LOCKING must
- be disabled.

config IRQSOFF_TRACER
bool "Interrupts-off Latency Tracer"
@@ -203,6 +214,7 @@ config PREEMPT_TRACER
select RING_BUFFER_ALLOW_SWAP
select TRACER_SNAPSHOT
select TRACER_SNAPSHOT_PER_CPU_SWAP
+ select TRACE_PREEMPT_TOGGLE
help
This option measures the time spent in preemption-off critical
sections, with microsecond accuracy.
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index 31c6b524c260..b916db076561 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -36,7 +36,7 @@ obj-$(CONFIG_TRACING_MAP) += tracing_map.o
obj-$(CONFIG_PREEMPTIRQ_DELAY_TEST) += preemptirq_delay_test.o
obj-$(CONFIG_CONTEXT_SWITCH_TRACER) += trace_sched_switch.o
obj-$(CONFIG_FUNCTION_TRACER) += trace_functions.o
-obj-$(CONFIG_PREEMPTIRQ_EVENTS) += trace_irqsoff.o
+obj-$(CONFIG_PREEMPTIRQ_TRACEPOINTS) += trace_preemptirq.o
obj-$(CONFIG_IRQSOFF_TRACER) += trace_irqsoff.o
obj-$(CONFIG_PREEMPT_TRACER) += trace_irqsoff.o
obj-$(CONFIG_SCHED_TRACER) += trace_sched_wakeup.o
diff --git a/kernel/trace/trace_irqsoff.c b/kernel/trace/trace_irqsoff.c
index f8daa754cce2..770cd30cda40 100644
--- a/kernel/trace/trace_irqsoff.c
+++ b/kernel/trace/trace_irqsoff.c
@@ -16,7 +16,6 @@

#include "trace.h"

-#define CREATE_TRACE_POINTS
#include <trace/events/preemptirq.h>

#if defined(CONFIG_IRQSOFF_TRACER) || defined(CONFIG_PREEMPT_TRACER)
@@ -450,66 +449,6 @@ void stop_critical_timings(void)
}
EXPORT_SYMBOL_GPL(stop_critical_timings);

-#ifdef CONFIG_IRQSOFF_TRACER
-#ifdef CONFIG_PROVE_LOCKING
-void time_hardirqs_on(unsigned long a0, unsigned long a1)
-{
- if (!preempt_trace() && irq_trace())
- stop_critical_timing(a0, a1);
-}
-
-void time_hardirqs_off(unsigned long a0, unsigned long a1)
-{
- if (!preempt_trace() && irq_trace())
- start_critical_timing(a0, a1);
-}
-
-#else /* !CONFIG_PROVE_LOCKING */
-
-/*
- * We are only interested in hardirq on/off events:
- */
-static inline void tracer_hardirqs_on(void)
-{
- if (!preempt_trace() && irq_trace())
- stop_critical_timing(CALLER_ADDR0, CALLER_ADDR1);
-}
-
-static inline void tracer_hardirqs_off(void)
-{
- if (!preempt_trace() && irq_trace())
- start_critical_timing(CALLER_ADDR0, CALLER_ADDR1);
-}
-
-static inline void tracer_hardirqs_on_caller(unsigned long caller_addr)
-{
- if (!preempt_trace() && irq_trace())
- stop_critical_timing(CALLER_ADDR0, caller_addr);
-}
-
-static inline void tracer_hardirqs_off_caller(unsigned long caller_addr)
-{
- if (!preempt_trace() && irq_trace())
- start_critical_timing(CALLER_ADDR0, caller_addr);
-}
-
-#endif /* CONFIG_PROVE_LOCKING */
-#endif /* CONFIG_IRQSOFF_TRACER */
-
-#ifdef CONFIG_PREEMPT_TRACER
-static inline void tracer_preempt_on(unsigned long a0, unsigned long a1)
-{
- if (preempt_trace() && !irq_trace())
- stop_critical_timing(a0, a1);
-}
-
-static inline void tracer_preempt_off(unsigned long a0, unsigned long a1)
-{
- if (preempt_trace() && !irq_trace())
- start_critical_timing(a0, a1);
-}
-#endif /* CONFIG_PREEMPT_TRACER */
-
#ifdef CONFIG_FUNCTION_TRACER
static bool function_enabled;

@@ -659,15 +598,34 @@ static void irqsoff_tracer_stop(struct trace_array *tr)
}

#ifdef CONFIG_IRQSOFF_TRACER
+/*
+ * We are only interested in hardirq on/off events:
+ */
+static void tracer_hardirqs_on(void *none, unsigned long a0, unsigned long a1)
+{
+ if (!preempt_trace() && irq_trace())
+ stop_critical_timing(a0, a1);
+}
+
+static void tracer_hardirqs_off(void *none, unsigned long a0, unsigned long a1)
+{
+ if (!preempt_trace() && irq_trace())
+ start_critical_timing(a0, a1);
+}
+
static int irqsoff_tracer_init(struct trace_array *tr)
{
trace_type = TRACER_IRQS_OFF;

+ register_trace_irq_disable(tracer_hardirqs_off, NULL);
+ register_trace_irq_enable(tracer_hardirqs_on, NULL);
return __irqsoff_tracer_init(tr);
}

static void irqsoff_tracer_reset(struct trace_array *tr)
{
+ unregister_trace_irq_disable(tracer_hardirqs_off, NULL);
+ unregister_trace_irq_enable(tracer_hardirqs_on, NULL);
__irqsoff_tracer_reset(tr);
}

@@ -690,21 +648,34 @@ static struct tracer irqsoff_tracer __read_mostly =
.allow_instances = true,
.use_max_tr = true,
};
-# define register_irqsoff(trace) register_tracer(&trace)
-#else
-# define register_irqsoff(trace) do { } while (0)
-#endif
+#endif /* CONFIG_IRQSOFF_TRACER */

#ifdef CONFIG_PREEMPT_TRACER
+static void tracer_preempt_on(void *none, unsigned long a0, unsigned long a1)
+{
+ if (preempt_trace() && !irq_trace())
+ stop_critical_timing(a0, a1);
+}
+
+static void tracer_preempt_off(void *none, unsigned long a0, unsigned long a1)
+{
+ if (preempt_trace() && !irq_trace())
+ start_critical_timing(a0, a1);
+}
+
static int preemptoff_tracer_init(struct trace_array *tr)
{
trace_type = TRACER_PREEMPT_OFF;

+ register_trace_preempt_disable(tracer_preempt_off, NULL);
+ register_trace_preempt_enable(tracer_preempt_on, NULL);
return __irqsoff_tracer_init(tr);
}

static void preemptoff_tracer_reset(struct trace_array *tr)
{
+ unregister_trace_preempt_disable(tracer_preempt_off, NULL);
+ unregister_trace_preempt_enable(tracer_preempt_on, NULL);
__irqsoff_tracer_reset(tr);
}

@@ -727,23 +698,29 @@ static struct tracer preemptoff_tracer __read_mostly =
.allow_instances = true,
.use_max_tr = true,
};
-# define register_preemptoff(trace) register_tracer(&trace)
-#else
-# define register_preemptoff(trace) do { } while (0)
-#endif
+#endif /* CONFIG_PREEMPT_TRACER */

-#if defined(CONFIG_IRQSOFF_TRACER) && \
- defined(CONFIG_PREEMPT_TRACER)
+#if defined(CONFIG_IRQSOFF_TRACER) && defined(CONFIG_PREEMPT_TRACER)

static int preemptirqsoff_tracer_init(struct trace_array *tr)
{
trace_type = TRACER_IRQS_OFF | TRACER_PREEMPT_OFF;

+ register_trace_irq_disable(tracer_hardirqs_off, NULL);
+ register_trace_irq_enable(tracer_hardirqs_on, NULL);
+ register_trace_preempt_disable(tracer_preempt_off, NULL);
+ register_trace_preempt_enable(tracer_preempt_on, NULL);
+
return __irqsoff_tracer_init(tr);
}

static void preemptirqsoff_tracer_reset(struct trace_array *tr)
{
+ unregister_trace_irq_disable(tracer_hardirqs_off, NULL);
+ unregister_trace_irq_enable(tracer_hardirqs_on, NULL);
+ unregister_trace_preempt_disable(tracer_preempt_off, NULL);
+ unregister_trace_preempt_enable(tracer_preempt_on, NULL);
+
__irqsoff_tracer_reset(tr);
}

@@ -766,115 +743,21 @@ static struct tracer preemptirqsoff_tracer __read_mostly =
.allow_instances = true,
.use_max_tr = true,
};
-
-# define register_preemptirqsoff(trace) register_tracer(&trace)
-#else
-# define register_preemptirqsoff(trace) do { } while (0)
#endif

__init static int init_irqsoff_tracer(void)
{
- register_irqsoff(irqsoff_tracer);
- register_preemptoff(preemptoff_tracer);
- register_preemptirqsoff(preemptirqsoff_tracer);
-
- return 0;
-}
-core_initcall(init_irqsoff_tracer);
-#endif /* IRQSOFF_TRACER || PREEMPTOFF_TRACER */
-
-#ifndef CONFIG_IRQSOFF_TRACER
-static inline void tracer_hardirqs_on(void) { }
-static inline void tracer_hardirqs_off(void) { }
-static inline void tracer_hardirqs_on_caller(unsigned long caller_addr) { }
-static inline void tracer_hardirqs_off_caller(unsigned long caller_addr) { }
+#ifdef CONFIG_IRQSOFF_TRACER
+ register_tracer(&irqsoff_tracer);
#endif
-
-#ifndef CONFIG_PREEMPT_TRACER
-static inline void tracer_preempt_on(unsigned long a0, unsigned long a1) { }
-static inline void tracer_preempt_off(unsigned long a0, unsigned long a1) { }
+#ifdef CONFIG_PREEMPT_TRACER
+ register_tracer(&preemptoff_tracer);
#endif
-
-#if defined(CONFIG_TRACE_IRQFLAGS) && !defined(CONFIG_PROVE_LOCKING)
-/* Per-cpu variable to prevent redundant calls when IRQs already off */
-static DEFINE_PER_CPU(int, tracing_irq_cpu);
-
-void trace_hardirqs_on(void)
-{
- if (!this_cpu_read(tracing_irq_cpu))
- return;
-
- trace_irq_enable_rcuidle(CALLER_ADDR0, CALLER_ADDR1);
- tracer_hardirqs_on();
-
- this_cpu_write(tracing_irq_cpu, 0);
-}
-EXPORT_SYMBOL(trace_hardirqs_on);
-
-void trace_hardirqs_off(void)
-{
- if (this_cpu_read(tracing_irq_cpu))
- return;
-
- this_cpu_write(tracing_irq_cpu, 1);
-
- trace_irq_disable_rcuidle(CALLER_ADDR0, CALLER_ADDR1);
- tracer_hardirqs_off();
-}
-EXPORT_SYMBOL(trace_hardirqs_off);
-
-__visible void trace_hardirqs_on_caller(unsigned long caller_addr)
-{
- if (!this_cpu_read(tracing_irq_cpu))
- return;
-
- trace_irq_enable_rcuidle(CALLER_ADDR0, caller_addr);
- tracer_hardirqs_on_caller(caller_addr);
-
- this_cpu_write(tracing_irq_cpu, 0);
-}
-EXPORT_SYMBOL(trace_hardirqs_on_caller);
-
-__visible void trace_hardirqs_off_caller(unsigned long caller_addr)
-{
- if (this_cpu_read(tracing_irq_cpu))
- return;
-
- this_cpu_write(tracing_irq_cpu, 1);
-
- trace_irq_disable_rcuidle(CALLER_ADDR0, caller_addr);
- tracer_hardirqs_off_caller(caller_addr);
-}
-EXPORT_SYMBOL(trace_hardirqs_off_caller);
-
-/*
- * Stubs:
- */
-
-void trace_softirqs_on(unsigned long ip)
-{
-}
-
-void trace_softirqs_off(unsigned long ip)
-{
-}
-
-inline void print_irqtrace_events(struct task_struct *curr)
-{
-}
+#if defined(CONFIG_IRQSOFF_TRACER) && defined(CONFIG_PREEMPT_TRACER)
+ register_tracer(&preemptirqsoff_tracer);
#endif

-#if defined(CONFIG_PREEMPT_TRACER) || \
- (defined(CONFIG_DEBUG_PREEMPT) && defined(CONFIG_PREEMPTIRQ_EVENTS))
-void trace_preempt_on(unsigned long a0, unsigned long a1)
-{
- trace_preempt_enable_rcuidle(a0, a1);
- tracer_preempt_on(a0, a1);
-}
-
-void trace_preempt_off(unsigned long a0, unsigned long a1)
-{
- trace_preempt_disable_rcuidle(a0, a1);
- tracer_preempt_off(a0, a1);
+ return 0;
}
-#endif
+core_initcall(init_irqsoff_tracer);
+#endif /* IRQSOFF_TRACER || PREEMPTOFF_TRACER */
diff --git a/kernel/trace/trace_preemptirq.c b/kernel/trace/trace_preemptirq.c
new file mode 100644
index 000000000000..e76b78bf258e
--- /dev/null
+++ b/kernel/trace/trace_preemptirq.c
@@ -0,0 +1,72 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * preemptoff and irqoff tracepoints
+ *
+ * Copyright (C) Joel Fernandes (Google) <[email protected]>
+ */
+
+#include <linux/kallsyms.h>
+#include <linux/uaccess.h>
+#include <linux/module.h>
+#include <linux/ftrace.h>
+
+#define CREATE_TRACE_POINTS
+#include <trace/events/preemptirq.h>
+
+#ifdef CONFIG_TRACE_IRQFLAGS
+/* Per-cpu variable to prevent redundant calls when IRQs already off */
+static DEFINE_PER_CPU(int, tracing_irq_cpu);
+
+void trace_hardirqs_on(void)
+{
+ if (!this_cpu_read(tracing_irq_cpu))
+ return;
+
+ trace_irq_enable_rcuidle(CALLER_ADDR0, CALLER_ADDR1);
+ this_cpu_write(tracing_irq_cpu, 0);
+}
+EXPORT_SYMBOL(trace_hardirqs_on);
+
+void trace_hardirqs_off(void)
+{
+ if (this_cpu_read(tracing_irq_cpu))
+ return;
+
+ this_cpu_write(tracing_irq_cpu, 1);
+ trace_irq_disable_rcuidle(CALLER_ADDR0, CALLER_ADDR1);
+}
+EXPORT_SYMBOL(trace_hardirqs_off);
+
+__visible void trace_hardirqs_on_caller(unsigned long caller_addr)
+{
+ if (!this_cpu_read(tracing_irq_cpu))
+ return;
+
+ trace_irq_enable_rcuidle(CALLER_ADDR0, caller_addr);
+ this_cpu_write(tracing_irq_cpu, 0);
+}
+EXPORT_SYMBOL(trace_hardirqs_on_caller);
+
+__visible void trace_hardirqs_off_caller(unsigned long caller_addr)
+{
+ if (this_cpu_read(tracing_irq_cpu))
+ return;
+
+ this_cpu_write(tracing_irq_cpu, 1);
+ trace_irq_disable_rcuidle(CALLER_ADDR0, caller_addr);
+}
+EXPORT_SYMBOL(trace_hardirqs_off_caller);
+#endif /* CONFIG_TRACE_IRQFLAGS */
+
+#ifdef CONFIG_TRACE_PREEMPT_TOGGLE
+
+void trace_preempt_on(unsigned long a0, unsigned long a1)
+{
+ trace_preempt_enable_rcuidle(a0, a1);
+}
+
+void trace_preempt_off(unsigned long a0, unsigned long a1)
+{
+ trace_preempt_disable_rcuidle(a0, a1);
+}
+#endif
--
2.18.0.345.g5c9ce644c3-goog


2018-07-30 22:26:15

by Joel Fernandes

[permalink] [raw]
Subject: [PATCH v12 2/3] tracepoint: Make rcuidle tracepoint callers use SRCU

From: "Joel Fernandes (Google)" <[email protected]>

In recent tests with IRQ on/off tracepoints, a large performance
overhead ~10% is noticed when running hackbench. This is root caused to
calls to rcu_irq_enter_irqson and rcu_irq_exit_irqson from the
tracepoint code. Following a long discussion on the list [1] about this,
we concluded that srcu is a better alternative for use during rcu idle.
Although it does involve extra barriers, its lighter than the sched-rcu
version which has to do additional RCU calls to notify RCU idle about
entry into RCU sections.

In this patch, we change the underlying implementation of the
trace_*_rcuidle API to use SRCU. This has shown to improve performance
alot for the high frequency irq enable/disable tracepoints.

Test: Tested idle and preempt/irq tracepoints.

Here are some performance numbers:

With a run of the following 30 times on a single core x86 Qemu instance
with 1GB memory:
hackbench -g 4 -f 2 -l 3000

Completion times in seconds. CONFIG_PROVE_LOCKING=y.

No patches (without this series)
Mean: 3.048
Median: 3.025
Std Dev: 0.064

With Lockdep using irq tracepoints with RCU implementation:
Mean: 3.451 (-11.66 %)
Median: 3.447 (-12.22%)
Std Dev: 0.049

With Lockdep using irq tracepoints with SRCU implementation (this series):
Mean: 3.020 (I would consider the improvement against the "without
this series" case as just noise).
Median: 3.013
Std Dev: 0.033

[1] https://patchwork.kernel.org/patch/10344297/

[remove rcu_read_lock_sched_notrace as its the equivalent of
preempt_disable_notrace and is unnecessary to call in tracepoint code]
Cleaned-up-by: Peter Zijlstra <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Reviewed-by: Mathieu Desnoyers <[email protected]>
Signed-off-by: Joel Fernandes (Google) <[email protected]>
---
include/linux/tracepoint.h | 41 ++++++++++++++++++++++++++++++--------
kernel/tracepoint.c | 16 ++++++++++++++-
2 files changed, 48 insertions(+), 9 deletions(-)

diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index 19a690b559ca..6e7bc6ebfcd8 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -15,6 +15,7 @@
*/

#include <linux/smp.h>
+#include <linux/srcu.h>
#include <linux/errno.h>
#include <linux/types.h>
#include <linux/cpumask.h>
@@ -33,6 +34,8 @@ struct trace_eval_map {

#define TRACEPOINT_DEFAULT_PRIO 10

+extern struct srcu_struct tracepoint_srcu;
+
extern int
tracepoint_probe_register(struct tracepoint *tp, void *probe, void *data);
extern int
@@ -75,10 +78,16 @@ int unregister_tracepoint_module_notifier(struct notifier_block *nb)
* probe unregistration and the end of module exit to make sure there is no
* caller executing a probe when it is freed.
*/
+#ifdef CONFIG_TRACEPOINTS
static inline void tracepoint_synchronize_unregister(void)
{
+ synchronize_srcu(&tracepoint_srcu);
synchronize_sched();
}
+#else
+static inline void tracepoint_synchronize_unregister(void)
+{ }
+#endif

#ifdef CONFIG_HAVE_SYSCALL_TRACEPOINTS
extern int syscall_regfunc(void);
@@ -129,18 +138,32 @@ extern void syscall_unregfunc(void);
* as "(void *, void)". The DECLARE_TRACE_NOARGS() will pass in just
* "void *data", where as the DECLARE_TRACE() will pass in "void *data, proto".
*/
-#define __DO_TRACE(tp, proto, args, cond, rcucheck) \
+#define __DO_TRACE(tp, proto, args, cond, rcuidle) \
do { \
struct tracepoint_func *it_func_ptr; \
void *it_func; \
void *__data; \
+ int __maybe_unused idx = 0; \
\
if (!(cond)) \
return; \
- if (rcucheck) \
- rcu_irq_enter_irqson(); \
- rcu_read_lock_sched_notrace(); \
- it_func_ptr = rcu_dereference_sched((tp)->funcs); \
+ \
+ /* srcu can't be used from NMI */ \
+ if (rcuidle && in_nmi()) \
+ WARN_ON_ONCE(1); \
+ \
+ /* keep srcu and sched-rcu usage consistent */ \
+ preempt_disable_notrace(); \
+ \
+ /* \
+ * For rcuidle callers, use srcu since sched-rcu \
+ * doesn't work from the idle path. \
+ */ \
+ if (rcuidle) \
+ idx = srcu_read_lock_notrace(&tracepoint_srcu); \
+ \
+ it_func_ptr = rcu_dereference_raw((tp)->funcs); \
+ \
if (it_func_ptr) { \
do { \
it_func = (it_func_ptr)->func; \
@@ -148,9 +171,11 @@ extern void syscall_unregfunc(void);
((void(*)(proto))(it_func))(args); \
} while ((++it_func_ptr)->func); \
} \
- rcu_read_unlock_sched_notrace(); \
- if (rcucheck) \
- rcu_irq_exit_irqson(); \
+ \
+ if (rcuidle) \
+ srcu_read_unlock_notrace(&tracepoint_srcu, idx);\
+ \
+ preempt_enable_notrace(); \
} while (0)

#ifndef MODULE
diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
index 6dc6356c3327..955148d91b74 100644
--- a/kernel/tracepoint.c
+++ b/kernel/tracepoint.c
@@ -31,6 +31,9 @@
extern struct tracepoint * const __start___tracepoints_ptrs[];
extern struct tracepoint * const __stop___tracepoints_ptrs[];

+DEFINE_SRCU(tracepoint_srcu);
+EXPORT_SYMBOL_GPL(tracepoint_srcu);
+
/* Set to 1 to enable tracepoint debug output */
static const int tracepoint_debug;

@@ -67,16 +70,27 @@ static inline void *allocate_probes(int count)
return p == NULL ? NULL : p->probes;
}

-static void rcu_free_old_probes(struct rcu_head *head)
+static void srcu_free_old_probes(struct rcu_head *head)
{
kfree(container_of(head, struct tp_probes, rcu));
}

+static void rcu_free_old_probes(struct rcu_head *head)
+{
+ call_srcu(&tracepoint_srcu, head, srcu_free_old_probes);
+}
+
static inline void release_probes(struct tracepoint_func *old)
{
if (old) {
struct tp_probes *tp_probes = container_of(old,
struct tp_probes, probes[0]);
+ /*
+ * Tracepoint probes are protected by both sched RCU and SRCU,
+ * by calling the SRCU callback in the sched RCU callback we
+ * cover both cases. So let us chain the SRCU and sched RCU
+ * callbacks to wait for both grace periods.
+ */
call_rcu_sched(&tp_probes->rcu, rcu_free_old_probes);
}
}
--
2.18.0.345.g5c9ce644c3-goog


2018-07-30 23:11:41

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v12 2/3] tracepoint: Make rcuidle tracepoint callers use SRCU

On Mon, 30 Jul 2018 15:24:22 -0700
Joel Fernandes <[email protected]> wrote:

> + \
> + /* srcu can't be used from NMI */ \
> + if (rcuidle && in_nmi()) \
> + WARN_ON_ONCE(1); \
> + \

Heh, you still kept this. I'll fix it again.

-- Steve

2018-08-02 14:56:58

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [PATCH v12 0/3] tracing: Centralize preemptirq tracepoints and unify their usage

Hi Joel,

I found this caused several issues when testing ftrace.

#1) ftrace boottest (FTRACE_STARTUP_TEST) fails
#2) mmiotrace reports "IRQs not enabled as expected" error
#3) lock subsystem event boottest causes "IRQs not disabled as expected" error (sometimes)
#4) ftracetest test.d/event/toplevel-enable.tc causes "suspicious RCU usage" warning

#1-#3 were resolved if I reverted this [3/3] patch.
#4 is resolved if I revered this [2/3] patch.

See attached logs for details.
I also attached my kernel .config.
I guess the first one comes from PREEMPTIRQ_TRACEPOINTS=n,
so it should also disable preemptirq tracer.
But #2-#4 we should look into it.
Could you help us to solve these issues?

Thank you,


On Mon, 30 Jul 2018 15:24:20 -0700
Joel Fernandes <[email protected]> wrote:

> From: "Joel Fernandes (Google)" <[email protected]>
>
> This series contains the last 2 patches with minor changes suggested by Peter
> and Steven, and an additional clean up of get_lock_stats as suggested by Peter.
>
> The preempt/irq tracepoints exist but not everything in the kernel is using it
> whenever they need to be notified that a preempt disable/enable or an irq
> disable/enable has occurred. This makes things not work simultaneously (for
> example, only either lockdep or irqsoff trace-events can be used at a time).
>
> This is particularly painful to deal with, since turning on lockdep breaks
> tracers that install probes on IRQ events, such as the BCC atomic critical
> section tracer [1]. This constraint also makes it not possible to use synthetic
> events to trace irqsoff sections with lockdep simulataneously turned on.
>
> This series solves that, and also results in a nice clean up of relevant parts
> of the kernel. Several ifdefs are simpler, and the design is more unified and
> better. Also as a result of this, we also speeded performance all rcuidle
> tracepoints since their handling is simpler.
>
> [1] https://github.com/iovisor/bcc/pull/1801/
>
> v11->v12:
> - minor corrections to changelog
> - Added PeterZ's Acks
> - Squashed in fix for get_lock_stats
>
> v10->v11:
> - Dropped extra unneeded else statement since
> rcu_read_lock_sched_notrace is same as prempt_disable_notrace (PeterZ)
>
> v9->v10:
> - Dropped first 3 and last 2 patches that were applied previously
> - Folded SPDK license into the main patch introducing trace_preemptirq.c (Steve)
> - Dropped lockdep_recursing & use simplify get_cpu_var instead (PeterZ)
> - Simplify __DO_TRACE and use rcu_dereference_raw for both RCU and SRCU (PeterZ)
>
> v8->v9:
> - Small style changes to tracepoint code (Mathieu)
> - Minor style fix to use PTR_ERR_OR_ZERO (0-day bot)
> - Minor fix to test_atomic_sections to use unsigned long.
> - Added Namhyung's, Mathieu's Reviewed-by to some patches.
>
> v7->v8:
> - Refactored irqsoff tracer probe defines (Namhyung)
>
> v6->v7:
> - Added a module to simulate an atomic section, a kselftest to load and
> and trigger it which verifies the preempt-tracer and this series.
>
> - Fixed a new warning after I rebased in early boot, this is because
> early_boot_irqs_disabled was set too early, I moved it after the lockdep
> initialization.
>
> - added back the softirq fix since it appears it wasn't picked up.
>
> - Ran Ingo's locking API selftest suite which are passing with this
> series.
>
> - Mathieu suggested ifdef'ing the tracepoint_synchronize_unregister
> function incase tracepoints aren't enabled, did that.
>
> Joel Fernandes (Google) (3):
> lockdep: use this_cpu_ptr instead of get_cpu_var stats
> tracepoint: Make rcuidle tracepoint callers use SRCU
> tracing: Centralize preemptirq tracepoints and unify their usage
>
> include/linux/ftrace.h | 11 +-
> include/linux/irqflags.h | 11 +-
> include/linux/lockdep.h | 8 +-
> include/linux/preempt.h | 2 +-
> include/linux/tracepoint.h | 41 ++++--
> include/trace/events/preemptirq.h | 23 +--
> init/main.c | 5 +-
> kernel/locking/lockdep.c | 45 ++----
> kernel/sched/core.c | 2 +-
> kernel/trace/Kconfig | 22 ++-
> kernel/trace/Makefile | 2 +-
> kernel/trace/trace_irqsoff.c | 231 ++++++++----------------------
> kernel/trace/trace_preemptirq.c | 72 ++++++++++
> kernel/tracepoint.c | 16 ++-
> 14 files changed, 244 insertions(+), 247 deletions(-)
> create mode 100644 kernel/trace/trace_preemptirq.c
>
> --
> 2.18.0.345.g5c9ce644c3-goog


--
Masami Hiramatsu <[email protected]>


Attachments:
bug_no1_boot_test.txt (2.32 kB)
bug_no2_mmiotrace.txt (3.17 kB)
bug_no3_lock_event_boottest.txt (14.27 kB)
bug_no4_toplevel-kselftest.txt (7.13 kB)
.config (74.27 kB)
Download all attachments

2018-08-03 02:58:18

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH v12 0/3] tracing: Centralize preemptirq tracepoints and unify their usage

Hi Masami,

On Thu, Aug 2, 2018 at 7:55 AM, Masami Hiramatsu <[email protected]> wrote:
> Hi Joel,
>
> I found this caused several issues when testing ftrace.
>
> #1) ftrace boottest (FTRACE_STARTUP_TEST) fails

This sadly appears to be a real issue. The startup test for
"preemptirqsoff" tracer fails, however it passes for only preemptoff
or only irqsoff. I tested only the last 2 tracers, not the first one,
that's why I didn't catch it. I need to debug this more.

> #2) mmiotrace reports "IRQs not enabled as expected" error
> #3) lock subsystem event boottest causes "IRQs not disabled as expected" error (sometimes)

Could you try the below patch and let me know if you still see the
issue? In the v11 I removed the lockdep_recursing() check since it
seemed unnecessary. But I'd like to rule out that there's still some
issue lurking there. Thanks and I appreciate your help, diff is
attached to this email.

> #4) ftracetest test.d/event/toplevel-enable.tc causes "suspicious RCU usage" warning

I sent a patch fixing this with you on CC. I tested that it fixes the
issue you're reporting.


Attachments:
fixup-for-patch-3-3.diff (1.28 kB)

2018-08-03 07:25:29

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [PATCH v12 0/3] tracing: Centralize preemptirq tracepoints and unify their usage

Hi Joel,

Thank you for trying to fix that.

On Thu, 2 Aug 2018 19:57:09 -0700
Joel Fernandes <[email protected]> wrote:

> Hi Masami,
>
> On Thu, Aug 2, 2018 at 7:55 AM, Masami Hiramatsu <[email protected]> wrote:
> > Hi Joel,
> >
> > I found this caused several issues when testing ftrace.
> >
> > #1) ftrace boottest (FTRACE_STARTUP_TEST) fails
>
> This sadly appears to be a real issue. The startup test for
> "preemptirqsoff" tracer fails, however it passes for only preemptoff
> or only irqsoff. I tested only the last 2 tracers, not the first one,
> that's why I didn't catch it. I need to debug this more.
>
> > #2) mmiotrace reports "IRQs not enabled as expected" error
> > #3) lock subsystem event boottest causes "IRQs not disabled as expected" error (sometimes)
>
> Could you try the below patch and let me know if you still see the
> issue? In the v11 I removed the lockdep_recursing() check since it
> seemed unnecessary. But I'd like to rule out that there's still some
> issue lurking there. Thanks and I appreciate your help, diff is
> attached to this email.

I tried the attached patch. But unfortunately I still see #1 and #2.
(#3 does not always occurred, I'll continue to test)

> > #4) ftracetest test.d/event/toplevel-enable.tc causes "suspicious RCU usage" warning
>
> I sent a patch fixing this with you on CC. I tested that it fixes the
> issue you're reporting.

OK, I'll check it.

Thanks!

--
Masami Hiramatsu <[email protected]>

2018-08-03 07:36:09

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [PATCH v12 0/3] tracing: Centralize preemptirq tracepoints and unify their usage

On Thu, 2 Aug 2018 19:57:09 -0700
Joel Fernandes <[email protected]> wrote:

> Hi Masami,
>
> On Thu, Aug 2, 2018 at 7:55 AM, Masami Hiramatsu <[email protected]> wrote:
> > Hi Joel,
> >
> > I found this caused several issues when testing ftrace.
> >
> > #1) ftrace boottest (FTRACE_STARTUP_TEST) fails
>
> This sadly appears to be a real issue. The startup test for
> "preemptirqsoff" tracer fails, however it passes for only preemptoff
> or only irqsoff. I tested only the last 2 tracers, not the first one,
> that's why I didn't catch it. I need to debug this more.

I found that this does not always happen, but 4 out of 5 it happens.

Thank you,

--
Masami Hiramatsu <[email protected]>

2018-08-04 04:53:09

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH v12 0/3] tracing: Centralize preemptirq tracepoints and unify their usage

Hi Masami,

On Fri, Aug 3, 2018 at 12:23 AM, Masami Hiramatsu <[email protected]> wrote:
> Hi Joel,
>
> Thank you for trying to fix that.
>
> On Thu, 2 Aug 2018 19:57:09 -0700
> Joel Fernandes <[email protected]> wrote:
>
>> Hi Masami,
>>
>> On Thu, Aug 2, 2018 at 7:55 AM, Masami Hiramatsu <[email protected]> wrote:
>> > Hi Joel,
>> >
>> > I found this caused several issues when testing ftrace.
>> >
>> > #1) ftrace boottest (FTRACE_STARTUP_TEST) fails
>>
>> This sadly appears to be a real issue. The startup test for
>> "preemptirqsoff" tracer fails, however it passes for only preemptoff
>> or only irqsoff. I tested only the last 2 tracers, not the first one,
>> that's why I didn't catch it. I need to debug this more.
>>
>> > #2) mmiotrace reports "IRQs not enabled as expected" error
>> > #3) lock subsystem event boottest causes "IRQs not disabled as expected" error (sometimes)
>>
>> Could you try the below patch and let me know if you still see the
>> issue? In the v11 I removed the lockdep_recursing() check since it
>> seemed unnecessary. But I'd like to rule out that there's still some
>> issue lurking there. Thanks and I appreciate your help, diff is
>> attached to this email.
>
> I tried the attached patch. But unfortunately I still see #1 and #2.
> (#3 does not always occurred, I'll continue to test)
>

I found that #2 is because of CPU hotplug and tracepoint interaction.
I can reproduce the same issue if I manually offline a CPU by:
echo 0 > /sys/devices/system/cpu/cpuX/online

mmiotrace also does the same thing.

Its because the tracepoints become inactive when we hotplug. This
prevents the lockdep probes that were registered from getting called.

Basically the cond parameter in DECLARE_TRACE is whether the cpu is online:
#define DECLARE_TRACE(name, proto, args) \
__DECLARE_TRACE(name, PARAMS(proto), PARAMS(args), \
cpu_online(raw_smp_processor_id()), \
PARAMS(void *__data, proto), \
PARAMS(__data, args))

which prevents the probes from getting called.

Steve, any ideas on how we can fix this? Basically we need the
tracepoints to work until the CPU is really offline... Could you also
let me know why we need to check if CPU is online or not before
calling the tracepoint probes?

I'll try to look into #1 and #2 tomorrow as well (I'm on vacation +
weekend, so time is a bit short, but this is quite important so I'll
make time). Thanks!

- Joel

2018-08-05 16:48:00

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH v12 0/3] tracing: Centralize preemptirq tracepoints and unify their usage

Hi Masami,

On Fri, Aug 3, 2018 at 9:51 PM, Joel Fernandes <[email protected]> wrote:
[...]
>> On Thu, 2 Aug 2018 19:57:09 -0700
>> Joel Fernandes <[email protected]> wrote:
>>
>>> Hi Masami,
>>>
>>> On Thu, Aug 2, 2018 at 7:55 AM, Masami Hiramatsu <[email protected]> wrote:
>>> > Hi Joel,
>>> >
>>> > I found this caused several issues when testing ftrace.
>>> >
>>> > #1) ftrace boottest (FTRACE_STARTUP_TEST) fails
>>>
>>> This sadly appears to be a real issue. The startup test for
>>> "preemptirqsoff" tracer fails, however it passes for only preemptoff
>>> or only irqsoff. I tested only the last 2 tracers, not the first one,
>>> that's why I didn't catch it. I need to debug this more.

I figured out this one too. Its because I need to account for
preempt_count() in tracer_hardirqs_off since the tracer probe is now
called with an additional level of preempt disabled from the
tracepoint code. Without that accounting, stop_critical_timings may
not be called causing an empty trace buffer. That should be easy to
fix, I'm on vacation though and back on 13th so can most likely look
at it only then (the week after the next).

>>> > #2) mmiotrace reports "IRQs not enabled as expected" error
>>> > #3) lock subsystem event boottest causes "IRQs not disabled as expected" error (sometimes)

The only thing left to figure out is #3 ("lock subsystem event
boottest"). Could you let me know how to run this test?

Thanks for the reporting these,

- Joel

2018-08-06 02:08:39

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [PATCH v12 0/3] tracing: Centralize preemptirq tracepoints and unify their usage

On Sun, 5 Aug 2018 09:46:56 -0700
Joel Fernandes <[email protected]> wrote:

> Hi Masami,
>
> On Fri, Aug 3, 2018 at 9:51 PM, Joel Fernandes <[email protected]> wrote:
> [...]
> >> On Thu, 2 Aug 2018 19:57:09 -0700
> >> Joel Fernandes <[email protected]> wrote:
> >>
> >>> Hi Masami,
> >>>
> >>> On Thu, Aug 2, 2018 at 7:55 AM, Masami Hiramatsu <[email protected]> wrote:
> >>> > Hi Joel,
> >>> >
> >>> > I found this caused several issues when testing ftrace.
> >>> >
> >>> > #1) ftrace boottest (FTRACE_STARTUP_TEST) fails
> >>>
> >>> This sadly appears to be a real issue. The startup test for
> >>> "preemptirqsoff" tracer fails, however it passes for only preemptoff
> >>> or only irqsoff. I tested only the last 2 tracers, not the first one,
> >>> that's why I didn't catch it. I need to debug this more.
>
> I figured out this one too. Its because I need to account for
> preempt_count() in tracer_hardirqs_off since the tracer probe is now
> called with an additional level of preempt disabled from the
> tracepoint code. Without that accounting, stop_critical_timings may
> not be called causing an empty trace buffer. That should be easy to
> fix, I'm on vacation though and back on 13th so can most likely look
> at it only then (the week after the next).

Nice! Thank you for identifying the root cause!
BTW, I'm also on vacation this week :)

> >>> > #2) mmiotrace reports "IRQs not enabled as expected" error
> >>> > #3) lock subsystem event boottest causes "IRQs not disabled as expected" error (sometimes)
>
> The only thing left to figure out is #3 ("lock subsystem event
> boottest"). Could you let me know how to run this test?

The #3 is not always reproducible, I just enabled these 2 configs

CONFIG_FTRACE_STARTUP_TEST=y and CONFIG_EVENT_TRACE_TEST_SYSCALLS=y

and boot on Qemu. I have seen this issue about 10% of system boot.

Thank you,


--
Masami Hiramatsu <[email protected]>

2018-08-06 15:25:56

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH v12 0/3] tracing: Centralize preemptirq tracepoints and unify their usage

On Sun, Aug 5, 2018 at 7:07 PM, Masami Hiramatsu <[email protected]> wrote:
> On Sun, 5 Aug 2018 09:46:56 -0700
> Joel Fernandes <[email protected]> wrote:
>
>> Hi Masami,
>>
>> On Fri, Aug 3, 2018 at 9:51 PM, Joel Fernandes <[email protected]> wrote:
>> [...]
>> >> On Thu, 2 Aug 2018 19:57:09 -0700
>> >> Joel Fernandes <[email protected]> wrote:
>> >>
>> >>> Hi Masami,
>> >>>
>> >>> On Thu, Aug 2, 2018 at 7:55 AM, Masami Hiramatsu <[email protected]> wrote:
>> >>> > Hi Joel,
>> >>> >
>> >>> > I found this caused several issues when testing ftrace.
>> >>> >
>> >>> > #1) ftrace boottest (FTRACE_STARTUP_TEST) fails
>> >>>
>> >>> This sadly appears to be a real issue. The startup test for
>> >>> "preemptirqsoff" tracer fails, however it passes for only preemptoff
>> >>> or only irqsoff. I tested only the last 2 tracers, not the first one,
>> >>> that's why I didn't catch it. I need to debug this more.
>>
>> I figured out this one too. Its because I need to account for
>> preempt_count() in tracer_hardirqs_off since the tracer probe is now
>> called with an additional level of preempt disabled from the
>> tracepoint code. Without that accounting, stop_critical_timings may
>> not be called causing an empty trace buffer. That should be easy to
>> fix, I'm on vacation though and back on 13th so can most likely look
>> at it only then (the week after the next).
>
> Nice! Thank you for identifying the root cause!

No problem :) Thanks a lot for your help!

>> >>> > #2) mmiotrace reports "IRQs not enabled as expected" error

I fixed #2 too now, the attached patch based on ftrace/core should fix
it (Sorry couldn't post the patch inline because I'm not at a sane
email client right now).

The patch is still in RFC shape, but I verified it fixes the issue.

Basically, I do this in the patch:
- switch to using SRCU for all tracepoints, use a separate SRCU node
for NMI context (Paul, Mathieu would love to get your thoughts here).
This is to remove dependence on sched-RCU.
- use tracepoints even after CPUs are offline

>> >>> > #3) lock subsystem event boottest causes "IRQs not disabled as expected" error (sometimes)
>> The only thing left to figure out is #3 ("lock subsystem event
>> boottest"). Could you let me know how to run this test?
>
> The #3 is not always reproducible, I just enabled these 2 configs
>
> CONFIG_FTRACE_STARTUP_TEST=y and CONFIG_EVENT_TRACE_TEST_SYSCALLS=y

Thanks, I am also able to see this now some of the times. I have
certain builds where this is much faster to reproduce so I'll use
those.

- Joel


Attachments:
0001-tracepoint-Run-tracepoints-even-after-CPU-is-offline.patch (5.98 kB)

2018-08-06 21:23:40

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v12 3/3] tracing: Centralize preemptirq tracepoints and unify their usage


With this patch applied, I'm constantly getting lockdep errors. Instead
of doing a full revert of the patch, I did this, which makes all those
errors go away. I may apply this for now, and we can revisit having
lockdep use the tracepoint code. But since it's currently always
enabled, I'm thinking of just leaving this as is. The macros are still
clean from Joel's patch.

Thoughts?

-- Steve

diff --git a/include/linux/irqflags.h b/include/linux/irqflags.h
index 50edb9cbbd26..a93476c6c954 100644
--- a/include/linux/irqflags.h
+++ b/include/linux/irqflags.h
@@ -19,6 +19,8 @@
#ifdef CONFIG_PROVE_LOCKING
extern void trace_softirqs_on(unsigned long ip);
extern void trace_softirqs_off(unsigned long ip);
+ extern void lockdep_hardirqs_on(unsigned long ip);
+ extern void lockdep_hardirqs_off(unsigned long ip);
#else
# define trace_softirqs_on(ip) do { } while (0)
# define trace_softirqs_off(ip) do { } while (0)
diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
index a8113357ceeb..30d0eb857f2e 100644
--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -267,7 +267,6 @@ struct held_lock {
* Initialization, self-test and debugging-output methods:
*/
extern void lockdep_init(void);
-extern void lockdep_init_early(void);
extern void lockdep_reset(void);
extern void lockdep_reset_lock(struct lockdep_map *lock);
extern void lockdep_free_key_range(void *start, unsigned long size);
@@ -408,7 +407,6 @@ static inline void lockdep_on(void)
# define lock_set_class(l, n, k, s, i) do { } while (0)
# define lock_set_subclass(l, s, i) do { } while (0)
# define lockdep_init() do { } while (0)
-# define lockdep_init_early() do { } while (0)
# define lockdep_init_map(lock, name, key, sub) \
do { (void)(name); (void)(key); } while (0)
# define lockdep_set_class(lock, key) do { (void)(key); } while (0)
diff --git a/init/main.c b/init/main.c
index 44fe43be84c1..5d42e577643a 100644
--- a/init/main.c
+++ b/init/main.c
@@ -649,8 +649,6 @@ asmlinkage __visible void __init start_kernel(void)
call_function_init();
WARN(!irqs_disabled(), "Interrupts were enabled early\n");

- lockdep_init_early();
-
early_boot_irqs_disabled = false;
local_irq_enable();

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 03bfaeb9f4e6..e406c5fdb41e 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -2840,8 +2840,7 @@ static void __trace_hardirqs_on_caller(unsigned long ip)
debug_atomic_inc(hardirqs_on_events);
}

-static void lockdep_hardirqs_on(void *none, unsigned long ignore,
- unsigned long ip)
+void lockdep_hardirqs_on(unsigned long ip)
{
if (unlikely(!debug_locks || current->lockdep_recursion))
return;
@@ -2885,8 +2884,7 @@ static void lockdep_hardirqs_on(void *none, unsigned long ignore,
/*
* Hardirqs were disabled:
*/
-static void lockdep_hardirqs_off(void *none, unsigned long ignore,
- unsigned long ip)
+void lockdep_hardirqs_off(unsigned long ip)
{
struct task_struct *curr = current;

@@ -4315,14 +4313,6 @@ void lockdep_reset_lock(struct lockdep_map *lock)
raw_local_irq_restore(flags);
}

-void __init lockdep_init_early(void)
-{
-#ifdef CONFIG_PROVE_LOCKING
- register_trace_prio_irq_disable(lockdep_hardirqs_off, NULL, INT_MAX);
- register_trace_prio_irq_enable(lockdep_hardirqs_on, NULL, INT_MIN);
-#endif
-}
-
void __init lockdep_init(void)
{
printk("Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo Molnar\n");
diff --git a/kernel/trace/trace_preemptirq.c b/kernel/trace/trace_preemptirq.c
index e76b78bf258e..8326adcab0dc 100644
--- a/kernel/trace/trace_preemptirq.c
+++ b/kernel/trace/trace_preemptirq.c
@@ -20,40 +20,52 @@ static DEFINE_PER_CPU(int, tracing_irq_cpu);
void trace_hardirqs_on(void)
{
if (!this_cpu_read(tracing_irq_cpu))
- return;
+ goto out;

trace_irq_enable_rcuidle(CALLER_ADDR0, CALLER_ADDR1);
this_cpu_write(tracing_irq_cpu, 0);
+
+ out:
+ lockdep_hardirqs_on(CALLER_ADDR0);
}
EXPORT_SYMBOL(trace_hardirqs_on);

void trace_hardirqs_off(void)
{
if (this_cpu_read(tracing_irq_cpu))
- return;
+ goto out;

this_cpu_write(tracing_irq_cpu, 1);
trace_irq_disable_rcuidle(CALLER_ADDR0, CALLER_ADDR1);
+
+ out:
+ lockdep_hardirqs_off(CALLER_ADDR0);
}
EXPORT_SYMBOL(trace_hardirqs_off);

__visible void trace_hardirqs_on_caller(unsigned long caller_addr)
{
if (!this_cpu_read(tracing_irq_cpu))
- return;
+ goto out;

trace_irq_enable_rcuidle(CALLER_ADDR0, caller_addr);
this_cpu_write(tracing_irq_cpu, 0);
+
+ out:
+ lockdep_hardirqs_on(CALLER_ADDR0);
}
EXPORT_SYMBOL(trace_hardirqs_on_caller);

__visible void trace_hardirqs_off_caller(unsigned long caller_addr)
{
if (this_cpu_read(tracing_irq_cpu))
- return;
+ goto out;

this_cpu_write(tracing_irq_cpu, 1);
trace_irq_disable_rcuidle(CALLER_ADDR0, caller_addr);
+
+ out:
+ lockdep_hardirqs_off(CALLER_ADDR0);
}
EXPORT_SYMBOL(trace_hardirqs_off_caller);
#endif /* CONFIG_TRACE_IRQFLAGS */

2018-08-07 00:48:57

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH v12 3/3] tracing: Centralize preemptirq tracepoints and unify their usage

On Mon, Aug 6, 2018 at 12:50 PM, Steven Rostedt <[email protected]> wrote:
>
> With this patch applied, I'm constantly getting lockdep errors. Instead
> of doing a full revert of the patch, I did this, which makes all those
> errors go away. I may apply this for now, and we can revisit having
> lockdep use the tracepoint code. But since it's currently always
> enabled, I'm thinking of just leaving this as is. The macros are still
> clean from Joel's patch.
>
> Thoughts?

I like your patch. Thanks a lot for doing this.. It keeps most of the
benefits of my patch while avoiding the issues with lockdep. I agree
we can look at the remaining lockdep issue after. There were several
lockdep issues with this patch that I fixed over the the months, but
there's still the one that Masami reported that I believe you're also
seeing. Once I'm back I'll work on figuring that one out.

Could you pull in the fixes to the other issues I posted though? With
that we should be good.
https://lore.kernel.org/patchwork/patch/971104/
https://lore.kernel.org/patchwork/patch/971829/

thanks,

- Joel

2018-08-07 01:48:15

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v12 3/3] tracing: Centralize preemptirq tracepoints and unify their usage

On Mon, 6 Aug 2018 17:43:19 -0700
Joel Fernandes <[email protected]> wrote:

> On Mon, Aug 6, 2018 at 12:50 PM, Steven Rostedt <[email protected]> wrote:
> >
> > With this patch applied, I'm constantly getting lockdep errors. Instead
> > of doing a full revert of the patch, I did this, which makes all those
> > errors go away. I may apply this for now, and we can revisit having
> > lockdep use the tracepoint code. But since it's currently always
> > enabled, I'm thinking of just leaving this as is. The macros are still
> > clean from Joel's patch.
> >
> > Thoughts?
>
> I like your patch. Thanks a lot for doing this.. It keeps most of the
> benefits of my patch while avoiding the issues with lockdep. I agree
> we can look at the remaining lockdep issue after. There were several
> lockdep issues with this patch that I fixed over the the months, but
> there's still the one that Masami reported that I believe you're also
> seeing. Once I'm back I'll work on figuring that one out.
>
> Could you pull in the fixes to the other issues I posted though? With
> that we should be good.
> https://lore.kernel.org/patchwork/patch/971104/
> https://lore.kernel.org/patchwork/patch/971829/
>

I already had these applied when I created this patch ;-)

Thanks, I'll add it.

-- Steve

2018-08-07 14:35:19

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v12 3/3] tracing: Centralize preemptirq tracepoints and unify their usage

On Tue, 07 Aug 2018 10:10:59 -0400
Joel Fernandes <[email protected]> wrote:

> On August 7, 2018 9:49:54 AM EDT, Steven Rostedt <[email protected]> wrote:
> >On Tue, 7 Aug 2018 06:33:35 -0700
> >Joel Fernandes <[email protected]> wrote:
> >
> >> Thanks, also one more thing I noticed in your patch,
> >> lockdep_hardirqs_off needs to be called before all other probes but
> >> you're calling it after. This is why I registered it with INT_MAX:
> >>
> >> register_trace_prio_irq_disable(lockdep_hardirqs_off, NULL, INT_MAX);
> >>
> >> Without it you may get annotation warnings. Thanks,
> >
> >Interesting. I was following the old way where we called the tracing
> >code before calling the lockdep code (all hard coded and not from
> >trace events). Is this have something to do with calling the code from
> >a tracepoint?
> >
> >Do you have an example that could trigger the warning?
> >
>
> I remember the warnings but can't remember now how I triggered them.
> I think I saw them with the irqsoff tracer or irq trace events
> running, with lockdep turned on.

I'll see if I can trigger it. I'll run this all through my tests.

>
> Also an irq disable probe that does a lockdep assert that irqs are
> disabled could trigger it?
>

You mean if someone add a tracepoint callback to the irq disable
tracepoint, and did a lockdep assert to make sure interrupts are
disabled?

-- Steve


2018-08-07 14:56:01

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH v12 3/3] tracing: Centralize preemptirq tracepoints and unify their usage



On August 7, 2018 9:49:54 AM EDT, Steven Rostedt <[email protected]> wrote:
>On Tue, 7 Aug 2018 06:33:35 -0700
>Joel Fernandes <[email protected]> wrote:
>
>> Thanks, also one more thing I noticed in your patch,
>> lockdep_hardirqs_off needs to be called before all other probes but
>> you're calling it after. This is why I registered it with INT_MAX:
>>
>> register_trace_prio_irq_disable(lockdep_hardirqs_off, NULL, INT_MAX);
>>
>> Without it you may get annotation warnings. Thanks,
>
>Interesting. I was following the old way where we called the tracing
>code before calling the lockdep code (all hard coded and not from
>trace events). Is this have something to do with calling the code from
>a tracepoint?
>
>Do you have an example that could trigger the warning?
>

I remember the warnings but can't remember now how I triggered them. I think I saw them with the irqsoff tracer or irq trace events running, with lockdep turned on.

Also an irq disable probe that does a lockdep assert that irqs are disabled could trigger it?

thanks,

- Joel


>-- Steve

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

2018-08-07 15:31:35

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH v12 3/3] tracing: Centralize preemptirq tracepoints and unify their usage



On August 7, 2018 11:09:06 AM EDT, Steven Rostedt <[email protected]> wrote:
>On Tue, 07 Aug 2018 10:48:05 -0400
>Joel Fernandes <[email protected]> wrote:
>
>> >You mean if someone add a tracepoint callback to the irq disable
>> >tracepoint, and did a lockdep assert to make sure interrupts are
>> >disabled?
>>
>> Yes that's what I meant.
>
>That sounds like a "Doctor, it hurts me when I do this" problem ;-)

Haha, yes true. But just to clarify, I didn't do this to see the problem but noticed it with turning on existing things. :-) but I see your point...

Thanks,

- Joel

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

2018-08-07 15:57:53

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH v12 3/3] tracing: Centralize preemptirq tracepoints and unify their usage

On Mon, Aug 6, 2018 at 6:43 PM, Steven Rostedt <[email protected]> wrote:
> On Mon, 6 Aug 2018 17:43:19 -0700
> Joel Fernandes <[email protected]> wrote:
>
>> On Mon, Aug 6, 2018 at 12:50 PM, Steven Rostedt <[email protected]> wrote:
>> >
>> > With this patch applied, I'm constantly getting lockdep errors. Instead
>> > of doing a full revert of the patch, I did this, which makes all those
>> > errors go away. I may apply this for now, and we can revisit having
>> > lockdep use the tracepoint code. But since it's currently always
>> > enabled, I'm thinking of just leaving this as is. The macros are still
>> > clean from Joel's patch.
>> >
>> > Thoughts?
>>
>> I like your patch. Thanks a lot for doing this.. It keeps most of the
>> benefits of my patch while avoiding the issues with lockdep. I agree
>> we can look at the remaining lockdep issue after. There were several
>> lockdep issues with this patch that I fixed over the the months, but
>> there's still the one that Masami reported that I believe you're also
>> seeing. Once I'm back I'll work on figuring that one out.
>>
>> Could you pull in the fixes to the other issues I posted though? With
>> that we should be good.
>> https://lore.kernel.org/patchwork/patch/971104/
>> https://lore.kernel.org/patchwork/patch/971829/
>>
>
> I already had these applied when I created this patch ;-)
>
> Thanks, I'll add it.

Thanks, also one more thing I noticed in your patch,
lockdep_hardirqs_off needs to be called before all other probes but
you're calling it after. This is why I registered it with INT_MAX:

register_trace_prio_irq_disable(lockdep_hardirqs_off, NULL, INT_MAX);

Without it you may get annotation warnings. Thanks,

- Joel

2018-08-07 16:05:24

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v12 3/3] tracing: Centralize preemptirq tracepoints and unify their usage

On Tue, 7 Aug 2018 06:33:35 -0700
Joel Fernandes <[email protected]> wrote:

> Thanks, also one more thing I noticed in your patch,
> lockdep_hardirqs_off needs to be called before all other probes but
> you're calling it after. This is why I registered it with INT_MAX:
>
> register_trace_prio_irq_disable(lockdep_hardirqs_off, NULL, INT_MAX);
>
> Without it you may get annotation warnings. Thanks,

Interesting. I was following the old way where we called the tracing
code before calling the lockdep code (all hard coded and not from
trace events). Is this have something to do with calling the code from
a tracepoint?

Do you have an example that could trigger the warning?

-- Steve

2018-08-07 16:34:48

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v12 3/3] tracing: Centralize preemptirq tracepoints and unify their usage

On Tue, 07 Aug 2018 10:48:05 -0400
Joel Fernandes <[email protected]> wrote:

> >You mean if someone add a tracepoint callback to the irq disable
> >tracepoint, and did a lockdep assert to make sure interrupts are
> >disabled?
>
> Yes that's what I meant.

That sounds like a "Doctor, it hurts me when I do this" problem ;-)

-- Steve


2018-08-07 17:31:34

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH v12 3/3] tracing: Centralize preemptirq tracepoints and unify their usage



On August 7, 2018 10:34:10 AM EDT, Steven Rostedt <[email protected]> wrote:
>On Tue, 07 Aug 2018 10:10:59 -0400
>Joel Fernandes <[email protected]> wrote:
>
>> On August 7, 2018 9:49:54 AM EDT, Steven Rostedt
><[email protected]> wrote:
>> >On Tue, 7 Aug 2018 06:33:35 -0700
>> >Joel Fernandes <[email protected]> wrote:
>> >
>> >> Thanks, also one more thing I noticed in your patch,
>> >> lockdep_hardirqs_off needs to be called before all other probes
>but
>> >> you're calling it after. This is why I registered it with INT_MAX:
>> >>
>> >> register_trace_prio_irq_disable(lockdep_hardirqs_off, NULL,
>INT_MAX);
>> >>
>> >> Without it you may get annotation warnings. Thanks,
>> >
>> >Interesting. I was following the old way where we called the tracing
>> >code before calling the lockdep code (all hard coded and not from
>> >trace events). Is this have something to do with calling the code
>from
>> >a tracepoint?
>> >
>> >Do you have an example that could trigger the warning?
>> >
>>
>> I remember the warnings but can't remember now how I triggered them.
>> I think I saw them with the irqsoff tracer or irq trace events
>> running, with lockdep turned on.
>
>I'll see if I can trigger it. I'll run this all through my tests.

Ok.

>> Also an irq disable probe that does a lockdep assert that irqs are
>> disabled could trigger it?
>>
>
>You mean if someone add a tracepoint callback to the irq disable
>tracepoint, and did a lockdep assert to make sure interrupts are
>disabled?

Yes that's what I meant.

Thanks,

- Joel

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

2018-08-07 23:46:28

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v12 3/3] tracing: Centralize preemptirq tracepoints and unify their usage

On Tue, 07 Aug 2018 11:24:13 -0400
Joel Fernandes <[email protected]> wrote:

> On August 7, 2018 11:09:06 AM EDT, Steven Rostedt <[email protected]> wrote:
> >On Tue, 07 Aug 2018 10:48:05 -0400
> >Joel Fernandes <[email protected]> wrote:
> >
> >> >You mean if someone add a tracepoint callback to the irq disable
> >> >tracepoint, and did a lockdep assert to make sure interrupts are
> >> >disabled?
> >>
> >> Yes that's what I meant.
> >
> >That sounds like a "Doctor, it hurts me when I do this" problem ;-)
>
> Haha, yes true. But just to clarify, I didn't do this to see the problem but noticed it with turning on existing things. :-) but I see your point...
>

OK, I hit this bug, but it's not because of the partial revert. This
bug seems it needs to be another partial revert. I like you movement of
the code, but I'm starting to doubt that we can use a trace event as a
hook for critical areas yet. Well, not until we can use srcu in NMI.

#define __DO_TRACE(tp, proto, args, cond, rcuidle) \
do { \
struct tracepoint_func *it_func_ptr; \
void *it_func; \
void *__data; \
int __maybe_unused idx = 0; \
\
if (!(cond)) \
return; \
\
/* srcu can't be used from NMI */ \

WARN_ON_ONCE(rcuidle && in_nmi()); <== WARN_ON_ONCE hit!

WARNING: CPU: 3 PID: 3727 at /work/build/trace/nobackup/linux-test.git/include/trace/events/preemptirq.h:38 trace_irq_disable_rcuidle+0x2a/0x6c
Modules linked in: ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables ipv6 crc_ccitt r8169 ppdev parport_pc parport
CPU: 3 PID: 3727 Comm: trace-cmd Not tainted 4.18.0-rc6-test+ #14
Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014
EIP: trace_irq_disable_rcuidle+0x2a/0x6c
Code: e9 01 00 00 00 c3 64 8b 0d 24 e1 50 c1 0f a3 0d e0 f1 3f c1 73 55 55 89 e5 57 56 53 51 64 8b 0d cc 37 51 c1 0f ba e1 14 73 02 <0f> 0b 89 d7 89 c6 b8 e0 d8 2e c1 e8 8e 5b fa ff 8b 1d 9c 27 3d c1
EAX: c0401509 EBX: efa43680 ECX: 80110000 EDX: c0dc81f3
ESI: ed823d44 EDI: efa43680 EBP: ed823cd0 ESP: ed823cc0
DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010047
CR0: 80050033 CR2: b7f06000 CR3: 30513b00 CR4: 001406f0
Call Trace:
trace_hardirqs_off_caller+0x23/0x2d
trace_hardirqs_off_thunk+0xc/0x10
EIP: default_do_nmi+0x1/0x157


-- Steve

2018-08-07 23:55:24

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH v12 3/3] tracing: Centralize preemptirq tracepoints and unify their usage



On August 7, 2018 7:45:15 PM EDT, Steven Rostedt <[email protected]> wrote:
>On Tue, 07 Aug 2018 11:24:13 -0400
>Joel Fernandes <[email protected]> wrote:
>
>> On August 7, 2018 11:09:06 AM EDT, Steven Rostedt
><[email protected]> wrote:
>> >On Tue, 07 Aug 2018 10:48:05 -0400
>> >Joel Fernandes <[email protected]> wrote:
>> >
>> >> >You mean if someone add a tracepoint callback to the irq disable
>> >> >tracepoint, and did a lockdep assert to make sure interrupts are
>> >> >disabled?
>> >>
>> >> Yes that's what I meant.
>> >
>> >That sounds like a "Doctor, it hurts me when I do this" problem ;-)
>
>>
>> Haha, yes true. But just to clarify, I didn't do this to see the
>problem but noticed it with turning on existing things. :-) but I see
>your point...
>>
>
>OK, I hit this bug, but it's not because of the partial revert. This
>bug seems it needs to be another partial revert. I like you movement of
>the code, but I'm starting to doubt that we can use a trace event as a
>hook for critical areas yet. Well, not until we can use srcu in NMI.
>

I sent a patch to use srcu for all tracepoints including nmi. That patch also removes this warning and fixes the one other issue masami reported (hot plug causing a warning).

If Paul and Mathieu can confirm SRCU works on offline CPUs that would be great.

It's just this one warning or anything else that makes you feel tracepoints for critical paths isn't feasible?

Thanks.




>#define __DO_TRACE(tp, proto, args, cond, rcuidle) \
> do { \
> struct tracepoint_func *it_func_ptr; \
> void *it_func; \
> void *__data; \
> int __maybe_unused idx = 0; \
> \
> if (!(cond)) \
> return; \
> \
> /* srcu can't be used from NMI */ \
>
> WARN_ON_ONCE(rcuidle && in_nmi()); <== WARN_ON_ONCE hit!
>
>WARNING: CPU: 3 PID: 3727 at
>/work/build/trace/nobackup/linux-test.git/include/trace/events/preemptirq.h:38
>trace_irq_disable_rcuidle+0x2a/0x6c
>Modules linked in: ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6
>nf_defrag_ipv6 ip6table_filter ip6_tables ipv6 crc_ccitt r8169 ppdev
>parport_pc parport
>CPU: 3 PID: 3727 Comm: trace-cmd Not tainted 4.18.0-rc6-test+ #14
>Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014
>EIP: trace_irq_disable_rcuidle+0x2a/0x6c
>Code: e9 01 00 00 00 c3 64 8b 0d 24 e1 50 c1 0f a3 0d e0 f1 3f c1 73 55
>55 89 e5 57 56 53 51 64 8b 0d cc 37 51 c1 0f ba e1 14 73 02 <0f> 0b 89
>d7 89 c6 b8 e0 d8 2e c1 e8 8e 5b fa ff 8b 1d 9c 27 3d c1
>EAX: c0401509 EBX: efa43680 ECX: 80110000 EDX: c0dc81f3
>ESI: ed823d44 EDI: efa43680 EBP: ed823cd0 ESP: ed823cc0
>DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010047
>CR0: 80050033 CR2: b7f06000 CR3: 30513b00 CR4: 001406f0
>Call Trace:
> trace_hardirqs_off_caller+0x23/0x2d
> trace_hardirqs_off_thunk+0xc/0x10
>EIP: default_do_nmi+0x1/0x157
>
>
>-- Steve

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

2018-08-08 00:49:31

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v12 3/3] tracing: Centralize preemptirq tracepoints and unify their usage

On Tue, 07 Aug 2018 19:54:13 -0400
Joel Fernandes <[email protected]> wrote:


> >OK, I hit this bug, but it's not because of the partial revert. This
> >bug seems it needs to be another partial revert. I like you movement of
> >the code, but I'm starting to doubt that we can use a trace event as a
> >hook for critical areas yet. Well, not until we can use srcu in NMI.
> >
>
> I sent a patch to use srcu for all tracepoints including nmi. That
> patch also removes this warning and fixes the one other issue masami
> reported (hot plug causing a warning).

Is it one I can take, or does Paul have it? I'll need it to continue
with this as is, because I can't pass my tests without triggering that
warning (and that fails the test).

>
> If Paul and Mathieu can confirm SRCU works on offline CPUs that would
> be great.

Yes please.

>
> It's just this one warning or anything else that makes you feel
> tracepoints for critical paths isn't feasible?

Currently I'm down to this, but I can't get past my first test in my
ktest suite, because it fails here.

-- Steve


2018-08-08 01:18:49

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH v12 3/3] tracing: Centralize preemptirq tracepoints and unify their usage

On Tue, Aug 7, 2018 at 5:48 PM, Steven Rostedt <[email protected]> wrote:
> On Tue, 07 Aug 2018 19:54:13 -0400
> Joel Fernandes <[email protected]> wrote:
>
>
>> >OK, I hit this bug, but it's not because of the partial revert. This
>> >bug seems it needs to be another partial revert. I like you movement of
>> >the code, but I'm starting to doubt that we can use a trace event as a
>> >hook for critical areas yet. Well, not until we can use srcu in NMI.
>> >
>>
>> I sent a patch to use srcu for all tracepoints including nmi. That
>> patch also removes this warning and fixes the one other issue masami
>> reported (hot plug causing a warning).
>
> Is it one I can take, or does Paul have it? I'll need it to continue
> with this as is, because I can't pass my tests without triggering that
> warning (and that fails the test).

I think you could take it if you're Ok with it, its purely in the
tracing code and I tested it yesterday morning. Its attached to this
email. I tested that it fixes the mmio trace (hotplug related) issue..

>>
>> If Paul and Mathieu can confirm SRCU works on offline CPUs that would
>> be great.
>
> Yes please.
>

BTW I found this in Paul's docs RCU Requirements docs, "SRCU is insensitive
to whether or not a CPU is online" so I think it will work.

The paragraph was..
Also unlike other RCU flavors, SRCU's callbacks-wait function
srcu_barrier() may be invoked from CPU-hotplug notifiers,
though this is not necessarily a good idea.
The reason that this is possible is that SRCU is insensitive
to whether or not a CPU is online, which means that srcu_barrier()
need not exclude CPU-hotplug operations.

>>
>> It's just this one warning or anything else that makes you feel
>> tracepoints for critical paths isn't feasible?
>
> Currently I'm down to this, but I can't get past my first test in my
> ktest suite, because it fails here.

Ok hopefully this will get you past that. Sorry for the frustration.


Attachments:
0001-tracepoint-Run-tracepoints-even-after-CPU-is-offline.patch (5.98 kB)

2018-08-08 01:56:35

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v12 3/3] tracing: Centralize preemptirq tracepoints and unify their usage

On Tue, 7 Aug 2018 18:17:42 -0700
Joel Fernandes <[email protected]> wrote:

> From 6986af946ceb04fc9ddc6d5b45fc559b6807e465 Mon Sep 17 00:00:00 2001
> From: "Joel Fernandes (Google)" <[email protected]>
> Date: Sun, 5 Aug 2018 20:17:41 -0700
> Subject: [PATCH] tracepoint: Run tracepoints even after CPU is offline
>
> Commit f37755490fe9 ("tracepoints: Do not trace when cpu is offline")
> causes a problem for lockdep using tracepoint code. Once a CPU is
> offline, tracepoints donot get called, however this causes a big problem
> for lockdep probes that need to run so that IRQ annotations are marked
> correctly.
>
> A race is possible where while the CPU is going offline, an interrupt
> can come in and then a lockdep assert causes an annotation warning:
>
> [ 106.551354] IRQs not enabled as expected
> [ 106.551785] WARNING: CPU: 1 PID: 0 at kernel/time/tick-sched.c:982
> tick_nohz_idle_enter+0x99/0xb0
> [ 106.552964] Modules linked in:
> [ 106.553299] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G W
>
> We need tracepoints to run as late as possible. This commit fixes the
> issue by removing the cpu_online check in tracepoint code that was
> introduced in the mentioned commit, however we now switch to using SRCU
> for all tracepoints and special handle calling tracepoints from NMI so
> that we don't run into issues that result from using sched-RCU when the
> CPUs are marked to be offline.
>
> Fixes: c3bc8fd637a9 ("tracing: Centralize preemptirq tracepoints and
> unify their usage")
> Reported-by: Masami Hiramatsu <[email protected]>
> Signed-off-by: Joel Fernandes (Google) <[email protected]>


The above change log doesn't look like it matches the NMI patch.

Can you resend with just the NMI changes? I already handled the cpu
offline ones.

But I may still have concerns with this patch.

---
> include/linux/tracepoint.h | 27 +++++++++++----------------
> kernel/tracepoint.c | 10 ++++++----
> 2 files changed, 17 insertions(+), 20 deletions(-)
>
> diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
> index d9a084c72541..5733502bb3ce 100644
> --- a/include/linux/tracepoint.h
> +++ b/include/linux/tracepoint.h
> @@ -35,6 +35,7 @@ struct trace_eval_map {
> #define TRACEPOINT_DEFAULT_PRIO 10
>
> extern struct srcu_struct tracepoint_srcu;
> +extern struct srcu_struct tracepoint_srcu_nmi;
>
> extern int
> tracepoint_probe_register(struct tracepoint *tp, void *probe, void *data);
> @@ -144,13 +145,11 @@ extern void syscall_unregfunc(void);
> void *it_func; \
> void *__data; \
> int __maybe_unused idx = 0; \
> + struct srcu_struct *ss; \
> \
> if (!(cond)) \
> return; \
> \
> - /* srcu can't be used from NMI */ \
> - WARN_ON_ONCE(rcuidle && in_nmi()); \
> - \
> /* keep srcu and sched-rcu usage consistent */ \
> preempt_disable_notrace(); \
> \
> @@ -159,7 +158,11 @@ extern void syscall_unregfunc(void);
> * doesn't work from the idle path. \
> */ \
> if (rcuidle) \
> - idx = srcu_read_lock_notrace(&tracepoint_srcu); \
> + ss = &tracepoint_srcu_nmi; \
> + else \
> + ss = &tracepoint_srcu; \
> + \
> + idx = srcu_read_lock_notrace(ss); \
> \
> it_func_ptr = rcu_dereference_raw((tp)->funcs); \
> \
> @@ -171,8 +174,7 @@ extern void syscall_unregfunc(void);
> } while ((++it_func_ptr)->func); \
> } \
> \
> - if (rcuidle) \
> - srcu_read_unlock_notrace(&tracepoint_srcu, idx);\
> + srcu_read_unlock_notrace(ss, idx); \

Hmm, why do we have the two different srcu handles?

Thinking about this, if this can be called by the "thunk" code, then
there might be an issue. I think the "thunk" code can be called before
in_nmi() is set, so we don't know if we are in an NMI or not. I need to
look at that code to make sure. If in_nmi() still works, then we should
use the _nmi srcu handle (if that's required).

But I'm not sure using SRCU for all tracepoints is needed at this
moment. I'll have to look deeper into this tomorrow. But it's getting
close to the merge window, and this needs to be settled quick. Another
"partial revert" may be needed until this gets settled.

-- Steve

> \
> preempt_enable_notrace(); \
> } while (0)
> @@ -212,11 +214,6 @@ extern void syscall_unregfunc(void);
> TP_PROTO(data_proto), \
> TP_ARGS(data_args), \
> TP_CONDITION(cond), 0); \
> - if (IS_ENABLED(CONFIG_LOCKDEP) && (cond)) { \
> - rcu_read_lock_sched_notrace(); \
> - rcu_dereference_sched(__tracepoint_##name.funcs);\
> - rcu_read_unlock_sched_notrace(); \
> - } \
> } \
> __DECLARE_TRACE_RCU(name, PARAMS(proto), PARAMS(args), \
> PARAMS(cond), PARAMS(data_proto), PARAMS(data_args)) \
> @@ -365,19 +362,17 @@ extern void syscall_unregfunc(void);
> * "void *__data, proto" as the callback prototype.
> */
> #define DECLARE_TRACE_NOARGS(name) \
> - __DECLARE_TRACE(name, void, , \
> - cpu_online(raw_smp_processor_id()), \
> + __DECLARE_TRACE(name, void, , 1, \
> void *__data, __data)
>
> #define DECLARE_TRACE(name, proto, args) \
> - __DECLARE_TRACE(name, PARAMS(proto), PARAMS(args), \
> - cpu_online(raw_smp_processor_id()), \
> + __DECLARE_TRACE(name, PARAMS(proto), PARAMS(args), 1, \
> PARAMS(void *__data, proto), \
> PARAMS(__data, args))
>
> #define DECLARE_TRACE_CONDITION(name, proto, args, cond) \
> __DECLARE_TRACE(name, PARAMS(proto), PARAMS(args), \
> - cpu_online(raw_smp_processor_id()) && (PARAMS(cond)), \
> + PARAMS(cond), \
> PARAMS(void *__data, proto), \
> PARAMS(__data, args))
>
> diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
> index 955148d91b74..769d74b2f90e 100644
> --- a/kernel/tracepoint.c
> +++ b/kernel/tracepoint.c
> @@ -32,7 +32,9 @@ extern struct tracepoint * const __start___tracepoints_ptrs[];
> extern struct tracepoint * const __stop___tracepoints_ptrs[];
>
> DEFINE_SRCU(tracepoint_srcu);
> +DEFINE_SRCU(tracepoint_srcu_nmi);
> EXPORT_SYMBOL_GPL(tracepoint_srcu);
> +EXPORT_SYMBOL_GPL(tracepoint_srcu_nmi);
>
> /* Set to 1 to enable tracepoint debug output */
> static const int tracepoint_debug;
> @@ -70,14 +72,14 @@ static inline void *allocate_probes(int count)
> return p == NULL ? NULL : p->probes;
> }
>
> -static void srcu_free_old_probes(struct rcu_head *head)
> +static void srcu_free_old_probes_nmi(struct rcu_head *head)
> {
> kfree(container_of(head, struct tp_probes, rcu));
> }
>
> -static void rcu_free_old_probes(struct rcu_head *head)
> +static void srcu_free_old_probes(struct rcu_head *head)
> {
> - call_srcu(&tracepoint_srcu, head, srcu_free_old_probes);
> + call_srcu(&tracepoint_srcu_nmi, head, srcu_free_old_probes_nmi);
> }
>
> static inline void release_probes(struct tracepoint_func *old)
> @@ -91,7 +93,7 @@ static inline void release_probes(struct tracepoint_func *old)
> * cover both cases. So let us chain the SRCU and sched RCU
> * callbacks to wait for both grace periods.
> */
> - call_rcu_sched(&tp_probes->rcu, rcu_free_old_probes);
> + call_srcu(&tracepoint_srcu, &tp_probes->rcu, srcu_free_old_probes);
> }
> }
>


2018-08-08 02:14:37

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH v12 3/3] tracing: Centralize preemptirq tracepoints and unify their usage

On Tue, Aug 7, 2018 at 6:55 PM, Steven Rostedt <[email protected]> wrote:
> On Tue, 7 Aug 2018 18:17:42 -0700
> Joel Fernandes <[email protected]> wrote:
>
>> From 6986af946ceb04fc9ddc6d5b45fc559b6807e465 Mon Sep 17 00:00:00 2001
>> From: "Joel Fernandes (Google)" <[email protected]>
>> Date: Sun, 5 Aug 2018 20:17:41 -0700
>> Subject: [PATCH] tracepoint: Run tracepoints even after CPU is offline
>>
>> Commit f37755490fe9 ("tracepoints: Do not trace when cpu is offline")
>> causes a problem for lockdep using tracepoint code. Once a CPU is
>> offline, tracepoints donot get called, however this causes a big problem
>> for lockdep probes that need to run so that IRQ annotations are marked
>> correctly.
>>
>> A race is possible where while the CPU is going offline, an interrupt
>> can come in and then a lockdep assert causes an annotation warning:
>>
>> [ 106.551354] IRQs not enabled as expected
>> [ 106.551785] WARNING: CPU: 1 PID: 0 at kernel/time/tick-sched.c:982
>> tick_nohz_idle_enter+0x99/0xb0
>> [ 106.552964] Modules linked in:
>> [ 106.553299] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G W
>>
>> We need tracepoints to run as late as possible. This commit fixes the
>> issue by removing the cpu_online check in tracepoint code that was
>> introduced in the mentioned commit, however we now switch to using SRCU
>> for all tracepoints and special handle calling tracepoints from NMI so
>> that we don't run into issues that result from using sched-RCU when the
>> CPUs are marked to be offline.
>>
>> Fixes: c3bc8fd637a9 ("tracing: Centralize preemptirq tracepoints and
>> unify their usage")
>> Reported-by: Masami Hiramatsu <[email protected]>
>> Signed-off-by: Joel Fernandes (Google) <[email protected]>
>
>
> The above change log doesn't look like it matches the NMI patch.
>
> Can you resend with just the NMI changes? I already handled the cpu
> offline ones.

Ok, sent with "cpu offline" changes dropped, here it is:
https://lore.kernel.org/patchwork/patch/972657/

If you could add your Reported-by to it, that would be great as well.

>
> But I may still have concerns with this patch.

Ok let me know what you think.

Thanks!

- Joel

2018-08-08 02:30:03

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v12 3/3] tracing: Centralize preemptirq tracepoints and unify their usage

On Tue, 7 Aug 2018 19:13:32 -0700
Joel Fernandes <[email protected]> wrote:

> On Tue, Aug 7, 2018 at 6:55 PM, Steven Rostedt <[email protected]> wrote:
> > On Tue, 7 Aug 2018 18:17:42 -0700
> > Joel Fernandes <[email protected]> wrote:
> >
> >> From 6986af946ceb04fc9ddc6d5b45fc559b6807e465 Mon Sep 17 00:00:00 2001
> >> From: "Joel Fernandes (Google)" <[email protected]>
> >> Date: Sun, 5 Aug 2018 20:17:41 -0700
> >> Subject: [PATCH] tracepoint: Run tracepoints even after CPU is offline
> >>
> >> Commit f37755490fe9 ("tracepoints: Do not trace when cpu is offline")
> >> causes a problem for lockdep using tracepoint code. Once a CPU is
> >> offline, tracepoints donot get called, however this causes a big problem
> >> for lockdep probes that need to run so that IRQ annotations are marked
> >> correctly.
> >>
> >> A race is possible where while the CPU is going offline, an interrupt
> >> can come in and then a lockdep assert causes an annotation warning:
> >>
> >> [ 106.551354] IRQs not enabled as expected
> >> [ 106.551785] WARNING: CPU: 1 PID: 0 at kernel/time/tick-sched.c:982
> >> tick_nohz_idle_enter+0x99/0xb0
> >> [ 106.552964] Modules linked in:
> >> [ 106.553299] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G W
> >>
> >> We need tracepoints to run as late as possible. This commit fixes the
> >> issue by removing the cpu_online check in tracepoint code that was
> >> introduced in the mentioned commit, however we now switch to using SRCU
> >> for all tracepoints and special handle calling tracepoints from NMI so
> >> that we don't run into issues that result from using sched-RCU when the
> >> CPUs are marked to be offline.
> >>
> >> Fixes: c3bc8fd637a9 ("tracing: Centralize preemptirq tracepoints and
> >> unify their usage")
> >> Reported-by: Masami Hiramatsu <[email protected]>
> >> Signed-off-by: Joel Fernandes (Google) <[email protected]>
> >
> >
> > The above change log doesn't look like it matches the NMI patch.
> >
> > Can you resend with just the NMI changes? I already handled the cpu
> > offline ones.
>
> Ok, sent with "cpu offline" changes dropped, here it is:
> https://lore.kernel.org/patchwork/patch/972657/
>
> If you could add your Reported-by to it, that would be great as well.
>
> >
> > But I may still have concerns with this patch.
>
> Ok let me know what you think.
>

Not sure you saw this part of my reply:

> @@ -171,8 +174,7 @@ extern void syscall_unregfunc(void);
> } while ((++it_func_ptr)->func); \
> } \
> \
> - if (rcuidle) \
> - srcu_read_unlock_notrace(&tracepoint_srcu, idx);\
> + srcu_read_unlock_notrace(ss, idx); \

Hmm, why do we have the two different srcu handles?

Thinking about this, if this can be called by the "thunk" code, then
there might be an issue. I think the "thunk" code can be called before
in_nmi() is set, so we don't know if we are in an NMI or not. I need to
look at that code to make sure. If in_nmi() still works, then we should
use the _nmi srcu handle (if that's required).

But I'm not sure using SRCU for all tracepoints is needed at this
moment. I'll have to look deeper into this tomorrow. But it's getting
close to the merge window, and this needs to be settled quick. Another
"partial revert" may be needed until this gets settled.


-- Steve

2018-08-08 03:45:30

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH v12 3/3] tracing: Centralize preemptirq tracepoints and unify their usage

Hi Steve,

On Tue, Aug 7, 2018 at 7:28 PM, Steven Rostedt <[email protected]> wrote:
> On Tue, 7 Aug 2018 19:13:32 -0700
> Joel Fernandes <[email protected]> wrote:
>> >
>> >> From 6986af946ceb04fc9ddc6d5b45fc559b6807e465 Mon Sep 17 00:00:00 2001
>> >> From: "Joel Fernandes (Google)" <[email protected]>
>> >> Date: Sun, 5 Aug 2018 20:17:41 -0700
>> >> Subject: [PATCH] tracepoint: Run tracepoints even after CPU is offline
>> >>
>> >> Commit f37755490fe9 ("tracepoints: Do not trace when cpu is offline")
>> >> causes a problem for lockdep using tracepoint code. Once a CPU is
>> >> offline, tracepoints donot get called, however this causes a big problem
>> >> for lockdep probes that need to run so that IRQ annotations are marked
>> >> correctly.
>> >>
>> >> A race is possible where while the CPU is going offline, an interrupt
>> >> can come in and then a lockdep assert causes an annotation warning:
>> >>
>> >> [ 106.551354] IRQs not enabled as expected
>> >> [ 106.551785] WARNING: CPU: 1 PID: 0 at kernel/time/tick-sched.c:982
>> >> tick_nohz_idle_enter+0x99/0xb0
>> >> [ 106.552964] Modules linked in:
>> >> [ 106.553299] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G W
>> >>
>> >> We need tracepoints to run as late as possible. This commit fixes the
>> >> issue by removing the cpu_online check in tracepoint code that was
>> >> introduced in the mentioned commit, however we now switch to using SRCU
>> >> for all tracepoints and special handle calling tracepoints from NMI so
>> >> that we don't run into issues that result from using sched-RCU when the
>> >> CPUs are marked to be offline.
>> >>
>> >> Fixes: c3bc8fd637a9 ("tracing: Centralize preemptirq tracepoints and
>> >> unify their usage")
>> >> Reported-by: Masami Hiramatsu <[email protected]>
>> >> Signed-off-by: Joel Fernandes (Google) <[email protected]>
>> >
>> >
>> > The above change log doesn't look like it matches the NMI patch.
>> >
>> > Can you resend with just the NMI changes? I already handled the cpu
>> > offline ones.
>>
>> Ok, sent with "cpu offline" changes dropped, here it is:
>> https://lore.kernel.org/patchwork/patch/972657/
>>
>> If you could add your Reported-by to it, that would be great as well.
>>
>> >
>> > But I may still have concerns with this patch.
>>
>> Ok let me know what you think.
>>
>
> Not sure you saw this part of my reply:

I missed the part on the srcu handles, sorry.

>
>> @@ -171,8 +174,7 @@ extern void syscall_unregfunc(void);
>> } while ((++it_func_ptr)->func); \
>> } \
>> \
>> - if (rcuidle) \
>> - srcu_read_unlock_notrace(&tracepoint_srcu, idx);\
>> + srcu_read_unlock_notrace(ss, idx); \
>
> Hmm, why do we have the two different srcu handles?

Because if the memory operations happening on the normal SRCU handle
(during srcu_read_lock) is interrupted by NMI, then the other handle
(devoted to NMI) could be used instead and not bother the interrupted
handle. Does that makes sense?

When I talked to Paul few months ago about SRCU from NMI context, he
mentioned the per-cpu memory operations during srcu_read_lock can be
NMI interrupted, that's why we added that warning.

> Thinking about this, if this can be called by the "thunk" code, then
> there might be an issue. I think the "thunk" code can be called before
> in_nmi() is set, so we don't know if we are in an NMI or not. I need to
> look at that code to make sure. If in_nmi() still works, then we should
> use the _nmi srcu handle (if that's required).
>
> But I'm not sure using SRCU for all tracepoints is needed at this
> moment. I'll have to look deeper into this tomorrow. But it's getting
> close to the merge window, and this needs to be settled quick. Another
> "partial revert" may be needed until this gets settled.

I did read this part, yes I'm not sure either. You mentioned you would
confirm that in the morning, I look forward to it. I hope the in_nmi()
function is reliable to use from here.

2018-08-08 03:54:58

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH v12 3/3] tracing: Centralize preemptirq tracepoints and unify their usage

On Tue, Aug 7, 2018 at 8:44 PM, Joel Fernandes <[email protected]> wrote:
> Hi Steve,
>
> On Tue, Aug 7, 2018 at 7:28 PM, Steven Rostedt <[email protected]> wrote:
[...]
>>> @@ -171,8 +174,7 @@ extern void syscall_unregfunc(void);
>>> } while ((++it_func_ptr)->func); \
>>> } \
>>> \
>>> - if (rcuidle) \
>>> - srcu_read_unlock_notrace(&tracepoint_srcu, idx);\
>>> + srcu_read_unlock_notrace(ss, idx); \
>>
>> Hmm, why do we have the two different srcu handles?
>
> Because if the memory operations happening on the normal SRCU handle
> (during srcu_read_lock) is interrupted by NMI, then the other handle
> (devoted to NMI) could be used instead and not bother the interrupted
> handle. Does that makes sense?
>
> When I talked to Paul few months ago about SRCU from NMI context, he
> mentioned the per-cpu memory operations during srcu_read_lock can be
> NMI interrupted, that's why we added that warning.

So I looked more closely, __srcu_read_lock on 2 different handles may
still be doing a this_cpu_inc on the same location..
(sp->sda->srcu_lock_count). :-(

Paul any ideas on how to solve this?

It does start to seem like a show stopper :-(

2018-08-08 05:08:01

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH v12 3/3] tracing: Centralize preemptirq tracepoints and unify their usage

On Tue, Aug 7, 2018 at 8:53 PM, Joel Fernandes <[email protected]> wrote:
> On Tue, Aug 7, 2018 at 8:44 PM, Joel Fernandes <[email protected]> wrote:
>> Hi Steve,
>>
>> On Tue, Aug 7, 2018 at 7:28 PM, Steven Rostedt <[email protected]> wrote:
> [...]
>>>> @@ -171,8 +174,7 @@ extern void syscall_unregfunc(void);
>>>> } while ((++it_func_ptr)->func); \
>>>> } \
>>>> \
>>>> - if (rcuidle) \
>>>> - srcu_read_unlock_notrace(&tracepoint_srcu, idx);\
>>>> + srcu_read_unlock_notrace(ss, idx); \
>>>
>>> Hmm, why do we have the two different srcu handles?
>>
>> Because if the memory operations happening on the normal SRCU handle
>> (during srcu_read_lock) is interrupted by NMI, then the other handle
>> (devoted to NMI) could be used instead and not bother the interrupted
>> handle. Does that makes sense?
>>
>> When I talked to Paul few months ago about SRCU from NMI context, he
>> mentioned the per-cpu memory operations during srcu_read_lock can be
>> NMI interrupted, that's why we added that warning.
>
> So I looked more closely, __srcu_read_lock on 2 different handles may
> still be doing a this_cpu_inc on the same location..
> (sp->sda->srcu_lock_count). :-(
>
> Paul any ideas on how to solve this?
>
> It does start to seem like a show stopper :-(

Steve, another thing we could do is drop "tracepoint: Make rcuidle
tracepoint callers use SRCU" and just call into irqsoff and preemptoff
tracer hooks directly.

The reason I added "tracepoint: Make rcuidle tracepoint callers use
SRCU" is because lockdep's performance dropped with existing
tracepoint code and SRCU improved that. But now that you're calling
directly into lockdep that shouldn't be an issue.

That should resolve your NMI issues and keep my macro and other clean
ups from the original patch. What do you think?

I am out in the morning, but I could write this up in the evening when
I get some time (unless you do it before me).

thanks,

- Joel

2018-08-08 12:48:21

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v12 3/3] tracing: Centralize preemptirq tracepoints and unify their usage

On Tue, 7 Aug 2018 20:53:54 -0700
Joel Fernandes <[email protected]> wrote:


> > When I talked to Paul few months ago about SRCU from NMI context, he
> > mentioned the per-cpu memory operations during srcu_read_lock can be
> > NMI interrupted, that's why we added that warning.
>
> So I looked more closely, __srcu_read_lock on 2 different handles may
> still be doing a this_cpu_inc on the same location..
> (sp->sda->srcu_lock_count). :-(
>
> Paul any ideas on how to solve this?
>
> It does start to seem like a show stopper :-(

What's wrong with a this_cpu_inc()? It's atomic for the CPU. Although
it wont be atomic for the capture of the idx. But I also don't see
interrupts being disabled, thus an NMI is no different than any
interrupt doing the same thing, right?

-- Steve

2018-08-08 13:02:18

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v12 3/3] tracing: Centralize preemptirq tracepoints and unify their usage

On Tue, Aug 07, 2018 at 08:53:54PM -0700, Joel Fernandes wrote:
> On Tue, Aug 7, 2018 at 8:44 PM, Joel Fernandes <[email protected]> wrote:
> > Hi Steve,
> >
> > On Tue, Aug 7, 2018 at 7:28 PM, Steven Rostedt <[email protected]> wrote:
> [...]
> >>> @@ -171,8 +174,7 @@ extern void syscall_unregfunc(void);
> >>> } while ((++it_func_ptr)->func); \
> >>> } \
> >>> \
> >>> - if (rcuidle) \
> >>> - srcu_read_unlock_notrace(&tracepoint_srcu, idx);\
> >>> + srcu_read_unlock_notrace(ss, idx); \
> >>
> >> Hmm, why do we have the two different srcu handles?
> >
> > Because if the memory operations happening on the normal SRCU handle
> > (during srcu_read_lock) is interrupted by NMI, then the other handle
> > (devoted to NMI) could be used instead and not bother the interrupted
> > handle. Does that makes sense?
> >
> > When I talked to Paul few months ago about SRCU from NMI context, he
> > mentioned the per-cpu memory operations during srcu_read_lock can be
> > NMI interrupted, that's why we added that warning.
>
> So I looked more closely, __srcu_read_lock on 2 different handles may
> still be doing a this_cpu_inc on the same location..
> (sp->sda->srcu_lock_count). :-(
>
> Paul any ideas on how to solve this?

You lost me on this one. When you said "2 different handles", I assumed
that you meant two different values of "sp", which would have two
different addresses for &sp->sda->srcu_lock_count. What am I missing?

> It does start to seem like a show stopper :-(

I suppose that an srcu_read_lock_nmi() and srcu_read_unlock_nmi() could
be added, which would do atomic ops on sp->sda->srcu_lock_count. Not sure
whether this would be fast enough to be useful, but easy to provide:

int __srcu_read_lock_nmi(struct srcu_struct *sp) /* UNTESTED. */
{
int idx;

idx = READ_ONCE(sp->srcu_idx) & 0x1;
atomic_inc(&sp->sda->srcu_lock_count[idx]);
smp_mb__after_atomic(); /* B */ /* Avoid leaking critical section. */
return idx;
}

void __srcu_read_unlock_nmi(struct srcu_struct *sp, int idx)
{
smp_mb__before_atomic(); /* C */ /* Avoid leaking critical section. */
atomic_inc(&sp->sda->srcu_unlock_count[idx]);
}

With appropriate adjustments to also allow Tiny RCU to also work.

Note that you have to use _nmi() everywhere, not just in NMI handlers.
In fact, the NMI handlers are the one place you -don't- need to use
_nmi(), strangely enough.

Might be worth a try -- smp_mb__{before,after}_atomic() is a no-op on
some architectures, for example.

Thanx, Paul


2018-08-08 13:04:25

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v12 3/3] tracing: Centralize preemptirq tracepoints and unify their usage

On Wed, Aug 08, 2018 at 08:46:29AM -0400, Steven Rostedt wrote:
> On Tue, 7 Aug 2018 20:53:54 -0700
> Joel Fernandes <[email protected]> wrote:
>
>
> > > When I talked to Paul few months ago about SRCU from NMI context, he
> > > mentioned the per-cpu memory operations during srcu_read_lock can be
> > > NMI interrupted, that's why we added that warning.
> >
> > So I looked more closely, __srcu_read_lock on 2 different handles may
> > still be doing a this_cpu_inc on the same location..
> > (sp->sda->srcu_lock_count). :-(
> >
> > Paul any ideas on how to solve this?
> >
> > It does start to seem like a show stopper :-(
>
> What's wrong with a this_cpu_inc()? It's atomic for the CPU. Although
> it wont be atomic for the capture of the idx. But I also don't see
> interrupts being disabled, thus an NMI is no different than any
> interrupt doing the same thing, right?

On architectures without increment-memory instructions, if you take an NMI
between the load from sp->sda->srcu_lock_count and the later store, you
lose a count. Note that both __srcu_read_lock() and __srcu_read_unlock()
do increments of different locations, so you cannot rely on the usual
"NMI fixes up before exit" semantics you get when incrementing and
decrementing the same location.

Thanx, Paul


2018-08-08 13:08:46

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v12 3/3] tracing: Centralize preemptirq tracepoints and unify their usage

On Wed, 8 Aug 2018 06:03:02 -0700
"Paul E. McKenney" <[email protected]> wrote:

> What's wrong with a this_cpu_inc()? It's atomic for the CPU. Although
> > it wont be atomic for the capture of the idx. But I also don't see
> > interrupts being disabled, thus an NMI is no different than any
> > interrupt doing the same thing, right?
>
> On architectures without increment-memory instructions, if you take an NMI
> between the load from sp->sda->srcu_lock_count and the later store, you
> lose a count. Note that both __srcu_read_lock() and __srcu_read_unlock()
> do increments of different locations, so you cannot rely on the usual
> "NMI fixes up before exit" semantics you get when incrementing and
> decrementing the same location.

And how is this handled in the interrupt case? Interrupts are not
disabled here.

I would also argue that architectures without increment-memory
instructions shouldn't have NMIs ;-)

-- Steve

2018-08-08 14:12:43

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH v12 3/3] tracing: Centralize preemptirq tracepoints and unify their usage

On Wed, Aug 8, 2018 at 6:00 AM, Paul E. McKenney
<[email protected]> wrote:
> On Tue, Aug 07, 2018 at 08:53:54PM -0700, Joel Fernandes wrote:
>> On Tue, Aug 7, 2018 at 8:44 PM, Joel Fernandes <[email protected]> wrote:
>> > Hi Steve,
>> >
>> > On Tue, Aug 7, 2018 at 7:28 PM, Steven Rostedt <[email protected]> wrote:
>> [...]
>> >>> @@ -171,8 +174,7 @@ extern void syscall_unregfunc(void);
>> >>> } while ((++it_func_ptr)->func); \
>> >>> } \
>> >>> \
>> >>> - if (rcuidle) \
>> >>> - srcu_read_unlock_notrace(&tracepoint_srcu, idx);\
>> >>> + srcu_read_unlock_notrace(ss, idx); \
>> >>
>> >> Hmm, why do we have the two different srcu handles?
>> >
>> > Because if the memory operations happening on the normal SRCU handle
>> > (during srcu_read_lock) is interrupted by NMI, then the other handle
>> > (devoted to NMI) could be used instead and not bother the interrupted
>> > handle. Does that makes sense?
>> >
>> > When I talked to Paul few months ago about SRCU from NMI context, he
>> > mentioned the per-cpu memory operations during srcu_read_lock can be
>> > NMI interrupted, that's why we added that warning.
>>
>> So I looked more closely, __srcu_read_lock on 2 different handles may
>> still be doing a this_cpu_inc on the same location..
>> (sp->sda->srcu_lock_count). :-(
>>
>> Paul any ideas on how to solve this?
>
> You lost me on this one. When you said "2 different handles", I assumed
> that you meant two different values of "sp", which would have two
> different addresses for &sp->sda->srcu_lock_count. What am I missing?

Thanks a lot for the reply.
I thought "sda" is the same for different srcu_struct(s). May be it
was too late for me in the night, that's why I thought so? Which makes
no sense now that I think of it.

In that case based on what you're saying, the patch I sent to using
different srcu_struct for NMI is still good I guess...

>> It does start to seem like a show stopper :-(
>
> I suppose that an srcu_read_lock_nmi() and srcu_read_unlock_nmi() could
> be added, which would do atomic ops on sp->sda->srcu_lock_count. Not sure
> whether this would be fast enough to be useful, but easy to provide:
>
> int __srcu_read_lock_nmi(struct srcu_struct *sp) /* UNTESTED. */
> {
> int idx;
>
> idx = READ_ONCE(sp->srcu_idx) & 0x1;
> atomic_inc(&sp->sda->srcu_lock_count[idx]);
> smp_mb__after_atomic(); /* B */ /* Avoid leaking critical section. */
> return idx;
> }
>
> void __srcu_read_unlock_nmi(struct srcu_struct *sp, int idx)
> {
> smp_mb__before_atomic(); /* C */ /* Avoid leaking critical section. */
> atomic_inc(&sp->sda->srcu_unlock_count[idx]);
> }
>
> With appropriate adjustments to also allow Tiny RCU to also work.
>
> Note that you have to use _nmi() everywhere, not just in NMI handlers.
> In fact, the NMI handlers are the one place you -don't- need to use
> _nmi(), strangely enough.
>
> Might be worth a try -- smp_mb__{before,after}_atomic() is a no-op on
> some architectures, for example.

Continuing Steve's question on regular interrupts, do we need to use
this atomic_inc API for regular interrupts as well?

Thanks!

2018-08-08 14:28:35

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v12 3/3] tracing: Centralize preemptirq tracepoints and unify their usage

On Wed, 8 Aug 2018 06:00:41 -0700
"Paul E. McKenney" <[email protected]> wrote:
>
> I suppose that an srcu_read_lock_nmi() and srcu_read_unlock_nmi() could
> be added, which would do atomic ops on sp->sda->srcu_lock_count. Not sure
> whether this would be fast enough to be useful, but easy to provide:
>
> int __srcu_read_lock_nmi(struct srcu_struct *sp) /* UNTESTED. */
> {
> int idx;
>
> idx = READ_ONCE(sp->srcu_idx) & 0x1;
> atomic_inc(&sp->sda->srcu_lock_count[idx]);
> smp_mb__after_atomic(); /* B */ /* Avoid leaking critical section. */
> return idx;
> }
>
> void __srcu_read_unlock_nmi(struct srcu_struct *sp, int idx)
> {
> smp_mb__before_atomic(); /* C */ /* Avoid leaking critical section. */
> atomic_inc(&sp->sda->srcu_unlock_count[idx]);
> }
>
> With appropriate adjustments to also allow Tiny RCU to also work.
>
> Note that you have to use _nmi() everywhere, not just in NMI handlers.
> In fact, the NMI handlers are the one place you -don't- need to use
> _nmi(), strangely enough.
>
> Might be worth a try -- smp_mb__{before,after}_atomic() is a no-op on
> some architectures, for example.

Note this would kill the performance that srcu gives that Joel wants.
Switching from a this_cpu_inc() to a atomic_inc() would be a huge
impact.

There's also a local_inc() if you are using per cpu pointers, that is
suppose to guarantee atomicity for single cpu operations. That's what
the ftrace ring buffer uses.

-- Steve

2018-08-08 14:34:43

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v12 3/3] tracing: Centralize preemptirq tracepoints and unify their usage

On Wed, Aug 08, 2018 at 09:07:24AM -0400, Steven Rostedt wrote:
> On Wed, 8 Aug 2018 06:03:02 -0700
> "Paul E. McKenney" <[email protected]> wrote:
>
> > What's wrong with a this_cpu_inc()? It's atomic for the CPU. Although
> > > it wont be atomic for the capture of the idx. But I also don't see
> > > interrupts being disabled, thus an NMI is no different than any
> > > interrupt doing the same thing, right?
> >
> > On architectures without increment-memory instructions, if you take an NMI
> > between the load from sp->sda->srcu_lock_count and the later store, you
> > lose a count. Note that both __srcu_read_lock() and __srcu_read_unlock()
> > do increments of different locations, so you cannot rely on the usual
> > "NMI fixes up before exit" semantics you get when incrementing and
> > decrementing the same location.
>
> And how is this handled in the interrupt case? Interrupts are not
> disabled here.

Actually, on most architectures interrupts are in fact disabled:

#define this_cpu_generic_to_op(pcp, val, op) \
do { \
unsigned long __flags; \
raw_local_irq_save(__flags); \
raw_cpu_generic_to_op(pcp, val, op); \
raw_local_irq_restore(__flags); \
} while (0)

NMIs, not so much.

> I would also argue that architectures without increment-memory
> instructions shouldn't have NMIs ;-)

I would also argue a lot of things, but objective reality does not take my
opinions into account all that often. Which might be a good thing. ;-)

Thanx, Paul


2018-08-08 14:43:15

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v12 3/3] tracing: Centralize preemptirq tracepoints and unify their usage

On Wed, Aug 08, 2018 at 10:27:00AM -0400, Steven Rostedt wrote:
> On Wed, 8 Aug 2018 06:00:41 -0700
> "Paul E. McKenney" <[email protected]> wrote:
> >
> > I suppose that an srcu_read_lock_nmi() and srcu_read_unlock_nmi() could
> > be added, which would do atomic ops on sp->sda->srcu_lock_count. Not sure
> > whether this would be fast enough to be useful, but easy to provide:
> >
> > int __srcu_read_lock_nmi(struct srcu_struct *sp) /* UNTESTED. */
> > {
> > int idx;
> >
> > idx = READ_ONCE(sp->srcu_idx) & 0x1;
> > atomic_inc(&sp->sda->srcu_lock_count[idx]);
> > smp_mb__after_atomic(); /* B */ /* Avoid leaking critical section. */
> > return idx;
> > }
> >
> > void __srcu_read_unlock_nmi(struct srcu_struct *sp, int idx)
> > {
> > smp_mb__before_atomic(); /* C */ /* Avoid leaking critical section. */
> > atomic_inc(&sp->sda->srcu_unlock_count[idx]);
> > }
> >
> > With appropriate adjustments to also allow Tiny RCU to also work.
> >
> > Note that you have to use _nmi() everywhere, not just in NMI handlers.
> > In fact, the NMI handlers are the one place you -don't- need to use
> > _nmi(), strangely enough.
> >
> > Might be worth a try -- smp_mb__{before,after}_atomic() is a no-op on
> > some architectures, for example.
>
> Note this would kill the performance that srcu gives that Joel wants.
> Switching from a this_cpu_inc() to a atomic_inc() would be a huge
> impact.

I don't know how huge it would be, but that concern is exactly why I am
proposing adding _nmi() interfaces rather than just directly changing
the stock __srcu_read_lock() and __srcu_read_unlock() functions.

> There's also a local_inc() if you are using per cpu pointers, that is
> suppose to guarantee atomicity for single cpu operations. That's what
> the ftrace ring buffer uses.

Good point, that becomes atomic_long_inc() or equivalent on most
architectures, but an incl instruction (not locked) on x86. So updating
my earlier still-untested thought:

int __srcu_read_lock_nmi(struct srcu_struct *sp) /* UNTESTED. */
{
int idx;

idx = READ_ONCE(sp->srcu_idx) & 0x1;
local_inc(&sp->sda->srcu_lock_count[idx]);
smp_mb__after_atomic(); /* B */ /* Avoid leaking critical section. */
return idx;
}

void __srcu_read_unlock_nmi(struct srcu_struct *sp, int idx)
{
smp_mb__before_atomic(); /* C */ /* Avoid leaking critical section. */
local_inc(&sp->sda->srcu_unlock_count[idx]);
}

Would that work, or is there a better way to handle this?

Thanx, Paul


2018-08-08 14:50:49

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v12 3/3] tracing: Centralize preemptirq tracepoints and unify their usage

On Wed, 8 Aug 2018 07:33:10 -0700
"Paul E. McKenney" <[email protected]> wrote:

> On Wed, Aug 08, 2018 at 09:07:24AM -0400, Steven Rostedt wrote:
> > On Wed, 8 Aug 2018 06:03:02 -0700
> > "Paul E. McKenney" <[email protected]> wrote:
> >
> > > What's wrong with a this_cpu_inc()? It's atomic for the CPU. Although
> > > > it wont be atomic for the capture of the idx. But I also don't see
> > > > interrupts being disabled, thus an NMI is no different than any
> > > > interrupt doing the same thing, right?
> > >
> > > On architectures without increment-memory instructions, if you take an NMI
> > > between the load from sp->sda->srcu_lock_count and the later store, you
> > > lose a count. Note that both __srcu_read_lock() and __srcu_read_unlock()
> > > do increments of different locations, so you cannot rely on the usual
> > > "NMI fixes up before exit" semantics you get when incrementing and
> > > decrementing the same location.
> >
> > And how is this handled in the interrupt case? Interrupts are not
> > disabled here.
>
> Actually, on most architectures interrupts are in fact disabled:
>
> #define this_cpu_generic_to_op(pcp, val, op) \
> do { \
> unsigned long __flags; \
> raw_local_irq_save(__flags); \
> raw_cpu_generic_to_op(pcp, val, op); \
> raw_local_irq_restore(__flags); \
> } while (0)
>
> NMIs, not so much.

And do these archs have NMIs?

-- Steve

>
> > I would also argue that architectures without increment-memory
> > instructions shouldn't have NMIs ;-)
>
> I would also argue a lot of things, but objective reality does not take my
> opinions into account all that often. Which might be a good thing. ;-)
>
> Thanx, Paul


2018-08-08 14:51:05

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v12 3/3] tracing: Centralize preemptirq tracepoints and unify their usage

On Wed, Aug 08, 2018 at 07:10:53AM -0700, Joel Fernandes wrote:
> On Wed, Aug 8, 2018 at 6:00 AM, Paul E. McKenney
> <[email protected]> wrote:
> > On Tue, Aug 07, 2018 at 08:53:54PM -0700, Joel Fernandes wrote:
> >> On Tue, Aug 7, 2018 at 8:44 PM, Joel Fernandes <[email protected]> wrote:
> >> > Hi Steve,
> >> >
> >> > On Tue, Aug 7, 2018 at 7:28 PM, Steven Rostedt <[email protected]> wrote:
> >> [...]
> >> >>> @@ -171,8 +174,7 @@ extern void syscall_unregfunc(void);
> >> >>> } while ((++it_func_ptr)->func); \
> >> >>> } \
> >> >>> \
> >> >>> - if (rcuidle) \
> >> >>> - srcu_read_unlock_notrace(&tracepoint_srcu, idx);\
> >> >>> + srcu_read_unlock_notrace(ss, idx); \
> >> >>
> >> >> Hmm, why do we have the two different srcu handles?
> >> >
> >> > Because if the memory operations happening on the normal SRCU handle
> >> > (during srcu_read_lock) is interrupted by NMI, then the other handle
> >> > (devoted to NMI) could be used instead and not bother the interrupted
> >> > handle. Does that makes sense?
> >> >
> >> > When I talked to Paul few months ago about SRCU from NMI context, he
> >> > mentioned the per-cpu memory operations during srcu_read_lock can be
> >> > NMI interrupted, that's why we added that warning.
> >>
> >> So I looked more closely, __srcu_read_lock on 2 different handles may
> >> still be doing a this_cpu_inc on the same location..
> >> (sp->sda->srcu_lock_count). :-(
> >>
> >> Paul any ideas on how to solve this?
> >
> > You lost me on this one. When you said "2 different handles", I assumed
> > that you meant two different values of "sp", which would have two
> > different addresses for &sp->sda->srcu_lock_count. What am I missing?
>
> Thanks a lot for the reply.
> I thought "sda" is the same for different srcu_struct(s). May be it
> was too late for me in the night, that's why I thought so? Which makes
> no sense now that I think of it.

I know that feeling! ;-)

> In that case based on what you're saying, the patch I sent to using
> different srcu_struct for NMI is still good I guess...

As long as you wait for both SRCU grace periods. Hmmm... Maybe that means
that there is still a use for synchronize_rcu_mult():

void call_srcu_nmi(struct rcu_head *rhp, rcu_callback_t func)
{
call_srcu(&trace_srcu_struct_nmi, rhp, func);
}

void call_srcu_nonmi(struct rcu_head *rhp, rcu_callback_t func)
{
call_srcu(&trace_srcu_struct_nonmi, rhp, func);
}

...

/* Wait concurrently on the two grace periods. */
synchronize_rcu_mult(call_srcu_nmi, call_srcu_nonmi);

On the other hand, I bet that doing this is just fine in your use case:

synchronize_srcu(&trace_srcu_struct_nmi);
synchronize_srcu(&trace_srcu_struct_nonmi);

But please note that synchronize_rcu_mult() is no more in my -rcu tree,
so if you do want it please let me know (and please let me know why it
is important).

> >> It does start to seem like a show stopper :-(
> >
> > I suppose that an srcu_read_lock_nmi() and srcu_read_unlock_nmi() could
> > be added, which would do atomic ops on sp->sda->srcu_lock_count. Not sure
> > whether this would be fast enough to be useful, but easy to provide:
> >
> > int __srcu_read_lock_nmi(struct srcu_struct *sp) /* UNTESTED. */
> > {
> > int idx;
> >
> > idx = READ_ONCE(sp->srcu_idx) & 0x1;
> > atomic_inc(&sp->sda->srcu_lock_count[idx]);
> > smp_mb__after_atomic(); /* B */ /* Avoid leaking critical section. */
> > return idx;
> > }
> >
> > void __srcu_read_unlock_nmi(struct srcu_struct *sp, int idx)
> > {
> > smp_mb__before_atomic(); /* C */ /* Avoid leaking critical section. */
> > atomic_inc(&sp->sda->srcu_unlock_count[idx]);
> > }
> >
> > With appropriate adjustments to also allow Tiny RCU to also work.
> >
> > Note that you have to use _nmi() everywhere, not just in NMI handlers.
> > In fact, the NMI handlers are the one place you -don't- need to use
> > _nmi(), strangely enough.
> >
> > Might be worth a try -- smp_mb__{before,after}_atomic() is a no-op on
> > some architectures, for example.
>
> Continuing Steve's question on regular interrupts, do we need to use
> this atomic_inc API for regular interrupts as well?

If NMIs use one srcu_struct and non-NMI uses another, the current
srcu_read_lock() and srcu_read_unlock() will work just fine. If any given
srcu_struct needs both NMI and non-NMI readers, then we really do need
__srcu_read_lock_nmi() and __srcu_read_unlock_nmi() for that srcu_struct.

Thanx, Paul


2018-08-08 15:07:16

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v12 3/3] tracing: Centralize preemptirq tracepoints and unify their usage

On Wed, Aug 08, 2018 at 10:49:10AM -0400, Steven Rostedt wrote:
> On Wed, 8 Aug 2018 07:33:10 -0700
> "Paul E. McKenney" <[email protected]> wrote:
>
> > On Wed, Aug 08, 2018 at 09:07:24AM -0400, Steven Rostedt wrote:
> > > On Wed, 8 Aug 2018 06:03:02 -0700
> > > "Paul E. McKenney" <[email protected]> wrote:
> > >
> > > > What's wrong with a this_cpu_inc()? It's atomic for the CPU. Although
> > > > > it wont be atomic for the capture of the idx. But I also don't see
> > > > > interrupts being disabled, thus an NMI is no different than any
> > > > > interrupt doing the same thing, right?
> > > >
> > > > On architectures without increment-memory instructions, if you take an NMI
> > > > between the load from sp->sda->srcu_lock_count and the later store, you
> > > > lose a count. Note that both __srcu_read_lock() and __srcu_read_unlock()
> > > > do increments of different locations, so you cannot rely on the usual
> > > > "NMI fixes up before exit" semantics you get when incrementing and
> > > > decrementing the same location.
> > >
> > > And how is this handled in the interrupt case? Interrupts are not
> > > disabled here.
> >
> > Actually, on most architectures interrupts are in fact disabled:
> >
> > #define this_cpu_generic_to_op(pcp, val, op) \
> > do { \
> > unsigned long __flags; \
> > raw_local_irq_save(__flags); \
> > raw_cpu_generic_to_op(pcp, val, op); \
> > raw_local_irq_restore(__flags); \
> > } while (0)
> >
> > NMIs, not so much.
>
> And do these archs have NMIs?

It would appear so:

$ find . -name 'Kconfig*' -exec grep -l 'select HAVE_NMI\>' {} \;
./arch/sparc/Kconfig
./arch/s390/Kconfig
./arch/arm/Kconfig
./arch/arm64/Kconfig
./arch/mips/Kconfig
./arch/sh/Kconfig
./arch/powerpc/Kconfig
./arch/x86/Kconfig

Thanx, Paul


2018-08-08 15:24:34

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v12 3/3] tracing: Centralize preemptirq tracepoints and unify their usage

On Wed, 8 Aug 2018 08:05:58 -0700
"Paul E. McKenney" <[email protected]> wrote:

> On Wed, Aug 08, 2018 at 10:49:10AM -0400, Steven Rostedt wrote:
> > On Wed, 8 Aug 2018 07:33:10 -0700
> > "Paul E. McKenney" <[email protected]> wrote:
> >
> > > On Wed, Aug 08, 2018 at 09:07:24AM -0400, Steven Rostedt wrote:
> > > > On Wed, 8 Aug 2018 06:03:02 -0700
> > > > "Paul E. McKenney" <[email protected]> wrote:
> > > >
> > > > > What's wrong with a this_cpu_inc()? It's atomic for the CPU. Although
> > > > > > it wont be atomic for the capture of the idx. But I also don't see
> > > > > > interrupts being disabled, thus an NMI is no different than any
> > > > > > interrupt doing the same thing, right?
> > > > >
> > > > > On architectures without increment-memory instructions, if you take an NMI
> > > > > between the load from sp->sda->srcu_lock_count and the later store, you
> > > > > lose a count. Note that both __srcu_read_lock() and __srcu_read_unlock()
> > > > > do increments of different locations, so you cannot rely on the usual
> > > > > "NMI fixes up before exit" semantics you get when incrementing and
> > > > > decrementing the same location.
> > > >
> > > > And how is this handled in the interrupt case? Interrupts are not
> > > > disabled here.
> > >
> > > Actually, on most architectures interrupts are in fact disabled:
> > >
> > > #define this_cpu_generic_to_op(pcp, val, op) \
> > > do { \
> > > unsigned long __flags; \
> > > raw_local_irq_save(__flags); \
> > > raw_cpu_generic_to_op(pcp, val, op); \
> > > raw_local_irq_restore(__flags); \
> > > } while (0)
> > >
> > > NMIs, not so much.
> >
> > And do these archs have NMIs?
>
> It would appear so:

Well the next question is, which of these archs that use it are in this
list.

>
> $ find . -name 'Kconfig*' -exec grep -l 'select HAVE_NMI\>' {} \;
> ./arch/sparc/Kconfig
> ./arch/s390/Kconfig
> ./arch/arm/Kconfig
> ./arch/arm64/Kconfig
> ./arch/mips/Kconfig
> ./arch/sh/Kconfig
> ./arch/powerpc/Kconfig

Note, I know that powerpc "imitates" an NMI. It just sets the NMI as a
priority higher than other interrupts.

> ./arch/x86/Kconfig
>

And we get this:

$ git grep this_cpu_add_4
arch/arm64/include/asm/percpu.h:#define this_cpu_add_4(pcp, val) _percpu_add(pcp, val)
arch/s390/include/asm/percpu.h:#define this_cpu_add_4(pcp, val) arch_this_cpu_to_op_simple(pcp, val, +)
arch/s390/include/asm/percpu.h:#define this_cpu_add_4(pcp, val) arch_this_cpu_add(pcp, val, "laa", "asi", int)
arch/x86/include/asm/percpu.h:#define this_cpu_add_4(pcp, val) percpu_add_op((pcp), val)

Which leaves us with sparc, arm, mips, sh and powerpc.

sh is almost dead, and powerpc can be fixed, which I guess leaves us
with sparc, arm and mips.

-- Steve


> Thanx, Paul


2018-08-08 15:28:56

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v12 3/3] tracing: Centralize preemptirq tracepoints and unify their usage

On Wed, 8 Aug 2018 07:42:00 -0700
"Paul E. McKenney" <[email protected]> wrote:

> > There's also a local_inc() if you are using per cpu pointers, that is
> > suppose to guarantee atomicity for single cpu operations. That's what
> > the ftrace ring buffer uses.
>
> Good point, that becomes atomic_long_inc() or equivalent on most
> architectures, but an incl instruction (not locked) on x86. So updating
> my earlier still-untested thought:
>
> int __srcu_read_lock_nmi(struct srcu_struct *sp) /* UNTESTED. */
> {
> int idx;
>
> idx = READ_ONCE(sp->srcu_idx) & 0x1;
> local_inc(&sp->sda->srcu_lock_count[idx]);
> smp_mb__after_atomic(); /* B */ /* Avoid leaking critical section. */
> return idx;
> }
>
> void __srcu_read_unlock_nmi(struct srcu_struct *sp, int idx)
> {
> smp_mb__before_atomic(); /* C */ /* Avoid leaking critical section. */
> local_inc(&sp->sda->srcu_unlock_count[idx]);
> }
>
> Would that work, or is there a better way to handle this?

This would work much better than using the atomic ops, and I think it
would be doable.

I may need to revert the srcu for trace_*_rcidle() for now, as I want
most of the other changes in this merge window, and it's getting too
late to do it with these updates.

-- Steve


2018-08-08 16:04:24

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v12 3/3] tracing: Centralize preemptirq tracepoints and unify their usage

On Wed, Aug 08, 2018 at 11:23:09AM -0400, Steven Rostedt wrote:
> On Wed, 8 Aug 2018 08:05:58 -0700
> "Paul E. McKenney" <[email protected]> wrote:
>
> > On Wed, Aug 08, 2018 at 10:49:10AM -0400, Steven Rostedt wrote:
> > > On Wed, 8 Aug 2018 07:33:10 -0700
> > > "Paul E. McKenney" <[email protected]> wrote:
> > >
> > > > On Wed, Aug 08, 2018 at 09:07:24AM -0400, Steven Rostedt wrote:
> > > > > On Wed, 8 Aug 2018 06:03:02 -0700
> > > > > "Paul E. McKenney" <[email protected]> wrote:
> > > > >
> > > > > > What's wrong with a this_cpu_inc()? It's atomic for the CPU. Although
> > > > > > > it wont be atomic for the capture of the idx. But I also don't see
> > > > > > > interrupts being disabled, thus an NMI is no different than any
> > > > > > > interrupt doing the same thing, right?
> > > > > >
> > > > > > On architectures without increment-memory instructions, if you take an NMI
> > > > > > between the load from sp->sda->srcu_lock_count and the later store, you
> > > > > > lose a count. Note that both __srcu_read_lock() and __srcu_read_unlock()
> > > > > > do increments of different locations, so you cannot rely on the usual
> > > > > > "NMI fixes up before exit" semantics you get when incrementing and
> > > > > > decrementing the same location.
> > > > >
> > > > > And how is this handled in the interrupt case? Interrupts are not
> > > > > disabled here.
> > > >
> > > > Actually, on most architectures interrupts are in fact disabled:
> > > >
> > > > #define this_cpu_generic_to_op(pcp, val, op) \
> > > > do { \
> > > > unsigned long __flags; \
> > > > raw_local_irq_save(__flags); \
> > > > raw_cpu_generic_to_op(pcp, val, op); \
> > > > raw_local_irq_restore(__flags); \
> > > > } while (0)
> > > >
> > > > NMIs, not so much.
> > >
> > > And do these archs have NMIs?
> >
> > It would appear so:
>
> Well the next question is, which of these archs that use it are in this
> list.
>
> > $ find . -name 'Kconfig*' -exec grep -l 'select HAVE_NMI\>' {} \;
> > ./arch/sparc/Kconfig
> > ./arch/s390/Kconfig
> > ./arch/arm/Kconfig
> > ./arch/arm64/Kconfig
> > ./arch/mips/Kconfig
> > ./arch/sh/Kconfig
> > ./arch/powerpc/Kconfig
>
> Note, I know that powerpc "imitates" an NMI. It just sets the NMI as a
> priority higher than other interrupts.

Plus as you say below, its local_inc() is atomic, and thus NMI-safe,
and thus the _nmi() approach would work.

> > ./arch/x86/Kconfig
>
> And we get this:
>
> $ git grep this_cpu_add_4
> arch/arm64/include/asm/percpu.h:#define this_cpu_add_4(pcp, val) _percpu_add(pcp, val)
> arch/s390/include/asm/percpu.h:#define this_cpu_add_4(pcp, val) arch_this_cpu_to_op_simple(pcp, val, +)
> arch/s390/include/asm/percpu.h:#define this_cpu_add_4(pcp, val) arch_this_cpu_add(pcp, val, "laa", "asi", int)
> arch/x86/include/asm/percpu.h:#define this_cpu_add_4(pcp, val) percpu_add_op((pcp), val)
>
> Which leaves us with sparc, arm, mips, sh and powerpc.
>
> sh is almost dead, and powerpc can be fixed, which I guess leaves us
> with sparc, arm and mips.

If we want to stick with the current srcu_read_lock() and srcu_read_unlock(),
you mean? I would like that sort of outcome, at least assuming we are not
hammering any of the architectures.

Thanx, Paul


2018-08-08 16:05:29

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v12 3/3] tracing: Centralize preemptirq tracepoints and unify their usage

On Wed, Aug 08, 2018 at 11:27:05AM -0400, Steven Rostedt wrote:
> On Wed, 8 Aug 2018 07:42:00 -0700
> "Paul E. McKenney" <[email protected]> wrote:
>
> > > There's also a local_inc() if you are using per cpu pointers, that is
> > > suppose to guarantee atomicity for single cpu operations. That's what
> > > the ftrace ring buffer uses.
> >
> > Good point, that becomes atomic_long_inc() or equivalent on most
> > architectures, but an incl instruction (not locked) on x86. So updating
> > my earlier still-untested thought:
> >
> > int __srcu_read_lock_nmi(struct srcu_struct *sp) /* UNTESTED. */
> > {
> > int idx;
> >
> > idx = READ_ONCE(sp->srcu_idx) & 0x1;
> > local_inc(&sp->sda->srcu_lock_count[idx]);
> > smp_mb__after_atomic(); /* B */ /* Avoid leaking critical section. */
> > return idx;
> > }
> >
> > void __srcu_read_unlock_nmi(struct srcu_struct *sp, int idx)
> > {
> > smp_mb__before_atomic(); /* C */ /* Avoid leaking critical section. */
> > local_inc(&sp->sda->srcu_unlock_count[idx]);
> > }
> >
> > Would that work, or is there a better way to handle this?
>
> This would work much better than using the atomic ops, and I think it
> would be doable.

OK, here is hoping!

> I may need to revert the srcu for trace_*_rcidle() for now, as I want
> most of the other changes in this merge window, and it's getting too
> late to do it with these updates.

Agreed, especially since I normally freeze for the next merge window
shortly after -rc5. ;-)

Thanx, Paul


2018-08-08 16:25:53

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v12 3/3] tracing: Centralize preemptirq tracepoints and unify their usage

On Wed, 8 Aug 2018 09:02:43 -0700
"Paul E. McKenney" <[email protected]> wrote:

> > Which leaves us with sparc, arm, mips, sh and powerpc.
> >
> > sh is almost dead, and powerpc can be fixed, which I guess leaves us
> > with sparc, arm and mips.
>
> If we want to stick with the current srcu_read_lock() and srcu_read_unlock(),
> you mean? I would like that sort of outcome, at least assuming we are not
> hammering any of the architectures.

I would go with the local_inc approach, and even add a
srcu_read_un/lock_nmi() that does that if you want. Probably should add
lockdep to detect if the _nmi calls is ever used along with non _nmi
calls and complain about that.

But this will be something for the next merge window, not the one
coming up.

-- Steve

2018-08-08 17:22:26

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v12 3/3] tracing: Centralize preemptirq tracepoints and unify their usage

On Wed, Aug 08, 2018 at 12:24:04PM -0400, Steven Rostedt wrote:
> On Wed, 8 Aug 2018 09:02:43 -0700
> "Paul E. McKenney" <[email protected]> wrote:
>
> > > Which leaves us with sparc, arm, mips, sh and powerpc.
> > >
> > > sh is almost dead, and powerpc can be fixed, which I guess leaves us
> > > with sparc, arm and mips.
> >
> > If we want to stick with the current srcu_read_lock() and srcu_read_unlock(),
> > you mean? I would like that sort of outcome, at least assuming we are not
> > hammering any of the architectures.
>
> I would go with the local_inc approach, and even add a
> srcu_read_un/lock_nmi() that does that if you want. Probably should add
> lockdep to detect if the _nmi calls is ever used along with non _nmi
> calls and complain about that.

Would it be reasonable to also add a check for non-_nmi calls being
used in both NMI and non-NMI contexts?

> But this will be something for the next merge window, not the one
> coming up.

Completely agreed!

Thanx, Paul


2018-08-08 19:26:03

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH v12 3/3] tracing: Centralize preemptirq tracepoints and unify their usage

On Wed, Aug 8, 2018 at 7:49 AM, Paul E. McKenney
<[email protected]> wrote:
[...]
>
>> In that case based on what you're saying, the patch I sent to using
>> different srcu_struct for NMI is still good I guess...
>
> As long as you wait for both SRCU grace periods. Hmmm... Maybe that means
> that there is still a use for synchronize_rcu_mult():
>
> void call_srcu_nmi(struct rcu_head *rhp, rcu_callback_t func)
> {
> call_srcu(&trace_srcu_struct_nmi, rhp, func);
> }
>
> void call_srcu_nonmi(struct rcu_head *rhp, rcu_callback_t func)
> {
> call_srcu(&trace_srcu_struct_nonmi, rhp, func);
> }
>
> ...
>
> /* Wait concurrently on the two grace periods. */
> synchronize_rcu_mult(call_srcu_nmi, call_srcu_nonmi);
>
> On the other hand, I bet that doing this is just fine in your use case:
>
> synchronize_srcu(&trace_srcu_struct_nmi);
> synchronize_srcu(&trace_srcu_struct_nonmi);
>
> But please note that synchronize_rcu_mult() is no more in my -rcu tree,
> so if you do want it please let me know (and please let me know why it
> is important).

I did the chaining thing (one callback calling another), that should
work too right? I believe that is needed so that the tracepoint
callbacks are freed at one point and only when both NMI and non-NMI
read sections have completed.

>> >> It does start to seem like a show stopper :-(
>> >
>> > I suppose that an srcu_read_lock_nmi() and srcu_read_unlock_nmi() could
>> > be added, which would do atomic ops on sp->sda->srcu_lock_count. Not sure
>> > whether this would be fast enough to be useful, but easy to provide:
>> >
>> > int __srcu_read_lock_nmi(struct srcu_struct *sp) /* UNTESTED. */
>> > {
>> > int idx;
>> >
>> > idx = READ_ONCE(sp->srcu_idx) & 0x1;
>> > atomic_inc(&sp->sda->srcu_lock_count[idx]);
>> > smp_mb__after_atomic(); /* B */ /* Avoid leaking critical section. */
>> > return idx;
>> > }
>> >
>> > void __srcu_read_unlock_nmi(struct srcu_struct *sp, int idx)
>> > {
>> > smp_mb__before_atomic(); /* C */ /* Avoid leaking critical section. */
>> > atomic_inc(&sp->sda->srcu_unlock_count[idx]);
>> > }
>> >
>> > With appropriate adjustments to also allow Tiny RCU to also work.
>> >
>> > Note that you have to use _nmi() everywhere, not just in NMI handlers.
>> > In fact, the NMI handlers are the one place you -don't- need to use
>> > _nmi(), strangely enough.
>> >
>> > Might be worth a try -- smp_mb__{before,after}_atomic() is a no-op on
>> > some architectures, for example.
>>
>> Continuing Steve's question on regular interrupts, do we need to use
>> this atomic_inc API for regular interrupts as well? So I guess
>
> If NMIs use one srcu_struct and non-NMI uses another, the current
> srcu_read_lock() and srcu_read_unlock() will work just fine. If any given
> srcu_struct needs both NMI and non-NMI readers, then we really do need
> __srcu_read_lock_nmi() and __srcu_read_unlock_nmi() for that srcu_struct.

Yes, I believe as long as in_nmi() works reliably, we can use the
right srcu_struct (NMI vs non-NMI) and it would be fine.

Going through this thread, it sounds though that this_cpu_inc may not
be reliable on all architectures even for non-NMI interrupts and
local_inc may be the way to go.

For next merge window (not this one), lets do that then? Paul, if you
could provide me an SRCU API that uses local_inc, then I believe that
coupled with this patch should be all that's needed:
https://lore.kernel.org/patchwork/patch/972657/

Steve did express concern though if in_nmi() works reliably (i.e.
tracepoint doesn't fire from "thunk" code before in_nmi() is
available). Any thoughts on that Steve?

thanks!

- Joel

2018-08-08 20:20:10

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v12 3/3] tracing: Centralize preemptirq tracepoints and unify their usage

On Wed, Aug 08, 2018 at 12:24:20PM -0700, Joel Fernandes wrote:
> On Wed, Aug 8, 2018 at 7:49 AM, Paul E. McKenney
> <[email protected]> wrote:
> [...]
> >
> >> In that case based on what you're saying, the patch I sent to using
> >> different srcu_struct for NMI is still good I guess...
> >
> > As long as you wait for both SRCU grace periods. Hmmm... Maybe that means
> > that there is still a use for synchronize_rcu_mult():
> >
> > void call_srcu_nmi(struct rcu_head *rhp, rcu_callback_t func)
> > {
> > call_srcu(&trace_srcu_struct_nmi, rhp, func);
> > }
> >
> > void call_srcu_nonmi(struct rcu_head *rhp, rcu_callback_t func)
> > {
> > call_srcu(&trace_srcu_struct_nonmi, rhp, func);
> > }
> >
> > ...
> >
> > /* Wait concurrently on the two grace periods. */
> > synchronize_rcu_mult(call_srcu_nmi, call_srcu_nonmi);
> >
> > On the other hand, I bet that doing this is just fine in your use case:
> >
> > synchronize_srcu(&trace_srcu_struct_nmi);
> > synchronize_srcu(&trace_srcu_struct_nonmi);
> >
> > But please note that synchronize_rcu_mult() is no more in my -rcu tree,
> > so if you do want it please let me know (and please let me know why it
> > is important).
>
> I did the chaining thing (one callback calling another), that should
> work too right? I believe that is needed so that the tracepoint
> callbacks are freed at one point and only when both NMI and non-NMI
> read sections have completed.

Yes, that works also. It is possible to make that happen concurrently
via atomic_dec_and_test() or similar, but if the latency is not a problem,
why bother?

> >> >> It does start to seem like a show stopper :-(
> >> >
> >> > I suppose that an srcu_read_lock_nmi() and srcu_read_unlock_nmi() could
> >> > be added, which would do atomic ops on sp->sda->srcu_lock_count. Not sure
> >> > whether this would be fast enough to be useful, but easy to provide:
> >> >
> >> > int __srcu_read_lock_nmi(struct srcu_struct *sp) /* UNTESTED. */
> >> > {
> >> > int idx;
> >> >
> >> > idx = READ_ONCE(sp->srcu_idx) & 0x1;
> >> > atomic_inc(&sp->sda->srcu_lock_count[idx]);
> >> > smp_mb__after_atomic(); /* B */ /* Avoid leaking critical section. */
> >> > return idx;
> >> > }
> >> >
> >> > void __srcu_read_unlock_nmi(struct srcu_struct *sp, int idx)
> >> > {
> >> > smp_mb__before_atomic(); /* C */ /* Avoid leaking critical section. */
> >> > atomic_inc(&sp->sda->srcu_unlock_count[idx]);
> >> > }
> >> >
> >> > With appropriate adjustments to also allow Tiny RCU to also work.
> >> >
> >> > Note that you have to use _nmi() everywhere, not just in NMI handlers.
> >> > In fact, the NMI handlers are the one place you -don't- need to use
> >> > _nmi(), strangely enough.
> >> >
> >> > Might be worth a try -- smp_mb__{before,after}_atomic() is a no-op on
> >> > some architectures, for example.
> >>
> >> Continuing Steve's question on regular interrupts, do we need to use
> >> this atomic_inc API for regular interrupts as well? So I guess
> >
> > If NMIs use one srcu_struct and non-NMI uses another, the current
> > srcu_read_lock() and srcu_read_unlock() will work just fine. If any given
> > srcu_struct needs both NMI and non-NMI readers, then we really do need
> > __srcu_read_lock_nmi() and __srcu_read_unlock_nmi() for that srcu_struct.
>
> Yes, I believe as long as in_nmi() works reliably, we can use the
> right srcu_struct (NMI vs non-NMI) and it would be fine.
>
> Going through this thread, it sounds though that this_cpu_inc may not
> be reliable on all architectures even for non-NMI interrupts and
> local_inc may be the way to go.

My understanding is that this_cpu_inc() is defined to handle interrupts,
so any architecture on which it is unreliable needs to fix its bug. ;-)

> For next merge window (not this one), lets do that then? Paul, if you
> could provide me an SRCU API that uses local_inc, then I believe that
> coupled with this patch should be all that's needed:
> https://lore.kernel.org/patchwork/patch/972657/
>
> Steve did express concern though if in_nmi() works reliably (i.e.
> tracepoint doesn't fire from "thunk" code before in_nmi() is
> available). Any thoughts on that Steve?

Agreed, not the upcoming merge window. But we do need to work out
exactly what is the right way to do this.

Thanx, Paul


2018-08-08 22:17:18

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH v12 3/3] tracing: Centralize preemptirq tracepoints and unify their usage

On Wed, Aug 8, 2018 at 1:18 PM, Paul E. McKenney
<[email protected]> wrote:
[...]
>> >> >> It does start to seem like a show stopper :-(
>> >> >
>> >> > I suppose that an srcu_read_lock_nmi() and srcu_read_unlock_nmi() could
>> >> > be added, which would do atomic ops on sp->sda->srcu_lock_count. Not sure
>> >> > whether this would be fast enough to be useful, but easy to provide:
>> >> >
>> >> > int __srcu_read_lock_nmi(struct srcu_struct *sp) /* UNTESTED. */
>> >> > {
>> >> > int idx;
>> >> >
>> >> > idx = READ_ONCE(sp->srcu_idx) & 0x1;
>> >> > atomic_inc(&sp->sda->srcu_lock_count[idx]);
>> >> > smp_mb__after_atomic(); /* B */ /* Avoid leaking critical section. */
>> >> > return idx;
>> >> > }
>> >> >
>> >> > void __srcu_read_unlock_nmi(struct srcu_struct *sp, int idx)
>> >> > {
>> >> > smp_mb__before_atomic(); /* C */ /* Avoid leaking critical section. */
>> >> > atomic_inc(&sp->sda->srcu_unlock_count[idx]);
>> >> > }
>> >> >
>> >> > With appropriate adjustments to also allow Tiny RCU to also work.
>> >> >
>> >> > Note that you have to use _nmi() everywhere, not just in NMI handlers.
>> >> > In fact, the NMI handlers are the one place you -don't- need to use
>> >> > _nmi(), strangely enough.
>> >> >
>> >> > Might be worth a try -- smp_mb__{before,after}_atomic() is a no-op on
>> >> > some architectures, for example.
>> >>
>> >> Continuing Steve's question on regular interrupts, do we need to use
>> >> this atomic_inc API for regular interrupts as well? So I guess
>> >
>> > If NMIs use one srcu_struct and non-NMI uses another, the current
>> > srcu_read_lock() and srcu_read_unlock() will work just fine. If any given
>> > srcu_struct needs both NMI and non-NMI readers, then we really do need
>> > __srcu_read_lock_nmi() and __srcu_read_unlock_nmi() for that srcu_struct.
>>
>> Yes, I believe as long as in_nmi() works reliably, we can use the
>> right srcu_struct (NMI vs non-NMI) and it would be fine.
>>
>> Going through this thread, it sounds though that this_cpu_inc may not
>> be reliable on all architectures even for non-NMI interrupts and
>> local_inc may be the way to go.
>
> My understanding is that this_cpu_inc() is defined to handle interrupts,
> so any architecture on which it is unreliable needs to fix its bug. ;-)

Yes that's my understanding as well.

Then may be I'm missing something about yours/Steve's conversations in
the morning, about why we need bother with the local_inc then. So the
current SRCU code with the separate NMI handle should work fine (for
future merge windows) as long as we're using a separate srcu_struct
for NMI. :-)

>
>> For next merge window (not this one), lets do that then? Paul, if you
>> could provide me an SRCU API that uses local_inc, then I believe that
>> coupled with this patch should be all that's needed:
>> https://lore.kernel.org/patchwork/patch/972657/
>>
>> Steve did express concern though if in_nmi() works reliably (i.e.
>> tracepoint doesn't fire from "thunk" code before in_nmi() is
>> available). Any thoughts on that Steve?
>
> Agreed, not the upcoming merge window. But we do need to work out
> exactly what is the right way to do this.

Agreed, thanks!

- Joel

2018-08-08 22:49:10

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v12 3/3] tracing: Centralize preemptirq tracepoints and unify their usage

On Wed, Aug 08, 2018 at 03:15:31PM -0700, Joel Fernandes wrote:
> On Wed, Aug 8, 2018 at 1:18 PM, Paul E. McKenney
> <[email protected]> wrote:
> [...]
> >> >> >> It does start to seem like a show stopper :-(
> >> >> >
> >> >> > I suppose that an srcu_read_lock_nmi() and srcu_read_unlock_nmi() could
> >> >> > be added, which would do atomic ops on sp->sda->srcu_lock_count. Not sure
> >> >> > whether this would be fast enough to be useful, but easy to provide:
> >> >> >
> >> >> > int __srcu_read_lock_nmi(struct srcu_struct *sp) /* UNTESTED. */
> >> >> > {
> >> >> > int idx;
> >> >> >
> >> >> > idx = READ_ONCE(sp->srcu_idx) & 0x1;
> >> >> > atomic_inc(&sp->sda->srcu_lock_count[idx]);
> >> >> > smp_mb__after_atomic(); /* B */ /* Avoid leaking critical section. */
> >> >> > return idx;
> >> >> > }
> >> >> >
> >> >> > void __srcu_read_unlock_nmi(struct srcu_struct *sp, int idx)
> >> >> > {
> >> >> > smp_mb__before_atomic(); /* C */ /* Avoid leaking critical section. */
> >> >> > atomic_inc(&sp->sda->srcu_unlock_count[idx]);
> >> >> > }
> >> >> >
> >> >> > With appropriate adjustments to also allow Tiny RCU to also work.
> >> >> >
> >> >> > Note that you have to use _nmi() everywhere, not just in NMI handlers.
> >> >> > In fact, the NMI handlers are the one place you -don't- need to use
> >> >> > _nmi(), strangely enough.
> >> >> >
> >> >> > Might be worth a try -- smp_mb__{before,after}_atomic() is a no-op on
> >> >> > some architectures, for example.
> >> >>
> >> >> Continuing Steve's question on regular interrupts, do we need to use
> >> >> this atomic_inc API for regular interrupts as well? So I guess
> >> >
> >> > If NMIs use one srcu_struct and non-NMI uses another, the current
> >> > srcu_read_lock() and srcu_read_unlock() will work just fine. If any given
> >> > srcu_struct needs both NMI and non-NMI readers, then we really do need
> >> > __srcu_read_lock_nmi() and __srcu_read_unlock_nmi() for that srcu_struct.
> >>
> >> Yes, I believe as long as in_nmi() works reliably, we can use the
> >> right srcu_struct (NMI vs non-NMI) and it would be fine.
> >>
> >> Going through this thread, it sounds though that this_cpu_inc may not
> >> be reliable on all architectures even for non-NMI interrupts and
> >> local_inc may be the way to go.
> >
> > My understanding is that this_cpu_inc() is defined to handle interrupts,
> > so any architecture on which it is unreliable needs to fix its bug. ;-)
>
> Yes that's my understanding as well.
>
> Then may be I'm missing something about yours/Steve's conversations in
> the morning, about why we need bother with the local_inc then. So the
> current SRCU code with the separate NMI handle should work fine (for
> future merge windows) as long as we're using a separate srcu_struct
> for NMI. :-)

I do believe that to be true. But only as long as that separate
srcu_struct is -only- used for NMI.

If this is what you have been pushing for all along, please accept my
apologies for my being slow!

That said, your approach does require you to have a perfect way to
distinguish between NMI and not-NMI. If the distinguishing is even
in the slightest imperfect, then some sort of NMI-safe SRCU reader
approach is of course required.

Thanx, Paul

> >> For next merge window (not this one), lets do that then? Paul, if you
> >> could provide me an SRCU API that uses local_inc, then I believe that
> >> coupled with this patch should be all that's needed:
> >> https://lore.kernel.org/patchwork/patch/972657/
> >>
> >> Steve did express concern though if in_nmi() works reliably (i.e.
> >> tracepoint doesn't fire from "thunk" code before in_nmi() is
> >> available). Any thoughts on that Steve?
> >
> > Agreed, not the upcoming merge window. But we do need to work out
> > exactly what is the right way to do this.
>
> Agreed, thanks!
>
> - Joel
>


2018-08-09 12:20:01

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH v12 3/3] tracing: Centralize preemptirq tracepoints and unify their usage



On August 8, 2018 6:47:16 PM EDT, "Paul E. McKenney" <[email protected]> wrote:
>On Wed, Aug 08, 2018 at 03:15:31PM -0700, Joel Fernandes wrote:
>> On Wed, Aug 8, 2018 at 1:18 PM, Paul E. McKenney
>> <[email protected]> wrote:
>> [...]
>> >> >> >> It does start to seem like a show stopper :-(
>> >> >> >
>> >> >> > I suppose that an srcu_read_lock_nmi() and
>srcu_read_unlock_nmi() could
>> >> >> > be added, which would do atomic ops on
>sp->sda->srcu_lock_count. Not sure
>> >> >> > whether this would be fast enough to be useful, but easy to
>provide:
>> >> >> >
>> >> >> > int __srcu_read_lock_nmi(struct srcu_struct *sp) /*
>UNTESTED. */
>> >> >> > {
>> >> >> > int idx;
>> >> >> >
>> >> >> > idx = READ_ONCE(sp->srcu_idx) & 0x1;
>> >> >> > atomic_inc(&sp->sda->srcu_lock_count[idx]);
>> >> >> > smp_mb__after_atomic(); /* B */ /* Avoid leaking
>critical section. */
>> >> >> > return idx;
>> >> >> > }
>> >> >> >
>> >> >> > void __srcu_read_unlock_nmi(struct srcu_struct *sp, int idx)
>> >> >> > {
>> >> >> > smp_mb__before_atomic(); /* C */ /* Avoid leaking
>critical section. */
>> >> >> > atomic_inc(&sp->sda->srcu_unlock_count[idx]);
>> >> >> > }
>> >> >> >
>> >> >> > With appropriate adjustments to also allow Tiny RCU to also
>work.
>> >> >> >
>> >> >> > Note that you have to use _nmi() everywhere, not just in NMI
>handlers.
>> >> >> > In fact, the NMI handlers are the one place you -don't- need
>to use
>> >> >> > _nmi(), strangely enough.
>> >> >> >
>> >> >> > Might be worth a try -- smp_mb__{before,after}_atomic() is a
>no-op on
>> >> >> > some architectures, for example.
>> >> >>
>> >> >> Continuing Steve's question on regular interrupts, do we need
>to use
>> >> >> this atomic_inc API for regular interrupts as well? So I guess
>> >> >
>> >> > If NMIs use one srcu_struct and non-NMI uses another, the
>current
>> >> > srcu_read_lock() and srcu_read_unlock() will work just fine. If
>any given
>> >> > srcu_struct needs both NMI and non-NMI readers, then we really
>do need
>> >> > __srcu_read_lock_nmi() and __srcu_read_unlock_nmi() for that
>srcu_struct.
>> >>
>> >> Yes, I believe as long as in_nmi() works reliably, we can use the
>> >> right srcu_struct (NMI vs non-NMI) and it would be fine.
>> >>
>> >> Going through this thread, it sounds though that this_cpu_inc may
>not
>> >> be reliable on all architectures even for non-NMI interrupts and
>> >> local_inc may be the way to go.
>> >
>> > My understanding is that this_cpu_inc() is defined to handle
>interrupts,
>> > so any architecture on which it is unreliable needs to fix its bug.
> ;-)
>>
>> Yes that's my understanding as well.
>>
>> Then may be I'm missing something about yours/Steve's conversations
>in
>> the morning, about why we need bother with the local_inc then. So the
>> current SRCU code with the separate NMI handle should work fine (for
>> future merge windows) as long as we're using a separate srcu_struct
>> for NMI. :-)
>
>I do believe that to be true. But only as long as that separate
>srcu_struct is -only- used for NMI.
>
>If this is what you have been pushing for all along, please accept my
>apologies for my being slow!
>

That's ok, sorry I initially didn't describe it well which may have caused confusion, but yes that's what I was pushing for.

>That said, your approach does require you to have a perfect way to
>distinguish between NMI and not-NMI. If the distinguishing is even
>in the slightest imperfect, then some sort of NMI-safe SRCU reader
>approach is of course required.
>

Thanks Paul, agreed with everything and we are on the same page.

- Joel



> Thanx, Paul
>
>> >> For next merge window (not this one), lets do that then? Paul, if
>you
>> >> could provide me an SRCU API that uses local_inc, then I believe
>that
>> >> coupled with this patch should be all that's needed:
>> >> https://lore.kernel.org/patchwork/patch/972657/
>> >>
>> >> Steve did express concern though if in_nmi() works reliably (i.e.
>> >> tracepoint doesn't fire from "thunk" code before in_nmi() is
>> >> available). Any thoughts on that Steve?
>> >
>> > Agreed, not the upcoming merge window. But we do need to work out
>> > exactly what is the right way to do this.
>>
>> Agreed, thanks!
>>
>> - Joel
>>

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

2018-08-10 16:13:41

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v12 2/3] tracepoint: Make rcuidle tracepoint callers use SRCU

On Mon, 30 Jul 2018 15:24:22 -0700
Joel Fernandes <[email protected]> wrote:

> From: "Joel Fernandes (Google)" <[email protected]>
>
> In recent tests with IRQ on/off tracepoints, a large performance
> overhead ~10% is noticed when running hackbench. This is root caused to
> calls to rcu_irq_enter_irqson and rcu_irq_exit_irqson from the
> tracepoint code. Following a long discussion on the list [1] about this,
> we concluded that srcu is a better alternative for use during rcu idle.
> Although it does involve extra barriers, its lighter than the sched-rcu
> version which has to do additional RCU calls to notify RCU idle about
> entry into RCU sections.
>
> In this patch, we change the underlying implementation of the
> trace_*_rcuidle API to use SRCU. This has shown to improve performance
> alot for the high frequency irq enable/disable tracepoints.
>
> Test: Tested idle and preempt/irq tracepoints.
>
> Here are some performance numbers:
>
> With a run of the following 30 times on a single core x86 Qemu instance
> with 1GB memory:
> hackbench -g 4 -f 2 -l 3000
>
> Completion times in seconds. CONFIG_PROVE_LOCKING=y.
>
> No patches (without this series)
> Mean: 3.048
> Median: 3.025
> Std Dev: 0.064
>
> With Lockdep using irq tracepoints with RCU implementation:
> Mean: 3.451 (-11.66 %)
> Median: 3.447 (-12.22%)
> Std Dev: 0.049
>
> With Lockdep using irq tracepoints with SRCU implementation (this series):
> Mean: 3.020 (I would consider the improvement against the "without
> this series" case as just noise).
> Median: 3.013
> Std Dev: 0.033
>
> [1] https://patchwork.kernel.org/patch/10344297/
>
> [remove rcu_read_lock_sched_notrace as its the equivalent of
> preempt_disable_notrace and is unnecessary to call in tracepoint code]
> Cleaned-up-by: Peter Zijlstra <[email protected]>
> Acked-by: Peter Zijlstra <[email protected]>
> Reviewed-by: Mathieu Desnoyers <[email protected]>
> Signed-off-by: Joel Fernandes (Google) <[email protected]>
> ---
> include/linux/tracepoint.h | 41 ++++++++++++++++++++++++++++++--------
> kernel/tracepoint.c | 16 ++++++++++++++-
> 2 files changed, 48 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
> index 19a690b559ca..6e7bc6ebfcd8 100644
> --- a/include/linux/tracepoint.h
> +++ b/include/linux/tracepoint.h
> @@ -15,6 +15,7 @@
> */
>
> #include <linux/smp.h>
> +#include <linux/srcu.h>
> #include <linux/errno.h>
> #include <linux/types.h>
> #include <linux/cpumask.h>
> @@ -33,6 +34,8 @@ struct trace_eval_map {
>
> #define TRACEPOINT_DEFAULT_PRIO 10
>
> +extern struct srcu_struct tracepoint_srcu;
> +
> extern int
> tracepoint_probe_register(struct tracepoint *tp, void *probe, void *data);
> extern int
> @@ -75,10 +78,16 @@ int unregister_tracepoint_module_notifier(struct notifier_block *nb)
> * probe unregistration and the end of module exit to make sure there is no
> * caller executing a probe when it is freed.
> */
> +#ifdef CONFIG_TRACEPOINTS
> static inline void tracepoint_synchronize_unregister(void)
> {
> + synchronize_srcu(&tracepoint_srcu);
> synchronize_sched();
> }
> +#else
> +static inline void tracepoint_synchronize_unregister(void)
> +{ }
> +#endif
>
> #ifdef CONFIG_HAVE_SYSCALL_TRACEPOINTS
> extern int syscall_regfunc(void);
> @@ -129,18 +138,32 @@ extern void syscall_unregfunc(void);
> * as "(void *, void)". The DECLARE_TRACE_NOARGS() will pass in just
> * "void *data", where as the DECLARE_TRACE() will pass in "void *data, proto".
> */
> -#define __DO_TRACE(tp, proto, args, cond, rcucheck) \
> +#define __DO_TRACE(tp, proto, args, cond, rcuidle) \
> do { \
> struct tracepoint_func *it_func_ptr; \
> void *it_func; \
> void *__data; \
> + int __maybe_unused idx = 0; \
> \
> if (!(cond)) \
> return; \
> - if (rcucheck) \
> - rcu_irq_enter_irqson(); \
> - rcu_read_lock_sched_notrace(); \
> - it_func_ptr = rcu_dereference_sched((tp)->funcs); \
> + \
> + /* srcu can't be used from NMI */ \
> + if (rcuidle && in_nmi()) \
> + WARN_ON_ONCE(1); \
> + \
> + /* keep srcu and sched-rcu usage consistent */ \
> + preempt_disable_notrace(); \
> + \
> + /* \
> + * For rcuidle callers, use srcu since sched-rcu \
> + * doesn't work from the idle path. \
> + */ \
> + if (rcuidle) \
> + idx = srcu_read_lock_notrace(&tracepoint_srcu); \
> + \
> + it_func_ptr = rcu_dereference_raw((tp)->funcs); \
> + \
> if (it_func_ptr) { \
> do { \
> it_func = (it_func_ptr)->func; \
> @@ -148,9 +171,11 @@ extern void syscall_unregfunc(void);
> ((void(*)(proto))(it_func))(args); \
> } while ((++it_func_ptr)->func); \
> } \
> - rcu_read_unlock_sched_notrace(); \
> - if (rcucheck) \
> - rcu_irq_exit_irqson(); \
> + \
> + if (rcuidle) \
> + srcu_read_unlock_notrace(&tracepoint_srcu, idx);\
> + \
> + preempt_enable_notrace(); \
> } while (0)
>
> #ifndef MODULE
> diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
> index 6dc6356c3327..955148d91b74 100644
> --- a/kernel/tracepoint.c
> +++ b/kernel/tracepoint.c
> @@ -31,6 +31,9 @@
> extern struct tracepoint * const __start___tracepoints_ptrs[];
> extern struct tracepoint * const __stop___tracepoints_ptrs[];
>
> +DEFINE_SRCU(tracepoint_srcu);
> +EXPORT_SYMBOL_GPL(tracepoint_srcu);
> +
> /* Set to 1 to enable tracepoint debug output */
> static const int tracepoint_debug;
>
> @@ -67,16 +70,27 @@ static inline void *allocate_probes(int count)
> return p == NULL ? NULL : p->probes;
> }
>
> -static void rcu_free_old_probes(struct rcu_head *head)
> +static void srcu_free_old_probes(struct rcu_head *head)
> {
> kfree(container_of(head, struct tp_probes, rcu));
> }
>
> +static void rcu_free_old_probes(struct rcu_head *head)
> +{
> + call_srcu(&tracepoint_srcu, head, srcu_free_old_probes);
> +}
> +

Grumble. This fails my tests that test enabling tracepoints via the
command line with this:

WARNING: CPU: 0 PID: 13 at /work/build/trace/nobackup/linux-test.git/kernel/rcu/srcutree.c:236 check_init_srcu_struct+0xe/0x61
Modules linked in:
CPU: 0 PID: 13 Comm: watchdog/0 Not tainted 4.18.0-rc6-test+ #6
Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014
RIP: 0010:check_init_srcu_struct+0xe/0x61
Code: 48 c7 c6 ec 8a 65 b4 e8 ff 79 fe ff 48 89 df 31 f6 e8 f2 fa ff ff 5a 5b 41 5c 5d c3 0f 1f 44 00 00 83 3d 68 94 b8 01 01 75 02 <0f> 0b 48 8b 87 f0 0a 00 00 a8 03 74 45 55 48 89 e5 41 55 41 54 4c
RSP: 0000:ffff96eb9ea03e68 EFLAGS: 00010246
RAX: ffff96eb962b5b01 RBX: ffffffffb4a87420 RCX: 0000000000000001
RDX: ffffffffb3107969 RSI: ffff96eb962b5b40 RDI: ffffffffb4a87420
RBP: ffff96eb9ea03eb0 R08: ffffabbd00cd7f48 R09: 0000000000000000
R10: ffff96eb9ea03e68 R11: ffffffffb4a6eec0 R12: ffff96eb962b5b40
R13: ffff96eb9ea03ef8 R14: ffffffffb3107969 R15: ffffffffb3107948
FS: 0000000000000000(0000) GS:ffff96eb9ea00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff96eb13ab2000 CR3: 0000000192a1e001 CR4: 00000000001606f0
Call Trace:
<IRQ>
? __call_srcu+0x2d/0x290
? rcu_process_callbacks+0x26e/0x448
? allocate_probes+0x2b/0x2b
call_srcu+0x13/0x15
rcu_free_old_probes+0x1f/0x21
rcu_process_callbacks+0x2ed/0x448
__do_softirq+0x172/0x336
irq_exit+0x62/0xb2
smp_apic_timer_interrupt+0x161/0x19e
apic_timer_interrupt+0xf/0x20
</IRQ>

Investigating it, it's that when we register more than one event, the
tracepoint code calls "release_probes" when adding new tracepoints (as
it updated the tracepoint array), and this is done very early in boot
up, causing this warning.

I'm still looking at ways to fix this. Any ideas would help.

-- Steve


> static inline void release_probes(struct tracepoint_func *old)
> {
> if (old) {
> struct tp_probes *tp_probes = container_of(old,
> struct tp_probes, probes[0]);
> + /*
> + * Tracepoint probes are protected by both sched RCU and SRCU,
> + * by calling the SRCU callback in the sched RCU callback we
> + * cover both cases. So let us chain the SRCU and sched RCU
> + * callbacks to wait for both grace periods.
> + */
> call_rcu_sched(&tp_probes->rcu, rcu_free_old_probes);
> }
> }


2018-08-10 16:34:30

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v12 2/3] tracepoint: Make rcuidle tracepoint callers use SRCU

On Fri, 10 Aug 2018 11:35:54 -0400
Steven Rostedt <[email protected]> wrote:

> Investigating it, it's that when we register more than one event, the
> tracepoint code calls "release_probes" when adding new tracepoints (as
> it updated the tracepoint array), and this is done very early in boot
> up, causing this warning.
>
> I'm still looking at ways to fix this. Any ideas would help.

I just posted a patch:

http://lkml.kernel.org/r/[email protected]

It appears to fix the issue. I'm going to start testing it in my suite.
Does anyone have any concerns about it, or a better solution?

Thanks!

-- Steve