Subject: [RFC 0/3] preempt_tracer: Fix preempt_disable tracepoint

While playing with the model + working in a fix for the task
context & trace recursion, I ended up hitting two cases in which the
preempt_disable/enable tracepoint was supposed to happen, but didn't.

There is an explanation for each case in the log message.

This is an RFC exposing the problem, with possible ways to fix it.
But I bet there might be better solutions as well, so this is a real
RFC.

Daniel Bristot de Oliveira (3):
softirq: Use preempt_latency_stop/start to trace preemption
preempt_tracer: Disable IRQ while starting/stopping due to a
preempt_counter change
preempt_tracer: Use a percpu variable to control traceble calls

kernel/sched/core.c | 66 ++++++++++++++++++++++++++++++++++-----------
kernel/softirq.c | 13 ++++-----
2 files changed, 55 insertions(+), 24 deletions(-)

--
2.20.1


Subject: [RFC 1/3] softirq: Use preempt_latency_stop/start to trace preemption

prempt_disable/enable tracepoints occurs only in the preemption
enabled <-> disable transition. As preempt_latency_stop() and
preempt_latency_start() already do this control, avoid code
duplication by using these functions in the softirq code as well.

RFC: Should we move preempt_latency_start/preempt_latency_stop
to a trace source file... or even a header?

Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
Cc: "Steven Rostedt (VMware)" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Matthias Kaehlcke <[email protected]>
Cc: "Joel Fernandes (Google)" <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Yangtao Li <[email protected]>
Cc: Tommaso Cucinotta <[email protected]>
Cc: [email protected]
---
kernel/sched/core.c | 4 ++--
kernel/softirq.c | 13 +++++--------
2 files changed, 7 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 874c427742a9..8c0b414e45dc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3152,7 +3152,7 @@ static inline void sched_tick_stop(int cpu) { }
* If the value passed in is equal to the current preempt count
* then we just disabled preemption. Start timing the latency.
*/
-static inline void preempt_latency_start(int val)
+void preempt_latency_start(int val)
{
if (preempt_count() == val) {
unsigned long ip = get_lock_parent_ip();
@@ -3189,7 +3189,7 @@ NOKPROBE_SYMBOL(preempt_count_add);
* If the value passed in equals to the current preempt count
* then we just enabled preemption. Stop timing the latency.
*/
-static inline void preempt_latency_stop(int val)
+void preempt_latency_stop(int val)
{
if (preempt_count() == val)
trace_preempt_on(CALLER_ADDR0, get_lock_parent_ip());
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 2c3382378d94..c9ad89c3dfed 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -108,6 +108,8 @@ static bool ksoftirqd_running(unsigned long pending)
* where hardirqs are disabled legitimately:
*/
#ifdef CONFIG_TRACE_IRQFLAGS
+extern void preempt_latency_start(int val);
+extern void preempt_latency_stop(int val);
void __local_bh_disable_ip(unsigned long ip, unsigned int cnt)
{
unsigned long flags;
@@ -130,12 +132,8 @@ void __local_bh_disable_ip(unsigned long ip, unsigned int cnt)
trace_softirqs_off(ip);
raw_local_irq_restore(flags);

- if (preempt_count() == cnt) {
-#ifdef CONFIG_DEBUG_PREEMPT
- current->preempt_disable_ip = get_lock_parent_ip();
-#endif
- trace_preempt_off(CALLER_ADDR0, get_lock_parent_ip());
- }
+ preempt_latency_start(cnt);
+
}
EXPORT_SYMBOL(__local_bh_disable_ip);
#endif /* CONFIG_TRACE_IRQFLAGS */
@@ -144,8 +142,7 @@ static void __local_bh_enable(unsigned int cnt)
{
lockdep_assert_irqs_disabled();

- if (preempt_count() == cnt)
- trace_preempt_on(CALLER_ADDR0, get_lock_parent_ip());
+ preempt_latency_stop(cnt);

if (softirq_count() == (cnt & SOFTIRQ_MASK))
trace_softirqs_on(_RET_IP_);
--
2.20.1

Subject: [RFC 2/3] preempt_tracer: Disable IRQ while starting/stopping due to a preempt_counter change

The preempt_disable/enable tracepoint only traces in the disable <-> enable
case, which is correct. But think about this case:

---------------------------- %< ------------------------------
THREAD IRQ
| |
preempt_disable() {
__preempt_count_add(1)
-------> smp_apic_timer_interrupt() {
preempt_disable()
do not trace (preempt count >= 1)
....
preempt_enable()
do not trace (preempt count >= 1)
}
trace_preempt_disable();
}
---------------------------- >% ------------------------------

The tracepoint will be skipped.

It is possible to observe this problem using this kernel module:

http://bristot.me/files/efficient_verification/wip.tar.gz [*]

and doing the following trace:

# cd /sys/kernel/debug/tracing/
# echo 1 > snapshot
# cat available_events | grep preempt_ > set_event
# echo irq_vectors >> /sys/kernel/debug/tracing/set_event
# insmod wip.ko
/* wait for a snapshot creation */
# tail -100 snapshot
[...]
tail-5572 [001] ....1.. 2888.401184: preempt_enable: caller=_raw_spin_unlock_irqrestore+0x2a/0x70 parent= (null)
tail-5572 [001] ....1.. 2888.401184: preempt_disable: caller=migrate_disable+0x8b/0x1e0 parent=migrate_disable+0x8b/0x1e0
tail-5572 [001] ....111 2888.401184: preempt_enable: caller=migrate_disable+0x12f/0x1e0 parent=migrate_disable+0x12f/0x1e0
tail-5572 [001] d..h212 2888.401189: local_timer_entry: vector=236
tail-5572 [001] dN.h512 2888.401192: process_event: event sched_waking not expected in the state preemptive
tail-5572 [001] dN.h512 2888.401200: <stack trace>
=> process_event
=> __handle_event
=> ttwu_do_wakeup
=> try_to_wake_up
=> invoke_rcu_core
=> rcu_check_callbacks
=> update_process_times
=> tick_sched_handle
=> tick_sched_timer
=> __hrtimer_run_queues
=> hrtimer_interrupt
=> smp_apic_timer_interrupt
=> apic_timer_interrupt
=> trace_event_raw_event_preemptirq_template
=> trace_preempt_off
=> get_page_from_freelist
=> __alloc_pages_nodemask
=> __handle_mm_fault
=> handle_mm_fault
=> __do_page_fault
=> do_page_fault
=> async_page_fault

To avoid skipping the trace, the change in the counter should be "atomic"
with the start/stop, w.r.t the interrupts.

Disable interrupts while the adding/starting stopping/subtracting.

Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
Cc: "Steven Rostedt (VMware)" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Matthias Kaehlcke <[email protected]>
Cc: "Joel Fernandes (Google)" <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Yangtao Li <[email protected]>
Cc: Tommaso Cucinotta <[email protected]>
Cc: [email protected]
---
[*] with some luck we will talk about this at the Plumbers.

kernel/sched/core.c | 48 +++++++++++++++++++++++++++++++++------------
kernel/softirq.c | 2 +-
2 files changed, 37 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8c0b414e45dc..be4117d7384f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3163,6 +3163,16 @@ void preempt_latency_start(int val)
}
}

+static inline void preempt_add_start_latency(int val)
+{
+ unsigned long flags;
+
+ raw_local_irq_save(flags);
+ __preempt_count_add(val);
+ preempt_latency_start(val);
+ raw_local_irq_restore(flags);
+}
+
void preempt_count_add(int val)
{
#ifdef CONFIG_DEBUG_PREEMPT
@@ -3172,7 +3182,7 @@ void preempt_count_add(int val)
if (DEBUG_LOCKS_WARN_ON((preempt_count() < 0)))
return;
#endif
- __preempt_count_add(val);
+ preempt_add_start_latency(val);
#ifdef CONFIG_DEBUG_PREEMPT
/*
* Spinlock count overflowing soon?
@@ -3180,7 +3190,6 @@ void preempt_count_add(int val)
DEBUG_LOCKS_WARN_ON((preempt_count() & PREEMPT_MASK) >=
PREEMPT_MASK - 10);
#endif
- preempt_latency_start(val);
}
EXPORT_SYMBOL(preempt_count_add);
NOKPROBE_SYMBOL(preempt_count_add);
@@ -3195,6 +3204,16 @@ void preempt_latency_stop(int val)
trace_preempt_on(CALLER_ADDR0, get_lock_parent_ip());
}

+static inline void preempt_sub_stop_latency(int val)
+{
+ unsigned long flags;
+
+ raw_local_irq_save(flags);
+ preempt_latency_stop(val);
+ __preempt_count_sub(val);
+ raw_local_irq_restore(flags);
+}
+
void preempt_count_sub(int val)
{
#ifdef CONFIG_DEBUG_PREEMPT
@@ -3211,8 +3230,7 @@ void preempt_count_sub(int val)
return;
#endif

- preempt_latency_stop(val);
- __preempt_count_sub(val);
+ preempt_sub_stop_latency(val);
}
EXPORT_SYMBOL(preempt_count_sub);
NOKPROBE_SYMBOL(preempt_count_sub);
@@ -3220,6 +3238,16 @@ NOKPROBE_SYMBOL(preempt_count_sub);
#else
static inline void preempt_latency_start(int val) { }
static inline void preempt_latency_stop(int val) { }
+
+static inline void preempt_sub_stop_latency(int val)
+{
+ __preempt_count_sub(val);
+}
+
+static inline void preempt_add_start_latency(int val)
+{
+ __preempt_count_add(val);
+}
#endif

static inline unsigned long get_preempt_disable_ip(struct task_struct *p)
@@ -3585,11 +3613,9 @@ static void __sched notrace preempt_schedule_common(void)
* traced. The other to still record the preemption latency,
* which can also be traced by the function tracer.
*/
- preempt_disable_notrace();
- preempt_latency_start(1);
+ preempt_add_start_latency(1);
__schedule(true);
- preempt_latency_stop(1);
- preempt_enable_no_resched_notrace();
+ preempt_sub_stop_latency(1);

/*
* Check again in case we missed a preemption opportunity
@@ -3653,8 +3679,7 @@ asmlinkage __visible void __sched notrace preempt_schedule_notrace(void)
* traced. The other to still record the preemption latency,
* which can also be traced by the function tracer.
*/
- preempt_disable_notrace();
- preempt_latency_start(1);
+ preempt_add_start_latency(1);
/*
* Needs preempt disabled in case user_exit() is traced
* and the tracer calls preempt_enable_notrace() causing
@@ -3664,8 +3689,7 @@ asmlinkage __visible void __sched notrace preempt_schedule_notrace(void)
__schedule(true);
exception_exit(prev_ctx);

- preempt_latency_stop(1);
- preempt_enable_no_resched_notrace();
+ preempt_sub_stop_latency(1);
} while (need_resched());
}
EXPORT_SYMBOL_GPL(preempt_schedule_notrace);
diff --git a/kernel/softirq.c b/kernel/softirq.c
index c9ad89c3dfed..9c64522ecc76 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -130,9 +130,9 @@ void __local_bh_disable_ip(unsigned long ip, unsigned int cnt)
*/
if (softirq_count() == (cnt & SOFTIRQ_MASK))
trace_softirqs_off(ip);
- raw_local_irq_restore(flags);

preempt_latency_start(cnt);
+ raw_local_irq_restore(flags);

}
EXPORT_SYMBOL(__local_bh_disable_ip);
--
2.20.1

Subject: [RFC 3/3] preempt_tracer: Use a percpu variable to control traceble calls

The preempt_disable tracepoint only traces in the disable <-> enable case.
Which is correct, but think about this case:

--------------------------- %< ----------------------
THREAD IRQ
| |
preempt_disable_notrace() {
__preempt_count_add(1)
}

/* preemption disabled/IRQs enabled */

-------> smp_apic_timer_interrupt() {
preempt_disable() {
do not trace (preempt count >= 1)
}
....
preempt_enable() {
do not trace (preempt count >= 1)
}
<------- }
preempt_enable_notrace() {
__preempt_count_sub(1)
}
--------------------------- >% ----------------------

The non-traceble preempt_disable can hide a legit preempt_disable (which
is worth tracing).

It is possible to observe this problem using this kernel module:

http://bristot.me/files/efficient_verification/wip.tar.gz

and doing the following trace:

# cd /sys/kernel/debug/tracing/
# echo 1 > snapshot
# cat available_events | grep preempt_ > set_event
# echo irq_vectors >> /sys/kernel/debug/tracing/set_event
# insmod wip.ko
/* wait for a snapshot creation */
# tail -100 snapshot
sshd-1159 [000] d...1.. 2440.866116: preempt_enable: caller=migrate_enable+0x1bb/0x330 parent=migrate_enable+0x1bb/0x330
sshd-1159 [000] d..h1.. 2440.866122: local_timer_entry: vector=236
sshd-1159 [000] d..h1.. 2440.866127: local_timer_exit: vector=236
sshd-1159 [000] d.L.4.. 2440.866129: process_event: event sched_waking not expected in the state preemptive
sshd-1159 [000] d.L.4.. 2440.866137: <stack trace>
=> process_event
=> __handle_event
=> ttwu_do_wakeup
=> try_to_wake_up
=> irq_exit
=> smp_apic_timer_interrupt
=> apic_timer_interrupt
=> kvm_clock_read
=> ktime_get_with_offset
=> posix_get_boottime
=> __x64_sys_clock_gettime
=> do_syscall_64
=> entry_SYSCALL_64_after_hwframe

and kvm_clock_read() disables preemption without tracing:

--------------------------- %< ----------------------
static u64 kvm_clock_read(void)
{
u64 ret;

preempt_disable_notrace();
ret = pvclock_clocksource_read(this_cpu_pvti());
preempt_enable_notrace();
return ret;
}
--------------------------- >% ----------------------

Use a percpu variable for the traced preempt_disable/enable, and use it
to decide whether trace or not.

Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
Cc: "Steven Rostedt (VMware)" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Matthias Kaehlcke <[email protected]>
Cc: "Joel Fernandes (Google)" <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Yangtao Li <[email protected]>
Cc: Tommaso Cucinotta <[email protected]>
Cc: [email protected]
---
kernel/sched/core.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index be4117d7384f..2e07d4174778 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3148,19 +3148,25 @@ static inline void sched_tick_stop(int cpu) { }

#if defined(CONFIG_PREEMPT) && (defined(CONFIG_DEBUG_PREEMPT) || \
defined(CONFIG_TRACE_PREEMPT_TOGGLE))
+
+DEFINE_PER_CPU(int, __traced_preempt_count) = 0;
/*
* If the value passed in is equal to the current preempt count
* then we just disabled preemption. Start timing the latency.
*/
void preempt_latency_start(int val)
{
- if (preempt_count() == val) {
+ int curr = this_cpu_read(__traced_preempt_count);
+
+ if (!curr) {
unsigned long ip = get_lock_parent_ip();
#ifdef CONFIG_DEBUG_PREEMPT
current->preempt_disable_ip = ip;
#endif
trace_preempt_off(CALLER_ADDR0, ip);
}
+
+ this_cpu_write(__traced_preempt_count, curr + val);
}

static inline void preempt_add_start_latency(int val)
@@ -3200,8 +3206,12 @@ NOKPROBE_SYMBOL(preempt_count_add);
*/
void preempt_latency_stop(int val)
{
- if (preempt_count() == val)
+ int curr = this_cpu_read(__traced_preempt_count) - val;
+
+ if (!curr)
trace_preempt_on(CALLER_ADDR0, get_lock_parent_ip());
+
+ this_cpu_write(__traced_preempt_count, curr);
}

static inline void preempt_sub_stop_latency(int val)
--
2.20.1

2019-05-29 08:33:34

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC 1/3] softirq: Use preempt_latency_stop/start to trace preemption

On Tue, May 28, 2019 at 05:16:22PM +0200, Daniel Bristot de Oliveira wrote:
> prempt_disable/enable tracepoints occurs only in the preemption
> enabled <-> disable transition. As preempt_latency_stop() and
> preempt_latency_start() already do this control, avoid code
> duplication by using these functions in the softirq code as well.
>
> RFC: Should we move preempt_latency_start/preempt_latency_stop
> to a trace source file... or even a header?

Yeah, a header might not be a bad idea.

> @@ -130,12 +132,8 @@ void __local_bh_disable_ip(unsigned long ip, unsigned int cnt)
> trace_softirqs_off(ip);
> raw_local_irq_restore(flags);
>
> - if (preempt_count() == cnt) {
> -#ifdef CONFIG_DEBUG_PREEMPT
> - current->preempt_disable_ip = get_lock_parent_ip();
> -#endif
> - trace_preempt_off(CALLER_ADDR0, get_lock_parent_ip());
> - }
> + preempt_latency_start(cnt);
> +

I'm thinking we can do without that empty line.

> }
> EXPORT_SYMBOL(__local_bh_disable_ip);
> #endif /* CONFIG_TRACE_IRQFLAGS */

2019-05-29 08:38:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC 2/3] preempt_tracer: Disable IRQ while starting/stopping due to a preempt_counter change

On Tue, May 28, 2019 at 05:16:23PM +0200, Daniel Bristot de Oliveira wrote:
> The preempt_disable/enable tracepoint only traces in the disable <-> enable
> case, which is correct. But think about this case:
>
> ---------------------------- %< ------------------------------
> THREAD IRQ
> | |
> preempt_disable() {
> __preempt_count_add(1)
> -------> smp_apic_timer_interrupt() {
> preempt_disable()
> do not trace (preempt count >= 1)
> ....
> preempt_enable()
> do not trace (preempt count >= 1)
> }
> trace_preempt_disable();
> }
> ---------------------------- >% ------------------------------
>
> The tracepoint will be skipped.

.... for the IRQ. But IRQs are not preemptible anyway, so what the
problem?

> To avoid skipping the trace, the change in the counter should be "atomic"
> with the start/stop, w.r.t the interrupts.
>
> Disable interrupts while the adding/starting stopping/subtracting.

> +static inline void preempt_add_start_latency(int val)
> +{
> + unsigned long flags;
> +
> + raw_local_irq_save(flags);
> + __preempt_count_add(val);
> + preempt_latency_start(val);
> + raw_local_irq_restore(flags);
> +}

> +static inline void preempt_sub_stop_latency(int val)
> +{
> + unsigned long flags;
> +
> + raw_local_irq_save(flags);
> + preempt_latency_stop(val);
> + __preempt_count_sub(val);
> + raw_local_irq_restore(flags);
> +}

That is hideously expensive :/

2019-05-29 08:43:29

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC 3/3] preempt_tracer: Use a percpu variable to control traceble calls

On Tue, May 28, 2019 at 05:16:24PM +0200, Daniel Bristot de Oliveira wrote:
> #if defined(CONFIG_PREEMPT) && (defined(CONFIG_DEBUG_PREEMPT) || \
> defined(CONFIG_TRACE_PREEMPT_TOGGLE))
> +
> +DEFINE_PER_CPU(int, __traced_preempt_count) = 0;
> /*
> * If the value passed in is equal to the current preempt count
> * then we just disabled preemption. Start timing the latency.
> */
> void preempt_latency_start(int val)
> {
> - if (preempt_count() == val) {
> + int curr = this_cpu_read(__traced_preempt_count);

We actually have this_cpu_add_return();

> +
> + if (!curr) {
> unsigned long ip = get_lock_parent_ip();
> #ifdef CONFIG_DEBUG_PREEMPT
> current->preempt_disable_ip = ip;
> #endif
> trace_preempt_off(CALLER_ADDR0, ip);
> }
> +
> + this_cpu_write(__traced_preempt_count, curr + val);
> }
>
> static inline void preempt_add_start_latency(int val)
> @@ -3200,8 +3206,12 @@ NOKPROBE_SYMBOL(preempt_count_add);
> */
> void preempt_latency_stop(int val)
> {
> - if (preempt_count() == val)
> + int curr = this_cpu_read(__traced_preempt_count) - val;

this_cpu_sub_return();

> +
> + if (!curr)
> trace_preempt_on(CALLER_ADDR0, get_lock_parent_ip());
> +
> + this_cpu_write(__traced_preempt_count, curr);
> }

Can't say I love this, but it is miles better than the last patch.

2019-05-29 09:32:21

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC 1/3] softirq: Use preempt_latency_stop/start to trace preemption

On Tue, May 28, 2019 at 05:16:22PM +0200, Daniel Bristot de Oliveira wrote:
> prempt_disable/enable tracepoints occurs only in the preemption
> enabled <-> disable transition. As preempt_latency_stop() and
> preempt_latency_start() already do this control, avoid code
> duplication by using these functions in the softirq code as well.
>
> RFC: Should we move preempt_latency_start/preempt_latency_stop
> to a trace source file... or even a header?

Yes, I think so. Also this patch changes CALLER_ADDR0 passed to the
tracepoint because there's one more level of a non-inlined function call
in the call chain right? Very least the changelog should document this
change in functional behavior, IMO.

Looks like a nice change otherwise, thanks!


> Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
> Cc: "Steven Rostedt (VMware)" <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: "Paul E. McKenney" <[email protected]>
> Cc: Matthias Kaehlcke <[email protected]>
> Cc: "Joel Fernandes (Google)" <[email protected]>
> Cc: Frederic Weisbecker <[email protected]>
> Cc: Yangtao Li <[email protected]>
> Cc: Tommaso Cucinotta <[email protected]>
> Cc: [email protected]
> ---
> kernel/sched/core.c | 4 ++--
> kernel/softirq.c | 13 +++++--------
> 2 files changed, 7 insertions(+), 10 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 874c427742a9..8c0b414e45dc 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3152,7 +3152,7 @@ static inline void sched_tick_stop(int cpu) { }
> * If the value passed in is equal to the current preempt count
> * then we just disabled preemption. Start timing the latency.
> */
> -static inline void preempt_latency_start(int val)
> +void preempt_latency_start(int val)
> {
> if (preempt_count() == val) {
> unsigned long ip = get_lock_parent_ip();
> @@ -3189,7 +3189,7 @@ NOKPROBE_SYMBOL(preempt_count_add);
> * If the value passed in equals to the current preempt count
> * then we just enabled preemption. Stop timing the latency.
> */
> -static inline void preempt_latency_stop(int val)
> +void preempt_latency_stop(int val)
> {
> if (preempt_count() == val)
> trace_preempt_on(CALLER_ADDR0, get_lock_parent_ip());
> diff --git a/kernel/softirq.c b/kernel/softirq.c
> index 2c3382378d94..c9ad89c3dfed 100644
> --- a/kernel/softirq.c
> +++ b/kernel/softirq.c
> @@ -108,6 +108,8 @@ static bool ksoftirqd_running(unsigned long pending)
> * where hardirqs are disabled legitimately:
> */
> #ifdef CONFIG_TRACE_IRQFLAGS
> +extern void preempt_latency_start(int val);
> +extern void preempt_latency_stop(int val);
> void __local_bh_disable_ip(unsigned long ip, unsigned int cnt)
> {
> unsigned long flags;
> @@ -130,12 +132,8 @@ void __local_bh_disable_ip(unsigned long ip, unsigned int cnt)
> trace_softirqs_off(ip);
> raw_local_irq_restore(flags);
>
> - if (preempt_count() == cnt) {
> -#ifdef CONFIG_DEBUG_PREEMPT
> - current->preempt_disable_ip = get_lock_parent_ip();
> -#endif
> - trace_preempt_off(CALLER_ADDR0, get_lock_parent_ip());
> - }
> + preempt_latency_start(cnt);
> +
> }
> EXPORT_SYMBOL(__local_bh_disable_ip);
> #endif /* CONFIG_TRACE_IRQFLAGS */
> @@ -144,8 +142,7 @@ static void __local_bh_enable(unsigned int cnt)
> {
> lockdep_assert_irqs_disabled();
>
> - if (preempt_count() == cnt)
> - trace_preempt_on(CALLER_ADDR0, get_lock_parent_ip());
> + preempt_latency_stop(cnt);
>
> if (softirq_count() == (cnt & SOFTIRQ_MASK))
> trace_softirqs_on(_RET_IP_);
> --
> 2.20.1
>

Subject: Re: [RFC 2/3] preempt_tracer: Disable IRQ while starting/stopping due to a preempt_counter change

On 29/05/2019 10:33, Peter Zijlstra wrote:
> On Tue, May 28, 2019 at 05:16:23PM +0200, Daniel Bristot de Oliveira wrote:
>> The preempt_disable/enable tracepoint only traces in the disable <-> enable
>> case, which is correct. But think about this case:
>>
>> ---------------------------- %< ------------------------------
>> THREAD IRQ
>> | |
>> preempt_disable() {
>> __preempt_count_add(1)
>> -------> smp_apic_timer_interrupt() {
>> preempt_disable()
>> do not trace (preempt count >= 1)
>> ....
>> preempt_enable()
>> do not trace (preempt count >= 1)
>> }
>> trace_preempt_disable();
>> }
>> ---------------------------- >% ------------------------------
>>
>> The tracepoint will be skipped.
>
> .... for the IRQ. But IRQs are not preemptible anyway, so what the
> problem?


right, they are.

exposing my problem in a more specific way:

To show in a model that an event always takes place with preemption disabled,
but not necessarily with IRQs disabled, it is worth having the preemption
disable events separated from IRQ disable ones.

The main reason is that, although IRQs disabled postpone the execution of the
scheduler, it is more pessimistic, as it also delays IRQs. So the more precise
the model is, the less pessimistic the analysis will be.

But there are other use-cases, for instance:

(Steve, correct me if I am wrong)

The preempt_tracer will not notice a "preempt disabled" section in an IRQ
handler if the problem above happens.

(Yeah, I know these problems are very specific... but...)

>> To avoid skipping the trace, the change in the counter should be "atomic"
>> with the start/stop, w.r.t the interrupts.
>>
>> Disable interrupts while the adding/starting stopping/subtracting.
>
>> +static inline void preempt_add_start_latency(int val)
>> +{
>> + unsigned long flags;
>> +
>> + raw_local_irq_save(flags);
>> + __preempt_count_add(val);
>> + preempt_latency_start(val);
>> + raw_local_irq_restore(flags);
>> +}
>
>> +static inline void preempt_sub_stop_latency(int val)
>> +{
>> + unsigned long flags;
>> +
>> + raw_local_irq_save(flags);
>> + preempt_latency_stop(val);
>> + __preempt_count_sub(val);
>> + raw_local_irq_restore(flags);
>> +}
>
> That is hideously expensive :/

Yeah... :-( Is there another way to provide such "atomicity"?

Can I use the argument "if one has these tracepoints enabled, they are not
considering it as a hot-path?"

-- Daniel

Subject: Re: [RFC 3/3] preempt_tracer: Use a percpu variable to control traceble calls

On 29/05/2019 10:41, Peter Zijlstra wrote:
> On Tue, May 28, 2019 at 05:16:24PM +0200, Daniel Bristot de Oliveira wrote:
>> #if defined(CONFIG_PREEMPT) && (defined(CONFIG_DEBUG_PREEMPT) || \
>> defined(CONFIG_TRACE_PREEMPT_TOGGLE))
>> +
>> +DEFINE_PER_CPU(int, __traced_preempt_count) = 0;
>> /*
>> * If the value passed in is equal to the current preempt count
>> * then we just disabled preemption. Start timing the latency.
>> */
>> void preempt_latency_start(int val)
>> {
>> - if (preempt_count() == val) {
>> + int curr = this_cpu_read(__traced_preempt_count);
>
> We actually have this_cpu_add_return();
>
>> +
>> + if (!curr) {
>> unsigned long ip = get_lock_parent_ip();
>> #ifdef CONFIG_DEBUG_PREEMPT
>> current->preempt_disable_ip = ip;
>> #endif
>> trace_preempt_off(CALLER_ADDR0, ip);
>> }
>> +
>> + this_cpu_write(__traced_preempt_count, curr + val);
>> }
>>
>> static inline void preempt_add_start_latency(int val)
>> @@ -3200,8 +3206,12 @@ NOKPROBE_SYMBOL(preempt_count_add);
>> */
>> void preempt_latency_stop(int val)
>> {
>> - if (preempt_count() == val)
>> + int curr = this_cpu_read(__traced_preempt_count) - val;
>
> this_cpu_sub_return();
>
>> +
>> + if (!curr)
>> trace_preempt_on(CALLER_ADDR0, get_lock_parent_ip());
>> +
>> + this_cpu_write(__traced_preempt_count, curr);
>> }
>
> Can't say I love this, but it is miles better than the last patch.
>

ack! I will change the methods (and remove some blank lines here and there... :-))

Thanks Peter!

-- Daniel

2019-05-29 10:23:35

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC 2/3] preempt_tracer: Disable IRQ while starting/stopping due to a preempt_counter change

On Wed, May 29, 2019 at 11:40:34AM +0200, Daniel Bristot de Oliveira wrote:
> On 29/05/2019 10:33, Peter Zijlstra wrote:
> > On Tue, May 28, 2019 at 05:16:23PM +0200, Daniel Bristot de Oliveira wrote:
> >> The preempt_disable/enable tracepoint only traces in the disable <-> enable
> >> case, which is correct. But think about this case:
> >>
> >> ---------------------------- %< ------------------------------
> >> THREAD IRQ
> >> | |
> >> preempt_disable() {
> >> __preempt_count_add(1)
> >> -------> smp_apic_timer_interrupt() {
> >> preempt_disable()
> >> do not trace (preempt count >= 1)
> >> ....
> >> preempt_enable()
> >> do not trace (preempt count >= 1)
> >> }
> >> trace_preempt_disable();
> >> }
> >> ---------------------------- >% ------------------------------
> >>
> >> The tracepoint will be skipped.
> >
> > .... for the IRQ. But IRQs are not preemptible anyway, so what the
> > problem?
>
>
> right, they are.
>
> exposing my problem in a more specific way:
>
> To show in a model that an event always takes place with preemption disabled,
> but not necessarily with IRQs disabled, it is worth having the preemption
> disable events separated from IRQ disable ones.
>
> The main reason is that, although IRQs disabled postpone the execution of the
> scheduler, it is more pessimistic, as it also delays IRQs. So the more precise
> the model is, the less pessimistic the analysis will be.

I'm not sure I follow, IRQs disabled fully implies !preemptible. I don't
see how the model would be more pessimistic than reality if it were to
use this knowledge.

Any !0 preempt_count(), which very much includes (Hard)IRQ and SoftIRQ
counts, means non-preemptible.

2019-05-29 12:24:53

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC 1/3] softirq: Use preempt_latency_stop/start to trace preemption

On Wed, 29 May 2019 05:30:56 -0400
Joel Fernandes <[email protected]> wrote:

> Yes, I think so. Also this patch changes CALLER_ADDR0 passed to the
> tracepoint because there's one more level of a non-inlined function call
> in the call chain right? Very least the changelog should document this
> change in functional behavior, IMO.

This sounds more like a break in behavior not a functional change. I
guess moving it to a header and making it a static __always_inline
should be fine though.

-- Steve

2019-05-29 12:41:40

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC 2/3] preempt_tracer: Disable IRQ while starting/stopping due to a preempt_counter change

On Wed, 29 May 2019 12:20:38 +0200
Peter Zijlstra <[email protected]> wrote:

> On Wed, May 29, 2019 at 11:40:34AM +0200, Daniel Bristot de Oliveira wrote:
> > On 29/05/2019 10:33, Peter Zijlstra wrote:
> > > On Tue, May 28, 2019 at 05:16:23PM +0200, Daniel Bristot de Oliveira wrote:
> > >> The preempt_disable/enable tracepoint only traces in the disable <-> enable
> > >> case, which is correct. But think about this case:
> > >>
> > >> ---------------------------- %< ------------------------------
> > >> THREAD IRQ
> > >> | |
> > >> preempt_disable() {
> > >> __preempt_count_add(1)
> > >> -------> smp_apic_timer_interrupt() {
> > >> preempt_disable()
> > >> do not trace (preempt count >= 1)
> > >> ....
> > >> preempt_enable()
> > >> do not trace (preempt count >= 1)
> > >> }
> > >> trace_preempt_disable();
> > >> }
> > >> ---------------------------- >% ------------------------------
> > >>
> > >> The tracepoint will be skipped.
> > >
> > > .... for the IRQ. But IRQs are not preemptible anyway, so what the
> > > problem?
> >
> >
> > right, they are.
> >
> > exposing my problem in a more specific way:
> >
> > To show in a model that an event always takes place with preemption disabled,
> > but not necessarily with IRQs disabled, it is worth having the preemption
> > disable events separated from IRQ disable ones.
> >
> > The main reason is that, although IRQs disabled postpone the execution of the
> > scheduler, it is more pessimistic, as it also delays IRQs. So the more precise
> > the model is, the less pessimistic the analysis will be.
>
> I'm not sure I follow, IRQs disabled fully implies !preemptible. I don't
> see how the model would be more pessimistic than reality if it were to
> use this knowledge.
>
> Any !0 preempt_count(), which very much includes (Hard)IRQ and SoftIRQ
> counts, means non-preemptible.

I believe I see what Daniel is talking about, but I hate the proposed
solution ;-)

First, if you care about real times that the CPU can't preempt
(preempt_count != 0 or interrupts disabled), then you want the
preempt_irqsoff tracer. The preempt_tracer is more academic where it
just shows you when we disable preemption via the counter. But even
with the preempt_irqsoff tracer you may not get the full length of time
due to the above explained race.

__preempt_count_add(1) <-- CPU now prevents preemption

<interrupt>
---->
trace_hardirqs_off() <-- Start preempt disable timer
handler();
trace_hardirqs_on() <-- Stop preempt disable timer (Diff A)
<----
trace_preempt_disable() <-- Start preempt disable timer

[..]

trace_preempt_enable() <-- Stop preempt disable timer (Diff B)

__preempt_count_sub(1) <-- CPU re-enables preemption

The preempt_irqsoff tracer will break this into two parts: Diff A and
Diff B, when in reality, the total time the CPU prevented preemption
was A + B.

Note, we can have the same race if an interrupt came in just after
calling trace_preempt_enable() and before the __preempt_count_sub().

What I would recommend is adding a flag to the task_struct that gets
set before the __preempt_count_add() and cleared by the tracing
function. If an interrupt goes off during this time, it will start the
total time to record, and not end it on the trace_hardirqs_on() part.
Now since we set this flag before disabling preemption, what if we get
preempted before calling __preempt_count_add()?. Simple, have a hook in
the scheduler (just connect to the sched_switch tracepoint) that checks
that flag, and if it is set, it ends the preempt disable recording
time. Also on scheduling that task back in, if that flag is set, start
the preempt disable timer.

It may get a bit more complex, but we can contain that complexity
within the tracing code, and we don't have to do added irq disabling.

-- Steve

2019-05-29 13:26:13

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC 2/3] preempt_tracer: Disable IRQ while starting/stopping due to a preempt_counter change

On Wed, May 29, 2019 at 08:39:30AM -0400, Steven Rostedt wrote:
> I believe I see what Daniel is talking about, but I hate the proposed
> solution ;-)
>
> First, if you care about real times that the CPU can't preempt
> (preempt_count != 0 or interrupts disabled), then you want the
> preempt_irqsoff tracer. The preempt_tracer is more academic where it
> just shows you when we disable preemption via the counter. But even
> with the preempt_irqsoff tracer you may not get the full length of time
> due to the above explained race.

IOW, that tracer gives a completely 'make believe' number? What's the
point? Just delete the pure preempt tracer.

And the preempt_irqoff tracer had better also consume the IRQ events,
and if it does that it can DTRT without extra bits on, even with that
race.

Consider:

preempt_disable()
preempt_count += 1;
<IRQ>
trace_irq_enter();

trace_irq_exit();
</IRQ>
trace_preempt_disable();

/* does stuff */

preempt_enable()
preempt_count -= 1;
trace_preempt_enable();

You're saying preempt_irqoff() fails to connect the two because of the
hole between trace_irq_exit() and trace_preempt_disable() ?

But trace_irq_exit() can see the raised preempt_count and set state for
trace_preempt_disable() to connect.

> What I would recommend is adding a flag to the task_struct that gets
> set before the __preempt_count_add() and cleared by the tracing
> function. If an interrupt goes off during this time, it will start the
> total time to record, and not end it on the trace_hardirqs_on() part.
> Now since we set this flag before disabling preemption, what if we get
> preempted before calling __preempt_count_add()?. Simple, have a hook in
> the scheduler (just connect to the sched_switch tracepoint) that checks
> that flag, and if it is set, it ends the preempt disable recording
> time. Also on scheduling that task back in, if that flag is set, start
> the preempt disable timer.

I don't think that works, you also have to consider softirq. And yes you
can make it more complicated, but I still don't see the point.

And none of this is relevant for Daniels model stuff. He just needs to
consider in-IRQ as !preempt.

2019-05-29 13:33:00

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC 2/3] preempt_tracer: Disable IRQ while starting/stopping due to a preempt_counter change

On Wed, May 29, 2019 at 03:19:57PM +0200, Peter Zijlstra wrote:
> On Wed, May 29, 2019 at 08:39:30AM -0400, Steven Rostedt wrote:
> > I believe I see what Daniel is talking about, but I hate the proposed
> > solution ;-)
> >
> > First, if you care about real times that the CPU can't preempt
> > (preempt_count != 0 or interrupts disabled), then you want the
> > preempt_irqsoff tracer. The preempt_tracer is more academic where it
> > just shows you when we disable preemption via the counter. But even
> > with the preempt_irqsoff tracer you may not get the full length of time
> > due to the above explained race.
>
> IOW, that tracer gives a completely 'make believe' number? What's the
> point? Just delete the pure preempt tracer.

Alternatively, fix the preempt tracer by having it completely disregard
IRQs.

2019-05-29 13:44:49

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC 2/3] preempt_tracer: Disable IRQ while starting/stopping due to a preempt_counter change

On Wed, 29 May 2019 15:19:57 +0200
Peter Zijlstra <[email protected]> wrote:

> On Wed, May 29, 2019 at 08:39:30AM -0400, Steven Rostedt wrote:
> > I believe I see what Daniel is talking about, but I hate the proposed
> > solution ;-)
> >
> > First, if you care about real times that the CPU can't preempt
> > (preempt_count != 0 or interrupts disabled), then you want the
> > preempt_irqsoff tracer. The preempt_tracer is more academic where it
> > just shows you when we disable preemption via the counter. But even
> > with the preempt_irqsoff tracer you may not get the full length of time
> > due to the above explained race.
>
> IOW, that tracer gives a completely 'make believe' number? What's the
> point? Just delete the pure preempt tracer.

The preempt_tracer is there as part of the preempt_irqsoff tracer
implementation. By removing it, the only code we would remove is
displaying preemptoff as a tracer. I stated this when it was created,
that it was more of an academic exercise if you use it, but that code
was required to get preempt_irqsoff working.

>
> And the preempt_irqoff tracer had better also consume the IRQ events,
> and if it does that it can DTRT without extra bits on, even with that
> race.
>
> Consider:
>
> preempt_disable()
> preempt_count += 1;
> <IRQ>
> trace_irq_enter();
>
> trace_irq_exit();
> </IRQ>
> trace_preempt_disable();
>
> /* does stuff */
>
> preempt_enable()
> preempt_count -= 1;
> trace_preempt_enable();
>
> You're saying preempt_irqoff() fails to connect the two because of the
> hole between trace_irq_exit() and trace_preempt_disable() ?
>
> But trace_irq_exit() can see the raised preempt_count and set state
> for trace_preempt_disable() to connect.

That's basically what I was suggesting as the solution to this ;-)

>
> > What I would recommend is adding a flag to the task_struct that gets
> > set before the __preempt_count_add() and cleared by the tracing
> > function. If an interrupt goes off during this time, it will start
> > the total time to record, and not end it on the trace_hardirqs_on()
> > part. Now since we set this flag before disabling preemption, what
> > if we get preempted before calling __preempt_count_add()?. Simple,
> > have a hook in the scheduler (just connect to the sched_switch
> > tracepoint) that checks that flag, and if it is set, it ends the
> > preempt disable recording time. Also on scheduling that task back
> > in, if that flag is set, start the preempt disable timer.
>
> I don't think that works, you also have to consider softirq. And yes
> you can make it more complicated, but I still don't see the point.

Note, there's places that disable preemption without being traced. If
we trigger only on preemption being disabled and start the "timer",
there may not be any code to stop it. That was why I recommended the
flag in the code that starts the timing.

>
> And none of this is relevant for Daniels model stuff. He just needs to
> consider in-IRQ as !preempt.

But he does bring up an issues that preempt_irqsoff fails.

-- Steve

2019-05-29 13:52:39

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC 2/3] preempt_tracer: Disable IRQ while starting/stopping due to a preempt_counter change

On Wed, May 29, 2019 at 09:42:13AM -0400, Steven Rostedt wrote:
> > And the preempt_irqoff tracer had better also consume the IRQ events,
> > and if it does that it can DTRT without extra bits on, even with that
> > race.
> >
> > Consider:
> >
> > preempt_disable()
> > preempt_count += 1;
> > <IRQ>
> > trace_irq_enter();
> >
> > trace_irq_exit();
> > </IRQ>
> > trace_preempt_disable();
> >
> > /* does stuff */
> >
> > preempt_enable()
> > preempt_count -= 1;
> > trace_preempt_enable();
> >
> > You're saying preempt_irqoff() fails to connect the two because of the
> > hole between trace_irq_exit() and trace_preempt_disable() ?
> >
> > But trace_irq_exit() can see the raised preempt_count and set state
> > for trace_preempt_disable() to connect.
>
> That's basically what I was suggesting as the solution to this ;-)

You were wanting changes to preempt_disable() and task_struct, neither
of which is required. The above only needs some per-cpu storage in the
tracer implementation.

Subject: Re: [RFC 2/3] preempt_tracer: Disable IRQ while starting/stopping due to a preempt_counter change

On 29/05/2019 12:20, Peter Zijlstra wrote:
> On Wed, May 29, 2019 at 11:40:34AM +0200, Daniel Bristot de Oliveira wrote:
>> On 29/05/2019 10:33, Peter Zijlstra wrote:
>>> On Tue, May 28, 2019 at 05:16:23PM +0200, Daniel Bristot de Oliveira wrote:
>>>> The preempt_disable/enable tracepoint only traces in the disable <-> enable
>>>> case, which is correct. But think about this case:
>>>>
>>>> ---------------------------- %< ------------------------------
>>>> THREAD IRQ
>>>> | |
>>>> preempt_disable() {
>>>> __preempt_count_add(1)
>>>> -------> smp_apic_timer_interrupt() {
>>>> preempt_disable()
>>>> do not trace (preempt count >= 1)
>>>> ....
>>>> preempt_enable()
>>>> do not trace (preempt count >= 1)
>>>> }
>>>> trace_preempt_disable();
>>>> }
>>>> ---------------------------- >% ------------------------------
>>>>
>>>> The tracepoint will be skipped.
>>>
>>> .... for the IRQ. But IRQs are not preemptible anyway, so what the
>>> problem?
>>
>>
>> right, they are.
>>
>> exposing my problem in a more specific way:
>>
>> To show in a model that an event always takes place with preemption disabled,
>> but not necessarily with IRQs disabled, it is worth having the preemption
>> disable events separated from IRQ disable ones.
>>
>> The main reason is that, although IRQs disabled postpone the execution of the
>> scheduler, it is more pessimistic, as it also delays IRQs. So the more precise
>> the model is, the less pessimistic the analysis will be.
>
> I'm not sure I follow, IRQs disabled fully implies !preemptible. I don't
> see how the model would be more pessimistic than reality if it were to
> use this knowledge.

Maybe I did not expressed myself well... and the example was not good either.

"IRQs disabled fully implies !preemptible" is a "to big" step. In modeling (or
mathematical reasoning?), a good practice is to break the properties into small
piece, and then build more complex reasoning/implications using these "small
properties."

Doing "big steps" makes you prone "miss interpretations", creating ambiguity.
Then, -RT people are prone to be pessimist, non-RT optimistic, and so on... and
that is what models try to avoid.

For instance, explaining this using words is contradictory:>
> Any !0 preempt_count(), which very much includes (Hard)IRQ and SoftIRQ
> counts, means non-preemptible.

One might argue that, the preemption of a thread always takes place with
preempt_count() != 0, because __schedule() is always called with preemption
disabled, so the preemption takes place while in non-preemptive.

A more elaborated example:

------------------ %< --------------------------
Thread A is running, and goes to sleep waiting for a timer...
schedule() {
preempt_disable();
__schedule() {
smp_apic_timer_interrupt() {
sched_wakeup (Thread A);
sched_wakeup (Thread B: highest prio) {
sched_set_need_resched();
}
}
local_irq_disable()
context switch to B, A leaves in state=R.
------------------ %< --------------------------

In this case, the thread A suffered a "preemption" with "!0 preempt_count()"

The fact is, Linux does not fit straight in the "well known terminology" of
academic papers because many of those terminology bases in the fact that
operations are atomic. But they are not and Linux has some behaviors that
desires new terminology/interpretation...

- WAIT But you (daniel) wants to fake the atomicity between preempt_disable and
its tracepoint!

Yes, I do, but this is a very straightforward step/assumption: the atomicity is
about the real-event and the tracepoint that notifies it. It is not about two
different events.

That is why it is worth letting the modeling rules to clarify the behavior of
system, without doing non-obvious implication in the code part, so we can have a
model that fits better in the Linux actions/events to avoid ambiguity.

[ note 1: the tracepoint is only enabled if CONFIG_PREEMPTIRQ_TRACEPOINTS=y
which is not enabled by default ]

[ note 2: I just saw that Steven replied while I was writing this email... I
will read them now... sorry for some repetitive topic here ]

-- Daniel

2019-05-29 14:00:40

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC 2/3] preempt_tracer: Disable IRQ while starting/stopping due to a preempt_counter change

On Wed, 29 May 2019 15:49:46 +0200
Peter Zijlstra <[email protected]> wrote:


> > That's basically what I was suggesting as the solution to this ;-)
>
> You were wanting changes to preempt_disable() and task_struct, neither
> of which is required. The above only needs some per-cpu storage in the
> tracer implementation.

Only changes were to the trace preempt_disable() code. Which I still
think needs to be done regardless to differentiate between when we are
tracing preempt disable and when we are not.

And no modification would need to be done to task_struct as we already
have a place to add tracing flags.

-- Steve



2019-05-29 18:24:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC 2/3] preempt_tracer: Disable IRQ while starting/stopping due to a preempt_counter change

On Wed, May 29, 2019 at 03:51:31PM +0200, Daniel Bristot de Oliveira wrote:
> On 29/05/2019 12:20, Peter Zijlstra wrote:

> > I'm not sure I follow, IRQs disabled fully implies !preemptible. I don't
> > see how the model would be more pessimistic than reality if it were to
> > use this knowledge.
>
> Maybe I did not expressed myself well... and the example was not good either.
>
> "IRQs disabled fully implies !preemptible" is a "to big" step. In modeling (or
> mathematical reasoning?), a good practice is to break the properties into small
> piece, and then build more complex reasoning/implications using these "small
> properties."
>
> Doing "big steps" makes you prone "miss interpretations", creating ambiguity.
> Then, -RT people are prone to be pessimist, non-RT optimistic, and so on... and
> that is what models try to avoid.

You already construct the big model out of small generators, this is
just one more of those little generators.

> For instance, explaining this using words is contradictory:>
> > Any !0 preempt_count(), which very much includes (Hard)IRQ and SoftIRQ
> > counts, means non-preemptible.
>
> One might argue that, the preemption of a thread always takes place with
> preempt_count() != 0, because __schedule() is always called with preemption
> disabled, so the preemption takes place while in non-preemptive.

Yeah, I know about that one; you've used it in your talks. Also, you've
modeled the schedule preempt disable as a special state. If you want we
can actually make it a special bit in the preempt_count word too, the
patch shouldn't be too hard, although it would make no practical
difference.

> - WAIT But you (daniel) wants to fake the atomicity between preempt_disable and
> its tracepoint!
>
> Yes, I do, but this is a very straightforward step/assumption: the atomicity is
> about the real-event and the tracepoint that notifies it. It is not about two
> different events.
>
> That is why it is worth letting the modeling rules to clarify the behavior of
> system, without doing non-obvious implication in the code part, so we can have a
> model that fits better in the Linux actions/events to avoid ambiguity.

You can easily build a little shim betwen the model and the tracehooks
that fix this up if you don't want to stick it in the model proper.

All the information is there. Heck you can even do that 3/3 thing
internally I think.

2019-05-31 07:49:00

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC 2/3] preempt_tracer: Disable IRQ while starting/stopping due to a preempt_counter change

On Wed, May 29, 2019 at 11:40:34AM +0200, Daniel Bristot de Oliveira wrote:
> On 29/05/2019 10:33, Peter Zijlstra wrote:
> > On Tue, May 28, 2019 at 05:16:23PM +0200, Daniel Bristot de Oliveira wrote:
> >> The preempt_disable/enable tracepoint only traces in the disable <-> enable
> >> case, which is correct. But think about this case:
> >>
> >> ---------------------------- %< ------------------------------
> >> THREAD IRQ
> >> | |
> >> preempt_disable() {
> >> __preempt_count_add(1)
> >> -------> smp_apic_timer_interrupt() {
> >> preempt_disable()
> >> do not trace (preempt count >= 1)
> >> ....
> >> preempt_enable()
> >> do not trace (preempt count >= 1)
> >> }
> >> trace_preempt_disable();
> >> }
> >> ---------------------------- >% ------------------------------
> >>
> >> The tracepoint will be skipped.
> >
> > .... for the IRQ. But IRQs are not preemptible anyway, so what the
> > problem?
>
>
> right, they are.
>
> exposing my problem in a more specific way:
>
> To show in a model that an event always takes place with preemption disabled,
> but not necessarily with IRQs disabled, it is worth having the preemption
> disable events separated from IRQ disable ones.
>
> The main reason is that, although IRQs disabled postpone the execution of the
> scheduler, it is more pessimistic, as it also delays IRQs. So the more precise
> the model is, the less pessimistic the analysis will be.
>
> But there are other use-cases, for instance:
>
> (Steve, correct me if I am wrong)
>
> The preempt_tracer will not notice a "preempt disabled" section in an IRQ
> handler if the problem above happens.
>
> (Yeah, I know these problems are very specific... but...)

I agree with the problem. I think Daniel does not want to miss the preemption
disabled event caused by the IRQ disabling.

> >> To avoid skipping the trace, the change in the counter should be "atomic"
> >> with the start/stop, w.r.t the interrupts.
> >>
> >> Disable interrupts while the adding/starting stopping/subtracting.
> >
> >> +static inline void preempt_add_start_latency(int val)
> >> +{
> >> + unsigned long flags;
> >> +
> >> + raw_local_irq_save(flags);
> >> + __preempt_count_add(val);
> >> + preempt_latency_start(val);
> >> + raw_local_irq_restore(flags);
> >> +}
> >
> >> +static inline void preempt_sub_stop_latency(int val)
> >> +{
> >> + unsigned long flags;
> >> +
> >> + raw_local_irq_save(flags);
> >> + preempt_latency_stop(val);
> >> + __preempt_count_sub(val);
> >> + raw_local_irq_restore(flags);
> >> +}
> >
> > That is hideously expensive :/
>
> Yeah... :-( Is there another way to provide such "atomicity"?
>
> Can I use the argument "if one has these tracepoints enabled, they are not
> considering it as a hot-path?"

The only addition here seems to the raw_local_irq_{save,restore} around the
calls to increment the preempt counter and start the latency tracking.

Is there any performance data with the tracepoint enabled and with/without
this patch? Like with hackbench?

Thanks.

Subject: Re: [RFC 2/3] preempt_tracer: Disable IRQ while starting/stopping due to a preempt_counter change

On 31/05/2019 09:47, Joel Fernandes wrote:
> On Wed, May 29, 2019 at 11:40:34AM +0200, Daniel Bristot de Oliveira wrote:
>> On 29/05/2019 10:33, Peter Zijlstra wrote:
>>> On Tue, May 28, 2019 at 05:16:23PM +0200, Daniel Bristot de Oliveira wrote:
>>>> The preempt_disable/enable tracepoint only traces in the disable <-> enable
>>>> case, which is correct. But think about this case:
>>>>
>>>> ---------------------------- %< ------------------------------
>>>> THREAD IRQ
>>>> | |
>>>> preempt_disable() {
>>>> __preempt_count_add(1)
>>>> -------> smp_apic_timer_interrupt() {
>>>> preempt_disable()
>>>> do not trace (preempt count >= 1)
>>>> ....
>>>> preempt_enable()
>>>> do not trace (preempt count >= 1)
>>>> }
>>>> trace_preempt_disable();
>>>> }
>>>> ---------------------------- >% ------------------------------
>>>>
>>>> The tracepoint will be skipped.
>>>
>>> .... for the IRQ. But IRQs are not preemptible anyway, so what the
>>> problem?
>>
>>
>> right, they are.
>>
>> exposing my problem in a more specific way:
>>
>> To show in a model that an event always takes place with preemption disabled,
>> but not necessarily with IRQs disabled, it is worth having the preemption
>> disable events separated from IRQ disable ones.
>>
>> The main reason is that, although IRQs disabled postpone the execution of the
>> scheduler, it is more pessimistic, as it also delays IRQs. So the more precise
>> the model is, the less pessimistic the analysis will be.
>>
>> But there are other use-cases, for instance:
>>
>> (Steve, correct me if I am wrong)
>>
>> The preempt_tracer will not notice a "preempt disabled" section in an IRQ
>> handler if the problem above happens.
>>
>> (Yeah, I know these problems are very specific... but...)
>
> I agree with the problem. I think Daniel does not want to miss the preemption
> disabled event caused by the IRQ disabling.

Hi Joel!

Correct, but ... look bellow.

>>>> To avoid skipping the trace, the change in the counter should be "atomic"
>>>> with the start/stop, w.r.t the interrupts.
>>>>
>>>> Disable interrupts while the adding/starting stopping/subtracting.
>>>
>>>> +static inline void preempt_add_start_latency(int val)
>>>> +{
>>>> + unsigned long flags;
>>>> +
>>>> + raw_local_irq_save(flags);
>>>> + __preempt_count_add(val);
>>>> + preempt_latency_start(val);
>>>> + raw_local_irq_restore(flags);
>>>> +}
>>>
>>>> +static inline void preempt_sub_stop_latency(int val)
>>>> +{
>>>> + unsigned long flags;
>>>> +
>>>> + raw_local_irq_save(flags);
>>>> + preempt_latency_stop(val);
>>>> + __preempt_count_sub(val);
>>>> + raw_local_irq_restore(flags);
>>>> +}
>>>
>>> That is hideously expensive :/
>>
>> Yeah... :-( Is there another way to provide such "atomicity"?
>>
>> Can I use the argument "if one has these tracepoints enabled, they are not
>> considering it as a hot-path?"
>
> The only addition here seems to the raw_local_irq_{save,restore} around the
> calls to increment the preempt counter and start the latency tracking.
>
> Is there any performance data with the tracepoint enabled and with/without
> this patch? Like with hackbench?

I discussed this with Steve at the Summit on the Summit (the reason why I did
not reply this email earlier is because I was in the conf/traveling), and he
also agrees with peterz, disabling and (mainly) re-enabling IRQs costs too much.

We need to find another way to resolve this problem (or mitigate the cost).... :-(.

Ideas?

Thanks!!

-- Daniel

Subject: Re: [RFC 2/3] preempt_tracer: Disable IRQ while starting/stopping due to a preempt_counter change

On 29/05/2019 20:21, Peter Zijlstra wrote:
> On Wed, May 29, 2019 at 03:51:31PM +0200, Daniel Bristot de Oliveira wrote:
>> On 29/05/2019 12:20, Peter Zijlstra wrote:
>
>>> I'm not sure I follow, IRQs disabled fully implies !preemptible. I don't
>>> see how the model would be more pessimistic than reality if it were to
>>> use this knowledge.
>>
>> Maybe I did not expressed myself well... and the example was not good either.
>>
>> "IRQs disabled fully implies !preemptible" is a "to big" step. In modeling (or
>> mathematical reasoning?), a good practice is to break the properties into small
>> piece, and then build more complex reasoning/implications using these "small
>> properties."
>>
>> Doing "big steps" makes you prone "miss interpretations", creating ambiguity.
>> Then, -RT people are prone to be pessimist, non-RT optimistic, and so on... and
>> that is what models try to avoid.
>
> You already construct the big model out of small generators, this is
> just one more of those little generators.

Yes, we can take that way too...

>> For instance, explaining this using words is contradictory:>
>>> Any !0 preempt_count(), which very much includes (Hard)IRQ and SoftIRQ
>>> counts, means non-preemptible.
>>
>> One might argue that, the preemption of a thread always takes place with
>> preempt_count() != 0, because __schedule() is always called with preemption
>> disabled, so the preemption takes place while in non-preemptive.
>
> Yeah, I know about that one; you've used it in your talks. Also, you've
> modeled the schedule preempt disable as a special state. If you want we
> can actually make it a special bit in the preempt_count word too, the
> patch shouldn't be too hard, although it would make no practical
> difference.

Good to know! I am trying not to change the code or abstractions. In the
user-space version I was using the caller address to figure out if it was a call
in the scheduler or not. In the kernel version I am planing to use a
is_sched_function() like approach. But if nothing of that works, I will explore
this possibility, thanks for the suggestion.

>> - WAIT But you (daniel) wants to fake the atomicity between preempt_disable and
>> its tracepoint!
>>
>> Yes, I do, but this is a very straightforward step/assumption: the atomicity is
>> about the real-event and the tracepoint that notifies it. It is not about two
>> different events.
>>
>> That is why it is worth letting the modeling rules to clarify the behavior of
>> system, without doing non-obvious implication in the code part, so we can have a
>> model that fits better in the Linux actions/events to avoid ambiguity.
>
> You can easily build a little shim betwen the model and the tracehooks
> that fix this up if you don't want to stick it in the model proper.
>
> All the information is there. Heck you can even do that 3/3 thing
> internally I think.
>

Ack, let's see what I can came up.

Thanks!

-- Daniel

Subject: Re: [RFC 1/3] softirq: Use preempt_latency_stop/start to trace preemption



On 29/05/2019 14:22, Steven Rostedt wrote:
> On Wed, 29 May 2019 05:30:56 -0400
> Joel Fernandes <[email protected]> wrote:
>
>> Yes, I think so. Also this patch changes CALLER_ADDR0 passed to the
>> tracepoint because there's one more level of a non-inlined function call
>> in the call chain right? Very least the changelog should document this
>> change in functional behavior, IMO.

In practice I am seeing no change in the values printed, but there is another
problem with this regard: there are cases in which both caller and parent have
the same address.

I am quite sure it has to do with the in_lock_function() behavior. Anyway, I was
already planing to propose a cleanup in the in_lock_function/in_sched_function.
I will investigate it more.

> This sounds more like a break in behavior not a functional change. I
> guess moving it to a header and making it a static __always_inline
> should be fine though.

Steve, which header should I use?

Thanks!

-- Daniel

> -- Steve
>

2019-06-04 12:03:52

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC 1/3] softirq: Use preempt_latency_stop/start to trace preemption

On Tue, 4 Jun 2019 12:39:03 +0200
Daniel Bristot de Oliveira <[email protected]> wrote:

> > This sounds more like a break in behavior not a functional change. I
> > guess moving it to a header and making it a static __always_inline
> > should be fine though.
>
> Steve, which header should I use?

I would say include/linux/preempt.h if Peter doesn't have any issues
with that.

-- Steve

2019-06-04 15:03:28

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC 2/3] preempt_tracer: Disable IRQ while starting/stopping due to a preempt_counter change

On Tue, 4 Jun 2019 12:12:36 +0200
Daniel Bristot de Oliveira <[email protected]> wrote:

> I discussed this with Steve at the Summit on the Summit (the reason why I did
> not reply this email earlier is because I was in the conf/traveling), and he
> also agrees with peterz, disabling and (mainly) re-enabling IRQs costs too much.
>
> We need to find another way to resolve this problem (or mitigate the cost).... :-(.
>
> Ideas?

I thought we talked about using flags in the pc to let us know that we
are about to trace the preemption off?


If an interrupt comes in, we check the flag:

irq_enter()
preempt_count_add(HARDIRQ_OFFSET)


Then we can have something like:

preempt_count_add(val) {
int pc = preempt_count();
int trace_val = TRACE_FLAG;

if (val == HARDIRQ_OFFSET && (pc & TRACE_FLAG)) {
__preempt_count_sub(TRACE_FLAG);
trace_val |= TRACE_SET;
}

__preempt_count_add(val + trace_val);
if (!pc)
trace_preempt_disable();
__preempt_count_sub(trace_val);


And in trace_preempt_enable() we have:

if ((preempt_count() & TRACE_SET) && in_irq())
return;

Thus, we wont call the preempt_enable tracing code if we started it due
to an interrupt preempting the process of setting the trace.

Note, I'm writing this while extremely tired (still have yet to get
more than 4 hours of sleep) so it may still have bugs, but you should
get the idea ;-)

-- Steve



2019-06-05 15:19:10

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC 2/3] preempt_tracer: Disable IRQ while starting/stopping due to a preempt_counter change

On Tue, Jun 04, 2019 at 11:01:27AM -0400, Steven Rostedt wrote:
> On Tue, 4 Jun 2019 12:12:36 +0200
> Daniel Bristot de Oliveira <[email protected]> wrote:
>
> > I discussed this with Steve at the Summit on the Summit (the reason why I did
> > not reply this email earlier is because I was in the conf/traveling), and he
> > also agrees with peterz, disabling and (mainly) re-enabling IRQs costs too much.
> >
> > We need to find another way to resolve this problem (or mitigate the cost).... :-(.
> >
> > Ideas?
>
> I thought we talked about using flags in the pc to let us know that we
> are about to trace the preemption off?
>
>
> If an interrupt comes in, we check the flag:
>
> irq_enter()
> preempt_count_add(HARDIRQ_OFFSET)
>
>
> Then we can have something like:
>
> preempt_count_add(val) {
> int pc = preempt_count();
> int trace_val = TRACE_FLAG;
>
> if (val == HARDIRQ_OFFSET && (pc & TRACE_FLAG)) {
> __preempt_count_sub(TRACE_FLAG);
> trace_val |= TRACE_SET;
> }
>
> __preempt_count_add(val + trace_val);
> if (!pc)
> trace_preempt_disable();
> __preempt_count_sub(trace_val);
>
>
> And in trace_preempt_enable() we have:
>
> if ((preempt_count() & TRACE_SET) && in_irq())
> return;
>
> Thus, we wont call the preempt_enable tracing code if we started it due
> to an interrupt preempting the process of setting the trace.

Hmm, the interrupt handler will not be calling preempt_enable anyway since
the preempt counter was already set by the interrupted code. The situation
Daniel describes in patch 2/3 is:

---------------------------- %< ------------------------------
THREAD IRQ
| |
preempt_disable() {
__preempt_count_add(1)
-------> smp_apic_timer_interrupt() {
preempt_disable()
do not trace (preempt count >= 1)
....
preempt_enable()
do not trace (preempt count >= 1)
}
trace_preempt_disable();
}
---------------------------- >% ------------------------------

Here the preempt_enable in the IRQ will not call tracing. If I understand
correctly, he *does* want to call tracing since the IRQ's preempt
disable/enable section could be running for some time, and so
it would not be nice to miss these events from the traces.

Regarding the preempt count flag idea, I really like your idea. I did not
fully follow the code snip you shared above though. My understanding of your
idea initially was, we set a flag in pc, then increment the preempt count,
call tracing and the reset the pc flag. In between the increment and the call
to the tracer, if we get interrupted, then we check the flag from IRQ
context, and still call tracing. If we get interrupted after calling tracing,
but before resetting the flag, then we may end up with a nested pair of
preempt enable/disable calls, but that's Ok. At least we wont miss any
events.

At least I think we only need to modify the preempt disable path, something
like this in pseudo code in the preempt disable path (of the THREAD context):

set pc flag
inc pc
<--- if interrupted here, still trace from IRQ.
call tracer
<--- if interrupted here, still trace from IRQ.
reset pc flag
<--- if interrupted here, no need to trace from IRQ since
we already recorded at least one preempt disable event.


Did I get your idea right?

I need to think about this some more, but got to run to an appointment. Will
take a look once I'm back.

> Note, I'm writing this while extremely tired (still have yet to get
> more than 4 hours of sleep) so it may still have bugs, but you should
> get the idea ;-)

No problem ;-)

thanks,

- Joel