Zimuzo reported seeing occasional cases in perfetto traces where
tasks went from sleeping directly to trace_sched_wakeup()
without always seeing a trace_sched_waking().
Looking at the code, trace_sched_wakeup() is only called in
ttwu_do_wakeup()
The call paths that get you to ttwu_do_wakeup() are:
try_to_wake_up() -> ttwu_do_wakeup()
try_to_wake_up() -> ttwu_runnable() -> ttwu_do_wakeup()
try_to_wake_up() -> ttwu_queue() -> ttwu_do_activate() -> ttwu_do_wakeup()
sched_ttwu_pending() -> ttwu_do_activate() -> ttwu_do_wakeup()
where trace_sched_waking() is currently called only in
try_to_wake_up().
So add a trace_sched_waking() call to sched_ttwu_pending(), so
we see the same state machine transitions.
With this change, the number of unexpected state transitions in
perfetto was greatly reduced.
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Phil Auld <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Reviewed-by: Phil Auld <[email protected]>
Reported-by: Zimuzo Ezeozue <[email protected]>
Signed-off-by: John Stultz <[email protected]>
---
v2:
* Minor commit message fix suggested by Phil Auld
---
kernel/sched/core.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9116bcc90346..233f06360d6a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3894,6 +3894,7 @@ void sched_ttwu_pending(void *arg)
if (WARN_ON_ONCE(task_cpu(p) != cpu_of(rq)))
set_task_cpu(p, cpu_of(rq));
+ trace_sched_waking(p);
ttwu_do_activate(rq, p, p->sched_remote_wakeup ? WF_MIGRATED : 0, &rf);
}
--
2.44.0.291.gc1ea87d7ee-goog
Hi,
On Mon, Mar 18, 2024 at 12:28:33PM -0700, John Stultz wrote:
> Zimuzo reported seeing occasional cases in perfetto traces where
> tasks went from sleeping directly to trace_sched_wakeup()
> without always seeing a trace_sched_waking().
>
> Looking at the code, trace_sched_wakeup() is only called in
> ttwu_do_wakeup()
>
> The call paths that get you to ttwu_do_wakeup() are:
> try_to_wake_up() -> ttwu_do_wakeup()
> try_to_wake_up() -> ttwu_runnable() -> ttwu_do_wakeup()
> try_to_wake_up() -> ttwu_queue() -> ttwu_do_activate() -> ttwu_do_wakeup()
> sched_ttwu_pending() -> ttwu_do_activate() -> ttwu_do_wakeup()
Notably, sched_ttwu_pending() is invoked for remote wakeups.
Given this, I anticipate a scenario similar to the following
occurred: When a process (P) is to be awakened on a remote CPU,
the scheduler adds P to the remote CPU's wakelist,a per-CPU queue,
and sends an IPI to the remote CPU. This action triggers
sched_ttwu_pending() on the remote CPU, which then processes the
wakelist and wakes up the queued processes.
In this scenario, the "waking trace" of P, signifying the initiation
of the wake-up, is recorded on the CPU where try_to_wake_up was executed.
Meanwhile, the "wakeup trace," denoting the completion of the wake-up,
is observed on the remote CPU where sched_ttwu_pending() is executed.
Is there a possibility that something other than the above occurred
in your case?
>
> where trace_sched_waking() is currently called only in
> try_to_wake_up().
>
> So add a trace_sched_waking() call to sched_ttwu_pending(), so
> we see the same state machine transitions.
>
> With this change, the number of unexpected state transitions in
> perfetto was greatly reduced.
>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Ben Segall <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Daniel Bristot de Oliveira <[email protected]>
> Cc: Phil Auld <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: [email protected]
> Reviewed-by: Phil Auld <[email protected]>
> Reported-by: Zimuzo Ezeozue <[email protected]>
> Signed-off-by: John Stultz <[email protected]>
> ---
> v2:
> * Minor commit message fix suggested by Phil Auld
> ---
> kernel/sched/core.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 9116bcc90346..233f06360d6a 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3894,6 +3894,7 @@ void sched_ttwu_pending(void *arg)
> if (WARN_ON_ONCE(task_cpu(p) != cpu_of(rq)))
> set_task_cpu(p, cpu_of(rq));
>
> + trace_sched_waking(p);
> ttwu_do_activate(rq, p, p->sched_remote_wakeup ? WF_MIGRATED : 0, &rf);
> }
>
> --
> 2.44.0.291.gc1ea87d7ee-goog
>
On Tue, Mar 19, 2024 at 10:24 PM Vishal Chourasia <[email protected]> wrote:
> On Mon, Mar 18, 2024 at 12:28:33PM -0700, John Stultz wrote:
> > Zimuzo reported seeing occasional cases in perfetto traces where
> > tasks went from sleeping directly to trace_sched_wakeup()
> > without always seeing a trace_sched_waking().
> >
> > Looking at the code, trace_sched_wakeup() is only called in
> > ttwu_do_wakeup()
> >
> > The call paths that get you to ttwu_do_wakeup() are:
> > try_to_wake_up() -> ttwu_do_wakeup()
> > try_to_wake_up() -> ttwu_runnable() -> ttwu_do_wakeup()
> > try_to_wake_up() -> ttwu_queue() -> ttwu_do_activate() -> ttwu_do_wakeup()
> > sched_ttwu_pending() -> ttwu_do_activate() -> ttwu_do_wakeup()
> Notably, sched_ttwu_pending() is invoked for remote wakeups.
>
> Given this, I anticipate a scenario similar to the following
> occurred: When a process (P) is to be awakened on a remote CPU,
> the scheduler adds P to the remote CPU's wakelist,a per-CPU queue,
> and sends an IPI to the remote CPU. This action triggers
> sched_ttwu_pending() on the remote CPU, which then processes the
> wakelist and wakes up the queued processes.
Hey! Thanks for taking a look here and sharing this feedback.
Indeed, looking closer here I do have a hard time seeing how we can
get to sched_ttwu_pending() without having gone through the paths in
try_to_wake_up() that would call trace_sched_wakeup():
try_to_wake_up()->[ttwu_queue ->]
ttwu_queue_wakelist()->sched_ttwu_pending()->...
> In this scenario, the "waking trace" of P, signifying the initiation
> of the wake-up, is recorded on the CPU where try_to_wake_up was executed.
> Meanwhile, the "wakeup trace," denoting the completion of the wake-up,
> is observed on the remote CPU where sched_ttwu_pending() is executed.
>
> Is there a possibility that something other than the above occurred
> in your case?
I suspect that the case reported may be that the task went to sleep
right after the trace_sched_waking(). Which could result in the
transition from sleeping to trace_sched_wakeup() without the
trace_sched_waking() inbetween. The added latency from the remote
wakeup probably increases the chance for this race to occur.
So I'll withdraw this change for now and sync back up to check if my
suspicions are correct or not.
Thanks so much again for pointing this out.
-john
On Wed, Mar 20, 2024 at 03:42:31PM -0700, John Stultz wrote:
> On Tue, Mar 19, 2024 at 10:24 PM Vishal Chourasia <[email protected]> wrote:
> > On Mon, Mar 18, 2024 at 12:28:33PM -0700, John Stultz wrote:
> > > Zimuzo reported seeing occasional cases in perfetto traces where
> > > tasks went from sleeping directly to trace_sched_wakeup()
> > > without always seeing a trace_sched_waking().
> > >
> > > Looking at the code, trace_sched_wakeup() is only called in
> > > ttwu_do_wakeup()
> > >
> > > The call paths that get you to ttwu_do_wakeup() are:
> > > try_to_wake_up() -> ttwu_do_wakeup()
> > > try_to_wake_up() -> ttwu_runnable() -> ttwu_do_wakeup()
> > > try_to_wake_up() -> ttwu_queue() -> ttwu_do_activate() -> ttwu_do_wakeup()
> > > sched_ttwu_pending() -> ttwu_do_activate() -> ttwu_do_wakeup()
> > Notably, sched_ttwu_pending() is invoked for remote wakeups.
> >
> > Given this, I anticipate a scenario similar to the following
> > occurred: When a process (P) is to be awakened on a remote CPU,
> > the scheduler adds P to the remote CPU's wakelist,a per-CPU queue,
> > and sends an IPI to the remote CPU. This action triggers
> > sched_ttwu_pending() on the remote CPU, which then processes the
> > wakelist and wakes up the queued processes.
>
> Hey! Thanks for taking a look here and sharing this feedback.
>
> Indeed, looking closer here I do have a hard time seeing how we can
> get to sched_ttwu_pending() without having gone through the paths in
> try_to_wake_up() that would call trace_sched_wakeup():
> try_to_wake_up()->[ttwu_queue ->]
> ttwu_queue_wakelist()->sched_ttwu_pending()->...
>
> > In this scenario, the "waking trace" of P, signifying the initiation
> > of the wake-up, is recorded on the CPU where try_to_wake_up was executed.
> > Meanwhile, the "wakeup trace," denoting the completion of the wake-up,
> > is observed on the remote CPU where sched_ttwu_pending() is executed.
> >
> > Is there a possibility that something other than the above occurred
> > in your case?
>
> I suspect that the case reported may be that the task went to sleep
> right after the trace_sched_waking(). Which could result in the
> transition from sleeping to trace_sched_wakeup() without the
> trace_sched_waking() inbetween. The added latency from the remote
> wakeup probably increases the chance for this race to occur.
Hi,
# trace-cmd record -e sched:* perf record -a -g -e sched:* ./schbench
This generates perf.data and trace.dat file.
Running a simple check for equality between sched_waking and
sched_wakeup. I see two different results.
# perf script -G | grep -i "sched_waking" | wc -l
35264
# perf script -G | grep -i "sched_wakeup:" | wc -l
35314
perf report suggests missing sched_waking events. Although trace
collected through ftrace (using trace-cmd) suggests equal waking and
wakeup events.
# trace-cmd report | grep -i "sched_waking" | wc -l
35959
# trace-cmd report | grep -i "sched_wakeup:" | wc -l
35959
Below is a stack trace for a sched_wakeup event for which sched_waking
was not recorded in perf.data file,
schbench 307346 [023] 600177.250150:
sched:sched_wakeup: perf:307321 [120] CPU:023
ttwu_do_activate+0x1c0 ([kernel.kallsyms])
print_fmt_xfs_perag_intents_class+0x0 ([xfs].data)
sched_ttwu_pending+0x118 ([kernel.kallsyms])
=> __flush_smp_call_function_queue+0x19c ([kernel.kallsyms])
smp_ipi_demux_relaxed+0xec ([kernel.kallsyms])
xive_muxed_ipi_action+0x20 ([kernel.kallsyms])
__handle_irq_event_percpu+0x8c ([kernel.kallsyms])
handle_irq_event_percpu+0x2c ([kernel.kallsyms])
handle_percpu_irq+0x80 ([kernel.kallsyms])
generic_handle_irq+0x50 ([kernel.kallsyms])
__do_irq+0xb8 ([kernel.kallsyms])
__do_IRQ+0x88 ([kernel.kallsyms])
print_fmt_xfs_perag_intents_class+0x0 ([xfs].data)
do_IRQ+0x50 ([kernel.kallsyms])
hardware_interrupt_common_virt+0x28c ([kernel.kallsyms])
worker_thread+0x268 (/usr/bin/schbench)
worker_thread+0x4cc (/usr/bin/schbench)
start_thread+0x170 (/usr/lib64/glibc-hwcaps/power10/libc.so.6)
__clone+0xa0 (/usr/lib64/glibc-hwcaps/power10/libc.so.6)
Also, it has been observed for cases where sched_waking is missing only
happens when perf task is woken up.
Looking at the code in __flush_smp_call_function_queue,
it invokes all the callbacks currently queue on the given cpu, and if
sched_ttwu_pending is being called, suggesting that csd of type CSD_TYPE_TTWU
was queued. Now, this happens inside try_to_wake_up.
if there are no other way to queue a csd of type CSD_TYPE_TTWU, then
this suggests some events are being missed by perf.
>
> So I'll withdraw this change for now and sync back up to check if my
> suspicions are correct or not.
>
> Thanks so much again for pointing this out.
> -john