When using perf timehist, sch delay is only computed for a waking task,
not for a pre empted task. This patches addresses this problem.
Example of `perf timehist` report before the patch for `stress` task
competing with each other.
First column is wait time, second column sch delay, third column
runtime.
```
1.492060 [0000] s stress[81] 1.999 0.000 2.000 R next: stress[83]
1.494060 [0000] s stress[83] 2.000 0.000 2.000 R next: stress[81]
1.496060 [0000] s stress[81] 2.000 0.000 2.000 R next: stress[83]
1.498060 [0000] s stress[83] 2.000 0.000 1.999 R next: stress[81]
```
After the patch, it looks like this (note that all wait time is now sch
delay instead of zero):
```
1.492060 [0000] s stress[81] 1.999 1.999 2.000 R next: stress[83]
1.494060 [0000] s stress[83] 2.000 2.000 2.000 R next: stress[81]
1.496060 [0000] s stress[81] 2.000 2.000 2.000 R next: stress[83]
1.498060 [0000] s stress[83] 2.000 2.000 1.999 R next: stress[81]
```
In timehist:
* wait time represents the duration waiting for any system resource
* sch delay represents the duration waiting for cpu system resources
This is based on perf comments (dt_wait = wait time, dt_delay = sch
delay):
```
/*
* Explanation of delta-time stats:
*
* t = time of current schedule out event
* tprev = time of previous sched out event
* also time of schedule-in event for current task
* last_time = time of last sched change event for current task
* (i.e, time process was last scheduled out)
* ready_to_run = time of wakeup for current task
*
* -----|------------|------------|------------|------
* last ready tprev t
* time to run
*
* |-------- dt_wait --------|
* |- dt_delay -|-- dt_run --|
*
* dt_run = run time of current task
* dt_wait = time between last schedule out event for task and tprev
* represents time spent off the cpu
* dt_delay = time between wakeup and schedule-in of task
*/
```
The problem with the current logic is that last time is only set when
waking a task. Therefore it is not set for a pre empted task. To fix
this, we set last time to the current sample time if a scheduled out
task (on the switch tracepoint) is in state running.
Signed-off-by: Fernand Sieber <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Alexander Shishkin <[email protected]>
---
tools/perf/builtin-sched.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index 5977c49ae2c7..7422c930abaf 100644
--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -2659,7 +2659,10 @@ static int timehist_sched_change_event(struct perf_tool *tool,
tr->last_state = state;
/* sched out event for task so reset ready to run time */
- tr->ready_to_run = 0;
+ if (state == 'R')
+ tr->ready_to_run = t;
+ else
+ tr->ready_to_run = 0;
}
evsel__save_time(evsel, sample->time, sample->cpu);
--
2.40.1
Hi Fernand,
On 14/06/24 00:29, Fernand Sieber wrote:
> When using perf timehist, sch delay is only computed for a waking task,
> not for a pre empted task. This patches addresses this problem.
>
> Example of `perf timehist` report before the patch for `stress` task
> competing with each other.
>
> First column is wait time, second column sch delay, third column
> runtime.
> ```
> 1.492060 [0000] s stress[81] 1.999 0.000 2.000 R next: stress[83]
> 1.494060 [0000] s stress[83] 2.000 0.000 2.000 R next: stress[81]
> 1.496060 [0000] s stress[81] 2.000 0.000 2.000 R next: stress[83]
> 1.498060 [0000] s stress[83] 2.000 0.000 1.999 R next: stress[81]
> ```
>
> After the patch, it looks like this (note that all wait time is now sch
> delay instead of zero):
> ```
> 1.492060 [0000] s stress[81] 1.999 1.999 2.000 R next: stress[83]
> 1.494060 [0000] s stress[83] 2.000 2.000 2.000 R next: stress[81]
> 1.496060 [0000] s stress[81] 2.000 2.000 2.000 R next: stress[83]
> 1.498060 [0000] s stress[83] 2.000 2.000 1.999 R next: stress[81]
> ```
>
> In timehist:
> * wait time represents the duration waiting for any system resource
> * sch delay represents the duration waiting for cpu system resources
According to the documentation(perf-sched man page), scheduling delay is defined
as the "time between wakeup and actually running."
When a task switches out, if its state is 'R' , then the task would switch-in
again without waking and wakeup in between.Therefore, that duration is not
currently calculated as scheduling delay, which aligns with the man page
description.
I agree that the time a task is preempted involuntarily and is waiting to run
could be considered part of the scheduling delay. The definition is somewhat
vague in this regard.
If you decide to proceed with this change, please ensure the man page is updated
accordingly.
Regarding wait time, the man page defines it as the "time between sched-out and
the next sched-in events for the task." This measurement does not depend on the
task state and includes time spent on a voluntary switch.
Thanks and Regards
Madadi Vineeth Reddy
>
> This is based on perf comments (dt_wait = wait time, dt_delay = sch
> delay):
> ```
> /*
> * Explanation of delta-time stats:
> *
> * t = time of current schedule out event
> * tprev = time of previous sched out event
> * also time of schedule-in event for current task
> * last_time = time of last sched change event for current task
> * (i.e, time process was last scheduled out)
> * ready_to_run = time of wakeup for current task
> *
> * -----|------------|------------|------------|------
> * last ready tprev t
> * time to run
> *
> * |-------- dt_wait --------|
> * |- dt_delay -|-- dt_run --|
> *
> * dt_run = run time of current task
> * dt_wait = time between last schedule out event for task and tprev
> * represents time spent off the cpu
> * dt_delay = time between wakeup and schedule-in of task
> */
> ```
>
> The problem with the current logic is that last time is only set when
> waking a task. Therefore it is not set for a pre empted task. To fix
> this, we set last time to the current sample time if a scheduled out
> task (on the switch tracepoint) is in state running.
>
> Signed-off-by: Fernand Sieber <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Arnaldo Carvalho de Melo <[email protected]>
> Cc: Alexander Shishkin <[email protected]>
> ---
> tools/perf/builtin-sched.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
> index 5977c49ae2c7..7422c930abaf 100644
> --- a/tools/perf/builtin-sched.c
> +++ b/tools/perf/builtin-sched.c
> @@ -2659,7 +2659,10 @@ static int timehist_sched_change_event(struct perf_tool *tool,
> tr->last_state = state;
>
> /* sched out event for task so reset ready to run time */
> - tr->ready_to_run = 0;
> + if (state == 'R')
> + tr->ready_to_run = t;
> + else
> + tr->ready_to_run = 0;
> }
>
> evsel__save_time(evsel, sample->time, sample->cpu);