2023-04-28 15:22:44

by Chen Yu

[permalink] [raw]
Subject: [PATCH v8 0/2] sched/fair: Introduce SIS_CURRENT to wake up short task on current CPU

The main purpose is to avoid too many cross CPU wake up when it is
unnecessary. The frequent cross CPU wakeup brings significant damage
to some workloads, especially on high core count systems:
1. The cross CPU wakeup triggers race condition that, some wakers are
stacked on 1 CPU which delays the wakeup of their wakees.
2. The cross CPU wakeup brings c2c overhead if the waker and wakee share
cache resource.

Inhibits the cross CPU wake-up by placing the wakee on waking CPU,
if both the waker and wakee are short-duration tasks, and when the
waker and wakee wake up each other.

The first patch introduces the definition of a short-duration task.
The second patch leverages the first patch to choose a local CPU for wakee.

Overall there is performance improvement when the benchmark has
close 1:1 waker/wakee relationship, such as will-it-scale, netperf, tbench.
And for netperf, it has a universal performance improvement under many
different utilization. And there is no noticeable impact on schbench, hackbench,
or a OLTP workload with a commercial RDBMS, according to the test on
Intel Sapphire Rapids, which has 2 x 56C/112T = 224 CPUs.

Per the test on Zen3 from Prateek on v7, tbench/netperf shows good
improvement at 128 clients, SPECjbb2015 shows improvement in max-jOPS.
And the result of v8 should not be quite different from v7 because
the threshold in v7 and v8 are comparable.

Changes since v7:
1. Replace sysctl_sched_min_granularity with sysctl_sched_migration_cost
to determine if a task is a short duration one, according to Peter's
suggestion.

Changes since v6:
1. Rename SIS_SHORT to SIS_CURRENT, which better describes this feature.
2. Remove the 50% utilization threshold. Removes the has_idle_cores check.
After this change, SIS_CURRENT is applicable to all system utilization.
3. Add a condition that only waker and wakee wake up each other would enable
the SIS_CURRENT. That is, A->last_wakee = B and B->last_wakee = A.

Changes since v5:
1. Check the wakee_flips of the waker/wakee. If the wakee_flips
of waker/wakee are both 0, it indicates that the waker and the wakee
are waking up each other. In this case, put them together on the
same CPU. This is to avoid that too many wakees are stacked on
one CPU, which might cause regression on redis.

Changes since v4:
1. Dietmar has commented on the task duration calculation. So refined
the commit log to reduce confusion.
2. Change [PATCH 1/2] to only record the average duration of a task.
So this change could benefit UTIL_EST_FASTER[1].
3. As v4 reported regression on Zen3 and Kunpeng Arm64, add back
the system average utilization restriction that, if the system
is not busy, do not enable the short wake up. Above logic has
shown improvment on Zen3[2].
4. Restrict the wakeup target to be current CPU, rather than both
current CPU and task's previous CPU. This could also benefit
wakeup optimization from interrupt in the future, which is
suggested by Yicong.

Changes since v3:
1. Honglei and Josh have concern that the threshold of short
task duration could be too long. Decreased the threshold from
sysctl_sched_min_granularity to (sysctl_sched_min_granularity / 8),
and the '8' comes from get_update_sysctl_factor().
2. Export p->se.dur_avg to /proc/{pid}/sched per Yicong's suggestion.
3. Move the calculation of average duration from put_prev_task_fair()
to dequeue_task_fair(). Because there is an issue in v3 that,
put_prev_task_fair() will not be invoked by pick_next_task_fair()
in fast path, thus the dur_avg could not be updated timely.
4. Fix the comment in PATCH 2/2, that "WRITE_ONCE(CPU1->ttwu_pending, 1);"
on CPU0 is earlier than CPU1 getting "ttwu_list->p0", per Tianchen.
5. Move the scan for CPU with short duration task from select_idle_cpu()
to select_idle_siblings(), because there is no CPU scan involved, per
Yicong.

Changes since v2:

1. Peter suggested comparing the duration of waker and the cost to
scan for an idle CPU: If the cost is higher than the task duration,
do not waste time finding an idle CPU, choose the local or previous
CPU directly. A prototype was created based on this suggestion.
However, according to the test result, this prototype does not inhibit
the cross CPU wakeup and did not bring improvement. Because the cost
to find an idle CPU is small in the problematic scenario. The root
cause of the problem is a race condition between scanning for an idle
CPU and task enqueue(please refer to the commit log in PATCH 2/2).
So v3 does not change the core logic of v2, with some refinement based
on Peter's suggestion.

2. Simplify the logic to record the task duration per Peter and Abel's suggestion.


[1] https://lore.kernel.org/lkml/[email protected]/
[2] https://lore.kernel.org/all/[email protected]/

v6: https://lore.kernel.org/lkml/[email protected]/
v5: https://lore.kernel.org/lkml/[email protected]/
v4: https://lore.kernel.org/lkml/[email protected]/
v3: https://lore.kernel.org/lkml/[email protected]/
v2: https://lore.kernel.org/all/[email protected]/
v1: https://lore.kernel.org/lkml/[email protected]/

Chen Yu (2):
sched/fair: Record the average duration of a task
sched/fair: Introduce SIS_CURRENT to wake up short task on current CPU

include/linux/sched.h | 3 +++
kernel/sched/core.c | 2 ++
kernel/sched/debug.c | 1 +
kernel/sched/fair.c | 59 +++++++++++++++++++++++++++++++++++++++++
kernel/sched/features.h | 1 +
5 files changed, 66 insertions(+)

--
2.25.1


2023-04-28 15:23:34

by Chen Yu

[permalink] [raw]
Subject: [PATCH v8 2/2] sched/fair: Introduce SIS_CURRENT to wake up short task on current CPU

[Problem Statement]
For a workload that is doing frequent context switches, the throughput
scales well until the number of instances reaches a peak point. After
that peak point, the throughput drops significantly if the number of
instances continue to increase.

The will-it-scale context_switch1 test case exposes the issue. The
test platform has 2 x 56C/112T and 224 CPUs in total. will-it-scale
launches 1, 8, 16 ... instances respectively. Each instance is composed
of 2 tasks, and each pair of tasks would do ping-pong scheduling via
pipe_read() and pipe_write(). No task is bound to any CPU. It is found
that, once the number of instances is higher than 56, the throughput
drops accordingly:

^
throughput|
| X
| X X X
| X X X
| X X
| X X
| X
| X
| X
| X
|
+-----------------.------------------->
56
number of instances

[Symptom analysis]

One of the reasons to cause the performance downgrading is the high
system idle percentage(around 20% ~ 30%). The CPUs waste a lot of
time in idle and do nothing. As a comparison, if set CPU affinity
to these workloads and stops them from migrating among CPUs,
the idle percentage drops to nearly 0%, and the throughput
increases a lot. This indicates room for optimization.

The cause of high idle time is that there is no strict synchronization
between select_task_rq() and the set of ttwu_pending flag among several
CPUs. And this might be by design because the scheduler prefers parallel
wakeup.

Suppose there are nr_cpus pairs of ping-pong scheduling
tasks. For example, p0' and p0 are ping-pong scheduling,
so do p1' <=> p1, and p2'<=> p2. None of these tasks are
bound to any CPUs. The problem can be summarized as:
more than 1 wakers are stacked on 1 CPU, which slows down
waking up their wakees:

CPU0 CPU1 CPU2

p0' p1' => idle p2'

try_to_wake_up(p0) try_to_wake_up(p2);
CPU1 = select_task_rq(p0); CPU1 = select_task_rq(p2);
ttwu_queue(p0, CPU1); ttwu_queue(p2, CPU1);
__ttwu_queue_wakelist(p0, CPU1);
WRITE_ONCE(CPU1->ttwu_pending, 1);
__smp_call_single_queue(CPU1, p0); => ttwu_list->p0
quiting cpuidle_idle_call()

__ttwu_queue_wakelist(p2, CPU1);
WRITE_ONCE(CPU1->ttwu_pending, 1);
ttwu_list->p2->p0 <= __smp_call_single_queue(CPU1, p2);

p0' => idle
sched_ttwu_pending()
enqueue_task(p2 and p0)

idle => p2

...
p2 time slice expires
...
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
<=== !!! p2 delays the wake up of p0' !!!
!!! causes long idle on CPU0 !!!
p2 => p0 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
p0 wakes up p0'

idle => p0'

Since there are many waker/wakee pairs in the system, the chain reaction
causes many CPUs to be victims. These idle CPUs wait for their waker to
be scheduled. Tiancheng has mentioned the above issue here[1].

Besides the high idle percentage, waking up the tasks on different CPUs
could bring Core-to-Core cache overhead, which hurts the performance.

[Proposal]

Waking up the short task on current CPU, if the
following conditions are met:

1. waker A's rq->nr_running <= 1
2. waker A is a short duration task (waker will fall asleep soon)
3. wakee B is a short duration task (impact of B is minor to A)
4. A->wakee_flips is 0 and A->last_wakee = B
5. B->wakee_flips is 0 and B->last_wakee = A

The reason is that, if the waker is a short-duration task, it might
relinquish the CPU soon, and the wakee has the chance to be scheduled.
On the other hand, if the wakee is a short duration task, putting it on
non-idle CPU would bring minimal impact to the running task.
The benefit of waking short task on current CPU:
1. Reduce race condition which causes high idle percentage.
2. Increase cache share between the waker and wakee.

The threshold to define a short duration task is sysctl_sched_migration_cost.
As suggested by Peter, this value is also used in task_hot() to prevent
migrations.

This wake up strategy is regarded as a dynamic WF_CURRENT_CPU[2] proposed
by Andrei Vagin, except that this change treats the current CPU as the
last resort when the previous CPU is not idle, and avoid tasks stacking
on the current CPU as much as possible.

[Benchmark results]
The baseline is v6.3-rc7 tip:sched/core, on top of
Commit f31dcb152a3d ("sched/clock: Fix local_clock() before sched_clock_init()").
The test platform Intel Sapphire Rapids has 2 x 56C/112T and 224 CPUs in total.
C-states deeper than C1E are disabled. Turbo is disabled. CPU frequency governor
is performance.

Overall there is a universal improvement for netperf/tbench/will-it-scale,
under different loads. And there is no significant impact on hackbench/schbench.

will-it-scale
=============
case load baseline compare%
context_switch1 224 groups 1.00 +552.84%

netperf
=======
case load baseline(std%) compare%( std%)
TCP_RR 56-threads 1.00 ( 1.96) +15.23 ( 4.67)
TCP_RR 112-threads 1.00 ( 1.84) +88.83 ( 4.37)
TCP_RR 168-threads 1.00 ( 0.41) +475.45 ( 4.45)
TCP_RR 224-threads 1.00 ( 0.62) +806.85 ( 3.67)
TCP_RR 280-threads 1.00 ( 65.80) +162.66 ( 10.26)
TCP_RR 336-threads 1.00 ( 17.30) -0.19 ( 19.07)
TCP_RR 392-threads 1.00 ( 26.88) +3.38 ( 28.91)
TCP_RR 448-threads 1.00 ( 36.43) -0.26 ( 33.72)
UDP_RR 56-threads 1.00 ( 7.91) +3.77 ( 17.48)
UDP_RR 112-threads 1.00 ( 2.72) -15.02 ( 10.78)
UDP_RR 168-threads 1.00 ( 8.86) +131.77 ( 13.30)
UDP_RR 224-threads 1.00 ( 9.54) +178.73 ( 16.75)
UDP_RR 280-threads 1.00 ( 15.40) +189.69 ( 19.36)
UDP_RR 336-threads 1.00 ( 24.09) +0.54 ( 22.28)
UDP_RR 392-threads 1.00 ( 39.63) -3.90 ( 33.77)
UDP_RR 448-threads 1.00 ( 43.57) +1.57 ( 40.43)

tbench
======
case load baseline(std%) compare%( std%)
loopback 56-threads 1.00 ( 0.50) +10.78 ( 0.52)
loopback 112-threads 1.00 ( 0.19) +2.73 ( 0.08)
loopback 168-threads 1.00 ( 0.09) +173.72 ( 0.47)
loopback 224-threads 1.00 ( 0.20) -2.13 ( 0.42)
loopback 280-threads 1.00 ( 0.06) -0.77 ( 0.15)
loopback 336-threads 1.00 ( 0.14) -0.08 ( 0.08)
loopback 392-threads 1.00 ( 0.17) -0.27 ( 0.86)
loopback 448-threads 1.00 ( 0.37) +0.32 ( 0.02)

hackbench
=========
case load baseline(std%) compare%( std%)
process-pipe 1-groups 1.00 ( 0.94) -0.67 ( 0.45)
process-pipe 2-groups 1.00 ( 3.22) -3.00 ( 3.35)
process-pipe 4-groups 1.00 ( 1.66) -3.25 ( 1.87)
process-sockets 1-groups 1.00 ( 0.70) +1.34 ( 0.44)
process-sockets 2-groups 1.00 ( 0.24) +6.99 ( 11.23)
process-sockets 4-groups 1.00 ( 0.61) +1.72 ( 0.57)
threads-pipe 1-groups 1.00 ( 0.95) -0.66 ( 0.74)
threads-pipe 2-groups 1.00 ( 0.79) -0.59 ( 2.10)
threads-pipe 4-groups 1.00 ( 1.97) -1.23 ( 10.62)
threads-sockets 1-groups 1.00 ( 0.73) -2.59 ( 1.32)
threads-sockets 2-groups 1.00 ( 0.30) -1.95 ( 1.68)
threads-sockets 4-groups 1.00 ( 1.22) +1.86 ( 0.73)

schbench
========
case load baseline(std%) compare%( std%)
normal 1-mthreads 1.00 ( 0.00) +0.88 ( 1.25)
normal 2-mthreads 1.00 ( 2.09) +0.85 ( 2.44)
normal 4-mthreads 1.00 ( 1.29) -1.82 ( 4.55)
normal 8-mthreads 1.00 ( 1.22) +3.45 ( 1.26)

Redis
=====
Launch 224 instances of redis-server on machine A, launch 224 instances
of redis-benchmark on machine B, measure the SET/GET latency on B.
It was tested on a 1G NIC card. The 99th latency before vs after SIS_CURRENT
did not change much.
baseline sis_current
SET 115 ms 116 ms
GET 225 ms 228 ms

Prateek tested this patch on a dual socket Zen3 system (2 x 64C/128T).
tbench and netperf show good improvements at 128 clients. SpecJBB shows
some improvement in max-jOPS:
tip SIS_CURRENT
SPECjbb2015 max-jOPS 100.00% 102.78%
SPECjbb2015 Critical-jOPS 100.00% 100.00%

Others are perf neutral.

[1] https://lore.kernel.org/lkml/[email protected]/
[2] https://lore.kernel.org/lkml/[email protected]/

Suggested-by: Tim Chen <[email protected]>
Suggested-by: K Prateek Nayak <[email protected]>
Tested-by: kernel test robot <[email protected]>
Tested-by: K Prateek Nayak <[email protected]>
Signed-off-by: Chen Yu <[email protected]>
---
kernel/sched/fair.c | 46 +++++++++++++++++++++++++++++++++++++++++
kernel/sched/features.h | 1 +
2 files changed, 47 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3236011658a2..642a9e830e8f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6537,6 +6537,46 @@ static int wake_wide(struct task_struct *p)
return 1;
}

+/*
+ * Wake up the task on current CPU, if the following conditions are met:
+ *
+ * 1. waker A is the only running task on this_cpu
+ * 2. A is a short duration task (waker will fall asleep soon)
+ * 3. wakee B is a short duration task (impact of B on A is minor)
+ * 4. A and B wake up each other alternately
+ */
+static bool
+wake_on_current(int this_cpu, struct task_struct *p)
+{
+ if (!sched_feat(SIS_CURRENT))
+ return false;
+
+ if (cpu_rq(this_cpu)->nr_running > 1)
+ return false;
+
+ /*
+ * If a task switches in and then voluntarily relinquishes the
+ * CPU quickly, it is regarded as a short duration task. In that
+ * way, the short waker is likely to relinquish the CPU soon, which
+ * provides room for the wakee. Meanwhile, a short wakee would bring
+ * minor impact to the current rq. Put the short waker and wakee together
+ * bring benefit to cache-share task pairs and avoid migration overhead.
+ */
+ if (!current->se.dur_avg || current->se.dur_avg >= sysctl_sched_migration_cost)
+ return false;
+
+ if (!p->se.dur_avg || p->se.dur_avg >= sysctl_sched_migration_cost)
+ return false;
+
+ if (current->wakee_flips || p->wakee_flips)
+ return false;
+
+ if (current->last_wakee != p || p->last_wakee != current)
+ return false;
+
+ return true;
+}
+
/*
* The purpose of wake_affine() is to quickly determine on which CPU we can run
* soonest. For the purpose of speed we only consider the waking and previous
@@ -6630,6 +6670,9 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
if (sched_feat(WA_WEIGHT) && target == nr_cpumask_bits)
target = wake_affine_weight(sd, p, this_cpu, prev_cpu, sync);

+ if (target == nr_cpumask_bits && wake_on_current(this_cpu, p))
+ target = this_cpu;
+
schedstat_inc(p->stats.nr_wakeups_affine_attempts);
if (target != this_cpu)
return prev_cpu;
@@ -7151,6 +7194,9 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
}
}

+ if (smp_processor_id() == target && wake_on_current(target, p))
+ return target;
+
i = select_idle_cpu(p, sd, has_idle_core, target);
if ((unsigned)i < nr_cpumask_bits)
return i;
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index ee7f23c76bd3..a3e05827f7e8 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -62,6 +62,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
*/
SCHED_FEAT(SIS_PROP, false)
SCHED_FEAT(SIS_UTIL, true)
+SCHED_FEAT(SIS_CURRENT, true)

/*
* Issue a WARN when we do multiple update_rq_clock() calls
--
2.25.1

2023-04-28 15:23:59

by Chen Yu

[permalink] [raw]
Subject: [PATCH v8 1/2] sched/fair: Record the average duration of a task

Record the average duration of a task, as there is a requirement
to leverage this information for better task placement.

At first thought the (p->se.sum_exec_runtime / p->nvcsw)
can be used to measure the task duration. However, the
history long past was factored too heavily in such a formula.
Ideally, the old activity should decay and not affect
the current status too much.

Although something based on PELT can be used, se.util_avg might
not be appropriate to describe the task duration:
Task p1 and task p2 are doing frequent ping-pong scheduling on
one CPU, both p1 and p2 have a short duration, but the util_avg
of each task can be up to 50%, which is inconsistent with the
short task duration.

It was found that there was once a similar feature to track the
duration of a task:
commit ad4b78bbcbab ("sched: Add new wakeup preemption mode: WAKEUP_RUNNING")
Unfortunately, it was reverted because it was an experiment. Pick the
patch up again, by recording the average duration when a task voluntarily
switches out.

Suppose on CPU1, task p1 and p2 run alternatively:

--------------------> time

| p1 runs 1ms | p2 preempt p1 | p1 switch in, runs 0.5ms and blocks |
^ ^ ^
|_____________| |_____________________________________|
^
|
p1 dequeued

p1's duration in one section is (1 + 0.5)ms. Because if p2 does not
preempt p1, p1 can run 1.5ms. This reflects the nature of a task:
how long it wishes to run at most.

Suggested-by: Tim Chen <[email protected]>
Suggested-by: Vincent Guittot <[email protected]>
Tested-by: K Prateek Nayak <[email protected]>
Signed-off-by: Chen Yu <[email protected]>
---
include/linux/sched.h | 3 +++
kernel/sched/core.c | 2 ++
kernel/sched/debug.c | 1 +
kernel/sched/fair.c | 13 +++++++++++++
4 files changed, 19 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 675298d6eb36..6ee6b00faa12 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -558,6 +558,9 @@ struct sched_entity {
u64 prev_sum_exec_runtime;

u64 nr_migrations;
+ u64 prev_sleep_sum_runtime;
+ /* average duration of a task */
+ u64 dur_avg;

#ifdef CONFIG_FAIR_GROUP_SCHED
int depth;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 898fa3bc2765..32eacd220e39 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4452,6 +4452,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
p->se.prev_sum_exec_runtime = 0;
p->se.nr_migrations = 0;
p->se.vruntime = 0;
+ p->se.dur_avg = 0;
+ p->se.prev_sleep_sum_runtime = 0;
INIT_LIST_HEAD(&p->se.group_node);

#ifdef CONFIG_FAIR_GROUP_SCHED
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 1637b65ba07a..8d64fba16cfe 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -1024,6 +1024,7 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
__PS("nr_involuntary_switches", p->nivcsw);

P(se.load.weight);
+ P(se.dur_avg);
#ifdef CONFIG_SMP
P(se.avg.load_sum);
P(se.avg.runnable_sum);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3f8135d7c89d..3236011658a2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6319,6 +6319,18 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)

static void set_next_buddy(struct sched_entity *se);

+static inline void dur_avg_update(struct task_struct *p, bool task_sleep)
+{
+ u64 dur;
+
+ if (!task_sleep)
+ return;
+
+ dur = p->se.sum_exec_runtime - p->se.prev_sleep_sum_runtime;
+ p->se.prev_sleep_sum_runtime = p->se.sum_exec_runtime;
+ update_avg(&p->se.dur_avg, dur);
+}
+
/*
* The dequeue_task method is called before nr_running is
* decreased. We remove the task from the rbtree and
@@ -6391,6 +6403,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)

dequeue_throttle:
util_est_update(&rq->cfs, p, task_sleep);
+ dur_avg_update(p, task_sleep);
hrtick_update(rq);
}

--
2.25.1

2023-04-29 19:49:21

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH v8 2/2] sched/fair: Introduce SIS_CURRENT to wake up short task on current CPU

On Sat, 2023-04-29 at 07:16 +0800, Chen Yu wrote:
> [Problem Statement]
> For a workload that is doing frequent context switches, the throughput
> scales well until the number of instances reaches a peak point. After
> that peak point, the throughput drops significantly if the number of
> instances continue to increase.
>
> The will-it-scale context_switch1 test case exposes the issue. The
> test platform has 2 x 56C/112T and 224 CPUs in total. will-it-scale
> launches 1, 8, 16 ... instances respectively. Each instance is composed
> of 2 tasks, and each pair of tasks would do ping-pong scheduling via
> pipe_read() and pipe_write(). No task is bound to any CPU. It is found
> that, once the number of instances is higher than 56, the throughput
> drops accordingly:
>
>           ^
> throughput|
>           |                 X
>           |               X   X X
>           |             X         X X
>           |           X               X
>           |         X                   X
>           |       X
>           |     X
>           |   X
>           | X
>           |
>           +-----------------.------------------->
>                             56
>                                  number of instances

Should these buddy pairs not start interfering with one another at 112
instances instead of 56? NR_CPUS/2 buddy pair instances is the point at
which trying to turn waker/wakee overlap into throughput should tend
toward being a loser due to man-in-the-middle wakeup delay pain more
than offsetting overlap recovery gain, rendering sync wakeup thereafter
an ever more likely win.

Anyway..

What I see in my box, and I bet a virtual nickle it's a player in your
box as well, is WA_WEIGHT making a mess of things by stacking tasks,
sometimes very badly. Below, I start NR_CPUS tbench buddy pairs in
crusty ole i4790 desktop box with WA_WEIGHT turned off, then turn it on
remotely as to not have noisy GUI muck up my demo.

...
8 3155749 3606.79 MB/sec warmup 38 sec latency 3.852 ms
8 3238485 3608.75 MB/sec warmup 39 sec latency 3.839 ms
8 3321578 3608.59 MB/sec warmup 40 sec latency 3.882 ms
8 3404746 3608.09 MB/sec warmup 41 sec latency 2.273 ms
8 3487885 3607.58 MB/sec warmup 42 sec latency 3.869 ms
8 3571034 3607.12 MB/sec warmup 43 sec latency 3.855 ms
8 3654067 3607.48 MB/sec warmup 44 sec latency 3.857 ms
8 3736973 3608.83 MB/sec warmup 45 sec latency 4.008 ms
8 3820160 3608.33 MB/sec warmup 46 sec latency 3.849 ms
8 3902963 3607.60 MB/sec warmup 47 sec latency 14.241 ms
8 3986117 3607.17 MB/sec warmup 48 sec latency 20.290 ms
8 4069256 3606.70 MB/sec warmup 49 sec latency 28.284 ms
8 4151986 3608.35 MB/sec warmup 50 sec latency 17.216 ms
8 4235070 3608.06 MB/sec warmup 51 sec latency 23.221 ms
8 4318221 3607.81 MB/sec warmup 52 sec latency 28.285 ms
8 4401456 3607.29 MB/sec warmup 53 sec latency 20.835 ms
8 4484606 3607.06 MB/sec warmup 54 sec latency 28.943 ms
8 4567609 3607.32 MB/sec warmup 55 sec latency 28.254 ms

Where I turned it on is hard to miss.

Short duration thread pool workers can be stacked all the way to the
ceiling by WA_WEIGHT during burst wakeups, with wake_wide() not being
able to intervene due to lack of cross coupling between waker/wakees
leading to heuristic failure. A (now long) while ago I caught that
happening with firefox event threads, it launched 32 of 'em in my 8 rq
box (hmm), and them being essentially the scheduler equivalent of
neutrinos (nearly massless), we stuffed 'em all into one rq.. and got
away with it because those particular threads don't seem to do much of
anything. However, were they to go active, the latency hit that we set
up could have stung mightily. That scenario being highly generic leads
me to suspect that somewhere out there in the big wide world, folks are
eating that burst serialization.

-Mike

2023-05-01 08:45:00

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v8 2/2] sched/fair: Introduce SIS_CURRENT to wake up short task on current CPU

On Sat, Apr 29, 2023 at 09:34:06PM +0200, Mike Galbraith wrote:
> On Sat, 2023-04-29 at 07:16 +0800, Chen Yu wrote:
> > [Problem Statement]
> > For a workload that is doing frequent context switches, the throughput
> > scales well until the number of instances reaches a peak point. After
> > that peak point, the throughput drops significantly if the number of
> > instances continue to increase.
> >
> > The will-it-scale context_switch1 test case exposes the issue. The
> > test platform has 2 x 56C/112T and 224 CPUs in total. will-it-scale
> > launches 1, 8, 16 ... instances respectively. Each instance is composed
> > of 2 tasks, and each pair of tasks would do ping-pong scheduling via
> > pipe_read() and pipe_write(). No task is bound to any CPU. It is found
> > that, once the number of instances is higher than 56, the throughput
> > drops accordingly:
> >
> > ????????? ^
> > throughput|
> > ????????? |???????????????? X
> > ????????? |?????????????? X?? X X
> > ????????? |???????????? X???????? X X
> > ????????? |?????????? X?????????????? X
> > ????????? |???????? X?????????????????? X
> > ????????? |?????? X
> > ????????? |???? X
> > ????????? |?? X
> > ????????? | X
> > ????????? |
> > ????????? +-----------------.------------------->
> > ??????????????????????????? 56
> > ???????????????????????????????? number of instances
>
> Should these buddy pairs not start interfering with one another at 112
> instances instead of 56? NR_CPUS/2 buddy pair instances is the point at
> which trying to turn waker/wakee overlap into throughput should tend
> toward being a loser due to man-in-the-middle wakeup delay pain more
> than offsetting overlap recovery gain, rendering sync wakeup thereafter
> an ever more likely win.
>
> Anyway..
>
> What I see in my box, and I bet a virtual nickle it's a player in your
> box as well, is WA_WEIGHT making a mess of things by stacking tasks,
> sometimes very badly. Below, I start NR_CPUS tbench buddy pairs in
> crusty ole i4790 desktop box with WA_WEIGHT turned off, then turn it on
> remotely as to not have noisy GUI muck up my demo.
>
> ...
> 8 3155749 3606.79 MB/sec warmup 38 sec latency 3.852 ms
> 8 3238485 3608.75 MB/sec warmup 39 sec latency 3.839 ms
> 8 3321578 3608.59 MB/sec warmup 40 sec latency 3.882 ms
> 8 3404746 3608.09 MB/sec warmup 41 sec latency 2.273 ms
> 8 3487885 3607.58 MB/sec warmup 42 sec latency 3.869 ms
> 8 3571034 3607.12 MB/sec warmup 43 sec latency 3.855 ms
> 8 3654067 3607.48 MB/sec warmup 44 sec latency 3.857 ms
> 8 3736973 3608.83 MB/sec warmup 45 sec latency 4.008 ms
> 8 3820160 3608.33 MB/sec warmup 46 sec latency 3.849 ms
> 8 3902963 3607.60 MB/sec warmup 47 sec latency 14.241 ms
> 8 3986117 3607.17 MB/sec warmup 48 sec latency 20.290 ms
> 8 4069256 3606.70 MB/sec warmup 49 sec latency 28.284 ms
> 8 4151986 3608.35 MB/sec warmup 50 sec latency 17.216 ms
> 8 4235070 3608.06 MB/sec warmup 51 sec latency 23.221 ms
> 8 4318221 3607.81 MB/sec warmup 52 sec latency 28.285 ms
> 8 4401456 3607.29 MB/sec warmup 53 sec latency 20.835 ms
> 8 4484606 3607.06 MB/sec warmup 54 sec latency 28.943 ms
> 8 4567609 3607.32 MB/sec warmup 55 sec latency 28.254 ms
>
> Where I turned it on is hard to miss.
>
> Short duration thread pool workers can be stacked all the way to the
> ceiling by WA_WEIGHT during burst wakeups, with wake_wide() not being
> able to intervene due to lack of cross coupling between waker/wakees
> leading to heuristic failure. A (now long) while ago I caught that
> happening with firefox event threads, it launched 32 of 'em in my 8 rq
> box (hmm), and them being essentially the scheduler equivalent of
> neutrinos (nearly massless), we stuffed 'em all into one rq.. and got
> away with it because those particular threads don't seem to do much of
> anything. However, were they to go active, the latency hit that we set
> up could have stung mightily. That scenario being highly generic leads
> me to suspect that somewhere out there in the big wide world, folks are
> eating that burst serialization.

I'm thinking WA_BIAS makes this worse...

2023-05-01 13:58:48

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v8 2/2] sched/fair: Introduce SIS_CURRENT to wake up short task on current CPU

On Sat, Apr 29, 2023 at 07:16:56AM +0800, Chen Yu wrote:
> netperf
> =======
> case load baseline(std%) compare%( std%)
> TCP_RR 56-threads 1.00 ( 1.96) +15.23 ( 4.67)
> TCP_RR 112-threads 1.00 ( 1.84) +88.83 ( 4.37)
> TCP_RR 168-threads 1.00 ( 0.41) +475.45 ( 4.45)
> TCP_RR 224-threads 1.00 ( 0.62) +806.85 ( 3.67)
> TCP_RR 280-threads 1.00 ( 65.80) +162.66 ( 10.26)
> TCP_RR 336-threads 1.00 ( 17.30) -0.19 ( 19.07)
> TCP_RR 392-threads 1.00 ( 26.88) +3.38 ( 28.91)
> TCP_RR 448-threads 1.00 ( 36.43) -0.26 ( 33.72)
> UDP_RR 56-threads 1.00 ( 7.91) +3.77 ( 17.48)
> UDP_RR 112-threads 1.00 ( 2.72) -15.02 ( 10.78)
> UDP_RR 168-threads 1.00 ( 8.86) +131.77 ( 13.30)
> UDP_RR 224-threads 1.00 ( 9.54) +178.73 ( 16.75)
> UDP_RR 280-threads 1.00 ( 15.40) +189.69 ( 19.36)
> UDP_RR 336-threads 1.00 ( 24.09) +0.54 ( 22.28)
> UDP_RR 392-threads 1.00 ( 39.63) -3.90 ( 33.77)
> UDP_RR 448-threads 1.00 ( 43.57) +1.57 ( 40.43)
>
> tbench
> ======
> case load baseline(std%) compare%( std%)
> loopback 56-threads 1.00 ( 0.50) +10.78 ( 0.52)
> loopback 112-threads 1.00 ( 0.19) +2.73 ( 0.08)
> loopback 168-threads 1.00 ( 0.09) +173.72 ( 0.47)
> loopback 224-threads 1.00 ( 0.20) -2.13 ( 0.42)
> loopback 280-threads 1.00 ( 0.06) -0.77 ( 0.15)
> loopback 336-threads 1.00 ( 0.14) -0.08 ( 0.08)
> loopback 392-threads 1.00 ( 0.17) -0.27 ( 0.86)
> loopback 448-threads 1.00 ( 0.37) +0.32 ( 0.02)

So,... I've been poking around with this a bit today and I'm not seeing
it. On my ancient IVB-EP (2*10*2) with the code as in
queue/sched/core I get:

netperf NO_WA_WEIGHT NO_SIS_CURRENT
NO_WA_BIAS SIS_CURRENT
-------------------------------------------------------------------
TCP_SENDFILE-1 : Avg: 40495.7 41899.7 42001 40783.4
TCP_SENDFILE-10 : Avg: 37218.6 37200.1 37065.1 36604.4
TCP_SENDFILE-20 : Avg: 21495.1 21516.6 21004.4 21356.9
TCP_SENDFILE-40 : Avg: 6947.24 7917.64 7079.93 7231.3
TCP_SENDFILE-80 : Avg: 4081.91 3572.48 3582.98 3615.85
TCP_STREAM-1 : Avg: 37078.1 34469.4 37134.5 35095.4
TCP_STREAM-10 : Avg: 31532.1 31265.8 31260.7 31588.1
TCP_STREAM-20 : Avg: 17848 17914.9 17996.6 17937.4
TCP_STREAM-40 : Avg: 7844.3 7201.65 7710.4 7790.62
TCP_STREAM-80 : Avg: 2518.38 2932.74 2601.51 2903.89
TCP_RR-1 : Avg: 84347.1 81056.2 81167.8 83541.3
TCP_RR-10 : Avg: 71539.1 72099.5 71123.2 69447.9
TCP_RR-20 : Avg: 51053.3 50952.4 50905.4 52157.2
TCP_RR-40 : Avg: 46370.9 46477.5 46289.2 46350.7
TCP_RR-80 : Avg: 21515.2 22497.9 22024.4 22229.2
UDP_RR-1 : Avg: 96933 100076 95997.2 96553.3
UDP_RR-10 : Avg: 83937.3 83054.3 83878.5 78998.6
UDP_RR-20 : Avg: 61974 61897.5 61838.8 62926
UDP_RR-40 : Avg: 56708.6 57053.9 56456.1 57115.2
UDP_RR-80 : Avg: 26950 27895.8 27635.2 27784.8
UDP_STREAM-1 : Avg: 52808.3 55296.8 52808.2 51908.6
UDP_STREAM-10 : Avg: 45810 42944.1 43115 43561.2
UDP_STREAM-20 : Avg: 19212.7 17572.9 18798.7 20066
UDP_STREAM-40 : Avg: 13105.1 13096.9 13070.5 13110.2
UDP_STREAM-80 : Avg: 6372.57 6367.96 6248.86 6413.09


tbench

NO_WA_WEIGHT, NO_WA_BIAS, NO_SIS_CURRENT

Throughput 626.57 MB/sec 2 clients 2 procs max_latency=0.095 ms
Throughput 1316.08 MB/sec 5 clients 5 procs max_latency=0.106 ms
Throughput 1905.19 MB/sec 10 clients 10 procs max_latency=0.161 ms
Throughput 2428.05 MB/sec 20 clients 20 procs max_latency=0.284 ms
Throughput 2323.16 MB/sec 40 clients 40 procs max_latency=0.381 ms
Throughput 2229.93 MB/sec 80 clients 80 procs max_latency=0.873 ms

WA_WEIGHT, NO_WA_BIAS, NO_SIS_CURRENT

Throughput 575.04 MB/sec 2 clients 2 procs max_latency=0.093 ms
Throughput 1285.37 MB/sec 5 clients 5 procs max_latency=0.122 ms
Throughput 1916.10 MB/sec 10 clients 10 procs max_latency=0.150 ms
Throughput 2422.54 MB/sec 20 clients 20 procs max_latency=0.292 ms
Throughput 2361.57 MB/sec 40 clients 40 procs max_latency=0.448 ms
Throughput 2479.70 MB/sec 80 clients 80 procs max_latency=1.249 ms

WA_WEIGHT, WA_BIAS, NO_SIS_CURRENT (aka, mainline)

Throughput 649.46 MB/sec 2 clients 2 procs max_latency=0.092 ms
Throughput 1370.93 MB/sec 5 clients 5 procs max_latency=0.140 ms
Throughput 1904.14 MB/sec 10 clients 10 procs max_latency=0.470 ms
Throughput 2406.15 MB/sec 20 clients 20 procs max_latency=0.276 ms
Throughput 2419.40 MB/sec 40 clients 40 procs max_latency=0.414 ms
Throughput 2426.00 MB/sec 80 clients 80 procs max_latency=1.366 ms

WA_WEIGHT, WA_BIAS, SIS_CURRENT (aka, with patches on)

Throughput 646.55 MB/sec 2 clients 2 procs max_latency=0.104 ms
Throughput 1361.06 MB/sec 5 clients 5 procs max_latency=0.100 ms
Throughput 1889.82 MB/sec 10 clients 10 procs max_latency=0.154 ms
Throughput 2406.57 MB/sec 20 clients 20 procs max_latency=3.667 ms
Throughput 2318.00 MB/sec 40 clients 40 procs max_latency=0.390 ms
Throughput 2384.85 MB/sec 80 clients 80 procs max_latency=1.371 ms


So what's going on here? I don't see anything exciting happening at the
40 mark. At the same time, I can't seem to reproduce Mike's latency pile
up either :/

2023-05-01 15:26:39

by Chen Yu

[permalink] [raw]
Subject: Re: [PATCH v8 2/2] sched/fair: Introduce SIS_CURRENT to wake up short task on current CPU

Hi Mika,
On 2023-04-29 at 21:34:06 +0200, Mike Galbraith wrote:
> On Sat, 2023-04-29 at 07:16 +0800, Chen Yu wrote:
> > [Problem Statement]
> > For a workload that is doing frequent context switches, the throughput
> > scales well until the number of instances reaches a peak point. After
> > that peak point, the throughput drops significantly if the number of
> > instances continue to increase.
> >
> > The will-it-scale context_switch1 test case exposes the issue. The
> > test platform has 2 x 56C/112T and 224 CPUs in total. will-it-scale
> > launches 1, 8, 16 ... instances respectively. Each instance is composed
> > of 2 tasks, and each pair of tasks would do ping-pong scheduling via
> > pipe_read() and pipe_write(). No task is bound to any CPU. It is found
> > that, once the number of instances is higher than 56, the throughput
> > drops accordingly:
> >
> > ????????? ^
> > throughput|
> > ????????? |???????????????? X
> > ????????? |?????????????? X?? X X
> > ????????? |???????????? X???????? X X
> > ????????? |?????????? X?????????????? X
> > ????????? |???????? X?????????????????? X
> > ????????? |?????? X
> > ????????? |???? X
> > ????????? |?? X
> > ????????? | X
> > ????????? |
> > ????????? +-----------------.------------------->
> > ??????????????????????????? 56
> > ???????????????????????????????? number of instances
>
> Should these buddy pairs not start interfering with one another at 112
> instances instead of 56? NR_CPUS/2 buddy pair instances is the point at
> which trying to turn waker/wakee overlap into throughput should tend
> toward being a loser due to man-in-the-middle wakeup delay pain more
> than offsetting overlap recovery gain, rendering sync wakeup thereafter
> an ever more likely win.
>
Thank you for taking a look at this. Yes, you are right, I did not
described this clearly. Actually above figure was draw when I first
found the issue when there is only 1 socket online(112 thread in total).
I should update the figure to the 224 CPUs case.
> Anyway...
>
> What I see in my box, and I bet a virtual nickle it's a player in your
> box as well, is WA_WEIGHT making a mess of things by stacking tasks,
> sometimes very badly. Below, I start NR_CPUS tbench buddy pairs in
> crusty ole i4790 desktop box with WA_WEIGHT turned off, then turn it on
> remotely as to not have noisy GUI muck up my demo.
>
> ...
> 8 3155749 3606.79 MB/sec warmup 38 sec latency 3.852 ms
> 8 3238485 3608.75 MB/sec warmup 39 sec latency 3.839 ms
> 8 3321578 3608.59 MB/sec warmup 40 sec latency 3.882 ms
> 8 3404746 3608.09 MB/sec warmup 41 sec latency 2.273 ms
> 8 3487885 3607.58 MB/sec warmup 42 sec latency 3.869 ms
> 8 3571034 3607.12 MB/sec warmup 43 sec latency 3.855 ms
> 8 3654067 3607.48 MB/sec warmup 44 sec latency 3.857 ms
> 8 3736973 3608.83 MB/sec warmup 45 sec latency 4.008 ms
> 8 3820160 3608.33 MB/sec warmup 46 sec latency 3.849 ms
> 8 3902963 3607.60 MB/sec warmup 47 sec latency 14.241 ms
> 8 3986117 3607.17 MB/sec warmup 48 sec latency 20.290 ms
> 8 4069256 3606.70 MB/sec warmup 49 sec latency 28.284 ms
> 8 4151986 3608.35 MB/sec warmup 50 sec latency 17.216 ms
> 8 4235070 3608.06 MB/sec warmup 51 sec latency 23.221 ms
> 8 4318221 3607.81 MB/sec warmup 52 sec latency 28.285 ms
> 8 4401456 3607.29 MB/sec warmup 53 sec latency 20.835 ms
> 8 4484606 3607.06 MB/sec warmup 54 sec latency 28.943 ms
> 8 4567609 3607.32 MB/sec warmup 55 sec latency 28.254 ms
>
> Where I turned it on is hard to miss.
>
> Short duration thread pool workers can be stacked all the way to the
> ceiling by WA_WEIGHT during burst wakeups, with wake_wide() not being
> able to intervene due to lack of cross coupling between waker/wakees
> leading to heuristic failure. A (now long) while ago I caught that
> happening with firefox event threads, it launched 32 of 'em in my 8 rq
> box (hmm), and them being essentially the scheduler equivalent of
> neutrinos (nearly massless), we stuffed 'em all into one rq.. and got
> away with it because those particular threads don't seem to do much of
> anything. However, were they to go active, the latency hit that we set
> up could have stung mightily. That scenario being highly generic leads
> me to suspect that somewhere out there in the big wide world, folks are
> eating that burst serialization.
Thank you for this information. Yes, task stacking can be quite annoying
for latency. And in my previous patch version, Prateek from AMD, Abel
from Bytedance, and Yicong from Hisilicon have expressed their concern
on the task stacking risk. So we tried several versions carefully to
not break their use case. For the WA_WEIGHT issue you described(and
I also saw your email offline), it is interesting and I'll try
to run similar test on a 4C/8T system to reproduce it.

thanks,
Chenyu
>
> -Mike

2023-05-01 15:39:10

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH v8 2/2] sched/fair: Introduce SIS_CURRENT to wake up short task on current CPU

On Mon, 2023-05-01 at 15:48 +0200, Peter Zijlstra wrote:
>
> Throughput  646.55 MB/sec   2 clients   2 procs  max_latency=0.104 ms
> Throughput 1361.06 MB/sec   5 clients   5 procs  max_latency=0.100 ms
> Throughput 1889.82 MB/sec  10 clients  10 procs  max_latency=0.154 ms
> Throughput 2406.57 MB/sec  20 clients  20 procs  max_latency=3.667 ms
> Throughput 2318.00 MB/sec  40 clients  40 procs  max_latency=0.390 ms
> Throughput 2384.85 MB/sec  80 clients  80 procs  max_latency=1.371 ms
>
>
> So what's going on here? I don't see anything exciting happening at the
> 40 mark. At the same time, I can't seem to reproduce Mike's latency pile
> up either :/

Are you running tbench in the GUI so the per second output stimulates
assorted goo? I'm using KDE fwtw.

Caught this from my raspberry pi, tbench placement looks lovely, the
llvmpipe thingies otoh..

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND
19109 git 20 0 23468 1920 1664 R 52.65 0.012 3:59.64 4 tbench
19110 git 20 0 23468 1664 1536 R 52.65 0.010 4:00.03 3 tbench
19104 git 20 0 23468 1664 1536 R 52.32 0.010 4:00.15 1 tbench
19105 git 20 0 23468 1664 1536 R 52.32 0.010 4:00.16 0 tbench
19108 git 20 0 23468 1792 1664 R 52.32 0.011 4:00.12 7 tbench
19111 git 20 0 23468 1792 1664 R 51.99 0.011 4:00.33 5 tbench
19106 git 20 0 23468 1664 1536 R 51.66 0.010 3:59.40 6 tbench
19107 git 20 0 23468 1664 1536 R 51.32 0.010 3:59.72 2 tbench
19114 git 20 0 6748 896 768 R 46.69 0.006 3:32.77 6 tbench_srv
19116 git 20 0 6748 768 768 S 46.69 0.005 3:32.17 7 tbench_srv
19118 git 20 0 6748 768 768 S 46.69 0.005 3:31.70 3 tbench_srv
19117 git 20 0 6748 768 768 S 46.36 0.005 3:32.99 4 tbench_srv
19112 git 20 0 6748 768 768 S 46.03 0.005 3:32.51 1 tbench_srv
19113 git 20 0 6748 768 768 R 46.03 0.005 3:32.48 0 tbench_srv
19119 git 20 0 6748 768 768 S 46.03 0.005 3:31.93 5 tbench_srv
19115 git 20 0 6748 768 768 R 45.70 0.005 3:32.70 2 tbench_srv
2492 root 20 0 392608 110044 70276 S 1.987 0.682 8:06.86 3 X
2860 root 20 0 2557284 183260 138568 S 0.662 1.135 2:06.38 6 llvmpipe-1
2861 root 20 0 2557284 183260 138568 S 0.662 1.135 2:06.44 6 llvmpipe-2
2863 root 20 0 2557284 183260 138568 S 0.662 1.135 2:04.94 6 llvmpipe-4
2864 root 20 0 2557284 183260 138568 S 0.662 1.135 2:04.72 6 llvmpipe-5
2866 root 20 0 2557284 183260 138568 S 0.662 1.135 2:04.49 6 llvmpipe-7
19562 root 20 0 26192 4876 3596 R 0.662 0.030 0:00.43 5 top
2837 root 20 0 2557284 183260 138568 S 0.331 1.135 1:51.39 5 kwin_x11
2859 root 20 0 2557284 183260 138568 S 0.331 1.135 2:07.56 6 llvmpipe-0
2862 root 20 0 2557284 183260 138568 S 0.331 1.135 2:05.97 6 llvmpipe-3
2865 root 20 0 2557284 183260 138568 S 0.331 1.135 2:03.84 6 llvmpipe-6
2966 root 20 0 3829152 323000 174992 S 0.331 2.001 0:12.71 4 llvmpipe-7
2998 root 20 0 1126332 116960 78032 S 0.331 0.725 0:25.58 3 konsole

2023-05-01 15:42:42

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH v8 2/2] sched/fair: Introduce SIS_CURRENT to wake up short task on current CPU

On Mon, 2023-05-01 at 17:32 +0200, Mike Galbraith wrote:
> On Mon, 2023-05-01 at 15:48 +0200, Peter Zijlstra wrote:
> >
> > Throughput  646.55 MB/sec   2 clients   2 procs  max_latency=0.104
> > ms
> > Throughput 1361.06 MB/sec   5 clients   5 procs  max_latency=0.100
> > ms
> > Throughput 1889.82 MB/sec  10 clients  10 procs  max_latency=0.154
> > ms
> > Throughput 2406.57 MB/sec  20 clients  20 procs  max_latency=3.667
> > ms
> > Throughput 2318.00 MB/sec  40 clients  40 procs  max_latency=0.390
> > ms
> > Throughput 2384.85 MB/sec  80 clients  80 procs  max_latency=1.371
> > ms
> >
> >
> > So what's going on here? I don't see anything exciting happening at
> > the
> > 40 mark. At the same time, I can't seem to reproduce Mike's latency
> > pile
> > up either :/
>
> Are you running tbench in the GUI so the per second output stimulates
> assorted goo?  I'm using KDE fwtw.
>
> Caught this from my raspberry pi, tbench placement looks lovely, the
> llvmpipe thingies otoh..

erm, without 14 feet of whitespace.

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND
19109 git 20 0 23468 1920 1664 R 52.65 0.012 3:59.64 4 tbench
19110 git 20 0 23468 1664 1536 R 52.65 0.010 4:00.03 3 tbench
19104 git 20 0 23468 1664 1536 R 52.32 0.010 4:00.15 1 tbench
19105 git 20 0 23468 1664 1536 R 52.32 0.010 4:00.16 0 tbench
19108 git 20 0 23468 1792 1664 R 52.32 0.011 4:00.12 7 tbench
19111 git 20 0 23468 1792 1664 R 51.99 0.011 4:00.33 5 tbench
19106 git 20 0 23468 1664 1536 R 51.66 0.010 3:59.40 6 tbench
19107 git 20 0 23468 1664 1536 R 51.32 0.010 3:59.72 2 tbench
19114 git 20 0 6748 896 768 R 46.69 0.006 3:32.77 6 tbench_srv
19116 git 20 0 6748 768 768 S 46.69 0.005 3:32.17 7 tbench_srv
19118 git 20 0 6748 768 768 S 46.69 0.005 3:31.70 3 tbench_srv
19117 git 20 0 6748 768 768 S 46.36 0.005 3:32.99 4 tbench_srv
19112 git 20 0 6748 768 768 S 46.03 0.005 3:32.51 1 tbench_srv
19113 git 20 0 6748 768 768 R 46.03 0.005 3:32.48 0 tbench_srv
19119 git 20 0 6748 768 768 S 46.03 0.005 3:31.93 5 tbench_srv
19115 git 20 0 6748 768 768 R 45.70 0.005 3:32.70 2 tbench_srv
2492 root 20 0 392608 110044 70276 S 1.987 0.682 8:06.86 3 X
2860 root 20 0 2557284 183260 138568 S 0.662 1.135 2:06.38 6 llvmpipe-1
2861 root 20 0 2557284 183260 138568 S 0.662 1.135 2:06.44 6 llvmpipe-2
2863 root 20 0 2557284 183260 138568 S 0.662 1.135 2:04.94 6 llvmpipe-4
2864 root 20 0 2557284 183260 138568 S 0.662 1.135 2:04.72 6 llvmpipe-5
2866 root 20 0 2557284 183260 138568 S 0.662 1.135 2:04.49 6 llvmpipe-7
19562 root 20 0 26192 4876 3596 R 0.662 0.030 0:00.43 5 top
2837 root 20 0 2557284 183260 138568 S 0.331 1.135 1:51.39 5 kwin_x11
2859 root 20 0 2557284 183260 138568 S 0.331 1.135 2:07.56 6 llvmpipe-0
2862 root 20 0 2557284 183260 138568 S 0.331 1.135 2:05.97 6 llvmpipe-3
2865 root 20 0 2557284 183260 138568 S 0.331 1.135 2:03.84 6 llvmpipe-6
2966 root 20 0 3829152 323000 174992 S 0.331 2.001 0:12.71 4 llvmpipe-7
2998 root 20 0 1126332 116960 78032 S 0.331 0.725 0:25.58 3 konsole

2023-05-01 16:01:58

by Chen Yu

[permalink] [raw]
Subject: Re: [PATCH v8 2/2] sched/fair: Introduce SIS_CURRENT to wake up short task on current CPU

Hi Peter,
On 2023-05-01 at 15:48:27 +0200, Peter Zijlstra wrote:
> On Sat, Apr 29, 2023 at 07:16:56AM +0800, Chen Yu wrote:
> > netperf
> > =======
> > case load baseline(std%) compare%( std%)
> > TCP_RR 56-threads 1.00 ( 1.96) +15.23 ( 4.67)
> > TCP_RR 112-threads 1.00 ( 1.84) +88.83 ( 4.37)
> > TCP_RR 168-threads 1.00 ( 0.41) +475.45 ( 4.45)
> > TCP_RR 224-threads 1.00 ( 0.62) +806.85 ( 3.67)
> > TCP_RR 280-threads 1.00 ( 65.80) +162.66 ( 10.26)
> > TCP_RR 336-threads 1.00 ( 17.30) -0.19 ( 19.07)
> > TCP_RR 392-threads 1.00 ( 26.88) +3.38 ( 28.91)
> > TCP_RR 448-threads 1.00 ( 36.43) -0.26 ( 33.72)
> > UDP_RR 56-threads 1.00 ( 7.91) +3.77 ( 17.48)
> > UDP_RR 112-threads 1.00 ( 2.72) -15.02 ( 10.78)
> > UDP_RR 168-threads 1.00 ( 8.86) +131.77 ( 13.30)
> > UDP_RR 224-threads 1.00 ( 9.54) +178.73 ( 16.75)
> > UDP_RR 280-threads 1.00 ( 15.40) +189.69 ( 19.36)
> > UDP_RR 336-threads 1.00 ( 24.09) +0.54 ( 22.28)
> > UDP_RR 392-threads 1.00 ( 39.63) -3.90 ( 33.77)
> > UDP_RR 448-threads 1.00 ( 43.57) +1.57 ( 40.43)
> >
> > tbench
> > ======
> > case load baseline(std%) compare%( std%)
> > loopback 56-threads 1.00 ( 0.50) +10.78 ( 0.52)
> > loopback 112-threads 1.00 ( 0.19) +2.73 ( 0.08)
> > loopback 168-threads 1.00 ( 0.09) +173.72 ( 0.47)
> > loopback 224-threads 1.00 ( 0.20) -2.13 ( 0.42)
> > loopback 280-threads 1.00 ( 0.06) -0.77 ( 0.15)
> > loopback 336-threads 1.00 ( 0.14) -0.08 ( 0.08)
> > loopback 392-threads 1.00 ( 0.17) -0.27 ( 0.86)
> > loopback 448-threads 1.00 ( 0.37) +0.32 ( 0.02)
>
> So,... I've been poking around with this a bit today and I'm not seeing
> it. On my ancient IVB-EP (2*10*2) with the code as in
> queue/sched/core I get:
>
> netperf NO_WA_WEIGHT NO_SIS_CURRENT
> NO_WA_BIAS SIS_CURRENT
> -------------------------------------------------------------------
> TCP_SENDFILE-1 : Avg: 40495.7 41899.7 42001 40783.4
> TCP_SENDFILE-10 : Avg: 37218.6 37200.1 37065.1 36604.4
> TCP_SENDFILE-20 : Avg: 21495.1 21516.6 21004.4 21356.9
> TCP_SENDFILE-40 : Avg: 6947.24 7917.64 7079.93 7231.3
> TCP_SENDFILE-80 : Avg: 4081.91 3572.48 3582.98 3615.85
> TCP_STREAM-1 : Avg: 37078.1 34469.4 37134.5 35095.4
> TCP_STREAM-10 : Avg: 31532.1 31265.8 31260.7 31588.1
> TCP_STREAM-20 : Avg: 17848 17914.9 17996.6 17937.4
> TCP_STREAM-40 : Avg: 7844.3 7201.65 7710.4 7790.62
> TCP_STREAM-80 : Avg: 2518.38 2932.74 2601.51 2903.89
> TCP_RR-1 : Avg: 84347.1 81056.2 81167.8 83541.3
> TCP_RR-10 : Avg: 71539.1 72099.5 71123.2 69447.9
> TCP_RR-20 : Avg: 51053.3 50952.4 50905.4 52157.2
> TCP_RR-40 : Avg: 46370.9 46477.5 46289.2 46350.7
> TCP_RR-80 : Avg: 21515.2 22497.9 22024.4 22229.2
> UDP_RR-1 : Avg: 96933 100076 95997.2 96553.3
> UDP_RR-10 : Avg: 83937.3 83054.3 83878.5 78998.6
> UDP_RR-20 : Avg: 61974 61897.5 61838.8 62926
> UDP_RR-40 : Avg: 56708.6 57053.9 56456.1 57115.2
> UDP_RR-80 : Avg: 26950 27895.8 27635.2 27784.8
> UDP_STREAM-1 : Avg: 52808.3 55296.8 52808.2 51908.6
> UDP_STREAM-10 : Avg: 45810 42944.1 43115 43561.2
> UDP_STREAM-20 : Avg: 19212.7 17572.9 18798.7 20066
> UDP_STREAM-40 : Avg: 13105.1 13096.9 13070.5 13110.2
> UDP_STREAM-80 : Avg: 6372.57 6367.96 6248.86 6413.09
>
>
> tbench
>
> NO_WA_WEIGHT, NO_WA_BIAS, NO_SIS_CURRENT
>
> Throughput 626.57 MB/sec 2 clients 2 procs max_latency=0.095 ms
> Throughput 1316.08 MB/sec 5 clients 5 procs max_latency=0.106 ms
> Throughput 1905.19 MB/sec 10 clients 10 procs max_latency=0.161 ms
> Throughput 2428.05 MB/sec 20 clients 20 procs max_latency=0.284 ms
> Throughput 2323.16 MB/sec 40 clients 40 procs max_latency=0.381 ms
> Throughput 2229.93 MB/sec 80 clients 80 procs max_latency=0.873 ms
>
> WA_WEIGHT, NO_WA_BIAS, NO_SIS_CURRENT
>
> Throughput 575.04 MB/sec 2 clients 2 procs max_latency=0.093 ms
> Throughput 1285.37 MB/sec 5 clients 5 procs max_latency=0.122 ms
> Throughput 1916.10 MB/sec 10 clients 10 procs max_latency=0.150 ms
> Throughput 2422.54 MB/sec 20 clients 20 procs max_latency=0.292 ms
> Throughput 2361.57 MB/sec 40 clients 40 procs max_latency=0.448 ms
> Throughput 2479.70 MB/sec 80 clients 80 procs max_latency=1.249 ms
>
> WA_WEIGHT, WA_BIAS, NO_SIS_CURRENT (aka, mainline)
>
> Throughput 649.46 MB/sec 2 clients 2 procs max_latency=0.092 ms
> Throughput 1370.93 MB/sec 5 clients 5 procs max_latency=0.140 ms
> Throughput 1904.14 MB/sec 10 clients 10 procs max_latency=0.470 ms
> Throughput 2406.15 MB/sec 20 clients 20 procs max_latency=0.276 ms
> Throughput 2419.40 MB/sec 40 clients 40 procs max_latency=0.414 ms
> Throughput 2426.00 MB/sec 80 clients 80 procs max_latency=1.366 ms
>
> WA_WEIGHT, WA_BIAS, SIS_CURRENT (aka, with patches on)
>
> Throughput 646.55 MB/sec 2 clients 2 procs max_latency=0.104 ms
> Throughput 1361.06 MB/sec 5 clients 5 procs max_latency=0.100 ms
> Throughput 1889.82 MB/sec 10 clients 10 procs max_latency=0.154 ms
> Throughput 2406.57 MB/sec 20 clients 20 procs max_latency=3.667 ms
> Throughput 2318.00 MB/sec 40 clients 40 procs max_latency=0.390 ms
> Throughput 2384.85 MB/sec 80 clients 80 procs max_latency=1.371 ms
>
>
> So what's going on here? I don't see anything exciting happening at the
> 40 mark. At the same time, I can't seem to reproduce Mike's latency pile
> up either :/
>
Thank you very much for trying this patch. This patch was found to mainly
benefit system with large number of CPUs in 1 LLC. Previously I tested
it on Sapphire Rapids(2x56C/224T) and Ice Lake Server(2x32C/128T)[1], it
seems to have benefit on them. The benefit seems to come from:
1. reducing the waker stacking among many CPUs within 1 LLC
2. reducing the C2C overhead within 1 LLC
As a comparison, Prateek has tested this patch on the Zen3 platform,
which has 16 threads per LLC and smaller than Sapphire Rapids and Ice
Lake Server. He did not observe too much difference with this patch
applied, but only saw some limited improvement on tbench and Spec.
So far I did not received performance difference from LKP on desktop
test boxes. Let me queue the full test on some desktops to confirm
if this change has any impact on them.

[1] https://lore.kernel.org/lkml/[email protected]/

thanks,
Chenyu


The original symptom I found was that, there are
quite some idle time(up to 30%) when running will-it-scale context switch
using the same number as the online CPUs. And waking up the task locally
reduce the race condition and reduce the C2C overhead within 1 LLC,
which is more severe on a system with large number of CPUs in 1 LLC.

2023-05-01 18:54:08

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v8 2/2] sched/fair: Introduce SIS_CURRENT to wake up short task on current CPU

On Mon, May 01, 2023 at 05:32:05PM +0200, Mike Galbraith wrote:
> On Mon, 2023-05-01 at 15:48 +0200, Peter Zijlstra wrote:
> >
> > Throughput? 646.55 MB/sec?? 2 clients?? 2 procs? max_latency=0.104 ms
> > Throughput 1361.06 MB/sec?? 5 clients?? 5 procs? max_latency=0.100 ms
> > Throughput 1889.82 MB/sec? 10 clients? 10 procs? max_latency=0.154 ms
> > Throughput 2406.57 MB/sec? 20 clients? 20 procs? max_latency=3.667 ms
> > Throughput 2318.00 MB/sec? 40 clients? 40 procs? max_latency=0.390 ms
> > Throughput 2384.85 MB/sec? 80 clients? 80 procs? max_latency=1.371 ms
> >
> >
> > So what's going on here? I don't see anything exciting happening at the
> > 40 mark. At the same time, I can't seem to reproduce Mike's latency pile
> > up either :/
>
> Are you running tbench in the GUI so the per second output stimulates
> assorted goo? I'm using KDE fwtw.

Nah, the IVB-EP is headless, doesn't even have systemd on, still running
sysvinit.

2023-05-02 03:17:47

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH v8 2/2] sched/fair: Introduce SIS_CURRENT to wake up short task on current CPU

On Mon, 2023-05-01 at 20:49 +0200, Peter Zijlstra wrote:
> On Mon, May 01, 2023 at 05:32:05PM +0200, Mike Galbraith wrote:
> > On Mon, 2023-05-01 at 15:48 +0200, Peter Zijlstra wrote:
> > >
> > > Throughput  646.55 MB/sec   2 clients   2 procs  max_latency=0.104 ms
> > > Throughput 1361.06 MB/sec   5 clients   5 procs  max_latency=0.100 ms
> > > Throughput 1889.82 MB/sec  10 clients  10 procs  max_latency=0.154 ms
> > > Throughput 2406.57 MB/sec  20 clients  20 procs  max_latency=3.667 ms
> > > Throughput 2318.00 MB/sec  40 clients  40 procs  max_latency=0.390 ms
> > > Throughput 2384.85 MB/sec  80 clients  80 procs  max_latency=1.371 ms
> > >
> > >
> > > So what's going on here? I don't see anything exciting happening at the
> > > 40 mark. At the same time, I can't seem to reproduce Mike's latency pile
> > > up either :/
> >
> > Are you running tbench in the GUI so the per second output stimulates
> > assorted goo?  I'm using KDE fwtw.
>
> Nah, the IVB-EP is headless, doesn't even have systemd on, still running
> sysvinit.

Turns out load distribution on Yu's box wasn't going wonky way early
after all, so WA_WEIGHT's evil side is off topic wrt $subject. Hohum.

-Mike

2023-05-02 12:04:25

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v8 2/2] sched/fair: Introduce SIS_CURRENT to wake up short task on current CPU

On Mon, May 01, 2023 at 11:52:47PM +0800, Chen Yu wrote:

> > So,... I've been poking around with this a bit today and I'm not seeing
> > it. On my ancient IVB-EP (2*10*2) with the code as in
> > queue/sched/core I get:
> >
> > netperf NO_SIS_CURRENT %
> > SIS_CURRENT
> > ----------------------- -------------------------------
> > TCP_SENDFILE-1 : Avg: 42001 40783.4 -2.89898
> > TCP_SENDFILE-10 : Avg: 37065.1 36604.4 -1.24295
> > TCP_SENDFILE-20 : Avg: 21004.4 21356.9 1.67822
> > TCP_SENDFILE-40 : Avg: 7079.93 7231.3 2.13802
> > TCP_SENDFILE-80 : Avg: 3582.98 3615.85 0.917393

> > TCP_STREAM-1 : Avg: 37134.5 35095.4 -5.49112
> > TCP_STREAM-10 : Avg: 31260.7 31588.1 1.04732
> > TCP_STREAM-20 : Avg: 17996.6 17937.4 -0.328951
> > TCP_STREAM-40 : Avg: 7710.4 7790.62 1.04041
> > TCP_STREAM-80 : Avg: 2601.51 2903.89 11.6232

> > TCP_RR-1 : Avg: 81167.8 83541.3 2.92419
> > TCP_RR-10 : Avg: 71123.2 69447.9 -2.35549
> > TCP_RR-20 : Avg: 50905.4 52157.2 2.45907
> > TCP_RR-40 : Avg: 46289.2 46350.7 0.13286
> > TCP_RR-80 : Avg: 22024.4 22229.2 0.929878

> > UDP_RR-1 : Avg: 95997.2 96553.3 0.579288
> > UDP_RR-10 : Avg: 83878.5 78998.6 -5.81782
> > UDP_RR-20 : Avg: 61838.8 62926 1.75812
> > UDP_RR-40 : Avg: 56456.1 57115.2 1.16746
> > UDP_RR-80 : Avg: 27635.2 27784.8 0.541339

> > UDP_STREAM-1 : Avg: 52808.2 51908.6 -1.70352
> > UDP_STREAM-10 : Avg: 43115 43561.2 1.03491
> > UDP_STREAM-20 : Avg: 18798.7 20066 6.74142
> > UDP_STREAM-40 : Avg: 13070.5 13110.2 0.303737
> > UDP_STREAM-80 : Avg: 6248.86 6413.09 2.62816


> > tbench

> > WA_WEIGHT, WA_BIAS, NO_SIS_CURRENT (aka, mainline)
> >
> > Throughput 649.46 MB/sec 2 clients 2 procs max_latency=0.092 ms
> > Throughput 1370.93 MB/sec 5 clients 5 procs max_latency=0.140 ms
> > Throughput 1904.14 MB/sec 10 clients 10 procs max_latency=0.470 ms
> > Throughput 2406.15 MB/sec 20 clients 20 procs max_latency=0.276 ms
> > Throughput 2419.40 MB/sec 40 clients 40 procs max_latency=0.414 ms
> > Throughput 2426.00 MB/sec 80 clients 80 procs max_latency=1.366 ms
> >
> > WA_WEIGHT, WA_BIAS, SIS_CURRENT (aka, with patches on)
> >
> > Throughput 646.55 MB/sec 2 clients 2 procs max_latency=0.104 ms
> > Throughput 1361.06 MB/sec 5 clients 5 procs max_latency=0.100 ms
> > Throughput 1889.82 MB/sec 10 clients 10 procs max_latency=0.154 ms
> > Throughput 2406.57 MB/sec 20 clients 20 procs max_latency=3.667 ms
> > Throughput 2318.00 MB/sec 40 clients 40 procs max_latency=0.390 ms
> > Throughput 2384.85 MB/sec 80 clients 80 procs max_latency=1.371 ms
> >
> >
> > So what's going on here? I don't see anything exciting happening at the
> > 40 mark. At the same time, I can't seem to reproduce Mike's latency pile
> > up either :/
> >
> Thank you very much for trying this patch. This patch was found to mainly
> benefit system with large number of CPUs in 1 LLC. Previously I tested
> it on Sapphire Rapids(2x56C/224T) and Ice Lake Server(2x32C/128T)[1], it
> seems to have benefit on them. The benefit seems to come from:
> 1. reducing the waker stacking among many CPUs within 1 LLC

I should be seeing that at 10 cores per LLC. And when we look at the
tbench results (never the most stable -- let me run a few more of those)
it looks like SIS_CURRENT is actually making that worse.

That latency spike at 20 seems stable for me -- and 3ms is rather small,
I've seen it up to 11ms (but typical in the 4-6 range). This does not
happen with NO_SIS_CURRENT and is a fairly big point against these
patches.

> 2. reducing the C2C overhead within 1 LLC

This is due to how L3 became non-inclusive with Skylake? I can't see
that because I don't have anything that recent :/

> So far I did not received performance difference from LKP on desktop
> test boxes. Let me queue the full test on some desktops to confirm
> if this change has any impact on them.

Right, so I've updated my netperf results above to have a relative
difference between NO_SIS_CURRENT and SIS_CURRENT and I see some losses
at the low end. For servers that gets compensated at the high end, but
desktops tend to not get there much.


2023-05-04 11:24:07

by Chen Yu

[permalink] [raw]
Subject: Re: [PATCH v8 2/2] sched/fair: Introduce SIS_CURRENT to wake up short task on current CPU

On 2023-05-02 at 13:54:08 +0200, Peter Zijlstra wrote:
> On Mon, May 01, 2023 at 11:52:47PM +0800, Chen Yu wrote:
>
> > > So,... I've been poking around with this a bit today and I'm not seeing
> > > it. On my ancient IVB-EP (2*10*2) with the code as in
> > > queue/sched/core I get:
> > >
> > > netperf NO_SIS_CURRENT %
> > > SIS_CURRENT
> > > ----------------------- -------------------------------
> > > TCP_SENDFILE-1 : Avg: 42001 40783.4 -2.89898
> > > TCP_SENDFILE-10 : Avg: 37065.1 36604.4 -1.24295
> > > TCP_SENDFILE-20 : Avg: 21004.4 21356.9 1.67822
> > > TCP_SENDFILE-40 : Avg: 7079.93 7231.3 2.13802
> > > TCP_SENDFILE-80 : Avg: 3582.98 3615.85 0.917393
>
> > > TCP_STREAM-1 : Avg: 37134.5 35095.4 -5.49112
> > > TCP_STREAM-10 : Avg: 31260.7 31588.1 1.04732
> > > TCP_STREAM-20 : Avg: 17996.6 17937.4 -0.328951
> > > TCP_STREAM-40 : Avg: 7710.4 7790.62 1.04041
> > > TCP_STREAM-80 : Avg: 2601.51 2903.89 11.6232
>
> > > TCP_RR-1 : Avg: 81167.8 83541.3 2.92419
> > > TCP_RR-10 : Avg: 71123.2 69447.9 -2.35549
> > > TCP_RR-20 : Avg: 50905.4 52157.2 2.45907
> > > TCP_RR-40 : Avg: 46289.2 46350.7 0.13286
> > > TCP_RR-80 : Avg: 22024.4 22229.2 0.929878
>
> > > UDP_RR-1 : Avg: 95997.2 96553.3 0.579288
> > > UDP_RR-10 : Avg: 83878.5 78998.6 -5.81782
> > > UDP_RR-20 : Avg: 61838.8 62926 1.75812
> > > UDP_RR-40 : Avg: 56456.1 57115.2 1.16746
> > > UDP_RR-80 : Avg: 27635.2 27784.8 0.541339
>
> > > UDP_STREAM-1 : Avg: 52808.2 51908.6 -1.70352
> > > UDP_STREAM-10 : Avg: 43115 43561.2 1.03491
> > > UDP_STREAM-20 : Avg: 18798.7 20066 6.74142
> > > UDP_STREAM-40 : Avg: 13070.5 13110.2 0.303737
> > > UDP_STREAM-80 : Avg: 6248.86 6413.09 2.62816
>
>
> > > tbench
>
> > > WA_WEIGHT, WA_BIAS, NO_SIS_CURRENT (aka, mainline)
> > >
> > > Throughput 649.46 MB/sec 2 clients 2 procs max_latency=0.092 ms
> > > Throughput 1370.93 MB/sec 5 clients 5 procs max_latency=0.140 ms
> > > Throughput 1904.14 MB/sec 10 clients 10 procs max_latency=0.470 ms
> > > Throughput 2406.15 MB/sec 20 clients 20 procs max_latency=0.276 ms
> > > Throughput 2419.40 MB/sec 40 clients 40 procs max_latency=0.414 ms
> > > Throughput 2426.00 MB/sec 80 clients 80 procs max_latency=1.366 ms
> > >
> > > WA_WEIGHT, WA_BIAS, SIS_CURRENT (aka, with patches on)
> > >
> > > Throughput 646.55 MB/sec 2 clients 2 procs max_latency=0.104 ms
> > > Throughput 1361.06 MB/sec 5 clients 5 procs max_latency=0.100 ms
> > > Throughput 1889.82 MB/sec 10 clients 10 procs max_latency=0.154 ms
> > > Throughput 2406.57 MB/sec 20 clients 20 procs max_latency=3.667 ms
> > > Throughput 2318.00 MB/sec 40 clients 40 procs max_latency=0.390 ms
> > > Throughput 2384.85 MB/sec 80 clients 80 procs max_latency=1.371 ms
> > >
> > >
> > > So what's going on here? I don't see anything exciting happening at the
> > > 40 mark. At the same time, I can't seem to reproduce Mike's latency pile
> > > up either :/
> > >
> > Thank you very much for trying this patch. This patch was found to mainly
> > benefit system with large number of CPUs in 1 LLC. Previously I tested
> > it on Sapphire Rapids(2x56C/224T) and Ice Lake Server(2x32C/128T)[1], it
> > seems to have benefit on them. The benefit seems to come from:
> > 1. reducing the waker stacking among many CPUs within 1 LLC
>
> I should be seeing that at 10 cores per LLC. And when we look at the
> tbench results (never the most stable -- let me run a few more of those)
> it looks like SIS_CURRENT is actually making that worse.
>
> That latency spike at 20 seems stable for me -- and 3ms is rather small,
> I've seen it up to 11ms (but typical in the 4-6 range). This does not
> happen with NO_SIS_CURRENT and is a fairly big point against these
> patches.
>
I tried to reproduce the issue on your platform, so I launched tbench with
nr_thread = 50% on a Ivy Bridge-EP, it seems that I could not reproduce the
issue(the difference is that the default testing is with perf record enabled).

I launched netperf/tbench under 50%/75%/100%/125% on some platforms with
smaller number of CPUs, including:
Ivy Bridge-EP, nr_node: 2, nr_cpu: 48
Ivy Bridge, nr_node: 1, nr_cpu: 4
Coffee Lake, nr_node: 1, nr_cpu: 12
Commet Lake, nr_node: 1, nr_cpu: 20
Kaby Lake, nr_node: 1, nr_cpu: 8

All platforms are tested with cpu freq govenor set to performance to
get stable result. Each test lasts for 60 seconds.

It seems that per the test result, no obvious netperf/tbench throughput
regress was detected on these platforms(within 3%), and some platforms
such as Commet Lake shows some improvement.

The tbench.max_latency show improvement/degrading and it seems that
this latency value is unstable(with/without patch applied).
I don't know how to interpret this value(should we look at the tail
latency like schbench) and it seems that the latency variance is another
issue to be looked into.


netperf.Throughput_total_tps(higher the better):

Ivy Bridge-EP, nr_node: 2, nr_cpu: 48:
NO_SIS_CURRENT SIS_CURRENT
---------------- ---------------------------
50%+TCP_RR: 990828 -1.0% 980992
50%+UDP_RR: 1282489 +1.0% 1295717
75%+TCP_RR: 935827 +8.9% 1019470
75%+UDP_RR: 1164074 +11.6% 1298844
100%+TCP_RR: 1846962 -0.1% 1845311
100%+UDP_RR: 2557455 -2.3% 2497846
125%+TCP_RR: 1771652 -1.4% 1747653
125%+UDP_RR: 2415665 -1.1% 2388459

Ivy Bridge, nr_node: 1, nr_cpu: 4
NO_SIS_CURRENT SIS_CURRENT
---------------- ---------------------------
50%+TCP_RR: 52697 -1.2% 52088
50%+UDP_RR: 135397 -0.1% 135315
75%+TCP_RR: 135613 -0.6% 134777
75%+UDP_RR: 183439 -0.3% 182853
100%+TCP_RR: 183255 -1.3% 180859
100%+UDP_RR: 245836 -0.6% 244345
125%+TCP_RR: 174957 -2.1% 171258
125%+UDP_RR: 232509 -1.1% 229868


Coffee Lake, nr_node: 1, nr_cpu: 12
NO_SIS_CURRENT SIS_CURRENT
---------------- ---------------------------
50%+TCP_RR: 429718 -1.2% 424359
50%+UDP_RR: 536240 +0.1% 536646
75%+TCP_RR: 450310 -1.2% 444764
75%+UDP_RR: 538645 -1.0% 532995
100%+TCP_RR: 774423 -0.3% 771764
100%+UDP_RR: 971805 -0.3% 969223
125%+TCP_RR: 720546 +0.6% 724593
125%+UDP_RR: 911169 +0.2% 912576

Commet Lake, nr_node: 1, nr_cpu: 20
NO_SIS_CURRENT SIS_CURRENT
---------------- ---------------------------
50%+UDP_RR: 1174505 +4.6% 1228945
75%+TCP_RR: 833303 +20.2% 1001582
75%+UDP_RR: 1149171 +13.4% 1303623
100%+TCP_RR: 1928064 -0.5% 1917500
125%+TCP_RR: 74389 -0.1% 74304
125%+UDP_RR: 2564210 -1.1% 2535377


Kaby Lake, nr_node: 1, nr_cpu: 8
NO_SIS_CURRENT SIS_CURRENT
---------------- ---------------------------
50%+TCP_RR: 303956 -1.7% 298749
50%+UDP_RR: 382059 -0.8% 379176
75%+TCP_RR: 368399 -1.5% 362742
75%+UDP_RR: 459285 -0.3% 458020
100%+TCP_RR: 544630 -1.1% 538901
100%+UDP_RR: 684498 -0.6% 680450
125%+TCP_RR: 514266 +0.0% 514367
125%+UDP_RR: 645970 +0.2% 647473



tbench.max_latency(lower the better)

Ivy Bridge-EP, nr_node: 2, nr_cpu: 48:
NO_SIS_CURRENT SIS_CURRENT
---------------- ---------------------------
50%: 45.31 -26.3% 33.41
75%: 269.36 -87.5% 33.72
100%: 274.76 -66.6% 91.85
125%: 723.34 -49.1% 368.29

Ivy Bridge, nr_node: 1, nr_cpu: 4
NO_SIS_CURRENT SIS_CURRENT
---------------- ---------------------------
50%: 10.04 -70.5% 2.96
75%: 10.12 +63.0% 16.49
100%: 73.97 +148.1% 183.55
125%: 138.31 -39.9% 83.09


Commet Lake, nr_node: 1, nr_cpu: 20
NO_SIS_CURRENT SIS_CURRENT
---------------- ---------------------------
50%: 10.59 +24.5% 13.18
75%: 11.53 -0.5% 11.47
100%: 414.65 -13.9% 356.93
125%: 411.51 -81.9% 74.56

Coffee Lake, nr_node: 1, nr_cpu: 12
NO_SIS_CURRENT SIS_CURRENT
---------------- ---------------------------
50% 452.07 -99.5% 2.06
75% 4.42 +81.2% 8.00
100% 76.11 -44.7% 42.12
125% 47.06 +280.6% 179.09


Kaby Lake, nr_node: 1, nr_cpu: 8
NO_SIS_CURRENT SIS_CURRENT
---------------- ---------------------------
50%: 10.52 +0.1% 10.53
75%: 12.95 +62.1% 20.99
100%: 25.63 +181.1% 72.05
125%: 94.05 -17.0% 78.06

> > 2. reducing the C2C overhead within 1 LLC
>
> This is due to how L3 became non-inclusive with Skylake? I can't see
> that because I don't have anything that recent :/
>
I checked it with the colleagues and it seems to not be related to non-inclusive L3
but related to the number of CPUs. More CPUs makes distances in the die longer,
which adds to the latency.
> > So far I did not received performance difference from LKP on desktop
> > test boxes. Let me queue the full test on some desktops to confirm
> > if this change has any impact on them.
>
> Right, so I've updated my netperf results above to have a relative
> difference between NO_SIS_CURRENT and SIS_CURRENT and I see some losses
> at the low end. For servers that gets compensated at the high end, but
> desktops tend to not get there much.
>
>
Since the large CPU number is one major motivation to wakeup
the task locally, may I have your opinion that is it applicable to add llc_size
as the factor when deciding whether we should wake up the wakee on current CPU?
The smaller the llc_size is, the harder the wake will be woken up on current CPU.


thanks,
Chenyu

2023-05-12 08:42:36

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v8 2/2] sched/fair: Introduce SIS_CURRENT to wake up short task on current CPU



Hello,

kernel test robot noticed a 57.6% improvement of phoronix-test-suite.stress-ng.ContextSwitching.bogo_ops_s on:


commit: 485a13bb0c94668849d554f8f1f935a81f8dd2cd ("[PATCH v8 2/2] sched/fair: Introduce SIS_CURRENT to wake up short task on current CPU")
url: https://github.com/intel-lab-lkp/linux/commits/Chen-Yu/sched-fair-Record-the-average-duration-of-a-task/20230428-232326
base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git f31dcb152a3d0816e2f1deab4e64572336da197d
patch link: https://lore.kernel.org/all/4081178486e025c89dbb7cc0e62bbfab95fc794a.1682661027.git.yu.c.chen@intel.com/
patch subject: [PATCH v8 2/2] sched/fair: Introduce SIS_CURRENT to wake up short task on current CPU

testcase: phoronix-test-suite
test machine: 96 threads 2 sockets Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz (Cascade Lake) with 512G memory
parameters:

test: stress-ng-1.3.1
option_a: Context Switching
cpufreq_governor: performance

test-description: The Phoronix Test Suite is the most comprehensive testing and benchmarking platform available that provides an extensible framework for which new tests can be easily added.
test-url: http://www.phoronix-test-suite.com/





Details are as below:
-------------------------------------------------------------------------------------------------->


To reproduce:

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
sudo bin/lkp install job.yaml # job file is attached in this email
bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run
sudo bin/lkp run generated-yaml-file

# if come across any failure that blocks the test,
# please remove ~/.lkp and /lkp dir to run from a clean state.

=========================================================================================
compiler/cpufreq_governor/kconfig/option_a/rootfs/tbox_group/test/testcase:
gcc-11/performance/x86_64-rhel-8.3/Context Switching/debian-x86_64-phoronix/lkp-csl-2sp7/stress-ng-1.3.1/phoronix-test-suite

commit:
26bcaaa5c6 ("sched/fair: Record the average duration of a task")
485a13bb0c ("sched/fair: Introduce SIS_CURRENT to wake up short task on current CPU")

26bcaaa5c63c4ad3 485a13bb0c94668849d554f8f1f
---------------- ---------------------------
%stddev %change %stddev
\ | \
3465032 +57.6% 5461531 ? 14% phoronix-test-suite.stress-ng.ContextSwitching.bogo_ops_s
114.80 +327.0% 490.19 ? 10% phoronix-test-suite.time.elapsed_time
114.80 +327.0% 490.19 ? 10% phoronix-test-suite.time.elapsed_time.max
147934 ? 7% +14359.4% 21390420 ? 70% phoronix-test-suite.time.involuntary_context_switches
182173 +73.4% 315930 ? 5% phoronix-test-suite.time.minor_page_faults
4590 +26.9% 5826 ? 3% phoronix-test-suite.time.percent_of_cpu_this_job_got
4399 +448.9% 24148 ? 10% phoronix-test-suite.time.system_time
870.48 +408.4% 4425 ? 11% phoronix-test-suite.time.user_time
5.651e+08 +668.6% 4.343e+09 ? 19% phoronix-test-suite.time.voluntary_context_switches
3.528e+09 ? 3% +190.8% 1.026e+10 ? 9% cpuidle..time
3.046e+08 +260.4% 1.098e+09 ? 11% cpuidle..usage
166.16 +226.0% 541.68 ? 9% uptime.boot
7741 +89.3% 14658 ? 6% uptime.idle
80501 ? 54% +94.4% 156530 ? 24% numa-meminfo.node0.AnonHugePages
76.50 ? 8% +788.0% 679.33 ? 50% numa-meminfo.node0.Unevictable
59794 ? 36% +123.0% 133361 ? 15% numa-meminfo.node1.Active
7692 ? 12% +967.3% 82097 ? 6% numa-meminfo.node1.Active(anon)
281420 ? 23% +177.0% 779557 ? 21% numa-numastat.node0.local_node
292813 ? 23% +167.7% 783881 ? 21% numa-numastat.node0.numa_hit
361211 ? 17% +121.8% 801062 ? 22% numa-numastat.node1.local_node
369650 ? 17% +120.7% 815910 ? 21% numa-numastat.node1.numa_hit
127614 +86.2% 237590 ? 31% meminfo.Active
11364 ? 6% +664.0% 86826 ? 7% meminfo.Active(anon)
140254 ? 11% +67.9% 235428 ? 6% meminfo.AnonHugePages
721246 +11.2% 802088 ? 2% meminfo.Shmem
83.00 +969.1% 887.33 ? 18% meminfo.Unevictable
38.83 -33.0% 26.00 ? 4% vmstat.cpu.id
806.50 ? 4% -65.6% 277.83 ? 62% vmstat.io.bi
68.83 +28.3% 88.33 ? 3% vmstat.procs.r
8102056 +44.8% 11728871 ? 10% vmstat.system.cs
965452 -6.7% 900326 ? 5% vmstat.system.in
38.37 -12.3 26.09 ? 3% mpstat.cpu.all.idle%
0.03 ? 7% -0.0 0.01 ? 14% mpstat.cpu.all.iowait%
7.20 -1.5 5.73 ? 7% mpstat.cpu.all.irq%
0.22 ? 2% -0.1 0.16 ? 7% mpstat.cpu.all.soft%
45.11 +12.4 57.48 mpstat.cpu.all.sys%
9.07 +1.5 10.52 ? 3% mpstat.cpu.all.usr%
1.67 ? 82% +9040.0% 152.33 ? 56% numa-vmstat.node0.nr_mlock
18.67 ? 7% +807.1% 169.33 ? 51% numa-vmstat.node0.nr_unevictable
18.67 ? 7% +807.1% 169.33 ? 51% numa-vmstat.node0.nr_zone_unevictable
292934 ? 23% +167.7% 784191 ? 21% numa-vmstat.node0.numa_hit
281540 ? 23% +177.0% 779868 ? 21% numa-vmstat.node0.numa_local
1888 ? 12% +986.6% 20520 ? 6% numa-vmstat.node1.nr_active_anon
1888 ? 12% +986.6% 20520 ? 6% numa-vmstat.node1.nr_zone_active_anon
369806 ? 17% +120.7% 816096 ? 21% numa-vmstat.node1.numa_hit
361367 ? 17% +121.7% 801248 ? 22% numa-vmstat.node1.numa_local
2095 +10.6% 2317 turbostat.Avg_MHz
75.48 +8.5 83.97 turbostat.Busy%
1.658e+08 +224.5% 5.382e+08 ? 13% turbostat.C1
5.96 ? 4% -1.5 4.48 ? 11% turbostat.C1%
2506058 ? 56% +195.8% 7412626 ? 57% turbostat.C1E
20.85 ? 13% -35.0% 13.55 ? 13% turbostat.CPU%c1
1.14e+08 +288.9% 4.435e+08 ? 10% turbostat.IRQ
1.347e+08 +306.7% 5.477e+08 ? 10% turbostat.POLL
3.86 -0.4 3.44 ? 6% turbostat.POLL%
250.18 +6.5% 266.34 turbostat.PkgWatt
2872 ? 6% +655.6% 21702 ? 7% proc-vmstat.nr_active_anon
29062 +29.7% 37689 ? 46% proc-vmstat.nr_active_file
488299 +6.5% 519898 ? 5% proc-vmstat.nr_file_pages
22777 -2.2% 22281 proc-vmstat.nr_kernel_stack
21414 -5.0% 20337 proc-vmstat.nr_mapped
3.00 +6705.6% 204.17 ? 20% proc-vmstat.nr_mlock
3621 ? 2% +4.7% 3791 proc-vmstat.nr_page_table_pages
180334 +11.2% 200518 ? 2% proc-vmstat.nr_shmem
26676 +22.7% 32721 ? 35% proc-vmstat.nr_slab_reclaimable
55226 +3.4% 57117 ? 2% proc-vmstat.nr_slab_unreclaimable
20.00 +1005.8% 221.17 ? 18% proc-vmstat.nr_unevictable
2872 ? 6% +655.6% 21702 ? 7% proc-vmstat.nr_zone_active_anon
29062 +29.7% 37689 ? 46% proc-vmstat.nr_zone_active_file
20.00 +1005.8% 221.17 ? 18% proc-vmstat.nr_zone_unevictable
669039 +140.9% 1611469 ? 8% proc-vmstat.numa_hit
649181 +144.5% 1587192 ? 8% proc-vmstat.numa_local
784187 +125.5% 1768689 ? 8% proc-vmstat.pgalloc_normal
896335 +162.1% 2349602 ? 8% proc-vmstat.pgfault
686654 ? 2% +139.5% 1644338 ? 8% proc-vmstat.pgfree
110663 +207.5% 340238 ? 8% proc-vmstat.pgreuse
1068544 +263.0% 3878912 ? 9% proc-vmstat.unevictable_pgs_scanned
1.054e+10 +40.3% 1.479e+10 ? 9% perf-stat.i.branch-instructions
2.04e+08 +11.5% 2.274e+08 ? 4% perf-stat.i.branch-misses
4915549 ? 4% +11.9% 5499505 ? 8% perf-stat.i.cache-misses
1.07e+09 -11.9% 9.427e+08 ? 5% perf-stat.i.cache-references
8379617 +41.0% 11817356 ? 9% perf-stat.i.context-switches
4.01 ? 5% -11.7% 3.54 ? 7% perf-stat.i.cpi
2.062e+11 +8.7% 2.243e+11 perf-stat.i.cpu-cycles
3206852 -15.2% 2720851 ? 7% perf-stat.i.cpu-migrations
1.425e+10 +43.7% 2.047e+10 ? 10% perf-stat.i.dTLB-loads
0.07 ? 12% -0.0 0.05 ? 14% perf-stat.i.dTLB-store-miss-rate%
5278267 ? 3% -19.0% 4276247 ? 7% perf-stat.i.dTLB-store-misses
8.04e+09 +46.2% 1.176e+10 ? 10% perf-stat.i.dTLB-stores
29270062 +22.1% 35725942 ? 4% perf-stat.i.iTLB-load-misses
52729822 +27.5% 67248151 ? 6% perf-stat.i.iTLB-loads
5.149e+10 +41.5% 7.284e+10 ? 9% perf-stat.i.instructions
1.64 ? 6% -74.5% 0.42 ? 10% perf-stat.i.major-faults
2.15 +8.8% 2.34 perf-stat.i.metric.GHz
352.95 +41.6% 499.62 ? 9% perf-stat.i.metric.M/sec
5633 -24.3% 4266 perf-stat.i.minor-faults
90.31 +3.0 93.28 perf-stat.i.node-load-miss-rate%
2212505 ? 2% +12.6% 2490436 ? 3% perf-stat.i.node-load-misses
5635 -24.3% 4266 perf-stat.i.page-faults
20.77 -36.9% 13.11 ? 13% perf-stat.overall.MPKI
1.93 -0.4 1.54 ? 4% perf-stat.overall.branch-miss-rate%
0.46 ? 3% +0.1 0.58 ? 7% perf-stat.overall.cache-miss-rate%
4.00 -22.4% 3.11 ? 9% perf-stat.overall.cpi
0.38 ? 8% -0.1 0.24 ? 8% perf-stat.overall.dTLB-load-miss-rate%
0.07 ? 2% -0.0 0.04 ? 16% perf-stat.overall.dTLB-store-miss-rate%
35.70 -1.0 34.72 perf-stat.overall.iTLB-load-miss-rate%
1759 +15.6% 2034 ? 5% perf-stat.overall.instructions-per-iTLB-miss
0.25 +30.2% 0.33 ? 10% perf-stat.overall.ipc
93.90 +1.0 94.94 perf-stat.overall.node-load-miss-rate%
1.044e+10 +41.3% 1.476e+10 ? 9% perf-stat.ps.branch-instructions
2.02e+08 +12.3% 2.269e+08 ? 4% perf-stat.ps.branch-misses
4888066 ? 4% +12.3% 5487899 ? 8% perf-stat.ps.cache-misses
1.059e+09 -11.3% 9.402e+08 ? 5% perf-stat.ps.cache-references
8300663 +42.1% 11794849 ? 10% perf-stat.ps.context-switches
2.043e+11 +9.5% 2.237e+11 perf-stat.ps.cpu-cycles
3175301 -14.6% 2712811 ? 7% perf-stat.ps.cpu-migrations
1.411e+10 +44.8% 2.043e+10 ? 10% perf-stat.ps.dTLB-loads
5227681 ? 3% -18.4% 4263828 ? 7% perf-stat.ps.dTLB-store-misses
7.965e+09 +47.3% 1.174e+10 ? 10% perf-stat.ps.dTLB-stores
28995942 +22.9% 35648237 ? 4% perf-stat.ps.iTLB-load-misses
52224757 +28.5% 67109656 ? 6% perf-stat.ps.iTLB-loads
5.101e+10 +42.5% 7.27e+10 ? 10% perf-stat.ps.instructions
1.62 ? 6% -74.3% 0.42 ? 10% perf-stat.ps.major-faults
5627 -24.4% 4256 perf-stat.ps.minor-faults
2193451 ? 2% +13.2% 2483610 ? 3% perf-stat.ps.node-load-misses
5629 -24.4% 4257 perf-stat.ps.page-faults
5.883e+12 +506.6% 3.569e+13 ? 14% perf-stat.total.instructions
196075 ? 20% +675.2% 1519913 ? 22% sched_debug.cfs_rq:/.MIN_vruntime.avg
880934 ? 2% +959.5% 9333279 ? 22% sched_debug.cfs_rq:/.MIN_vruntime.max
361929 ? 7% +821.7% 3335971 ? 21% sched_debug.cfs_rq:/.MIN_vruntime.stddev
0.57 ? 7% +78.3% 1.01 ? 4% sched_debug.cfs_rq:/.h_nr_running.avg
0.60 ? 12% +30.6% 0.78 ? 2% sched_debug.cfs_rq:/.h_nr_running.stddev
7613 ? 5% +45.7% 11092 ? 2% sched_debug.cfs_rq:/.load.avg
32.92 ? 40% -46.2% 17.70 ? 6% sched_debug.cfs_rq:/.load_avg.avg
556.17 ? 21% -67.0% 183.28 ? 28% sched_debug.cfs_rq:/.load_avg.max
96.85 ? 32% -67.7% 31.28 ? 22% sched_debug.cfs_rq:/.load_avg.stddev
196075 ? 20% +675.2% 1519913 ? 22% sched_debug.cfs_rq:/.max_vruntime.avg
880934 ? 2% +959.5% 9333279 ? 22% sched_debug.cfs_rq:/.max_vruntime.max
361929 ? 7% +821.7% 3335971 ? 21% sched_debug.cfs_rq:/.max_vruntime.stddev
889655 ? 2% +1037.3% 10118475 ? 12% sched_debug.cfs_rq:/.min_vruntime.avg
903843 ? 2% +1051.9% 10411690 ? 12% sched_debug.cfs_rq:/.min_vruntime.max
876105 ? 2% +1017.0% 9785960 ? 13% sched_debug.cfs_rq:/.min_vruntime.min
4900 ? 5% +4241.8% 212778 ? 40% sched_debug.cfs_rq:/.min_vruntime.stddev
0.38 ? 5% +72.0% 0.66 ? 3% sched_debug.cfs_rq:/.nr_running.avg
564.00 +68.6% 951.12 ? 4% sched_debug.cfs_rq:/.runnable_avg.avg
1263 ? 10% +41.2% 1784 ? 5% sched_debug.cfs_rq:/.runnable_avg.max
203.26 ? 7% +52.2% 309.32 ? 5% sched_debug.cfs_rq:/.runnable_avg.stddev
9113 ? 67% +1960.8% 187805 ? 61% sched_debug.cfs_rq:/.spread0.avg
23301 ? 26% +1963.7% 480870 ? 42% sched_debug.cfs_rq:/.spread0.max
-4421 +3175.2% -144798 sched_debug.cfs_rq:/.spread0.min
4878 ? 5% +4258.8% 212641 ? 40% sched_debug.cfs_rq:/.spread0.stddev
378.07 ? 3% +59.1% 601.51 ? 5% sched_debug.cfs_rq:/.util_avg.avg
939.33 ? 3% +23.3% 1158 ? 5% sched_debug.cfs_rq:/.util_avg.max
146.76 ? 4% +42.2% 208.76 ? 6% sched_debug.cfs_rq:/.util_avg.stddev
13.05 ? 7% +1272.9% 179.21 ? 22% sched_debug.cfs_rq:/.util_est_enqueued.avg
475.17 ? 12% +102.1% 960.48 ? 5% sched_debug.cfs_rq:/.util_est_enqueued.max
56.28 ? 8% +334.6% 244.56 ? 3% sched_debug.cfs_rq:/.util_est_enqueued.stddev
447090 -71.0% 129667 ? 38% sched_debug.cpu.avg_idle.avg
690047 ? 12% -66.2% 233542 ? 44% sched_debug.cpu.avg_idle.max
105039 ? 10% -67.7% 33934 ? 36% sched_debug.cpu.avg_idle.stddev
80235 +243.3% 275481 ? 8% sched_debug.cpu.clock.avg
80242 +243.3% 275492 ? 8% sched_debug.cpu.clock.max
80227 +243.4% 275469 ? 8% sched_debug.cpu.clock.min
4.22 ? 12% +61.6% 6.81 ? 17% sched_debug.cpu.clock.stddev
78554 +238.5% 265921 ? 8% sched_debug.cpu.clock_task.avg
78911 +239.8% 268136 ? 8% sched_debug.cpu.clock_task.max
63258 +293.4% 248843 ? 8% sched_debug.cpu.clock_task.min
1588 ? 2% +51.4% 2404 ? 15% sched_debug.cpu.clock_task.stddev
1756 ? 4% +284.1% 6747 ? 12% sched_debug.cpu.curr->pid.avg
4095 +160.2% 10654 ? 6% sched_debug.cpu.curr->pid.max
1614 +170.2% 4363 ? 12% sched_debug.cpu.curr->pid.stddev
669155 ? 19% -22.9% 515937 sched_debug.cpu.max_idle_balance_cost.max
20831 ? 59% -90.5% 1983 ? 28% sched_debug.cpu.max_idle_balance_cost.stddev
0.00 ? 13% +87.8% 0.00 ? 12% sched_debug.cpu.next_balance.stddev
0.53 ? 5% +83.7% 0.98 ? 5% sched_debug.cpu.nr_running.avg
0.59 ? 8% +30.4% 0.77 ? 7% sched_debug.cpu.nr_running.stddev
2227313 ? 2% +1137.4% 27560829 ? 16% sched_debug.cpu.nr_switches.avg
2457119 ? 3% +1230.5% 32693099 ? 15% sched_debug.cpu.nr_switches.max
2000173 ? 3% +1032.1% 22644429 ? 20% sched_debug.cpu.nr_switches.min
204086 ? 31% +2005.5% 4297102 ? 37% sched_debug.cpu.nr_switches.stddev
80228 +243.4% 275469 ? 8% sched_debug.cpu_clk
79513 +245.5% 274754 ? 8% sched_debug.ktime
81061 +240.8% 276272 ? 8% sched_debug.sched_clk
58611259 +100.0% 1.172e+08 sched_debug.sysctl_sched.sysctl_sched_features
43.00 -20.3 22.74 ? 68% perf-profile.calltrace.cycles-pp.secondary_startup_64_no_verify
42.55 -20.0 22.50 ? 68% perf-profile.calltrace.cycles-pp.start_secondary.secondary_startup_64_no_verify
42.53 -20.0 22.50 ? 68% perf-profile.calltrace.cycles-pp.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
42.46 -20.0 22.46 ? 68% perf-profile.calltrace.cycles-pp.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
20.34 -9.7 10.63 ? 70% perf-profile.calltrace.cycles-pp.flush_smp_call_function_queue.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
17.07 -8.2 8.89 ? 70% perf-profile.calltrace.cycles-pp.sched_ttwu_pending.flush_smp_call_function_queue.do_idle.cpu_startup_entry.start_secondary
16.28 -7.8 8.47 ? 70% perf-profile.calltrace.cycles-pp.ttwu_do_activate.sched_ttwu_pending.flush_smp_call_function_queue.do_idle.cpu_startup_entry
16.15 -7.8 8.40 ? 70% perf-profile.calltrace.cycles-pp.activate_task.ttwu_do_activate.sched_ttwu_pending.flush_smp_call_function_queue.do_idle
15.06 -7.2 7.85 ? 70% perf-profile.calltrace.cycles-pp.enqueue_task_fair.activate_task.ttwu_do_activate.sched_ttwu_pending.flush_smp_call_function_queue
14.45 -7.1 7.33 ? 70% perf-profile.calltrace.cycles-pp.enqueue_entity.enqueue_task_fair.activate_task.ttwu_do_activate.sched_ttwu_pending
14.67 -7.1 7.60 ? 70% perf-profile.calltrace.cycles-pp.cpuidle_idle_call.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
12.54 -6.1 6.48 ? 70% perf-profile.calltrace.cycles-pp.cpuidle_enter.cpuidle_idle_call.do_idle.cpu_startup_entry.start_secondary
12.50 -6.0 6.46 ? 70% perf-profile.calltrace.cycles-pp.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle.cpu_startup_entry
7.14 -3.8 3.38 ? 70% perf-profile.calltrace.cycles-pp.intel_idle_irq.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle
7.19 -3.4 3.77 ? 70% perf-profile.calltrace.cycles-pp.update_cfs_group.dequeue_entity.dequeue_task_fair.__schedule.schedule
5.84 -2.9 2.99 ? 70% perf-profile.calltrace.cycles-pp.update_cfs_group.enqueue_entity.enqueue_task_fair.activate_task.ttwu_do_activate
5.73 -2.7 3.05 ? 70% perf-profile.calltrace.cycles-pp.schedule_idle.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
5.60 -2.6 2.98 ? 70% perf-profile.calltrace.cycles-pp.__schedule.schedule_idle.do_idle.cpu_startup_entry.start_secondary
7.18 -2.5 4.66 ? 47% perf-profile.calltrace.cycles-pp.update_load_avg.enqueue_entity.enqueue_task_fair.activate_task.ttwu_do_activate
4.80 -2.1 2.66 ? 70% perf-profile.calltrace.cycles-pp.select_idle_cpu.select_idle_sibling.select_task_rq_fair.select_task_rq.try_to_wake_up
6.55 -2.0 4.52 ? 31% perf-profile.calltrace.cycles-pp.dequeue_entity.dequeue_task_fair.__schedule.schedule.pipe_write
4.77 -2.0 2.76 ? 70% perf-profile.calltrace.cycles-pp.poll_idle.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle
6.54 -2.0 4.54 ? 29% perf-profile.calltrace.cycles-pp.dequeue_entity.dequeue_task_fair.__schedule.schedule.pipe_read
5.92 -1.7 4.26 ? 36% perf-profile.calltrace.cycles-pp.select_idle_sibling.select_task_rq_fair.select_task_rq.try_to_wake_up.autoremove_wake_function
3.32 -1.6 1.69 ? 70% perf-profile.calltrace.cycles-pp.select_idle_core.select_idle_cpu.select_idle_sibling.select_task_rq_fair.select_task_rq
6.75 -1.5 5.22 ? 26% perf-profile.calltrace.cycles-pp.select_task_rq_fair.select_task_rq.try_to_wake_up.autoremove_wake_function.__wake_up_common
3.12 -1.5 1.64 ? 71% perf-profile.calltrace.cycles-pp.activate_task.ttwu_do_activate.sched_ttwu_pending.__sysvec_call_function_single.sysvec_call_function_single
3.20 -1.4 1.76 ? 70% perf-profile.calltrace.cycles-pp.ttwu_do_activate.sched_ttwu_pending.__sysvec_call_function_single.sysvec_call_function_single.asm_sysvec_call_function_single
6.84 -1.4 5.42 ? 22% perf-profile.calltrace.cycles-pp.select_task_rq.try_to_wake_up.autoremove_wake_function.__wake_up_common.__wake_up_common_lock
2.90 -1.4 1.52 ? 71% perf-profile.calltrace.cycles-pp.enqueue_task_fair.activate_task.ttwu_do_activate.sched_ttwu_pending.__sysvec_call_function_single
4.70 -1.3 3.42 ? 29% perf-profile.calltrace.cycles-pp.update_load_avg.dequeue_entity.dequeue_task_fair.__schedule.schedule
2.32 -1.1 1.17 ? 70% perf-profile.calltrace.cycles-pp.update_cfs_group.enqueue_task_fair.activate_task.ttwu_do_activate.sched_ttwu_pending
1.38 -1.1 0.26 ?100% perf-profile.calltrace.cycles-pp.migrate_task_rq_fair.set_task_cpu.try_to_wake_up.autoremove_wake_function.__wake_up_common
2.25 -1.0 1.22 ? 70% perf-profile.calltrace.cycles-pp.asm_sysvec_call_function_single.flush_smp_call_function_queue.do_idle.cpu_startup_entry.start_secondary
1.25 -1.0 0.25 ?100% perf-profile.calltrace.cycles-pp.__smp_call_single_queue.ttwu_queue_wakelist.try_to_wake_up.autoremove_wake_function.__wake_up_common
2.14 -1.0 1.16 ? 70% perf-profile.calltrace.cycles-pp.sysvec_call_function_single.asm_sysvec_call_function_single.flush_smp_call_function_queue.do_idle.cpu_startup_entry
2.10 -1.0 1.13 ? 70% perf-profile.calltrace.cycles-pp.__sysvec_call_function_single.sysvec_call_function_single.asm_sysvec_call_function_single.flush_smp_call_function_queue.do_idle
2.09 -0.9 1.15 ? 70% perf-profile.calltrace.cycles-pp.ttwu_queue_wakelist.try_to_wake_up.autoremove_wake_function.__wake_up_common.__wake_up_common_lock
1.99 -0.9 1.06 ? 70% perf-profile.calltrace.cycles-pp.pick_next_task_fair.__schedule.schedule_idle.do_idle.cpu_startup_entry
1.82 -0.9 0.90 ? 70% perf-profile.calltrace.cycles-pp.available_idle_cpu.select_idle_core.select_idle_cpu.select_idle_sibling.select_task_rq_fair
1.97 -0.9 1.06 ? 70% perf-profile.calltrace.cycles-pp.sched_ttwu_pending.__sysvec_call_function_single.sysvec_call_function_single.asm_sysvec_call_function_single.flush_smp_call_function_queue
1.82 -0.9 0.96 ? 70% perf-profile.calltrace.cycles-pp.set_next_entity.pick_next_task_fair.__schedule.schedule_idle.do_idle
1.58 -0.8 0.79 ? 70% perf-profile.calltrace.cycles-pp.set_task_cpu.try_to_wake_up.autoremove_wake_function.__wake_up_common.__wake_up_common_lock
1.62 -0.8 0.84 ? 70% perf-profile.calltrace.cycles-pp.update_load_avg.set_next_entity.pick_next_task_fair.__schedule.schedule_idle
1.46 ? 2% -0.7 0.75 ? 70% perf-profile.calltrace.cycles-pp.menu_select.cpuidle_idle_call.do_idle.cpu_startup_entry.start_secondary
1.04 -0.5 0.52 ? 70% perf-profile.calltrace.cycles-pp.sched_mm_cid_migrate_to.activate_task.ttwu_do_activate.sched_ttwu_pending.flush_smp_call_function_queue
0.87 ? 2% -0.4 0.46 ? 70% perf-profile.calltrace.cycles-pp.asm_sysvec_call_function_single.intel_idle_irq.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call
0.82 -0.4 0.43 ? 70% perf-profile.calltrace.cycles-pp.sysvec_call_function_single.asm_sysvec_call_function_single.intel_idle_irq.cpuidle_enter_state.cpuidle_enter
0.75 ? 2% -0.4 0.36 ? 70% perf-profile.calltrace.cycles-pp.tick_nohz_get_sleep_length.menu_select.cpuidle_idle_call.do_idle.cpu_startup_entry
0.79 -0.4 0.41 ? 70% perf-profile.calltrace.cycles-pp.__sysvec_call_function_single.sysvec_call_function_single.asm_sysvec_call_function_single.intel_idle_irq.cpuidle_enter_state
0.74 ? 2% -0.4 0.39 ? 70% perf-profile.calltrace.cycles-pp.sched_ttwu_pending.__sysvec_call_function_single.sysvec_call_function_single.asm_sysvec_call_function_single.intel_idle_irq
0.78 -0.4 0.42 ? 70% perf-profile.calltrace.cycles-pp.switch_mm_irqs_off.__schedule.schedule_idle.do_idle.cpu_startup_entry
0.69 ? 2% -0.3 0.36 ? 70% perf-profile.calltrace.cycles-pp.__flush_smp_call_function_queue.flush_smp_call_function_queue.do_idle.cpu_startup_entry.start_secondary
0.69 -0.3 0.38 ? 70% perf-profile.calltrace.cycles-pp.finish_task_switch.__schedule.schedule_idle.do_idle.cpu_startup_entry
0.78 ? 2% -0.3 0.48 ? 70% perf-profile.calltrace.cycles-pp.asm_sysvec_call_function_single.poll_idle.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call
0.65 ? 2% -0.3 0.37 ? 70% perf-profile.calltrace.cycles-pp._raw_spin_lock.__schedule.schedule_idle.do_idle.cpu_startup_entry
0.72 ? 2% -0.3 0.45 ? 70% perf-profile.calltrace.cycles-pp.sysvec_call_function_single.asm_sysvec_call_function_single.poll_idle.cpuidle_enter_state.cpuidle_enter
0.70 ? 2% -0.3 0.42 ? 70% perf-profile.calltrace.cycles-pp.__sysvec_call_function_single.sysvec_call_function_single.asm_sysvec_call_function_single.poll_idle.cpuidle_enter_state
0.65 ? 2% -0.2 0.40 ? 70% perf-profile.calltrace.cycles-pp.sched_ttwu_pending.__sysvec_call_function_single.sysvec_call_function_single.asm_sysvec_call_function_single.poll_idle
0.56 -0.0 0.53 ? 2% perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.prepare_to_wait_event.pipe_write.vfs_write.ksys_write
0.51 +0.4 0.87 ? 49% perf-profile.calltrace.cycles-pp.copy_page_from_iter.pipe_write.vfs_write.ksys_write.do_syscall_64
0.92 +0.6 1.51 ? 44% perf-profile.calltrace.cycles-pp.restore_fpregs_from_fpstate.switch_fpu_return.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64
0.00 +0.8 0.79 ? 47% perf-profile.calltrace.cycles-pp._copy_from_iter.copy_page_from_iter.pipe_write.vfs_write.ksys_write
1.46 +0.9 2.33 ? 30% perf-profile.calltrace.cycles-pp.switch_fpu_return.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe
1.51 +1.1 2.57 ? 34% perf-profile.calltrace.cycles-pp.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe
11.60 +2.0 13.62 ? 6% perf-profile.calltrace.cycles-pp.__schedule.schedule.pipe_write.vfs_write.ksys_write
11.63 +2.1 13.74 ? 7% perf-profile.calltrace.cycles-pp.__schedule.schedule.pipe_read.vfs_read.ksys_read
11.65 +2.2 13.83 ? 7% perf-profile.calltrace.cycles-pp.schedule.pipe_write.vfs_write.ksys_write.do_syscall_64
11.70 +2.3 13.97 ? 8% perf-profile.calltrace.cycles-pp.schedule.pipe_read.vfs_read.ksys_read.do_syscall_64
7.84 +2.6 10.46 ? 5% perf-profile.calltrace.cycles-pp.try_to_wake_up.autoremove_wake_function.__wake_up_common.__wake_up_common_lock.pipe_read
8.12 +2.6 10.76 ? 5% perf-profile.calltrace.cycles-pp.__wake_up_common.__wake_up_common_lock.pipe_read.vfs_read.ksys_read
7.90 +2.7 10.58 ? 6% perf-profile.calltrace.cycles-pp.autoremove_wake_function.__wake_up_common.__wake_up_common_lock.pipe_read.vfs_read
7.78 +2.7 10.53 ? 7% perf-profile.calltrace.cycles-pp.try_to_wake_up.autoremove_wake_function.__wake_up_common.__wake_up_common_lock.pipe_write
6.05 +2.8 8.80 ? 35% perf-profile.calltrace.cycles-pp.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.00 +2.8 2.75 ? 68% perf-profile.calltrace.cycles-pp.enqueue_entity.enqueue_task_fair.activate_task.ttwu_do_activate.try_to_wake_up
8.07 +2.8 10.84 ? 7% perf-profile.calltrace.cycles-pp.__wake_up_common.__wake_up_common_lock.pipe_write.vfs_write.ksys_write
8.31 +2.8 11.09 ? 6% perf-profile.calltrace.cycles-pp.__wake_up_common_lock.pipe_read.vfs_read.ksys_read.do_syscall_64
7.85 +2.8 10.66 ? 8% perf-profile.calltrace.cycles-pp.autoremove_wake_function.__wake_up_common.__wake_up_common_lock.pipe_write.vfs_write
8.26 +2.9 11.18 ? 9% perf-profile.calltrace.cycles-pp.__wake_up_common_lock.pipe_write.vfs_write.ksys_write.do_syscall_64
0.00 +3.7 3.72 ? 43% perf-profile.calltrace.cycles-pp.update_cfs_group.enqueue_task_fair.activate_task.ttwu_do_activate.try_to_wake_up
23.31 +6.5 29.83 ? 12% perf-profile.calltrace.cycles-pp.pipe_read.vfs_read.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe
22.83 +6.6 29.45 ? 13% perf-profile.calltrace.cycles-pp.pipe_write.vfs_write.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe
23.81 +7.0 30.77 ? 14% perf-profile.calltrace.cycles-pp.vfs_read.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe
23.32 +7.1 30.44 ? 14% perf-profile.calltrace.cycles-pp.vfs_write.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe
24.04 +7.2 31.19 ? 14% perf-profile.calltrace.cycles-pp.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe
23.57 +7.3 30.90 ? 15% perf-profile.calltrace.cycles-pp.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe
1.11 ? 3% +7.9 9.01 ? 30% perf-profile.calltrace.cycles-pp.activate_task.ttwu_do_activate.try_to_wake_up.autoremove_wake_function.__wake_up_common
0.88 ? 44% +7.9 8.81 ? 30% perf-profile.calltrace.cycles-pp.enqueue_task_fair.activate_task.ttwu_do_activate.try_to_wake_up.autoremove_wake_function
1.13 ? 3% +8.7 9.85 ? 36% perf-profile.calltrace.cycles-pp.ttwu_do_activate.try_to_wake_up.autoremove_wake_function.__wake_up_common.__wake_up_common_lock
53.92 +17.5 71.42 ? 17% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe
54.24 +18.3 72.56 ? 18% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe
43.00 -20.3 22.74 ? 68% perf-profile.children.cycles-pp.secondary_startup_64_no_verify
43.00 -20.3 22.74 ? 68% perf-profile.children.cycles-pp.cpu_startup_entry
42.94 -20.2 22.70 ? 68% perf-profile.children.cycles-pp.do_idle
42.55 -20.0 22.50 ? 68% perf-profile.children.cycles-pp.start_secondary
21.77 -10.2 11.58 ? 69% perf-profile.children.cycles-pp.sched_ttwu_pending
20.67 -9.8 10.87 ? 69% perf-profile.children.cycles-pp.flush_smp_call_function_queue
14.82 -7.1 7.75 ? 68% perf-profile.children.cycles-pp.cpuidle_idle_call
12.67 -6.1 6.60 ? 68% perf-profile.children.cycles-pp.cpuidle_enter
12.65 -6.1 6.59 ? 68% perf-profile.children.cycles-pp.cpuidle_enter_state
16.44 -5.1 11.38 ? 34% perf-profile.children.cycles-pp.enqueue_entity
13.12 -4.0 9.16 ? 29% perf-profile.children.cycles-pp.dequeue_entity
7.29 -3.8 3.46 ? 70% perf-profile.children.cycles-pp.intel_idle_irq
15.52 -3.4 12.13 ? 22% perf-profile.children.cycles-pp.update_load_avg
5.80 -2.7 3.11 ? 68% perf-profile.children.cycles-pp.schedule_idle
5.23 -2.3 2.96 ? 68% perf-profile.children.cycles-pp.asm_sysvec_call_function_single
4.93 -2.1 2.78 ? 69% perf-profile.children.cycles-pp.sysvec_call_function_single
4.79 -2.1 2.69 ? 69% perf-profile.children.cycles-pp.__sysvec_call_function_single
4.86 -2.0 2.85 ? 68% perf-profile.children.cycles-pp.poll_idle
4.85 -1.9 2.97 ? 51% perf-profile.children.cycles-pp.select_idle_cpu
5.97 -1.6 4.33 ? 35% perf-profile.children.cycles-pp.select_idle_sibling
3.35 -1.6 1.72 ? 69% perf-profile.children.cycles-pp.select_idle_core
6.76 -1.5 5.24 ? 25% perf-profile.children.cycles-pp.select_task_rq_fair
3.39 -1.5 1.88 ? 65% perf-profile.children.cycles-pp.available_idle_cpu
6.85 -1.4 5.42 ? 22% perf-profile.children.cycles-pp.select_task_rq
2.10 -0.8 1.30 ? 47% perf-profile.children.cycles-pp.ttwu_queue_wakelist
1.58 -0.8 0.79 ? 70% perf-profile.children.cycles-pp.set_task_cpu
1.38 -0.7 0.67 ? 70% perf-profile.children.cycles-pp.migrate_task_rq_fair
1.48 ? 2% -0.7 0.78 ? 66% perf-profile.children.cycles-pp.menu_select
1.36 -0.6 0.74 ? 70% perf-profile.children.cycles-pp.sched_mm_cid_migrate_to
1.27 -0.6 0.67 ? 70% perf-profile.children.cycles-pp.__smp_call_single_queue
1.22 -0.6 0.64 ? 70% perf-profile.children.cycles-pp.llist_add_batch
1.59 -0.5 1.12 ? 29% perf-profile.children.cycles-pp.finish_task_switch
2.62 -0.5 2.17 ? 15% perf-profile.children.cycles-pp._raw_spin_lock
0.96 -0.4 0.51 ? 70% perf-profile.children.cycles-pp.__flush_smp_call_function_queue
0.76 ? 2% -0.4 0.37 ? 70% perf-profile.children.cycles-pp.tick_nohz_get_sleep_length
0.53 ? 2% -0.3 0.24 ? 70% perf-profile.children.cycles-pp.remove_entity_load_avg
0.49 -0.3 0.22 ? 70% perf-profile.children.cycles-pp.tick_nohz_next_event
0.56 ? 2% -0.2 0.31 ? 70% perf-profile.children.cycles-pp.wake_affine
0.54 -0.2 0.30 ? 70% perf-profile.children.cycles-pp._find_next_bit
0.54 ? 2% -0.2 0.31 ? 70% perf-profile.children.cycles-pp.nohz_run_idle_balance
0.46 ? 2% -0.2 0.24 ? 70% perf-profile.children.cycles-pp.llist_reverse_order
0.41 -0.2 0.21 ? 70% perf-profile.children.cycles-pp.send_call_function_single_ipi
0.37 ? 2% -0.2 0.18 ? 70% perf-profile.children.cycles-pp.get_next_timer_interrupt
0.35 -0.2 0.19 ? 70% perf-profile.children.cycles-pp.pick_next_task_idle
0.30 -0.2 0.15 ? 70% perf-profile.children.cycles-pp.attach_entity_load_avg
0.29 -0.1 0.15 ? 70% perf-profile.children.cycles-pp.task_h_load
0.30 ? 2% -0.1 0.17 ? 70% perf-profile.children.cycles-pp.__update_idle_core
0.28 ? 4% -0.1 0.15 ? 59% perf-profile.children.cycles-pp.__irq_exit_rcu
0.23 ? 3% -0.1 0.12 ? 70% perf-profile.children.cycles-pp.call_cpuidle
0.21 ? 3% -0.1 0.10 ? 70% perf-profile.children.cycles-pp.__do_softirq
0.21 -0.1 0.12 ? 70% perf-profile.children.cycles-pp.ct_kernel_exit_state
0.20 ? 7% -0.1 0.11 ? 71% perf-profile.children.cycles-pp.hrtimer_next_event_without
0.20 ? 4% -0.1 0.10 ? 70% perf-profile.children.cycles-pp.hrtimer_get_next_event
0.72 ? 5% -0.1 0.63 ? 4% perf-profile.children.cycles-pp.asm_sysvec_apic_timer_interrupt
0.18 ? 4% -0.1 0.10 ? 70% perf-profile.children.cycles-pp.__bitmap_andnot
0.63 ? 5% -0.1 0.56 ? 4% perf-profile.children.cycles-pp.sysvec_apic_timer_interrupt
0.14 ? 3% -0.1 0.07 ? 70% perf-profile.children.cycles-pp.put_prev_task_fair
0.13 ? 5% -0.1 0.06 ? 70% perf-profile.children.cycles-pp.rebalance_domains
0.16 ? 3% -0.1 0.09 ? 70% perf-profile.children.cycles-pp.tick_nohz_idle_exit
0.14 ? 4% -0.1 0.08 ? 70% perf-profile.children.cycles-pp.cpuidle_governor_latency_req
0.14 ? 4% -0.1 0.08 ? 70% perf-profile.children.cycles-pp.local_clock
0.13 ? 3% -0.1 0.08 ? 70% perf-profile.children.cycles-pp.native_irq_return_iret
0.12 ? 3% -0.1 0.07 ? 70% perf-profile.children.cycles-pp.resched_curr
0.11 ? 6% -0.1 0.06 ? 70% perf-profile.children.cycles-pp.newidle_balance
0.11 ? 4% -0.0 0.06 ? 70% perf-profile.children.cycles-pp.ct_idle_exit
0.10 ? 3% -0.0 0.06 ? 71% perf-profile.children.cycles-pp.__task_rq_lock
0.07 ? 6% -0.0 0.04 ? 71% perf-profile.children.cycles-pp.load_balance
0.09 ? 4% -0.0 0.05 ? 71% perf-profile.children.cycles-pp.ct_kernel_enter
0.08 ? 5% -0.0 0.05 ? 70% perf-profile.children.cycles-pp.__mutex_lock
0.06 +0.1 0.13 ? 63% perf-profile.children.cycles-pp.__get_task_ioprio
0.07 +0.1 0.16 ? 62% perf-profile.children.cycles-pp.clear_buddies
0.11 ? 6% +0.1 0.20 ? 45% perf-profile.children.cycles-pp.put_prev_entity
0.10 ? 5% +0.1 0.19 ? 54% perf-profile.children.cycles-pp.__rdgsbase_inactive
0.00 +0.1 0.11 ? 77% perf-profile.children.cycles-pp.syscall_exit_to_user_mode_prepare
0.06 ? 7% +0.1 0.20 ? 67% perf-profile.children.cycles-pp.__cond_resched
0.11 ? 3% +0.1 0.25 ? 63% perf-profile.children.cycles-pp.__list_add_valid
0.18 ? 2% +0.2 0.32 ? 52% perf-profile.children.cycles-pp.__wrgsbase_inactive
0.07 ? 6% +0.2 0.24 ? 70% perf-profile.children.cycles-pp.__list_del_entry_valid
0.10 ? 5% +0.2 0.26 ? 68% perf-profile.children.cycles-pp.syscall_return_via_sysret
0.10 ? 5% +0.2 0.27 ? 71% perf-profile.children.cycles-pp.__entry_text_start
0.08 ? 11% +0.2 0.26 ? 72% perf-profile.children.cycles-pp.__cgroup_account_cputime
0.11 ? 4% +0.2 0.29 ? 72% perf-profile.children.cycles-pp.__might_sleep
0.23 ? 2% +0.2 0.42 ? 50% perf-profile.children.cycles-pp.finish_wait
0.28 ? 3% +0.2 0.47 ? 50% perf-profile.children.cycles-pp._raw_spin_unlock_irqrestore
0.13 ? 2% +0.2 0.34 ? 68% perf-profile.children.cycles-pp.entry_SYSRETQ_unsafe_stack
0.10 ? 7% +0.2 0.31 ? 71% perf-profile.children.cycles-pp.pick_next_entity
0.24 ? 3% +0.2 0.45 ? 55% perf-profile.children.cycles-pp.atime_needs_update
0.14 ? 3% +0.2 0.35 ? 76% perf-profile.children.cycles-pp.current_time
0.05 ? 8% +0.2 0.27 ?104% perf-profile.children.cycles-pp.set_next_buddy
0.15 ? 7% +0.2 0.38 ? 66% perf-profile.children.cycles-pp.cpuacct_charge
0.46 +0.2 0.68 ? 41% perf-profile.children.cycles-pp.native_sched_clock
0.38 +0.2 0.61 ? 47% perf-profile.children.cycles-pp._copy_to_iter
0.32 ? 2% +0.3 0.57 ? 53% perf-profile.children.cycles-pp.touch_atime
0.00 +0.3 0.26 ? 61% perf-profile.children.cycles-pp.wake_on_current
0.40 ? 2% +0.3 0.67 ? 49% perf-profile.children.cycles-pp.copy_page_to_iter
0.32 ? 2% +0.3 0.59 ? 59% perf-profile.children.cycles-pp.apparmor_file_permission
0.13 +0.3 0.42 ? 78% perf-profile.children.cycles-pp.__might_fault
0.39 +0.3 0.71 ? 51% perf-profile.children.cycles-pp.sched_clock_cpu
0.48 +0.3 0.81 ? 48% perf-profile.children.cycles-pp._copy_from_iter
0.36 ? 3% +0.3 0.70 ? 61% perf-profile.children.cycles-pp.security_file_permission
0.18 ? 5% +0.4 0.54 ? 75% perf-profile.children.cycles-pp.update_min_vruntime
0.52 +0.4 0.88 ? 49% perf-profile.children.cycles-pp.copy_page_from_iter
0.30 ? 2% +0.4 0.67 ? 62% perf-profile.children.cycles-pp.__might_resched
0.34 +0.5 0.86 ? 65% perf-profile.children.cycles-pp.os_xsave
0.50 ? 2% +0.5 1.03 ? 57% perf-profile.children.cycles-pp.mutex_lock
0.00 +0.6 0.57 ?107% perf-profile.children.cycles-pp.check_preempt_wakeup
0.92 +0.6 1.51 ? 44% perf-profile.children.cycles-pp.restore_fpregs_from_fpstate
0.46 ? 2% +0.6 1.07 ? 63% perf-profile.children.cycles-pp.update_rq_clock
0.19 ? 3% +0.6 0.84 ? 87% perf-profile.children.cycles-pp.check_preempt_curr
0.17 ? 2% +0.7 0.84 ? 89% perf-profile.children.cycles-pp.__calc_delta
0.63 ? 2% +0.8 1.42 ? 66% perf-profile.children.cycles-pp.__update_load_avg_cfs_rq
0.38 ? 3% +0.8 1.19 ? 75% perf-profile.children.cycles-pp.__update_load_avg_se
1.46 +0.9 2.34 ? 30% perf-profile.children.cycles-pp.switch_fpu_return
1.52 +1.1 2.58 ? 35% perf-profile.children.cycles-pp.exit_to_user_mode_prepare
28.96 +1.6 30.56 perf-profile.children.cycles-pp.__schedule
1.26 +1.6 2.88 ? 63% perf-profile.children.cycles-pp.switch_mm_irqs_off
0.46 ? 3% +1.8 2.22 ? 87% perf-profile.children.cycles-pp.reweight_entity
1.01 ? 3% +2.1 3.15 ? 79% perf-profile.children.cycles-pp.update_curr
6.06 +2.8 8.82 ? 35% perf-profile.children.cycles-pp.syscall_exit_to_user_mode
23.36 +4.5 27.87 ? 8% perf-profile.children.cycles-pp.schedule
15.63 +5.4 21.03 ? 6% perf-profile.children.cycles-pp.try_to_wake_up
16.19 +5.4 21.61 ? 6% perf-profile.children.cycles-pp.__wake_up_common
15.75 +5.5 21.26 ? 7% perf-profile.children.cycles-pp.autoremove_wake_function
16.58 +5.7 22.29 ? 8% perf-profile.children.cycles-pp.__wake_up_common_lock
23.33 +6.6 29.91 ? 12% perf-profile.children.cycles-pp.pipe_read
22.85 +6.7 29.51 ? 13% perf-profile.children.cycles-pp.pipe_write
23.82 +7.0 30.78 ? 14% perf-profile.children.cycles-pp.vfs_read
23.33 +7.1 30.45 ? 14% perf-profile.children.cycles-pp.vfs_write
24.04 +7.2 31.19 ? 14% perf-profile.children.cycles-pp.ksys_read
23.57 +7.3 30.90 ? 15% perf-profile.children.cycles-pp.ksys_write
53.93 +17.5 71.46 ? 17% perf-profile.children.cycles-pp.do_syscall_64
54.24 +18.3 72.56 ? 18% perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
13.80 -4.6 9.18 ? 46% perf-profile.self.cycles-pp.update_load_avg
6.40 -3.4 3.00 ? 69% perf-profile.self.cycles-pp.intel_idle_irq
3.97 -1.7 2.29 ? 68% perf-profile.self.cycles-pp.poll_idle
3.32 -1.5 1.84 ? 65% perf-profile.self.cycles-pp.available_idle_cpu
3.10 -1.1 1.96 ? 37% perf-profile.self.cycles-pp.try_to_wake_up
1.36 -0.6 0.74 ? 70% perf-profile.self.cycles-pp.sched_mm_cid_migrate_to
1.20 -0.6 0.63 ? 70% perf-profile.self.cycles-pp.llist_add_batch
1.02 -0.5 0.54 ? 70% perf-profile.self.cycles-pp.select_idle_core
0.82 -0.4 0.42 ? 70% perf-profile.self.cycles-pp.migrate_task_rq_fair
2.06 -0.3 1.78 ? 4% perf-profile.self.cycles-pp._raw_spin_lock
0.54 ? 2% -0.3 0.28 ? 70% perf-profile.self.cycles-pp.menu_select
0.51 ? 2% -0.2 0.27 ? 70% perf-profile.self.cycles-pp.__flush_smp_call_function_queue
0.50 -0.2 0.28 ? 70% perf-profile.self.cycles-pp._find_next_bit
0.44 ? 3% -0.2 0.23 ? 70% perf-profile.self.cycles-pp.llist_reverse_order
0.40 -0.2 0.21 ? 70% perf-profile.self.cycles-pp.send_call_function_single_ipi
0.40 -0.2 0.22 ? 70% perf-profile.self.cycles-pp.sched_ttwu_pending
0.36 -0.2 0.19 ? 70% perf-profile.self.cycles-pp.flush_smp_call_function_queue
0.40 -0.2 0.23 ? 70% perf-profile.self.cycles-pp.do_idle
0.30 -0.1 0.15 ? 70% perf-profile.self.cycles-pp.attach_entity_load_avg
0.29 -0.1 0.15 ? 70% perf-profile.self.cycles-pp.task_h_load
0.25 ? 2% -0.1 0.14 ? 71% perf-profile.self.cycles-pp.nohz_run_idle_balance
0.58 -0.1 0.47 ? 8% perf-profile.self.cycles-pp.finish_task_switch
0.22 ? 3% -0.1 0.12 ? 70% perf-profile.self.cycles-pp.call_cpuidle
0.21 -0.1 0.12 ? 70% perf-profile.self.cycles-pp.ct_kernel_exit_state
0.22 ? 2% -0.1 0.12 ? 70% perf-profile.self.cycles-pp.__update_idle_core
0.44 ? 2% -0.1 0.35 ? 12% perf-profile.self.cycles-pp.__wake_up_common
0.20 -0.1 0.11 ? 70% perf-profile.self.cycles-pp.cpuidle_idle_call
0.20 ? 6% -0.1 0.12 ? 71% perf-profile.self.cycles-pp.set_task_cpu
0.17 ? 4% -0.1 0.09 ? 70% perf-profile.self.cycles-pp.__bitmap_andnot
0.13 ? 3% -0.1 0.08 ? 70% perf-profile.self.cycles-pp.native_irq_return_iret
0.11 ? 4% -0.1 0.06 ? 70% perf-profile.self.cycles-pp.cpuidle_enter_state
0.12 ? 3% -0.1 0.07 ? 70% perf-profile.self.cycles-pp.resched_curr
0.13 ? 3% -0.1 0.08 ? 70% perf-profile.self.cycles-pp.wake_affine
0.10 ? 4% -0.0 0.06 ? 71% perf-profile.self.cycles-pp.newidle_balance
0.08 ? 6% -0.0 0.03 ? 70% perf-profile.self.cycles-pp.remove_entity_load_avg
0.12 ? 4% -0.0 0.08 ? 25% perf-profile.self.cycles-pp.copyin
0.07 ? 6% -0.0 0.04 ? 71% perf-profile.self.cycles-pp.__hrtimer_next_event_base
0.08 ? 6% -0.0 0.04 ? 71% perf-profile.self.cycles-pp.schedule_idle
0.09 ? 7% +0.1 0.14 ? 35% perf-profile.self.cycles-pp.put_prev_entity
0.06 ? 6% +0.1 0.13 ? 56% perf-profile.self.cycles-pp.activate_task
0.13 ? 3% +0.1 0.20 ? 41% perf-profile.self.cycles-pp.atime_needs_update
0.06 ? 6% +0.1 0.14 ? 61% perf-profile.self.cycles-pp.clear_buddies
0.00 +0.1 0.08 ? 49% perf-profile.self.cycles-pp.sched_clock_cpu
0.05 ? 8% +0.1 0.14 ? 68% perf-profile.self.cycles-pp.__wake_up_common_lock
0.10 ? 5% +0.1 0.18 ? 52% perf-profile.self.cycles-pp.__rdgsbase_inactive
0.09 +0.1 0.18 ? 59% perf-profile.self.cycles-pp.select_task_rq
0.00 +0.1 0.10 ? 68% perf-profile.self.cycles-pp._copy_to_iter
0.00 +0.1 0.11 ? 71% perf-profile.self.cycles-pp.__cond_resched
0.00 +0.1 0.11 ? 72% perf-profile.self.cycles-pp.security_file_permission
0.06 ? 6% +0.1 0.18 ? 75% perf-profile.self.cycles-pp.ksys_read
0.08 ? 4% +0.1 0.20 ? 63% perf-profile.self.cycles-pp._copy_from_iter
0.06 ? 7% +0.1 0.20 ? 73% perf-profile.self.cycles-pp.__cgroup_account_cputime
0.10 ? 4% +0.1 0.25 ? 62% perf-profile.self.cycles-pp.__list_add_valid
0.07 ? 5% +0.1 0.21 ? 70% perf-profile.self.cycles-pp.__list_del_entry_valid
0.06 ? 6% +0.1 0.20 ? 84% perf-profile.self.cycles-pp.check_preempt_curr
0.00 +0.1 0.15 ? 64% perf-profile.self.cycles-pp.exit_to_user_mode_prepare
0.18 ? 2% +0.2 0.32 ? 52% perf-profile.self.cycles-pp.__wrgsbase_inactive
0.06 ? 6% +0.2 0.21 ? 77% perf-profile.self.cycles-pp.ksys_write
0.09 ? 5% +0.2 0.25 ? 70% perf-profile.self.cycles-pp.syscall_return_via_sysret
0.09 ? 4% +0.2 0.25 ? 70% perf-profile.self.cycles-pp.__might_sleep
0.12 ? 4% +0.2 0.29 ? 61% perf-profile.self.cycles-pp.set_next_entity
0.10 ? 5% +0.2 0.27 ? 71% perf-profile.self.cycles-pp.__entry_text_start
0.25 ? 3% +0.2 0.43 ? 54% perf-profile.self.cycles-pp.vfs_read
0.10 ? 5% +0.2 0.28 ? 70% perf-profile.self.cycles-pp.pick_next_entity
0.12 ? 4% +0.2 0.30 ? 72% perf-profile.self.cycles-pp._raw_spin_unlock_irqrestore
0.48 +0.2 0.67 ? 9% perf-profile.self.cycles-pp.select_idle_sibling
0.13 ? 3% +0.2 0.32 ? 68% perf-profile.self.cycles-pp.entry_SYSRETQ_unsafe_stack
0.23 ? 2% +0.2 0.43 ? 52% perf-profile.self.cycles-pp.mutex_lock
0.05 ? 7% +0.2 0.26 ?104% perf-profile.self.cycles-pp.set_next_buddy
0.09 ? 4% +0.2 0.31 ? 79% perf-profile.self.cycles-pp.do_syscall_64
0.15 ? 7% +0.2 0.37 ? 65% perf-profile.self.cycles-pp.cpuacct_charge
0.22 ? 2% +0.2 0.45 ? 55% perf-profile.self.cycles-pp.update_rq_clock
0.44 +0.2 0.67 ? 41% perf-profile.self.cycles-pp.native_sched_clock
0.27 +0.2 0.50 ? 52% perf-profile.self.cycles-pp.vfs_write
0.00 +0.3 0.26 ? 61% perf-profile.self.cycles-pp.wake_on_current
0.10 ? 3% +0.3 0.37 ? 80% perf-profile.self.cycles-pp.schedule
0.54 +0.3 0.82 ? 5% perf-profile.self.cycles-pp.switch_fpu_return
0.20 ? 3% +0.3 0.50 ? 71% perf-profile.self.cycles-pp.pick_next_task_fair
0.17 ? 6% +0.3 0.50 ? 76% perf-profile.self.cycles-pp.update_min_vruntime
0.24 ? 2% +0.3 0.58 ? 63% perf-profile.self.cycles-pp.dequeue_entity
0.30 ? 3% +0.4 0.66 ? 62% perf-profile.self.cycles-pp.__might_resched
0.23 +0.4 0.60 ? 70% perf-profile.self.cycles-pp.select_task_rq_fair
0.41 +0.4 0.83 ? 64% perf-profile.self.cycles-pp.enqueue_task_fair
0.36 ? 2% +0.4 0.78 ? 60% perf-profile.self.cycles-pp.pipe_write
0.00 +0.5 0.46 ?107% perf-profile.self.cycles-pp.check_preempt_wakeup
0.60 ? 2% +0.5 1.06 ? 56% perf-profile.self.cycles-pp.enqueue_entity
0.34 +0.5 0.86 ? 65% perf-profile.self.cycles-pp.os_xsave
0.92 +0.6 1.51 ? 44% perf-profile.self.cycles-pp.restore_fpregs_from_fpstate
0.36 +0.6 0.97 ? 64% perf-profile.self.cycles-pp.reweight_entity
0.17 ? 3% +0.7 0.83 ? 89% perf-profile.self.cycles-pp.__calc_delta
0.61 ? 2% +0.8 1.37 ? 66% perf-profile.self.cycles-pp.__update_load_avg_cfs_rq
0.46 ? 6% +0.8 1.24 ? 78% perf-profile.self.cycles-pp.update_curr
0.37 ? 3% +0.8 1.16 ? 76% perf-profile.self.cycles-pp.__update_load_avg_se
0.30 +0.8 1.10 ? 56% perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe
1.24 +1.6 2.85 ? 63% perf-profile.self.cycles-pp.switch_mm_irqs_off
4.50 +1.6 6.14 ? 34% perf-profile.self.cycles-pp.syscall_exit_to_user_mode




Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests



Attachments:
(No filename) (52.20 kB)
config-6.3.0-rc7-00020-g485a13bb0c94 (159.44 kB)
job-script (7.51 kB)
job.yaml (5.17 kB)
reproduce (311.00 B)
Download all attachments