2023-04-21 08:21:47

by Chen Yu

[permalink] [raw]
Subject: [PATCH v7 0/2] sched/fair: Introduce SIS_CURRENT to wake up short task on current CPU

The main purpose is to avoid too many cross CPU wake up when it is
unnecessary. The frequent cross CPU wakeup brings significant damage
to some workloads, especially on high core count systems:
1. The cross CPU wakeup triggers race condition that, some wakers are
stacked on 1 CPU which delays the wakeup of their wakees.
2. The cross CPU wakeup brings c2c overhead if the waker and wakee share
cache resource.

Inhibits the cross CPU wake-up by placing the wakee on waking CPU,
if both the waker and wakee are short-duration tasks, and when the
waker and wakee wake up each other.

The first patch introduces the definition of a short-duration task.
The second patch leverages the first patch to choose a local CPU for wakee.

Overall there is performance improvement when the benchmark has
close 1:1 waker/wakee relationship, such as will-it-scale, netperf, tbench.
And for netperf, it has a universal performance improvement under many
different utilization. And there is no noticeable impact on schbench, hackbench,
or a OLTP workload with a commercial RDBMS, according to the test on
Intel Sapphire Rapids, which has 2 x 56C/112T = 224 CPUs.

Per the lastest test on Zen3 from Prateek, tbench/netperf shows good
improvement at 128 clients, SPECjbb2015 shows improvement in max-jOPS.


Changes since v6:
1. Rename SIS_SHORT to SIS_CURRENT, which better describes this feature.
2. Remove the 50% utilization threshold. Removes the has_idle_cores check.
After this change, SIS_CURRENT is applicable to all system utilization.
3. Add a condition that only waker and wakee wake up each other would enable
the SIS_CURRENT. That is, A->last_wakee = B and B->last_wakee = A.

Changes since v5:
1. Check the wakee_flips of the waker/wakee. If the wakee_flips
of waker/wakee are both 0, it indicates that the waker and the wakee
are waking up each other. In this case, put them together on the
same CPU. This is to avoid that too many wakees are stacked on
one CPU, which might cause regression on redis.

Changes since v4:
1. Dietmar has commented on the task duration calculation. So refined
the commit log to reduce confusion.
2. Change [PATCH 1/2] to only record the average duration of a task.
So this change could benefit UTIL_EST_FASTER[1].
3. As v4 reported regression on Zen3 and Kunpeng Arm64, add back
the system average utilization restriction that, if the system
is not busy, do not enable the short wake up. Above logic has
shown improvment on Zen3[2].
4. Restrict the wakeup target to be current CPU, rather than both
current CPU and task's previous CPU. This could also benefit
wakeup optimization from interrupt in the future, which is
suggested by Yicong.

Changes since v3:
1. Honglei and Josh have concern that the threshold of short
task duration could be too long. Decreased the threshold from
sysctl_sched_min_granularity to (sysctl_sched_min_granularity / 8),
and the '8' comes from get_update_sysctl_factor().
2. Export p->se.dur_avg to /proc/{pid}/sched per Yicong's suggestion.
3. Move the calculation of average duration from put_prev_task_fair()
to dequeue_task_fair(). Because there is an issue in v3 that,
put_prev_task_fair() will not be invoked by pick_next_task_fair()
in fast path, thus the dur_avg could not be updated timely.
4. Fix the comment in PATCH 2/2, that "WRITE_ONCE(CPU1->ttwu_pending, 1);"
on CPU0 is earlier than CPU1 getting "ttwu_list->p0", per Tianchen.
5. Move the scan for CPU with short duration task from select_idle_cpu()
to select_idle_siblings(), because there is no CPU scan involved, per
Yicong.

Changes since v2:

1. Peter suggested comparing the duration of waker and the cost to
scan for an idle CPU: If the cost is higher than the task duration,
do not waste time finding an idle CPU, choose the local or previous
CPU directly. A prototype was created based on this suggestion.
However, according to the test result, this prototype does not inhibit
the cross CPU wakeup and did not bring improvement. Because the cost
to find an idle CPU is small in the problematic scenario. The root
cause of the problem is a race condition between scanning for an idle
CPU and task enqueue(please refer to the commit log in PATCH 2/2).
So v3 does not change the core logic of v2, with some refinement based
on Peter's suggestion.

2. Simplify the logic to record the task duration per Peter and Abel's suggestion.


[1] https://lore.kernel.org/lkml/[email protected]/
[2] https://lore.kernel.org/all/[email protected]/

v6: https://lore.kernel.org/lkml/[email protected]/
v5: https://lore.kernel.org/lkml/[email protected]/
v4: https://lore.kernel.org/lkml/[email protected]/
v3: https://lore.kernel.org/lkml/[email protected]/
v2: https://lore.kernel.org/all/[email protected]/
v1: https://lore.kernel.org/lkml/[email protected]/

Chen Yu (2):
sched/fair: Record the average duration of a task
sched/fair: Introduce SIS_CURRENT to wake up short task on current CPU

include/linux/sched.h | 3 +++
kernel/sched/core.c | 2 ++
kernel/sched/debug.c | 1 +
kernel/sched/fair.c | 59 +++++++++++++++++++++++++++++++++++++++++
kernel/sched/features.h | 1 +
5 files changed, 66 insertions(+)

--
2.25.1


2023-04-21 08:21:54

by Chen Yu

[permalink] [raw]
Subject: [PATCH v7 1/2] sched/fair: Record the average duration of a task

Record the average duration of a task, as there is a requirement
to leverage this information for better task placement.

At first thought the (p->se.sum_exec_runtime / p->nvcsw)
can be used to measure the task duration. However, the
history long past was factored too heavily in such a formula.
Ideally, the old activity should decay and not affect
the current status too much.

Although something based on PELT can be used, se.util_avg might
not be appropriate to describe the task duration:
Task p1 and task p2 are doing frequent ping-pong scheduling on
one CPU, both p1 and p2 have a short duration, but the util_avg
can be up to 50%, which is inconsistent with task duration.

It was found that there was once a similar feature to track the
duration of a task:
commit ad4b78bbcbab ("sched: Add new wakeup preemption mode: WAKEUP_RUNNING")
Unfortunately, it was reverted because it was an experiment. Pick the
patch up again, by recording the average duration when a task voluntarily
switches out.

For example, suppose on CPU1, task p1 and p2 run alternatively:

--------------------> time

| p1 runs 1ms | p2 preempt p1 | p1 switch in, runs 0.5ms and blocks |
^ ^ ^
|_____________| |_____________________________________|
^
|
p1 dequeued

p1's duration in one section is (1 + 0.5)ms. Because if p2 does not
preempt p1, p1 can run 1.5ms. This reflects the nature of a task:
how long it wishes to run at most.

Suggested-by: Tim Chen <[email protected]>
Suggested-by: Vincent Guittot <[email protected]>
Tested-by: K Prateek Nayak <[email protected]>
Signed-off-by: Chen Yu <[email protected]>
---
include/linux/sched.h | 3 +++
kernel/sched/core.c | 2 ++
kernel/sched/debug.c | 1 +
kernel/sched/fair.c | 13 +++++++++++++
4 files changed, 19 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6d654eb4cabd..f94e6aa159b0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -558,6 +558,9 @@ struct sched_entity {
u64 prev_sum_exec_runtime;

u64 nr_migrations;
+ u64 prev_sleep_sum_runtime;
+ /* average duration of a task */
+ u64 dur_avg;

#ifdef CONFIG_FAIR_GROUP_SCHED
int depth;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d861db8aa7ab..59a6c0414a19 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4446,6 +4446,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
p->se.prev_sum_exec_runtime = 0;
p->se.nr_migrations = 0;
p->se.vruntime = 0;
+ p->se.dur_avg = 0;
+ p->se.prev_sleep_sum_runtime = 0;
INIT_LIST_HEAD(&p->se.group_node);

#ifdef CONFIG_FAIR_GROUP_SCHED
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 1637b65ba07a..8d64fba16cfe 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -1024,6 +1024,7 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
__PS("nr_involuntary_switches", p->nivcsw);

P(se.load.weight);
+ P(se.dur_avg);
#ifdef CONFIG_SMP
P(se.avg.load_sum);
P(se.avg.runnable_sum);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f5da01a6b35a..4af5799b90fc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6283,6 +6283,18 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)

static void set_next_buddy(struct sched_entity *se);

+static inline void dur_avg_update(struct task_struct *p, bool task_sleep)
+{
+ u64 dur;
+
+ if (!task_sleep)
+ return;
+
+ dur = p->se.sum_exec_runtime - p->se.prev_sleep_sum_runtime;
+ p->se.prev_sleep_sum_runtime = p->se.sum_exec_runtime;
+ update_avg(&p->se.dur_avg, dur);
+}
+
/*
* The dequeue_task method is called before nr_running is
* decreased. We remove the task from the rbtree and
@@ -6355,6 +6367,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)

dequeue_throttle:
util_est_update(&rq->cfs, p, task_sleep);
+ dur_avg_update(p, task_sleep);
hrtick_update(rq);
}

--
2.25.1

2023-04-21 08:23:13

by Chen Yu

[permalink] [raw]
Subject: [PATCH v7 2/2] sched/fair: Introduce SIS_CURRENT to wake up short task on current CPU

[Problem Statement]
For a workload that is doing frequent context switches, the throughput
scales well until the number of instances reaches a peak point. After
that peak point, the throughput drops significantly if the number of
instances continues to increase.

The will-it-scale context_switch1 test case exposes the issue. The
test platform is Intel Sapphire Rapids, which has 2 x 56C/112T and
224 CPUs in total. will-it-scale launches 1, 8, 16 ... instances
respectively. Each instance is composed of 2 tasks, and each pair
of tasks would do ping-pong scheduling via pipe_read() and pipe_write().
No task is bound to any CPU. It is found that, once the number of
instances is higher than 56, the throughput drops accordingly:

^
throughput|
| X
| X X X
| X X X
| X X
| X X
| X
| X
| X
| X
|
+-----------------.------------------->
56
number of instances

[Symptom analysis]

The performance downgrading was caused by a high system idle
percentage(around 20% ~ 30%). The CPUs waste a lot of time in
idle and do nothing. As a comparison, if set CPU affinity to
these workloads and stops them from migrating among CPUs,
the idle percentage drops to nearly 0%, and the throughput
increases a lot. This indicates room for optimization.

The cause is the race condition between select_task_rq() and
the task enqueue.

Suppose there are nr_cpus pairs of ping-pong scheduling
tasks. For example, p0' and p0 are ping-pong scheduling,
so do p1' <=> p1, and p2'<=> p2. None of these tasks are
bound to any CPUs. The problem can be summarized as:
more than 1 wakers are stacked on 1 CPU, which slows down
waking up their wakees:

CPU0 CPU1 CPU2

p0' p1' => idle p2'

try_to_wake_up(p0) try_to_wake_up(p2);
CPU1 = select_task_rq(p0); CPU1 = select_task_rq(p2);
ttwu_queue(p0, CPU1); ttwu_queue(p2, CPU1);
__ttwu_queue_wakelist(p0, CPU1);
WRITE_ONCE(CPU1->ttwu_pending, 1);
__smp_call_single_queue(CPU1, p0); => ttwu_list->p0
quiting cpuidle_idle_call()

__ttwu_queue_wakelist(p2, CPU1);
WRITE_ONCE(CPU1->ttwu_pending, 1);
ttwu_list->p2->p0 <= __smp_call_single_queue(CPU1, p2);

p0' => idle
sched_ttwu_pending()
enqueue_task(p2 and p0)

idle => p2

...
p2 time slice expires
...
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
<=== !!! p2 delays the wake up of p0' !!!
!!! causes long idle on CPU0 !!!
p2 => p0 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
p0 wakes up p0'

idle => p0'

Since there are many waker/wakee pairs in the system, the chain reaction
causes many CPUs to be victims. These idle CPUs wait for their waker to
be scheduled.

Tiancheng has mentioned the above issue here[1].

[Proposal]
The root cause is that there is no strict synchronization of
select_task_rq() and the set of ttwu_pending flag among several CPUs.
And this might be by design because the scheduler prefers parallel
wakeup.

Avoid this problem indirectly.
Wake up the short task on current CPU, if the following conditions are met:

1. waker A's rq->nr_running <= 1
2. waker A is a short duration task (waker will fall asleep soon)
3. wakee B is a short duration task (impact of B is minor to A)
4. A->wakee_flips is 0 and A->last_wakee = B
5. B->wakee_flips is 0 and B->last_wakee = A

The reason is that, if the waker is a short-duration task, it might
relinquish the CPU soon, and the wakee has the chance to be scheduled.
On the other hand, if the wakee is a short duration task, putting it on
non-idle CPU would bring minimal impact to the running task. Besides,
if two tasks wake up each other alternately, it suggests that
they usually share resource and should be put together.

This wake up strategy is regarded as a dynamic WF_CURRENT_CPU[2] proposed
by Andrei Vagin, except that this change treats the current CPU as the
last resort when previous CPU is not idle, and avoid tasks stacking
on current CPU as much as possible.

[Benchmark results]
The baseline is v6.3-rc3 tip:sched/core, on top of
Commit 9b8e17813aec ("sched/core: Make sched_dynamic_mutex static").
The test platform Intel Sapphire Rapids has 2 x 56C/112T and 224 CPUs in total.
C-states deeper than C1E are disabled. Turbo is disabled. CPU frequency governor
is performance.

Overall there is universal improvement for netperf/tbench/will-it-scale,
under different load. And there is no significant impact on hackbench/schbench.

will-it-scale
=============
case load baseline compare%
context_switch1 224 groups 1.00 +1467.97%

netperf
=======
case load baseline(std%) compare%( std%)
TCP_RR 56-threads 1.00 ( 2.20) +13.15 ( 4.36)
TCP_RR 112-threads 1.00 ( 2.08) +87.39 ( 4.35)
TCP_RR 168-threads 1.00 ( 17.82) +419.70 ( 7.34)
TCP_RR 224-threads 1.00 ( 15.60) +779.49 ( 3.24)
TCP_RR 280-threads 1.00 ( 73.01) +192.72 ( 9.17)
TCP_RR 336-threads 1.00 ( 18.42) +0.21 ( 19.25)
TCP_RR 392-threads 1.00 ( 41.34) +26.30 ( 26.97)
TCP_RR 448-threads 1.00 ( 27.61) +0.02 ( 28.65)
UDP_RR 56-threads 1.00 ( 4.82) -0.60 ( 11.28)
UDP_RR 112-threads 1.00 ( 29.67) -15.97 ( 44.92)
UDP_RR 168-threads 1.00 ( 21.62) +113.99 ( 45.49)
UDP_RR 224-threads 1.00 ( 18.54) +184.40 ( 44.54)
UDP_RR 280-threads 1.00 ( 24.41) +199.72 ( 42.38)
UDP_RR 336-threads 1.00 ( 41.28) +0.40 ( 38.10)
UDP_RR 392-threads 1.00 ( 32.40) +1.73 ( 39.68)
UDP_RR 448-threads 1.00 ( 36.24) +3.69 ( 48.73)

tbench
======
case load baseline(std%) compare%( std%)
loopback 56-threads 1.00 ( 1.13) +11.54 ( 1.29)
loopback 112-threads 1.00 ( 0.18) +2.82 ( 0.27)
loopback 168-threads 1.00 ( 0.02) +154.13 ( 0.20)
loopback 224-threads 1.00 ( 58.38) +82.56 ( 0.49)
loopback 280-threads 1.00 ( 0.06) -0.07 ( 0.28)
loopback 336-threads 1.00 ( 0.37) -1.24 ( 0.33)
loopback 392-threads 1.00 ( 0.11) +0.17 ( 0.09)
loopback 448-threads 1.00 ( 0.29) +0.32 ( 0.14)

hackbench
=========
case load baseline(std%) compare%( std%)
process-pipe 1-groups 1.00 ( 0.95) +0.02 ( 0.60)
process-pipe 2-groups 1.00 ( 8.51) -0.43 ( 3.21)
process-pipe 4-groups 1.00 ( 7.39) -9.09 ( 2.50)
process-sockets 1-groups 1.00 ( 1.28) +1.03 ( 0.95)
process-sockets 2-groups 1.00 ( 0.85) +6.27 ( 10.32)
process-sockets 4-groups 1.00 ( 1.28) -0.51 ( 0.66)
threads-pipe 1-groups 1.00 ( 1.69) +0.66 ( 0.40)
threads-pipe 2-groups 1.00 ( 8.25) -7.78 ( 3.31)
threads-pipe 4-groups 1.00 ( 1.52) +1.50 ( 4.98)
threads-sockets 1-groups 1.00 ( 1.50) +0.96 ( 2.02)
threads-sockets 2-groups 1.00 ( 1.87) -1.15 ( 1.91)
threads-sockets 4-groups 1.00 ( 0.77) -0.73 ( 2.15)

schbench
========
case load baseline(std%) compare%( std%)
normal 1-mthreads 1.00 ( 1.25) +0.00 ( 1.25)
normal 2-mthreads 1.00 ( 0.00) +0.88 ( 3.31)
normal 4-mthreads 1.00 ( 1.29) +0.91 ( 1.30)
normal 8-mthreads 1.00 ( 2.21) +0.00 ( 2.21)

Redis
=====
Launch 224 instances of redis-server on machine A, launch 224 instances
of redis-benchmark on machine B, measure the SET/GET latency on B.
It was tested on a 1G NIC card. The 99th latency before vs after SIS_CURRENT
did not change much.
baseline sis_current
SET 115 ms 116 ms
GET 225 ms 228 ms

Thanks Prateek for testing on a dual socket Zen3 system (2 x 64C/128T).
tbench and netperf show good improvements at 128 clients. SpecJBB shows
some improvement in max-jOPS:
tip SIS_CURRENT
SPECjbb2015 max-jOPS 100.00% 102.78%
SPECjbb2015 Critical-jOPS 100.00% 100.00%

Others are perf neutral.

[Limitations/Miscellaneous]

[a]
Peter has suggested[3] comparing task duration with the cost of searching
for an idle CPU. If the latter is higher, then give up the scan, to
achieve better task affine. However, this method does not fit in the case
encountered in this patch. Because there are plenty of (fast)idle CPUs in
the system, it will not take too long to find an idle CPU. The bottleneck is
caused by the race condition mentioned above.

[b]
The short task threshold is sysctl_sched_min_granularity / 8.
According to get_update_sysctl_factor(), the sysctl_sched_min_granularity
could be 0.75 msec * 4 for SCHED_TUNABLESCALING_LOG,
or 0.75 msec * ncpus for SCHED_TUNABLESCALING_LINEAR.
Choosing 8 as the divisor is a trade-off. Thanks Honglei for pointing
this out.

[c]
Inspired by Abel's Redis test, the short task definination not only considers
the duration, but also checks the wakee_flips/last_wakee of both waker and
wakee. A waker can only wake up the wakee on current CPU if the wakee_flips is 0
and last_wakee is the wakee. That is to say, task A only wakes up task B, and
task B only wakes up A, A and B can be put together on one CPU.

[d]
Inspired by Andrei's WF_CURRENT_CPU proposal, makes SIS_CURRENT a dynamic version
for WF_CURRENT_CPU: short tasks having close wakeup relationship with each other,
should be put on 1 CPU to benefit cache sharing.


[1] https://lore.kernel.org/lkml/[email protected]/
[2] https://lore.kernel.org/lkml/[email protected]/
[3] https://lore.kernel.org/lkml/Y2O8a%[email protected]/

Suggested-by: Tim Chen <[email protected]>
Suggested-by: K Prateek Nayak <[email protected]>
Tested-by: kernel test robot <[email protected]>
Tested-by: K Prateek Nayak <[email protected]>
Signed-off-by: Chen Yu <[email protected]>
---
kernel/sched/fair.c | 46 +++++++++++++++++++++++++++++++++++++++++
kernel/sched/features.h | 1 +
2 files changed, 47 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4af5799b90fc..46c1321c0407 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6501,6 +6501,46 @@ static int wake_wide(struct task_struct *p)
return 1;
}

+/*
+ * Wake up the task on current CPU, if the following conditions are met:
+ *
+ * 1. waker A is the only running task on this_cpu
+ * 3. A is a short duration task (waker will fall asleep soon)
+ * 4. wakee B is a short duration task (impact of B on A is minor)
+ * 5. A and B wake up each other alternately
+ */
+static bool
+wake_on_current(int this_cpu, struct task_struct *p)
+{
+ if (!sched_feat(SIS_CURRENT))
+ return false;
+
+ if (cpu_rq(this_cpu)->nr_running > 1)
+ return false;
+
+ /*
+ * If a task switches in and then voluntarily relinquishes the
+ * CPU quickly, it is regarded as a short duration task. In that
+ * way, the short waker is likely to relinquish the CPU soon, which
+ * provides room for the wakee. Meanwhile, a short wakee would bring
+ * minor impact to the target rq. Put the short waker and wakee together
+ * bring benefit to cache-share task pairs and avoid migration overhead.
+ */
+ if (!current->se.dur_avg || ((current->se.dur_avg * 8) >= sysctl_sched_min_granularity))
+ return false;
+
+ if (!p->se.dur_avg || ((p->se.dur_avg * 8) >= sysctl_sched_min_granularity))
+ return false;
+
+ if (current->wakee_flips || p->wakee_flips)
+ return false;
+
+ if (current->last_wakee != p || p->last_wakee != current)
+ return false;
+
+ return true;
+}
+
/*
* The purpose of wake_affine() is to quickly determine on which CPU we can run
* soonest. For the purpose of speed we only consider the waking and previous
@@ -6594,6 +6634,9 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
if (sched_feat(WA_WEIGHT) && target == nr_cpumask_bits)
target = wake_affine_weight(sd, p, this_cpu, prev_cpu, sync);

+ if (target == nr_cpumask_bits && wake_on_current(this_cpu, p))
+ target = this_cpu;
+
schedstat_inc(p->stats.nr_wakeups_affine_attempts);
if (target != this_cpu)
return prev_cpu;
@@ -7115,6 +7158,9 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
}
}

+ if (smp_processor_id() == target && wake_on_current(target, p))
+ return target;
+
i = select_idle_cpu(p, sd, has_idle_core, target);
if ((unsigned)i < nr_cpumask_bits)
return i;
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index ee7f23c76bd3..a3e05827f7e8 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -62,6 +62,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
*/
SCHED_FEAT(SIS_PROP, false)
SCHED_FEAT(SIS_UTIL, true)
+SCHED_FEAT(SIS_CURRENT, true)

/*
* Issue a WARN when we do multiple update_rq_clock() calls
--
2.25.1

2023-04-24 07:04:14

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v7 0/2] sched/fair: Introduce SIS_CURRENT to wake up short task on current CPU

Hello,

kernel test robot noticed a 2250.1% improvement of stress-ng.switch.ops_per_sec on:

patch: "sched/fair: Introduce SIS_CURRENT to wake up short task on current CPU"

testcase: stress-ng
test machine: 224 threads 2 sockets (Sapphire Rapids) with 256G memory
parameters:

nr_threads: 100%
testtime: 60s
sc_pid_max: 4194304
class: scheduler
test: switch
cpufreq_governor: performance


Details are as below:

=========================================================================================
class/compiler/cpufreq_governor/kconfig/nr_threads/rootfs/sc_pid_max/tbox_group/test/testcase/testtime:
scheduler/gcc-11/performance/x86_64-rhel-8.3/100%/debian-11.1-x86_64-20220510.cgz/4194304/lkp-spr-r02/switch/stress-ng/60s

commit:
dac54350b7 ("sched/fair: Record the average duration of a task")
f153e964b7 ("sched/fair: Introduce SIS_CURRENT to wake up short task on current CPU")

dac54350b7363c69 f153e964b7d24fd0375f0efad66
---------------- ---------------------------
%stddev %change %stddev
\ | \
1.49e+08 ? 6% +2250.2% 3.503e+09 stress-ng.switch.ops
2483795 ? 6% +2250.1% 58370891 stress-ng.switch.ops_per_sec
69985 ? 4% +4.4e+06% 3.088e+09 stress-ng.time.involuntary_context_switches
31001 -5.1% 29406 stress-ng.time.minor_page_faults
12849 +60.9% 20680 stress-ng.time.percent_of_cpu_this_job_got
7325 +47.5% 10802 stress-ng.time.system_time
672.80 +209.1% 2079 stress-ng.time.user_time
2.836e+08 ? 6% +1273.7% 3.895e+09 stress-ng.time.voluntary_context_switches
14182 -14.6% 12109 uptime.idle
2.93e+09 -64.8% 1.03e+09 cpuidle..time
1.946e+08 ? 5% -82.0% 34997812 ? 41% cpuidle..usage
52479 ? 14% +90.9% 100199 ? 2% meminfo.Active
52368 ? 14% +91.1% 100095 ? 2% meminfo.Active(anon)
51616 ? 15% +90.0% 98071 ? 2% numa-meminfo.node1.Active
51523 ? 15% +90.3% 98037 ? 2% numa-meminfo.node1.Active(anon)
12896 ? 15% +90.7% 24589 ? 2% numa-vmstat.node1.nr_active_anon
12896 ? 15% +90.7% 24589 ? 2% numa-vmstat.node1.nr_zone_active_anon
29.58 -21.7 7.89 ? 2% mpstat.cpu.all.idle%
6.86 -5.6 1.28 ? 3% mpstat.cpu.all.irq%
0.41 ? 6% -0.4 0.05 ? 5% mpstat.cpu.all.soft%
57.53 +17.9 75.48 mpstat.cpu.all.sys%
5.62 +9.7 15.30 mpstat.cpu.all.usr%
31.00 -68.8% 9.67 ? 4% vmstat.cpu.id
5.00 +183.3% 14.17 ? 2% vmstat.cpu.us
185.00 +59.0% 294.17 vmstat.procs.r
7520613 ? 6% +1324.1% 1.071e+08 vmstat.system.cs
819272 ? 4% -35.1% 531594 ? 20% vmstat.system.in
13105 ? 14% +91.1% 25040 ? 2% proc-vmstat.nr_active_anon
135920 -6.0% 127705 proc-vmstat.nr_inactive_anon
57438 ? 3% +6.0% 60898 proc-vmstat.nr_shmem
13105 ? 14% +91.1% 25040 ? 2% proc-vmstat.nr_zone_active_anon
135920 -6.0% 127705 proc-vmstat.nr_zone_inactive_anon
905140 +1.4% 917932 proc-vmstat.numa_hit
702189 +1.8% 714949 proc-vmstat.numa_local
2252 ? 13% +91.7% 4317 ? 2% proc-vmstat.pgactivate
987449 +1.2% 998848 proc-vmstat.pgalloc_normal
2327 +16.0% 2700 turbostat.Avg_MHz
81.46 +12.1 93.60 turbostat.Busy%
23636974 ? 25% -87.4% 2970987 ? 18% turbostat.C1
1.60 ? 20% -1.4 0.15 ? 14% turbostat.C1%
1.678e+08 ? 3% -93.0% 11676240 ? 5% turbostat.C1E
15.38 ? 2% -12.1 3.25 ? 2% turbostat.C1E%
18.52 -65.5% 6.38 ? 2% turbostat.CPU%c1
0.06 ? 7% +773.7% 0.55 turbostat.IPC
53133659 ? 4% -34.6% 34758470 ? 21% turbostat.IRQ
2707361 ? 2% +633.0% 19845296 ? 69% turbostat.POLL
551.77 +21.5% 670.20 turbostat.PkgWatt
17.13 +8.1% 18.52 turbostat.RAMWatt
606383 ? 18% -97.7% 13786 ? 44% sched_debug.cfs_rq:/.MIN_vruntime.avg
1289912 ? 8% -84.0% 205876 ? 44% sched_debug.cfs_rq:/.MIN_vruntime.stddev
0.55 ? 4% +57.7% 0.87 sched_debug.cfs_rq:/.h_nr_running.avg
3.83 ? 22% -47.8% 2.00 ? 14% sched_debug.cfs_rq:/.h_nr_running.max
0.60 ? 8% -41.3% 0.36 ? 4% sched_debug.cfs_rq:/.h_nr_running.stddev
606383 ? 18% -97.7% 13786 ? 44% sched_debug.cfs_rq:/.max_vruntime.avg
1289912 ? 8% -84.0% 205876 ? 44% sched_debug.cfs_rq:/.max_vruntime.stddev
3441744 +73.7% 5979755 sched_debug.cfs_rq:/.min_vruntime.avg
3673914 +70.4% 6258914 sched_debug.cfs_rq:/.min_vruntime.max
2245166 +65.6% 3719103 sched_debug.cfs_rq:/.min_vruntime.min
88019 ? 2% +106.3% 181555 ? 7% sched_debug.cfs_rq:/.min_vruntime.stddev
0.38 ? 4% +40.3% 0.53 sched_debug.cfs_rq:/.nr_running.avg
0.34 ? 3% -61.9% 0.13 ? 26% sched_debug.cfs_rq:/.nr_running.stddev
622.22 ? 2% +33.3% 829.67 sched_debug.cfs_rq:/.runnable_avg.avg
2083 ? 13% -10.9% 1855 ? 5% sched_debug.cfs_rq:/.runnable_avg.max
320.17 ? 5% -45.6% 174.13 ? 10% sched_debug.cfs_rq:/.runnable_avg.stddev
-1218600 +84.3% -2246378 sched_debug.cfs_rq:/.spread0.min
87973 ? 2% +107.2% 182295 ? 7% sched_debug.cfs_rq:/.spread0.stddev
400.48 ? 2% +45.1% 581.11 sched_debug.cfs_rq:/.util_avg.avg
188.48 ? 6% -20.9% 149.09 ? 9% sched_debug.cfs_rq:/.util_avg.stddev
75.30 ? 8% +411.6% 385.20 ? 2% sched_debug.cfs_rq:/.util_est_enqueued.avg
106.05 ? 7% +53.1% 162.35 ? 6% sched_debug.cfs_rq:/.util_est_enqueued.stddev
648805 ? 6% +55.8% 1010843 ? 13% sched_debug.cpu.avg_idle.max
86024 ? 6% +44.6% 124370 ? 11% sched_debug.cpu.avg_idle.stddev
2332 ? 7% +81.6% 4235 sched_debug.cpu.curr->pid.avg
2566 ? 2% -77.1% 587.02 ? 7% sched_debug.cpu.curr->pid.stddev
0.47 ? 6% +65.4% 0.77 sched_debug.cpu.nr_running.avg
0.56 ? 2% -34.3% 0.37 ? 4% sched_debug.cpu.nr_running.stddev
1031696 ? 6% +1343.5% 14892376 sched_debug.cpu.nr_switches.avg
1140186 ? 4% +1348.9% 16520256 sched_debug.cpu.nr_switches.max
678987 ? 6% +1130.3% 8353556 ? 15% sched_debug.cpu.nr_switches.min
54009 ? 29% +1497.3% 862684 ? 11% sched_debug.cpu.nr_switches.stddev
58611259 +100.0% 1.172e+08 sched_debug.sysctl_sched.sysctl_sched_features
15.59 -90.5% 1.47 ? 4% perf-stat.i.MPKI
1.127e+10 ? 5% +912.3% 1.141e+11 perf-stat.i.branch-instructions
1.43 -0.5 0.95 perf-stat.i.branch-miss-rate%
1.572e+08 ? 6% +529.5% 9.894e+08 perf-stat.i.branch-misses
1.27 ? 14% +29.3 30.61 ? 6% perf-stat.i.cache-miss-rate%
5226056 ? 7% +840.6% 49157630 ? 5% perf-stat.i.cache-misses
9.044e+08 ? 6% -76.3% 2.145e+08 ? 2% perf-stat.i.cache-references
7735617 ? 6% +1348.7% 1.121e+08 perf-stat.i.context-switches
8.96 ? 5% -82.5% 1.56 ? 4% perf-stat.i.cpi
5.207e+11 +17.6% 6.122e+11 perf-stat.i.cpu-cycles
2826254 ? 6% -92.6% 210370 ? 5% perf-stat.i.cpu-migrations
105045 ? 7% -82.7% 18181 ? 6% perf-stat.i.cycles-between-cache-misses
0.37 ? 7% -0.3 0.03 ? 7% perf-stat.i.dTLB-load-miss-rate%
59575414 ? 9% -92.4% 4518176 ? 8% perf-stat.i.dTLB-load-misses
1.541e+10 ? 5% +969.1% 1.647e+11 perf-stat.i.dTLB-loads
0.08 -0.1 0.01 ? 7% perf-stat.i.dTLB-store-miss-rate%
6651717 ? 5% -89.8% 680779 ? 5% perf-stat.i.dTLB-store-misses
8.526e+09 ? 6% +1076.1% 1.003e+11 perf-stat.i.dTLB-stores
5.657e+10 ? 5% +905.0% 5.686e+11 perf-stat.i.instructions
0.14 ? 7% +536.8% 0.91 perf-stat.i.ipc
2.32 +17.7% 2.73 perf-stat.i.metric.GHz
94.15 ? 4% +1221.0% 1243 perf-stat.i.metric.K/sec
160.83 ? 5% +951.6% 1691 perf-stat.i.metric.M/sec
96.30 +2.7 99.03 perf-stat.i.node-load-miss-rate%
2066362 ? 8% +1017.1% 23083138 ? 5% perf-stat.i.node-load-misses
78248 ? 11% -45.9% 42316 ? 11% perf-stat.i.node-loads
16.04 -97.7% 0.37 perf-stat.overall.MPKI
1.39 -0.5 0.87 perf-stat.overall.branch-miss-rate%
0.57 ? 12% +22.7 23.25 ? 7% perf-stat.overall.cache-miss-rate%
9.27 ? 5% -88.4% 1.07 perf-stat.overall.cpi
102426 ? 7% -87.8% 12467 ? 5% perf-stat.overall.cycles-between-cache-misses
0.39 ? 6% -0.4 0.00 ? 9% perf-stat.overall.dTLB-load-miss-rate%
0.08 -0.1 0.00 ? 6% perf-stat.overall.dTLB-store-miss-rate%
0.11 ? 5% +759.8% 0.93 perf-stat.overall.ipc
96.35 +3.5 99.82 perf-stat.overall.node-load-miss-rate%
1.094e+10 ? 5% +927.6% 1.124e+11 perf-stat.ps.branch-instructions
1.522e+08 ? 6% +540.5% 9.746e+08 perf-stat.ps.branch-misses
4975869 ? 6% +873.2% 48427022 ? 5% perf-stat.ps.cache-misses
8.813e+08 ? 6% -76.3% 2.086e+08 ? 2% perf-stat.ps.cache-references
7529020 ? 6% +1367.2% 1.105e+08 perf-stat.ps.context-switches
216929 +1.3% 219851 perf-stat.ps.cpu-clock
5.072e+11 +18.7% 6.019e+11 perf-stat.ps.cpu-cycles
2754210 ? 6% -92.8% 198251 ? 5% perf-stat.ps.cpu-migrations
58032406 ? 9% -92.6% 4269838 ? 8% perf-stat.ps.dTLB-load-misses
1.496e+10 ? 5% +984.7% 1.623e+11 perf-stat.ps.dTLB-loads
6474992 ? 5% -90.0% 647675 ? 5% perf-stat.ps.dTLB-store-misses
8.286e+09 ? 6% +1092.6% 9.882e+10 perf-stat.ps.dTLB-stores
5.491e+10 ? 5% +920.3% 5.602e+11 perf-stat.ps.instructions
1984455 ? 7% +1046.4% 22749741 ? 5% perf-stat.ps.node-load-misses
74637 ? 12% -44.7% 41249 ? 11% perf-stat.ps.node-loads
216929 +1.3% 219851 perf-stat.ps.task-clock
3.397e+12 ? 6% +921.9% 3.471e+13 perf-stat.total.instructions



Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests


Attachments:
(No filename) (11.16 kB)
config-6.3.0-rc3-00015-gf153e964b7d2 (159.65 kB)
job-script (8.27 kB)
job.yaml (5.95 kB)
reproduce (395.00 B)
Download all attachments

2023-04-26 14:15:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v7 2/2] sched/fair: Introduce SIS_CURRENT to wake up short task on current CPU

On Sat, Apr 22, 2023 at 12:08:18AM +0800, Chen Yu wrote:

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4af5799b90fc..46c1321c0407 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6501,6 +6501,46 @@ static int wake_wide(struct task_struct *p)
> return 1;
> }
>
> +/*
> + * Wake up the task on current CPU, if the following conditions are met:
> + *
> + * 1. waker A is the only running task on this_cpu
> + * 3. A is a short duration task (waker will fall asleep soon)
> + * 4. wakee B is a short duration task (impact of B on A is minor)
> + * 5. A and B wake up each other alternately
> + */
> +static bool
> +wake_on_current(int this_cpu, struct task_struct *p)
> +{
> + if (!sched_feat(SIS_CURRENT))
> + return false;
> +
> + if (cpu_rq(this_cpu)->nr_running > 1)
> + return false;
> +
> + /*
> + * If a task switches in and then voluntarily relinquishes the
> + * CPU quickly, it is regarded as a short duration task. In that
> + * way, the short waker is likely to relinquish the CPU soon, which
> + * provides room for the wakee. Meanwhile, a short wakee would bring
> + * minor impact to the target rq. Put the short waker and wakee together
> + * bring benefit to cache-share task pairs and avoid migration overhead.
> + */
> + if (!current->se.dur_avg || ((current->se.dur_avg * 8) >= sysctl_sched_min_granularity))
> + return false;
> +
> + if (!p->se.dur_avg || ((p->se.dur_avg * 8) >= sysctl_sched_min_granularity))
> + return false;
> +
> + if (current->wakee_flips || p->wakee_flips)
> + return false;
> +
> + if (current->last_wakee != p || p->last_wakee != current)
> + return false;
> +
> + return true;
> +}

So I was going to play with this and found I needed to change things up
since these sysctl's no longer exist in my EEVDF branch.

And while I can easily do
's/sysctl_sched_min_granularity/sysctl_sched_base_slice/', it did make
me wonder if that's the right value to use.

min_gran/base_slice is related to how long we want a task to run before
switching, but that is not related to how long it needs to run to
establish a cache footprint.

Would not sched_migration_cost be a better measure to compare against?
That is also used in task_hot() to prevent migrations.

2023-04-27 08:04:27

by Chen Yu

[permalink] [raw]
Subject: Re: [PATCH v7 2/2] sched/fair: Introduce SIS_CURRENT to wake up short task on current CPU

On 2023-04-26 at 16:03:24 +0200, Peter Zijlstra wrote:
> On Sat, Apr 22, 2023 at 12:08:18AM +0800, Chen Yu wrote:
>
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 4af5799b90fc..46c1321c0407 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6501,6 +6501,46 @@ static int wake_wide(struct task_struct *p)
> > return 1;
> > }
> >
> > +/*
> > + * Wake up the task on current CPU, if the following conditions are met:
> > + *
> > + * 1. waker A is the only running task on this_cpu
> > + * 3. A is a short duration task (waker will fall asleep soon)
> > + * 4. wakee B is a short duration task (impact of B on A is minor)
> > + * 5. A and B wake up each other alternately
> > + */
> > +static bool
> > +wake_on_current(int this_cpu, struct task_struct *p)
> > +{
> > + if (!sched_feat(SIS_CURRENT))
> > + return false;
> > +
> > + if (cpu_rq(this_cpu)->nr_running > 1)
> > + return false;
> > +
> > + /*
> > + * If a task switches in and then voluntarily relinquishes the
> > + * CPU quickly, it is regarded as a short duration task. In that
> > + * way, the short waker is likely to relinquish the CPU soon, which
> > + * provides room for the wakee. Meanwhile, a short wakee would bring
> > + * minor impact to the target rq. Put the short waker and wakee together
> > + * bring benefit to cache-share task pairs and avoid migration overhead.
> > + */
> > + if (!current->se.dur_avg || ((current->se.dur_avg * 8) >= sysctl_sched_min_granularity))
> > + return false;
> > +
> > + if (!p->se.dur_avg || ((p->se.dur_avg * 8) >= sysctl_sched_min_granularity))
> > + return false;
> > +
> > + if (current->wakee_flips || p->wakee_flips)
> > + return false;
> > +
> > + if (current->last_wakee != p || p->last_wakee != current)
> > + return false;
> > +
> > + return true;
> > +}
>
> So I was going to play with this and found I needed to change things up
> since these sysctl's no longer exist in my EEVDF branch.
>
> And while I can easily do
> 's/sysctl_sched_min_granularity/sysctl_sched_base_slice/', it did make
> me wonder if that's the right value to use.
>
> min_gran/base_slice is related to how long we want a task to run before
> switching, but that is not related to how long it needs to run to
> establish a cache footprint.
>
> Would not sched_migration_cost be a better measure to compare against?
> That is also used in task_hot() to prevent migrations.
Yes, thanks for the suggestion, this looks more reasonable. I'll tune on this
and check the result.

thanks,
Chenyu