2023-02-03 05:16:32

by Chen Yu

[permalink] [raw]
Subject: [PATCH v5 0/2] sched/fair: Wake short task on current CPU

The main purpose is to avoid too many cross CPU wake up when it is
unnecessary. The frequent cross CPU wake up brings significant damage
to some workloads, especially on high core count systems.

Inhibits the cross CPU wake-up by placing the wakee on waking CPU,
if both the waker and wakee are short-duration tasks. The short
duration task could become a trouble maker on high-load system,
because it could bring frequent context switch. So this strategy
only takes effect when the system is busy. Besides, it is unreasonable
to inhibit the idle CPU scan when there are still idle CPUs.

First, introduce the definition of a short-duration task. Then
leverages the first patch to choose a local CPU for wakee.

Overall there is significant performance improvement on Intel
2 x 56C/112T platform. Such as will-it-scale (1200+%),
netperf(600+%) in some cases. And no noticeable impact on
schbench, hackbench, tbench and a OLTP workload with a commercial RDBMS.

Seeking for test results on other platforms, such as Zen3 and Kunpeng
Arm64. Appreciated Prateek and Yicong if you can have a try on this
version.

Changes since v4:
1. Dietmar has commented on the task duration calculation. So refined
the commit log to reduce confusion.
2. Change [PATCH 1/2] to only record the average duration of a task.
So this change could benefit UTIL_EST_FASTER[1].
3. As v4 reported regression on Zen3 and Kunpeng Arm64, add back
the system average utilization restriction that, if the system
is not busy, do not enable the short wake up. Above logic has
shown improvment on Zen3[2].
4. Restrict the wakeup target to be current CPU, rather than both
current CPU and task's previous CPU. This could also benefit
wakeup optimization from interrupt in the future, which is
suggested by Yicong.

Changes since v3:
1. Honglei and Josh have concern that the threshold of short
task duration could be too long. Decreased the threshold from
sysctl_sched_min_granularity to (sysctl_sched_min_granularity / 8),
and the '8' comes from get_update_sysctl_factor().
2. Export p->se.dur_avg to /proc/{pid}/sched per Yicong's suggestion.
3. Move the calculation of average duration from put_prev_task_fair()
to dequeue_task_fair(). Because there is an issue in v3 that,
put_prev_task_fair() will not be invoked by pick_next_task_fair()
in fast path, thus the dur_avg could not be updated timely.
4. Fix the comment in PATCH 2/2, that "WRITE_ONCE(CPU1->ttwu_pending, 1);"
on CPU0 is earlier than CPU1 getting "ttwu_list->p0", per Tianchen.
5. Move the scan for CPU with short duration task from select_idle_cpu()
to select_idle_siblings(), because there is no CPU scan involved, per
Yicong.

Changes since v2:

1. Peter suggested comparing the duration of waker and the cost to
scan for an idle CPU: If the cost is higher than the task duration,
do not waste time finding an idle CPU, choose the local or previous
CPU directly. A prototype was created based on this suggestion.
However, according to the test result, this prototype does not inhibit
the cross CPU wakeup and did not bring improvement. Because the cost
to find an idle CPU is small in the problematic scenario. The root
cause of the problem is a race condition between scanning for an idle
CPU and task enqueue(please refer to the commit log in PATCH 2/2).
So v3 does not change the core logic of v2, with some refinement based
on Peter's suggestion.

2. Simplify the logic to record the task duration per Peter and Abel's suggestion.


[1] https://lore.kernel.org/lkml/[email protected]/
[2] https://lore.kernel.org/all/[email protected]/

v4: https://lore.kernel.org/lkml/[email protected]/
v3: https://lore.kernel.org/lkml/[email protected]/
v2: https://lore.kernel.org/all/[email protected]/
v1: https://lore.kernel.org/lkml/[email protected]/

Chen Yu (2):
sched/fair: Record the average duration of a task
sched/fair: Introduce SIS_SHORT to wake up short task on current CPU

include/linux/sched.h | 3 +++
kernel/sched/core.c | 2 ++
kernel/sched/debug.c | 1 +
kernel/sched/fair.c | 39 +++++++++++++++++++++++++++++++++++++++
kernel/sched/features.h | 1 +
5 files changed, 46 insertions(+)

--
2.25.1



2023-02-03 05:16:47

by Chen Yu

[permalink] [raw]
Subject: [PATCH v5 1/2] sched/fair: Record the average duration of a task

Record the average duration of a task, as there is a requirement
to leverage this information for better task placement.

At first thought the (p->se.sum_exec_runtime / p->nvcsw)
can be used to measure the task duration. However, the
history long past was factored too heavily in such a formula.
Ideally, the old activity should decay and not affect
the current status too much.

Although something based on PELT can be used, se.util_avg might
not be appropriate to describe the task duration:
Task p1 and task p2 are doing frequent ping-pong scheduling on
one CPU, both p1 and p2 have a short duration, but the util_avg
can be up to 50%, which is inconsistent with task duration.

It was found that there was once a similar feature to track the
duration of a task:
commit ad4b78bbcbab ("sched: Add new wakeup preemption mode: WAKEUP_RUNNING")
Unfortunately, it was reverted because it was an experiment. Pick the
patch up again, by recording the average duration when a task voluntarily
switches out.

For example, suppose on CPU1, task p1 and p2 run alternatively:

--------------------> time

| p1 runs 1ms | p2 preempt p1 | p1 switch in, runs 0.5ms and blocks |
^ ^ ^
|_____________| |_____________________________________|
^
|
p1 dequeued

p1's duration in one section is (1 + 0.5)ms. Because if p2 does not
preempt p1, p1 can run 1.5ms. This reflects the nature of a task:
how long it wishes to run at most.

Suggested-by: Tim Chen <[email protected]>
Suggested-by: Vincent Guittot <[email protected]>
Signed-off-by: Chen Yu <[email protected]>
---
include/linux/sched.h | 3 +++
kernel/sched/core.c | 2 ++
kernel/sched/debug.c | 1 +
kernel/sched/fair.c | 13 +++++++++++++
4 files changed, 19 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4df2b3e76b30..e21709402a31 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -557,6 +557,9 @@ struct sched_entity {
u64 prev_sum_exec_runtime;

u64 nr_migrations;
+ u64 prev_sleep_sum_runtime;
+ /* average duration of a task */
+ u64 dur_avg;

#ifdef CONFIG_FAIR_GROUP_SCHED
int depth;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 03b8529db73f..b805c5bdc7ff 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4379,6 +4379,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
p->se.prev_sum_exec_runtime = 0;
p->se.nr_migrations = 0;
p->se.vruntime = 0;
+ p->se.dur_avg = 0;
+ p->se.prev_sleep_sum_runtime = 0;
INIT_LIST_HEAD(&p->se.group_node);

#ifdef CONFIG_FAIR_GROUP_SCHED
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 1637b65ba07a..8d64fba16cfe 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -1024,6 +1024,7 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
__PS("nr_involuntary_switches", p->nivcsw);

P(se.load.weight);
+ P(se.dur_avg);
#ifdef CONFIG_SMP
P(se.avg.load_sum);
P(se.avg.runnable_sum);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d4db72f8f84e..aa16611c7263 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6271,6 +6271,18 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)

static void set_next_buddy(struct sched_entity *se);

+static inline void dur_avg_update(struct task_struct *p, bool task_sleep)
+{
+ u64 dur;
+
+ if (!task_sleep)
+ return;
+
+ dur = p->se.sum_exec_runtime - p->se.prev_sleep_sum_runtime;
+ p->se.prev_sleep_sum_runtime = p->se.sum_exec_runtime;
+ update_avg(&p->se.dur_avg, dur);
+}
+
/*
* The dequeue_task method is called before nr_running is
* decreased. We remove the task from the rbtree and
@@ -6343,6 +6355,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)

dequeue_throttle:
util_est_update(&rq->cfs, p, task_sleep);
+ dur_avg_update(p, task_sleep);
hrtick_update(rq);
}

--
2.25.1


2023-02-03 05:17:11

by Chen Yu

[permalink] [raw]
Subject: [PATCH v5 2/2] sched/fair: Introduce SIS_SHORT to wake up short task on current CPU

[Problem Statement]
For a workload that is doing frequent context switches, the throughput
scales well until the number of instances reaches a peak point. After
that peak point, the throughput drops significantly if the number of
instances continues to increase.

The will-it-scale context_switch1 test case exposes the issue. The
test platform has 112 CPUs per LLC domain. The will-it-scale launches
1, 8, 16 ... 112 instances respectively. Each instance is composed
of 2 tasks, and each pair of tasks would do ping-pong scheduling via
pipe_read() and pipe_write(). No task is bound to any CPU.
It is found that, once the number of instances is higher than
56(112 tasks in total, every CPU has 1 task), the throughput
drops accordingly if the instance number continues to increase:

^
throughput|
| X
| X X X
| X X X
| X X
| X X
| X
| X
| X
| X
|
+-----------------.------------------->
56
number of instances

[Symptom analysis]

The performance downgrading was caused by a high system idle
percentage(around 20% ~ 30%). The CPUs waste a lot of time in
idle and do nothing. As a comparison, if set CPU affinity to
these workloads and stops them from migrating among CPUs,
the idle percentage drops to nearly 0%, and the throughput
increases a lot. This indicates room for optimization.

The cause is the race condition between select_task_rq() and
the task enqueue.

Suppose there are nr_cpus pairs of ping-pong scheduling
tasks. For example, p0' and p0 are ping-pong scheduling,
so do p1' <=> p1, and p2'<=> p2. None of these tasks are
bound to any CPUs. The problem can be summarized as:
more than 1 wakers are stacked on 1 CPU, which slows down
waking up their wakees:

CPU0 CPU1 CPU2

p0' p1' => idle p2'

try_to_wake_up(p0) try_to_wake_up(p2);
CPU1 = select_task_rq(p0); CPU1 = select_task_rq(p2);
ttwu_queue(p0, CPU1); ttwu_queue(p2, CPU1);
__ttwu_queue_wakelist(p0, CPU1);
WRITE_ONCE(CPU1->ttwu_pending, 1);
__smp_call_single_queue(CPU1, p0); => ttwu_list->p0
quiting cpuidle_idle_call()

__ttwu_queue_wakelist(p2, CPU1);
WRITE_ONCE(CPU1->ttwu_pending, 1);
ttwu_list->p2->p0 <= __smp_call_single_queue(CPU1, p2);

p0' => idle
sched_ttwu_pending()
enqueue_task(p2 and p0)

idle => p2

...
p2 time slice expires
...
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
<=== !!! p2 delays the wake up of p0' !!!
!!! causes long idle on CPU0 !!!
p2 => p0 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
p0 wakes up p0'

idle => p0'

Since there are many waker/wakee pairs in the system, the chain reaction
causes many CPUs to be victims. These idle CPUs wait for their waker to
be scheduled.

Tiancheng has mentioned the above issue here[1].

[Proposal]
The root cause is that there is no strict synchronization of
select_task_rq() and the set of ttwu_pending flag among several CPUs.
And this might be by design because the scheduler prefers parallel
wakeup.

Avoid this problem indirectly. If a system is busy, and if the waker
and wakee are both short duration tasks, wake up the wakee on current CPU.

The reason is that, if the waker is a short-duration task, it might
relinquish the CPU soon, and the wakee has the chance to be scheduled.
On the other hand, if the wakee is a short duration task, putting it on
non-idle CPU would bring minimal impact to the running task. No idle
core in the system indicates that this mechanism should not inhibit
spreading the tasks if the system is not busy.

This wake up strategy can be viewed as dynamic WF_SYNC. Except that
WF_SYNC does not treat non-idle CPU as candidate CPU.

[Benchmark results]
The baseline is v6.2-rc1 tip:sched/core, on top of
Commit be06c7d02443a ("cpuidle: Fix poll_idle() noinstr annotation").
The test platform has 2 x 56C/112T and 224 CPUs in total. C-states
deeper than C1E are disabled. Turbo is disabled. CPU frequency governor
is performance.

will-it-scale
=============
case load baseline compare%
context_switch1 224 groups 1.00 +1262.06%

There is a huge improvement in fast context switch test case, especially
when the number of groups equals the CPUs.

netperf
=======
case load baseline(std%) compare%( std%)
TCP_RR 56-threads 1.00 ( 0.86) -0.16 ( 0.98)
TCP_RR 112-threads 1.00 ( 0.65) +0.24 ( 0.50)
TCP_RR 168-threads 1.00 ( 5.87) +4.99 ( 4.81)
TCP_RR 224-threads 1.00 ( 4.63) +687.45 ( 3.77)
TCP_RR 280-threads 1.00 ( 9.91) +0.39 ( 13.05)
TCP_RR 336-threads 1.00 ( 21.27) +0.08 ( 15.32)
TCP_RR 392-threads 1.00 ( 40.60) +20.30 ( 30.95)
TCP_RR 448-threads 1.00 ( 29.85) +0.05 ( 33.18)
UDP_RR 56-threads 1.00 ( 2.46) -0.19 ( 2.50)
UDP_RR 112-threads 1.00 ( 12.59) +0.00 ( 12.31)
UDP_RR 168-threads 1.00 ( 18.55) +6.33 ( 62.39)
UDP_RR 224-threads 1.00 ( 13.31) +131.74 ( 22.13)
UDP_RR 280-threads 1.00 ( 32.69) -0.54 ( 23.84)
UDP_RR 336-threads 1.00 ( 28.52) +0.14 ( 26.45)
UDP_RR 392-threads 1.00 ( 26.80) +0.23 ( 32.52)
UDP_RR 448-threads 1.00 ( 41.65) -0.20 ( 42.97)

There is significant 600+% improvement for TCP_RR and 100+% for UDP_RR
when the number of threads equals the CPUs.

tbench
======
case load baseline(std%) compare%( std%)
loopback 56-threads 1.00 ( 1.05) +0.56 ( 0.48)
loopback 112-threads 1.00 ( 1.04) +0.83 ( 0.65)
loopback 168-threads 1.00 ( 29.60) +37.37 ( 33.77)
loopback 224-threads 1.00 ( 0.22) +0.11 ( 0.02)
loopback 280-threads 1.00 ( 0.04) -0.11 ( 0.06)
loopback 336-threads 1.00 ( 0.10) -0.11 ( 0.11)
loopback 392-threads 1.00 ( 0.42) -0.06 ( 0.05)
loopback 448-threads 1.00 ( 0.08) +0.19 ( 0.07)

There is no noticeable impact on tbench. And there is run-to-run variance
in 168 threads case, with or without this patch applied. It might be
another issue and need to be investigated later.

hackbench
=========
case load baseline(std%) compare%( std%)
process-pipe 1-groups 1.00 ( 11.63) +10.24 ( 17.85)
process-sockets 1-groups 1.00 ( 15.36) -17.58 ( 20.96)
threads-pipe 1-groups 1.00 ( 2.78) +2.86 ( 4.14)
threads-sockets 1-groups 1.00 ( 1.44) -0.57 ( 1.09)
process-pipe 2-groups 1.00 ( 4.93) -3.48 ( 8.04)
process-sockets 2-groups 1.00 ( 2.44) -3.76 ( 2.34)
threads-pipe 2-groups 1.00 ( 2.26) +4.36 ( 1.77)
threads-sockets 2-groups 1.00 ( 2.50) +1.86 ( 4.46)
process-pipe 4-groups 1.00 ( 1.97) +13.06 ( 7.60)
process-sockets 4-groups 1.00 ( 0.11) -0.57 ( 0.66)
threads-pipe 4-groups 1.00 ( 2.48) +1.90 ( 3.81)
threads-sockets 4-groups 1.00 ( 2.41) +0.28 ( 1.78)
process-pipe 8-groups 1.00 ( 1.45) +1.62 ( 0.13)
process-sockets 8-groups 1.00 ( 0.21) +0.05 ( 0.33)
threads-pipe 8-groups 1.00 ( 0.36) -1.19 ( 0.85)
threads-sockets 8-groups 1.00 ( 0.39) +1.03 ( 0.33)

Overall there is no noticeable impact on hackbench. There is a large
run-to-run variance when the load is low, with or without this patch
applied. Similar to tbench, this issue needs to be investigated too.

schbench
========
case load baseline(std%) compare%( std%)
normal 1-mthreads 1.00 ( 0.53) -0.22 ( 1.33)
normal 2-mthreads 1.00 ( 3.04) -3.53 ( 4.88)
normal 4-mthreads 1.00 ( 2.28) +1.73 ( 1.66)
normal 8-mthreads 1.00 ( 2.18) -1.75 ( 1.42)

There should be no impact on schbench in theory, because the default task
duration of schbench is 30 ms, which is much longer than the short task
threshold.

[Limitations/Miscellaneous]

[a]
Peter has suggested[2] comparing task duration with the cost of searching
for an idle CPU. If the latter is higher, then give up the scan, to
achieve better task affine. However, this method does not fit in the case
encountered in this patch. Because there are plenty of (fast)idle CPUs in
the system, it will not take too long to find an idle CPU. The bottleneck is
caused by the race condition mentioned above.

[b]
The short task threshold is sysctl_sched_min_granularity / 8.
According to get_update_sysctl_factor(), the sysctl_sched_min_granularity
could be 0.75 msec * 4 for SCHED_TUNABLESCALING_LOG,
or 0.75 msec * ncpus for SCHED_TUNABLESCALING_LINEAR.
Choosing 8 as the divisor is a trade-off. Thanks Honglei for pointing
this out.

[c]
SIS_SHORT leverages SIS_PROP to do better task placement. If the scan
number suggested by SIS_PROP is smaller than 60% of llc_weight, it
indicates that the util_avg% of the LLC domain is higher than 50%.
The 50% util_avg indicates a half-busy LLC domain, which makes a double
confirm with !has_idle_core, to not stack tasks if the system has idle
CPUs. System busier than this could lower its bar to choose a
compromised "idle" CPU.

[1] https://lore.kernel.org/lkml/[email protected]/
[2] https://lore.kernel.org/lkml/Y2O8a%[email protected]/

Suggested-by: Tim Chen <[email protected]>
Suggested-by: K Prateek Nayak <[email protected]>
Tested-by: kernel test robot <[email protected]>
Signed-off-by: Chen Yu <[email protected]>
---
kernel/sched/fair.c | 26 ++++++++++++++++++++++++++
kernel/sched/features.h | 1 +
2 files changed, 27 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index aa16611c7263..d50097e5fcc1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6489,6 +6489,20 @@ static int wake_wide(struct task_struct *p)
return 1;
}

+/*
+ * If a task switches in and then voluntarily relinquishes the
+ * CPU quickly, it is regarded as a short duration task.
+ *
+ * SIS_SHORT tries to wake up the short wakee on current CPU. This
+ * aims to avoid race condition among CPUs due to frequent context
+ * switch.
+ */
+static inline int is_short_task(struct task_struct *p)
+{
+ return sched_feat(SIS_SHORT) && p->se.dur_avg &&
+ ((p->se.dur_avg * 8) < sysctl_sched_min_granularity);
+}
+
/*
* The purpose of wake_affine() is to quickly determine on which CPU we can run
* soonest. For the purpose of speed we only consider the waking and previous
@@ -6525,6 +6539,11 @@ wake_affine_idle(int this_cpu, int prev_cpu, int sync)
if (available_idle_cpu(prev_cpu))
return prev_cpu;

+ /* The only running task is a short duration one. */
+ if (cpu_rq(this_cpu)->nr_running == 1 &&
+ is_short_task(rcu_dereference(cpu_curr(this_cpu))))
+ return this_cpu;
+
return nr_cpumask_bits;
}

@@ -6899,6 +6918,13 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
/* overloaded LLC is unlikely to have idle cpu/core */
if (nr == 1)
return -1;
+
+ if (!has_idle_core && this == target &&
+ (5 * nr < 3 * sd->span_weight) &&
+ cpu_rq(target)->nr_running <= 1 &&
+ is_short_task(p) &&
+ is_short_task(rcu_dereference(cpu_curr(target))))
+ return target;
}
}

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index ee7f23c76bd3..efdc29c42161 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -62,6 +62,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
*/
SCHED_FEAT(SIS_PROP, false)
SCHED_FEAT(SIS_UTIL, true)
+SCHED_FEAT(SIS_SHORT, true)

/*
* Issue a WARN when we do multiple update_rq_clock() calls
--
2.25.1


2023-02-16 12:56:09

by Abel Wu

[permalink] [raw]
Subject: Re: [PATCH v5 2/2] sched/fair: Introduce SIS_SHORT to wake up short task on current CPU

Hi Chen,

I've tested this patchset (with modification) on our Redis proxy
servers, and the results seems promising.

On 2/3/23 1:18 PM, Chen Yu wrote:
> ...
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index aa16611c7263..d50097e5fcc1 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6489,6 +6489,20 @@ static int wake_wide(struct task_struct *p)
> return 1;
> }
>
> +/*
> + * If a task switches in and then voluntarily relinquishes the
> + * CPU quickly, it is regarded as a short duration task.
> + *
> + * SIS_SHORT tries to wake up the short wakee on current CPU. This
> + * aims to avoid race condition among CPUs due to frequent context
> + * switch.
> + */
> +static inline int is_short_task(struct task_struct *p)
> +{
> + return sched_feat(SIS_SHORT) && p->se.dur_avg &&
> + ((p->se.dur_avg * 8) < sysctl_sched_min_granularity);
> +}

I changed the factor to fit into the shape of tasks in question.

static inline int is_short_task(struct task_struct *p)
{
u64 dur = sysctl_sched_min_granularity / 8;

if (!sched_feat(SIS_SHORT) || !p->se.dur_avg)
return false;

/*
* Bare tracepoint to allow dynamically changing
* the threshold.
*/
trace_sched_short_task_tp(p, &dur);

return p->se.dur_avg < dur;
}

I'm not sure it is the right way to provide such flexibility, but
definition of 'short' can be workload specific.

> +
> /*
> * The purpose of wake_affine() is to quickly determine on which CPU we can run
> * soonest. For the purpose of speed we only consider the waking and previous
> @@ -6525,6 +6539,11 @@ wake_affine_idle(int this_cpu, int prev_cpu, int sync)
> if (available_idle_cpu(prev_cpu))
> return prev_cpu;
>
> + /* The only running task is a short duration one. */
> + if (cpu_rq(this_cpu)->nr_running == 1 &&
> + is_short_task(rcu_dereference(cpu_curr(this_cpu))))
> + return this_cpu;

Since proxy server handles simple data delivery, the tasks are
generally short running ones and hate task stacking which may
introduce scheduling latency (even there are only 2 short tasks
competing each other). So this part brings slight regression on
the proxy case. But I still think this is good for most cases.

Speaking of task stacking, I found wake_affine_weight() can be
much more dangerous. It chooses the less loaded one between the
prev & this cpu as a candidate, so 'small' tasks can be easily
stacked on this cpu when wake up several tasks at one time if
this cpu is unloaded. This really hurts if the 'small' tasks are
latency-sensitive, although wake_affine_weight() does the right
thing from the point of view of 'load'.

The following change greatly reduced the p99lat of Redis service
from 150ms to 0.9ms, at exactly the same throughput (QPS).

@@ -5763,6 +5787,9 @@ wake_affine_weight(struct sched_domain *sd, struct
task_struct *p,
s64 this_eff_load, prev_eff_load;
unsigned long task_load;

+ if (is_short_task(p))
+ return nr_cpumask_bits;
+
this_eff_load = cpu_load(cpu_rq(this_cpu));

if (sync) {

I know that 'short' tasks are not necessarily 'small' tasks, e.g.
sleeping duration is small or have large weights, but this works
really well for this case. This is partly because delivering data
is memory bandwidth intensive hence prefer cache hot cpus. And I
think this is also applicable to the general purposes: do NOT let
the short running tasks suffering from cache misses caused by
migration.

Best regards,
Abel

2023-02-16 15:27:57

by Chen Yu

[permalink] [raw]
Subject: Re: [PATCH v5 2/2] sched/fair: Introduce SIS_SHORT to wake up short task on current CPU

Hi Abel,
On 2023-02-16 at 20:55:29 +0800, Abel Wu wrote:
> Hi Chen,
>
> I've tested this patchset (with modification) on our Redis proxy
> servers, and the results seems promising.
>
> On 2/3/23 1:18 PM, Chen Yu wrote:
> > ...
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index aa16611c7263..d50097e5fcc1 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6489,6 +6489,20 @@ static int wake_wide(struct task_struct *p)
> > return 1;
> > }
> > +/*
> > + * If a task switches in and then voluntarily relinquishes the
> > + * CPU quickly, it is regarded as a short duration task.
> > + *
> > + * SIS_SHORT tries to wake up the short wakee on current CPU. This
> > + * aims to avoid race condition among CPUs due to frequent context
> > + * switch.
> > + */
> > +static inline int is_short_task(struct task_struct *p)
> > +{
> > + return sched_feat(SIS_SHORT) && p->se.dur_avg &&
> > + ((p->se.dur_avg * 8) < sysctl_sched_min_granularity);
> > +}
>
> I changed the factor to fit into the shape of tasks in question.
>
> static inline int is_short_task(struct task_struct *p)
> {
> u64 dur = sysctl_sched_min_granularity / 8;
>
> if (!sched_feat(SIS_SHORT) || !p->se.dur_avg)
> return false;
>
> /*
> * Bare tracepoint to allow dynamically changing
> * the threshold.
> */
> trace_sched_short_task_tp(p, &dur);
>
> return p->se.dur_avg < dur;
> }
>
> I'm not sure it is the right way to provide such flexibility, but
> definition of 'short' can be workload specific.
>
I'm not sure neither.
> > +
> > /*
> > * The purpose of wake_affine() is to quickly determine on which CPU we can run
> > * soonest. For the purpose of speed we only consider the waking and previous
> > @@ -6525,6 +6539,11 @@ wake_affine_idle(int this_cpu, int prev_cpu, int sync)
> > if (available_idle_cpu(prev_cpu))
> > return prev_cpu;
> > + /* The only running task is a short duration one. */
> > + if (cpu_rq(this_cpu)->nr_running == 1 &&
> > + is_short_task(rcu_dereference(cpu_curr(this_cpu))))
> > + return this_cpu;
>
> Since proxy server handles simple data delivery, the tasks are
> generally short running ones and hate task stacking which may
> introduce scheduling latency (even there are only 2 short tasks
> competing each other). So this part brings slight regression on
> the proxy case. But I still think this is good for most cases.
>
> Speaking of task stacking, I found wake_affine_weight() can be
> much more dangerous. It chooses the less loaded one between the
> prev & this cpu as a candidate, so 'small' tasks can be easily
> stacked on this cpu when wake up several tasks at one time if
> this cpu is unloaded. This really hurts if the 'small' tasks are
> latency-sensitive, although wake_affine_weight() does the right
> thing from the point of view of 'load'.
>
> The following change greatly reduced the p99lat of Redis service
> from 150ms to 0.9ms, at exactly the same throughput (QPS).
>
> @@ -5763,6 +5787,9 @@ wake_affine_weight(struct sched_domain *sd, struct
> task_struct *p,
> s64 this_eff_load, prev_eff_load;
> unsigned long task_load;
>
> + if (is_short_task(p))
> + return nr_cpumask_bits;
> +
So above change wants to wake up the short task on its previous
CPU if I understand correctly.
> this_eff_load = cpu_load(cpu_rq(this_cpu));
>
> if (sync) {
>
> I know that 'short' tasks are not necessarily 'small' tasks, e.g.
> sleeping duration is small or have large weights, but this works
> really well for this case. This is partly because delivering data
> is memory bandwidth intensive hence prefer cache hot cpus. And I
> think this is also applicable to the general purposes: do NOT let
> the short running tasks suffering from cache misses caused by
> migration.
>
I see. My original thought was to mitigate short task migration
as much as possible. Either waking up the task on current CPU or previous
CPU should both achieve the goal in theory. Could you please describe
a little more about how Redis proxy server was tested? Was it tested
locally or using multiple machines? I asked this because for network
benchmarks, it might be better to wake the task close to the waker(maybe
the NIC interrupt) due to hot network buffer. Anyway I will test
your change slightly changed to see the impact, and also Redis. But it
would be even better if you could provide some simple test steps I can
try locally : )

thanks,
Chenyu
> Best regards,
> Abel

2023-02-17 02:45:08

by Abel Wu

[permalink] [raw]
Subject: Re: [PATCH v5 2/2] sched/fair: Introduce SIS_SHORT to wake up short task on current CPU

On 2/16/23 11:24 PM, Chen Yu wrote:
>> The following change greatly reduced the p99lat of Redis service
>> from 150ms to 0.9ms, at exactly the same throughput (QPS).
>>
>> @@ -5763,6 +5787,9 @@ wake_affine_weight(struct sched_domain *sd, struct
>> task_struct *p,
>> s64 this_eff_load, prev_eff_load;
>> unsigned long task_load;
>>
>> + if (is_short_task(p))
>> + return nr_cpumask_bits;
>> +
> So above change wants to wake up the short task on its previous
> CPU if I understand correctly.

Yes.

>> this_eff_load = cpu_load(cpu_rq(this_cpu));
>>
>> if (sync) {
>>
>> I know that 'short' tasks are not necessarily 'small' tasks, e.g.
>> sleeping duration is small or have large weights, but this works
>> really well for this case. This is partly because delivering data
>> is memory bandwidth intensive hence prefer cache hot cpus. And I
>> think this is also applicable to the general purposes: do NOT let
>> the short running tasks suffering from cache misses caused by
>> migration.
>>
> I see. My original thought was to mitigate short task migration
> as much as possible. Either waking up the task on current CPU or previous
> CPU should both achieve the goal in theory. Could you please describe
> a little more about how Redis proxy server was tested? Was it tested
> locally or using multiple machines? I asked this because for network
> benchmarks, it might be better to wake the task close to the waker(maybe
> the NIC interrupt) due to hot network buffer. Anyway I will test
> your change slightly changed to see the impact, and also Redis. But it
> would be even better if you could provide some simple test steps I can
> try locally : )

Sorry for missing the info. The test was done in production environment,
and what I have done is only updating the kernel in several machines
which are highly loaded, that is over 85% cpu util observed by mpstat.
Please let me know if you want any specific info.

Best,
Abel

2023-02-17 08:37:50

by Honglei Wang

[permalink] [raw]
Subject: Re: [PATCH v5 2/2] sched/fair: Introduce SIS_SHORT to wake up short task on current CPU



On 2023/2/16 20:55, Abel Wu wrote:
> Hi Chen,
>
> I've tested this patchset (with modification) on our Redis proxy
> servers, and the results seems promising.
>
> On 2/3/23 1:18 PM, Chen Yu wrote:
>> ...
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index aa16611c7263..d50097e5fcc1 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6489,6 +6489,20 @@ static int wake_wide(struct task_struct *p)
>>       return 1;
>>   }
>> +/*
>> + * If a task switches in and then voluntarily relinquishes the
>> + * CPU quickly, it is regarded as a short duration task.
>> + *
>> + * SIS_SHORT tries to wake up the short wakee on current CPU. This
>> + * aims to avoid race condition among CPUs due to frequent context
>> + * switch.
>> + */
>> +static inline int is_short_task(struct task_struct *p)
>> +{
>> +    return sched_feat(SIS_SHORT) && p->se.dur_avg &&
>> +           ((p->se.dur_avg * 8) < sysctl_sched_min_granularity);
>> +}
>
> I changed the factor to fit into the shape of tasks in question.
>
>     static inline int is_short_task(struct task_struct *p)
>     {
>         u64 dur = sysctl_sched_min_granularity / 8;
>
>         if (!sched_feat(SIS_SHORT) || !p->se.dur_avg)
>             return false;
>
>         /*
>          * Bare tracepoint to allow dynamically changing
>          * the threshold.
>          */
>         trace_sched_short_task_tp(p, &dur);
>
>         return p->se.dur_avg < dur;
>     }
>
> I'm not sure it is the right way to provide such flexibility, but
> definition of 'short' can be workload specific.
>
>> +
>>   /*
>>    * The purpose of wake_affine() is to quickly determine on which CPU
>> we can run
>>    * soonest. For the purpose of speed we only consider the waking and
>> previous
>> @@ -6525,6 +6539,11 @@ wake_affine_idle(int this_cpu, int prev_cpu,
>> int sync)
>>       if (available_idle_cpu(prev_cpu))
>>           return prev_cpu;
>> +    /* The only running task is a short duration one. */
>> +    if (cpu_rq(this_cpu)->nr_running == 1 &&
>> +        is_short_task(rcu_dereference(cpu_curr(this_cpu))))
>> +        return this_cpu;
>
> Since proxy server handles simple data delivery, the tasks are
> generally short running ones and hate task stacking which may
> introduce scheduling latency (even there are only 2 short tasks
> competing each other). So this part brings slight regression on
> the proxy case. But I still think this is good for most cases.
>
> Speaking of task stacking, I found wake_affine_weight() can be
> much more dangerous. It chooses the less loaded one between the
> prev & this cpu as a candidate, so 'small' tasks can be easily
> stacked on this cpu when wake up several tasks at one time if
> this cpu is unloaded. This really hurts if the 'small' tasks are
> latency-sensitive, although wake_affine_weight() does the right
> thing from the point of view of 'load'.
>
> The following change greatly reduced the p99lat of Redis service
> from 150ms to 0.9ms, at exactly the same throughput (QPS).
>
> @@ -5763,6 +5787,9 @@ wake_affine_weight(struct sched_domain *sd, struct
> task_struct *p,
>     s64 this_eff_load, prev_eff_load;
>     unsigned long task_load;
>
> +    if (is_short_task(p))
> +        return nr_cpumask_bits;
> +
>     this_eff_load = cpu_load(cpu_rq(this_cpu));
>
>     if (sync) {
>
> I know that 'short' tasks are not necessarily 'small' tasks, e.g.
> sleeping duration is small or have large weights, but this works
> really well for this case. This is partly because delivering data
> is memory bandwidth intensive hence prefer cache hot cpus. And I
> think this is also applicable to the general purposes: do NOT let
> the short running tasks suffering from cache misses caused by
> migration.
>

Redis is a bit special. It runs quick and really sensitive on schedule
latency. The purpose of this 'short task' feature from Yu is to mitigate
the migration and tend to place the waking task on local cpu, this is
somehow on the opposite side of workload such as Redis. The changes you
did remind me of the latency-prio stuff. Maybe we can do something base
on both the 'short task' and 'latency-prio' to make your changes more
general. thoughts?

Thanks,
Honglei

> Best regards,
>     Abel

2023-02-17 10:40:44

by Abel Wu

[permalink] [raw]
Subject: Re: [PATCH v5 2/2] sched/fair: Introduce SIS_SHORT to wake up short task on current CPU

Hi Honglei,

On 2/17/23 4:35 PM, Honglei Wang wrote:
>> The following change greatly reduced the p99lat of Redis service
>> from 150ms to 0.9ms, at exactly the same throughput (QPS).
>>
>> @@ -5763,6 +5787,9 @@ wake_affine_weight(struct sched_domain *sd,
>> struct task_struct *p,
>>      s64 this_eff_load, prev_eff_load;
>>      unsigned long task_load;
>>
>> +    if (is_short_task(p))
>> +        return nr_cpumask_bits;
>> +
>>      this_eff_load = cpu_load(cpu_rq(this_cpu));
>>
>>      if (sync) {
>>
>> I know that 'short' tasks are not necessarily 'small' tasks, e.g.
>> sleeping duration is small or have large weights, but this works
>> really well for this case. This is partly because delivering data
>> is memory bandwidth intensive hence prefer cache hot cpus. And I
>> think this is also applicable to the general purposes: do NOT let
>> the short running tasks suffering from cache misses caused by
>> migration.
>>
>
> Redis is a bit special. It runs quick and really sensitive on schedule
> latency. The purpose of this 'short task' feature from Yu is to mitigate
> the migration and tend to place the waking task on local cpu, this is
> somehow on the opposite side of workload such as Redis. The changes you
> did remind me of the latency-prio stuff. Maybe we can do something base
> on both the 'short task' and 'latency-prio' to make your changes more
> general. thoughts?

I think it is more like an enhance rather than conflict. Chen Yu's patch
treats the cpus with only one short task as idle, to make idle cpu scan
more efficient. So if this cpu is such 'idle' cpu, just choose it. While
what I suggested is to ignore this cpu if it is not idle.

But as you pointed out that Redis is a bit special in the manner of
sensitive on scheduling latency, the change in wake_affine_weight() may
be inappropriate as 'weight' implies more on throughput than latency.

Best Regards,
Abel

2023-02-17 19:36:07

by K Prateek Nayak

[permalink] [raw]
Subject: Re: [PATCH v5 0/2] sched/fair: Wake short task on current CPU

Hello Chenyu and Abel,

I'll leave the detailed results from testing on a dual socket Zen3 system
(2 x 64C/128T) below.

tl;dr

o Most benchmark results see small wins or are comparable to tip.
o SpecJBB Max-jOPS see a small hit but Critical-jOPS improve.
o ycsb-mongodb sees small uplift in NPS1 mode.
o Numbers for Netperf runs are pending which I'll share in the
coming week.
o Abel's suggestion on top of v5 seem promising but there are
few regressions I notice on larger workloads.

Detailed Results:

NPS Modes are used to logically divide single socket into
multiple NUMA region.
Following is the NUMA configuration for each NPS mode on the system:

NPS1: Each socket is a NUMA node.
Total 2 NUMA nodes in the dual socket machine.

Node 0: 0-63, 128-191
Node 1: 64-127, 192-255

NPS2: Each socket is further logically divided into 2 NUMA regions.
Total 4 NUMA nodes exist over 2 socket.

Node 0: 0-31, 128-159
Node 1: 32-63, 160-191
Node 2: 64-95, 192-223
Node 3: 96-127, 223-255

NPS4: Each socket is logically divided into 4 NUMA regions.
Total 8 NUMA nodes exist over 2 socket.

Node 0: 0-15, 128-143
Node 1: 16-31, 144-159
Node 2: 32-47, 160-175
Node 3: 48-63, 176-191
Node 4: 64-79, 192-207
Node 5: 80-95, 208-223
Node 6: 96-111, 223-231
Node 7: 112-127, 232-255

Benchmark Results:

Kernel versions:
- tip: 6.2.0-rc6 tip sched/core
- sis_short: 6.2.0-rc6 tip sched/core + this series

When the testing started, the tip was at:
commit 4d627628d758 "cpuidle: Fix poll_idle() noinstr annotation"

~~~~~~~~~~~~~
~ hackbench ~
~~~~~~~~~~~~~

o NPS1

Test: tip sis_short
1-groups: 4.38 (0.00 pct) 4.49 (-2.51 pct)
2-groups: 5.12 (0.00 pct) 5.20 (-1.56 pct)
4-groups: 4.21 (0.00 pct) 4.24 (-0.71 pct)
8-groups: 4.68 (0.00 pct) 4.73 (-1.06 pct)
16-groups: 6.13 (0.00 pct) 6.35 (-3.58 pct)

o NPS2

Test: tip sis_short
1-groups: 4.51 (0.00 pct) 4.36 (3.32 pct)
2-groups: 4.31 (0.00 pct) 4.35 (0.92 pct)
4-groups: 4.17 (0.00 pct) 4.08 (2.15 pct)
8-groups: 4.58 (0.00 pct) 4.49 (1.96 pct)
16-groups: 5.74 (0.00 pct) 5.93 (-3.31 pct)

o NPS4

Test: tip sis_short
1-groups: 4.47 (0.00 pct) 4.51 (-0.89 pct)
2-groups: 4.97 (0.00 pct) 5.04 (-1.40 pct)
4-groups: 4.26 (0.00 pct) 4.28 (-0.46 pct)
8-groups: 5.46 (0.00 pct) 5.56 (-1.83 pct)
16-groups: 6.38 (0.00 pct) 6.10 (4.38 pct)

~~~~~~~~~~~~
~ schbench ~
~~~~~~~~~~~~

o NPS1

#workers: tip sis_short
1: 36.00 (0.00 pct) 27.00 (25.00 pct)
2: 37.00 (0.00 pct) 32.00 (13.51 pct)
4: 41.00 (0.00 pct) 34.00 (17.07 pct)
8: 46.00 (0.00 pct) 43.00 (6.52 pct)
16: 66.00 (0.00 pct) 66.00 (0.00 pct)
32: 111.00 (0.00 pct) 108.00 (2.70 pct)
64: 207.00 (0.00 pct) 206.00 (0.48 pct)
128: 483.00 (0.00 pct) 481.00 (0.41 pct)
256: 46272.00 (0.00 pct) 45120.00 (2.48 pct)
512: 76160.00 (0.00 pct) 77696.00 (-2.01 pct)

o NPS2

#workers: tip sis_short
1: 33.00 (0.00 pct) 31.00 (6.06 pct)
2: 35.00 (0.00 pct) 31.00 (11.42 pct)
4: 38.00 (0.00 pct) 38.00 (0.00 pct)
8: 51.00 (0.00 pct) 47.00 (7.84 pct)
16: 64.00 (0.00 pct) 67.00 (-4.68 pct)
32: 118.00 (0.00 pct) 116.00 (1.69 pct)
64: 214.00 (0.00 pct) 217.00 (-1.40 pct)
128: 497.00 (0.00 pct) 504.00 (-1.40 pct)
256: 45632.00 (0.00 pct) 44352.00 (2.80 pct)
512: 81024.00 (0.00 pct) 78464.00 (3.15 pct)

o NPS4

#workers: tip sis_short
1: 33.00 (0.00 pct) 32.00 (3.03 pct)
2: 40.00 (0.00 pct) 32.00 (20.00 pct)
4: 42.00 (0.00 pct) 38.00 (9.52 pct)
8: 64.00 (0.00 pct) 65.00 (-1.56 pct)
16: 73.00 (0.00 pct) 69.00 (5.47 pct)
32: 112.00 (0.00 pct) 112.00 (0.00 pct)
64: 215.00 (0.00 pct) 207.00 (3.72 pct)
128: 615.00 (0.00 pct) 593.00 (3.73 pct)
256: 46144.00 (0.00 pct) 45376.00 (1.66 pct)
512: 78208.00 (0.00 pct) 77696.00 (0.65 pct)


~~~~~~~~~~
~ tbench ~
~~~~~~~~~~

o NPS1

Clients: tip sis_short
1 536.78 (0.00 pct) 537.38 (0.11 pct)
2 1050.74 (0.00 pct) 1058.74 (0.76 pct)
4 1993.47 (0.00 pct) 1976.79 (-0.83 pct)
8 3498.02 (0.00 pct) 3657.16 (4.54 pct)
16 6202.01 (0.00 pct) 6014.62 (-3.02 pct)
32 11544.55 (0.00 pct) 11847.47 (2.62 pct)
64 21828.75 (0.00 pct) 21754.85 (-0.33 pct)
128 31095.92 (0.00 pct) 31643.35 (1.76 pct)
256 54828.12 (0.00 pct) 55432.29 (1.10 pct)
512 54888.10 (0.00 pct) 55917.91 (1.87 pct)
1024 54916.75 (0.00 pct) 53468.79 (-2.63 pct)

o NPS2

Clients: tip sis_short
1 543.08 (0.00 pct) 544.49 (0.25 pct)
2 1074.55 (0.00 pct) 1060.33 (-1.32 pct)
4 1980.75 (0.00 pct) 1992.86 (0.61 pct)
8 3628.36 (0.00 pct) 3507.73 (-3.32 pct)
16 5806.00 (0.00 pct) 5790.82 (-0.26 pct)
32 11351.94 (0.00 pct) 10937.21 (-3.26 pct)
64 19987.40 (0.00 pct) 20739.38 (3.76 pct)
128 29554.40 (0.00 pct) 30011.99 (1.54 pct)
256 53594.11 (0.00 pct) 51473.78 (-3.95 pct)
512 54304.03 (0.00 pct) 52998.31 (-2.40 pct)
1024 54338.25 (0.00 pct) 53265.51 (-1.97 pct)

o NPS4

Clients: tip sis_short
1 541.29 (0.00 pct) 536.21 (-0.93 pct)
2 1045.15 (0.00 pct) 1054.94 (0.93 pct)
4 1973.01 (0.00 pct) 1988.63 (0.79 pct)
8 3490.55 (0.00 pct) 3535.27 (1.28 pct)
16 5920.12 (0.00 pct) 5846.04 (-1.25 pct)
32 10933.38 (0.00 pct) 10944.33 (0.10pct)
64 19628.34 (0.00 pct) 19328.66 (1.01 pct)
128 29785.23 (0.00 pct) 28749.48 (-4.55 pct)
256 51999.72 (0.00 pct) 51336.20 (-1.27 pct)
512 53619.42 (0.00 pct) 53269.04 (-0.65 pct)
1024 53956.57 (0.00 pct) 53666.14 (-0.53 pct)


~~~~~~~~~~
~ stream ~
~~~~~~~~~~

o NPS1

10 Runs:

Test: tip sis_short
Copy: 320576.16 (0.00 pct) 328194.56 (2.37 pct)
Scale: 212869.80 (0.00 pct) 216713.96 (1.80 pct)
Add: 241556.74 (0.00 pct) 247467.26 (2.44 pct)
Triad: 250637.58 (0.00 pct) 245538.49 (-2.03 pct)

100 Runs:

Test: tip sis_short
Copy: 330058.38 (0.00 pct) 329339.60 (-0.21 pct)
Scale: 216475.85 (0.00 pct) 219334.10 (1.32 pct)
Add: 243028.82 (0.00 pct) 244037.77 (0.41 pct)
Triad: 252907.98 (0.00 pct) 257210.37 (1.70 pct)

o NPS2

10 Runs:

Test: tip sis_short
Copy: 339946.34 (0.00 pct) 327261.79 (-3.73 pct)
Scale: 217453.46 (0.00 pct) 221366.66 (1.79 pct)
Add: 258099.63 (0.00 pct) 258472.44 (0.14 pct)
Triad: 264974.76 (0.00 pct) 262618.99 (-0.88 pct)

100 Runs:

Test: tip sis_short
Copy: 335725.30 (0.00 pct) 320797.67 (-4.44 pct)
Scale: 229985.45 (0.00 pct) 221706.62 (-3.59 pct)
Add: 260546.33 (0.00 pct) 250668.80 (-3.79 pct)
Triad: 267925.27 (0.00 pct) 262959.86 (-1.85 pct)

o NPS4

10 Runs:

Test: tip sis_short
Copy: 369037.34 (0.00 pct) 371514.46 (0.67 pct)
Scale: 238235.39 (0.00 pct) 237661.29 (-0.24 pct)
Add: 263626.48 (0.00 pct) 263436.20 (-0.07 pct)
Triad: 280881.43 (0.00 pct) 288059.52 (2.55 pct)

100 Runs:

Test: tip sis_short
Copy: 339036.66 (0.00 pct) 346904.09 (2.32 pct)
Scale: 246638.02 (0.00 pct) 230195.65 (-6.66 pct)
Add: 259898.86 (0.00 pct) 244631.77 (-5.87 pct)
Triad: 265719.02 (0.00 pct) 264620.50 (-0.41 pct)

~~~~~~~~~~~~~~~~
~ ycsb-mongodb ~
~~~~~~~~~~~~~~~~

o NPS1:

tip : 133514.00 (var: 2.07%)
sis-short : 137664.67 (var: 1.45%) (3.11%)

o NPS2:

tip : 132193.33 (var: 1.46%)
sis-short : 131189.33 (var: 1.69%) (-0.75%)

o NPS4:

tip : 133285.67 (var: 1.77%)
sis-short : 133891.33 (var: 1.58%) (0.45%)

~~~~~~~~~~~~~
~ unixbench ~
~~~~~~~~~~~~~

o NPS1

Test Metric Parallelism tip sis_short
unixbench-dhry2reg Hmean unixbench-dhry2reg-1 48665321.00 ( 0.00%) 48553432.30 ( -0.23%)
unixbench-dhry2reg Hmean unixbench-dhry2reg-512 6281376826.80 ( 0.00%) 6277335150.50 ( -0.06%)
unixbench-syscall Amean unixbench-syscall-1 2689026.67 ( 0.00%) 2682044.73 * 0.26%*
unixbench-syscall Amean unixbench-syscall-512 7352453.23 ( 0.00%) 7290524.47 * -0.84%*
unixbench-pipe Hmean unixbench-pipe-1 2467955.46 ( 0.00%) 2426076.17 * -1.70%*
unixbench-pipe Hmean unixbench-pipe-512 295937232.39 ( 0.00%) 293462420.03 * -0.84%*
unixbench-spawn Hmean unixbench-spawn-1 4164.75 ( 0.00%) 4229.59 ( 1.56%)
unixbench-spawn Hmean unixbench-spawn-512 79950.80 ( 0.00%) 76439.30 ( -4.39%)
unixbench-execl Hmean unixbench-execl-1 4112.25 ( 0.00%) 4151.37 ( 0.95%)
unixbench-execl Hmean unixbench-execl-512 11785.88 ( 0.00%) 11756.46 ( -0.25%)

o NPS2

Test Metric Parallelism tip sis_short
unixbench-dhry2reg Hmean unixbench-dhry2reg-1 49671827.09 ( 0.00%) 49077076.00 ( -1.20%)
unixbench-dhry2reg Hmean unixbench-dhry2reg-512 6282239821.90 ( 0.00%) 6283671307.30 ( 0.02%)
unixbench-syscall Amean unixbench-syscall-1 2688504.20 ( 0.00%) 2676278.60 * 0.45%*
unixbench-syscall Amean unixbench-syscall-512 7321621.07 ( 0.00%) 7784926.60 * 6.33%*
unixbench-pipe Hmean unixbench-pipe-1 2469941.97 ( 0.00%) 2419584.09 * -2.04%*
unixbench-pipe Hmean unixbench-pipe-512 296146392.10 ( 0.00%) 293156913.86 * -1.01%*
unixbench-spawn Hmean unixbench-spawn-1 5029.05 ( 0.00%) 5015.18 ( -0.28%)
unixbench-spawn Hmean unixbench-spawn-512 77198.79 ( 0.00%) 80409.23 * 4.16%*
unixbench-execl Hmean unixbench-execl-1 4092.59 ( 0.00%) 4158.36 * 1.61%*
unixbench-execl Hmean unixbench-execl-512 12293.67 ( 0.00%) 12169.31 ( -1.01%)

o NPS4

Test Metric Parallelism tip sis_short
unixbench-dhry2reg Hmean unixbench-dhry2reg-1 48944542.05 ( 0.00%) 49490899.03 * 1.12%*
unixbench-dhry2reg Hmean unixbench-dhry2reg-512 6291259625.50 ( 0.00%) 6299305899.90 ( 0.13%)
unixbench-syscall Amean unixbench-syscall-1 2686991.73 ( 0.00%) 2682940.53 * 0.15%*
unixbench-syscall Amean unixbench-syscall-512 7902201.47 ( 0.00%) 7931906.47 ( -0.38%)
unixbench-pipe Hmean unixbench-pipe-1 2468813.43 ( 0.00%) 2422272.88 * -1.89%*
unixbench-pipe Hmean unixbench-pipe-512 297109244.52 ( 0.00%) 294589928.27 * -0.85%*
unixbench-spawn Hmean unixbench-spawn-1 5161.67 ( 0.00%) 5012.58 ( -2.89%)
unixbench-spawn Hmean unixbench-spawn-512 78657.60 ( 0.00%) 78572.80 ( -0.11%)
unixbench-execl Hmean unixbench-execl-1 4112.02 ( 0.00%) 4122.16 ( 0.25%)
unixbench-execl Hmean unixbench-execl-512 13700.99 ( 0.00%) 14173.20 * 3.44%*

~~~~~~~~~~~
~ SpecJBB ~
~~~~~~~~~~~

o NPS1 - Normalized to baseline (tip)

Kernel tip sis_short
Max-jOPS 100% 98.53%
Critical-jOPS 100% 105.61%

~~~~~~~~~~~~~~~~~~
~ DeathStarBench ~
~~~~~~~~~~~~~~~~~~

o NPS1 - Normalized to baseline (tip)

Kernel : tip sis_short
8C/16T : 100.00% 100.54%
16C/32T : 100.00% 100.19%
32C/64T : 100.00% 98.08%
64C/128T : 100.00% 98.34%


--------------- With Abel's suggestion added to v5 ---------------

I've added the hunk suggested by Abel in the thread to the v5 and
following are results for the same set of benchmarks but only for
machine running in NPS1 mode.

sis_short_v5.1: 6.2.0-rc6 tip sched/core + this series + Abel's suggestion

~~~~~~~~~~~~~
~ hackbench ~
~~~~~~~~~~~~~

o NPS1

Test: tip sis_short_v5.1
1-groups: 4.38 (0.00 pct) 4.08 (6.84 pct)
2-groups: 5.12 (0.00 pct) 5.10 (0.39 pct)
4-groups: 4.21 (0.00 pct) 4.23 (-0.47 pct)
8-groups: 4.68 (0.00 pct) 4.69 (-0.21 pct)
16-groups: 6.13 (0.00 pct) 5.94 (3.09 pct)

~~~~~~~~~~~~
~ schbench ~
~~~~~~~~~~~~

o NPS1

#workers: tip sis_short_v5.1
1: 36.00 (0.00 pct) 36.00 (0.00 pct)
2: 37.00 (0.00 pct) 39.00 (-5.40 pct)
4: 41.00 (0.00 pct) 40.00 (2.43 pct)
8: 46.00 (0.00 pct) 46.00 (0.00 pct)
16: 66.00 (0.00 pct) 68.00 (-3.03 pct)
32: 111.00 (0.00 pct) 112.00 (-0.90 pct)
64: 207.00 (0.00 pct) 238.00 (-14.97 pct)
64: 227.00 (0.00 pct) 219.00 (3.52 pct)
128: 483.00 (0.00 pct) 494.00 (-2.27 pct)
256: 46272.00 (0.00 pct) 41280.00 (10.78 pct)
512: 78293.00 (0.00 pct) 79325.00 (-1.31 pct)

~~~~~~~~~~
~ tbench ~
~~~~~~~~~~

o NPS1

Clients: tip sis_short_v5.1
1 536.78 (0.00 pct) 535.90 (-0.16 pct)
2 1050.74 (0.00 pct) 1067.32 (1.57 pct)
4 1993.47 (0.00 pct) 1971.63 (-1.09 pct)
8 3601.77 (0.00 pct) 3599.17 (-0.07 pct)
16 6202.01 (0.00 pct) 6115.08 (-1.40 pct)
32 11544.55 (0.00 pct) 11423.52 (-1.04 pct)
64 21828.75 (0.00 pct) 21403.94 (-1.94 pct)
128 31095.92 (0.00 pct) 30783.55 (-1.00 pct)
256 54828.12 (0.00 pct) 55328.94 (0.91 pct)
512 54888.10 (0.00 pct) 53483.33 (-2.55 pct)
1024 48407.14 (0.00 pct) 48998.95 (1.22 pct)

~~~~~~~~~~
~ stream ~
~~~~~~~~~~

o NPS1

10 Runs:

Test: tip sis_short_v5.1
Copy: 320576.16 (0.00 pct) 331810.14 (3.50 pct)
Scale: 212869.80 (0.00 pct) 214725.82 (0.87 pct)
Add: 241556.74 (0.00 pct) 242340.92 (0.32 pct)
Triad: 250637.58 (0.00 pct) 251271.53 (0.25 pct)

100 Runs:

Test: tip sis_short_v5.1
Copy: 330058.38 (0.00 pct) 331966.60 (0.57 pct)
Scale: 216475.85 (0.00 pct) 222777.84 (2.91 pct)
Add: 243028.82 (0.00 pct) 250873.78 (3.22 pct)
Triad: 252907.98 (0.00 pct) 253791.20 (0.34 pct)

~~~~~~~~~~~~~~~~
~ ycsb-mongodb ~
~~~~~~~~~~~~~~~~

o NPS1:

tip : 133514.00 (var: 2.07%)
sis-short_v5.1 : 129172.67 (var: 2.32%) (-3.25%) **

~~~~~~~~~~~~~
~ unixbench ~
~~~~~~~~~~~~~

o NPS1

Test Metric Parallelism tip sis_short_v5.1
unixbench-dhry2reg Hmean unixbench-dhry2reg-1 49266026.90 ( 0.00%) 49054799.90 ( -0.43%)
unixbench-dhry2reg Hmean unixbench-dhry2reg-512 6285063007.68 ( 0.00%) 6280424934.15 ( -0.07%)
unixbench-syscall Amean unixbench-syscall-1 2689026.67 ( 0.00%) 2677968.03 * 0.41%*
unixbench-syscall Amean unixbench-syscall-512 7352453.23 ( 0.00%) 7354325.40 ( -0.03%)
unixbench-pipe Hmean unixbench-pipe-1 2467955.46 ( 0.00%) 2351117.60 * -4.73%*
unixbench-pipe Hmean unixbench-pipe-512 295937232.39 ( 0.00%) 295769918.99 ( -0.06%)
unixbench-spawn Hmean unixbench-spawn-1 4164.75 ( 0.00%) 4331.89 * 4.01%*
unixbench-spawn Hmean unixbench-spawn-512 79626.61 ( 0.00%) 77865.32 * -2.21%*
unixbench-execl Hmean unixbench-execl-1 4112.25 ( 0.00%) 4145.85 ( 0.82%)
unixbench-execl Hmean unixbench-execl-512 11785.88 ( 0.00%) 11935.41 ( 1.27%)

~~~~~~~~~~~
~ SpecJBB ~
~~~~~~~~~~~

o NPS1 - Normalized to baseline (tip)

Kernel tip sis_short_V5.1
Max-jOPS 100% 91.99% ** (-8.01%)
Critical-jOPS 100% 99.29%

~~~~~~~~~~~~~~~~~~
~ DeathStarBench ~
~~~~~~~~~~~~~~~~~~

o NPS1 - Throughput normalized to baseline (tip)

Kernel : tip sis_short_V5.1
8C/16T : 100.00% 93.75% ** (-6.25%)
16C/32T : 100.00% 100.43%
32C/64T : 100.00% 101.12%
64C/128T : 100.00% 100.21%

o Follow wake_affine_bias() if waker's cpu and prev_cpu are on same LLC?

There are cases with Abel's suggestion where some of the larger
benchmark regresses. I wonder if wake_affine_bias() can still be
considered for short running tasks if the waker's CPU and the
prev_cpu share caches. In DeathStarBench 8C/16T case, the
services are all pinned to the CPUs of same MC domain. The
regression observed seems to arise from the missed opportunity
to distribute load among the CPUs sharing the same L3. I do not
have data for this currently but I'll update the thread with any
findings.

I'll also queue up a Redis run from mmtest to see if I can reproduce
Abel's observations on my system however I'm not sure if the
utilization will be high enough to emulate the same scenario as
Abel's prod environment. If the migrations within the same MC

On 2/3/2023 10:47 AM, Chen Yu wrote:
> The main purpose is to avoid too many cross CPU wake up when it is
> unnecessary. The frequent cross CPU wake up brings significant damage
> to some workloads, especially on high core count systems.
>
> Inhibits the cross CPU wake-up by placing the wakee on waking CPU,
> if both the waker and wakee are short-duration tasks. The short
> duration task could become a trouble maker on high-load system,
> because it could bring frequent context switch. So this strategy
> only takes effect when the system is busy. Besides, it is unreasonable
> to inhibit the idle CPU scan when there are still idle CPUs.
>
> First, introduce the definition of a short-duration task. Then
> leverages the first patch to choose a local CPU for wakee.
>
> Overall there is significant performance improvement on Intel
> 2 x 56C/112T platform. Such as will-it-scale (1200+%),
> netperf(600+%) in some cases. And no noticeable impact on
> schbench, hackbench, tbench and a OLTP workload with a commercial RDBMS.
>
> Seeking for test results on other platforms, such as Zen3 and Kunpeng
> Arm64. Appreciated Prateek and Yicong if you can have a try on this
> version.
>
> Changes since v4:
> 1. Dietmar has commented on the task duration calculation. So refined
> the commit log to reduce confusion.
> 2. Change [PATCH 1/2] to only record the average duration of a task.
> So this change could benefit UTIL_EST_FASTER[1].
> 3. As v4 reported regression on Zen3 and Kunpeng Arm64, add back
> the system average utilization restriction that, if the system
> is not busy, do not enable the short wake up. Above logic has
> shown improvment on Zen3[2].
> 4. Restrict the wakeup target to be current CPU, rather than both
> current CPU and task's previous CPU. This could also benefit
> wakeup optimization from interrupt in the future, which is
> suggested by Yicong.
>
> Changes since v3:
> 1. Honglei and Josh have concern that the threshold of short
> task duration could be too long. Decreased the threshold from
> sysctl_sched_min_granularity to (sysctl_sched_min_granularity / 8),
> and the '8' comes from get_update_sysctl_factor().
> 2. Export p->se.dur_avg to /proc/{pid}/sched per Yicong's suggestion.
> 3. Move the calculation of average duration from put_prev_task_fair()
> to dequeue_task_fair(). Because there is an issue in v3 that,
> put_prev_task_fair() will not be invoked by pick_next_task_fair()
> in fast path, thus the dur_avg could not be updated timely.
> 4. Fix the comment in PATCH 2/2, that "WRITE_ONCE(CPU1->ttwu_pending, 1);"
> on CPU0 is earlier than CPU1 getting "ttwu_list->p0", per Tianchen.
> 5. Move the scan for CPU with short duration task from select_idle_cpu()
> to select_idle_siblings(), because there is no CPU scan involved, per
> Yicong.
>
> Changes since v2:
>
> 1. Peter suggested comparing the duration of waker and the cost to
> scan for an idle CPU: If the cost is higher than the task duration,
> do not waste time finding an idle CPU, choose the local or previous
> CPU directly. A prototype was created based on this suggestion.
> However, according to the test result, this prototype does not inhibit
> the cross CPU wakeup and did not bring improvement. Because the cost
> to find an idle CPU is small in the problematic scenario. The root
> cause of the problem is a race condition between scanning for an idle
> CPU and task enqueue(please refer to the commit log in PATCH 2/2).
> So v3 does not change the core logic of v2, with some refinement based
> on Peter's suggestion.
>
> 2. Simplify the logic to record the task duration per Peter and Abel's suggestion.
>
>
> [1] https://lore.kernel.org/lkml/[email protected]/
> [2] https://lore.kernel.org/all/[email protected]/
>
> v4: https://lore.kernel.org/lkml/[email protected]/
> v3: https://lore.kernel.org/lkml/[email protected]/
> v2: https://lore.kernel.org/all/[email protected]/
> v1: https://lore.kernel.org/lkml/[email protected]/
>
> Chen Yu (2):
> sched/fair: Record the average duration of a task
> sched/fair: Introduce SIS_SHORT to wake up short task on current CPU
>
> include/linux/sched.h | 3 +++
> kernel/sched/core.c | 2 ++
> kernel/sched/debug.c | 1 +
> kernel/sched/fair.c | 39 +++++++++++++++++++++++++++++++++++++++
> kernel/sched/features.h | 1 +
> 5 files changed, 46 insertions(+)
>

The netperf results are still pending and I'll update the thread
with the same in the coming week. If you would like me to test
or gather some data for specific workload on the test system,
please do let me know.
--
Thanks and Regards,
Prateek

2023-02-20 04:55:45

by Chen Yu

[permalink] [raw]
Subject: Re: [PATCH v5 2/2] sched/fair: Introduce SIS_SHORT to wake up short task on current CPU

On 2023-02-17 at 10:44:51 +0800, Abel Wu wrote:
> On 2/16/23 11:24 PM, Chen Yu wrote:
> > > The following change greatly reduced the p99lat of Redis service
> > > from 150ms to 0.9ms, at exactly the same throughput (QPS).
> > >
> > > @@ -5763,6 +5787,9 @@ wake_affine_weight(struct sched_domain *sd, struct
> > > task_struct *p,
> > > s64 this_eff_load, prev_eff_load;
> > > unsigned long task_load;
> > >
> > > + if (is_short_task(p))
> > > + return nr_cpumask_bits;
> > > +
> > So above change wants to wake up the short task on its previous
> > CPU if I understand correctly.
>
> Yes.
>
> > > this_eff_load = cpu_load(cpu_rq(this_cpu));
> > >
> > > if (sync) {
> > >
> > > I know that 'short' tasks are not necessarily 'small' tasks, e.g.
> > > sleeping duration is small or have large weights, but this works
> > > really well for this case. This is partly because delivering data
> > > is memory bandwidth intensive hence prefer cache hot cpus. And I
> > > think this is also applicable to the general purposes: do NOT let
> > > the short running tasks suffering from cache misses caused by
> > > migration.
> > >
> > I see. My original thought was to mitigate short task migration
> > as much as possible. Either waking up the task on current CPU or previous
> > CPU should both achieve the goal in theory. Could you please describe
> > a little more about how Redis proxy server was tested? Was it tested
> > locally or using multiple machines? I asked this because for network
> > benchmarks, it might be better to wake the task close to the waker(maybe
> > the NIC interrupt) due to hot network buffer. Anyway I will test
> > your change slightly changed to see the impact, and also Redis. But it
> > would be even better if you could provide some simple test steps I can
> > try locally : )
>
> Sorry for missing the info. The test was done in production environment,
> and what I have done is only updating the kernel in several machines
> which are highly loaded, that is over 85% cpu util observed by mpstat.
> Please let me know if you want any specific info.
>
I've modified the code a little bit, which was inpired by your statement
"'small' tasks can be easily stacked on this cpu when wake up several tasks
at one time if this cpu is unloaded". I added extra check on wakee_flips, so
that tasks waking up too many different tasks will not be treated as 'small'
ones. That is to say, only tasks waking up each other exclusively will be
put together. So far this change has verified to keep the improvement for
netperf/will-it-scale, and I just launched the redis test to see what will
happen.

thanks,
Chenyu

2023-02-20 04:58:52

by Chen Yu

[permalink] [raw]
Subject: Re: [PATCH v5 2/2] sched/fair: Introduce SIS_SHORT to wake up short task on current CPU

On 2023-02-17 at 16:35:24 +0800, Honglei Wang wrote:
>
>
> On 2023/2/16 20:55, Abel Wu wrote:
> > Hi Chen,
> >
> > I've tested this patchset (with modification) on our Redis proxy
> > servers, and the results seems promising.
> >
> > On 2/3/23 1:18 PM, Chen Yu wrote:
> > > ...
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index aa16611c7263..d50097e5fcc1 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -6489,6 +6489,20 @@ static int wake_wide(struct task_struct *p)
> > > ????? return 1;
> > > ? }
> > > +/*
> > > + * If a task switches in and then voluntarily relinquishes the
> > > + * CPU quickly, it is regarded as a short duration task.
> > > + *
> > > + * SIS_SHORT tries to wake up the short wakee on current CPU. This
> > > + * aims to avoid race condition among CPUs due to frequent context
> > > + * switch.
> > > + */
> > > +static inline int is_short_task(struct task_struct *p)
> > > +{
> > > +??? return sched_feat(SIS_SHORT) && p->se.dur_avg &&
> > > +?????????? ((p->se.dur_avg * 8) < sysctl_sched_min_granularity);
> > > +}
> >
> > I changed the factor to fit into the shape of tasks in question.
> >
> > ????static inline int is_short_task(struct task_struct *p)
> > ????{
> > ??????? u64 dur = sysctl_sched_min_granularity / 8;
> >
> > ??????? if (!sched_feat(SIS_SHORT) || !p->se.dur_avg)
> > ??????????? return false;
> >
> > ??????? /*
> > ???????? * Bare tracepoint to allow dynamically changing
> > ???????? * the threshold.
> > ???????? */
> > ??????? trace_sched_short_task_tp(p, &dur);
> >
> > ??????? return p->se.dur_avg < dur;
> > ????}
> >
> > I'm not sure it is the right way to provide such flexibility, but
> > definition of 'short' can be workload specific.
> >
> > > +
> > > ? /*
> > > ?? * The purpose of wake_affine() is to quickly determine on which
> > > CPU we can run
> > > ?? * soonest. For the purpose of speed we only consider the waking
> > > and previous
> > > @@ -6525,6 +6539,11 @@ wake_affine_idle(int this_cpu, int prev_cpu,
> > > int sync)
> > > ????? if (available_idle_cpu(prev_cpu))
> > > ????????? return prev_cpu;
> > > +??? /* The only running task is a short duration one. */
> > > +??? if (cpu_rq(this_cpu)->nr_running == 1 &&
> > > +??????? is_short_task(rcu_dereference(cpu_curr(this_cpu))))
> > > +??????? return this_cpu;
> >
> > Since proxy server handles simple data delivery, the tasks are
> > generally short running ones and hate task stacking which may
> > introduce scheduling latency (even there are only 2 short tasks
> > competing each other). So this part brings slight regression on
> > the proxy case. But I still think this is good for most cases.
> >
> > Speaking of task stacking, I found wake_affine_weight() can be
> > much more dangerous. It chooses the less loaded one between the
> > prev & this cpu as a candidate, so 'small' tasks can be easily
> > stacked on this cpu when wake up several tasks at one time if
> > this cpu is unloaded. This really hurts if the 'small' tasks are
> > latency-sensitive, although wake_affine_weight() does the right
> > thing from the point of view of 'load'.
> >
> > The following change greatly reduced the p99lat of Redis service
> > from 150ms to 0.9ms, at exactly the same throughput (QPS).
> >
> > @@ -5763,6 +5787,9 @@ wake_affine_weight(struct sched_domain *sd, struct
> > task_struct *p,
> > ????s64 this_eff_load, prev_eff_load;
> > ????unsigned long task_load;
> >
> > +??? if (is_short_task(p))
> > +??????? return nr_cpumask_bits;
> > +
> > ????this_eff_load = cpu_load(cpu_rq(this_cpu));
> >
> > ????if (sync) {
> >
> > I know that 'short' tasks are not necessarily 'small' tasks, e.g.
> > sleeping duration is small or have large weights, but this works
> > really well for this case. This is partly because delivering data
> > is memory bandwidth intensive hence prefer cache hot cpus. And I
> > think this is also applicable to the general purposes: do NOT let
> > the short running tasks suffering from cache misses caused by
> > migration.
> >
>
> Redis is a bit special. It runs quick and really sensitive on schedule
> latency. The purpose of this 'short task' feature from Yu is to mitigate the
> migration and tend to place the waking task on local cpu, this is somehow on
> the opposite side of workload such as Redis. The changes you did remind me
> of the latency-prio stuff. Maybe we can do something base on both the 'short
> task' and 'latency-prio' to make your changes more general. thoughts?
>
Looks reasonable, I suppose you were refering to 'latency nice' proposed by
Vincent. For now I'd like to keep this patch simple enough, later we can
extend it.

thanks,
Chenyu
> Thanks,
> Honglei
>
> > Best regards,
> > ????Abel

2023-02-20 05:59:15

by Chen Yu

[permalink] [raw]
Subject: Re: [PATCH v5 0/2] sched/fair: Wake short task on current CPU

Hi Prateek,
On 2023-02-18 at 01:05:32 +0530, K Prateek Nayak wrote:
> Hello Chenyu and Abel,
>
> I'll leave the detailed results from testing on a dual socket Zen3 system
> (2 x 64C/128T) below.
>
Thanks for the test!
> tl;dr
>
> o Most benchmark results see small wins or are comparable to tip.
> o SpecJBB Max-jOPS see a small hit but Critical-jOPS improve.
I assume that this change should be in acceptible variance rance, because in
previous version, we did not restrict the local wakeup as strictly as current
version, and we did not see a hit on Max-jOPS in v4. Anyway, I've
enhanced the restriction per Abel's feedback and launched some tests,
so that to make Redis and SpecJBB feel better.
> o ycsb-mongodb sees small uplift in NPS1 mode.
> o Numbers for Netperf runs are pending which I'll share in the
> coming week.
> o Abel's suggestion on top of v5 seem promising but there are
> few regressions I notice on larger workloads.
>
> Detailed Results:
>
> NPS Modes are used to logically divide single socket into
> multiple NUMA region.
> Following is the NUMA configuration for each NPS mode on the system:
>
> NPS1: Each socket is a NUMA node.
> Total 2 NUMA nodes in the dual socket machine.
>
> Node 0: 0-63, 128-191
> Node 1: 64-127, 192-255
>
> NPS2: Each socket is further logically divided into 2 NUMA regions.
> Total 4 NUMA nodes exist over 2 socket.
>
> Node 0: 0-31, 128-159
> Node 1: 32-63, 160-191
> Node 2: 64-95, 192-223
> Node 3: 96-127, 223-255
>
> NPS4: Each socket is logically divided into 4 NUMA regions.
> Total 8 NUMA nodes exist over 2 socket.
>
> Node 0: 0-15, 128-143
> Node 1: 16-31, 144-159
> Node 2: 32-47, 160-175
> Node 3: 48-63, 176-191
> Node 4: 64-79, 192-207
> Node 5: 80-95, 208-223
> Node 6: 96-111, 223-231
> Node 7: 112-127, 232-255
>
> Benchmark Results:
>
> Kernel versions:
> - tip: 6.2.0-rc6 tip sched/core
> - sis_short: 6.2.0-rc6 tip sched/core + this series
>
> When the testing started, the tip was at:
> commit 4d627628d758 "cpuidle: Fix poll_idle() noinstr annotation"
>
> ~~~~~~~~~~~~~
> ~ hackbench ~
> ~~~~~~~~~~~~~
>
> o NPS1
>
> Test: tip sis_short
> 1-groups: 4.38 (0.00 pct) 4.49 (-2.51 pct)
> 2-groups: 5.12 (0.00 pct) 5.20 (-1.56 pct)
> 4-groups: 4.21 (0.00 pct) 4.24 (-0.71 pct)
> 8-groups: 4.68 (0.00 pct) 4.73 (-1.06 pct)
> 16-groups: 6.13 (0.00 pct) 6.35 (-3.58 pct)
>
> o NPS2
>
> Test: tip sis_short
> 1-groups: 4.51 (0.00 pct) 4.36 (3.32 pct)
> 2-groups: 4.31 (0.00 pct) 4.35 (0.92 pct)
> 4-groups: 4.17 (0.00 pct) 4.08 (2.15 pct)
> 8-groups: 4.58 (0.00 pct) 4.49 (1.96 pct)
> 16-groups: 5.74 (0.00 pct) 5.93 (-3.31 pct)
>
> o NPS4
>
> Test: tip sis_short
> 1-groups: 4.47 (0.00 pct) 4.51 (-0.89 pct)
> 2-groups: 4.97 (0.00 pct) 5.04 (-1.40 pct)
> 4-groups: 4.26 (0.00 pct) 4.28 (-0.46 pct)
> 8-groups: 5.46 (0.00 pct) 5.56 (-1.83 pct)
> 16-groups: 6.38 (0.00 pct) 6.10 (4.38 pct)
>
> ~~~~~~~~~~~~
> ~ schbench ~
> ~~~~~~~~~~~~
>
> o NPS1
>
> #workers: tip sis_short
> 1: 36.00 (0.00 pct) 27.00 (25.00 pct)
> 2: 37.00 (0.00 pct) 32.00 (13.51 pct)
> 4: 41.00 (0.00 pct) 34.00 (17.07 pct)
> 8: 46.00 (0.00 pct) 43.00 (6.52 pct)
> 16: 66.00 (0.00 pct) 66.00 (0.00 pct)
> 32: 111.00 (0.00 pct) 108.00 (2.70 pct)
> 64: 207.00 (0.00 pct) 206.00 (0.48 pct)
> 128: 483.00 (0.00 pct) 481.00 (0.41 pct)
> 256: 46272.00 (0.00 pct) 45120.00 (2.48 pct)
> 512: 76160.00 (0.00 pct) 77696.00 (-2.01 pct)
>
> o NPS2
>
> #workers: tip sis_short
> 1: 33.00 (0.00 pct) 31.00 (6.06 pct)
> 2: 35.00 (0.00 pct) 31.00 (11.42 pct)
> 4: 38.00 (0.00 pct) 38.00 (0.00 pct)
> 8: 51.00 (0.00 pct) 47.00 (7.84 pct)
> 16: 64.00 (0.00 pct) 67.00 (-4.68 pct)
> 32: 118.00 (0.00 pct) 116.00 (1.69 pct)
> 64: 214.00 (0.00 pct) 217.00 (-1.40 pct)
> 128: 497.00 (0.00 pct) 504.00 (-1.40 pct)
> 256: 45632.00 (0.00 pct) 44352.00 (2.80 pct)
> 512: 81024.00 (0.00 pct) 78464.00 (3.15 pct)
>
> o NPS4
>
> #workers: tip sis_short
> 1: 33.00 (0.00 pct) 32.00 (3.03 pct)
> 2: 40.00 (0.00 pct) 32.00 (20.00 pct)
> 4: 42.00 (0.00 pct) 38.00 (9.52 pct)
> 8: 64.00 (0.00 pct) 65.00 (-1.56 pct)
> 16: 73.00 (0.00 pct) 69.00 (5.47 pct)
> 32: 112.00 (0.00 pct) 112.00 (0.00 pct)
> 64: 215.00 (0.00 pct) 207.00 (3.72 pct)
> 128: 615.00 (0.00 pct) 593.00 (3.73 pct)
> 256: 46144.00 (0.00 pct) 45376.00 (1.66 pct)
> 512: 78208.00 (0.00 pct) 77696.00 (0.65 pct)
>
>
> ~~~~~~~~~~
> ~ tbench ~
> ~~~~~~~~~~
>
> o NPS1
>
> Clients: tip sis_short
> 1 536.78 (0.00 pct) 537.38 (0.11 pct)
> 2 1050.74 (0.00 pct) 1058.74 (0.76 pct)
> 4 1993.47 (0.00 pct) 1976.79 (-0.83 pct)
> 8 3498.02 (0.00 pct) 3657.16 (4.54 pct)
> 16 6202.01 (0.00 pct) 6014.62 (-3.02 pct)
> 32 11544.55 (0.00 pct) 11847.47 (2.62 pct)
> 64 21828.75 (0.00 pct) 21754.85 (-0.33 pct)
> 128 31095.92 (0.00 pct) 31643.35 (1.76 pct)
> 256 54828.12 (0.00 pct) 55432.29 (1.10 pct)
> 512 54888.10 (0.00 pct) 55917.91 (1.87 pct)
> 1024 54916.75 (0.00 pct) 53468.79 (-2.63 pct)
>
> o NPS2
>
> Clients: tip sis_short
> 1 543.08 (0.00 pct) 544.49 (0.25 pct)
> 2 1074.55 (0.00 pct) 1060.33 (-1.32 pct)
> 4 1980.75 (0.00 pct) 1992.86 (0.61 pct)
> 8 3628.36 (0.00 pct) 3507.73 (-3.32 pct)
> 16 5806.00 (0.00 pct) 5790.82 (-0.26 pct)
> 32 11351.94 (0.00 pct) 10937.21 (-3.26 pct)
> 64 19987.40 (0.00 pct) 20739.38 (3.76 pct)
> 128 29554.40 (0.00 pct) 30011.99 (1.54 pct)
> 256 53594.11 (0.00 pct) 51473.78 (-3.95 pct)
> 512 54304.03 (0.00 pct) 52998.31 (-2.40 pct)
> 1024 54338.25 (0.00 pct) 53265.51 (-1.97 pct)
>
> o NPS4
>
> Clients: tip sis_short
> 1 541.29 (0.00 pct) 536.21 (-0.93 pct)
> 2 1045.15 (0.00 pct) 1054.94 (0.93 pct)
> 4 1973.01 (0.00 pct) 1988.63 (0.79 pct)
> 8 3490.55 (0.00 pct) 3535.27 (1.28 pct)
> 16 5920.12 (0.00 pct) 5846.04 (-1.25 pct)
> 32 10933.38 (0.00 pct) 10944.33 (0.10pct)
> 64 19628.34 (0.00 pct) 19328.66 (1.01 pct)
> 128 29785.23 (0.00 pct) 28749.48 (-4.55 pct)
> 256 51999.72 (0.00 pct) 51336.20 (-1.27 pct)
> 512 53619.42 (0.00 pct) 53269.04 (-0.65 pct)
> 1024 53956.57 (0.00 pct) 53666.14 (-0.53 pct)
>
>
> ~~~~~~~~~~
> ~ stream ~
> ~~~~~~~~~~
>
> o NPS1
>
> 10 Runs:
>
> Test: tip sis_short
> Copy: 320576.16 (0.00 pct) 328194.56 (2.37 pct)
> Scale: 212869.80 (0.00 pct) 216713.96 (1.80 pct)
> Add: 241556.74 (0.00 pct) 247467.26 (2.44 pct)
> Triad: 250637.58 (0.00 pct) 245538.49 (-2.03 pct)
>
> 100 Runs:
>
> Test: tip sis_short
> Copy: 330058.38 (0.00 pct) 329339.60 (-0.21 pct)
> Scale: 216475.85 (0.00 pct) 219334.10 (1.32 pct)
> Add: 243028.82 (0.00 pct) 244037.77 (0.41 pct)
> Triad: 252907.98 (0.00 pct) 257210.37 (1.70 pct)
>
> o NPS2
>
> 10 Runs:
>
> Test: tip sis_short
> Copy: 339946.34 (0.00 pct) 327261.79 (-3.73 pct)
> Scale: 217453.46 (0.00 pct) 221366.66 (1.79 pct)
> Add: 258099.63 (0.00 pct) 258472.44 (0.14 pct)
> Triad: 264974.76 (0.00 pct) 262618.99 (-0.88 pct)
>
> 100 Runs:
>
> Test: tip sis_short
> Copy: 335725.30 (0.00 pct) 320797.67 (-4.44 pct)
> Scale: 229985.45 (0.00 pct) 221706.62 (-3.59 pct)
> Add: 260546.33 (0.00 pct) 250668.80 (-3.79 pct)
> Triad: 267925.27 (0.00 pct) 262959.86 (-1.85 pct)
>
> o NPS4
>
> 10 Runs:
>
> Test: tip sis_short
> Copy: 369037.34 (0.00 pct) 371514.46 (0.67 pct)
> Scale: 238235.39 (0.00 pct) 237661.29 (-0.24 pct)
> Add: 263626.48 (0.00 pct) 263436.20 (-0.07 pct)
> Triad: 280881.43 (0.00 pct) 288059.52 (2.55 pct)
>
> 100 Runs:
>
> Test: tip sis_short
> Copy: 339036.66 (0.00 pct) 346904.09 (2.32 pct)
> Scale: 246638.02 (0.00 pct) 230195.65 (-6.66 pct)
> Add: 259898.86 (0.00 pct) 244631.77 (-5.87 pct)
> Triad: 265719.02 (0.00 pct) 264620.50 (-0.41 pct)
>
> ~~~~~~~~~~~~~~~~
> ~ ycsb-mongodb ~
> ~~~~~~~~~~~~~~~~
>
> o NPS1:
>
> tip : 133514.00 (var: 2.07%)
> sis-short : 137664.67 (var: 1.45%) (3.11%)
>
> o NPS2:
>
> tip : 132193.33 (var: 1.46%)
> sis-short : 131189.33 (var: 1.69%) (-0.75%)
>
> o NPS4:
>
> tip : 133285.67 (var: 1.77%)
> sis-short : 133891.33 (var: 1.58%) (0.45%)
>
> ~~~~~~~~~~~~~
> ~ unixbench ~
> ~~~~~~~~~~~~~
>
> o NPS1
>
> Test Metric Parallelism tip sis_short
> unixbench-dhry2reg Hmean unixbench-dhry2reg-1 48665321.00 ( 0.00%) 48553432.30 ( -0.23%)
> unixbench-dhry2reg Hmean unixbench-dhry2reg-512 6281376826.80 ( 0.00%) 6277335150.50 ( -0.06%)
> unixbench-syscall Amean unixbench-syscall-1 2689026.67 ( 0.00%) 2682044.73 * 0.26%*
> unixbench-syscall Amean unixbench-syscall-512 7352453.23 ( 0.00%) 7290524.47 * -0.84%*
> unixbench-pipe Hmean unixbench-pipe-1 2467955.46 ( 0.00%) 2426076.17 * -1.70%*
> unixbench-pipe Hmean unixbench-pipe-512 295937232.39 ( 0.00%) 293462420.03 * -0.84%*
> unixbench-spawn Hmean unixbench-spawn-1 4164.75 ( 0.00%) 4229.59 ( 1.56%)
> unixbench-spawn Hmean unixbench-spawn-512 79950.80 ( 0.00%) 76439.30 ( -4.39%)
> unixbench-execl Hmean unixbench-execl-1 4112.25 ( 0.00%) 4151.37 ( 0.95%)
> unixbench-execl Hmean unixbench-execl-512 11785.88 ( 0.00%) 11756.46 ( -0.25%)
>
> o NPS2
>
> Test Metric Parallelism tip sis_short
> unixbench-dhry2reg Hmean unixbench-dhry2reg-1 49671827.09 ( 0.00%) 49077076.00 ( -1.20%)
> unixbench-dhry2reg Hmean unixbench-dhry2reg-512 6282239821.90 ( 0.00%) 6283671307.30 ( 0.02%)
> unixbench-syscall Amean unixbench-syscall-1 2688504.20 ( 0.00%) 2676278.60 * 0.45%*
> unixbench-syscall Amean unixbench-syscall-512 7321621.07 ( 0.00%) 7784926.60 * 6.33%*
> unixbench-pipe Hmean unixbench-pipe-1 2469941.97 ( 0.00%) 2419584.09 * -2.04%*
> unixbench-pipe Hmean unixbench-pipe-512 296146392.10 ( 0.00%) 293156913.86 * -1.01%*
> unixbench-spawn Hmean unixbench-spawn-1 5029.05 ( 0.00%) 5015.18 ( -0.28%)
> unixbench-spawn Hmean unixbench-spawn-512 77198.79 ( 0.00%) 80409.23 * 4.16%*
> unixbench-execl Hmean unixbench-execl-1 4092.59 ( 0.00%) 4158.36 * 1.61%*
> unixbench-execl Hmean unixbench-execl-512 12293.67 ( 0.00%) 12169.31 ( -1.01%)
>
> o NPS4
>
> Test Metric Parallelism tip sis_short
> unixbench-dhry2reg Hmean unixbench-dhry2reg-1 48944542.05 ( 0.00%) 49490899.03 * 1.12%*
> unixbench-dhry2reg Hmean unixbench-dhry2reg-512 6291259625.50 ( 0.00%) 6299305899.90 ( 0.13%)
> unixbench-syscall Amean unixbench-syscall-1 2686991.73 ( 0.00%) 2682940.53 * 0.15%*
> unixbench-syscall Amean unixbench-syscall-512 7902201.47 ( 0.00%) 7931906.47 ( -0.38%)
> unixbench-pipe Hmean unixbench-pipe-1 2468813.43 ( 0.00%) 2422272.88 * -1.89%*
> unixbench-pipe Hmean unixbench-pipe-512 297109244.52 ( 0.00%) 294589928.27 * -0.85%*
> unixbench-spawn Hmean unixbench-spawn-1 5161.67 ( 0.00%) 5012.58 ( -2.89%)
> unixbench-spawn Hmean unixbench-spawn-512 78657.60 ( 0.00%) 78572.80 ( -0.11%)
> unixbench-execl Hmean unixbench-execl-1 4112.02 ( 0.00%) 4122.16 ( 0.25%)
> unixbench-execl Hmean unixbench-execl-512 13700.99 ( 0.00%) 14173.20 * 3.44%*
>
> ~~~~~~~~~~~
> ~ SpecJBB ~
> ~~~~~~~~~~~
>
> o NPS1 - Normalized to baseline (tip)
>
> Kernel tip sis_short
> Max-jOPS 100% 98.53%
> Critical-jOPS 100% 105.61%
>
> ~~~~~~~~~~~~~~~~~~
> ~ DeathStarBench ~
> ~~~~~~~~~~~~~~~~~~
>
> o NPS1 - Normalized to baseline (tip)
>
> Kernel : tip sis_short
> 8C/16T : 100.00% 100.54%
> 16C/32T : 100.00% 100.19%
> 32C/64T : 100.00% 98.08%
> 64C/128T : 100.00% 98.34%
>
>
> --------------- With Abel's suggestion added to v5 ---------------
>
> I've added the hunk suggested by Abel in the thread to the v5 and
> following are results for the same set of benchmarks but only for
> machine running in NPS1 mode.
>
> sis_short_v5.1: 6.2.0-rc6 tip sched/core + this series + Abel's suggestion
>
> ~~~~~~~~~~~~~
> ~ hackbench ~
> ~~~~~~~~~~~~~
>
> o NPS1
>
> Test: tip sis_short_v5.1
> 1-groups: 4.38 (0.00 pct) 4.08 (6.84 pct)
> 2-groups: 5.12 (0.00 pct) 5.10 (0.39 pct)
> 4-groups: 4.21 (0.00 pct) 4.23 (-0.47 pct)
> 8-groups: 4.68 (0.00 pct) 4.69 (-0.21 pct)
> 16-groups: 6.13 (0.00 pct) 5.94 (3.09 pct)
>
> ~~~~~~~~~~~~
> ~ schbench ~
> ~~~~~~~~~~~~
>
> o NPS1
>
> #workers: tip sis_short_v5.1
> 1: 36.00 (0.00 pct) 36.00 (0.00 pct)
> 2: 37.00 (0.00 pct) 39.00 (-5.40 pct)
> 4: 41.00 (0.00 pct) 40.00 (2.43 pct)
> 8: 46.00 (0.00 pct) 46.00 (0.00 pct)
> 16: 66.00 (0.00 pct) 68.00 (-3.03 pct)
> 32: 111.00 (0.00 pct) 112.00 (-0.90 pct)
> 64: 207.00 (0.00 pct) 238.00 (-14.97 pct)
> 64: 227.00 (0.00 pct) 219.00 (3.52 pct)
> 128: 483.00 (0.00 pct) 494.00 (-2.27 pct)
> 256: 46272.00 (0.00 pct) 41280.00 (10.78 pct)
> 512: 78293.00 (0.00 pct) 79325.00 (-1.31 pct)
>
> ~~~~~~~~~~
> ~ tbench ~
> ~~~~~~~~~~
>
> o NPS1
>
> Clients: tip sis_short_v5.1
> 1 536.78 (0.00 pct) 535.90 (-0.16 pct)
> 2 1050.74 (0.00 pct) 1067.32 (1.57 pct)
> 4 1993.47 (0.00 pct) 1971.63 (-1.09 pct)
> 8 3601.77 (0.00 pct) 3599.17 (-0.07 pct)
> 16 6202.01 (0.00 pct) 6115.08 (-1.40 pct)
> 32 11544.55 (0.00 pct) 11423.52 (-1.04 pct)
> 64 21828.75 (0.00 pct) 21403.94 (-1.94 pct)
> 128 31095.92 (0.00 pct) 30783.55 (-1.00 pct)
> 256 54828.12 (0.00 pct) 55328.94 (0.91 pct)
> 512 54888.10 (0.00 pct) 53483.33 (-2.55 pct)
> 1024 48407.14 (0.00 pct) 48998.95 (1.22 pct)
>
> ~~~~~~~~~~
> ~ stream ~
> ~~~~~~~~~~
>
> o NPS1
>
> 10 Runs:
>
> Test: tip sis_short_v5.1
> Copy: 320576.16 (0.00 pct) 331810.14 (3.50 pct)
> Scale: 212869.80 (0.00 pct) 214725.82 (0.87 pct)
> Add: 241556.74 (0.00 pct) 242340.92 (0.32 pct)
> Triad: 250637.58 (0.00 pct) 251271.53 (0.25 pct)
>
> 100 Runs:
>
> Test: tip sis_short_v5.1
> Copy: 330058.38 (0.00 pct) 331966.60 (0.57 pct)
> Scale: 216475.85 (0.00 pct) 222777.84 (2.91 pct)
> Add: 243028.82 (0.00 pct) 250873.78 (3.22 pct)
> Triad: 252907.98 (0.00 pct) 253791.20 (0.34 pct)
>
> ~~~~~~~~~~~~~~~~
> ~ ycsb-mongodb ~
> ~~~~~~~~~~~~~~~~
>
> o NPS1:
>
> tip : 133514.00 (var: 2.07%)
> sis-short_v5.1 : 129172.67 (var: 2.32%) (-3.25%) **
>
> ~~~~~~~~~~~~~
> ~ unixbench ~
> ~~~~~~~~~~~~~
>
> o NPS1
>
> Test Metric Parallelism tip sis_short_v5.1
> unixbench-dhry2reg Hmean unixbench-dhry2reg-1 49266026.90 ( 0.00%) 49054799.90 ( -0.43%)
> unixbench-dhry2reg Hmean unixbench-dhry2reg-512 6285063007.68 ( 0.00%) 6280424934.15 ( -0.07%)
> unixbench-syscall Amean unixbench-syscall-1 2689026.67 ( 0.00%) 2677968.03 * 0.41%*
> unixbench-syscall Amean unixbench-syscall-512 7352453.23 ( 0.00%) 7354325.40 ( -0.03%)
> unixbench-pipe Hmean unixbench-pipe-1 2467955.46 ( 0.00%) 2351117.60 * -4.73%*
> unixbench-pipe Hmean unixbench-pipe-512 295937232.39 ( 0.00%) 295769918.99 ( -0.06%)
> unixbench-spawn Hmean unixbench-spawn-1 4164.75 ( 0.00%) 4331.89 * 4.01%*
> unixbench-spawn Hmean unixbench-spawn-512 79626.61 ( 0.00%) 77865.32 * -2.21%*
> unixbench-execl Hmean unixbench-execl-1 4112.25 ( 0.00%) 4145.85 ( 0.82%)
> unixbench-execl Hmean unixbench-execl-512 11785.88 ( 0.00%) 11935.41 ( 1.27%)
>
> ~~~~~~~~~~~
> ~ SpecJBB ~
> ~~~~~~~~~~~
>
> o NPS1 - Normalized to baseline (tip)
>
> Kernel tip sis_short_V5.1
> Max-jOPS 100% 91.99% ** (-8.01%)
> Critical-jOPS 100% 99.29%
>
> ~~~~~~~~~~~~~~~~~~
> ~ DeathStarBench ~
> ~~~~~~~~~~~~~~~~~~
>
> o NPS1 - Throughput normalized to baseline (tip)
>
> Kernel : tip sis_short_V5.1
> 8C/16T : 100.00% 93.75% ** (-6.25%)
> 16C/32T : 100.00% 100.43%
> 32C/64T : 100.00% 101.12%
> 64C/128T : 100.00% 100.21%
>
> o Follow wake_affine_bias() if waker's cpu and prev_cpu are on same LLC?
>
> There are cases with Abel's suggestion where some of the larger
> benchmark regresses. I wonder if wake_affine_bias() can still be
> considered for short running tasks if the waker's CPU and the
> prev_cpu share caches. In DeathStarBench 8C/16T case, the
> services are all pinned to the CPUs of same MC domain. The
> regression observed seems to arise from the missed opportunity
> to distribute load among the CPUs sharing the same L3. I do not
> have data for this currently but I'll update the thread with any
> findings.
Good observation. Just like select_idle_sibling(), prev cpu should only
be chosen if the target cpu and prev cpu shares the LLC cache. My next
version still prefers current cpu than prev cpu, and add the wake_flips
check to aggregate tasks on current CPU only the waker and wakee wakes
up each other frequently. This could somehow mitigate the problem Abel
mentioned, too many short tasks are stacked on current CPU.

thanks,
Chenyu
>
> I'll also queue up a Redis run from mmtest to see if I can reproduce
> Abel's observations on my system however I'm not sure if the
> utilization will be high enough to emulate the same scenario as
> Abel's prod environment. If the migrations within the same MC
>
> On 2/3/2023 10:47 AM, Chen Yu wrote:
> > The main purpose is to avoid too many cross CPU wake up when it is
> > unnecessary. The frequent cross CPU wake up brings significant damage
> > to some workloads, especially on high core count systems.
> >
> > Inhibits the cross CPU wake-up by placing the wakee on waking CPU,
> > if both the waker and wakee are short-duration tasks. The short
> > duration task could become a trouble maker on high-load system,
> > because it could bring frequent context switch. So this strategy
> > only takes effect when the system is busy. Besides, it is unreasonable
> > to inhibit the idle CPU scan when there are still idle CPUs.
> >
> > First, introduce the definition of a short-duration task. Then
> > leverages the first patch to choose a local CPU for wakee.
> >
> > Overall there is significant performance improvement on Intel
> > 2 x 56C/112T platform. Such as will-it-scale (1200+%),
> > netperf(600+%) in some cases. And no noticeable impact on
> > schbench, hackbench, tbench and a OLTP workload with a commercial RDBMS.
> >
> > Seeking for test results on other platforms, such as Zen3 and Kunpeng
> > Arm64. Appreciated Prateek and Yicong if you can have a try on this
> > version.
> >
> > Changes since v4:
> > 1. Dietmar has commented on the task duration calculation. So refined
> > the commit log to reduce confusion.
> > 2. Change [PATCH 1/2] to only record the average duration of a task.
> > So this change could benefit UTIL_EST_FASTER[1].
> > 3. As v4 reported regression on Zen3 and Kunpeng Arm64, add back
> > the system average utilization restriction that, if the system
> > is not busy, do not enable the short wake up. Above logic has
> > shown improvment on Zen3[2].
> > 4. Restrict the wakeup target to be current CPU, rather than both
> > current CPU and task's previous CPU. This could also benefit
> > wakeup optimization from interrupt in the future, which is
> > suggested by Yicong.
> >
> > Changes since v3:
> > 1. Honglei and Josh have concern that the threshold of short
> > task duration could be too long. Decreased the threshold from
> > sysctl_sched_min_granularity to (sysctl_sched_min_granularity / 8),
> > and the '8' comes from get_update_sysctl_factor().
> > 2. Export p->se.dur_avg to /proc/{pid}/sched per Yicong's suggestion.
> > 3. Move the calculation of average duration from put_prev_task_fair()
> > to dequeue_task_fair(). Because there is an issue in v3 that,
> > put_prev_task_fair() will not be invoked by pick_next_task_fair()
> > in fast path, thus the dur_avg could not be updated timely.
> > 4. Fix the comment in PATCH 2/2, that "WRITE_ONCE(CPU1->ttwu_pending, 1);"
> > on CPU0 is earlier than CPU1 getting "ttwu_list->p0", per Tianchen.
> > 5. Move the scan for CPU with short duration task from select_idle_cpu()
> > to select_idle_siblings(), because there is no CPU scan involved, per
> > Yicong.
> >
> > Changes since v2:
> >
> > 1. Peter suggested comparing the duration of waker and the cost to
> > scan for an idle CPU: If the cost is higher than the task duration,
> > do not waste time finding an idle CPU, choose the local or previous
> > CPU directly. A prototype was created based on this suggestion.
> > However, according to the test result, this prototype does not inhibit
> > the cross CPU wakeup and did not bring improvement. Because the cost
> > to find an idle CPU is small in the problematic scenario. The root
> > cause of the problem is a race condition between scanning for an idle
> > CPU and task enqueue(please refer to the commit log in PATCH 2/2).
> > So v3 does not change the core logic of v2, with some refinement based
> > on Peter's suggestion.
> >
> > 2. Simplify the logic to record the task duration per Peter and Abel's suggestion.
> >
> >
> > [1] https://lore.kernel.org/lkml/[email protected]/
> > [2] https://lore.kernel.org/all/[email protected]/
> >
> > v4: https://lore.kernel.org/lkml/[email protected]/
> > v3: https://lore.kernel.org/lkml/[email protected]/
> > v2: https://lore.kernel.org/all/[email protected]/
> > v1: https://lore.kernel.org/lkml/[email protected]/
> >
> > Chen Yu (2):
> > sched/fair: Record the average duration of a task
> > sched/fair: Introduce SIS_SHORT to wake up short task on current CPU
> >
> > include/linux/sched.h | 3 +++
> > kernel/sched/core.c | 2 ++
> > kernel/sched/debug.c | 1 +
> > kernel/sched/fair.c | 39 +++++++++++++++++++++++++++++++++++++++
> > kernel/sched/features.h | 1 +
> > 5 files changed, 46 insertions(+)
> >
>
> The netperf results are still pending and I'll update the thread
> with the same in the coming week. If you would like me to test
> or gather some data for specific workload on the test system,
> please do let me know.
> --
I'll launch more tests and sent out the result later.

thanks,
Chenyu

2023-02-20 07:27:13

by Honglei Wang

[permalink] [raw]
Subject: Re: [PATCH v5 2/2] sched/fair: Introduce SIS_SHORT to wake up short task on current CPU



On 2023/2/20 12:58, Chen Yu wrote:
> On 2023-02-17 at 16:35:24 +0800, Honglei Wang wrote:
>>
>>
>> On 2023/2/16 20:55, Abel Wu wrote:
>>> Hi Chen,
>>>
>>> I've tested this patchset (with modification) on our Redis proxy
>>> servers, and the results seems promising.
>>>
>>> On 2/3/23 1:18 PM, Chen Yu wrote:
>>>> ...
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index aa16611c7263..d50097e5fcc1 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -6489,6 +6489,20 @@ static int wake_wide(struct task_struct *p)
>>>>       return 1;
>>>>   }
>>>> +/*
>>>> + * If a task switches in and then voluntarily relinquishes the
>>>> + * CPU quickly, it is regarded as a short duration task.
>>>> + *
>>>> + * SIS_SHORT tries to wake up the short wakee on current CPU. This
>>>> + * aims to avoid race condition among CPUs due to frequent context
>>>> + * switch.
>>>> + */
>>>> +static inline int is_short_task(struct task_struct *p)
>>>> +{
>>>> +    return sched_feat(SIS_SHORT) && p->se.dur_avg &&
>>>> +           ((p->se.dur_avg * 8) < sysctl_sched_min_granularity);
>>>> +}
>>>
>>> I changed the factor to fit into the shape of tasks in question.
>>>
>>>     static inline int is_short_task(struct task_struct *p)
>>>     {
>>>         u64 dur = sysctl_sched_min_granularity / 8;
>>>
>>>         if (!sched_feat(SIS_SHORT) || !p->se.dur_avg)
>>>             return false;
>>>
>>>         /*
>>>          * Bare tracepoint to allow dynamically changing
>>>          * the threshold.
>>>          */
>>>         trace_sched_short_task_tp(p, &dur);
>>>
>>>         return p->se.dur_avg < dur;
>>>     }
>>>
>>> I'm not sure it is the right way to provide such flexibility, but
>>> definition of 'short' can be workload specific.
>>>
>>>> +
>>>>   /*
>>>>    * The purpose of wake_affine() is to quickly determine on which
>>>> CPU we can run
>>>>    * soonest. For the purpose of speed we only consider the waking
>>>> and previous
>>>> @@ -6525,6 +6539,11 @@ wake_affine_idle(int this_cpu, int prev_cpu,
>>>> int sync)
>>>>       if (available_idle_cpu(prev_cpu))
>>>>           return prev_cpu;
>>>> +    /* The only running task is a short duration one. */
>>>> +    if (cpu_rq(this_cpu)->nr_running == 1 &&
>>>> +        is_short_task(rcu_dereference(cpu_curr(this_cpu))))
>>>> +        return this_cpu;
>>>
>>> Since proxy server handles simple data delivery, the tasks are
>>> generally short running ones and hate task stacking which may
>>> introduce scheduling latency (even there are only 2 short tasks
>>> competing each other). So this part brings slight regression on
>>> the proxy case. But I still think this is good for most cases.
>>>
>>> Speaking of task stacking, I found wake_affine_weight() can be
>>> much more dangerous. It chooses the less loaded one between the
>>> prev & this cpu as a candidate, so 'small' tasks can be easily
>>> stacked on this cpu when wake up several tasks at one time if
>>> this cpu is unloaded. This really hurts if the 'small' tasks are
>>> latency-sensitive, although wake_affine_weight() does the right
>>> thing from the point of view of 'load'.
>>>
>>> The following change greatly reduced the p99lat of Redis service
>>> from 150ms to 0.9ms, at exactly the same throughput (QPS).
>>>
>>> @@ -5763,6 +5787,9 @@ wake_affine_weight(struct sched_domain *sd, struct
>>> task_struct *p,
>>>     s64 this_eff_load, prev_eff_load;
>>>     unsigned long task_load;
>>>
>>> +    if (is_short_task(p))
>>> +        return nr_cpumask_bits;
>>> +
>>>     this_eff_load = cpu_load(cpu_rq(this_cpu));
>>>
>>>     if (sync) {
>>>
>>> I know that 'short' tasks are not necessarily 'small' tasks, e.g.
>>> sleeping duration is small or have large weights, but this works
>>> really well for this case. This is partly because delivering data
>>> is memory bandwidth intensive hence prefer cache hot cpus. And I
>>> think this is also applicable to the general purposes: do NOT let
>>> the short running tasks suffering from cache misses caused by
>>> migration.
>>>
>>
>> Redis is a bit special. It runs quick and really sensitive on schedule
>> latency. The purpose of this 'short task' feature from Yu is to mitigate the
>> migration and tend to place the waking task on local cpu, this is somehow on
>> the opposite side of workload such as Redis. The changes you did remind me
>> of the latency-prio stuff. Maybe we can do something base on both the 'short
>> task' and 'latency-prio' to make your changes more general. thoughts?
>>
> Looks reasonable, I suppose you were refering to 'latency nice' proposed by
> Vincent. For now I'd like to keep this patch simple enough, later we can
> extend it.
>

Yep, agree to keep this patch as is for now.

Thanks,
Honglei

> thanks,
> Chenyu