LinuxLists.cc - [PATCH v8 0/2] sched/fair: Introduce SIS_CURRENT to wake up short task on current CPU

[permalink] [raw]

Subject: [PATCH v8 2/2] sched/fair: Introduce SIS_CURRENT to wake up short task on current CPU

[Problem Statement]
For a workload that is doing frequent context switches, the throughput
scales well until the number of instances reaches a peak point. After
that peak point, the throughput drops significantly if the number of
instances continue to increase.

The will-it-scale context_switch1 test case exposes the issue. The
test platform has 2 x 56C/112T and 224 CPUs in total. will-it-scale
launches 1, 8, 16 ... instances respectively. Each instance is composed
of 2 tasks, and each pair of tasks would do ping-pong scheduling via
pipe_read() and pipe_write(). No task is bound to any CPU. It is found
that, once the number of instances is higher than 56, the throughput
drops accordingly:

^
throughput|
| X
| X X X
| X X X
| X X
| X X
| X
| X
| X
| X
|
+-----------------.------------------->
56
number of instances

[Symptom analysis]

One of the reasons to cause the performance downgrading is the high
system idle percentage(around 20% ~ 30%). The CPUs waste a lot of
time in idle and do nothing. As a comparison, if set CPU affinity
to these workloads and stops them from migrating among CPUs,
the idle percentage drops to nearly 0%, and the throughput
increases a lot. This indicates room for optimization.

The cause of high idle time is that there is no strict synchronization
between select_task_rq() and the set of ttwu_pending flag among several
CPUs. And this might be by design because the scheduler prefers parallel
wakeup.

Suppose there are nr_cpus pairs of ping-pong scheduling
tasks. For example, p0' and p0 are ping-pong scheduling,
so do p1' <=> p1, and p2'<=> p2. None of these tasks are
bound to any CPUs. The problem can be summarized as:
more than 1 wakers are stacked on 1 CPU, which slows down
waking up their wakees:

CPU0 CPU1 CPU2

p0' p1' => idle p2'

try_to_wake_up(p0) try_to_wake_up(p2);
CPU1 = select_task_rq(p0); CPU1 = select_task_rq(p2);
ttwu_queue(p0, CPU1); ttwu_queue(p2, CPU1);
__ttwu_queue_wakelist(p0, CPU1);
WRITE_ONCE(CPU1->ttwu_pending, 1);
__smp_call_single_queue(CPU1, p0); => ttwu_list->p0
quiting cpuidle_idle_call()

__ttwu_queue_wakelist(p2, CPU1);
WRITE_ONCE(CPU1->ttwu_pending, 1);
ttwu_list->p2->p0 <= __smp_call_single_queue(CPU1, p2);

p0' => idle
sched_ttwu_pending()
enqueue_task(p2 and p0)

idle => p2

...
p2 time slice expires
...
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
<=== !!! p2 delays the wake up of p0' !!!
!!! causes long idle on CPU0 !!!
p2 => p0 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
p0 wakes up p0'

idle => p0'

Since there are many waker/wakee pairs in the system, the chain reaction
causes many CPUs to be victims. These idle CPUs wait for their waker to
be scheduled. Tiancheng has mentioned the above issue here[1].

Besides the high idle percentage, waking up the tasks on different CPUs
could bring Core-to-Core cache overhead, which hurts the performance.

[Proposal]

Waking up the short task on current CPU, if the
following conditions are met:

1. waker A's rq->nr_running <= 1
2. waker A is a short duration task (waker will fall asleep soon)
3. wakee B is a short duration task (impact of B is minor to A)
4. A->wakee_flips is 0 and A->last_wakee = B
5. B->wakee_flips is 0 and B->last_wakee = A

The reason is that, if the waker is a short-duration task, it might
relinquish the CPU soon, and the wakee has the chance to be scheduled.
On the other hand, if the wakee is a short duration task, putting it on
non-idle CPU would bring minimal impact to the running task.
The benefit of waking short task on current CPU:
1. Reduce race condition which causes high idle percentage.
2. Increase cache share between the waker and wakee.

The threshold to define a short duration task is sysctl_sched_migration_cost.
As suggested by Peter, this value is also used in task_hot() to prevent
migrations.

This wake up strategy is regarded as a dynamic WF_CURRENT_CPU[2] proposed
by Andrei Vagin, except that this change treats the current CPU as the
last resort when the previous CPU is not idle, and avoid tasks stacking
on the current CPU as much as possible.

[Benchmark results]
The baseline is v6.3-rc7 tip:sched/core, on top of
Commit f31dcb152a3d ("sched/clock: Fix local_clock() before sched_clock_init()").
The test platform Intel Sapphire Rapids has 2 x 56C/112T and 224 CPUs in total.
C-states deeper than C1E are disabled. Turbo is disabled. CPU frequency governor
is performance.

Overall there is a universal improvement for netperf/tbench/will-it-scale,
under different loads. And there is no significant impact on hackbench/schbench.

will-it-scale
=============
case load baseline compare%
context_switch1 224 groups 1.00 +552.84%

netperf
=======
case load baseline(std%) compare%( std%)
TCP_RR 56-threads 1.00 ( 1.96) +15.23 ( 4.67)
TCP_RR 112-threads 1.00 ( 1.84) +88.83 ( 4.37)
TCP_RR 168-threads 1.00 ( 0.41) +475.45 ( 4.45)
TCP_RR 224-threads 1.00 ( 0.62) +806.85 ( 3.67)
TCP_RR 280-threads 1.00 ( 65.80) +162.66 ( 10.26)
TCP_RR 336-threads 1.00 ( 17.30) -0.19 ( 19.07)
TCP_RR 392-threads 1.00 ( 26.88) +3.38 ( 28.91)
TCP_RR 448-threads 1.00 ( 36.43) -0.26 ( 33.72)
UDP_RR 56-threads 1.00 ( 7.91) +3.77 ( 17.48)
UDP_RR 112-threads 1.00 ( 2.72) -15.02 ( 10.78)
UDP_RR 168-threads 1.00 ( 8.86) +131.77 ( 13.30)
UDP_RR 224-threads 1.00 ( 9.54) +178.73 ( 16.75)
UDP_RR 280-threads 1.00 ( 15.40) +189.69 ( 19.36)
UDP_RR 336-threads 1.00 ( 24.09) +0.54 ( 22.28)
UDP_RR 392-threads 1.00 ( 39.63) -3.90 ( 33.77)
UDP_RR 448-threads 1.00 ( 43.57) +1.57 ( 40.43)

tbench
======
case load baseline(std%) compare%( std%)
loopback 56-threads 1.00 ( 0.50) +10.78 ( 0.52)
loopback 112-threads 1.00 ( 0.19) +2.73 ( 0.08)
loopback 168-threads 1.00 ( 0.09) +173.72 ( 0.47)
loopback 224-threads 1.00 ( 0.20) -2.13 ( 0.42)
loopback 280-threads 1.00 ( 0.06) -0.77 ( 0.15)
loopback 336-threads 1.00 ( 0.14) -0.08 ( 0.08)
loopback 392-threads 1.00 ( 0.17) -0.27 ( 0.86)
loopback 448-threads 1.00 ( 0.37) +0.32 ( 0.02)

hackbench
=========
case load baseline(std%) compare%( std%)
process-pipe 1-groups 1.00 ( 0.94) -0.67 ( 0.45)
process-pipe 2-groups 1.00 ( 3.22) -3.00 ( 3.35)
process-pipe 4-groups 1.00 ( 1.66) -3.25 ( 1.87)
process-sockets 1-groups 1.00 ( 0.70) +1.34 ( 0.44)
process-sockets 2-groups 1.00 ( 0.24) +6.99 ( 11.23)
process-sockets 4-groups 1.00 ( 0.61) +1.72 ( 0.57)
threads-pipe 1-groups 1.00 ( 0.95) -0.66 ( 0.74)
threads-pipe 2-groups 1.00 ( 0.79) -0.59 ( 2.10)
threads-pipe 4-groups 1.00 ( 1.97) -1.23 ( 10.62)
threads-sockets 1-groups 1.00 ( 0.73) -2.59 ( 1.32)
threads-sockets 2-groups 1.00 ( 0.30) -1.95 ( 1.68)
threads-sockets 4-groups 1.00 ( 1.22) +1.86 ( 0.73)

schbench
========
case load baseline(std%) compare%( std%)
normal 1-mthreads 1.00 ( 0.00) +0.88 ( 1.25)
normal 2-mthreads 1.00 ( 2.09) +0.85 ( 2.44)
normal 4-mthreads 1.00 ( 1.29) -1.82 ( 4.55)
normal 8-mthreads 1.00 ( 1.22) +3.45 ( 1.26)

Redis
=====
Launch 224 instances of redis-server on machine A, launch 224 instances
of redis-benchmark on machine B, measure the SET/GET latency on B.
It was tested on a 1G NIC card. The 99th latency before vs after SIS_CURRENT
did not change much.
baseline sis_current
SET 115 ms 116 ms
GET 225 ms 228 ms

Prateek tested this patch on a dual socket Zen3 system (2 x 64C/128T).
tbench and netperf show good improvements at 128 clients. SpecJBB shows
some improvement in max-jOPS:
tip SIS_CURRENT
SPECjbb2015 max-jOPS 100.00% 102.78%
SPECjbb2015 Critical-jOPS 100.00% 100.00%

Others are perf neutral.

[1] https://lore.kernel.org/lkml/[email protected]/
[2] https://lore.kernel.org/lkml/[email protected]/

Suggested-by: Tim Chen <[email protected]>
Suggested-by: K Prateek Nayak <[email protected]>
Tested-by: kernel test robot <[email protected]>
Tested-by: K Prateek Nayak <[email protected]>
Signed-off-by: Chen Yu <[email protected]>
---
kernel/sched/fair.c | 46 +++++++++++++++++++++++++++++++++++++++++
kernel/sched/features.h | 1 +
2 files changed, 47 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3236011658a2..642a9e830e8f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6537,6 +6537,46 @@ static int wake_wide(struct task_struct *p)
return 1;
}

+/*
+ * Wake up the task on current CPU, if the following conditions are met:
+ *
+ * 1. waker A is the only running task on this_cpu
+ * 2. A is a short duration task (waker will fall asleep soon)
+ * 3. wakee B is a short duration task (impact of B on A is minor)
+ * 4. A and B wake up each other alternately
+ */
+static bool
+wake_on_current(int this_cpu, struct task_struct *p)
+{
+ if (!sched_feat(SIS_CURRENT))
+ return false;
+
+ if (cpu_rq(this_cpu)->nr_running > 1)
+ return false;
+
+ /*
+ * If a task switches in and then voluntarily relinquishes the
+ * CPU quickly, it is regarded as a short duration task. In that
+ * way, the short waker is likely to relinquish the CPU soon, which
+ * provides room for the wakee. Meanwhile, a short wakee would bring
+ * minor impact to the current rq. Put the short waker and wakee together
+ * bring benefit to cache-share task pairs and avoid migration overhead.
+ */
+ if (!current->se.dur_avg || current->se.dur_avg >= sysctl_sched_migration_cost)
+ return false;
+
+ if (!p->se.dur_avg || p->se.dur_avg >= sysctl_sched_migration_cost)
+ return false;
+
+ if (current->wakee_flips || p->wakee_flips)
+ return false;
+
+ if (current->last_wakee != p || p->last_wakee != current)
+ return false;
+
+ return true;
+}
+
/*
* The purpose of wake_affine() is to quickly determine on which CPU we can run
* soonest. For the purpose of speed we only consider the waking and previous
@@ -6630,6 +6670,9 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
if (sched_feat(WA_WEIGHT) && target == nr_cpumask_bits)
target = wake_affine_weight(sd, p, this_cpu, prev_cpu, sync);

+ if (target == nr_cpumask_bits && wake_on_current(this_cpu, p))
+ target = this_cpu;
+
schedstat_inc(p->stats.nr_wakeups_affine_attempts);
if (target != this_cpu)
return prev_cpu;
@@ -7151,6 +7194,9 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
}
}

+ if (smp_processor_id() == target && wake_on_current(target, p))
+ return target;
+
i = select_idle_cpu(p, sd, has_idle_core, target);
if ((unsigned)i < nr_cpumask_bits)
return i;
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index ee7f23c76bd3..a3e05827f7e8 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -62,6 +62,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
*/
SCHED_FEAT(SIS_PROP, false)
SCHED_FEAT(SIS_UTIL, true)
+SCHED_FEAT(SIS_CURRENT, true)

/*
* Issue a WARN when we do multiple update_rq_clock() calls
--
2.25.1

2023-04-28 15:23:59

[permalink] [raw]

Subject: [PATCH v8 1/2] sched/fair: Record the average duration of a task

Record the average duration of a task, as there is a requirement
to leverage this information for better task placement.

At first thought the (p->se.sum_exec_runtime / p->nvcsw)
can be used to measure the task duration. However, the
history long past was factored too heavily in such a formula.
Ideally, the old activity should decay and not affect
the current status too much.

Although something based on PELT can be used, se.util_avg might
not be appropriate to describe the task duration:
Task p1 and task p2 are doing frequent ping-pong scheduling on
one CPU, both p1 and p2 have a short duration, but the util_avg
of each task can be up to 50%, which is inconsistent with the
short task duration.

It was found that there was once a similar feature to track the
duration of a task:
commit ad4b78bbcbab ("sched: Add new wakeup preemption mode: WAKEUP_RUNNING")
Unfortunately, it was reverted because it was an experiment. Pick the
patch up again, by recording the average duration when a task voluntarily
switches out.

Suppose on CPU1, task p1 and p2 run alternatively:

--------------------> time

| p1 runs 1ms | p2 preempt p1 | p1 switch in, runs 0.5ms and blocks |
^ ^ ^
|_____________| |_____________________________________|
^
|
p1 dequeued

p1's duration in one section is (1 + 0.5)ms. Because if p2 does not
preempt p1, p1 can run 1.5ms. This reflects the nature of a task:
how long it wishes to run at most.

Suggested-by: Tim Chen <[email protected]>
Suggested-by: Vincent Guittot <[email protected]>
Tested-by: K Prateek Nayak <[email protected]>
Signed-off-by: Chen Yu <[email protected]>
---
include/linux/sched.h | 3 +++
kernel/sched/core.c | 2 ++
kernel/sched/debug.c | 1 +
kernel/sched/fair.c | 13 +++++++++++++
4 files changed, 19 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 675298d6eb36..6ee6b00faa12 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -558,6 +558,9 @@ struct sched_entity {
u64 prev_sum_exec_runtime;

u64 nr_migrations;
+ u64 prev_sleep_sum_runtime;
+ /* average duration of a task */
+ u64 dur_avg;

#ifdef CONFIG_FAIR_GROUP_SCHED
int depth;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 898fa3bc2765..32eacd220e39 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4452,6 +4452,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
p->se.prev_sum_exec_runtime = 0;
p->se.nr_migrations = 0;
p->se.vruntime = 0;
+ p->se.dur_avg = 0;
+ p->se.prev_sleep_sum_runtime = 0;
INIT_LIST_HEAD(&p->se.group_node);

#ifdef CONFIG_FAIR_GROUP_SCHED
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 1637b65ba07a..8d64fba16cfe 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -1024,6 +1024,7 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
__PS("nr_involuntary_switches", p->nivcsw);

P(se.load.weight);
+ P(se.dur_avg);
#ifdef CONFIG_SMP
P(se.avg.load_sum);
P(se.avg.runnable_sum);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3f8135d7c89d..3236011658a2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6319,6 +6319,18 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)

static void set_next_buddy(struct sched_entity *se);

+static inline void dur_avg_update(struct task_struct *p, bool task_sleep)
+{
+ u64 dur;
+
+ if (!task_sleep)
+ return;
+
+ dur = p->se.sum_exec_runtime - p->se.prev_sleep_sum_runtime;
+ p->se.prev_sleep_sum_runtime = p->se.sum_exec_runtime;
+ update_avg(&p->se.dur_avg, dur);
+}
+
/*
* The dequeue_task method is called before nr_running is
* decreased. We remove the task from the rbtree and
@@ -6391,6 +6403,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)

dequeue_throttle:
util_est_update(&rq->cfs, p, task_sleep);
+ dur_avg_update(p, task_sleep);
hrtick_update(rq);
}

--
2.25.1

2023-04-29 19:49:21

[permalink] [raw]

Subject: Re: [PATCH v8 2/2] sched/fair: Introduce SIS_CURRENT to wake up short task on current CPU

On Sat, 2023-04-29 at 07:16 +0800, Chen Yu wrote:
> [Problem Statement]
> For a workload that is doing frequent context switches, the throughput
> scales well until the number of instances reaches a peak point. After
> that peak point, the throughput drops significantly if the number of
> instances continue to increase.
>
> The will-it-scale context_switch1 test case exposes the issue. The
> test platform has 2 x 56C/112T and 224 CPUs in total. will-it-scale
> launches 1, 8, 16 ... instances respectively. Each instance is composed
> of 2 tasks, and each pair of tasks would do ping-pong scheduling via
> pipe_read() and pipe_write(). No task is bound to any CPU. It is found
> that, once the number of instances is higher than 56, the throughput
> drops accordingly:
>
>           ^
> throughput|
>           |                 X
>           |               X   X X
>           |             X         X X
>           |           X               X
>           |         X                   X
>           |       X
>           |     X
>           |   X
>           | X
>           |
>           +-----------------.------------------->
>                             56
>                                  number of instances

Should these buddy pairs not start interfering with one another at 112
instances instead of 56? NR_CPUS/2 buddy pair instances is the point at
which trying to turn waker/wakee overlap into throughput should tend
toward being a loser due to man-in-the-middle wakeup delay pain more
than offsetting overlap recovery gain, rendering sync wakeup thereafter
an ever more likely win.

Anyway..

What I see in my box, and I bet a virtual nickle it's a player in your
box as well, is WA_WEIGHT making a mess of things by stacking tasks,
sometimes very badly. Below, I start NR_CPUS tbench buddy pairs in
crusty ole i4790 desktop box with WA_WEIGHT turned off, then turn it on
remotely as to not have noisy GUI muck up my demo.

...
8 3155749 3606.79 MB/sec warmup 38 sec latency 3.852 ms
8 3238485 3608.75 MB/sec warmup 39 sec latency 3.839 ms
8 3321578 3608.59 MB/sec warmup 40 sec latency 3.882 ms
8 3404746 3608.09 MB/sec warmup 41 sec latency 2.273 ms
8 3487885 3607.58 MB/sec warmup 42 sec latency 3.869 ms
8 3571034 3607.12 MB/sec warmup 43 sec latency 3.855 ms
8 3654067 3607.48 MB/sec warmup 44 sec latency 3.857 ms
8 3736973 3608.83 MB/sec warmup 45 sec latency 4.008 ms
8 3820160 3608.33 MB/sec warmup 46 sec latency 3.849 ms
8 3902963 3607.60 MB/sec warmup 47 sec latency 14.241 ms
8 3986117 3607.17 MB/sec warmup 48 sec latency 20.290 ms
8 4069256 3606.70 MB/sec warmup 49 sec latency 28.284 ms
8 4151986 3608.35 MB/sec warmup 50 sec latency 17.216 ms
8 4235070 3608.06 MB/sec warmup 51 sec latency 23.221 ms
8 4318221 3607.81 MB/sec warmup 52 sec latency 28.285 ms
8 4401456 3607.29 MB/sec warmup 53 sec latency 20.835 ms
8 4484606 3607.06 MB/sec warmup 54 sec latency 28.943 ms
8 4567609 3607.32 MB/sec warmup 55 sec latency 28.254 ms

Where I turned it on is hard to miss.

Short duration thread pool workers can be stacked all the way to the
ceiling by WA_WEIGHT during burst wakeups, with wake_wide() not being
able to intervene due to lack of cross coupling between waker/wakees
leading to heuristic failure. A (now long) while ago I caught that
happening with firefox event threads, it launched 32 of 'em in my 8 rq
box (hmm), and them being essentially the scheduler equivalent of
neutrinos (nearly massless), we stuffed 'em all into one rq.. and got
away with it because those particular threads don't seem to do much of
anything. However, were they to go active, the latency hit that we set
up could have stung mightily. That scenario being highly generic leads
me to suspect that somewhere out there in the big wide world, folks are
eating that burst serialization.

-Mike

2023-05-01 08:45:00

[permalink] [raw]

Subject: Re: [PATCH v8 2/2] sched/fair: Introduce SIS_CURRENT to wake up short task on current CPU

On Sat, Apr 29, 2023 at 09:34:06PM +0200, Mike Galbraith wrote:
> On Sat, 2023-04-29 at 07:16 +0800, Chen Yu wrote:
> > [Problem Statement]
> > For a workload that is doing frequent context switches, the throughput
> > scales well until the number of instances reaches a peak point. After
> > that peak point, the throughput drops significantly if the number of
> > instances continue to increase.
> >
> > The will-it-scale context_switch1 test case exposes the issue. The
> > test platform has 2 x 56C/112T and 224 CPUs in total. will-it-scale
> > launches 1, 8, 16 ... instances respectively. Each instance is composed
> > of 2 tasks, and each pair of tasks would do ping-pong scheduling via
> > pipe_read() and pipe_write(). No task is bound to any CPU. It is found
> > that, once the number of instances is higher than 56, the throughput
> > drops accordingly:
> >
> > ????????? ^
> > throughput|
> > ????????? |???????????????? X
> > ????????? |?????????????? X?? X X
> > ????????? |???????????? X???????? X X
> > ????????? |?????????? X?????????????? X
> > ????????? |???????? X?????????????????? X
> > ????????? |?????? X
> > ????????? |???? X
> > ????????? |?? X
> > ????????? | X
> > ????????? |
> > ????????? +-----------------.------------------->
> > ??????????????????????????? 56
> > ???????????????????????????????? number of instances
>
> Should these buddy pairs not start interfering with one another at 112
> instances instead of 56? NR_CPUS/2 buddy pair instances is the point at
> which trying to turn waker/wakee overlap into throughput should tend
> toward being a loser due to man-in-the-middle wakeup delay pain more
> than offsetting overlap recovery gain, rendering sync wakeup thereafter
> an ever more likely win.
>
> Anyway..
>
> What I see in my box, and I bet a virtual nickle it's a player in your
> box as well, is WA_WEIGHT making a mess of things by stacking tasks,
> sometimes very badly. Below, I start NR_CPUS tbench buddy pairs in
> crusty ole i4790 desktop box with WA_WEIGHT turned off, then turn it on
> remotely as to not have noisy GUI muck up my demo.
>
> ...
> 8 3155749 3606.79 MB/sec warmup 38 sec latency 3.852 ms
> 8 3238485 3608.75 MB/sec warmup 39 sec latency 3.839 ms
> 8 3321578 3608.59 MB/sec warmup 40 sec latency 3.882 ms
> 8 3404746 3608.09 MB/sec warmup 41 sec latency 2.273 ms
> 8 3487885 3607.58 MB/sec warmup 42 sec latency 3.869 ms
> 8 3571034 3607.12 MB/sec warmup 43 sec latency 3.855 ms
> 8 3654067 3607.48 MB/sec warmup 44 sec latency 3.857 ms
> 8 3736973 3608.83 MB/sec warmup 45 sec latency 4.008 ms
> 8 3820160 3608.33 MB/sec warmup 46 sec latency 3.849 ms
> 8 3902963 3607.60 MB/sec warmup 47 sec latency 14.241 ms
> 8 3986117 3607.17 MB/sec warmup 48 sec latency 20.290 ms
> 8 4069256 3606.70 MB/sec warmup 49 sec latency 28.284 ms
> 8 4151986 3608.35 MB/sec warmup 50 sec latency 17.216 ms
> 8 4235070 3608.06 MB/sec warmup 51 sec latency 23.221 ms
> 8 4318221 3607.81 MB/sec warmup 52 sec latency 28.285 ms
> 8 4401456 3607.29 MB/sec warmup 53 sec latency 20.835 ms
> 8 4484606 3607.06 MB/sec warmup 54 sec latency 28.943 ms
> 8 4567609 3607.32 MB/sec warmup 55 sec latency 28.254 ms
>
> Where I turned it on is hard to miss.
>
> Short duration thread pool workers can be stacked all the way to the
> ceiling by WA_WEIGHT during burst wakeups, with wake_wide() not being
> able to intervene due to lack of cross coupling between waker/wakees
> leading to heuristic failure. A (now long) while ago I caught that
> happening with firefox event threads, it launched 32 of 'em in my 8 rq
> box (hmm), and them being essentially the scheduler equivalent of
> neutrinos (nearly massless), we stuffed 'em all into one rq.. and got
> away with it because those particular threads don't seem to do much of
> anything. However, were they to go active, the latency hit that we set
> up could have stung mightily. That scenario being highly generic leads
> me to suspect that somewhere out there in the big wide world, folks are
> eating that burst serialization.

I'm thinking WA_BIAS makes this worse...

2023-05-01 13:58:48

[permalink] [raw]

Subject: Re: [PATCH v8 2/2] sched/fair: Introduce SIS_CURRENT to wake up short task on current CPU

On Sat, Apr 29, 2023 at 07:16:56AM +0800, Chen Yu wrote:
> netperf
> =======
> case load baseline(std%) compare%( std%)
> TCP_RR 56-threads 1.00 ( 1.96) +15.23 ( 4.67)
> TCP_RR 112-threads 1.00 ( 1.84) +88.83 ( 4.37)
> TCP_RR 168-threads 1.00 ( 0.41) +475.45 ( 4.45)
> TCP_RR 224-threads 1.00 ( 0.62) +806.85 ( 3.67)
> TCP_RR 280-threads 1.00 ( 65.80) +162.66 ( 10.26)
> TCP_RR 336-threads 1.00 ( 17.30) -0.19 ( 19.07)
> TCP_RR 392-threads 1.00 ( 26.88) +3.38 ( 28.91)
> TCP_RR 448-threads 1.00 ( 36.43) -0.26 ( 33.72)
> UDP_RR 56-threads 1.00 ( 7.91) +3.77 ( 17.48)
> UDP_RR 112-threads 1.00 ( 2.72) -15.02 ( 10.78)
> UDP_RR 168-threads 1.00 ( 8.86) +131.77 ( 13.30)
> UDP_RR 224-threads 1.00 ( 9.54) +178.73 ( 16.75)
> UDP_RR 280-threads 1.00 ( 15.40) +189.69 ( 19.36)
> UDP_RR 336-threads 1.00 ( 24.09) +0.54 ( 22.28)
> UDP_RR 392-threads 1.00 ( 39.63) -3.90 ( 33.77)
> UDP_RR 448-threads 1.00 ( 43.57) +1.57 ( 40.43)
>
> tbench
> ======
> case load baseline(std%) compare%( std%)
> loopback 56-threads 1.00 ( 0.50) +10.78 ( 0.52)
> loopback 112-threads 1.00 ( 0.19) +2.73 ( 0.08)
> loopback 168-threads 1.00 ( 0.09) +173.72 ( 0.47)
> loopback 224-threads 1.00 ( 0.20) -2.13 ( 0.42)
> loopback 280-threads 1.00 ( 0.06) -0.77 ( 0.15)
> loopback 336-threads 1.00 ( 0.14) -0.08 ( 0.08)
> loopback 392-threads 1.00 ( 0.17) -0.27 ( 0.86)
> loopback 448-threads 1.00 ( 0.37) +0.32 ( 0.02)

So,... I've been poking around with this a bit today and I'm not seeing
it. On my ancient IVB-EP (2*10*2) with the code as in
queue/sched/core I get:

netperf NO_WA_WEIGHT NO_SIS_CURRENT
NO_WA_BIAS SIS_CURRENT
-------------------------------------------------------------------
TCP_SENDFILE-1 : Avg: 40495.7 41899.7 42001 40783.4
TCP_SENDFILE-10 : Avg: 37218.6 37200.1 37065.1 36604.4
TCP_SENDFILE-20 : Avg: 21495.1 21516.6 21004.4 21356.9
TCP_SENDFILE-40 : Avg: 6947.24 7917.64 7079.93 7231.3
TCP_SENDFILE-80 : Avg: 4081.91 3572.48 3582.98 3615.85
TCP_STREAM-1 : Avg: 37078.1 34469.4 37134.5 35095.4
TCP_STREAM-10 : Avg: 31532.1 31265.8 31260.7 31588.1
TCP_STREAM-20 : Avg: 17848 17914.9 17996.6 17937.4
TCP_STREAM-40 : Avg: 7844.3 7201.65 7710.4 7790.62
TCP_STREAM-80 : Avg: 2518.38 2932.74 2601.51 2903.89
TCP_RR-1 : Avg: 84347.1 81056.2 81167.8 83541.3
TCP_RR-10 : Avg: 71539.1 72099.5 71123.2 69447.9
TCP_RR-20 : Avg: 51053.3 50952.4 50905.4 52157.2
TCP_RR-40 : Avg: 46370.9 46477.5 46289.2 46350.7
TCP_RR-80 : Avg: 21515.2 22497.9 22024.4 22229.2
UDP_RR-1 : Avg: 96933 100076 95997.2 96553.3
UDP_RR-10 : Avg: 83937.3 83054.3 83878.5 78998.6
UDP_RR-20 : Avg: 61974 61897.5 61838.8 62926
UDP_RR-40 : Avg: 56708.6 57053.9 56456.1 57115.2
UDP_RR-80 : Avg: 26950 27895.8 27635.2 27784.8
UDP_STREAM-1 : Avg: 52808.3 55296.8 52808.2 51908.6
UDP_STREAM-10 : Avg: 45810 42944.1 43115 43561.2
UDP_STREAM-20 : Avg: 19212.7 17572.9 18798.7 20066
UDP_STREAM-40 : Avg: 13105.1 13096.9 13070.5 13110.2
UDP_STREAM-80 : Avg: 6372.57 6367.96 6248.86 6413.09

tbench

NO_WA_WEIGHT, NO_WA_BIAS, NO_SIS_CURRENT

Throughput 626.57 MB/sec 2 clients 2 procs max_latency=0.095 ms
Throughput 1316.08 MB/sec 5 clients 5 procs max_latency=0.106 ms
Throughput 1905.19 MB/sec 10 clients 10 procs max_latency=0.161 ms
Throughput 2428.05 MB/sec 20 clients 20 procs max_latency=0.284 ms
Throughput 2323.16 MB/sec 40 clients 40 procs max_latency=0.381 ms
Throughput 2229.93 MB/sec 80 clients 80 procs max_latency=0.873 ms

WA_WEIGHT, NO_WA_BIAS, NO_SIS_CURRENT

Throughput 575.04 MB/sec 2 clients 2 procs max_latency=0.093 ms
Throughput 1285.37 MB/sec 5 clients 5 procs max_latency=0.122 ms
Throughput 1916.10 MB/sec 10 clients 10 procs max_latency=0.150 ms
Throughput 2422.54 MB/sec 20 clients 20 procs max_latency=0.292 ms
Throughput 2361.57 MB/sec 40 clients 40 procs max_latency=0.448 ms
Throughput 2479.70 MB/sec 80 clients 80 procs max_latency=1.249 ms

WA_WEIGHT, WA_BIAS, NO_SIS_CURRENT (aka, mainline)

Throughput 649.46 MB/sec 2 clients 2 procs max_latency=0.092 ms
Throughput 1370.93 MB/sec 5 clients 5 procs max_latency=0.140 ms
Throughput 1904.14 MB/sec 10 clients 10 procs max_latency=0.470 ms
Throughput 2406.15 MB/sec 20 clients 20 procs max_latency=0.276 ms
Throughput 2419.40 MB/sec 40 clients 40 procs max_latency=0.414 ms
Throughput 2426.00 MB/sec 80 clients 80 procs max_latency=1.366 ms

WA_WEIGHT, WA_BIAS, SIS_CURRENT (aka, with patches on)

Throughput 646.55 MB/sec 2 clients 2 procs max_latency=0.104 ms
Throughput 1361.06 MB/sec 5 clients 5 procs max_latency=0.100 ms
Throughput 1889.82 MB/sec 10 clients 10 procs max_latency=0.154 ms
Throughput 2406.57 MB/sec 20 clients 20 procs max_latency=3.667 ms
Throughput 2318.00 MB/sec 40 clients 40 procs max_latency=0.390 ms
Throughput 2384.85 MB/sec 80 clients 80 procs max_latency=1.371 ms

So what's going on here? I don't see anything exciting happening at the
40 mark. At the same time, I can't seem to reproduce Mike's latency pile
up either :/

2023-05-01 15:26:39

[permalink] [raw]

Subject: Re: [PATCH v8 2/2] sched/fair: Introduce SIS_CURRENT to wake up short task on current CPU

Hi Mika,
On 2023-04-29 at 21:34:06 +0200, Mike Galbraith wrote:
> On Sat, 2023-04-29 at 07:16 +0800, Chen Yu wrote:
> > [Problem Statement]
> > For a workload that is doing frequent context switches, the throughput
> > scales well until the number of instances reaches a peak point. After
> > that peak point, the throughput drops significantly if the number of
> > instances continue to increase.
> >
> > The will-it-scale context_switch1 test case exposes the issue. The
> > test platform has 2 x 56C/112T and 224 CPUs in total. will-it-scale
> > launches 1, 8, 16 ... instances respectively. Each instance is composed
> > of 2 tasks, and each pair of tasks would do ping-pong scheduling via
> > pipe_read() and pipe_write(). No task is bound to any CPU. It is found
> > that, once the number of instances is higher than 56, the throughput
> > drops accordingly:
> >
> > ????????? ^
> > throughput|
> > ????????? |???????????????? X
> > ????????? |?????????????? X?? X X
> > ????????? |???????????? X???????? X X
> > ????????? |?????????? X?????????????? X
> > ????????? |???????? X?????????????????? X
> > ????????? |?????? X
> > ????????? |???? X
> > ????????? |?? X
> > ????????? | X
> > ????????? |
> > ????????? +-----------------.------------------->
> > ??????????????????????????? 56
> > ???????????????????????????????? number of instances
>
> Should these buddy pairs not start interfering with one another at 112
> instances instead of 56? NR_CPUS/2 buddy pair instances is the point at
> which trying to turn waker/wakee overlap into throughput should tend
> toward being a loser due to man-in-the-middle wakeup delay pain more
> than offsetting overlap recovery gain, rendering sync wakeup thereafter
> an ever more likely win.
>
Thank you for taking a look at this. Yes, you are right, I did not
described this clearly. Actually above figure was draw when I first
found the issue when there is only 1 socket online(112 thread in total).
I should update the figure to the 224 CPUs case.
> Anyway...
>
> What I see in my box, and I bet a virtual nickle it's a player in your
> box as well, is WA_WEIGHT making a mess of things by stacking tasks,
> sometimes very badly. Below, I start NR_CPUS tbench buddy pairs in
> crusty ole i4790 desktop box with WA_WEIGHT turned off, then turn it on
> remotely as to not have noisy GUI muck up my demo.
>
> ...
> 8 3155749 3606.79 MB/sec warmup 38 sec latency 3.852 ms
> 8 3238485 3608.75 MB/sec warmup 39 sec latency 3.839 ms
> 8 3321578 3608.59 MB/sec warmup 40 sec latency 3.882 ms
> 8 3404746 3608.09 MB/sec warmup 41 sec latency 2.273 ms
> 8 3487885 3607.58 MB/sec warmup 42 sec latency 3.869 ms
> 8 3571034 3607.12 MB/sec warmup 43 sec latency 3.855 ms
> 8 3654067 3607.48 MB/sec warmup 44 sec latency 3.857 ms
> 8 3736973 3608.83 MB/sec warmup 45 sec latency 4.008 ms
> 8 3820160 3608.33 MB/sec warmup 46 sec latency 3.849 ms
> 8 3902963 3607.60 MB/sec warmup 47 sec latency 14.241 ms
> 8 3986117 3607.17 MB/sec warmup 48 sec latency 20.290 ms
> 8 4069256 3606.70 MB/sec warmup 49 sec latency 28.284 ms
> 8 4151986 3608.35 MB/sec warmup 50 sec latency 17.216 ms
> 8 4235070 3608.06 MB/sec warmup 51 sec latency 23.221 ms
> 8 4318221 3607.81 MB/sec warmup 52 sec latency 28.285 ms
> 8 4401456 3607.29 MB/sec warmup 53 sec latency 20.835 ms
> 8 4484606 3607.06 MB/sec warmup 54 sec latency 28.943 ms
> 8 4567609 3607.32 MB/sec warmup 55 sec latency 28.254 ms
>
> Where I turned it on is hard to miss.
>
> Short duration thread pool workers can be stacked all the way to the
> ceiling by WA_WEIGHT during burst wakeups, with wake_wide() not being
> able to intervene due to lack of cross coupling between waker/wakees
> leading to heuristic failure. A (now long) while ago I caught that
> happening with firefox event threads, it launched 32 of 'em in my 8 rq
> box (hmm), and them being essentially the scheduler equivalent of
> neutrinos (nearly massless), we stuffed 'em all into one rq.. and got
> away with it because those particular threads don't seem to do much of
> anything. However, were they to go active, the latency hit that we set
> up could have stung mightily. That scenario being highly generic leads
> me to suspect that somewhere out there in the big wide world, folks are
> eating that burst serialization.
Thank you for this information. Yes, task stacking can be quite annoying
for latency. And in my previous patch version, Prateek from AMD, Abel
from Bytedance, and Yicong from Hisilicon have expressed their concern
on the task stacking risk. So we tried several versions carefully to
not break their use case. For the WA_WEIGHT issue you described(and
I also saw your email offline), it is interesting and I'll try
to run similar test on a 4C/8T system to reproduce it.

thanks,
Chenyu
>
> -Mike

2023-05-01 15:39:10

[permalink] [raw]

Subject: Re: [PATCH v8 2/2] sched/fair: Introduce SIS_CURRENT to wake up short task on current CPU

On Mon, 2023-05-01 at 15:48 +0200, Peter Zijlstra wrote:
>
> Throughput 646.55 MB/sec 2 clients 2 procs max_latency=0.104 ms
> Throughput 1361.06 MB/sec 5 clients 5 procs max_latency=0.100 ms
> Throughput 1889.82 MB/sec 10 clients 10 procs max_latency=0.154 ms
> Throughput 2406.57 MB/sec 20 clients 20 procs max_latency=3.667 ms
> Throughput 2318.00 MB/sec 40 clients 40 procs max_latency=0.390 ms
> Throughput 2384.85 MB/sec 80 clients 80 procs max_latency=1.371 ms
>
>
> So what's going on here? I don't see anything exciting happening at the
> 40 mark. At the same time, I can't seem to reproduce Mike's latency pile
> up either :/

Are you running tbench in the GUI so the per second output stimulates
assorted goo? I'm using KDE fwtw.

Caught this from my raspberry pi, tbench placement looks lovely, the
llvmpipe thingies otoh..

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND
19109 git 20 0 23468 1920 1664 R 52.65 0.012 3:59.64 4 tbench
19110 git 20 0 23468 1664 1536 R 52.65 0.010 4:00.03 3 tbench
19104 git 20 0 23468 1664 1536 R 52.32 0.010 4:00.15 1 tbench
19105 git 20 0 23468 1664 1536 R 52.32 0.010 4:00.16 0 tbench
19108 git 20 0 23468 1792 1664 R 52.32 0.011 4:00.12 7 tbench
19111 git 20 0 23468 1792 1664 R 51.99 0.011 4:00.33 5 tbench
19106 git 20 0 23468 1664 1536 R 51.66 0.010 3:59.40 6 tbench
19107 git 20 0 23468 1664 1536 R 51.32 0.010 3:59.72 2 tbench
19114 git 20 0 6748 896 768 R 46.69 0.006 3:32.77 6 tbench_srv
19116 git 20 0 6748 768 768 S 46.69 0.005 3:32.17 7 tbench_srv
19118 git 20 0 6748 768 768 S 46.69 0.005 3:31.70 3 tbench_srv
19117 git 20 0 6748 768 768 S 46.36 0.005 3:32.99 4 tbench_srv
19112 git 20 0 6748 768 768 S 46.03 0.005 3:32.51 1 tbench_srv
19113 git 20 0 6748 768 768 R 46.03 0.005 3:32.48 0 tbench_srv
19119 git 20 0 6748 768 768 S 46.03 0.005 3:31.93 5 tbench_srv
19115 git 20 0 6748 768 768 R 45.70 0.005 3:32.70 2 tbench_srv
2492 root 20 0 392608 110044 70276 S 1.987 0.682 8:06.86 3 X
2860 root 20 0 2557284 183260 138568 S 0.662 1.135 2:06.38 6 llvmpipe-1
2861 root 20 0 2557284 183260 138568 S 0.662 1.135 2:06.44 6 llvmpipe-2
2863 root 20 0 2557284 183260 138568 S 0.662 1.135 2:04.94 6 llvmpipe-4
2864 root 20 0 2557284 183260 138568 S 0.662 1.135 2:04.72 6 llvmpipe-5
2866 root 20 0 2557284 183260 138568 S 0.662 1.135 2:04.49 6 llvmpipe-7
19562 root 20 0 26192 4876 3596 R 0.662 0.030 0:00.43 5 top
2837 root 20 0 2557284 183260 138568 S 0.331 1.135 1:51.39 5 kwin_x11
2859 root 20 0 2557284 183260 138568 S 0.331 1.135 2:07.56 6 llvmpipe-0
2862 root 20 0 2557284 183260 138568 S 0.331 1.135 2:05.97 6 llvmpipe-3
2865 root 20 0 2557284 183260 138568 S 0.331 1.135 2:03.84 6 llvmpipe-6
2966 root 20 0 3829152 323000 174992 S 0.331 2.001 0:12.71 4 llvmpipe-7
2998 root 20 0 1126332 116960 78032 S 0.331 0.725 0:25.58 3 konsole

2023-05-01 15:42:42

[permalink] [raw]

Subject: Re: [PATCH v8 2/2] sched/fair: Introduce SIS_CURRENT to wake up short task on current CPU

On Mon, 2023-05-01 at 17:32 +0200, Mike Galbraith wrote:
> On Mon, 2023-05-01 at 15:48 +0200, Peter Zijlstra wrote:
> >
> > Throughput 646.55 MB/sec 2 clients 2 procs max_latency=0.104
> > ms
> > Throughput 1361.06 MB/sec 5 clients 5 procs max_latency=0.100
> > ms
> > Throughput 1889.82 MB/sec 10 clients 10 procs max_latency=0.154
> > ms
> > Throughput 2406.57 MB/sec 20 clients 20 procs max_latency=3.667
> > ms
> > Throughput 2318.00 MB/sec 40 clients 40 procs max_latency=0.390
> > ms
> > Throughput 2384.85 MB/sec 80 clients 80 procs max_latency=1.371
> > ms
> >
> >
> > So what's going on here? I don't see anything exciting happening at
> > the
> > 40 mark. At the same time, I can't seem to reproduce Mike's latency
> > pile
> > up either :/
>
> Are you running tbench in the GUI so the per second output stimulates
> assorted goo? I'm using KDE fwtw.
>
> Caught this from my raspberry pi, tbench placement looks lovely, the
> llvmpipe thingies otoh..

erm, without 14 feet of whitespace.

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND
19109 git 20 0 23468 1920 1664 R 52.65 0.012 3:59.64 4 tbench
19110 git 20 0 23468 1664 1536 R 52.65 0.010 4:00.03 3 tbench
19104 git 20 0 23468 1664 1536 R 52.32 0.010 4:00.15 1 tbench
19105 git 20 0 23468 1664 1536 R 52.32 0.010 4:00.16 0 tbench
19108 git 20 0 23468 1792 1664 R 52.32 0.011 4:00.12 7 tbench
19111 git 20 0 23468 1792 1664 R 51.99 0.011 4:00.33 5 tbench
19106 git 20 0 23468 1664 1536 R 51.66 0.010 3:59.40 6 tbench
19107 git 20 0 23468 1664 1536 R 51.32 0.010 3:59.72 2 tbench
19114 git 20 0 6748 896 768 R 46.69 0.006 3:32.77 6 tbench_srv
19116 git 20 0 6748 768 768 S 46.69 0.005 3:32.17 7 tbench_srv
19118 git 20 0 6748 768 768 S 46.69 0.005 3:31.70 3 tbench_srv
19117 git 20 0 6748 768 768 S 46.36 0.005 3:32.99 4 tbench_srv
19112 git 20 0 6748 768 768 S 46.03 0.005 3:32.51 1 tbench_srv
19113 git 20 0 6748 768 768 R 46.03 0.005 3:32.48 0 tbench_srv
19119 git 20 0 6748 768 768 S 46.03 0.005 3:31.93 5 tbench_srv
19115 git 20 0 6748 768 768 R 45.70 0.005 3:32.70 2 tbench_srv
2492 root 20 0 392608 110044 70276 S 1.987 0.682 8:06.86 3 X
2860 root 20 0 2557284 183260 138568 S 0.662 1.135 2:06.38 6 llvmpipe-1
2861 root 20 0 2557284 183260 138568 S 0.662 1.135 2:06.44 6 llvmpipe-2
2863 root 20 0 2557284 183260 138568 S 0.662 1.135 2:04.94 6 llvmpipe-4
2864 root 20 0 2557284 183260 138568 S 0.662 1.135 2:04.72 6 llvmpipe-5
2866 root 20 0 2557284 183260 138568 S 0.662 1.135 2:04.49 6 llvmpipe-7
19562 root 20 0 26192 4876 3596 R 0.662 0.030 0:00.43 5 top
2837 root 20 0 2557284 183260 138568 S 0.331 1.135 1:51.39 5 kwin_x11
2859 root 20 0 2557284 183260 138568 S 0.331 1.135 2:07.56 6 llvmpipe-0
2862 root 20 0 2557284 183260 138568 S 0.331 1.135 2:05.97 6 llvmpipe-3
2865 root 20 0 2557284 183260 138568 S 0.331 1.135 2:03.84 6 llvmpipe-6
2966 root 20 0 3829152 323000 174992 S 0.331 2.001 0:12.71 4 llvmpipe-7
2998 root 20 0 1126332 116960 78032 S 0.331 0.725 0:25.58 3 konsole

2023-05-01 16:01:58

[permalink] [raw]

Subject: Re: [PATCH v8 2/2] sched/fair: Introduce SIS_CURRENT to wake up short task on current CPU

Hi Peter,
On 2023-05-01 at 15:48:27 +0200, Peter Zijlstra wrote:
> On Sat, Apr 29, 2023 at 07:16:56AM +0800, Chen Yu wrote:
> > netperf
> > =======
> > case load baseline(std%) compare%( std%)
> > TCP_RR 56-threads 1.00 ( 1.96) +15.23 ( 4.67)
> > TCP_RR 112-threads 1.00 ( 1.84) +88.83 ( 4.37)
> > TCP_RR 168-threads 1.00 ( 0.41) +475.45 ( 4.45)
> > TCP_RR 224-threads 1.00 ( 0.62) +806.85 ( 3.67)
> > TCP_RR 280-threads 1.00 ( 65.80) +162.66 ( 10.26)
> > TCP_RR 336-threads 1.00 ( 17.30) -0.19 ( 19.07)
> > TCP_RR 392-threads 1.00 ( 26.88) +3.38 ( 28.91)
> > TCP_RR 448-threads 1.00 ( 36.43) -0.26 ( 33.72)
> > UDP_RR 56-threads 1.00 ( 7.91) +3.77 ( 17.48)
> > UDP_RR 112-threads 1.00 ( 2.72) -15.02 ( 10.78)
> > UDP_RR 168-threads 1.00 ( 8.86) +131.77 ( 13.30)
> > UDP_RR 224-threads 1.00 ( 9.54) +178.73 ( 16.75)
> > UDP_RR 280-threads 1.00 ( 15.40) +189.69 ( 19.36)
> > UDP_RR 336-threads 1.00 ( 24.09) +0.54 ( 22.28)
> > UDP_RR 392-threads 1.00 ( 39.63) -3.90 ( 33.77)
> > UDP_RR 448-threads 1.00 ( 43.57) +1.57 ( 40.43)
> >
> > tbench
> > ======
> > case load baseline(std%) compare%( std%)
> > loopback 56-threads 1.00 ( 0.50) +10.78 ( 0.52)
> > loopback 112-threads 1.00 ( 0.19) +2.73 ( 0.08)
> > loopback 168-threads 1.00 ( 0.09) +173.72 ( 0.47)
> > loopback 224-threads 1.00 ( 0.20) -2.13 ( 0.42)
> > loopback 280-threads 1.00 ( 0.06) -0.77 ( 0.15)
> > loopback 336-threads 1.00 ( 0.14) -0.08 ( 0.08)
> > loopback 392-threads 1.00 ( 0.17) -0.27 ( 0.86)
> > loopback 448-threads 1.00 ( 0.37) +0.32 ( 0.02)
>
> So,... I've been poking around with this a bit today and I'm not seeing
> it. On my ancient IVB-EP (2*10*2) with the code as in
> queue/sched/core I get:
>
> netperf NO_WA_WEIGHT NO_SIS_CURRENT
> NO_WA_BIAS SIS_CURRENT
> -------------------------------------------------------------------
> TCP_SENDFILE-1 : Avg: 40495.7 41899.7 42001 40783.4
> TCP_SENDFILE-10 : Avg: 37218.6 37200.1 37065.1 36604.4
> TCP_SENDFILE-20 : Avg: 21495.1 21516.6 21004.4 21356.9
> TCP_SENDFILE-40 : Avg: 6947.24 7917.64 7079.93 7231.3
> TCP_SENDFILE-80 : Avg: 4081.91 3572.48 3582.98 3615.85
> TCP_STREAM-1 : Avg: 37078.1 34469.4 37134.5 35095.4
> TCP_STREAM-10 : Avg: 31532.1 31265.8 31260.7 31588.1
> TCP_STREAM-20 : Avg: 17848 17914.9 17996.6 17937.4
> TCP_STREAM-40 : Avg: 7844.3 7201.65 7710.4 7790.62
> TCP_STREAM-80 : Avg: 2518.38 2932.74 2601.51 2903.89
> TCP_RR-1 : Avg: 84347.1 81056.2 81167.8 83541.3
> TCP_RR-10 : Avg: 71539.1 72099.5 71123.2 69447.9
> TCP_RR-20 : Avg: 51053.3 50952.4 50905.4 52157.2
> TCP_RR-40 : Avg: 46370.9 46477.5 46289.2 46350.7
> TCP_RR-80 : Avg: 21515.2 22497.9 22024.4 22229.2
> UDP_RR-1 : Avg: 96933 100076 95997.2 96553.3
> UDP_RR-10 : Avg: 83937.3 83054.3 83878.5 78998.6
> UDP_RR-20 : Avg: 61974 61897.5 61838.8 62926
> UDP_RR-40 : Avg: 56708.6 57053.9 56456.1 57115.2
> UDP_RR-80 : Avg: 26950 27895.8 27635.2 27784.8
> UDP_STREAM-1 : Avg: 52808.3 55296.8 52808.2 51908.6
> UDP_STREAM-10 : Avg: 45810 42944.1 43115 43561.2
> UDP_STREAM-20 : Avg: 19212.7 17572.9 18798.7 20066
> UDP_STREAM-40 : Avg: 13105.1 13096.9 13070.5 13110.2
> UDP_STREAM-80 : Avg: 6372.57 6367.96 6248.86 6413.09
>
>
> tbench
>
> NO_WA_WEIGHT, NO_WA_BIAS, NO_SIS_CURRENT
>
> Throughput 626.57 MB/sec 2 clients 2 procs max_latency=0.095 ms
> Throughput 1316.08 MB/sec 5 clients 5 procs max_latency=0.106 ms
> Throughput 1905.19 MB/sec 10 clients 10 procs max_latency=0.161 ms
> Throughput 2428.05 MB/sec 20 clients 20 procs max_latency=0.284 ms
> Throughput 2323.16 MB/sec 40 clients 40 procs max_latency=0.381 ms
> Throughput 2229.93 MB/sec 80 clients 80 procs max_latency=0.873 ms
>
> WA_WEIGHT, NO_WA_BIAS, NO_SIS_CURRENT
>
> Throughput 575.04 MB/sec 2 clients 2 procs max_latency=0.093 ms
> Throughput 1285.37 MB/sec 5 clients 5 procs max_latency=0.122 ms
> Throughput 1916.10 MB/sec 10 clients 10 procs max_latency=0.150 ms
> Throughput 2422.54 MB/sec 20 clients 20 procs max_latency=0.292 ms
> Throughput 2361.57 MB/sec 40 clients 40 procs max_latency=0.448 ms
> Throughput 2479.70 MB/sec 80 clients 80 procs max_latency=1.249 ms
>
> WA_WEIGHT, WA_BIAS, NO_SIS_CURRENT (aka, mainline)
>
> Throughput 649.46 MB/sec 2 clients 2 procs max_latency=0.092 ms
> Throughput 1370.93 MB/sec 5 clients 5 procs max_latency=0.140 ms
> Throughput 1904.14 MB/sec 10 clients 10 procs max_latency=0.470 ms
> Throughput 2406.15 MB/sec 20 clients 20 procs max_latency=0.276 ms
> Throughput 2419.40 MB/sec 40 clients 40 procs max_latency=0.414 ms
> Throughput 2426.00 MB/sec 80 clients 80 procs max_latency=1.366 ms
>
> WA_WEIGHT, WA_BIAS, SIS_CURRENT (aka, with patches on)
>
> Throughput 646.55 MB/sec 2 clients 2 procs max_latency=0.104 ms
> Throughput 1361.06 MB/sec 5 clients 5 procs max_latency=0.100 ms
> Throughput 1889.82 MB/sec 10 clients 10 procs max_latency=0.154 ms
> Throughput 2406.57 MB/sec 20 clients 20 procs max_latency=3.667 ms
> Throughput 2318.00 MB/sec 40 clients 40 procs max_latency=0.390 ms
> Throughput 2384.85 MB/sec 80 clients 80 procs max_latency=1.371 ms
>
>
> So what's going on here? I don't see anything exciting happening at the
> 40 mark. At the same time, I can't seem to reproduce Mike's latency pile
> up either :/
>
Thank you very much for trying this patch. This patch was found to mainly
benefit system with large number of CPUs in 1 LLC. Previously I tested
it on Sapphire Rapids(2x56C/224T) and Ice Lake Server(2x32C/128T)[1], it
seems to have benefit on them. The benefit seems to come from:
1. reducing the waker stacking among many CPUs within 1 LLC
2. reducing the C2C overhead within 1 LLC
As a comparison, Prateek has tested this patch on the Zen3 platform,
which has 16 threads per LLC and smaller than Sapphire Rapids and Ice
Lake Server. He did not observe too much difference with this patch
applied, but only saw some limited improvement on tbench and Spec.
So far I did not received performance difference from LKP on desktop
test boxes. Let me queue the full test on some desktops to confirm
if this change has any impact on them.

[1] https://lore.kernel.org/lkml/[email protected]/

thanks,
Chenyu

The original symptom I found was that, there are
quite some idle time(up to 30%) when running will-it-scale context switch
using the same number as the online CPUs. And waking up the task locally
reduce the race condition and reduce the C2C overhead within 1 LLC,
which is more severe on a system with large number of CPUs in 1 LLC.

2023-05-01 18:54:08

[permalink] [raw]

Subject: Re: [PATCH v8 2/2] sched/fair: Introduce SIS_CURRENT to wake up short task on current CPU

On Mon, May 01, 2023 at 05:32:05PM +0200, Mike Galbraith wrote:
> On Mon, 2023-05-01 at 15:48 +0200, Peter Zijlstra wrote:
> >
> > Throughput? 646.55 MB/sec?? 2 clients?? 2 procs? max_latency=0.104 ms
> > Throughput 1361.06 MB/sec?? 5 clients?? 5 procs? max_latency=0.100 ms
> > Throughput 1889.82 MB/sec? 10 clients? 10 procs? max_latency=0.154 ms
> > Throughput 2406.57 MB/sec? 20 clients? 20 procs? max_latency=3.667 ms
> > Throughput 2318.00 MB/sec? 40 clients? 40 procs? max_latency=0.390 ms
> > Throughput 2384.85 MB/sec? 80 clients? 80 procs? max_latency=1.371 ms
> >
> >
> > So what's going on here? I don't see anything exciting happening at the
> > 40 mark. At the same time, I can't seem to reproduce Mike's latency pile
> > up either :/
>
> Are you running tbench in the GUI so the per second output stimulates
> assorted goo? I'm using KDE fwtw.

Nah, the IVB-EP is headless, doesn't even have systemd on, still running
sysvinit.

2023-05-02 03:17:47

[permalink] [raw]

Subject: Re: [PATCH v8 2/2] sched/fair: Introduce SIS_CURRENT to wake up short task on current CPU

On Mon, 2023-05-01 at 20:49 +0200, Peter Zijlstra wrote:
> On Mon, May 01, 2023 at 05:32:05PM +0200, Mike Galbraith wrote:
> > On Mon, 2023-05-01 at 15:48 +0200, Peter Zijlstra wrote:
> > >
> > > Throughput 646.55 MB/sec 2 clients 2 procs max_latency=0.104 ms
> > > Throughput 1361.06 MB/sec 5 clients 5 procs max_latency=0.100 ms
> > > Throughput 1889.82 MB/sec 10 clients 10 procs max_latency=0.154 ms
> > > Throughput 2406.57 MB/sec 20 clients 20 procs max_latency=3.667 ms
> > > Throughput 2318.00 MB/sec 40 clients 40 procs max_latency=0.390 ms
> > > Throughput 2384.85 MB/sec 80 clients 80 procs max_latency=1.371 ms
> > >
> > >
> > > So what's going on here? I don't see anything exciting happening at the
> > > 40 mark. At the same time, I can't seem to reproduce Mike's latency pile
> > > up either :/
> >
> > Are you running tbench in the GUI so the per second output stimulates
> > assorted goo? I'm using KDE fwtw.
>
> Nah, the IVB-EP is headless, doesn't even have systemd on, still running
> sysvinit.

Turns out load distribution on Yu's box wasn't going wonky way early
after all, so WA_WEIGHT's evil side is off topic wrt $subject. Hohum.

-Mike

2023-05-02 12:04:25

[permalink] [raw]

Subject: Re: [PATCH v8 2/2] sched/fair: Introduce SIS_CURRENT to wake up short task on current CPU

On Mon, May 01, 2023 at 11:52:47PM +0800, Chen Yu wrote:

> > So,... I've been poking around with this a bit today and I'm not seeing
> > it. On my ancient IVB-EP (2*10*2) with the code as in
> > queue/sched/core I get:
> >
> > netperf NO_SIS_CURRENT %
> > SIS_CURRENT
> > ----------------------- -------------------------------
> > TCP_SENDFILE-1 : Avg: 42001 40783.4 -2.89898
> > TCP_SENDFILE-10 : Avg: 37065.1 36604.4 -1.24295
> > TCP_SENDFILE-20 : Avg: 21004.4 21356.9 1.67822
> > TCP_SENDFILE-40 : Avg: 7079.93 7231.3 2.13802
> > TCP_SENDFILE-80 : Avg: 3582.98 3615.85 0.917393

> > TCP_STREAM-1 : Avg: 37134.5 35095.4 -5.49112
> > TCP_STREAM-10 : Avg: 31260.7 31588.1 1.04732
> > TCP_STREAM-20 : Avg: 17996.6 17937.4 -0.328951
> > TCP_STREAM-40 : Avg: 7710.4 7790.62 1.04041
> > TCP_STREAM-80 : Avg: 2601.51 2903.89 11.6232

> > TCP_RR-1 : Avg: 81167.8 83541.3 2.92419
> > TCP_RR-10 : Avg: 71123.2 69447.9 -2.35549
> > TCP_RR-20 : Avg: 50905.4 52157.2 2.45907
> > TCP_RR-40 : Avg: 46289.2 46350.7 0.13286
> > TCP_RR-80 : Avg: 22024.4 22229.2 0.929878

> > UDP_RR-1 : Avg: 95997.2 96553.3 0.579288
> > UDP_RR-10 : Avg: 83878.5 78998.6 -5.81782
> > UDP_RR-20 : Avg: 61838.8 62926 1.75812
> > UDP_RR-40 : Avg: 56456.1 57115.2 1.16746
> > UDP_RR-80 : Avg: 27635.2 27784.8 0.541339

> > UDP_STREAM-1 : Avg: 52808.2 51908.6 -1.70352
> > UDP_STREAM-10 : Avg: 43115 43561.2 1.03491
> > UDP_STREAM-20 : Avg: 18798.7 20066 6.74142
> > UDP_STREAM-40 : Avg: 13070.5 13110.2 0.303737
> > UDP_STREAM-80 : Avg: 6248.86 6413.09 2.62816

> > tbench

> > WA_WEIGHT, WA_BIAS, NO_SIS_CURRENT (aka, mainline)
> >
> > Throughput 649.46 MB/sec 2 clients 2 procs max_latency=0.092 ms
> > Throughput 1370.93 MB/sec 5 clients 5 procs max_latency=0.140 ms
> > Throughput 1904.14 MB/sec 10 clients 10 procs max_latency=0.470 ms
> > Throughput 2406.15 MB/sec 20 clients 20 procs max_latency=0.276 ms
> > Throughput 2419.40 MB/sec 40 clients 40 procs max_latency=0.414 ms
> > Throughput 2426.00 MB/sec 80 clients 80 procs max_latency=1.366 ms
> >
> > WA_WEIGHT, WA_BIAS, SIS_CURRENT (aka, with patches on)
> >
> > Throughput 646.55 MB/sec 2 clients 2 procs max_latency=0.104 ms
> > Throughput 1361.06 MB/sec 5 clients 5 procs max_latency=0.100 ms
> > Throughput 1889.82 MB/sec 10 clients 10 procs max_latency=0.154 ms
> > Throughput 2406.57 MB/sec 20 clients 20 procs max_latency=3.667 ms
> > Throughput 2318.00 MB/sec 40 clients 40 procs max_latency=0.390 ms
> > Throughput 2384.85 MB/sec 80 clients 80 procs max_latency=1.371 ms
> >
> >
> > So what's going on here? I don't see anything exciting happening at the
> > 40 mark. At the same time, I can't seem to reproduce Mike's latency pile
> > up either :/
> >
> Thank you very much for trying this patch. This patch was found to mainly
> benefit system with large number of CPUs in 1 LLC. Previously I tested
> it on Sapphire Rapids(2x56C/224T) and Ice Lake Server(2x32C/128T)[1], it
> seems to have benefit on them. The benefit seems to come from:
> 1. reducing the waker stacking among many CPUs within 1 LLC

I should be seeing that at 10 cores per LLC. And when we look at the
tbench results (never the most stable -- let me run a few more of those)
it looks like SIS_CURRENT is actually making that worse.

That latency spike at 20 seems stable for me -- and 3ms is rather small,
I've seen it up to 11ms (but typical in the 4-6 range). This does not
happen with NO_SIS_CURRENT and is a fairly big point against these
patches.

> 2. reducing the C2C overhead within 1 LLC

This is due to how L3 became non-inclusive with Skylake? I can't see
that because I don't have anything that recent :/

> So far I did not received performance difference from LKP on desktop
> test boxes. Let me queue the full test on some desktops to confirm
> if this change has any impact on them.

Right, so I've updated my netperf results above to have a relative
difference between NO_SIS_CURRENT and SIS_CURRENT and I see some losses
at the low end. For servers that gets compensated at the high end, but
desktops tend to not get there much.

2023-05-04 11:24:07