2023-07-10 20:07:25

by David Vernet

[permalink] [raw]
Subject: [PATCH v2 0/7] sched: Implement shared runqueue in CFS

Changes
-------

This is v2 of the shared wakequeue (now called shared runqueue)
patchset. The following are changes from the RFC v1 patchset
(https://lore.kernel.org/lkml/[email protected]/).

v1 -> v2 changes:
- Change name from swqueue to shared_runq (Peter)

- Sharded per-LLC shared runqueues to avoid contention on
scheduler-heavy workloads (Peter)

- Pull tasks from the shared_runq in newidle_balance() rather than in
pick_next_task_fair() (Peter and Vincent)

- Rename a few functions to reflect their actual purpose. For example,
shared_runq_dequeue_task() instead of swqueue_remove_task() (Peter)

- Expose move_queued_task() from core.c rather than migrate_task_to()
(Peter)

- Properly check is_cpu_allowed() when pulling a task from a shared_runq
to ensure it can actually be migrated (Peter and Gautham)

- Dropped RFC tag

This patch set is based off of commit ebb83d84e49b ("sched/core: Avoid
multiple calling update_rq_clock() in __cfsb_csd_unthrottle()") on the
sched/core branch of tip.git.

Overview
========

The scheduler must constantly strike a balance between work
conservation, and avoiding costly migrations which harm performance due
to e.g. decreased cache locality. The matter is further complicated by
the topology of the system. Migrating a task between cores on the same
LLC may be more optimal than keeping a task local to the CPU, whereas
migrating a task between LLCs or NUMA nodes may tip the balance in the
other direction.

With that in mind, while CFS is by and large mostly a work conserving
scheduler, there are certain instances where the scheduler will choose
to keep a task local to a CPU, when it would have been more optimal to
migrate it to an idle core.

An example of such a workload is the HHVM / web workload at Meta. HHVM
is a VM that JITs Hack and PHP code in service of web requests. Like
other JIT / compilation workloads, it tends to be heavily CPU bound, and
exhibit generally poor cache locality. To try and address this, we set
several debugfs (/sys/kernel/debug/sched) knobs on our HHVM workloads:

- migration_cost_ns -> 0
- latency_ns -> 20000000
- min_granularity_ns -> 10000000
- wakeup_granularity_ns -> 12000000

These knobs are intended both to encourage the scheduler to be as work
conserving as possible (migration_cost_ns -> 0), and also to keep tasks
running for relatively long time slices so as to avoid the overhead of
context switching (the other knobs). Collectively, these knobs provide a
substantial performance win; resulting in roughly a 20% improvement in
throughput. Worth noting, however, is that this improvement is _not_ at
full machine saturation.

That said, even with these knobs, we noticed that CPUs were still going
idle even when the host was overcommitted. In response, we wrote the
"shared runqueue" (shared_runq) feature proposed in this patch set. The
idea behind shared_runq is simple: it enables the scheduler to be more
aggressively work conserving by placing a waking task into a sharded
per-LLC FIFO queue that can be pulled from by another core in the LLC
FIFO queue which can then be pulled from before it goes idle.

With this simple change, we were able to achieve a 1 - 1.6% improvement
in throughput, as well as a small, consistent improvement in p95 and p99
latencies, in HHVM. These performance improvements were in addition to
the wins from the debugfs knobs mentioned above, and to other benchmarks
outlined below in the Results section.

Design
======

The design of shared_runq is quite simple. A shared_runq is simply a
list of struct shared_runq_shards:

struct shared_runq_shard {
struct list_head list;
spinlock_t lock;
} ____cacheline_aligned;

struct shared_runq {
u32 num_shards;
struct shared_runq_shard shards[];
} ____cacheline_aligned;

We create a struct shared_runq per LLC, ensuring they're in their own
cachelines to avoid false sharing between CPUs on different LLCs. We
also create some number of shards per struct shared_runq, where runnable
tasks are inserted and pulled from.

When a task becomes runnable, it enqueues itself in the
shared_runq_shard of its current core. Enqueues only happen if the task
is not pinned to a specific CPU.

A core will pull a task from one of the shards in its LLC's shared_runq
at the beginning of newidle_balance().

Difference between shared_runq and SIS_NODE
===========================================

In [0] Peter proposed a patch that addresses Tejun's observations that
when workqueues are targeted towards a specific LLC on his Zen2 machine
with small CCXs, that there would be significant idle time due to
select_idle_sibling() not considering anything outside of the current
LLC.

This patch (SIS_NODE) is essentially the complement to the proposal
here. SID_NODE causes waking tasks to look for idle cores in neighboring
LLCs on the same die, whereas shared_runq causes cores about to go idle
to look for enqueued tasks. That said, in its current form, the two
features at are a different scope as SIS_NODE searches for idle cores
between LLCs, while shared_runq enqueues tasks within a single LLC.

The patch was since removed in [1], and we compared the results to
shared_runq (previously called "swqueue") in [2]. SIS_NODE did not
outperform shared_runq on any of the benchmarks, so we elect to not
compare against it again for this v2 patch set.

[0]: https://lore.kernel.org/all/[email protected]/
[1]: https://lore.kernel.org/all/[email protected]/
[2]: https://lore.kernel.org/lkml/[email protected]/

Results
=======

Note that the motivation for the shared runqueue feature was originally
arrived at using experiments in the sched_ext framework that's currently
being proposed upstream. The ~1 - 1.6% improvement in HHVM throughput
is similarly visible using work-conserving sched_ext schedulers (even
very simple ones like global FIFO).

In both single and multi socket / CCX hosts, this can measurably improve
performance. In addition to the performance gains observed on our
internal web workloads, we also observed an improvement in common
workloads such as kernel compile and hackbench, when running shared
runqueue.

On the other hand, some workloads suffer from shared_runq. Workloads
that hammer the runqueue hard, such as netperf UDP_RR, or ./schbench -L
-m 52 -p 512 -r 10 -t 1. This can be mitigated somewhat by sharding the
shared datastructures within a CCX, but it doesn't seem to eliminate all
contention in every scenario. On the positive side, it seems that
sharding does not materially harm the benchmarks run for this patch
series; and in fact seems to improve some workloads such as kernel
compile.

Note that for the kernel compile workloads below, the compilation was
done by running make -j$(nproc) built-in.a on several different types of
hosts configured with make allyesconfig on commit a27648c74210 ("afs:
Fix setting of mtime when creating a file/dir/symlink") on Linus' tree
(boost and turbo were disabled on all of these hosts when the
experiments were performed). Additionally, NO_SHARED_RUNQ refers to
SHARED_RUNQ being completely disabled, SHARED_RUNQ_WAKEUPS refers to
sharded SHARED_RUNQ where tasks are enqueued in the shared runqueue at
wakeup time, and SHARED_RUNQ_ALL refers to sharded SHARED_RUNQ where
tasks are enqueued in the shared runqueue on every enqueue. Results are
not included for unsharded shared runqueue, as the results here exceed
the unsharded results already outlined out in [2] as linked above.

=== Single-socket | 16 core / 32 thread | 2-CCX | AMD 7950X Zen4 ===

CPU max MHz: 5879.8818
CPU min MHz: 3000.0000

Command: make -j$(nproc) built-in.a
o____________o_______o
| mean | CPU |
o------------o-------o
NO_SHARED_RUNQ: | 582.46s | 3101% |
SHARED_RUNQ_WAKEUPS: | 581.22s | 3117% |
SHARED_RUNQ_ALL: | 578.41s | 3141% |
o------------o-------o

Takeaway: SHARED_RUNQ_WAKEUPS performs roughly the same as
NO_SHARED_RUNQ, but SHARED_RUNQ_ALL results in a statistically
significant ~.7% improvement over NO_SHARED_RUNQ. This suggests that
enqueuing tasks in the shared runqueue on every enqueue improves work
conservation, and thanks to sharding, does not result in contention.

Note that I didn't collect data for kernel compile with SHARED_RUNQ_ALL
_without_ sharding. The reason for this is that we know that CPUs with
sufficiently large LLCs will contend, so if we've decided to accommodate
those CPUs with sharding, there's not much point in measuring the
results of not sharding on CPUs that we know won't contend.

Command: hackbench --loops 10000
o____________o_______o
| mean | CPU |
o------------o-------o
NO_SHARED_RUNQ: | 2.1912s | 3117% |
SHARED_RUNQ_WAKEUP: | 2.1080s | 3155% |
SHARED_RUNQ_ALL: | 1.9830s | 3144% |
o------------o-------o

Takeaway: SHARED_RUNQ in both forms performs exceptionally well compared
to NO_SHARED_RUNQ here, with SHARED_RUNQ_ALL beating NO_SHARED_RUNQ by
almost 10%. This was a surprising result given that it seems
advantageous to err on the side of avoiding migration in hackbench given
that tasks are short lived in sending only 10k bytes worth of messages,
but the results of the benchmark would seem to suggest that minimizing
runqueue delays is preferable.

Command:
for i in `seq 128`; do
netperf -6 -t UDP_RR -c -C -l $runtime &
done
o_______________________o
| mean (thoughput) |
o-----------------------o
NO_SHARED_RUNQ: | 25064.12 |
SHARED_RUNQ_WAKEUP: | 24862.16 |
SHARED_RUNQ_ALL: | 25287.73 |
o-----------------------o

Takeaway: No statistical significance, though it is worth noting that
there is no regression for shared runqueue on the 7950X, while there is
a small regression on the Skylake and Milan hosts for SHARED_RUNQ_WAKEUP
as described below.


=== Single-socket | 18 core / 36 thread | 1-CCX | Intel Skylake ===

CPU max MHz: 1601.0000
CPU min MHz: 800.0000

Command: make -j$(nproc) built-in.a
o____________o_______o
| mean | CPU |
o------------o-------o
NO_SHARED_RUNQ: | 1535.46s | 3417% |
SHARED_RUNQ_WAKEUP: | 1534.56s | 3428% |
SHARED_RUNQ_ALL: | 1531.95s | 3429% |
o------------o-------o

Takeaway: SHARED_RUNQ_ALL results in a ~.23% improvement over
NO_SHARED_RUNQ. Not a huge improvement, but consistently measurable.
The cause of this gain is presumably the same as the 7950X: improved
work conservation, with sharding preventing excessive contention on the
shard lock.

Command: hackbench --loops 10000
o____________o_______o
| mean | CPU |
o------------o-------o
NO_SHARED_RUNQ: | 5.5750s | 3369% |
SHARED_RUNQ_WAKEUP: | 5.5764s | 3495% |
SHARED_RUNQ_ALL: | 5.4760s | 3481% |
o------------o-------o

Takeaway: SHARED_RUNQ_ALL results in a ~1.6% improvement over
NO_SHARED_RUNQ. Also statistically significant, but smaller than the
almost 10% improvement observed on the 7950X.

Command: netperf -n $(nproc) -l 60 -t TCP_RR
for i in `seq 128`; do
netperf -6 -t UDP_RR -c -C -l $runtime &
done
o______________________o
| mean (thoughput) |
o----------------------o
NO_SHARED_RUNQ: | 11963.08 |
SHARED_RUNQ_WAKEUP: | 11943.60 |
SHARED_RUNQ_ALL: | 11554.32 |
o----------------------o

Takeaway: NO_SHARED_RUNQ performs the same as SHARED_RUNQ_WAKEUP, but
beats SHARED_RUNQ_ALL by ~3.4%. This result makes sense -- the workload
is very heavy on the runqueue, so enqueuing tasks in the shared runqueue
in __enqueue_entity() would intuitively result in increased contention
on the shard lock. The fact that we're at parity with
SHARED_RUNQ_WAKEUP suggests that sharding the shared runqueue has
significantly improved the contention that was observed in v1, but that
__enqueue_entity() puts it over the edge.

NOTE: Parity for SHARED_RUNQ_WAKEUP relies on choosing the correct shard
size. If we chose, for example, a shard size of 16, there would still be
a regression between NO_SHARED_RUNQ and SHARED_RUNQ_WAKEUP. As described
below, this suggests that we may want to add a debugfs tunable for the
shard size.


=== Single-socket | 72-core | 6-CCX | AMD Milan Zen3 ===

CPU max MHz: 700.0000
CPU min MHz: 700.0000

Command: make -j$(nproc) built-in.a
o____________o_______o
| mean | CPU |
o------------o-------o
NO_SHARED_RUNQ: | 1601.81s | 6476% |
SHARED_RUNQ_WAKEUP: | 1602.55s | 6472% |
SHARED_RUNQ_ALL: | 1602.49s | 6475% |
o------------o-------o

Takeaway: No statistically significant variance. It might be worth
experimenting with work stealing in a follow-on patch set.

Command: hackbench --loops 10000
o____________o_______o
| mean | CPU |
o------------o-------o
NO_SHARED_RUNQ: | 5.2672s | 6463% |
SHARED_RUNQ_WAKEUP: | 5.1476s | 6583% |
SHARED_RUNQ_ALL: | 5.1003s | 6598% |
o------------o-------o

Takeaway: SHARED_RUNQ_ALL again wins, by about 3% over NO_SHARED_RUNQ in
this case.

Command: netperf -n $(nproc) -l 60 -t TCP_RR
for i in `seq 128`; do
netperf -6 -t UDP_RR -c -C -l $runtime &
done
o_______________________o
| mean (thoughput) |
o-----------------------o
NO_SHARED_RUNQ: | 13819.08 |
SHARED_RUNQ_WAKEUP: | 13907.74 |
SHARED_RUNQ_ALL: | 13569.69 |
o-----------------------o

Takeaway: Similar to the Skylake runs, NO_SHARED_RUNQ still beats
SHARED_RUNQ_ALL, though by a slightly lower margin of ~1.8%.


Finally, let's look at how sharding affects the following schbench
incantation suggested by Chris in [3]:

schbench -L -m 52 -p 512 -r 10 -t 1

[3]: https://lore.kernel.org/lkml/[email protected]/

The TL;DR is that sharding improves things a lot, but doesn't completely
fix the problem. Here are the results from running the schbench command
on the 18 core / 36 thread single CCX, single-socket Skylake:

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name con-bounces contentions waittime-min waittime-max waittime-total waittime-avg acq-bounces acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&shard->lock: 31510503 31510711 0.08 19.98 168932319.64 5.36 31700383 31843851 0.03 17.50 10273968.33 0.32
------------
&shard->lock 15731657 [<0000000068c0fd75>] pick_next_task_fair+0x4dd/0x510
&shard->lock 15756516 [<000000001faf84f9>] enqueue_task_fair+0x459/0x530
&shard->lock 21766 [<00000000126ec6ab>] newidle_balance+0x45a/0x650
&shard->lock 772 [<000000002886c365>] dequeue_task_fair+0x4c9/0x540
------------
&shard->lock 23458 [<00000000126ec6ab>] newidle_balance+0x45a/0x650
&shard->lock 16505108 [<000000001faf84f9>] enqueue_task_fair+0x459/0x530
&shard->lock 14981310 [<0000000068c0fd75>] pick_next_task_fair+0x4dd/0x510
&shard->lock 835 [<000000002886c365>] dequeue_task_fair+0x4c9/0x540

These results are when we create only 3 shards (16 logical cores per
shard), so the contention may be a result of overly-coarse sharding. If
we run the schbench incantation with no sharding whatsoever, we see the
following significantly worse lock stats contention:

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name con-bounces contentions waittime-min waittime-max waittime-total waittime-avg acq-bounces acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&shard->lock: 117868635 118361486 0.09 393.01 1250954097.25 10.57 119345882 119780601 0.05 343.35 38313419.51 0.32
------------
&shard->lock 59169196 [<0000000060507011>] __enqueue_entity+0xdc/0x110
&shard->lock 59084239 [<00000000f1c67316>] __dequeue_entity+0x78/0xa0
&shard->lock 108051 [<00000000084a6193>] newidle_balance+0x45a/0x650
------------
&shard->lock 60028355 [<0000000060507011>] __enqueue_entity+0xdc/0x110
&shard->lock 119882 [<00000000084a6193>] newidle_balance+0x45a/0x650
&shard->lock 58213249 [<00000000f1c67316>] __dequeue_entity+0x78/0xa0

The contention is ~3-4x worse if we don't shard at all. This roughly
matches the fact that we had 3 shards on the first workload run above.
If we make the shards even smaller, the contention is comparably much
lower:

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name con-bounces contentions waittime-min waittime-max waittime-total waittime-avg acq-bounces acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&shard->lock: 13839849 13877596 0.08 13.23 5389564.95 0.39 46910241 48069307 0.06 16.40 16534469.35 0.34
------------
&shard->lock 3559 [<00000000ea455dcc>] newidle_balance+0x45a/0x650
&shard->lock 6992418 [<000000002266f400>] __dequeue_entity+0x78/0xa0
&shard->lock 6881619 [<000000002a62f2e0>] __enqueue_entity+0xdc/0x110
------------
&shard->lock 6640140 [<000000002266f400>] __dequeue_entity+0x78/0xa0
&shard->lock 3523 [<00000000ea455dcc>] newidle_balance+0x45a/0x650
&shard->lock 7233933 [<000000002a62f2e0>] __enqueue_entity+0xdc/0x110

Interestingly, SHARED_RUNQ performs worse than NO_SHARED_RUNQ on the schbench
benchmark on Milan as well, but we contend more on the rq lock than the
shard lock:

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name con-bounces contentions waittime-min waittime-max waittime-total waittime-avg acq-bounces acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&rq->__lock: 9617614 9656091 0.10 79.64 69665812.00 7.21 18092700 67652829 0.11 82.38 344524858.87 5.09
-----------
&rq->__lock 6301611 [<000000003e63bf26>] task_rq_lock+0x43/0xe0
&rq->__lock 2530807 [<00000000516703f0>] __schedule+0x72/0xaa0
&rq->__lock 109360 [<0000000011be1562>] raw_spin_rq_lock_nested+0xa/0x10
&rq->__lock 178218 [<00000000c38a30f9>] sched_ttwu_pending+0x3d/0x170
-----------
&rq->__lock 3245506 [<00000000516703f0>] __schedule+0x72/0xaa0
&rq->__lock 1294355 [<00000000c38a30f9>] sched_ttwu_pending+0x3d/0x170
&rq->__lock 2837804 [<000000003e63bf26>] task_rq_lock+0x43/0xe0
&rq->__lock 1627866 [<0000000011be1562>] raw_spin_rq_lock_nested+0xa/0x10

..................................................................................................................................................................................................

&shard->lock: 7338558 7343244 0.10 35.97 7173949.14 0.98 30200858 32679623 0.08 35.59 16270584.52 0.50
------------
&shard->lock 2004142 [<00000000f8aa2c91>] __dequeue_entity+0x78/0xa0
&shard->lock 2611264 [<00000000473978cc>] newidle_balance+0x45a/0x650
&shard->lock 2727838 [<0000000028f55bb5>] __enqueue_entity+0xdc/0x110
------------
&shard->lock 2737232 [<00000000473978cc>] newidle_balance+0x45a/0x650
&shard->lock 1693341 [<00000000f8aa2c91>] __dequeue_entity+0x78/0xa0
&shard->lock 2912671 [<0000000028f55bb5>] __enqueue_entity+0xdc/0x110

...................................................................................................................................................................................................

If we look at the lock stats with SHARED_RUNQ disabled, the rq lock still
contends the most, but it's significantly less than with it enabled:

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name con-bounces contentions waittime-min waittime-max waittime-total waittime-avg acq-bounces acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&rq->__lock: 791277 791690 0.12 110.54 4889787.63 6.18 1575996 62390275 0.13 112.66 316262440.56 5.07
-----------
&rq->__lock 263343 [<00000000516703f0>] __schedule+0x72/0xaa0
&rq->__lock 19394 [<0000000011be1562>] raw_spin_rq_lock_nested+0xa/0x10
&rq->__lock 4143 [<000000003b542e83>] __task_rq_lock+0x51/0xf0
&rq->__lock 51094 [<00000000c38a30f9>] sched_ttwu_pending+0x3d/0x170
-----------
&rq->__lock 23756 [<0000000011be1562>] raw_spin_rq_lock_nested+0xa/0x10
&rq->__lock 379048 [<00000000516703f0>] __schedule+0x72/0xaa0
&rq->__lock 677 [<000000003b542e83>] __task_rq_lock+0x51/0xf0

Worth noting is that increasing the granularity of the shards in general
improves very runqueue-heavy workloads such as netperf UDP_RR and this
schbench command, but it doesn't necessarily make a big difference for
every workload, or for sufficiently small CCXs such as the 7950X. It may
make sense to eventually allow users to control this with a debugfs
knob, but for now we'll elect to choose a default that resulted in good
performance for the benchmarks run for this patch series.

Conclusion
==========

shared_runq in this form provides statistically significant wins for
several types of workloads, and various CPU topologies. The reason for
this is roughly the same for all workloads: shared_runq encourages work
conservation inside of a CCX by having a CPU do an O(# per-LLC shards)
iteration over the shared_runq shards in an LLC. We could similarly do
an O(n) iteration over all of the runqueues in the current LLC when a
core is going idle, but that's quite costly (especially for larger
LLCs), and sharded shared_runq seems to provide a performant middle
ground between doing an O(n) walk, and doing an O(# shards) walk.

For the workloads above, kernel compile and hackbench were clear winners
for shared_runq (especially in __enqueue_entity()). The reason for the
improvement in kernel compile is of course that we have a heavily
CPU-bound workload where cache locality doesn't mean much; getting a CPU
is the #1 goal. As mentioned above, while I didn't expect to see an
improvement in hackbench, the results of the benchmark suggest that
minimizing runqueue delays is preferable to optimizing for L1/L2
locality.

Not all workloads benefit from shared_runq, however. Workloads that
hammer the runqueue hard, such as netperf UDP_RR, or schbench -L -m 52
-p 512 -r 10 -t 1, tend to run into contention on the shard locks;
especially when enqueuing tasks in __enqueue_entity(). This can be
mitigated significantly by sharding the shared datastructures within a
CCX, but it doesn't eliminate all contention, as described above.

Worth noting as well is that Gautham Shenoy ran some interesting
experiments on a few more ideas in [4], such as walking the shared_runq
on the pop path until a task is found that can be migrated to the
calling CPU. I didn't run those experiments in this patch set, but it
might be worth doing so.

[4]: https://lore.kernel.org/lkml/[email protected]/

Finally, while shared_runq in this form encourages work conservation, it
of course does not guarantee it given that we don't implement any kind
of work stealing between shared_runqs. In the future, we could
potentially push CPU utilization even higher by enabling work stealing
between shared_runqs, likely between CCXs on the same NUMA node.

Originally-by: Roman Gushchin <[email protected]>
Signed-off-by: David Vernet <[email protected]>

David Vernet (7):
sched: Expose move_queued_task() from core.c
sched: Move is_cpu_allowed() into sched.h
sched: Check cpu_active() earlier in newidle_balance()
sched/fair: Add SHARED_RUNQ sched feature and skeleton calls
sched: Implement shared runqueue in CFS
sched: Shard per-LLC shared runqueues
sched: Move shared_runq to __{enqueue,dequeue}_entity()

include/linux/sched.h | 2 +
kernel/sched/core.c | 37 +-----
kernel/sched/fair.c | 254 +++++++++++++++++++++++++++++++++++++++-
kernel/sched/features.h | 1 +
kernel/sched/sched.h | 37 ++++++
5 files changed, 292 insertions(+), 39 deletions(-)

--
2.40.1


2023-07-10 20:07:38

by David Vernet

[permalink] [raw]
Subject: [PATCH v2 4/7] sched/fair: Add SHARED_RUNQ sched feature and skeleton calls

For certain workloads in CFS, CPU utilization is of the upmost
importance. For example, at Meta, our main web workload benefits from a
1 - 1.5% improvement in RPS, and a 1 - 2% improvement in p99 latency,
when CPU utilization is pushed as high as possible.

This is likely something that would be useful for any workload with long
slices, or for which avoiding migration is unlikely to result in
improved cache locality.

We will soon be enabling more aggressive load balancing via a new
feature called shared_runq, which places tasks into a FIFO queue on the
wakeup path, and then dequeues them in newidle_balance(). We don't want
to enable the feature by default, so this patch defines and declares a
new scheduler feature called SHARED_RUNQ which is disabled by default.
In addition, we add some calls to empty / skeleton functions in the
relevant fair codepaths where shared_runq will be utilized.

A set of future patches will implement these functions, and enable
shared_runq for both single and multi socket / CCX architectures.

Note as well that in future patches, the location of these calls may
change. For example, if we end up enqueueing tasks in a shared runqueue
any time they become runnable, we'd move the calls from
enqueue_task_fair() and pick_next_task_fair() to __enqueue_entity() and
__dequeue_entity() respectively.

Originally-by: Roman Gushchin <[email protected]>
Signed-off-by: David Vernet <[email protected]>
---
kernel/sched/fair.c | 32 ++++++++++++++++++++++++++++++++
kernel/sched/features.h | 1 +
2 files changed, 33 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6e882b7bf5b4..f7967be7646c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -140,6 +140,18 @@ static int __init setup_sched_thermal_decay_shift(char *str)
__setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift);

#ifdef CONFIG_SMP
+static void shared_runq_enqueue_task(struct rq *rq, struct task_struct *p,
+ int enq_flags)
+{}
+
+static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
+{
+ return 0;
+}
+
+static void shared_runq_dequeue_task(struct task_struct *p)
+{}
+
/*
* For asym packing, by default the lower numbered CPU has higher priority.
*/
@@ -162,6 +174,13 @@ int __weak arch_asym_cpu_priority(int cpu)
* (default: ~5%)
*/
#define capacity_greater(cap1, cap2) ((cap1) * 1024 > (cap2) * 1078)
+#else
+static void shared_runq_enqueue_task(struct rq *rq, struct task_struct *p,
+ int enq_flags)
+{}
+
+static void shared_runq_dequeue_task(struct task_struct *p)
+{}
#endif

#ifdef CONFIG_CFS_BANDWIDTH
@@ -6386,6 +6405,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (!task_new)
update_overutilized_status(rq);

+ if (sched_feat(SHARED_RUNQ))
+ shared_runq_enqueue_task(rq, p, flags);
+
enqueue_throttle:
assert_list_leaf_cfs_rq(rq);

@@ -6467,6 +6489,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
dequeue_throttle:
util_est_update(&rq->cfs, p, task_sleep);
hrtick_update(rq);
+
+ if (sched_feat(SHARED_RUNQ))
+ shared_runq_dequeue_task(p);
}

#ifdef CONFIG_SMP
@@ -8173,6 +8198,9 @@ done: __maybe_unused;

update_misfit_status(p, rq);

+ if (sched_feat(SHARED_RUNQ))
+ shared_runq_dequeue_task(p);
+
return p;

idle:
@@ -11843,6 +11871,9 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
if (!cpu_active(this_cpu))
return 0;

+ if (sched_feat(SHARED_RUNQ) && shared_runq_pick_next_task(this_rq, rf))
+ return -1;
+
/*
* We must set idle_stamp _before_ calling idle_balance(), such that we
* measure the duration of idle_balance() as idle time.
@@ -12343,6 +12374,7 @@ static void attach_task_cfs_rq(struct task_struct *p)

static void switched_from_fair(struct rq *rq, struct task_struct *p)
{
+ shared_runq_dequeue_task(p);
detach_task_cfs_rq(p);
}

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index ee7f23c76bd3..cd5db1a24181 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -101,3 +101,4 @@ SCHED_FEAT(LATENCY_WARN, false)

SCHED_FEAT(ALT_PERIOD, true)
SCHED_FEAT(BASE_SLICE, true)
+SCHED_FEAT(SHARED_RUNQ, false)
--
2.40.1


2023-07-10 20:07:50

by David Vernet

[permalink] [raw]
Subject: [PATCH v2 6/7] sched: Shard per-LLC shared runqueues

The SHARED_RUNQ scheduler feature creates a FIFO queue per LLC that
tasks are put into on enqueue, and pulled from when a core in that LLC
would otherwise go idle. For CPUs with large LLCs, this can sometimes
cause significant contention, as illustrated in [0].

[0]: https://lore.kernel.org/all/[email protected]/

So as to try and mitigate this contention, we can instead shard the
per-LLC runqueue into multiple per-LLC shards.

While this doesn't outright prevent all contention, it does somewhat mitigate it.
For example, if we run the following schbench command which does almost
nothing other than pound the runqueue:

schbench -L -m 52 -p 512 -r 10 -t 1

we observe with lockstats that sharding significantly decreases
contention.

3 shards:

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name con-bounces contentions waittime-min waittime-max waittime-total waittime-avg acq-bounces acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&shard->lock: 31510503 31510711 0.08 19.98 168932319.64 5.36 31700383 31843851 0.03 17.50 10273968.33 0.32
------------
&shard->lock 15731657 [<0000000068c0fd75>] pick_next_task_fair+0x4dd/0x510
&shard->lock 15756516 [<000000001faf84f9>] enqueue_task_fair+0x459/0x530
&shard->lock 21766 [<00000000126ec6ab>] newidle_balance+0x45a/0x650
&shard->lock 772 [<000000002886c365>] dequeue_task_fair+0x4c9/0x540
------------
&shard->lock 23458 [<00000000126ec6ab>] newidle_balance+0x45a/0x650
&shard->lock 16505108 [<000000001faf84f9>] enqueue_task_fair+0x459/0x530
&shard->lock 14981310 [<0000000068c0fd75>] pick_next_task_fair+0x4dd/0x510
&shard->lock 835 [<000000002886c365>] dequeue_task_fair+0x4c9/0x540

No sharding:

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name con-bounces contentions waittime-min waittime-max waittime-total waittime-avg acq-bounces acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&shard->lock: 117868635 118361486 0.09 393.01 1250954097.25 10.57 119345882 119780601 0.05 343.35 38313419.51 0.32
------------
&shard->lock 59169196 [<0000000060507011>] __enqueue_entity+0xdc/0x110
&shard->lock 59084239 [<00000000f1c67316>] __dequeue_entity+0x78/0xa0
&shard->lock 108051 [<00000000084a6193>] newidle_balance+0x45a/0x650
------------
&shard->lock 60028355 [<0000000060507011>] __enqueue_entity+0xdc/0x110
&shard->lock 119882 [<00000000084a6193>] newidle_balance+0x45a/0x650
&shard->lock 58213249 [<00000000f1c67316>] __dequeue_entity+0x78/0xa0

The contention is ~3-4x worse if we don't shard at all. This roughly
matches the fact that we had 3 shards on the host where this was
collected. This could be addressed in future patch sets by adding a
debugfs knob to control the sharding granularity. If we make the shards
even smaller (what's in this patch, i.e. a size of 6), the contention
goes away almost entirely:

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name con-bounces contentions waittime-min waittime-max waittime-total waittime-avg acq-bounces acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&shard->lock: 13839849 13877596 0.08 13.23 5389564.95 0.39 46910241 48069307 0.06 16.40 16534469.35 0.34
------------
&shard->lock 3559 [<00000000ea455dcc>] newidle_balance+0x45a/0x650
&shard->lock 6992418 [<000000002266f400>] __dequeue_entity+0x78/0xa0
&shard->lock 6881619 [<000000002a62f2e0>] __enqueue_entity+0xdc/0x110
------------
&shard->lock 6640140 [<000000002266f400>] __dequeue_entity+0x78/0xa0
&shard->lock 3523 [<00000000ea455dcc>] newidle_balance+0x45a/0x650
&shard->lock 7233933 [<000000002a62f2e0>] __enqueue_entity+0xdc/0x110

Interestingly, SHARED_RUNQ performs worse than NO_SHARED_RUNQ on the schbench
benchmark on Milan, but we contend even more on the rq lock:

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name con-bounces contentions waittime-min waittime-max waittime-total waittime-avg acq-bounces acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&rq->__lock: 9617614 9656091 0.10 79.64 69665812.00 7.21 18092700 67652829 0.11 82.38 344524858.87 5.09
-----------
&rq->__lock 6301611 [<000000003e63bf26>] task_rq_lock+0x43/0xe0
&rq->__lock 2530807 [<00000000516703f0>] __schedule+0x72/0xaa0
&rq->__lock 109360 [<0000000011be1562>] raw_spin_rq_lock_nested+0xa/0x10
&rq->__lock 178218 [<00000000c38a30f9>] sched_ttwu_pending+0x3d/0x170
-----------
&rq->__lock 3245506 [<00000000516703f0>] __schedule+0x72/0xaa0
&rq->__lock 1294355 [<00000000c38a30f9>] sched_ttwu_pending+0x3d/0x170
&rq->__lock 2837804 [<000000003e63bf26>] task_rq_lock+0x43/0xe0
&rq->__lock 1627866 [<0000000011be1562>] raw_spin_rq_lock_nested+0xa/0x10

..................................................................................................................................................................................................

&shard->lock: 7338558 7343244 0.10 35.97 7173949.14 0.98 30200858 32679623 0.08 35.59 16270584.52 0.50
------------
&shard->lock 2004142 [<00000000f8aa2c91>] __dequeue_entity+0x78/0xa0
&shard->lock 2611264 [<00000000473978cc>] newidle_balance+0x45a/0x650
&shard->lock 2727838 [<0000000028f55bb5>] __enqueue_entity+0xdc/0x110
------------
&shard->lock 2737232 [<00000000473978cc>] newidle_balance+0x45a/0x650
&shard->lock 1693341 [<00000000f8aa2c91>] __dequeue_entity+0x78/0xa0
&shard->lock 2912671 [<0000000028f55bb5>] __enqueue_entity+0xdc/0x110

...................................................................................................................................................................................................

If we look at the lock stats with SHARED_RUNQ disabled, the rq lock still
contends the most, but it's significantly less than with it enabled:

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name con-bounces contentions waittime-min waittime-max waittime-total waittime-avg acq-bounces acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&rq->__lock: 791277 791690 0.12 110.54 4889787.63 6.18 1575996 62390275 0.13 112.66 316262440.56 5.07
-----------
&rq->__lock 263343 [<00000000516703f0>] __schedule+0x72/0xaa0
&rq->__lock 19394 [<0000000011be1562>] raw_spin_rq_lock_nested+0xa/0x10
&rq->__lock 4143 [<000000003b542e83>] __task_rq_lock+0x51/0xf0
&rq->__lock 51094 [<00000000c38a30f9>] sched_ttwu_pending+0x3d/0x170
-----------
&rq->__lock 23756 [<0000000011be1562>] raw_spin_rq_lock_nested+0xa/0x10
&rq->__lock 379048 [<00000000516703f0>] __schedule+0x72/0xaa0
&rq->__lock 677 [<000000003b542e83>] __task_rq_lock+0x51/0xf0
&rq->__lock 47962 [<00000000c38a30f9>] sched_ttwu_pending+0x3d/0x170

In general, the takeaway here is that sharding does help with
contention, but it's not necessarily one size fits all, and it's
workload dependent. For now, let's include sharding to try and avoid
contention, and because it doesn't seem to regress CPUs that don't need
it such as the AMD 7950X.

Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: David Vernet <[email protected]>
---
kernel/sched/fair.c | 139 +++++++++++++++++++++++++++++--------------
kernel/sched/sched.h | 3 +-
2 files changed, 96 insertions(+), 46 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ff2491387201..97985f28a627 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -143,21 +143,28 @@ __setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift);
* struct shared_runq - Per-LLC queue structure for enqueuing and pulling
* waking tasks.
*
+ * struct shared_runq_shard - A structure containing a task list and a spinlock
+ * for a subset of cores in a struct shared_runq.
+ *
* WHAT
* ====
*
* This structure enables the scheduler to be more aggressively work
- * conserving, by placing waking tasks on a per-LLC FIFO queue that can then be
- * pulled from when another core in the LLC is going to go idle.
- *
- * struct rq stores a pointer to its LLC's shared_runq via struct cfs_rq.
- * Waking tasks are enqueued in a shared_runq at the end of
- * enqueue_task_fair(), and are opportunistically pulled from the shared_runq
- * in newidle_balance(). Tasks enqueued in a shared_runq may be scheduled prior
- * to being pulled from the shared_runq, in which case they're simply dequeued
- * from the shared_runq. A waking task is only enqueued to a shared_runq when
- * it was _not_ manually migrated to the current runqueue by
- * select_task_rq_fair().
+ * conserving, by placing waking tasks on a per-LLC FIFO queue shard that can
+ * then be pulled from when another core in the LLC is going to go idle.
+ *
+ * struct rq stores two pointers in its struct cfs_rq:
+ *
+ * 1. The per-LLC struct shared_runq which contains one or more shards of
+ * enqueued tasks.
+ *
+ * 2. The shard inside of the per-LLC struct shared_runq which contains the
+ * list of runnable tasks for that shard.
+ *
+ * Waking tasks are enqueued in the calling CPU's struct shared_runq_shard at
+ * the end of enqueue_task_fair(), and are opportunistically pulled from the
+ * shared_runq in newidle_balance(). Pulling from shards is an O(# shards)
+ * operation.
*
* There is currently no task-stealing between shared_runqs in different LLCs,
* which means that shared_runq is not fully work conserving. This could be
@@ -167,11 +174,12 @@ __setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift);
* HOW
* ===
*
- * An shared_runq is comprised of a list, and a spinlock for synchronization.
- * Given that the critical section for a shared_runq is typically a fast list
- * operation, and that the shared_runq is localized to a single LLC, the
- * spinlock will typically only be contended on workloads that do little else
- * other than hammer the runqueue.
+ * A struct shared_runq_shard is comprised of a list, and a spinlock for
+ * synchronization. Given that the critical section for a shared_runq is
+ * typically a fast list operation, and that the shared_runq_shard is localized
+ * to a subset of cores on a single LLC (plus other cores in the LLC that pull
+ * from the shard in newidle_balance()), the spinlock will typically only be
+ * contended on workloads that do little else other than hammer the runqueue.
*
* WHY
* ===
@@ -185,48 +193,64 @@ __setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift);
* it, as well as to strike a balance between work conservation, and L3 cache
* locality.
*/
-struct shared_runq {
+struct shared_runq_shard {
struct list_head list;
spinlock_t lock;
} ____cacheline_aligned;

+struct shared_runq {
+ u32 num_shards;
+ struct shared_runq_shard shards[];
+} ____cacheline_aligned;
+
+/* This would likely work better as a configurable knob via debugfs */
+#define SHARED_RUNQ_SHARD_SZ 6
+
#ifdef CONFIG_SMP
static struct shared_runq *rq_shared_runq(struct rq *rq)
{
return rq->cfs.shared_runq;
}

-static struct task_struct *shared_runq_pop_task(struct rq *rq)
+static struct shared_runq_shard *rq_shared_runq_shard(struct rq *rq)
+{
+ return rq->cfs.shard;
+}
+
+static int shared_runq_shard_idx(const struct shared_runq *runq, int cpu)
+{
+ return cpu % runq->num_shards;
+}
+
+static struct task_struct *
+shared_runq_pop_task(struct shared_runq_shard *shard, int target)
{
unsigned long flags;
struct task_struct *p;
- struct shared_runq *shared_runq;

- shared_runq = rq_shared_runq(rq);
- if (list_empty(&shared_runq->list))
+ if (list_empty(&shard->list))
return NULL;

- spin_lock_irqsave(&shared_runq->lock, flags);
- p = list_first_entry_or_null(&shared_runq->list, struct task_struct,
+ spin_lock_irqsave(&shard->lock, flags);
+ p = list_first_entry_or_null(&shard->list, struct task_struct,
shared_runq_node);
- if (p && is_cpu_allowed(p, cpu_of(rq)))
+ if (p && is_cpu_allowed(p, target))
list_del_init(&p->shared_runq_node);
else
p = NULL;
- spin_unlock_irqrestore(&shared_runq->lock, flags);
+ spin_unlock_irqrestore(&shard->lock, flags);

return p;
}

-static void shared_runq_push_task(struct rq *rq, struct task_struct *p)
+static void shared_runq_push_task(struct shared_runq_shard *shard,
+ struct task_struct *p)
{
unsigned long flags;
- struct shared_runq *shared_runq;

- shared_runq = rq_shared_runq(rq);
- spin_lock_irqsave(&shared_runq->lock, flags);
- list_add_tail(&p->shared_runq_node, &shared_runq->list);
- spin_unlock_irqrestore(&shared_runq->lock, flags);
+ spin_lock_irqsave(&shard->lock, flags);
+ list_add_tail(&p->shared_runq_node, &shard->list);
+ spin_unlock_irqrestore(&shard->lock, flags);
}

static void shared_runq_enqueue_task(struct rq *rq, struct task_struct *p,
@@ -247,7 +271,7 @@ static void shared_runq_enqueue_task(struct rq *rq, struct task_struct *p,
if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1)
return;

- shared_runq_push_task(rq, p);
+ shared_runq_push_task(rq_shared_runq_shard(rq), p);
}

static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
@@ -256,8 +280,21 @@ static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
struct rq *src_rq;
struct rq_flags src_rf;
int ret;
+ struct shared_runq *shared_runq;
+ struct shared_runq_shard *shard;
+ u32 i, starting_idx, curr_idx, num_shards;

- p = shared_runq_pop_task(rq);
+ shared_runq = rq_shared_runq(rq);
+ starting_idx = shared_runq_shard_idx(shared_runq, cpu_of(rq));
+ num_shards = shared_runq->num_shards;
+ for (i = 0; i < num_shards; i++) {
+ curr_idx = (starting_idx + i) % num_shards;
+ shard = &shared_runq->shards[curr_idx];
+
+ p = shared_runq_pop_task(shard, cpu_of(rq));
+ if (p)
+ break;
+ }
if (!p)
return 0;

@@ -287,13 +324,13 @@ static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
static void shared_runq_dequeue_task(struct task_struct *p)
{
unsigned long flags;
- struct shared_runq *shared_runq;
+ struct shared_runq_shard *shard;

if (!list_empty(&p->shared_runq_node)) {
- shared_runq = rq_shared_runq(task_rq(p));
- spin_lock_irqsave(&shared_runq->lock, flags);
+ shard = rq_shared_runq_shard(task_rq(p));
+ spin_lock_irqsave(&shard->lock, flags);
list_del_init(&p->shared_runq_node);
- spin_unlock_irqrestore(&shared_runq->lock, flags);
+ spin_unlock_irqrestore(&shard->lock, flags);
}
}

@@ -13003,19 +13040,31 @@ __init void init_sched_fair_class(void)
__init void init_sched_fair_class_late(void)
{
#ifdef CONFIG_SMP
- int i;
+ int i, j;
struct shared_runq *shared_runq;
+ struct shared_runq_shard *shard;
struct rq *rq;
struct rq *llc_rq;
+ size_t shared_runq_size;
+ u32 num_shards, shard_idx;

for_each_possible_cpu(i) {
if (per_cpu(sd_llc_id, i) == i) {
llc_rq = cpu_rq(i);
-
- shared_runq = kzalloc_node(sizeof(struct shared_runq),
- GFP_KERNEL, cpu_to_node(i));
- INIT_LIST_HEAD(&shared_runq->list);
- spin_lock_init(&shared_runq->lock);
+ num_shards = max(per_cpu(sd_llc_size, i) /
+ SHARED_RUNQ_SHARD_SZ, 1);
+ shared_runq_size = sizeof(struct shared_runq) +
+ num_shards * sizeof(struct shared_runq_shard);
+
+ shared_runq = kzalloc_node(shared_runq_size,
+ GFP_KERNEL, cpu_to_node(i));
+ shared_runq->num_shards = num_shards;
+ for (j = 0; j < num_shards; j++) {
+ shard = &shared_runq->shards[j];
+
+ INIT_LIST_HEAD(&shard->list);
+ spin_lock_init(&shard->lock);
+ }
llc_rq->cfs.shared_runq = shared_runq;
}
}
@@ -13024,9 +13073,9 @@ __init void init_sched_fair_class_late(void)
rq = cpu_rq(i);
llc_rq = cpu_rq(per_cpu(sd_llc_id, i));

- if (rq == llc_rq)
- continue;
rq->cfs.shared_runq = llc_rq->cfs.shared_runq;
+ shard_idx = shared_runq_shard_idx(rq->cfs.shared_runq, i);
+ rq->cfs.shard = &rq->cfs.shared_runq->shards[shard_idx];
}
#endif /* SMP */
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 8b573dfaba33..ca56a8120088 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -576,7 +576,8 @@ struct cfs_rq {
#endif

#ifdef CONFIG_SMP
- struct shared_runq *shared_runq;
+ struct shared_runq *shared_runq;
+ struct shared_runq_shard *shard;
/*
* CFS load tracking
*/
--
2.40.1


2023-07-10 20:08:26

by David Vernet

[permalink] [raw]
Subject: [PATCH v2 5/7] sched: Implement shared runqueue in CFS

Overview
========

The scheduler must constantly strike a balance between work
conservation, and avoiding costly migrations which harm performance due
to e.g. decreased cache locality. The matter is further complicated by
the topology of the system. Migrating a task between cores on the same
LLC may be more optimal than keeping a task local to the CPU, whereas
migrating a task between LLCs or NUMA nodes may tip the balance in the
other direction.

With that in mind, while CFS is by and large mostly a work conserving
scheduler, there are certain instances where the scheduler will choose
to keep a task local to a CPU, when it would have been more optimal to
migrate it to an idle core.

An example of such a workload is the HHVM / web workload at Meta. HHVM
is a VM that JITs Hack and PHP code in service of web requests. Like
other JIT / compilation workloads, it tends to be heavily CPU bound, and
exhibit generally poor cache locality. To try and address this, we set
several debugfs (/sys/kernel/debug/sched) knobs on our HHVM workloads:

- migration_cost_ns -> 0
- latency_ns -> 20000000
- min_granularity_ns -> 10000000
- wakeup_granularity_ns -> 12000000

These knobs are intended both to encourage the scheduler to be as work
conserving as possible (migration_cost_ns -> 0), and also to keep tasks
running for relatively long time slices so as to avoid the overhead of
context switching (the other knobs). Collectively, these knobs provide a
substantial performance win; resulting in roughly a 20% improvement in
throughput. Worth noting, however, is that this improvement is _not_ at
full machine saturation.

That said, even with these knobs, we noticed that CPUs were still going
idle even when the host was overcommitted. In response, we wrote the
"shared runqueue" (shared_runq) feature proposed in this patch set. The
idea behind shared_runq is simple: it enables the scheduler to be more
aggressively work conserving by placing a waking task into a sharded
per-LLC FIFO queue that can be pulled from by another core in the LLC
FIFO queue which can then be pulled from before it goes idle.

With this simple change, we were able to achieve a 1 - 1.6% improvement
in throughput, as well as a small, consistent improvement in p95 and p99
latencies, in HHVM. These performance improvements were in addition to
the wins from the debugfs knobs mentioned above, and to other benchmarks
outlined below in the Results section.

Design
======

Note that the design described here reflects sharding, which will be
added in a subsequent patch. The design is described that way in this
commit summary as the benchmarks described in the results section below
all include sharded shared_runq. The patches are not combined into one
to ease the burden of review.

The design of shared_runq is quite simple. A shared_runq is simply a
list of struct shared_runq_shard objects, which itself is simply a
struct list_head of tasks, and a spinlock:

struct shared_runq_shard {
struct list_head list;
spinlock_t lock;
} ____cacheline_aligned;

struct shared_runq {
u32 num_shards;
struct shared_runq_shard shards[];
} ____cacheline_aligned;

We create a struct shared_runq per LLC, ensuring they're in their own
cachelines to avoid false sharing between CPUs on different LLCs, and we
create a number of struct shared_runq_shard objects that are housed
there.

When a task first wakes up, it enqueues itself in the shared_runq_shard
of its current LLC at the end of enqueue_task_fair(). Enqueues only
happen if the task was not manually migrated to the current core by
select_task_rq(), and is not pinned to a specific CPU.

A core will pull a task from the shards in its LLC's shared_runq at the
beginning of newidle_balance().

Difference between shared_runq and SIS_NODE
===========================================

In [0] Peter proposed a patch that addresses Tejun's observations that
when workqueues are targeted towards a specific LLC on his Zen2 machine
with small CCXs, that there would be significant idle time due to
select_idle_sibling() not considering anything outside of the current
LLC.

This patch (SIS_NODE) is essentially the complement to the proposal
here. SID_NODE causes waking tasks to look for idle cores in neighboring
LLCs on the same die, whereas shared_runq causes cores about to go idle
to look for enqueued tasks. That said, in its current form, the two
features at are a different scope as SIS_NODE searches for idle cores
between LLCs, while shared_runq enqueues tasks within a single LLC.

The patch was since removed in [1], and we compared the results to
shared_runq (previously called "swqueue") in [2]. SIS_NODE did not
outperform shared_runq on any of the benchmarks, so we elect to not
compare against it again for this v2 patch set.

[0]: https://lore.kernel.org/all/[email protected]/
[1]: https://lore.kernel.org/all/[email protected]/
[2]: https://lore.kernel.org/lkml/[email protected]/

Results
=======

Note that the motivation for the shared runqueue feature was originally
arrived at using experiments in the sched_ext framework that's currently
being proposed upstream. The ~1 - 1.6% improvement in HHVM throughput
is similarly visible using work-conserving sched_ext schedulers (even
very simple ones like global FIFO).

In both single and multi socket / CCX hosts, this can measurably improve
performance. In addition to the performance gains observed on our
internal web workloads, we also observed an improvement in common
workloads such as kernel compile and hackbench, when running shared
runqueue.

On the other hand, some workloads suffer from shared_runq. Workloads
that hammer the runqueue hard, such as netperf UDP_RR, or schbench -L
-m 52 -p 512 -r 10 -t 1. This can be mitigated somewhat by sharding the
shared datastructures within a CCX, but it doesn't seem to eliminate all
contention in every scenario. On the positive side, it seems that
sharding does not materially harm the benchmarks run for this patch
series; and in fact seems to improve some workloads such as kernel
compile.

Note that for the kernel compile workloads below, the compilation was
done by running make -j$(nproc) built-in.a on several different types of
hosts configured with make allyesconfig on commit a27648c74210 ("afs:
Fix setting of mtime when creating a file/dir/symlink") on Linus' tree
(boost and turbo were disabled on all of these hosts when the
experiments were performed). Additionally, NO_SHARED_RUNQ refers to
SHARED_RUNQ being completely disabled, SHARED_RUNQ_WAKEUPS refers to
sharded SHARED_RUNQ where tasks are enqueued in the shared runqueue at
wakeup time, and SHARED_RUNQ_ALL refers to sharded SHARED_RUNQ where
tasks are enqueued in the shared runqueue on every enqueue. Results are
not included for unsharded shared runqueue, as the results here exceed
the unsharded results already outlined out in [2] as linked above.

=== Single-socket | 16 core / 32 thread | 2-CCX | AMD 7950X Zen4 ===

CPU max MHz: 5879.8818
CPU min MHz: 3000.0000

Command: make -j$(nproc) built-in.a
o____________o_______o
| mean | CPU |
o------------o-------o
NO_SHARED_RUNQ: | 582.46s | 3101% |
SHARED_RUNQ_WAKEUPS: | 581.22s | 3117% |
SHARED_RUNQ_ALL: | 578.41s | 3141% |
o------------o-------o

Takeaway: SHARED_RUNQ_WAKEUPS performs roughly the same as
NO_SHARED_RUNQ, but SHARED_RUNQ_ALL results in a statistically
significant ~.7% improvement over NO_SHARED_RUNQ. This suggests that
enqueuing tasks in the shared runqueue on every enqueue improves work
conservation, and thanks to sharding, does not result in contention.

Note that I didn't collect data for kernel compile with SHARED_RUNQ_ALL
_without_ sharding. The reason for this is that we know that CPUs with
sufficiently large LLCs will contend, so if we've decided to accommodate
those CPUs with sharding, there's not much point in measuring the
results of not sharding on CPUs that we know won't contend.

Command: hackbench --loops 10000
o____________o_______o
| mean | CPU |
o------------o-------o
NO_SHARED_RUNQ: | 2.1912s | 3117% |
SHARED_RUNQ_WAKEUP: | 2.1080s | 3155% |
SHARED_RUNQ_ALL: | 1.9830s | 3144% |
o------------o-------o

Takeaway: SHARED_RUNQ in both forms performs exceptionally well compared
to NO_SHARED_RUNQ here, with SHARED_RUNQ_ALL beating NO_SHARED_RUNQ by
almost 10%. This was a surprising result given that it seems
advantageous to err on the side of avoiding migration in hackbench given
that tasks are short lived in sending only 10k bytes worth of messages,
but the results of the benchmark would seem to suggest that minimizing
runqueue delays is preferable.

Command:
for i in `seq 128`; do
netperf -6 -t UDP_RR -c -C -l $runtime &
done
o_______________________o
| mean (thoughput) |
o-----------------------o
NO_SHARED_RUNQ: | 25064.12 |
SHARED_RUNQ_WAKEUP: | 24862.16 |
SHARED_RUNQ_ALL: | 25287.73 |
o-----------------------o

Takeaway: No statistical significance, though it is worth noting that
there is no regression for shared runqueue on the 7950X, while there is
a small regression on the Skylake and Milan hosts for SHARED_RUNQ_WAKEUP
as described below.

=== Single-socket | 18 core / 36 thread | 1-CCX | Intel Skylake ===

CPU max MHz: 1601.0000
CPU min MHz: 800.0000

Command: make -j$(nproc) built-in.a
o____________o_______o
| mean | CPU |
o------------o-------o
NO_SHARED_RUNQ: | 1535.46s | 3417% |
SHARED_RUNQ_WAKEUP: | 1534.56s | 3428% |
SHARED_RUNQ_ALL: | 1531.95s | 3429% |
o------------o-------o

Takeaway: SHARED_RUNQ_ALL results in a ~.23% improvement over
NO_SHARED_RUNQ. Not a huge improvement, but consistently measurable.
The cause of this gain is presumably the same as the 7950X: improved
work conservation, with sharding preventing excessive contention on the
shard lock.

Command: hackbench --loops 10000
o____________o_______o
| mean | CPU |
o------------o-------o
NO_SHARED_RUNQ: | 5.5750s | 3369% |
SHARED_RUNQ_WAKEUP: | 5.5764s | 3495% |
SHARED_RUNQ_ALL: | 5.4760s | 3481% |
o------------o-------o

Takeaway: SHARED_RUNQ_ALL results in a ~1.6% improvement over
NO_SHARED_RUNQ. Also statistically significant, but smaller than the
almost 10% improvement observed on the 7950X.

Command: netperf -n $(nproc) -l 60 -t TCP_RR
for i in `seq 128`; do
netperf -6 -t UDP_RR -c -C -l $runtime &
done
o______________________o
| mean (thoughput) |
o----------------------o
NO_SHARED_RUNQ: | 11963.08 |
SHARED_RUNQ_WAKEUP: | 11943.60 |
SHARED_RUNQ_ALL: | 11554.32 |
o----------------------o

Takeaway: NO_SHARED_RUNQ performs the same as SHARED_RUNQ_WAKEUP, but
beats SHARED_RUNQ_ALL by ~3.4%. This result makes sense -- the workload
is very heavy on the runqueue, so enqueuing tasks in the shared runqueue
in __enqueue_entity() would intuitively result in increased contention
on the shard lock. The fact that we're at parity with
SHARED_RUNQ_WAKEUP suggests that sharding the shared runqueue has
significantly improved the contention that was observed in v1, but that
__enqueue_entity() puts it over the edge.

NOTE: Parity for SHARED_RUNQ_WAKEUP relies on choosing the correct shard
size. If we chose, for example, a shard size of 16, there would still be
a regression between NO_SHARED_RUNQ and SHARED_RUNQ_WAKEUP. As described
below, this suggests that we may want to add a debugfs tunable for the
shard size.

=== Single-socket | 72-core | 6-CCX | AMD Milan Zen3 ===

CPU max MHz: 700.0000
CPU min MHz: 700.0000

Command: make -j$(nproc) built-in.a
o____________o_______o
| mean | CPU |
o------------o-------o
NO_SHARED_RUNQ: | 1601.81s | 6476% |
SHARED_RUNQ_WAKEUP: | 1602.55s | 6472% |
SHARED_RUNQ_ALL: | 1602.49s | 6475% |
o------------o-------o

Takeaway: No statistically significant variance. It might be worth
experimenting with work stealing in a follow-on patch set.

Command: hackbench --loops 10000
o____________o_______o
| mean | CPU |
o------------o-------o
NO_SHARED_RUNQ: | 5.2672s | 6463% |
SHARED_RUNQ_WAKEUP: | 5.1476s | 6583% |
SHARED_RUNQ_ALL: | 5.1003s | 6598% |
o------------o-------o

Takeaway: SHARED_RUNQ_ALL again wins, by about 3% over NO_SHARED_RUNQ in
this case.

Command: netperf -n $(nproc) -l 60 -t TCP_RR
for i in `seq 128`; do
netperf -6 -t UDP_RR -c -C -l $runtime &
done
o_______________________o
| mean (thoughput) |
o-----------------------o
NO_SHARED_RUNQ: | 13819.08 |
SHARED_RUNQ_WAKEUP: | 13907.74 |
SHARED_RUNQ_ALL: | 13569.69 |
o-----------------------o

Takeaway: Similar to the Skylake runs, NO_SHARED_RUNQ still beats
SHARED_RUNQ_ALL, though by a slightly lower margin of ~1.8%.

Finally, let's look at how sharding affects the following schbench
incantation suggested by Chris in [3]:

schbench -L -m 52 -p 512 -r 10 -t 1

[3]: https://lore.kernel.org/lkml/[email protected]/

The TL;DR is that sharding improves things a lot, but doesn't completely
fix the problem. Here are the results from running the schbench command
on the 18 core / 36 thread single CCX, single-socket Skylake:

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name con-bounces contentions waittime-min waittime-max waittime-total waittime-avg acq-bounces acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&shard->lock: 31510503 31510711 0.08 19.98 168932319.64 5.36 31700383 31843851 0.03 17.50 10273968.33 0.32
------------
&shard->lock 15731657 [<0000000068c0fd75>] pick_next_task_fair+0x4dd/0x510
&shard->lock 15756516 [<000000001faf84f9>] enqueue_task_fair+0x459/0x530
&shard->lock 21766 [<00000000126ec6ab>] newidle_balance+0x45a/0x650
&shard->lock 772 [<000000002886c365>] dequeue_task_fair+0x4c9/0x540
------------
&shard->lock 23458 [<00000000126ec6ab>] newidle_balance+0x45a/0x650
&shard->lock 16505108 [<000000001faf84f9>] enqueue_task_fair+0x459/0x530
&shard->lock 14981310 [<0000000068c0fd75>] pick_next_task_fair+0x4dd/0x510
&shard->lock 835 [<000000002886c365>] dequeue_task_fair+0x4c9/0x540

These results are when we create only 3 shards (16 logical cores per
shard), so the contention may be a result of overly-coarse sharding. If
we run the schbench incantation with no sharding whatsoever, we see the
following significantly worse lock stats contention:

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name con-bounces contentions waittime-min waittime-max waittime-total waittime-avg acq-bounces acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&shard->lock: 117868635 118361486 0.09 393.01 1250954097.25 10.57 119345882 119780601 0.05 343.35 38313419.51 0.32
------------
&shard->lock 59169196 [<0000000060507011>] __enqueue_entity+0xdc/0x110
&shard->lock 59084239 [<00000000f1c67316>] __dequeue_entity+0x78/0xa0
&shard->lock 108051 [<00000000084a6193>] newidle_balance+0x45a/0x650
------------
&shard->lock 60028355 [<0000000060507011>] __enqueue_entity+0xdc/0x110
&shard->lock 119882 [<00000000084a6193>] newidle_balance+0x45a/0x650
&shard->lock 58213249 [<00000000f1c67316>] __dequeue_entity+0x78/0xa0

The contention is ~3-4x worse if we don't shard at all. This roughly
matches the fact that we had 3 shards on the first workload run above.
If we make the shards even smaller, the contention is comparably much
lower:

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name con-bounces contentions waittime-min waittime-max waittime-total waittime-avg acq-bounces acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&shard->lock: 13839849 13877596 0.08 13.23 5389564.95 0.39 46910241 48069307 0.06 16.40 16534469.35 0.34
------------
&shard->lock 3559 [<00000000ea455dcc>] newidle_balance+0x45a/0x650
&shard->lock 6992418 [<000000002266f400>] __dequeue_entity+0x78/0xa0
&shard->lock 6881619 [<000000002a62f2e0>] __enqueue_entity+0xdc/0x110
------------
&shard->lock 6640140 [<000000002266f400>] __dequeue_entity+0x78/0xa0
&shard->lock 3523 [<00000000ea455dcc>] newidle_balance+0x45a/0x650
&shard->lock 7233933 [<000000002a62f2e0>] __enqueue_entity+0xdc/0x110

Interestingly, SHARED_RUNQ performs worse than NO_SHARED_RUNQ on the schbench
benchmark on Milan as well, but we contend more on the rq lock than the
shard lock:

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name con-bounces contentions waittime-min waittime-max waittime-total waittime-avg acq-bounces acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&rq->__lock: 9617614 9656091 0.10 79.64 69665812.00 7.21 18092700 67652829 0.11 82.38 344524858.87 5.09
-----------
&rq->__lock 6301611 [<000000003e63bf26>] task_rq_lock+0x43/0xe0
&rq->__lock 2530807 [<00000000516703f0>] __schedule+0x72/0xaa0
&rq->__lock 109360 [<0000000011be1562>] raw_spin_rq_lock_nested+0xa/0x10
&rq->__lock 178218 [<00000000c38a30f9>] sched_ttwu_pending+0x3d/0x170
-----------
&rq->__lock 3245506 [<00000000516703f0>] __schedule+0x72/0xaa0
&rq->__lock 1294355 [<00000000c38a30f9>] sched_ttwu_pending+0x3d/0x170
&rq->__lock 2837804 [<000000003e63bf26>] task_rq_lock+0x43/0xe0
&rq->__lock 1627866 [<0000000011be1562>] raw_spin_rq_lock_nested+0xa/0x10

..................................................................................................................................................................................................

&shard->lock: 7338558 7343244 0.10 35.97 7173949.14 0.98 30200858 32679623 0.08 35.59 16270584.52 0.50
------------
&shard->lock 2004142 [<00000000f8aa2c91>] __dequeue_entity+0x78/0xa0
&shard->lock 2611264 [<00000000473978cc>] newidle_balance+0x45a/0x650
&shard->lock 2727838 [<0000000028f55bb5>] __enqueue_entity+0xdc/0x110
------------
&shard->lock 2737232 [<00000000473978cc>] newidle_balance+0x45a/0x650
&shard->lock 1693341 [<00000000f8aa2c91>] __dequeue_entity+0x78/0xa0
&shard->lock 2912671 [<0000000028f55bb5>] __enqueue_entity+0xdc/0x110

...................................................................................................................................................................................................

If we look at the lock stats with SHARED_RUNQ disabled, the rq lock still
contends the most, but it's significantly less than with it enabled:

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name con-bounces contentions waittime-min waittime-max waittime-total waittime-avg acq-bounces acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&rq->__lock: 791277 791690 0.12 110.54 4889787.63 6.18 1575996 62390275 0.13 112.66 316262440.56 5.07
-----------
&rq->__lock 263343 [<00000000516703f0>] __schedule+0x72/0xaa0
&rq->__lock 19394 [<0000000011be1562>] raw_spin_rq_lock_nested+0xa/0x10
&rq->__lock 4143 [<000000003b542e83>] __task_rq_lock+0x51/0xf0
&rq->__lock 51094 [<00000000c38a30f9>] sched_ttwu_pending+0x3d/0x170
-----------
&rq->__lock 23756 [<0000000011be1562>] raw_spin_rq_lock_nested+0xa/0x10
&rq->__lock 379048 [<00000000516703f0>] __schedule+0x72/0xaa0
&rq->__lock 677 [<000000003b542e83>] __task_rq_lock+0x51/0xf0

Worth noting is that increasing the granularity of the shards in general
improves very runqueue-heavy workloads such as netperf UDP_RR and this
schbench command, but it doesn't necessarily make a big difference for
every workload, or for sufficiently small CCXs such as the 7950X. It may
make sense to eventually allow users to control this with a debugfs
knob, but for now we'll elect to choose a default that resulted in good
performance for the benchmarks run for this patch series.

Conclusion
==========

shared_runq in this form provides statistically significant wins for
several types of workloads, and various CPU topologies. The reason for
this is roughly the same for all workloads: shared_runq encourages work
conservation inside of a CCX by having a CPU do an O(# per-LLC shards)
iteration over the shared_runq shards in an LLC. We could similarly do
an O(n) iteration over all of the runqueues in the current LLC when a
core is going idle, but that's quite costly (especially for larger
LLCs), and sharded shared_runq seems to provide a performant middle
ground between doing an O(n) walk, and doing an O(# shards) walk.

For the workloads above, kernel compile and hackbench were clear winners
for shared_runq (especially in __enqueue_entity()). The reason for the
improvement in kernel compile is of course that we have a heavily
CPU-bound workload where cache locality doesn't mean much; getting a CPU
is the #1 goal. As mentioned above, while I didn't expect to see an
improvement in hackbench, the results of the benchmark suggest that
minimizing runqueue delays is preferable to optimizing for L1/L2
locality.

Not all workloads benefit from shared_runq, however. Workloads that
hammer the runqueue hard, such as netperf UDP_RR, or schbench -L -m 52
-p 512 -r 10 -t 1, tend to run into contention on the shard locks;
especially when enqueuing tasks in __enqueue_entity(). This can be
mitigated significantly by sharding the shared datastructures within a
CCX, but it doesn't eliminate all contention, as described above.

Worth noting as well is that Gautham Shenoy ran some interesting
experiments on a few more ideas in [4], such as walking the shared_runq
on the pop path until a task is found that can be migrated to the
calling CPU. I didn't run those experiments in this patch set, but it
might be worth doing so.

[4]: https://lore.kernel.org/lkml/[email protected]/

Finally, while shared_runq in this form encourages work conservation, it
of course does not guarantee it given that we don't implement any kind
of work stealing between shared_runqs. In the future, we could
potentially push CPU utilization even higher by enabling work stealing
between shared_runqs, likely between CCXs on the same NUMA node.

Originally-by: Roman Gushchin <[email protected]>
Signed-off-by: David Vernet <[email protected]>
---
include/linux/sched.h | 2 +
kernel/sched/core.c | 2 +
kernel/sched/fair.c | 182 +++++++++++++++++++++++++++++++++++++++++-
kernel/sched/sched.h | 2 +
4 files changed, 185 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1292d38d66cc..5c05a3da3d50 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -770,6 +770,8 @@ struct task_struct {
unsigned long wakee_flip_decay_ts;
struct task_struct *last_wakee;

+ struct list_head shared_runq_node;
+
/*
* recent_used_cpu is initially set as the last CPU used by a task
* that wakes affine another task. Waker/wakee relationships can
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1451f5aa82ac..3ad437d4ea3d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4503,6 +4503,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
#ifdef CONFIG_SMP
p->wake_entry.u_flags = CSD_TYPE_TTWU;
p->migration_pending = NULL;
+ INIT_LIST_HEAD(&p->shared_runq_node);
#endif
init_sched_mm_cid(p);
}
@@ -9842,6 +9843,7 @@ void __init sched_init_smp(void)

init_sched_rt_class();
init_sched_dl_class();
+ init_sched_fair_class_late();

sched_smp_initialized = true;
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f7967be7646c..ff2491387201 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -139,18 +139,163 @@ static int __init setup_sched_thermal_decay_shift(char *str)
}
__setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift);

+/**
+ * struct shared_runq - Per-LLC queue structure for enqueuing and pulling
+ * waking tasks.
+ *
+ * WHAT
+ * ====
+ *
+ * This structure enables the scheduler to be more aggressively work
+ * conserving, by placing waking tasks on a per-LLC FIFO queue that can then be
+ * pulled from when another core in the LLC is going to go idle.
+ *
+ * struct rq stores a pointer to its LLC's shared_runq via struct cfs_rq.
+ * Waking tasks are enqueued in a shared_runq at the end of
+ * enqueue_task_fair(), and are opportunistically pulled from the shared_runq
+ * in newidle_balance(). Tasks enqueued in a shared_runq may be scheduled prior
+ * to being pulled from the shared_runq, in which case they're simply dequeued
+ * from the shared_runq. A waking task is only enqueued to a shared_runq when
+ * it was _not_ manually migrated to the current runqueue by
+ * select_task_rq_fair().
+ *
+ * There is currently no task-stealing between shared_runqs in different LLCs,
+ * which means that shared_runq is not fully work conserving. This could be
+ * added at a later time, with tasks likely only being stolen across
+ * shared_runqs on the same NUMA node to avoid violating NUMA affinities.
+ *
+ * HOW
+ * ===
+ *
+ * An shared_runq is comprised of a list, and a spinlock for synchronization.
+ * Given that the critical section for a shared_runq is typically a fast list
+ * operation, and that the shared_runq is localized to a single LLC, the
+ * spinlock will typically only be contended on workloads that do little else
+ * other than hammer the runqueue.
+ *
+ * WHY
+ * ===
+ *
+ * As mentioned above, the main benefit of shared_runq is that it enables more
+ * aggressive work conservation in the scheduler. This can benefit workloads
+ * that benefit more from CPU utilization than from L1/L2 cache locality.
+ *
+ * shared_runqs are segmented across LLCs both to avoid contention on the
+ * shared_runq spinlock by minimizing the number of CPUs that could contend on
+ * it, as well as to strike a balance between work conservation, and L3 cache
+ * locality.
+ */
+struct shared_runq {
+ struct list_head list;
+ spinlock_t lock;
+} ____cacheline_aligned;
+
#ifdef CONFIG_SMP
+static struct shared_runq *rq_shared_runq(struct rq *rq)
+{
+ return rq->cfs.shared_runq;
+}
+
+static struct task_struct *shared_runq_pop_task(struct rq *rq)
+{
+ unsigned long flags;
+ struct task_struct *p;
+ struct shared_runq *shared_runq;
+
+ shared_runq = rq_shared_runq(rq);
+ if (list_empty(&shared_runq->list))
+ return NULL;
+
+ spin_lock_irqsave(&shared_runq->lock, flags);
+ p = list_first_entry_or_null(&shared_runq->list, struct task_struct,
+ shared_runq_node);
+ if (p && is_cpu_allowed(p, cpu_of(rq)))
+ list_del_init(&p->shared_runq_node);
+ else
+ p = NULL;
+ spin_unlock_irqrestore(&shared_runq->lock, flags);
+
+ return p;
+}
+
+static void shared_runq_push_task(struct rq *rq, struct task_struct *p)
+{
+ unsigned long flags;
+ struct shared_runq *shared_runq;
+
+ shared_runq = rq_shared_runq(rq);
+ spin_lock_irqsave(&shared_runq->lock, flags);
+ list_add_tail(&p->shared_runq_node, &shared_runq->list);
+ spin_unlock_irqrestore(&shared_runq->lock, flags);
+}
+
static void shared_runq_enqueue_task(struct rq *rq, struct task_struct *p,
int enq_flags)
-{}
+{
+ bool task_migrated = enq_flags & ENQUEUE_MIGRATED;
+ bool task_wakeup = enq_flags & ENQUEUE_WAKEUP;
+
+ /*
+ * Only enqueue the task in the shared runqueue if:
+ *
+ * - SWQUEUE is enabled
+ * - The task is on the wakeup path
+ * - The task wasn't purposefully migrated to the current rq by
+ * select_task_rq()
+ * - The task isn't pinned to a specific CPU
+ */
+ if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1)
+ return;
+
+ shared_runq_push_task(rq, p);
+}

static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
{
- return 0;
+ struct task_struct *p = NULL;
+ struct rq *src_rq;
+ struct rq_flags src_rf;
+ int ret;
+
+ p = shared_runq_pop_task(rq);
+ if (!p)
+ return 0;
+
+ rq_unpin_lock(rq, rf);
+ raw_spin_rq_unlock(rq);
+
+ src_rq = task_rq_lock(p, &src_rf);
+
+ if (task_on_rq_queued(p) && !task_on_cpu(rq, p)) {
+ update_rq_clock(src_rq);
+ src_rq = move_queued_task(src_rq, &src_rf, p, cpu_of(rq));
+ }
+
+ if (src_rq->cpu != rq->cpu)
+ ret = 1;
+ else
+ ret = -1;
+
+ task_rq_unlock(src_rq, p, &src_rf);
+
+ raw_spin_rq_lock(rq);
+ rq_repin_lock(rq, rf);
+
+ return ret;
}

static void shared_runq_dequeue_task(struct task_struct *p)
-{}
+{
+ unsigned long flags;
+ struct shared_runq *shared_runq;
+
+ if (!list_empty(&p->shared_runq_node)) {
+ shared_runq = rq_shared_runq(task_rq(p));
+ spin_lock_irqsave(&shared_runq->lock, flags);
+ list_del_init(&p->shared_runq_node);
+ spin_unlock_irqrestore(&shared_runq->lock, flags);
+ }
+}

/*
* For asym packing, by default the lower numbered CPU has higher priority.
@@ -12854,3 +12999,34 @@ __init void init_sched_fair_class(void)
#endif /* SMP */

}
+
+__init void init_sched_fair_class_late(void)
+{
+#ifdef CONFIG_SMP
+ int i;
+ struct shared_runq *shared_runq;
+ struct rq *rq;
+ struct rq *llc_rq;
+
+ for_each_possible_cpu(i) {
+ if (per_cpu(sd_llc_id, i) == i) {
+ llc_rq = cpu_rq(i);
+
+ shared_runq = kzalloc_node(sizeof(struct shared_runq),
+ GFP_KERNEL, cpu_to_node(i));
+ INIT_LIST_HEAD(&shared_runq->list);
+ spin_lock_init(&shared_runq->lock);
+ llc_rq->cfs.shared_runq = shared_runq;
+ }
+ }
+
+ for_each_possible_cpu(i) {
+ rq = cpu_rq(i);
+ llc_rq = cpu_rq(per_cpu(sd_llc_id, i));
+
+ if (rq == llc_rq)
+ continue;
+ rq->cfs.shared_runq = llc_rq->cfs.shared_runq;
+ }
+#endif /* SMP */
+}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 187ad5da5ef6..8b573dfaba33 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -576,6 +576,7 @@ struct cfs_rq {
#endif

#ifdef CONFIG_SMP
+ struct shared_runq *shared_runq;
/*
* CFS load tracking
*/
@@ -2440,6 +2441,7 @@ extern void update_max_interval(void);
extern void init_sched_dl_class(void);
extern void init_sched_rt_class(void);
extern void init_sched_fair_class(void);
+extern void init_sched_fair_class_late(void);

extern void reweight_task(struct task_struct *p, int prio);

--
2.40.1


2023-07-10 20:12:02

by David Vernet

[permalink] [raw]
Subject: [PATCH v2 2/7] sched: Move is_cpu_allowed() into sched.h

is_cpu_allowed() exists as a static inline function in core.c. The
functionality offered by is_cpu_allowed() is useful to scheduling
policies as well, e.g. to determine whether a runnable task can be
migrated to another core that would otherwise go idle.

Let's move it to sched.h.

Signed-off-by: David Vernet <[email protected]>
---
kernel/sched/core.c | 31 -------------------------------
kernel/sched/sched.h | 31 +++++++++++++++++++++++++++++++
2 files changed, 31 insertions(+), 31 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 167cd9f11ed0..1451f5aa82ac 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -48,7 +48,6 @@
#include <linux/kcov.h>
#include <linux/kprobes.h>
#include <linux/llist_api.h>
-#include <linux/mmu_context.h>
#include <linux/mmzone.h>
#include <linux/mutex_api.h>
#include <linux/nmi.h>
@@ -2444,36 +2443,6 @@ static inline bool rq_has_pinned_tasks(struct rq *rq)
return rq->nr_pinned;
}

-/*
- * Per-CPU kthreads are allowed to run on !active && online CPUs, see
- * __set_cpus_allowed_ptr() and select_fallback_rq().
- */
-static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
-{
- /* When not in the task's cpumask, no point in looking further. */
- if (!cpumask_test_cpu(cpu, p->cpus_ptr))
- return false;
-
- /* migrate_disabled() must be allowed to finish. */
- if (is_migration_disabled(p))
- return cpu_online(cpu);
-
- /* Non kernel threads are not allowed during either online or offline. */
- if (!(p->flags & PF_KTHREAD))
- return cpu_active(cpu) && task_cpu_possible(cpu, p);
-
- /* KTHREAD_IS_PER_CPU is always allowed. */
- if (kthread_is_per_cpu(p))
- return cpu_online(cpu);
-
- /* Regular kernel threads don't get to stay during offline. */
- if (cpu_dying(cpu))
- return false;
-
- /* But are allowed during online. */
- return cpu_online(cpu);
-}
-
/*
* This is how migration works:
*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 94846c947d6e..187ad5da5ef6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -44,6 +44,7 @@
#include <linux/lockdep.h>
#include <linux/minmax.h>
#include <linux/mm.h>
+#include <linux/mmu_context.h>
#include <linux/module.h>
#include <linux/mutex_api.h>
#include <linux/plist.h>
@@ -1199,6 +1200,36 @@ static inline bool is_migration_disabled(struct task_struct *p)
#endif
}

+/*
+ * Per-CPU kthreads are allowed to run on !active && online CPUs, see
+ * __set_cpus_allowed_ptr() and select_fallback_rq().
+ */
+static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
+{
+ /* When not in the task's cpumask, no point in looking further. */
+ if (!cpumask_test_cpu(cpu, p->cpus_ptr))
+ return false;
+
+ /* migrate_disabled() must be allowed to finish. */
+ if (is_migration_disabled(p))
+ return cpu_online(cpu);
+
+ /* Non kernel threads are not allowed during either online or offline. */
+ if (!(p->flags & PF_KTHREAD))
+ return cpu_active(cpu) && task_cpu_possible(cpu, p);
+
+ /* KTHREAD_IS_PER_CPU is always allowed. */
+ if (kthread_is_per_cpu(p))
+ return cpu_online(cpu);
+
+ /* Regular kernel threads don't get to stay during offline. */
+ if (cpu_dying(cpu))
+ return false;
+
+ /* But are allowed during online. */
+ return cpu_online(cpu);
+}
+
DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);

#define cpu_rq(cpu) (&per_cpu(runqueues, (cpu)))
--
2.40.1


2023-07-10 20:16:24

by David Vernet

[permalink] [raw]
Subject: [PATCH v2 3/7] sched: Check cpu_active() earlier in newidle_balance()

In newidle_balance(), we check if the current CPU is inactive, and then
decline to pull any remote tasks to the core if so. Before this check,
however, we're currently updating rq->idle_stamp. If a core is offline,
setting its idle stamp is not useful. The core won't be chosen by any
task in select_task_rq_fair(), and setting the rq->idle_stamp is
misleading anyways given that the core being inactive should imply that
it should have a very cold cache.

Let's set rq->idle_stamp in newidle_balance() only if the cpu is active.

Signed-off-by: David Vernet <[email protected]>
---
kernel/sched/fair.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a80a73909dc2..6e882b7bf5b4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11837,18 +11837,18 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
if (this_rq->ttwu_pending)
return 0;

- /*
- * We must set idle_stamp _before_ calling idle_balance(), such that we
- * measure the duration of idle_balance() as idle time.
- */
- this_rq->idle_stamp = rq_clock(this_rq);
-
/*
* Do not pull tasks towards !active CPUs...
*/
if (!cpu_active(this_cpu))
return 0;

+ /*
+ * We must set idle_stamp _before_ calling idle_balance(), such that we
+ * measure the duration of idle_balance() as idle time.
+ */
+ this_rq->idle_stamp = rq_clock(this_rq);
+
/*
* This is OK, because current is on_cpu, which avoids it being picked
* for load-balance and preemption/IRQs are still disabled avoiding
--
2.40.1


2023-07-10 20:19:21

by David Vernet

[permalink] [raw]
Subject: [PATCH v2 1/7] sched: Expose move_queued_task() from core.c

The migrate_task_to() function exposed from kernel/sched/core.c migrates
the current task, which is silently assumed to also be its first
argument, to the specified CPU. The function uses stop_one_cpu() to
migrate the task to the target CPU, which won't work if @p is not the
current task as the stop_one_cpu() callback isn't invoked on remote
CPUs.

While this operation is useful for task_numa_migrate() in fair.c, it
would be useful if move_queued_task() in core.c was given external
linkage, as it actually can be used to migrate any task to a CPU.

A follow-on patch will call move_queued_task() from fair.c when
migrating a task in a shared runqueue to a remote CPU.

Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: David Vernet <[email protected]>
---
kernel/sched/core.c | 4 ++--
kernel/sched/sched.h | 3 +++
2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c7db597e8175..167cd9f11ed0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2493,8 +2493,8 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
*
* Returns (locked) new rq. Old rq's lock is released.
*/
-static struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf,
- struct task_struct *p, int new_cpu)
+struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf,
+ struct task_struct *p, int new_cpu)
{
lockdep_assert_rq_held(rq);

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 50d4b61aef3a..94846c947d6e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1759,6 +1759,9 @@ init_numa_balancing(unsigned long clone_flags, struct task_struct *p)

#ifdef CONFIG_SMP

+
+extern struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf,
+ struct task_struct *p, int new_cpu);
static inline void
queue_balance_callback(struct rq *rq,
struct balance_callback *head,
--
2.40.1


2023-07-10 20:24:21

by David Vernet

[permalink] [raw]
Subject: [PATCH v2 7/7] sched: Move shared_runq to __{enqueue,dequeue}_entity()

In the thread at [0], Peter suggested experimenting with putting the
shared_runq enqueue and dequeue calls in __enqueue_entity() and
__dequeue_entity() respectively. This has the advantages of improve work
conservation from utilizing the shared runq on all enqueues, as well as
automagically causing shared_runq to work for SCHED_CORE. The possible
disadvantage was that we're dong more enqueues/dequeues, which could
result in untenable overhead or contention on the shard lock(s).

[0]: https://lore.kernel.org/lkml/[email protected]/

It turns out that this appears to improve shared_runq quite a bit:

=== Single-socket | 16 core / 32 thread | 2-CCX | AMD 7950X Zen4 ===

CPU max MHz: 5879.8818
CPU min MHz: 3000.0000

Command: make -j$(nproc) built-in.a
o____________o_______o
| mean | CPU |
o------------o-------o
NO_SHARED_RUNQ: | 582.46s | 3101% |
SHARED_RUNQ_WAKEUPS: | 581.22s | 3117% |
SHARED_RUNQ_ALL: | 578.41s | 3141% |
o------------o-------o

Takeaway: SHARED_RUNQ_WAKEUPS performs roughly the same as
NO_SHARED_RUNQ, but SHARED_RUNQ_ALL results in a statistically
significant ~.7% improvement over NO_SHARED_RUNQ. This suggests that
enqueuing tasks in the shared runqueue on every enqueue improves work
conservation, and thanks to sharding, does not result in contention.

Note that I didn't collect data for kernel compile with SHARED_RUNQ_ALL
_without_ sharding. The reason for this is that we know that CPUs with
sufficiently large LLCs will contend, so if we've decided to accommodate
those CPUs with sharding, there's not much point in measuring the
results of not sharding on CPUs that we know won't contend.

Command: hackbench --loops 10000
o____________o_______o
| mean | CPU |
o------------o-------o
NO_SHARED_RUNQ: | 2.1912s | 3117% |
SHARED_RUNQ_WAKEUP: | 2.1080s | 3155% |
SHARED_RUNQ_ALL: | 1.9830s | 3144% |
o------------o-------o

Takeaway: SHARED_RUNQ in both forms performs exceptionally well compared
to NO_SHARED_RUNQ here, with SHARED_RUNQ_ALL beating NO_SHARED_RUNQ by
almost 10%. This was a surprising result given that it seems
advantageous to err on the side of avoiding migration in hackbench given
that tasks are short lived in sending only 10k bytes worth of messages,
but the results of the benchmark would seem to suggest that minimizing
runqueue delays is preferable.

Command:
for i in `seq 128`; do
netperf -6 -t UDP_RR -c -C -l $runtime &
done
o_______________________o
| mean (thoughput) |
o-----------------------o
NO_SHARED_RUNQ: | 25064.12 |
SHARED_RUNQ_WAKEUP: | 24862.16 |
SHARED_RUNQ_ALL: | 25287.73 |
o-----------------------o

Takeaway: No statistical significance, though it is worth noting that
there is no regression for shared runqueue on the 7950X, while there is
a small regression on the Skylake and Milan hosts for SHARED_RUNQ_WAKEUP
as described below.

=== Single-socket | 18 core / 36 thread | 1-CCX | Intel Skylake ===

CPU max MHz: 1601.0000
CPU min MHz: 800.0000

Command: make -j$(nproc) built-in.a
o____________o_______o
| mean | CPU |
o------------o-------o
NO_SHARED_RUNQ: | 1535.46s | 3417% |
SHARED_RUNQ_WAKEUP: | 1534.56s | 3428% |
SHARED_RUNQ_ALL: | 1531.95s | 3429% |
o------------o-------o

Takeaway: SHARED_RUNQ_ALL results in a ~.23% improvement over
NO_SHARED_RUNQ. Not a huge improvement, but consistently measurable.
The cause of this gain is presumably the same as the 7950X: improved
work conservation, with sharding preventing excessive contention on the
shard lock.

Command: hackbench --loops 10000
o____________o_______o
| mean | CPU |
o------------o-------o
NO_SHARED_RUNQ: | 5.5750s | 3369% |
SHARED_RUNQ_WAKEUP: | 5.5764s | 3495% |
SHARED_RUNQ_ALL: | 5.4760s | 3481% |
o------------o-------o

Takeaway: SHARED_RUNQ_ALL results in a ~1.6% improvement over
NO_SHARED_RUNQ. Also statistically significant, but smaller than the
almost 10% improvement observed on the 7950X.

Command: netperf -n $(nproc) -l 60 -t TCP_RR
for i in `seq 128`; do
netperf -6 -t UDP_RR -c -C -l $runtime &
done
o______________________o
| mean (thoughput) |
o----------------------o
NO_SHARED_RUNQ: | 11963.08 |
SHARED_RUNQ_WAKEUP: | 11943.60 |
SHARED_RUNQ_ALL: | 11554.32 |
o----------------------o

Takeaway: NO_SHARED_RUNQ performs the same as SHARED_RUNQ_WAKEUP, but
beats SHARED_RUNQ_ALL by ~3.4%. This result makes sense -- the workload
is very heavy on the runqueue, so enqueuing tasks in the shared runqueue
in __enqueue_entity() would intuitively result in increased contention
on the shard lock. The fact that we're at parity with
SHARED_RUNQ_WAKEUP suggests that sharding the shared runqueue has
significantly improved the contention that was observed in v1, but that
__enqueue_entity() puts it over the edge.

NOTE: Parity for SHARED_RUNQ_WAKEUP relies on choosing the correct shard
size. If we chose, for example, a shard size of 16, there would still be
a regression between NO_SHARED_RUNQ and SHARED_RUNQ_WAKEUP. As described
below, this suggests that we may want to add a debugfs tunable for the
shard size.

=== Single-socket | 72-core | 6-CCX | AMD Milan Zen3 ===

CPU max MHz: 700.0000
CPU min MHz: 700.0000

Command: make -j$(nproc) built-in.a
o____________o_______o
| mean | CPU |
o------------o-------o
NO_SHARED_RUNQ: | 1601.81s | 6476% |
SHARED_RUNQ_WAKEUP: | 1602.55s | 6472% |
SHARED_RUNQ_ALL: | 1602.49s | 6475% |
o------------o-------o

Takeaway: No statistically significant variance. It might be worth
experimenting with work stealing in a follow-on patch set.

Command: hackbench --loops 10000
o____________o_______o
| mean | CPU |
o------------o-------o
NO_SHARED_RUNQ: | 5.2672s | 6463% |
SHARED_RUNQ_WAKEUP: | 5.1476s | 6583% |
SHARED_RUNQ_ALL: | 5.1003s | 6598% |
o------------o-------o

Takeaway: SHARED_RUNQ_ALL again wins, by about 3% over NO_SHARED_RUNQ in
this case.

Command: netperf -n $(nproc) -l 60 -t TCP_RR
for i in `seq 128`; do
netperf -6 -t UDP_RR -c -C -l $runtime &
done
o_______________________o
| mean (thoughput) |
o-----------------------o
NO_SHARED_RUNQ: | 13819.08 |
SHARED_RUNQ_WAKEUP: | 13907.74 |
SHARED_RUNQ_ALL: | 13569.69 |
o-----------------------o

Takeaway: Similar to the Skylake runs, NO_SHARED_RUNQ still beats
SHARED_RUNQ_ALL, though by a slightly lower margin of ~1.8%.

Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: David Vernet <[email protected]>
---
kernel/sched/fair.c | 38 ++++++++++++--------------------------
1 file changed, 12 insertions(+), 26 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 97985f28a627..bddb2bed4297 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -140,8 +140,8 @@ static int __init setup_sched_thermal_decay_shift(char *str)
__setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift);

/**
- * struct shared_runq - Per-LLC queue structure for enqueuing and pulling
- * waking tasks.
+ * struct shared_runq - Per-LLC queue structure for enqueuing and migrating
+ * runnable tasks within an LLC.
*
* struct shared_runq_shard - A structure containing a task list and a spinlock
* for a subset of cores in a struct shared_runq.
@@ -161,10 +161,9 @@ __setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift);
* 2. The shard inside of the per-LLC struct shared_runq which contains the
* list of runnable tasks for that shard.
*
- * Waking tasks are enqueued in the calling CPU's struct shared_runq_shard at
- * the end of enqueue_task_fair(), and are opportunistically pulled from the
- * shared_runq in newidle_balance(). Pulling from shards is an O(# shards)
- * operation.
+ * Waking tasks are enqueued in the calling CPU's struct shared_runq_shard in
+ * __enqueue_entity(), and are opportunistically pulled from the shared_runq in
+ * newidle_balance(). Pulling from shards is an O(# shards) operation.
*
* There is currently no task-stealing between shared_runqs in different LLCs,
* which means that shared_runq is not fully work conserving. This could be
@@ -253,22 +252,15 @@ static void shared_runq_push_task(struct shared_runq_shard *shard,
spin_unlock_irqrestore(&shard->lock, flags);
}

-static void shared_runq_enqueue_task(struct rq *rq, struct task_struct *p,
- int enq_flags)
+static void shared_runq_enqueue_task(struct rq *rq, struct task_struct *p)
{
- bool task_migrated = enq_flags & ENQUEUE_MIGRATED;
- bool task_wakeup = enq_flags & ENQUEUE_WAKEUP;
-
/*
* Only enqueue the task in the shared runqueue if:
*
* - SWQUEUE is enabled
- * - The task is on the wakeup path
- * - The task wasn't purposefully migrated to the current rq by
- * select_task_rq()
* - The task isn't pinned to a specific CPU
*/
- if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1)
+ if (p->nr_cpus_allowed == 1)
return;

shared_runq_push_task(rq_shared_runq_shard(rq), p);
@@ -357,8 +349,7 @@ int __weak arch_asym_cpu_priority(int cpu)
*/
#define capacity_greater(cap1, cap2) ((cap1) * 1024 > (cap2) * 1078)
#else
-static void shared_runq_enqueue_task(struct rq *rq, struct task_struct *p,
- int enq_flags)
+static void shared_runq_enqueue_task(struct rq *rq, struct task_struct *p)
{}

static void shared_runq_dequeue_task(struct task_struct *p)
@@ -843,11 +834,15 @@ static inline bool __entity_less(struct rb_node *a, const struct rb_node *b)
*/
static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
+ if (sched_feat(SHARED_RUNQ) && entity_is_task(se))
+ shared_runq_enqueue_task(rq_of(cfs_rq), task_of(se));
rb_add_cached(&se->run_node, &cfs_rq->tasks_timeline, __entity_less);
}

static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
+ if (sched_feat(SHARED_RUNQ) && entity_is_task(se))
+ shared_runq_dequeue_task(task_of(se));
rb_erase_cached(&se->run_node, &cfs_rq->tasks_timeline);
}

@@ -6587,9 +6582,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (!task_new)
update_overutilized_status(rq);

- if (sched_feat(SHARED_RUNQ))
- shared_runq_enqueue_task(rq, p, flags);
-
enqueue_throttle:
assert_list_leaf_cfs_rq(rq);

@@ -6671,9 +6663,6 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
dequeue_throttle:
util_est_update(&rq->cfs, p, task_sleep);
hrtick_update(rq);
-
- if (sched_feat(SHARED_RUNQ))
- shared_runq_dequeue_task(p);
}

#ifdef CONFIG_SMP
@@ -8380,9 +8369,6 @@ done: __maybe_unused;

update_misfit_status(p, rq);

- if (sched_feat(SHARED_RUNQ))
- shared_runq_dequeue_task(p);
-
return p;

idle:
--
2.40.1


2023-07-11 10:04:32

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 4/7] sched/fair: Add SHARED_RUNQ sched feature and skeleton calls

On Mon, Jul 10, 2023 at 03:03:39PM -0500, David Vernet wrote:

> @@ -11843,6 +11871,9 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> if (!cpu_active(this_cpu))
> return 0;
>
> + if (sched_feat(SHARED_RUNQ) && shared_runq_pick_next_task(this_rq, rf))
> + return -1;
> +

Next patch implements shared_runq_pick_next_task() with the same return
values as newidle_balance(), which would seem to suggest the following
instead:

if (sched_feat(SHARED_RUNQ)) {
pulled_task = shared_runq_pick_next_task(this_rq, rf);
if (pulled_task)
return pulled_task;
}

Additionally, I think we wants something like:

sd = rcu_dereference_check_sched_domain(this_rq->sd);
if (sched_feat(SHARED_RUNQ)) {
... /* see above */

sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
sd = sd->parent;
}

to ensure we skip <=LLC domains, since those have already been covered
by this shiny new thing, no?

2023-07-11 10:35:33

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 5/7] sched: Implement shared runqueue in CFS

On Mon, Jul 10, 2023 at 03:03:40PM -0500, David Vernet wrote:

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 1451f5aa82ac..3ad437d4ea3d 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c

> @@ -9842,6 +9843,7 @@ void __init sched_init_smp(void)
>
> init_sched_rt_class();
> init_sched_dl_class();
> + init_sched_fair_class_late();
>
> sched_smp_initialized = true;
> }

> @@ -12854,3 +12999,34 @@ __init void init_sched_fair_class(void)
> #endif /* SMP */
>
> }
> +
> +__init void init_sched_fair_class_late(void)
> +{
> +#ifdef CONFIG_SMP
> + int i;
> + struct shared_runq *shared_runq;
> + struct rq *rq;
> + struct rq *llc_rq;
> +
> + for_each_possible_cpu(i) {
> + if (per_cpu(sd_llc_id, i) == i) {
> + llc_rq = cpu_rq(i);
> +
> + shared_runq = kzalloc_node(sizeof(struct shared_runq),
> + GFP_KERNEL, cpu_to_node(i));
> + INIT_LIST_HEAD(&shared_runq->list);
> + spin_lock_init(&shared_runq->lock);
> + llc_rq->cfs.shared_runq = shared_runq;
> + }
> + }
> +
> + for_each_possible_cpu(i) {
> + rq = cpu_rq(i);
> + llc_rq = cpu_rq(per_cpu(sd_llc_id, i));
> +
> + if (rq == llc_rq)
> + continue;
> + rq->cfs.shared_runq = llc_rq->cfs.shared_runq;
> + }
> +#endif /* SMP */
> +}

I don't think this is right; the llc_id thing depends on the online
mask, not on possible mask. So if you boot with a number of CPUs offline
and late bring them online, you're screwy (IIRC this is actually a very
common thing in virt land).

Additionally, llc_id depends on the sched_domain tree, if someone were
to go create partitions, they can split an LLC and llc_id would split
right along with it.

I think you need to move this into sched/topology.c and handle hotplug /
domain (re) creation.

And yes, that's going to be a pain, because you might need to re-hash
existing lists.

2023-07-11 11:02:25

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 6/7] sched: Shard per-LLC shared runqueues

On Mon, Jul 10, 2023 at 03:03:41PM -0500, David Vernet wrote:

> +struct shared_runq_shard {
> struct list_head list;
> spinlock_t lock;
> } ____cacheline_aligned;
>
> +struct shared_runq {
> + u32 num_shards;
> + struct shared_runq_shard shards[];
> +} ____cacheline_aligned;
> +
> +/* This would likely work better as a configurable knob via debugfs */
> +#define SHARED_RUNQ_SHARD_SZ 6
> +
> #ifdef CONFIG_SMP
> static struct shared_runq *rq_shared_runq(struct rq *rq)
> {
> return rq->cfs.shared_runq;
> }
>
> -static struct task_struct *shared_runq_pop_task(struct rq *rq)
> +static struct shared_runq_shard *rq_shared_runq_shard(struct rq *rq)
> +{
> + return rq->cfs.shard;
> +}
> +
> +static int shared_runq_shard_idx(const struct shared_runq *runq, int cpu)
> +{
> + return cpu % runq->num_shards;

I would suggest either:

(cpu >> 1) % num_shards

or keeping num_shards even, to give SMT siblings a fighting chance to
hit the same bucket.

(I've no idea how SMT4 (or worse SMT8) is typically enumerated, so
someone from the Power/Sparc/MIPS world would have to go play with that
if they so care)

> +}

> + num_shards = max(per_cpu(sd_llc_size, i) /
> + SHARED_RUNQ_SHARD_SZ, 1);

> + shared_runq->num_shards = num_shards;



2023-07-11 11:28:13

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 7/7] sched: Move shared_runq to __{enqueue,dequeue}_entity()



Ufff.. so I see how you ended up with the series in this form, but I
typically prefer to have less back and forth. Perhaps fold back at least
this last patch?

2023-07-11 11:53:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 0/7] sched: Implement shared runqueue in CFS

On Mon, Jul 10, 2023 at 03:03:35PM -0500, David Vernet wrote:
> Difference between shared_runq and SIS_NODE
> ===========================================
>
> In [0] Peter proposed a patch that addresses Tejun's observations that
> when workqueues are targeted towards a specific LLC on his Zen2 machine
> with small CCXs, that there would be significant idle time due to
> select_idle_sibling() not considering anything outside of the current
> LLC.
>
> This patch (SIS_NODE) is essentially the complement to the proposal
> here. SID_NODE causes waking tasks to look for idle cores in neighboring
> LLCs on the same die, whereas shared_runq causes cores about to go idle
> to look for enqueued tasks. That said, in its current form, the two
> features at are a different scope as SIS_NODE searches for idle cores
> between LLCs, while shared_runq enqueues tasks within a single LLC.
>
> The patch was since removed in [1], and we compared the results to
> shared_runq (previously called "swqueue") in [2]. SIS_NODE did not
> outperform shared_runq on any of the benchmarks, so we elect to not
> compare against it again for this v2 patch set.

Right, so SIS is search-idle-on-wakeup, while you do
search-task-on-newidle, and they are indeed complentary actions.

As to SIS_NODE, I really want that to happen, but we need a little more
work for the Epyc things, they have a few too many CCXs per node :-)

Anyway, the same thing that moticated SIS_NODE should also be relevant
here, those Zen2 things have only 3/4 cores per LLC, would it not also
make sense to include multiple of them into the shared runqueue thing?

(My brain is still processing the shared_runq name...)

2023-07-11 16:28:24

by David Vernet

[permalink] [raw]
Subject: Re: [PATCH v2 5/7] sched: Implement shared runqueue in CFS

On Tue, Jul 11, 2023 at 12:18:32PM +0200, Peter Zijlstra wrote:
> On Mon, Jul 10, 2023 at 03:03:40PM -0500, David Vernet wrote:
>
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 1451f5aa82ac..3ad437d4ea3d 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
>
> > @@ -9842,6 +9843,7 @@ void __init sched_init_smp(void)
> >
> > init_sched_rt_class();
> > init_sched_dl_class();
> > + init_sched_fair_class_late();
> >
> > sched_smp_initialized = true;
> > }
>
> > @@ -12854,3 +12999,34 @@ __init void init_sched_fair_class(void)
> > #endif /* SMP */
> >
> > }
> > +
> > +__init void init_sched_fair_class_late(void)
> > +{
> > +#ifdef CONFIG_SMP
> > + int i;
> > + struct shared_runq *shared_runq;
> > + struct rq *rq;
> > + struct rq *llc_rq;
> > +
> > + for_each_possible_cpu(i) {
> > + if (per_cpu(sd_llc_id, i) == i) {
> > + llc_rq = cpu_rq(i);
> > +
> > + shared_runq = kzalloc_node(sizeof(struct shared_runq),
> > + GFP_KERNEL, cpu_to_node(i));
> > + INIT_LIST_HEAD(&shared_runq->list);
> > + spin_lock_init(&shared_runq->lock);
> > + llc_rq->cfs.shared_runq = shared_runq;
> > + }
> > + }
> > +
> > + for_each_possible_cpu(i) {
> > + rq = cpu_rq(i);
> > + llc_rq = cpu_rq(per_cpu(sd_llc_id, i));
> > +
> > + if (rq == llc_rq)
> > + continue;
> > + rq->cfs.shared_runq = llc_rq->cfs.shared_runq;
> > + }
> > +#endif /* SMP */
> > +}
>
> I don't think this is right; the llc_id thing depends on the online
> mask, not on possible mask. So if you boot with a number of CPUs offline
> and late bring them online, you're screwy (IIRC this is actually a very
> common thing in virt land).
>
> Additionally, llc_id depends on the sched_domain tree, if someone were
> to go create partitions, they can split an LLC and llc_id would split
> right along with it.
>
> I think you need to move this into sched/topology.c and handle hotplug /
> domain (re) creation.

Yeah, you're right. This falls apart if we hotplug when we do domain
recreation. I'll address this in v3.

> And yes, that's going to be a pain, because you might need to re-hash
> existing lists.

Eh, it needs to be done. I played around with this for a bit before
sending the v1 RFC code but ended up keeping the series simple because
it was RFC, and fixing this is pretty involved. I should have taken care
of it regardless before dropping the RFC tag, so glad you pointed it
out.

2023-07-11 16:44:08

by David Vernet

[permalink] [raw]
Subject: Re: [PATCH v2 7/7] sched: Move shared_runq to __{enqueue,dequeue}_entity()

On Tue, Jul 11, 2023 at 12:51:36PM +0200, Peter Zijlstra wrote:
>
>
> Ufff.. so I see how you ended up with the series in this form, but I
> typically prefer to have less back and forth. Perhaps fold back at least
> this last patch?

Sure, will do. I kept them separate so we could drop this patch from the
series if we didn't want it, but given that benchmarks seem to only
improve with this (if we also shard), I'll just fold this in with [0]
and the patch following it.

[0]: https://lore.kernel.org/all/[email protected]/

2023-07-11 16:46:00

by David Vernet

[permalink] [raw]
Subject: Re: [PATCH v2 4/7] sched/fair: Add SHARED_RUNQ sched feature and skeleton calls

On Tue, Jul 11, 2023 at 11:45:47AM +0200, Peter Zijlstra wrote:
> On Mon, Jul 10, 2023 at 03:03:39PM -0500, David Vernet wrote:
>
> > @@ -11843,6 +11871,9 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> > if (!cpu_active(this_cpu))
> > return 0;
> >
> > + if (sched_feat(SHARED_RUNQ) && shared_runq_pick_next_task(this_rq, rf))
> > + return -1;
> > +
>
> Next patch implements shared_runq_pick_next_task() with the same return
> values as newidle_balance(), which would seem to suggest the following
> instead:
>
> if (sched_feat(SHARED_RUNQ)) {
> pulled_task = shared_runq_pick_next_task(this_rq, rf);
> if (pulled_task)
> return pulled_task;
> }

Yep, that's cleaner.

> Additionally, I think we wants something like:
>
> sd = rcu_dereference_check_sched_domain(this_rq->sd);
> if (sched_feat(SHARED_RUNQ)) {
> ... /* see above */
>
> sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
> sd = sd->parent;
> }
>
> to ensure we skip <=LLC domains, since those have already been covered
> by this shiny new thing, no?

Indeed, another nice optimization. I'll incorporate both of these into
the next version.

2023-07-11 20:03:34

by David Vernet

[permalink] [raw]
Subject: Re: [PATCH v2 6/7] sched: Shard per-LLC shared runqueues

On Tue, Jul 11, 2023 at 12:49:58PM +0200, Peter Zijlstra wrote:
> On Mon, Jul 10, 2023 at 03:03:41PM -0500, David Vernet wrote:
>
> > +struct shared_runq_shard {
> > struct list_head list;
> > spinlock_t lock;
> > } ____cacheline_aligned;
> >
> > +struct shared_runq {
> > + u32 num_shards;
> > + struct shared_runq_shard shards[];
> > +} ____cacheline_aligned;
> > +
> > +/* This would likely work better as a configurable knob via debugfs */
> > +#define SHARED_RUNQ_SHARD_SZ 6
> > +
> > #ifdef CONFIG_SMP
> > static struct shared_runq *rq_shared_runq(struct rq *rq)
> > {
> > return rq->cfs.shared_runq;
> > }
> >
> > -static struct task_struct *shared_runq_pop_task(struct rq *rq)
> > +static struct shared_runq_shard *rq_shared_runq_shard(struct rq *rq)
> > +{
> > + return rq->cfs.shard;
> > +}
> > +
> > +static int shared_runq_shard_idx(const struct shared_runq *runq, int cpu)
> > +{
> > + return cpu % runq->num_shards;
>
> I would suggest either:
>
> (cpu >> 1) % num_shards
>
> or keeping num_shards even, to give SMT siblings a fighting chance to
> hit the same bucket.

Given that neither of these approaches guarantees that the SMT siblings
are in the same bucket, I'll just go with your suggestion which is
simpler.

Seems inevitable that we'll want to have another debugfs knob to adjust
the number of shards, but IMO it's preferable to just apply your
suggestion in v3 and hold off on adding that complexity until we know we
need it.

> (I've no idea how SMT4 (or worse SMT8) is typically enumerated, so
> someone from the Power/Sparc/MIPS world would have to go play with that
> if they so care)

Yeah, no idea either. If these things end up varying a lot across
different architectures then we can look into making shard assignment
architecture specific.

>
> > +}
>
> > + num_shards = max(per_cpu(sd_llc_size, i) /
> > + SHARED_RUNQ_SHARD_SZ, 1);
>
> > + shared_runq->num_shards = num_shards;
>
>

2023-07-11 21:58:57

by David Vernet

[permalink] [raw]
Subject: Re: [PATCH v2 0/7] sched: Implement shared runqueue in CFS

On Tue, Jul 11, 2023 at 01:42:07PM +0200, Peter Zijlstra wrote:
> On Mon, Jul 10, 2023 at 03:03:35PM -0500, David Vernet wrote:
> > Difference between shared_runq and SIS_NODE
> > ===========================================
> >
> > In [0] Peter proposed a patch that addresses Tejun's observations that
> > when workqueues are targeted towards a specific LLC on his Zen2 machine
> > with small CCXs, that there would be significant idle time due to
> > select_idle_sibling() not considering anything outside of the current
> > LLC.
> >
> > This patch (SIS_NODE) is essentially the complement to the proposal
> > here. SID_NODE causes waking tasks to look for idle cores in neighboring
> > LLCs on the same die, whereas shared_runq causes cores about to go idle
> > to look for enqueued tasks. That said, in its current form, the two
> > features at are a different scope as SIS_NODE searches for idle cores
> > between LLCs, while shared_runq enqueues tasks within a single LLC.
> >
> > The patch was since removed in [1], and we compared the results to
> > shared_runq (previously called "swqueue") in [2]. SIS_NODE did not
> > outperform shared_runq on any of the benchmarks, so we elect to not
> > compare against it again for this v2 patch set.
>
> Right, so SIS is search-idle-on-wakeup, while you do
> search-task-on-newidle, and they are indeed complentary actions.
>
> As to SIS_NODE, I really want that to happen, but we need a little more
> work for the Epyc things, they have a few too many CCXs per node :-)
>
> Anyway, the same thing that moticated SIS_NODE should also be relevant
> here, those Zen2 things have only 3/4 cores per LLC, would it not also
> make sense to include multiple of them into the shared runqueue thing?

It's probably worth experimenting with this, but it would be workload
dependent on whether this would help or hurt. I would imagine there are
workloads where having a single shared runq for the whole socket is
advantageous even for larger LLCs like on Milan. But for many use cases
(including e.g. HHVM), the cache-line bouncing makes it untenable.

But yes, if we deem SIS_NODE to be useful for small CCXs like Zen2, I
don't see any reason to not apply that to shared_runq as well. I don't
have a Zen2 but I'll prototype this idea and hopefully can get some help
from Tejun or someone else to run some benchmarks on it.

> (My brain is still processing the shared_runq name...)

Figured this would be the most contentious part of v2.

2023-07-12 06:30:56

by Gautham R. Shenoy

[permalink] [raw]
Subject: Re: [PATCH v2 5/7] sched: Implement shared runqueue in CFS

Hello David,

On Mon, Jul 10, 2023 at 03:03:40PM -0500, David Vernet wrote:

[..snip..]

> ---

> +
> +static struct task_struct *shared_runq_pop_task(struct rq *rq)
> +{
> + unsigned long flags;
> + struct task_struct *p;
> + struct shared_runq *shared_runq;
> +
> + shared_runq = rq_shared_runq(rq);
> + if (list_empty(&shared_runq->list))
> + return NULL;
> +
> + spin_lock_irqsave(&shared_runq->lock, flags);
> + p = list_first_entry_or_null(&shared_runq->list, struct task_struct,
> + shared_runq_node);


Apologies for the bikeshedding comment : Here you are attempting to
remove the task from the "head", while in shared_runq_push_task below,
you are adding a task to the tail. Which is the usual queue
semantics. Then why call them shared_runq_pop_task() and
shared_runq_push_task() ?

Can we name them __shared_runq_enqueue_task() and
__shared_runq_pick_next_task() instead ?

> + if (p && is_cpu_allowed(p, cpu_of(rq)))
> + list_del_init(&p->shared_runq_node);
> + else
> + p = NULL;
> + spin_unlock_irqrestore(&shared_runq->lock, flags);
> +
> + return p;
> +}
> +
> +static void shared_runq_push_task(struct rq *rq, struct task_struct *p)
> +{
> + unsigned long flags;
> + struct shared_runq *shared_runq;
> +
> + shared_runq = rq_shared_runq(rq);
> + spin_lock_irqsave(&shared_runq->lock, flags);
> + list_add_tail(&p->shared_runq_node, &shared_runq->list);
> + spin_unlock_irqrestore(&shared_runq->lock, flags);
> +}
> +
> static void shared_runq_enqueue_task(struct rq *rq, struct task_struct *p,
> int enq_flags)
> -{}
> +{
> + bool task_migrated = enq_flags & ENQUEUE_MIGRATED;
> + bool task_wakeup = enq_flags & ENQUEUE_WAKEUP;
> +
> + /*
> + * Only enqueue the task in the shared runqueue if:
> + *
> + * - SWQUEUE is enabled
> + * - The task is on the wakeup path
> + * - The task wasn't purposefully migrated to the current rq by
> + * select_task_rq()
> + * - The task isn't pinned to a specific CPU
> + */
> + if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1)
> + return;
> +
> + shared_runq_push_task(rq, p);
> +}
>
> static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
> {
> - return 0;
> + struct task_struct *p = NULL;
> + struct rq *src_rq;
> + struct rq_flags src_rf;
> + int ret;
> +
> + p = shared_runq_pop_task(rq);
> + if (!p)
> + return 0;
> +
> + rq_unpin_lock(rq, rf);
> + raw_spin_rq_unlock(rq);
> +
> + src_rq = task_rq_lock(p, &src_rf);
> +
> + if (task_on_rq_queued(p) && !task_on_cpu(rq, p)) {
> + update_rq_clock(src_rq);
> + src_rq = move_queued_task(src_rq, &src_rf, p, cpu_of(rq));
> + }
> +
> + if (src_rq->cpu != rq->cpu)
> + ret = 1;
> + else
> + ret = -1;


So if src_rq->cpu != rq->cpu, then the task has _not_ been moved to
rq. But you return 1.

While in the else case, since src_rq->cpu == rq->cpu, the task has
been successfully moved to rq. But you are returning -1,

If newidle_balance() were to interpret this return value as the number
of tasks pulled, then, shouldn't it be the other way around ?

> +
> + task_rq_unlock(src_rq, p, &src_rf);
> +
> + raw_spin_rq_lock(rq);
> + rq_repin_lock(rq, rf);
> +
> + return ret;
> }
>

--
Thanks and Regards
gautham.

2023-07-12 08:57:37

by Abel Wu

[permalink] [raw]
Subject: Re: [PATCH v2 4/7] sched/fair: Add SHARED_RUNQ sched feature and skeleton calls

On 7/11/23 4:03 AM, David Vernet wrote:
> @@ -6467,6 +6489,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> dequeue_throttle:
> util_est_update(&rq->cfs, p, task_sleep);
> hrtick_update(rq);
> +
> + if (sched_feat(SHARED_RUNQ))
> + shared_runq_dequeue_task(p);

Would disabling SHARED_RUNQ causing task list nodes left
in the shared stateful runqueue?

2023-07-12 11:06:56

by Gautham R. Shenoy

[permalink] [raw]
Subject: Re: [PATCH v2 6/7] sched: Shard per-LLC shared runqueues

On Tue, Jul 11, 2023 at 02:57:57PM -0500, David Vernet wrote:
> On Tue, Jul 11, 2023 at 12:49:58PM +0200, Peter Zijlstra wrote:
> > On Mon, Jul 10, 2023 at 03:03:41PM -0500, David Vernet wrote:

[..snip..]

> > > +static int shared_runq_shard_idx(const struct shared_runq *runq, int cpu)
> > > +{
> > > + return cpu % runq->num_shards;
> >
> > I would suggest either:
> >
> > (cpu >> 1) % num_shards
> >
> > or keeping num_shards even, to give SMT siblings a fighting chance to
> > hit the same bucket.
>
> Given that neither of these approaches guarantees that the SMT siblings
> are in the same bucket, I'll just go with your suggestion which is
> simpler.
>
> Seems inevitable that we'll want to have another debugfs knob to adjust
> the number of shards, but IMO it's preferable to just apply your
> suggestion in v3 and hold off on adding that complexity until we know we
> need it.
>
> > (I've no idea how SMT4 (or worse SMT8) is typically enumerated, so
> > someone from the Power/Sparc/MIPS world would have to go play with that
> > if they so care)
>
> Yeah, no idea either. If these things end up varying a lot across
> different architectures then we can look into making shard assignment
> architecture specific.

On POWER, the SMT siblings are enumerated in a sequential fashion, i.e

CPU id of a thread = Core_id * threads_per_core + thread_id_within_core.

But IIRC, POWER sets L2 domain as the LLC. On POWER8 (with SMT8) and
POWER9(with SMT4 on Baremetal and SMT8 on VMs), LLC size is 8. Even
with SHARED_RUNQ_SHARD_SZ = 6, there will only be 1 shard with the
current formula

num_shards = max(per_cpu(sd_llc_size, i)/SHARED_RUNQ_SHARD_SZ, 1);

(Aside: with the above formula, on a topology with 6 < sd_llc_size <
12, num_shards will remain 1, with the shard size exceeding the
intended SHARD_SZ. Was this the intention ?)

Even on x86, there is no uniformity in how the SMT threads are
numbered. On AMD EPYC Baremetal, the first threads of all the cores
are enumerated first and then the sibling threads. So, on an EPYC
server with 128 cores in total, the SMT sibings are {0,128}, {1, 129}, ...

With SHARED_RUNQ_SHARD_SZ = 6,

On Zen2 EPYC Baremetal, with LLC size = 8, num_shards = 1. This
simplifies stuff!

On Zen3, Zen4 EPYC Baremetal, with LLC size = 16, num_shards = 2.

Here, (cpu % num_shards) ensures that the SMT siblings belong to the
same shard along with 3 other cores.

On some Intel servers, it is possible that the CPU numbers are
interleaved across the two sockets. On my 2 socket, 32Cores per socket
Ice Lake Server, all the even numbered CPUs are in one socket and all
the odd numbered CPUs in the other socket.

The SMT siblings are {0,64}, {2, 66}, .... on one socket and {1, 65},
{3, 67}, .. on the other.

On this system, LLC size = 64. With SHARED_RUNQ_SHARD_SZ = 6,
num_shards = 10.

So with (cpu % num_shards) the siblings {0, 64} ... will belong to
different shards.

What would be good to have is

1. shard_size determined by individual architectures. If none is
provided, we pick the default shard_size.

2. A sharding scheme which guarantees that SMT siblinngs will belong
to the same shard as long as shard_size is at least as big as the SMT
size.

--
Thanks and Regards
gautham.



2023-07-12 11:39:25

by Abel Wu

[permalink] [raw]
Subject: Re: [PATCH v2 5/7] sched: Implement shared runqueue in CFS

Hi David, interesting patch!

On 7/11/23 4:03 AM, David Vernet wrote:
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 1292d38d66cc..5c05a3da3d50 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -770,6 +770,8 @@ struct task_struct {
> unsigned long wakee_flip_decay_ts;
> struct task_struct *last_wakee;
>
> + struct list_head shared_runq_node;
> +
> /*
> * recent_used_cpu is initially set as the last CPU used by a task
> * that wakes affine another task. Waker/wakee relationships can
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 1451f5aa82ac..3ad437d4ea3d 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4503,6 +4503,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
> #ifdef CONFIG_SMP
> p->wake_entry.u_flags = CSD_TYPE_TTWU;
> p->migration_pending = NULL;
> + INIT_LIST_HEAD(&p->shared_runq_node);
> #endif
> init_sched_mm_cid(p);
> }
> @@ -9842,6 +9843,7 @@ void __init sched_init_smp(void)
>
> init_sched_rt_class();
> init_sched_dl_class();
> + init_sched_fair_class_late();
>
> sched_smp_initialized = true;
> }
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f7967be7646c..ff2491387201 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -139,18 +139,163 @@ static int __init setup_sched_thermal_decay_shift(char *str)
> }
> __setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift);
>
> +/**
> + * struct shared_runq - Per-LLC queue structure for enqueuing and pulling
> + * waking tasks.
> + *
> + * WHAT
> + * ====
> + *
> + * This structure enables the scheduler to be more aggressively work
> + * conserving, by placing waking tasks on a per-LLC FIFO queue that can then be
> + * pulled from when another core in the LLC is going to go idle.
> + *
> + * struct rq stores a pointer to its LLC's shared_runq via struct cfs_rq.
> + * Waking tasks are enqueued in a shared_runq at the end of
> + * enqueue_task_fair(), and are opportunistically pulled from the shared_runq
> + * in newidle_balance(). Tasks enqueued in a shared_runq may be scheduled prior
> + * to being pulled from the shared_runq, in which case they're simply dequeued
> + * from the shared_runq. A waking task is only enqueued to a shared_runq when
> + * it was _not_ manually migrated to the current runqueue by
> + * select_task_rq_fair().
> + *
> + * There is currently no task-stealing between shared_runqs in different LLCs,
> + * which means that shared_runq is not fully work conserving. This could be
> + * added at a later time, with tasks likely only being stolen across
> + * shared_runqs on the same NUMA node to avoid violating NUMA affinities.
> + *
> + * HOW
> + * ===
> + *
> + * An shared_runq is comprised of a list, and a spinlock for synchronization.
> + * Given that the critical section for a shared_runq is typically a fast list
> + * operation, and that the shared_runq is localized to a single LLC, the
> + * spinlock will typically only be contended on workloads that do little else
> + * other than hammer the runqueue.

Would there be scalability issues on large LLCs?

> + *
> + * WHY
> + * ===
> + *
> + * As mentioned above, the main benefit of shared_runq is that it enables more
> + * aggressive work conservation in the scheduler. This can benefit workloads
> + * that benefit more from CPU utilization than from L1/L2 cache locality.
> + *
> + * shared_runqs are segmented across LLCs both to avoid contention on the
> + * shared_runq spinlock by minimizing the number of CPUs that could contend on
> + * it, as well as to strike a balance between work conservation, and L3 cache
> + * locality.
> + */
> +struct shared_runq {
> + struct list_head list;
> + spinlock_t lock;
> +} ____cacheline_aligned;
> +
> #ifdef CONFIG_SMP
> +static struct shared_runq *rq_shared_runq(struct rq *rq)
> +{
> + return rq->cfs.shared_runq;
> +}
> +
> +static struct task_struct *shared_runq_pop_task(struct rq *rq)
> +{
> + unsigned long flags;
> + struct task_struct *p;
> + struct shared_runq *shared_runq;
> +
> + shared_runq = rq_shared_runq(rq);
> + if (list_empty(&shared_runq->list))
> + return NULL;
> +
> + spin_lock_irqsave(&shared_runq->lock, flags);
> + p = list_first_entry_or_null(&shared_runq->list, struct task_struct,
> + shared_runq_node);
> + if (p && is_cpu_allowed(p, cpu_of(rq)))
> + list_del_init(&p->shared_runq_node);
> + else
> + p = NULL;
> + spin_unlock_irqrestore(&shared_runq->lock, flags);
> +
> + return p;
> +}
> +
> +static void shared_runq_push_task(struct rq *rq, struct task_struct *p)
> +{
> + unsigned long flags;
> + struct shared_runq *shared_runq;
> +
> + shared_runq = rq_shared_runq(rq);
> + spin_lock_irqsave(&shared_runq->lock, flags);
> + list_add_tail(&p->shared_runq_node, &shared_runq->list);
> + spin_unlock_irqrestore(&shared_runq->lock, flags);
> +}
> +
> static void shared_runq_enqueue_task(struct rq *rq, struct task_struct *p,
> int enq_flags)
> -{}
> +{
> + bool task_migrated = enq_flags & ENQUEUE_MIGRATED;
> + bool task_wakeup = enq_flags & ENQUEUE_WAKEUP;
> +
> + /*
> + * Only enqueue the task in the shared runqueue if:
> + *
> + * - SWQUEUE is enabled
> + * - The task is on the wakeup path
> + * - The task wasn't purposefully migrated to the current rq by
> + * select_task_rq()
> + * - The task isn't pinned to a specific CPU
> + */
> + if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1)
> + return;
> +
> + shared_runq_push_task(rq, p);
> +}
>
> static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
> {
> - return 0;
> + struct task_struct *p = NULL;
> + struct rq *src_rq;
> + struct rq_flags src_rf;
> + int ret;
> +
> + p = shared_runq_pop_task(rq);
> + if (!p)
> + return 0;
> +
> + rq_unpin_lock(rq, rf);
> + raw_spin_rq_unlock(rq);

It would be better use the rq_unlock(rq, rf) for simplicity.
But it's absolutely OK if you want to keep as it is to be
correspond with the below lock&repin part :)

> +
> + src_rq = task_rq_lock(p, &src_rf);
> +
> + if (task_on_rq_queued(p) && !task_on_cpu(rq, p)) {

IMHO it should be:

if (task_on_rq_queued(p) && !task_on_cpu(src_rq, p)) {

> + update_rq_clock(src_rq);
> + src_rq = move_queued_task(src_rq, &src_rf, p, cpu_of(rq));
> + }
> +
> + if (src_rq->cpu != rq->cpu)

Why not just 'if (src_rq != rq)'?

> + ret = 1;
> + else
> + ret = -1;

What about making @ret default to -1 and changing to 1 right
after move_queued_task()? Both for better readability and align
with the behavior of newidle_balance().

> +
> + task_rq_unlock(src_rq, p, &src_rf);
> +
> + raw_spin_rq_lock(rq);
> + rq_repin_lock(rq, rf);

By making it looks more ugly, we can save some cycles..

if (src_rq != rq) {
task_rq_unlock(src_rq, p, &src_rf);
} else {
rq_unpin_lock(src_rq, src_rf);
raw_spin_unlock_irqrestore(&p->pi_lock, src_rf.flags);
rq_repin_lock(rq, rf);
}

> +
> + return ret;
> }
>
> static void shared_runq_dequeue_task(struct task_struct *p)
> -{}
> +{
> + unsigned long flags;
> + struct shared_runq *shared_runq;
> +
> + if (!list_empty(&p->shared_runq_node)) {
> + shared_runq = rq_shared_runq(task_rq(p));
> + spin_lock_irqsave(&shared_runq->lock, flags);
> + list_del_init(&p->shared_runq_node);
> + spin_unlock_irqrestore(&shared_runq->lock, flags);
> + }
> +}
>
> /*
> * For asym packing, by default the lower numbered CPU has higher priority.
> @@ -12854,3 +12999,34 @@ __init void init_sched_fair_class(void)
> #endif /* SMP */
>
> }
> +
> +__init void init_sched_fair_class_late(void)
> +{
> +#ifdef CONFIG_SMP
> + int i;
> + struct shared_runq *shared_runq;
> + struct rq *rq;
> + struct rq *llc_rq;
> +
> + for_each_possible_cpu(i) {
> + if (per_cpu(sd_llc_id, i) == i) {
> + llc_rq = cpu_rq(i);
> +
> + shared_runq = kzalloc_node(sizeof(struct shared_runq),
> + GFP_KERNEL, cpu_to_node(i));
> + INIT_LIST_HEAD(&shared_runq->list);
> + spin_lock_init(&shared_runq->lock);
> + llc_rq->cfs.shared_runq = shared_runq;
> + }
> + }
> +
> + for_each_possible_cpu(i) {
> + rq = cpu_rq(i);
> + llc_rq = cpu_rq(per_cpu(sd_llc_id, i));
> +
> + if (rq == llc_rq)
> + continue;
> + rq->cfs.shared_runq = llc_rq->cfs.shared_runq;
> + }
> +#endif /* SMP */
> +}
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 187ad5da5ef6..8b573dfaba33 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -576,6 +576,7 @@ struct cfs_rq {
> #endif
>
> #ifdef CONFIG_SMP
> + struct shared_runq *shared_runq;

I would suggest moving shared_runq into shared LLC sched domain,
which is pointed by sd_llc_shared, as you might finally put the
retrieval of shared_runq under RCU critical sections to handle
domain re-building cases.

Best Regards,
Abel

> /*
> * CFS load tracking
> */
> @@ -2440,6 +2441,7 @@ extern void update_max_interval(void);
> extern void init_sched_dl_class(void);
> extern void init_sched_rt_class(void);
> extern void init_sched_fair_class(void);
> +extern void init_sched_fair_class_late(void);
>
> extern void reweight_task(struct task_struct *p, int prio);
>

2023-07-12 13:24:21

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 6/7] sched: Shard per-LLC shared runqueues

On Wed, Jul 12, 2023 at 03:36:27PM +0530, Gautham R. Shenoy wrote:

> On some Intel servers, it is possible that the CPU numbers are
> interleaved across the two sockets. On my 2 socket, 32Cores per socket
> Ice Lake Server, all the even numbered CPUs are in one socket and all
> the odd numbered CPUs in the other socket.
>
> The SMT siblings are {0,64}, {2, 66}, .... on one socket and {1, 65},
> {3, 67}, .. on the other.

Yeah, Intel SMT enumeration is a mess. There's a random mix of {n,n+1}
and {0..n-1} {n..2n-1}. And then there's the fun hybrid stuff. Those
appear to do {n,n+1} for the big cores and then continue with the small
cores in a dense set. My 8+8 ADL, has:

{0,1} {2,3} {4,5} {6,7} {8,9} {10,11} {12,13} {14,15} {16} {17} {18} {19} {20} {21} {22} {23}

I suspect it might be easier to re-number the whole show at boot to fit
a sane pattern rather than trying to match the various random garbage
gifted to us by the BIOS.


I wouldn't worry about it too much at this point.

2023-07-12 19:31:03

by David Vernet

[permalink] [raw]
Subject: Re: [PATCH v2 5/7] sched: Implement shared runqueue in CFS

On Wed, Jul 12, 2023 at 11:30:23AM +0530, Gautham R. Shenoy wrote:
> Hello David,
>
> On Mon, Jul 10, 2023 at 03:03:40PM -0500, David Vernet wrote:
>
> [..snip..]
>
> > ---
>
> > +
> > +static struct task_struct *shared_runq_pop_task(struct rq *rq)
> > +{
> > + unsigned long flags;
> > + struct task_struct *p;
> > + struct shared_runq *shared_runq;
> > +
> > + shared_runq = rq_shared_runq(rq);
> > + if (list_empty(&shared_runq->list))
> > + return NULL;
> > +
> > + spin_lock_irqsave(&shared_runq->lock, flags);
> > + p = list_first_entry_or_null(&shared_runq->list, struct task_struct,
> > + shared_runq_node);
>
>
> Apologies for the bikeshedding comment : Here you are attempting to
> remove the task from the "head", while in shared_runq_push_task below,
> you are adding a task to the tail. Which is the usual queue
> semantics. Then why call them shared_runq_pop_task() and
> shared_runq_push_task() ?
>
> Can we name them __shared_runq_enqueue_task() and
> __shared_runq_pick_next_task() instead ?

Hello Gautham,

So this was previously discussed in [0]. I'm fine with changing the
names if that's others' preferences as well. I think what we have now is
nice in that push and pop are list operations whereas enqueue / dequeue
are scheduler operations, but yeah, push / pop are more-so stack than
queue ops. Using __ to make the list ops "private" is fine with me.

[0]: https://lore.kernel.org/lkml/[email protected]/

> > + if (p && is_cpu_allowed(p, cpu_of(rq)))
> > + list_del_init(&p->shared_runq_node);
> > + else
> > + p = NULL;
> > + spin_unlock_irqrestore(&shared_runq->lock, flags);
> > +
> > + return p;
> > +}
> > +
> > +static void shared_runq_push_task(struct rq *rq, struct task_struct *p)
> > +{
> > + unsigned long flags;
> > + struct shared_runq *shared_runq;
> > +
> > + shared_runq = rq_shared_runq(rq);
> > + spin_lock_irqsave(&shared_runq->lock, flags);
> > + list_add_tail(&p->shared_runq_node, &shared_runq->list);
> > + spin_unlock_irqrestore(&shared_runq->lock, flags);
> > +}
> > +
> > static void shared_runq_enqueue_task(struct rq *rq, struct task_struct *p,
> > int enq_flags)
> > -{}
> > +{
> > + bool task_migrated = enq_flags & ENQUEUE_MIGRATED;
> > + bool task_wakeup = enq_flags & ENQUEUE_WAKEUP;
> > +
> > + /*
> > + * Only enqueue the task in the shared runqueue if:
> > + *
> > + * - SWQUEUE is enabled
> > + * - The task is on the wakeup path
> > + * - The task wasn't purposefully migrated to the current rq by
> > + * select_task_rq()
> > + * - The task isn't pinned to a specific CPU
> > + */
> > + if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1)
> > + return;
> > +
> > + shared_runq_push_task(rq, p);
> > +}
> >
> > static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
> > {
> > - return 0;
> > + struct task_struct *p = NULL;
> > + struct rq *src_rq;
> > + struct rq_flags src_rf;
> > + int ret;
> > +
> > + p = shared_runq_pop_task(rq);
> > + if (!p)
> > + return 0;
> > +
> > + rq_unpin_lock(rq, rf);
> > + raw_spin_rq_unlock(rq);
> > +
> > + src_rq = task_rq_lock(p, &src_rf);
> > +
> > + if (task_on_rq_queued(p) && !task_on_cpu(rq, p)) {
> > + update_rq_clock(src_rq);
> > + src_rq = move_queued_task(src_rq, &src_rf, p, cpu_of(rq));
> > + }
> > +
> > + if (src_rq->cpu != rq->cpu)
> > + ret = 1;
> > + else
> > + ret = -1;
>
>
> So if src_rq->cpu != rq->cpu, then the task has _not_ been moved to
> rq. But you return 1.
>
> While in the else case, since src_rq->cpu == rq->cpu, the task has
> been successfully moved to rq. But you are returning -1,
>
> If newidle_balance() were to interpret this return value as the number
> of tasks pulled, then, shouldn't it be the other way around ?

Yeah, good call. Will incorporate this into v3.

> > +
> > + task_rq_unlock(src_rq, p, &src_rf);
> > +
> > + raw_spin_rq_lock(rq);
> > + rq_repin_lock(rq, rf);
> > +
> > + return ret;
> > }
> >
>
> --
> Thanks and Regards
> gautham.

2023-07-12 22:01:36

by David Vernet

[permalink] [raw]
Subject: Re: [PATCH v2 4/7] sched/fair: Add SHARED_RUNQ sched feature and skeleton calls

On Wed, Jul 12, 2023 at 04:39:03PM +0800, Abel Wu wrote:
> On 7/11/23 4:03 AM, David Vernet wrote:
> > @@ -6467,6 +6489,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > dequeue_throttle:
> > util_est_update(&rq->cfs, p, task_sleep);
> > hrtick_update(rq);
> > +
> > + if (sched_feat(SHARED_RUNQ))
> > + shared_runq_dequeue_task(p);
>
> Would disabling SHARED_RUNQ causing task list nodes left
> in the shared stateful runqueue?

Hi Abel,

Yes, good call, there will be some stale tasks. The obvious way to get
around that would of course be to always call
shared_runq_dequeue_task(p) on the __dequeue_entity() path, but it would
be silly to tax a hot path in the scheduler in support of a feature
that's disabled by default.

At first I was thinking that the only issue would be some overhead in
clearing stale tasks once it was re-enabled, but that we'd be OK because
of this check in shared_runq_pick_next_task():

298 if (task_on_rq_queued(p) && !task_on_cpu(rq, p)) {
299 update_rq_clock(src_rq);
300 src_rq = move_queued_task(src_rq, &src_rf, p, cpu_of(rq));
301 }

So we wouldn't migrate tasks that weren't actually suitable. But that's
obviously wrong. It's not safe to keep stale tasks in that list for (at
least) two reasons.

- A task could exit (which would be easy to fix by just adding a dequeue
call in task_dead_fair())
- We could have a double-add if a task is re-enqueued in the list after
having been previously enqueued, but then never dequeued due to the
timing of disabling SHARED_RUNQ.

Not sure what the best solution is here. We could always address this by
draining the list when the feature is disabled, but there's not yet a
mechanism to hook in a callback to be invoked when a scheduler feature
is enabled/disabled. It shouldn't be too hard to add that, assuming
other sched folks are amenable to it. It should just be a matter of
adding another __SCHED_FEAT_NR-sized table of NULL-able callbacks that
are invoked on enable / disable state change, and which can be specified
in a new SCHED_FEAT_CALLBACK or something macro.

Peter -- what do you think?

Thanks,
David

2023-07-12 23:02:18

by David Vernet

[permalink] [raw]
Subject: Re: [PATCH v2 5/7] sched: Implement shared runqueue in CFS

On Wed, Jul 12, 2023 at 06:47:26PM +0800, Abel Wu wrote:
> Hi David, interesting patch!

Hi Abel,

Thanks for the helpful review!

> On 7/11/23 4:03 AM, David Vernet wrote:
> >
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 1292d38d66cc..5c05a3da3d50 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -770,6 +770,8 @@ struct task_struct {
> > unsigned long wakee_flip_decay_ts;
> > struct task_struct *last_wakee;
> > + struct list_head shared_runq_node;
> > +
> > /*
> > * recent_used_cpu is initially set as the last CPU used by a task
> > * that wakes affine another task. Waker/wakee relationships can
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 1451f5aa82ac..3ad437d4ea3d 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -4503,6 +4503,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
> > #ifdef CONFIG_SMP
> > p->wake_entry.u_flags = CSD_TYPE_TTWU;
> > p->migration_pending = NULL;
> > + INIT_LIST_HEAD(&p->shared_runq_node);
> > #endif
> > init_sched_mm_cid(p);
> > }
> > @@ -9842,6 +9843,7 @@ void __init sched_init_smp(void)
> > init_sched_rt_class();
> > init_sched_dl_class();
> > + init_sched_fair_class_late();
> > sched_smp_initialized = true;
> > }
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index f7967be7646c..ff2491387201 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -139,18 +139,163 @@ static int __init setup_sched_thermal_decay_shift(char *str)
> > }
> > __setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift);
> > +/**
> > + * struct shared_runq - Per-LLC queue structure for enqueuing and pulling
> > + * waking tasks.
> > + *
> > + * WHAT
> > + * ====
> > + *
> > + * This structure enables the scheduler to be more aggressively work
> > + * conserving, by placing waking tasks on a per-LLC FIFO queue that can then be
> > + * pulled from when another core in the LLC is going to go idle.
> > + *
> > + * struct rq stores a pointer to its LLC's shared_runq via struct cfs_rq.
> > + * Waking tasks are enqueued in a shared_runq at the end of
> > + * enqueue_task_fair(), and are opportunistically pulled from the shared_runq
> > + * in newidle_balance(). Tasks enqueued in a shared_runq may be scheduled prior
> > + * to being pulled from the shared_runq, in which case they're simply dequeued
> > + * from the shared_runq. A waking task is only enqueued to a shared_runq when
> > + * it was _not_ manually migrated to the current runqueue by
> > + * select_task_rq_fair().
> > + *
> > + * There is currently no task-stealing between shared_runqs in different LLCs,
> > + * which means that shared_runq is not fully work conserving. This could be
> > + * added at a later time, with tasks likely only being stolen across
> > + * shared_runqs on the same NUMA node to avoid violating NUMA affinities.
> > + *
> > + * HOW
> > + * ===
> > + *
> > + * An shared_runq is comprised of a list, and a spinlock for synchronization.
> > + * Given that the critical section for a shared_runq is typically a fast list
> > + * operation, and that the shared_runq is localized to a single LLC, the
> > + * spinlock will typically only be contended on workloads that do little else
> > + * other than hammer the runqueue.
>
> Would there be scalability issues on large LLCs?

See the next patch in the series [0] where we shard the per-LLC shared
runqueues to avoid contention.

[0]: https://lore.kernel.org/lkml/[email protected]/

>
> > + *
> > + * WHY
> > + * ===
> > + *
> > + * As mentioned above, the main benefit of shared_runq is that it enables more
> > + * aggressive work conservation in the scheduler. This can benefit workloads
> > + * that benefit more from CPU utilization than from L1/L2 cache locality.
> > + *
> > + * shared_runqs are segmented across LLCs both to avoid contention on the
> > + * shared_runq spinlock by minimizing the number of CPUs that could contend on
> > + * it, as well as to strike a balance between work conservation, and L3 cache
> > + * locality.
> > + */
> > +struct shared_runq {
> > + struct list_head list;
> > + spinlock_t lock;
> > +} ____cacheline_aligned;
> > +
> > #ifdef CONFIG_SMP
> > +static struct shared_runq *rq_shared_runq(struct rq *rq)
> > +{
> > + return rq->cfs.shared_runq;
> > +}
> > +
> > +static struct task_struct *shared_runq_pop_task(struct rq *rq)
> > +{
> > + unsigned long flags;
> > + struct task_struct *p;
> > + struct shared_runq *shared_runq;
> > +
> > + shared_runq = rq_shared_runq(rq);
> > + if (list_empty(&shared_runq->list))
> > + return NULL;
> > +
> > + spin_lock_irqsave(&shared_runq->lock, flags);
> > + p = list_first_entry_or_null(&shared_runq->list, struct task_struct,
> > + shared_runq_node);
> > + if (p && is_cpu_allowed(p, cpu_of(rq)))
> > + list_del_init(&p->shared_runq_node);
> > + else
> > + p = NULL;
> > + spin_unlock_irqrestore(&shared_runq->lock, flags);
> > +
> > + return p;
> > +}
> > +
> > +static void shared_runq_push_task(struct rq *rq, struct task_struct *p)
> > +{
> > + unsigned long flags;
> > + struct shared_runq *shared_runq;
> > +
> > + shared_runq = rq_shared_runq(rq);
> > + spin_lock_irqsave(&shared_runq->lock, flags);
> > + list_add_tail(&p->shared_runq_node, &shared_runq->list);
> > + spin_unlock_irqrestore(&shared_runq->lock, flags);
> > +}
> > +
> > static void shared_runq_enqueue_task(struct rq *rq, struct task_struct *p,
> > int enq_flags)
> > -{}
> > +{
> > + bool task_migrated = enq_flags & ENQUEUE_MIGRATED;
> > + bool task_wakeup = enq_flags & ENQUEUE_WAKEUP;
> > +
> > + /*
> > + * Only enqueue the task in the shared runqueue if:
> > + *
> > + * - SWQUEUE is enabled
> > + * - The task is on the wakeup path
> > + * - The task wasn't purposefully migrated to the current rq by
> > + * select_task_rq()
> > + * - The task isn't pinned to a specific CPU
> > + */
> > + if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1)
> > + return;
> > +
> > + shared_runq_push_task(rq, p);
> > +}
> > static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
> > {
> > - return 0;
> > + struct task_struct *p = NULL;
> > + struct rq *src_rq;
> > + struct rq_flags src_rf;
> > + int ret;
> > +
> > + p = shared_runq_pop_task(rq);
> > + if (!p)
> > + return 0;
> > +
> > + rq_unpin_lock(rq, rf);
> > + raw_spin_rq_unlock(rq);
>
> It would be better use the rq_unlock(rq, rf) for simplicity.
> But it's absolutely OK if you want to keep as it is to be
> correspond with the below lock&repin part :)

Yeah, personally I'd prefer to keep the style the consistent between
here and below.

> > +
> > + src_rq = task_rq_lock(p, &src_rf);
> > +
> > + if (task_on_rq_queued(p) && !task_on_cpu(rq, p)) {
>
> IMHO it should be:
>
> if (task_on_rq_queued(p) && !task_on_cpu(src_rq, p)) {

Agreed, will change in v3.

> > + update_rq_clock(src_rq);
> > + src_rq = move_queued_task(src_rq, &src_rf, p, cpu_of(rq));
> > + }
> > +
> > + if (src_rq->cpu != rq->cpu)
>
> Why not just 'if (src_rq != rq)'?

Ack, will change.

>
> > + ret = 1;
> > + else
> > + ret = -1;
>
> What about making @ret default to -1 and changing to 1 right
> after move_queued_task()? Both for better readability and align
> with the behavior of newidle_balance().

Yep. This aligns with Gautham's suggestion in [1] as well. Will combine
your input for v3.

[1]: https://lore.kernel.org/lkml/ZK5BdysC0lxKQ%[email protected]/

>
> > +
> > + task_rq_unlock(src_rq, p, &src_rf);
> > +
> > + raw_spin_rq_lock(rq);
> > + rq_repin_lock(rq, rf);
>
> By making it looks more ugly, we can save some cycles..
>
> if (src_rq != rq) {
> task_rq_unlock(src_rq, p, &src_rf);
> } else {
> rq_unpin_lock(src_rq, src_rf);
> raw_spin_unlock_irqrestore(&p->pi_lock, src_rf.flags);
> rq_repin_lock(rq, rf);
> }

I like it. Thanks for the suggestion. I'll apply for v3.

> > +
> > + return ret;
> > }
> > static void shared_runq_dequeue_task(struct task_struct *p)
> > -{}
> > +{
> > + unsigned long flags;
> > + struct shared_runq *shared_runq;
> > +
> > + if (!list_empty(&p->shared_runq_node)) {
> > + shared_runq = rq_shared_runq(task_rq(p));
> > + spin_lock_irqsave(&shared_runq->lock, flags);
> > + list_del_init(&p->shared_runq_node);
> > + spin_unlock_irqrestore(&shared_runq->lock, flags);
> > + }
> > +}
> > /*
> > * For asym packing, by default the lower numbered CPU has higher priority.
> > @@ -12854,3 +12999,34 @@ __init void init_sched_fair_class(void)
> > #endif /* SMP */
> > }
> > +
> > +__init void init_sched_fair_class_late(void)
> > +{
> > +#ifdef CONFIG_SMP
> > + int i;
> > + struct shared_runq *shared_runq;
> > + struct rq *rq;
> > + struct rq *llc_rq;
> > +
> > + for_each_possible_cpu(i) {
> > + if (per_cpu(sd_llc_id, i) == i) {
> > + llc_rq = cpu_rq(i);
> > +
> > + shared_runq = kzalloc_node(sizeof(struct shared_runq),
> > + GFP_KERNEL, cpu_to_node(i));
> > + INIT_LIST_HEAD(&shared_runq->list);
> > + spin_lock_init(&shared_runq->lock);
> > + llc_rq->cfs.shared_runq = shared_runq;
> > + }
> > + }
> > +
> > + for_each_possible_cpu(i) {
> > + rq = cpu_rq(i);
> > + llc_rq = cpu_rq(per_cpu(sd_llc_id, i));
> > +
> > + if (rq == llc_rq)
> > + continue;
> > + rq->cfs.shared_runq = llc_rq->cfs.shared_runq;
> > + }
> > +#endif /* SMP */
> > +}
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 187ad5da5ef6..8b573dfaba33 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -576,6 +576,7 @@ struct cfs_rq {
> > #endif
> > #ifdef CONFIG_SMP
> > + struct shared_runq *shared_runq;
>
> I would suggest moving shared_runq into shared LLC sched domain,
> which is pointed by sd_llc_shared, as you might finally put the
> retrieval of shared_runq under RCU critical sections to handle
> domain re-building cases.

Yep, I have plans to take add support for domain re-building in v3.
I'll see whether it makes sense to put them into the shared LLC sched
domain, but at first glance it seems like a sane choice.

>
> Best Regards,
> Abel

Thanks for the review.

- David

>
> > /*
> > * CFS load tracking
> > */
> > @@ -2440,6 +2441,7 @@ extern void update_max_interval(void);
> > extern void init_sched_dl_class(void);
> > extern void init_sched_rt_class(void);
> > extern void init_sched_fair_class(void);
> > +extern void init_sched_fair_class_late(void);
> > extern void reweight_task(struct task_struct *p, int prio);

2023-07-13 04:09:15

by Abel Wu

[permalink] [raw]
Subject: Re: Re: [PATCH v2 5/7] sched: Implement shared runqueue in CFS

On 7/13/23 6:16 AM, David Vernet wrote:
> On Wed, Jul 12, 2023 at 06:47:26PM +0800, Abel Wu wrote:
>>> + *
>>> + * HOW
>>> + * ===
>>> + *
>>> + * An shared_runq is comprised of a list, and a spinlock for synchronization.
>>> + * Given that the critical section for a shared_runq is typically a fast list
>>> + * operation, and that the shared_runq is localized to a single LLC, the
>>> + * spinlock will typically only be contended on workloads that do little else
>>> + * other than hammer the runqueue.
>>
>> Would there be scalability issues on large LLCs?
>
> See the next patch in the series [0] where we shard the per-LLC shared
> runqueues to avoid contention.
>
> [0]: https://lore.kernel.org/lkml/[email protected]/

Sorry, I should have read the cover letter more carefully. By sharding,
the LLC is partitioned into several zones, hence contention is relieved.
But sharding itself might be tricky. Making the SMT siblings not cross
shards, as suggested by Peter, is generally a good thing. But I wonder
if there is any workload might benefit from other sharding form.

>>
>>> +
>>> + task_rq_unlock(src_rq, p, &src_rf);
>>> +
>>> + raw_spin_rq_lock(rq);
>>> + rq_repin_lock(rq, rf);
>>
>> By making it looks more ugly, we can save some cycles..
>>
>> if (src_rq != rq) {
>> task_rq_unlock(src_rq, p, &src_rf);
>> } else {
>> rq_unpin_lock(src_rq, src_rf);
>> raw_spin_unlock_irqrestore(&p->pi_lock, src_rf.flags);
>> rq_repin_lock(rq, rf);
>> }

I forgot the repin part when src_rq != rq, but I'm sure you already got
my point :)

Cheers,
Abel

2023-07-13 04:18:49

by David Vernet

[permalink] [raw]
Subject: Re: Re: [PATCH v2 5/7] sched: Implement shared runqueue in CFS

On Thu, Jul 13, 2023 at 11:43:50AM +0800, Abel Wu wrote:
> On 7/13/23 6:16 AM, David Vernet wrote:
> > On Wed, Jul 12, 2023 at 06:47:26PM +0800, Abel Wu wrote:
> > > > + *
> > > > + * HOW
> > > > + * ===
> > > > + *
> > > > + * An shared_runq is comprised of a list, and a spinlock for synchronization.
> > > > + * Given that the critical section for a shared_runq is typically a fast list
> > > > + * operation, and that the shared_runq is localized to a single LLC, the
> > > > + * spinlock will typically only be contended on workloads that do little else
> > > > + * other than hammer the runqueue.
> > >
> > > Would there be scalability issues on large LLCs?
> >
> > See the next patch in the series [0] where we shard the per-LLC shared
> > runqueues to avoid contention.
> >
> > [0]: https://lore.kernel.org/lkml/[email protected]/
>
> Sorry, I should have read the cover letter more carefully. By sharding,
> the LLC is partitioned into several zones, hence contention is relieved.
> But sharding itself might be tricky. Making the SMT siblings not cross
> shards, as suggested by Peter, is generally a good thing. But I wonder
> if there is any workload might benefit from other sharding form.

IMO for now the best thing to do is keep things simple until we observe
it being a problem in practice. Or to at least plan to address it in a
follow-on patch set when we add support for a dynamic shard sizing.
Proper shard sizing is required to do optimal SMT placement anyways, and
in general will likely have more of an impact on performance.

> > >
> > > > +
> > > > + task_rq_unlock(src_rq, p, &src_rf);
> > > > +
> > > > + raw_spin_rq_lock(rq);
> > > > + rq_repin_lock(rq, rf);
> > >
> > > By making it looks more ugly, we can save some cycles..
> > >
> > > if (src_rq != rq) {
> > > task_rq_unlock(src_rq, p, &src_rf);
> > > } else {
> > > rq_unpin_lock(src_rq, src_rf);
> > > raw_spin_unlock_irqrestore(&p->pi_lock, src_rf.flags);
> > > rq_repin_lock(rq, rf);
> > > }
>
> I forgot the repin part when src_rq != rq, but I'm sure you already got
> my point :)

Yep, I got your main point which was to avoid the extra dropping and
acquiring of the same rq spinlock. FYI for posterity / anyone else
reading, the missing repin isn't sufficient on its own for the src_rq !=
rq path. We also need to reacquire the rq lock.

Thanks again for the helpful reviews.

- David

2023-07-13 08:36:29

by Aaron Lu

[permalink] [raw]
Subject: Re: [PATCH v2 5/7] sched: Implement shared runqueue in CFS

On Mon, Jul 10, 2023 at 03:03:40PM -0500, David Vernet wrote:
> static void shared_runq_dequeue_task(struct task_struct *p)
> -{}
> +{
> + unsigned long flags;
> + struct shared_runq *shared_runq;
> +
> + if (!list_empty(&p->shared_runq_node)) {
> + shared_runq = rq_shared_runq(task_rq(p));
> + spin_lock_irqsave(&shared_runq->lock, flags);

After taking the lock, p may have been poped by another newidle cpu.
Running list_del_init() doesn't seem hurt, but is this intended?

> + list_del_init(&p->shared_runq_node);
> + spin_unlock_irqrestore(&shared_runq->lock, flags);
> + }
> +}

2023-07-13 08:48:07

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 5/7] sched: Implement shared runqueue in CFS

On Mon, Jul 10, 2023 at 03:03:40PM -0500, David Vernet wrote:
> +struct shared_runq {
> + struct list_head list;
> + spinlock_t lock;

FWIW, this really needs to be raw_spinlock_t.

> +} ____cacheline_aligned;

2023-07-21 09:34:48

by Gautham R. Shenoy

[permalink] [raw]
Subject: Re: [PATCH v2 0/7] sched: Implement shared runqueue in CFS

Hello David,

On Mon, Jul 10, 2023 at 03:03:35PM -0500, David Vernet wrote:
> Changes
> -------
>
> This is v2 of the shared wakequeue (now called shared runqueue)
> patchset. The following are changes from the RFC v1 patchset
> (https://lore.kernel.org/lkml/[email protected]/).
>
> v1 -> v2 changes:
> - Change name from swqueue to shared_runq (Peter)
>
> - Sharded per-LLC shared runqueues to avoid contention on
> scheduler-heavy workloads (Peter)
>
> - Pull tasks from the shared_runq in newidle_balance() rather than in
> pick_next_task_fair() (Peter and Vincent)
>
> - Rename a few functions to reflect their actual purpose. For example,
> shared_runq_dequeue_task() instead of swqueue_remove_task() (Peter)
>
> - Expose move_queued_task() from core.c rather than migrate_task_to()
> (Peter)
>
> - Properly check is_cpu_allowed() when pulling a task from a shared_runq
> to ensure it can actually be migrated (Peter and Gautham)
>
> - Dropped RFC tag
>
> This patch set is based off of commit ebb83d84e49b ("sched/core: Avoid
> multiple calling update_rq_clock() in __cfsb_csd_unthrottle()") on the
> sched/core branch of tip.git.

I have evaluated this v2 patchset on AMD Zen3 and Zen4 servers.

tldr:

* We see non-trivial improvements on hackbench on both Zen3 and Zen4
until the system is super-overloaded, at which point we see
regressions.

* tbench shows regressions on Zen3 with each client
configuration. tbench on Zen4 shows some improvements when the system is
overloaded.

* netperf shows minor improvements on Zen3 when the system is under
low to moderate load. netperf regresses on Zen3 at high load, and at
all load-points on Zen4.

* Stream, SPECjbb2015 and Mongodb show no significant difference compared
to the current tip.

* With netperf and tbench, using the shared-runqueue during
enqueue_entity performs badly.


Server configurations used:

AMD Zen3 Server:
* 2 sockets,
* 64 cores per socket,
* SMT2 enabled
* Total of 256 threads.
* Configured in Nodes-Per-Socket(NPS)=1

AMD Zen4 Server:
* 2 sockets,
* 128 cores per socket,
* SMT2 enabled
* Total of 512 threads.
* Configured in Nodes-Per-Socket(NPS)=1


The trends on NPS=2 and NPS=4 are similar. So I am not posting those.


Legend:
tip : Tip kernel with top commit ebb83d84e49b
("sched/core: Avoid multiple calling update_rq_clock() in __cfsb_csd_unthrottle()")

swqueue_v1 : Your v1 patches applied on top of the aforemenioned tip commit.

noshard : shared-runqueue v2 patches 1-5. This uses a shared-runqueue
during wakeup. No sharding.

shard_wakeup : shared-runqueue v2 patches 1-6. This uses a
shared-runqueue during wakeup and has shards with
shard size = 6 (default)

shard_all : v2 patches 1-7. This uses a sharded shared-runqueue during
enqueue_entity


==================================================================
Test : hackbench
Units : Normalized time in seconds
Interpretation: Lower is better
==================================================================

Zen3, 2 Sockets, 64 cores per socket, SMT2:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Case: tip[pct imp](CV) swqueue_v1[pct imp](CV) noshard[pct imp](CV) shard_wakeup[pct imp](CV) shard_all[pct imp](CV)
1-groups 1.00 [ 0.00]( 4.39) 0.89 [ 11.39]( 6.86) 0.90 [ 9.87](12.51) 0.92 [ 7.85]( 6.29) 0.91 [ 8.86]( 4.67)
2-groups 1.00 [ 0.00]( 1.64) 0.81 [ 18.90]( 2.67) 0.84 [ 15.71]( 5.91) 0.87 [ 12.74]( 2.26) 0.87 [ 12.53]( 2.40)
4-groups 1.00 [ 0.00]( 1.27) 0.89 [ 10.85]( 2.38) 0.94 [ 6.20]( 2.99) 0.97 [ 2.71]( 2.38) 0.96 [ 4.46]( 1.06)
8-groups 1.00 [ 0.00]( 0.72) 0.94 [ 6.37]( 2.25) 0.96 [ 3.61]( 3.39) 1.09 [ -8.78]( 1.14) 1.06 [ -5.68]( 1.55)
16-groups 1.00 [ 0.00]( 4.72) 0.96 [ 3.64]( 1.11) 1.01 [ -1.26]( 1.68) 1.07 [ -7.41]( 2.06) 1.10 [ -9.55]( 1.48)

Zen4, 2 Sockets, 128 cores per socket, SMT2:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Case: tip[pct imp](CV) swqueue[pct imp](CV) noshard[pct imp](CV) shard_wakeup[pct imp](CV) shard_all[pct imp](CV)
1-groups 1.00 [ 0.00](13.13) 0.90 [ 10.34](10.27) 0.94 [ 6.44](10.99) 0.94 [ 6.44](11.45) 0.91 [ 9.20](13.31)
2-groups 1.00 [ 0.00](10.27) 0.97 [ 2.86]( 6.51) 0.94 [ 5.71]( 4.65) 1.00 [ -0.48](10.25) 0.95 [ 4.52]( 6.60)
4-groups 1.00 [ 0.00]( 2.73) 0.89 [ 10.80]( 7.75) 0.82 [ 17.61]( 9.57) 0.83 [ 16.67](10.64) 0.81 [ 18.56]( 9.58)
8-groups 1.00 [ 0.00]( 1.75) 0.85 [ 15.22]( 5.16) 0.83 [ 17.01]( 4.28) 0.90 [ 10.45](10.05) 0.79 [ 21.04]( 2.84)
16-groups 1.00 [ 0.00]( 0.54) 1.16 [-15.84]( 2.17) 1.16 [-16.09]( 3.59) 1.24 [-24.26]( 4.22) 1.13 [-12.87]( 3.76)


==================================================================
Test : tbench
Units : Normalized throughput
Interpretation: Higher is better
==================================================================

Zen3, 2 Sockets, 64 cores per socket, SMT2:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Clients: tip[pct imp](CV) swqueue_v1[pct imp](CV) noshard[pct imp](CV) shard_wakeup[pct imp](CV) shard_all[pct imp](CV)
1 1.00 [ 0.00]( 0.46) 0.99 [ -1.33]( 0.94) 0.98 [ -1.65]( 0.37) 1.00 [ -0.23]( 0.15) 0.95 [ -4.77]( 0.93)
2 1.00 [ 0.00]( 0.35) 0.99 [ -1.00]( 0.41) 1.00 [ -0.12]( 0.45) 0.99 [ -0.62]( 1.43) 0.92 [ -8.48]( 5.19)
4 1.00 [ 0.00]( 2.37) 0.99 [ -0.70]( 0.39) 0.98 [ -2.49]( 0.79) 0.98 [ -1.92]( 0.85) 0.91 [ -8.54]( 0.56)
8 1.00 [ 0.00]( 0.35) 1.01 [ 0.88]( 0.21) 0.97 [ -3.13]( 0.96) 0.99 [ -1.44]( 0.96) 0.89 [-11.48]( 1.31)
16 1.00 [ 0.00]( 2.50) 1.00 [ 0.34]( 1.34) 0.97 [ -2.96]( 1.17) 0.97 [ -2.88]( 1.85) 0.84 [-16.41]( 1.31)
32 1.00 [ 0.00]( 1.79) 0.98 [ -1.97]( 3.81) 0.99 [ -1.17]( 1.89) 0.95 [ -4.83]( 2.08) 0.52 [-48.23]( 1.11)
64 1.00 [ 0.00]( 5.75) 1.17 [ 17.41]( 0.79) 0.97 [ -3.14](10.67) 1.07 [ 6.94]( 3.08) 0.41 [-59.17]( 1.88)
128 1.00 [ 0.00]( 3.16) 0.87 [-13.37]( 7.98) 0.73 [-26.87]( 1.07) 0.74 [-25.81]( 0.97) 0.35 [-64.93]( 0.68)
256 1.00 [ 0.00]( 0.21) 0.93 [ -6.86]( 2.81) 0.85 [-15.26]( 3.17) 0.91 [ -9.39]( 1.05) 0.90 [ -9.94]( 0.97)
512 1.00 [ 0.00]( 0.23) 0.88 [-11.79](15.25) 0.83 [-17.35]( 1.21) 0.87 [-12.96]( 2.63) 0.99 [ -1.18]( 0.80)
1024 1.00 [ 0.00]( 0.44) 0.99 [ -0.98]( 0.43) 0.77 [-23.18]( 5.24) 0.82 [-17.83]( 0.70) 0.96 [ -3.82]( 1.61)

Zen4, 2 Sockets, 128 cores per socket, SMT2:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Clients: tip[pct imp](CV) swqueue[pct imp](CV) noshard[pct imp](CV) shard_wakeup[pct imp](CV) shard_all[pct imp](CV)
1 1.00 [ 0.00]( 0.19) 0.98 [ -1.72]( 0.19) 0.99 [ -1.15]( 0.28) 0.98 [ -1.79]( 0.28) 0.99 [ -1.49]( 0.10)
2 1.00 [ 0.00]( 0.63) 0.98 [ -2.28]( 0.63) 0.98 [ -1.91]( 0.26) 0.97 [ -3.14]( 0.25) 0.98 [ -1.77]( 0.32)
4 1.00 [ 0.00]( 0.22) 1.00 [ 0.00]( 1.13) 0.99 [ -0.69]( 0.57) 0.98 [ -1.59]( 0.35) 0.99 [ -0.64]( 0.18)
8 1.00 [ 0.00]( 1.14) 0.99 [ -0.73]( 0.61) 0.98 [ -2.28]( 2.61) 0.97 [ -2.56]( 0.34) 0.98 [ -1.77]( 0.70)
16 1.00 [ 0.00]( 0.98) 0.97 [ -2.54]( 1.24) 0.98 [ -1.71]( 1.86) 0.98 [ -1.53]( 0.62) 0.96 [ -3.56]( 0.93)
32 1.00 [ 0.00]( 0.76) 0.98 [ -2.31]( 1.35) 0.98 [ -2.06]( 0.77) 0.96 [ -3.53]( 1.63) 0.88 [-11.72]( 2.77)
64 1.00 [ 0.00]( 0.96) 0.96 [ -4.45]( 3.53) 0.97 [ -3.44]( 1.53) 0.96 [ -3.52]( 0.89) 0.31 [-69.03]( 0.64)
128 1.00 [ 0.00]( 3.03) 0.95 [ -4.78]( 0.56) 0.98 [ -2.48]( 0.47) 0.92 [ -7.73]( 0.16) 0.20 [-79.75]( 0.24)
256 1.00 [ 0.00]( 0.04) 0.93 [ -7.21]( 1.00) 0.94 [ -5.90]( 0.63) 0.59 [-41.29]( 1.76) 0.16 [-83.71]( 0.07)
512 1.00 [ 0.00]( 3.08) 1.07 [ 7.07](17.78) 1.15 [ 15.49]( 2.65) 0.82 [-17.53](29.11) 0.93 [ -7.18](32.23)
1024 1.00 [ 0.00]( 0.60) 1.16 [ 15.61]( 0.07) 1.16 [ 15.92]( 0.06) 1.12 [ 11.57]( 1.86) 1.12 [ 11.97]( 0.21)
2048 1.00 [ 0.00]( 0.16) 1.15 [ 14.62]( 0.90) 1.15 [ 15.20]( 0.29) 1.08 [ 7.64]( 1.44) 1.15 [ 14.57]( 0.23)

==================================================================
Test : stream (10 runs)
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
==================================================================

Zen3, 2 Sockets, 64 cores per socket, SMT2:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Test: tip[pct imp](CV) swqueue_v1[pct imp](CV) noshard[pct imp](CV) shard_wakeup[pct imp](CV) shard_all[pct imp](CV)
Copy 1.00 [ 0.00]( 5.30) 1.00 [ -0.09]( 5.26) 1.03 [ 2.79]( 2.06) 1.02 [ 1.73]( 5.29) 1.03 [ 2.92]( 1.82)
Scale 1.00 [ 0.00]( 0.27) 0.98 [ -1.98]( 1.82) 0.99 [ -0.62]( 1.24) 1.00 [ -0.26]( 1.42) 1.00 [ 0.10]( 0.17)
Add 1.00 [ 0.00]( 0.48) 0.99 [ -0.90]( 1.71) 1.00 [ 0.12]( 0.95) 1.00 [ 0.18]( 1.45) 1.01 [ 0.56]( 0.14)
Triad 1.00 [ 0.00]( 1.02) 1.03 [ 2.80]( 0.60) 1.02 [ 2.18]( 1.96) 1.03 [ 2.56]( 2.25) 1.03 [ 2.58]( 0.56)

Zen4, 2 Sockets, 128 cores per socket, SMT2:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Test: tip[pct imp](CV) swqueue[pct imp](CV) noshard[pct imp](CV) shard_wakeup[pct imp](CV) shard_all[pct imp](CV)
Copy 1.00 [ 0.00]( 0.79) 0.99 [ -0.65]( 0.83) 1.00 [ 0.39]( 0.47) 0.99 [ -1.41]( 1.16) 0.98 [ -1.73]( 1.17)
Scale 1.00 [ 0.00]( 0.25) 1.00 [ -0.21]( 0.23) 1.01 [ 0.63]( 0.64) 0.99 [ -0.72]( 0.25) 1.00 [ -0.40]( 0.70)
Add 1.00 [ 0.00]( 0.25) 1.00 [ -0.15]( 0.27) 1.00 [ 0.44]( 0.73) 0.99 [ -0.82]( 0.40) 0.99 [ -0.73]( 0.78)
Triad 1.00 [ 0.00]( 0.23) 1.00 [ -0.28]( 0.28) 1.00 [ 0.34]( 0.74) 0.99 [ -0.91]( 0.48) 0.99 [ -0.86]( 0.98)


==================================================================
Test : stream (100 runs)
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
==================================================================
Zen3, 2 Sockets, 64 cores per socket, SMT2:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Test: tip[pct imp](CV) swqueue_v1[pct imp](CV) noshard[pct imp](CV) shard_wakeup[pct imp](CV) shard_all[pct imp](CV)
Copy 1.00 [ 0.00]( 0.58) 0.98 [ -2.05]( 6.08) 1.00 [ 0.46]( 0.19) 0.99 [ -0.58]( 1.96) 1.01 [ 0.50]( 0.20)
Scale 1.00 [ 0.00]( 0.47) 0.97 [ -2.71]( 4.94) 1.00 [ -0.01]( 0.23) 1.00 [ -0.35]( 1.39) 1.00 [ 0.10]( 0.20)
Add 1.00 [ 0.00]( 0.18) 0.97 [ -2.77]( 7.30) 1.00 [ 0.27]( 0.13) 1.00 [ 0.05]( 0.59) 1.00 [ 0.19]( 0.15)
Triad 1.00 [ 0.00]( 0.77) 0.99 [ -0.59]( 7.73) 1.03 [ 2.80]( 0.36) 1.03 [ 2.92]( 0.43) 1.02 [ 2.23]( 0.38)

Zen4, 2 Sockets, 128 cores per socket, SMT2:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Test: tip[pct imp](CV) swqueue[pct imp](CV) noshard[pct imp](CV) shard_wakeup[pct imp](CV) shard_all[pct imp](CV)
Copy 1.00 [ 0.00]( 1.45) 0.97 [ -2.65]( 2.02) 1.01 [ 0.85]( 0.96) 1.00 [ -0.24]( 0.67) 0.97 [ -2.77]( 1.09)
Scale 1.00 [ 0.00]( 0.67) 0.99 [ -1.27]( 1.01) 1.01 [ 0.93]( 0.95) 1.00 [ -0.08]( 0.46) 0.99 [ -1.23]( 0.35)
Add 1.00 [ 0.00]( 0.46) 0.99 [ -1.28]( 0.54) 1.01 [ 0.88]( 0.61) 1.00 [ 0.00]( 0.31) 0.98 [ -1.89]( 0.33)
Triad 1.00 [ 0.00]( 0.46) 0.98 [ -1.93]( 0.55) 1.01 [ 0.80]( 0.59) 1.00 [ 0.00]( 0.37) 0.98 [ -2.22]( 0.09)


==================================================================
Test : netperf
Units : Normalized Througput
Interpretation: Higher is better
==================================================================
Zen3, 2 Sockets, 64 cores per socket, SMT2:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Clients: tip[pct imp](CV) swqueue_v1[pct imp](CV) noshard[pct imp](CV) shard_wakeup[pct imp](CV) shard_all[pct imp](CV)
1-clients 1.00 [ 0.00]( 0.25) 0.99 [ -1.22]( 0.88) 0.99 [ -1.34]( 0.26) 1.01 [ 0.65]( 0.44) 0.95 [ -4.71]( 1.14)
2-clients 1.00 [ 0.00]( 0.37) 1.02 [ 1.59]( 1.02) 1.03 [ 2.62]( 1.77) 1.02 [ 2.39]( 1.59) 0.96 [ -4.47]( 0.77)
4-clients 1.00 [ 0.00]( 0.57) 1.03 [ 2.79]( 1.21) 1.03 [ 2.59]( 1.72) 1.01 [ 1.49]( 2.27) 0.92 [ -8.30]( 2.95)
8-clients 1.00 [ 0.00]( 1.09) 1.05 [ 4.84]( 0.99) 1.03 [ 3.13]( 1.70) 1.02 [ 1.53]( 2.25) 0.92 [ -8.23]( 1.64)
16-clients 1.00 [ 0.00]( 1.34) 1.06 [ 5.88]( 0.96) 1.04 [ 3.52]( 1.99) 0.99 [ -1.10]( 2.28) 0.77 [-22.58]( 0.96)
32-clients 1.00 [ 0.00]( 5.77) 1.08 [ 8.30]( 2.26) 1.00 [ 0.06]( 2.18) 1.02 [ 2.12]( 2.31) 0.44 [-56.08]( 1.50)
64-clients 1.00 [ 0.00]( 3.14) 0.94 [ -5.93]( 3.19) 0.76 [-24.05]( 1.65) 0.85 [-15.44]( 3.05) 0.33 [-66.71]( 7.28)
128-clients 1.00 [ 0.00]( 1.74) 0.73 [-26.70]( 3.10) 0.64 [-35.94]( 3.64) 0.73 [-26.93]( 5.07) 0.36 [-63.97]( 7.73)
256-clients 1.00 [ 0.00]( 1.50) 0.61 [-38.62]( 1.21) 0.56 [-43.88]( 5.72) 0.59 [-40.60]( 2.26) 0.83 [-17.18]( 5.95)
512-clients 1.00 [ 0.00](50.23) 0.66 [-33.66](51.96) 0.47 [-53.21](47.91) 0.50 [-50.22](42.87) 0.89 [-10.89](48.44)

Zen4, 2 Sockets, 128 cores per socket, SMT2:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Clients: tip[pct imp](CV) swqueue[pct imp](CV) noshard[pct imp](CV) shard_wakeup[pct imp](CV) shard_all[pct imp](CV)
1-clients 1.00 [ 0.00]( 0.60) 0.99 [ -0.65]( 0.14) 0.99 [ -1.26]( 0.26) 0.99 [ -1.37]( 0.40) 0.99 [ -0.80]( 0.53)
2-clients 1.00 [ 0.00]( 0.57) 0.99 [ -1.07]( 0.40) 0.99 [ -1.46]( 0.22) 0.98 [ -2.07]( 0.15) 0.99 [ -1.38]( 0.25)
4-clients 1.00 [ 0.00]( 0.32) 0.99 [ -0.81]( 0.31) 0.99 [ -1.32]( 0.49) 0.98 [ -1.87]( 0.31) 0.99 [ -1.40]( 0.62)
8-clients 1.00 [ 0.00]( 0.45) 0.99 [ -1.26]( 0.88) 0.98 [ -2.01]( 0.90) 0.98 [ -2.23]( 0.42) 0.98 [ -2.04]( 1.03)
16-clients 1.00 [ 0.00]( 0.49) 0.98 [ -1.91]( 1.54) 0.97 [ -2.86]( 2.41) 0.97 [ -2.70]( 1.39) 0.97 [ -3.30]( 0.78)
32-clients 1.00 [ 0.00]( 1.95) 0.98 [ -2.10]( 2.09) 0.98 [ -2.17]( 1.24) 0.97 [ -2.73]( 1.58) 0.44 [-56.47](12.66)
64-clients 1.00 [ 0.00]( 3.05) 0.96 [ -4.00]( 2.43) 0.95 [ -4.84]( 2.82) 0.97 [ -3.43]( 2.06) 0.24 [-76.24]( 1.15)
128-clients 1.00 [ 0.00]( 2.63) 0.86 [-14.02]( 2.49) 0.80 [-19.86]( 2.16) 0.75 [-25.02]( 3.24) 0.14 [-85.91]( 3.76)
256-clients 1.00 [ 0.00]( 2.02) 0.75 [-25.02]( 2.59) 0.52 [-47.93]( 2.60) 0.42 [-58.38]( 2.18) 0.11 [-88.59]( 9.61)
512-clients 1.00 [ 0.00]( 5.67) 1.20 [ 19.91]( 4.99) 1.22 [ 21.57]( 3.89) 0.92 [ -7.65]( 4.84) 1.07 [ 7.22](14.77)


==================================================================
Test : schbench
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
==================================================================
Zen3, 2 Sockets, 64 cores per socket, SMT2:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#workers: tip[pct imp](CV) swqueue_v1[pct imp](CV) noshard[pct imp](CV) shard_wakeup[pct imp](CV) shard_all[pct imp](CV)
1 1.00 [ 0.00]( 0.00) 1.79 [-78.57]( 4.00) 1.93 [-92.86](13.86) 1.93 [-92.86](17.44) 1.14 [-14.29]( 9.75)
2 1.00 [ 0.00]( 7.37) 1.87 [-86.67]( 5.52) 1.87 [-86.67]( 9.10) 1.87 [-86.67]( 7.14) 1.07 [ -6.67]( 3.53)
4 1.00 [ 0.00]( 4.00) 1.28 [-28.00]( 8.57) 1.40 [-40.00]( 0.00) 1.48 [-48.00]( 6.47) 1.08 [ -8.00]( 7.41)
8 1.00 [ 0.00]( 3.33) 1.21 [-20.59]( 1.42) 1.18 [-17.65]( 3.79) 1.15 [-14.71]( 3.88) 1.09 [ -8.82]( 4.81)
16 1.00 [ 0.00]( 2.11) 1.00 [ 0.00]( 2.81) 1.00 [ 0.00]( 4.69) 1.04 [ -3.70]( 4.22) 1.04 [ -3.70]( 6.57)
32 1.00 [ 0.00]( 4.51) 0.92 [ 7.61]( 1.37) 0.92 [ 7.61]( 5.90) 0.93 [ 6.52]( 2.91) 0.96 [ 4.35]( 1.74)
64 1.00 [ 0.00]( 0.92) 0.95 [ 5.45]( 0.37) 0.95 [ 5.45]( 0.98) 0.96 [ 3.64]( 3.14) 1.00 [ 0.00]( 9.05)
128 1.00 [ 0.00]( 1.35) 0.93 [ 7.16]( 3.88) 0.92 [ 7.90]( 2.01) 0.94 [ 5.68]( 1.14) 0.94 [ 6.42]( 0.95)
256 1.00 [ 0.00]( 0.59) 1.00 [ 0.00]( 1.53) 1.01 [ -1.16]( 6.79) 0.96 [ 4.42]( 5.96) 0.91 [ 9.31]( 2.22)
512 1.00 [ 0.00]( 4.53) 1.03 [ -3.44]( 2.14) 1.08 [ -7.56]( 0.74) 1.04 [ -3.89]( 2.91) 0.90 [ 10.31](11.51)

Zen4, 2 Sockets, 128 cores per socket, SMT2:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#workers: tip[pct imp](CV) swqueue[pct imp](CV) noshard[pct imp](CV) shard_wakeup[pct imp](CV) shard_all[pct imp](CV)
1 1.00 [ 0.00](13.36) 0.90 [ 10.00](21.95) 0.87 [ 13.33]( 2.25) 0.83 [ 16.67](25.24) 0.97 [ 3.33](20.80)
2 1.00 [ 0.00](15.57) 1.12 [-11.54]( 2.01) 1.04 [ -3.85](11.04) 1.08 [ -7.69](10.66) 1.08 [ -7.69]( 2.09)
4 1.00 [ 0.00]( 3.33) 1.03 [ -3.33]( 9.94) 1.17 [-16.67](16.38) 0.97 [ 3.33]( 2.01) 0.97 [ 3.33]( 2.01)
8 1.00 [ 0.00]( 3.18) 1.08 [ -8.11]( 5.29) 0.97 [ 2.70]( 2.78) 1.00 [ 0.00]( 9.79) 1.08 [ -8.11]( 4.22)
16 1.00 [ 0.00](14.63) 0.96 [ 3.85]( 7.33) 1.15 [-15.38](13.09) 0.98 [ 1.92]( 1.96) 0.96 [ 3.85]( 3.40)
32 1.00 [ 0.00]( 1.49) 1.03 [ -2.60]( 3.80) 1.01 [ -1.30]( 1.95) 0.99 [ 1.30]( 1.32) 0.99 [ 1.30]( 2.25)
64 1.00 [ 0.00]( 0.00) 1.01 [ -0.80]( 1.64) 1.02 [ -1.60]( 1.63) 1.02 [ -2.40]( 0.78) 1.02 [ -1.60]( 1.21)
128 1.00 [ 0.00]( 1.10) 0.99 [ 0.87]( 0.88) 1.00 [ 0.44]( 0.00) 0.98 [ 2.18]( 0.77) 1.02 [ -1.75]( 1.96)
256 1.00 [ 0.00]( 0.96) 0.99 [ 0.76]( 0.22) 0.99 [ 1.14]( 1.17) 0.97 [ 3.43]( 0.20) 1.18 [-17.90](10.57)
512 1.00 [ 0.00]( 0.73) 1.16 [-16.40]( 0.46) 1.17 [-17.45]( 1.38) 1.16 [-16.40]( 0.87) 1.08 [ -8.03]( 1.98)


==================================================================
Test : new-schbench-requests-per-second
Units : Normalized Requests per second
Interpretation: Higher is better
==================================================================
Zen3, 2 Sockets, 64 cores per socket, SMT2:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#workers: tip[pct imp](CV) swqueue[pct imp](CV) noshard[pct imp](CV) shard_wakeup[pct imp](CV) shard_all[pct imp](CV)
1 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.22) 1.00 [ 0.00]( 0.25) 1.00 [ 0.00]( 0.22) 1.00 [ 0.00]( 0.12)
2 1.00 [ 0.00]( 0.21) 1.00 [ 0.24]( 0.50) 1.00 [ 0.00]( 0.12) 1.00 [ 0.00]( 0.12) 1.00 [ 0.24]( 0.12)
4 1.00 [ 0.00]( 0.12) 1.00 [ 0.00]( 0.12) 1.00 [ 0.24]( 0.12) 1.00 [ 0.24]( 0.12) 1.00 [ 0.00]( 0.12)
8 1.00 [ 0.00]( 0.12) 1.00 [ 0.00]( 0.12) 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.12) 1.00 [ 0.00]( 0.12)
16 1.00 [ 0.00]( 0.12) 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.12) 1.00 [ -0.24]( 0.12) 1.00 [ -0.48]( 0.12)
32 1.00 [ 0.00]( 3.00) 0.99 [ -1.38]( 0.25) 0.98 [ -1.93]( 0.14) 0.98 [ -2.20]( 0.15) 0.96 [ -4.13]( 0.39)
64 1.00 [ 0.00]( 3.74) 0.97 [ -3.11]( 0.49) 0.94 [ -5.87]( 1.53) 0.93 [ -7.25]( 2.01) 0.91 [ -8.64]( 0.39)
128 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00) 0.99 [ -1.11]( 0.19)
256 1.00 [ 0.00]( 0.26) 1.13 [ 12.58]( 0.34) 1.12 [ 11.84]( 0.20) 1.08 [ 7.89]( 0.82) 1.02 [ 1.73]( 1.31)
512 1.00 [ 0.00]( 0.11) 1.00 [ 0.43]( 0.33) 1.00 [ 0.21]( 0.19) 1.00 [ 0.21]( 0.29) 1.00 [ -0.43]( 0.11)

Zen4, 2 Sockets, 128 cores per socket, SMT2:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#workers: tip[pct imp](CV) swqueue[pct imp](CV) noshard[pct imp](CV) shard_wakeup[pct imp](CV) shard_all[pct imp](CV)
1 1.00 [ 0.00]( 0.23) 1.00 [ 0.00]( 0.00) 0.99 [ -1.32]( 0.23) 0.98 [ -1.76]( 0.23) 1.00 [ 0.00]( 0.00)
2 1.00 [ 0.00]( 0.11) 1.00 [ 0.44]( 0.11) 0.99 [ -0.66]( 0.23) 0.98 [ -1.54]( 0.12) 1.00 [ 0.22]( 0.11)
4 1.00 [ 0.00]( 0.00) 1.00 [ 0.22]( 0.00) 0.99 [ -0.88]( 0.20) 0.99 [ -1.32]( 0.11) 1.00 [ 0.22]( 0.00)
8 1.00 [ 0.00]( 0.11) 1.00 [ 0.22]( 0.00) 0.99 [ -0.88]( 0.00) 0.98 [ -1.54]( 0.12) 1.00 [ 0.22]( 0.00)
16 1.00 [ 0.00]( 0.11) 1.00 [ 0.22]( 0.11) 0.99 [ -0.66]( 0.11) 0.99 [ -1.10]( 0.23) 1.00 [ 0.22]( 0.00)
32 1.00 [ 0.00]( 0.00) 1.00 [ 0.22]( 0.11) 1.00 [ -0.22]( 0.00) 0.99 [ -1.32]( 0.11) 1.00 [ 0.44]( 0.11)
64 1.00 [ 0.00]( 0.00) 1.00 [ 0.22]( 0.00) 1.00 [ -0.22]( 0.00) 0.99 [ -1.32]( 0.11) 1.00 [ 0.22]( 0.00)
128 1.00 [ 0.00]( 4.48) 1.03 [ 3.03]( 0.12) 1.01 [ 1.17]( 0.24) 1.00 [ -0.47]( 0.32) 0.99 [ -0.70]( 0.21)
256 1.00 [ 0.00]( 0.00) 1.01 [ 1.06]( 0.00) 1.01 [ 0.84]( 0.11) 0.99 [ -1.48]( 0.29) 1.00 [ 0.42]( 0.00)
512 1.00 [ 0.00]( 0.36) 1.08 [ 8.48]( 0.13) 1.08 [ 7.68]( 0.13) 1.03 [ 2.65]( 0.40) 1.03 [ 2.65]( 1.16)

==================================================================
Test : new-schbench-wakeup-latency
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
==================================================================
Zen3, 2 Sockets, 64 cores per socket, SMT2:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

#workers: tip[pct imp](CV) swqueue_v1[pct imp](CV) noshard[pct imp](CV) shard_wakeup[pct imp](CV) shard_all[pct imp](CV)
1 1.00 [ 0.00](22.13) 1.00 [ 0.00](15.49) 1.00 [ 0.00]( 8.15) 1.17 [-16.67]( 0.00) 1.50 [-50.00](19.36)
2 1.00 [ 0.00]( 0.00) 0.75 [ 25.00](15.49) 1.00 [ 0.00](14.08) 1.00 [ 0.00]( 6.74) 1.00 [ 0.00]( 6.20)
4 1.00 [ 0.00](14.08) 1.00 [ 0.00](14.08) 1.00 [ 0.00](14.08) 1.00 [ 0.00](14.08) 1.00 [ 0.00]( 0.00)
8 1.00 [ 0.00]( 6.74) 0.75 [ 25.00](15.49) 0.88 [ 12.50](12.78) 0.75 [ 25.00](15.49) 1.12 [-12.50]( 5.96)
16 1.00 [ 0.00]( 0.00) 0.78 [ 22.22]( 0.00) 0.78 [ 22.22](13.47) 1.00 [ 0.00](12.39) 1.00 [ 0.00]( 0.00)
32 1.00 [ 0.00]( 9.94) 1.11 [-11.11](11.07) 1.11 [-11.11]( 5.34) 1.11 [-11.11](11.07) 1.11 [-11.11]( 5.34)
64 1.00 [ 0.00]( 8.37) 1.00 [ 0.00]( 4.08) 1.00 [ 0.00]( 4.08) 1.00 [ 0.00]( 4.08) 1.08 [ -7.69]( 0.00)
128 1.00 [ 0.00]( 1.27) 1.05 [ -4.88]( 1.19) 1.05 [ -4.88]( 2.08) 1.05 [ -4.88]( 0.00) 3.95 [-295.12]( 4.51)
256 1.00 [ 0.00]( 0.22) 0.56 [ 43.91]( 1.05) 0.58 [ 42.42]( 1.24) 0.60 [ 40.06]( 9.74) 0.80 [ 20.12]( 1.06)
512 1.00 [ 0.00](11.19) 1.14 [-14.13](18.63) 1.30 [-30.50](59.87) 0.98 [ 2.25]( 3.40) 1.67 [-66.61](37.87)

Zen4, 2 Sockets, 128 cores per socket, SMT2:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#workers: tip[pct imp](CV) swqueue_v1[pct imp](CV) noshard[pct imp](CV) shard_wakeup[pct imp](CV) shard_all[pct imp](CV)
1 1.00 [ 0.00](25.98) 0.85 [ 15.38](38.73) 0.85 [ 15.38](16.26) 0.69 [ 30.77](47.67) 0.85 [ 15.38]( 4.84)
2 1.00 [ 0.00]( 4.43) 1.00 [ 0.00]( 4.19) 0.92 [ 8.33]( 0.00) 1.00 [ 0.00]( 4.43) 1.00 [ 0.00]( 4.43)
4 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 4.19) 0.92 [ 8.33]( 0.00) 0.92 [ 8.33]( 0.00) 1.00 [ 0.00]( 0.00)
8 1.00 [ 0.00]( 0.00) 1.08 [ -8.33]( 0.00) 0.92 [ 8.33]( 4.84) 0.92 [ 8.33]( 0.00) 1.00 [ 0.00]( 0.00)
16 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00) 0.92 [ 8.33]( 0.00) 0.92 [ 8.33]( 0.00) 1.00 [ 0.00]( 0.00)
32 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00) 0.91 [ 9.09]( 0.00) 1.00 [ 0.00]( 4.84) 1.00 [ 0.00]( 0.00)
64 1.00 [ 0.00]( 0.00) 1.10 [-10.00]( 0.00) 1.10 [-10.00]( 4.84) 1.10 [-10.00]( 0.00) 1.20 [-20.00]( 0.00)
128 1.00 [ 0.00]( 7.75) 1.64 [-64.29]( 0.00) 1.50 [-50.00]( 2.42) 1.57 [-57.14]( 4.07) 1.71 [-71.43]( 2.18)
256 1.00 [ 0.00]( 1.52) 1.05 [ -5.08]( 2.97) 1.03 [ -3.39](19.10) 1.10 [-10.17](16.79) 3.20 [-220.34](22.22)
512 1.00 [ 0.00]( 0.26) 0.63 [ 37.01](48.00) 0.63 [ 36.51]( 4.25) 0.72 [ 28.50]( 4.44) 0.90 [ 9.58]( 7.28)

==================================================================
Test : new-schbench-request-latency
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
==================================================================
Zen3, 2 Sockets, 64 cores per socket, SMT2:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#workers: tip[pct imp](CV) swqueue_v1[pct imp](CV) noshard[pct imp](CV) shard_wakeup[pct imp](CV) shard_all[pct imp](CV)
1 1.00 [ 0.00]( 0.17) 1.00 [ -0.33]( 0.29) 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.17) 1.00 [ 0.00]( 0.17)
2 1.00 [ 0.00]( 0.17) 1.00 [ 0.00]( 0.17) 1.00 [ 0.00]( 0.17) 1.00 [ 0.33]( 0.17) 1.00 [ 0.00]( 0.00)
4 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.29) 1.00 [ 0.00]( 0.17) 1.00 [ 0.33]( 0.17) 1.00 [ 0.00]( 0.17)
8 1.00 [ 0.00]( 0.17) 1.00 [ 0.00]( 0.00) 1.00 [ 0.33]( 0.17) 1.00 [ 0.33]( 0.17) 1.00 [ 0.00]( 0.17)
16 1.00 [ 0.00]( 3.93) 0.95 [ 4.71]( 0.17) 0.95 [ 4.71](11.00) 1.04 [ -3.77]( 4.58) 1.18 [-18.21]( 7.53)
32 1.00 [ 0.00]( 9.53) 1.47 [-46.67](12.05) 1.37 [-36.84](14.04) 1.34 [-34.04]( 5.00) 1.63 [-62.57]( 4.20)
64 1.00 [ 0.00]( 6.01) 1.09 [ -9.29]( 0.22) 1.10 [ -9.76]( 0.11) 1.10 [ -9.99]( 0.11) 1.10 [-10.22]( 0.11)
128 1.00 [ 0.00]( 0.28) 1.00 [ 0.21]( 0.32) 1.00 [ 0.00]( 0.11) 1.00 [ 0.21]( 0.32) 1.85 [-84.79]( 0.11)
256 1.00 [ 0.00]( 2.80) 1.02 [ -2.26]( 1.76) 1.07 [ -6.54]( 1.12) 1.19 [-19.37]( 7.04) 1.06 [ -5.79]( 1.55)
512 1.00 [ 0.00]( 0.36) 1.02 [ -1.79]( 0.90) 1.00 [ 0.20]( 1.63) 1.01 [ -1.39]( 0.37) 0.99 [ 1.00]( 1.40)

Zen4, 2 Sockets, 128 cores per socket, SMT2:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#workers: tip[pct imp](CV) swqueue_v1[pct imp](CV) noshard[pct imp](CV) shard_wakeup[pct imp](CV) shard_all[pct imp](CV)
1 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.18) 1.01 [ -0.71]( 0.18) 1.01 [ -0.71]( 0.18) 1.00 [ 0.35]( 0.32)
2 1.00 [ 0.00]( 0.18) 0.99 [ 0.71]( 0.18) 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.18) 1.00 [ 0.35]( 0.00)
4 1.00 [ 0.00]( 0.18) 1.00 [ 0.35]( 0.00) 1.00 [ -0.35]( 0.00) 1.00 [ -0.35]( 0.00) 1.00 [ 0.35]( 0.18)
8 1.00 [ 0.00]( 0.18) 0.99 [ 0.71]( 0.18) 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00) 1.00 [ 0.35]( 0.00)
16 1.00 [ 0.00]( 0.00) 0.99 [ 0.71]( 0.18) 1.00 [ 0.00]( 0.18) 1.00 [ 0.00]( 0.18) 1.00 [ 0.35]( 0.00)
32 1.00 [ 0.00]( 0.00) 0.99 [ 0.71]( 0.18) 1.00 [ 0.00]( 0.00) 1.00 [ -0.35]( 0.00) 1.00 [ 0.35]( 0.00)
64 1.00 [ 0.00]( 0.00) 1.00 [ 0.35](31.00) 1.00 [ -0.35](29.77) 1.00 [ -0.35]( 0.18) 1.05 [ -5.11](30.08)
128 1.00 [ 0.00]( 2.35) 0.98 [ 1.52]( 0.35) 1.00 [ 0.38]( 0.34) 1.03 [ -3.05]( 0.19) 1.01 [ -0.76]( 0.59)
256 1.00 [ 0.00]( 0.18) 0.98 [ 2.14]( 0.19) 0.99 [ 1.43]( 0.19) 1.01 [ -1.07]( 0.18) 1.45 [-44.56](18.01)
512 1.00 [ 0.00]( 0.55) 1.09 [ -9.45]( 1.79) 1.12 [-12.35]( 2.39) 1.15 [-14.93]( 1.68) 1.21 [-21.37]( 8.03)


==================================================================
Test : SPECjbb2015
Units : maxJOPs and critJOPs
Interpretation: Higher is better
==================================================================
Zen3, 2 Sockets, 64 cores per socket, SMT2:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Metric: tip[pct imp] swqueue_v1[pct imp] noshard[pct imp] shard_wakeup[pct imp] shard_all[pct imp]
maxJOPS 1.00 [ 0.00] 1.00 [ 0.00] 1.00 [ 0.08] 1.01 [ 1.42] 1.00 [ 0.08]
critJOPS 1.00 [ 0.00] 0.97 [ -3.43] 0.98 [ -2.38] 0.98 [ -1.82] 0.99 [ -0.96]

==================================================================
Test : YCSB + Mongodb
Units : Throughput
Interpretation: Higher is better
==================================================================
Zen3, 2 Sockets, 64 cores per socket, SMT2:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Metric: tip[pct imp](CV) swqueue[pct imp](CV) noshard[pct imp](CV) shard_wakeup[pct imp](CV) shard_all[pct imp](CV)
Throughput 1.00 [ 0.00]( 0.33) 0.99 [ -0.77]( 0.81) 0.99 [ -1.04](2.14) 0.99 [ -1.04]( 0.36) 0.99 [ -1.46]( 1.29)


Please let me know if you need any other data.

--
Thanks and Regards
gautham.

2023-07-25 20:37:29

by David Vernet

[permalink] [raw]
Subject: Re: [PATCH v2 0/7] sched: Implement shared runqueue in CFS

On Fri, Jul 21, 2023 at 02:42:57PM +0530, Gautham R. Shenoy wrote:
> Hello David,

Hello Gautham,

Thank you for taking the time to run these benchmarks. Apologies for the
delayed response -- I've been traveling this week.

> On Mon, Jul 10, 2023 at 03:03:35PM -0500, David Vernet wrote:
> > Changes
> > -------
> >
> > This is v2 of the shared wakequeue (now called shared runqueue)
> > patchset. The following are changes from the RFC v1 patchset
> > (https://lore.kernel.org/lkml/[email protected]/).
> >
> > v1 -> v2 changes:
> > - Change name from swqueue to shared_runq (Peter)
> >
> > - Sharded per-LLC shared runqueues to avoid contention on
> > scheduler-heavy workloads (Peter)
> >
> > - Pull tasks from the shared_runq in newidle_balance() rather than in
> > pick_next_task_fair() (Peter and Vincent)
> >
> > - Rename a few functions to reflect their actual purpose. For example,
> > shared_runq_dequeue_task() instead of swqueue_remove_task() (Peter)
> >
> > - Expose move_queued_task() from core.c rather than migrate_task_to()
> > (Peter)
> >
> > - Properly check is_cpu_allowed() when pulling a task from a shared_runq
> > to ensure it can actually be migrated (Peter and Gautham)
> >
> > - Dropped RFC tag
> >
> > This patch set is based off of commit ebb83d84e49b ("sched/core: Avoid
> > multiple calling update_rq_clock() in __cfsb_csd_unthrottle()") on the
> > sched/core branch of tip.git.
>
> I have evaluated this v2 patchset on AMD Zen3 and Zen4 servers.
>
> tldr:
>
> * We see non-trivial improvements on hackbench on both Zen3 and Zen4
> until the system is super-overloaded, at which point we see
> regressions.

This makes sense to me. SHARED_RUNQ is more likely to help performance
when the system is not over-utilized, as it has more of a chance to
actually increase work conservation. If the system is over-utilized,
it's likely that a core will be able to find a task regardless of
whether it looks at the shared runq.

That said, I wasn't able to reproduce the regressions (with --groups 16)
on my 7950X, presumably because it only has 8 cores / CCX.

> * tbench shows regressions on Zen3 with each client
> configuration. tbench on Zen4 shows some improvements when the system is
> overloaded.

Hmm, I also observed tbench not performing well with SHARED_RUNQ on my
Zen4 / 7950X, but only with heavy load. It also seems that sharding
helps a lot for tbench on Zen3, whereas Zen4 performs fine without it.
I'm having trouble reasoning about why Zen4 wouldn't require sharding
whereas Zen3 would given that Zen4 has more cores per CCX.

Just to verify -- these benchmarks were run with boost disabled,
correct? Otherwise, there could be a lot of run-to-run variance
depending on thermal throttling.

>
> * netperf shows minor improvements on Zen3 when the system is under
> low to moderate load. netperf regresses on Zen3 at high load, and at
> all load-points on Zen4.

netperf in general seems to regress as the size of the LLC inreases due
to it relentlessly hammering the runqueue, though it's still surprising
to me that your Zen4 test showed regressions under low / moderate load
as well. Was this with -t TCP_RR, or -t UDP_RR? I observed SHARED_RUNQ
improving performance on my 7950X for -t TCP_RR as described on [0], so
I'd be curious to better understand where the slowdowns are coming from
(presumably it's just contending on the shard lock due to having a
larger CCX?)

[0]: https://lore.kernel.org/all/20230615000103.GC2883716@maniforge/

> * Stream, SPECjbb2015 and Mongodb show no significant difference compared
> to the current tip.
>
> * With netperf and tbench, using the shared-runqueue during
> enqueue_entity performs badly.

My reading of your Zen4 numbers on tbench seem to imply that it actually
performs well under heavy load. Copying here for convenience:

Zen4, 2 Sockets, 128 cores per socket, SMT2:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Clients: tip[pct imp](CV) swqueue[pct imp](CV) noshard[pct imp](CV) shard_wakeup[pct imp](CV) shard_all[pct imp](CV)
1 1.00 [ 0.00]( 0.19) 0.98 [ -1.72]( 0.19) 0.99 [ -1.15]( 0.28) 0.98 [ -1.79]( 0.28) 0.99 [ -1.49]( 0.10)
2 1.00 [ 0.00]( 0.63) 0.98 [ -2.28]( 0.63) 0.98 [ -1.91]( 0.26) 0.97 [ -3.14]( 0.25) 0.98 [ -1.77]( 0.32)
4 1.00 [ 0.00]( 0.22) 1.00 [ 0.00]( 1.13) 0.99 [ -0.69]( 0.57) 0.98 [ -1.59]( 0.35) 0.99 [ -0.64]( 0.18)
8 1.00 [ 0.00]( 1.14) 0.99 [ -0.73]( 0.61) 0.98 [ -2.28]( 2.61) 0.97 [ -2.56]( 0.34) 0.98 [ -1.77]( 0.70)
16 1.00 [ 0.00]( 0.98) 0.97 [ -2.54]( 1.24) 0.98 [ -1.71]( 1.86) 0.98 [ -1.53]( 0.62) 0.96 [ -3.56]( 0.93)
32 1.00 [ 0.00]( 0.76) 0.98 [ -2.31]( 1.35) 0.98 [ -2.06]( 0.77) 0.96 [ -3.53]( 1.63) 0.88 [-11.72]( 2.77)
64 1.00 [ 0.00]( 0.96) 0.96 [ -4.45]( 3.53) 0.97 [ -3.44]( 1.53) 0.96 [ -3.52]( 0.89) 0.31 [-69.03]( 0.64)
128 1.00 [ 0.00]( 3.03) 0.95 [ -4.78]( 0.56) 0.98 [ -2.48]( 0.47) 0.92 [ -7.73]( 0.16) 0.20 [-79.75]( 0.24)
256 1.00 [ 0.00]( 0.04) 0.93 [ -7.21]( 1.00) 0.94 [ -5.90]( 0.63) 0.59 [-41.29]( 1.76) 0.16 [-83.71]( 0.07)
512 1.00 [ 0.00]( 3.08) 1.07 [ 7.07](17.78) 1.15 [ 15.49]( 2.65) 0.82 [-17.53](29.11) 0.93 [ -7.18](32.23)
1024 1.00 [ 0.00]( 0.60) 1.16 [ 15.61]( 0.07) 1.16 [ 15.92]( 0.06) 1.12 [ 11.57]( 1.86) 1.12 [ 11.97]( 0.21)
2048 1.00 [ 0.00]( 0.16) 1.15 [ 14.62]( 0.90) 1.15 [ 15.20]( 0.29) 1.08 [ 7.64]( 1.44) 1.15 [ 14.57]( 0.23)

I'm also struggling to come up for an explanation for why Zen4 would
operate well with SHARED_RUNQ under heavy load. Do you have a theory?

> Server configurations used:
>
> AMD Zen3 Server:
> * 2 sockets,
> * 64 cores per socket,
> * SMT2 enabled
> * Total of 256 threads.
> * Configured in Nodes-Per-Socket(NPS)=1
>
> AMD Zen4 Server:
> * 2 sockets,
> * 128 cores per socket,
> * SMT2 enabled
> * Total of 512 threads.
> * Configured in Nodes-Per-Socket(NPS)=1
>
> The trends on NPS=2 and NPS=4 are similar. So I am not posting those.
>
>
> Legend:
> tip : Tip kernel with top commit ebb83d84e49b
> ("sched/core: Avoid multiple calling update_rq_clock() in __cfsb_csd_unthrottle()")
>
> swqueue_v1 : Your v1 patches applied on top of the aforemenioned tip commit.
>
> noshard : shared-runqueue v2 patches 1-5. This uses a shared-runqueue
> during wakeup. No sharding.
>
> shard_wakeup : shared-runqueue v2 patches 1-6. This uses a
> shared-runqueue during wakeup and has shards with
> shard size = 6 (default)
>
> shard_all : v2 patches 1-7. This uses a sharded shared-runqueue during
> enqueue_entity

So, what's your overall impression from these numbers? My general
impression so far is the following:

- SHARED_RUNQ works best when the system would otherwise be
under-utilized. If the system is going to be overloaded, it's unlikely
to provide a significant benefit over CFS, and may even just add
overhead with no benefit (or just cause worse cache locality).

- SHARED_RUNQ isn't well-suited to workloads such as netperf which
pummel the scheduler. Sharding helps a lot here, but doesn't
completely fix the problem depending on how aggressively tasks are
hammering the runqueue.

- To the point above, using SHARED_RUNQ in __enqueue_entity() /
__dequeue_entity(), rather than just on the wakeup path, is a net
positive. Workloads which hammer the runq such as netperf or schbench
-L -m 52 -p 512 -r 10 -t 1 will do poorly in both scenarios, so we may
as well get the better work conservation from __enqueue_entity() /
__dequeue_entity(). hackbench is one example of a workload that
benefits from this, another is kernel compile, and I strongly suspect
that HHVM would similarly benefit.

- Sharding in general doesn't seem to regress performance by much when
it wouldn't have otherwise been necessary to avoid contention.
hackbench is better without sharding on Zen3, but it's also better
with shard_all on Zen4.

In general, if our goal is to better support hosts with large CCXs, I
think we'll just need to support sharding.

Thoughts?

I have the v3 version of the patch set which properly supports domain
recreation and hotplug, but I still need to get updated benchmark
numbers on it, as well as benchmark spreading a shared_runq over
multiple CCXs per Peter's comment in [1] about the initial motivation
behind SIS_NODE also applying to SHARED_RUNQ.

[1]: https://lore.kernel.org/all/[email protected]/

Given the points above, I would ideally like to just run the shard_all
variant and compare that to the numbers I collected on v2 and shared in
[2]. What do you think? There will be tradeoffs no matter what we choose
to do, but enqueuing / dequeuing in __enqueue_entity() /
__dequeue_entity() seems to perform the best for workloads that don't
hammer the runqueue, and sharding seems like a given if we do decide to
do that.

[2]: https://lore.kernel.org/all/[email protected]/

Thanks,
David

2023-08-02 07:56:58

by Gautham R. Shenoy

[permalink] [raw]
Subject: Re: [PATCH v2 0/7] sched: Implement shared runqueue in CFS

Hello David,


On Tue, Jul 25, 2023 at 03:22:55PM -0500, David Vernet wrote:
> On Fri, Jul 21, 2023 at 02:42:57PM +0530, Gautham R. Shenoy wrote:
> > Hello David,
>
> Hello Gautham,
>
> Thank you for taking the time to run these benchmarks. Apologies for the
> delayed response -- I've been traveling this week.

No issues. As you can see, there has been an delay from my end as well.


>
> > On Mon, Jul 10, 2023 at 03:03:35PM -0500, David Vernet wrote:
> > > Changes
> > > -------
> > >
> > > This is v2 of the shared wakequeue (now called shared runqueue)
> > > patchset. The following are changes from the RFC v1 patchset
> > > (https://lore.kernel.org/lkml/[email protected]/).
> > >
> > > v1 -> v2 changes:
> > > - Change name from swqueue to shared_runq (Peter)
> > >
> > > - Sharded per-LLC shared runqueues to avoid contention on
> > > scheduler-heavy workloads (Peter)
> > >
> > > - Pull tasks from the shared_runq in newidle_balance() rather than in
> > > pick_next_task_fair() (Peter and Vincent)
> > >
> > > - Rename a few functions to reflect their actual purpose. For example,
> > > shared_runq_dequeue_task() instead of swqueue_remove_task() (Peter)
> > >
> > > - Expose move_queued_task() from core.c rather than migrate_task_to()
> > > (Peter)
> > >
> > > - Properly check is_cpu_allowed() when pulling a task from a shared_runq
> > > to ensure it can actually be migrated (Peter and Gautham)
> > >
> > > - Dropped RFC tag
> > >
> > > This patch set is based off of commit ebb83d84e49b ("sched/core: Avoid
> > > multiple calling update_rq_clock() in __cfsb_csd_unthrottle()") on the
> > > sched/core branch of tip.git.
> >
> > I have evaluated this v2 patchset on AMD Zen3 and Zen4 servers.
> >
> > tldr:
> >
> > * We see non-trivial improvements on hackbench on both Zen3 and Zen4
> > until the system is super-overloaded, at which point we see
> > regressions.
>
> This makes sense to me. SHARED_RUNQ is more likely to help performance
> when the system is not over-utilized, as it has more of a chance to
> actually increase work conservation. If the system is over-utilized,
> it's likely that a core will be able to find a task regardless of
> whether it looks at the shared runq.
>
> That said, I wasn't able to reproduce the regressions (with --groups 16)
> on my 7950X, presumably because it only has 8 cores / CCX.
>
> > * tbench shows regressions on Zen3 with each client
> > configuration. tbench on Zen4 shows some improvements when the system is
> > overloaded.
>
> Hmm, I also observed tbench not performing well with SHARED_RUNQ on my
> Zen4 / 7950X, but only with heavy load. It also seems that sharding
> helps a lot for tbench on Zen3, whereas Zen4 performs fine without it.
> I'm having trouble reasoning about why Zen4 wouldn't require sharding
> whereas Zen3 would given that Zen4 has more cores per CCX.

Yes, I have been thinking about it as well. Both Zen3 (Milan) and Zen4
(Bergamo) servers that I ran these tests on have 8 cores per
CCX. Bergamo has 2 CCXes per CCD, whle Milan has 1 CCX per CCD. We
don't model the CCD in the sched-domain hierarchy currently, so from
the point of view of the LLC domain (which is the CCX), the number of
cores between the two systems per LLC are identical.

>
> Just to verify -- these benchmarks were run with boost disabled,
> correct? Otherwise, there could be a lot of run-to-run variance
> depending on thermal throttling.


Checking my scripts, these benchmarks were run with C2 disabled,
performance governor enabled, and acpi-cpufreq as the scaling
governor. Boost was enabled, so yes, there could be run-to-run
variance. I can rerun them this weekend with boost disabled. I also
need to understand the overloaded cases of tbench and netperf where
the shared-runq is performing better.


>
> >
> > * netperf shows minor improvements on Zen3 when the system is under
> > low to moderate load. netperf regresses on Zen3 at high load, and at
> > all load-points on Zen4.
>
> netperf in general seems to regress as the size of the LLC inreases due
> to it relentlessly hammering the runqueue, though it's still surprising
> to me that your Zen4 test showed regressions under low / moderate load
> as well. Was this with -t TCP_RR, or -t UDP_RR? I observed SHARED_RUNQ
> improving performance on my 7950X for -t TCP_RR as described on [0], so
> I'd be curious to better understand where the slowdowns are coming from
> (presumably it's just contending on the shard lock due to having a
> larger CCX?)


I ran netperf with TCP_RR with the server running on localhost. The
exact command is:

netperf -H 127.0.0.1 -t TCP_RR -l 100 -- -r 100 \
-k REQUEST_SIZE,RESPONSE_SIZE,ELAPSED_TIME,THROUGHPUT,THROUGHPUT_UNITS,MIN_LATENCY,MEAN_LATENCY,P50_LATENCY,P90_LATENCY,P99_LATENCY,MAX_LATENCY,STDDEV_LATENCY

I am yet to debug why we are seeing performance drop in the
low-utilization cases.


>
> [0]: https://lore.kernel.org/all/20230615000103.GC2883716@maniforge/
>
> > * Stream, SPECjbb2015 and Mongodb show no significant difference compared
> > to the current tip.
> >
> > * With netperf and tbench, using the shared-runqueue during
> > enqueue_entity performs badly.
>
> My reading of your Zen4 numbers on tbench seem to imply that it actually
> performs well under heavy load. Copying here for convenience:

>
> Zen4, 2 Sockets, 128 cores per socket, SMT2:
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> Clients: tip[pct imp](CV) swqueue[pct imp](CV) noshard[pct imp](CV) shard_wakeup[pct imp](CV) shard_all[pct imp](CV)
> 1 1.00 [ 0.00]( 0.19) 0.98 [ -1.72]( 0.19) 0.99 [ -1.15]( 0.28) 0.98 [ -1.79]( 0.28) 0.99 [ -1.49]( 0.10)
> 2 1.00 [ 0.00]( 0.63) 0.98 [ -2.28]( 0.63) 0.98 [ -1.91]( 0.26) 0.97 [ -3.14]( 0.25) 0.98 [ -1.77]( 0.32)
> 4 1.00 [ 0.00]( 0.22) 1.00 [ 0.00]( 1.13) 0.99 [ -0.69]( 0.57) 0.98 [ -1.59]( 0.35) 0.99 [ -0.64]( 0.18)
> 8 1.00 [ 0.00]( 1.14) 0.99 [ -0.73]( 0.61) 0.98 [ -2.28]( 2.61) 0.97 [ -2.56]( 0.34) 0.98 [ -1.77]( 0.70)
> 16 1.00 [ 0.00]( 0.98) 0.97 [ -2.54]( 1.24) 0.98 [ -1.71]( 1.86) 0.98 [ -1.53]( 0.62) 0.96 [ -3.56]( 0.93)
> 32 1.00 [ 0.00]( 0.76) 0.98 [ -2.31]( 1.35) 0.98 [ -2.06]( 0.77) 0.96 [ -3.53]( 1.63) 0.88 [-11.72]( 2.77)
> 64 1.00 [ 0.00]( 0.96) 0.96 [ -4.45]( 3.53) 0.97 [ -3.44]( 1.53) 0.96 [ -3.52]( 0.89) 0.31 [-69.03]( 0.64)
> 128 1.00 [ 0.00]( 3.03) 0.95 [ -4.78]( 0.56) 0.98 [ -2.48]( 0.47) 0.92 [ -7.73]( 0.16) 0.20 [-79.75]( 0.24)
> 256 1.00 [ 0.00]( 0.04) 0.93 [ -7.21]( 1.00) 0.94 [ -5.90]( 0.63) 0.59 [-41.29]( 1.76) 0.16 [-83.71]( 0.07)
> 512 1.00 [ 0.00]( 3.08) 1.07 [ 7.07](17.78) 1.15 [ 15.49]( 2.65) 0.82 [-17.53](29.11) 0.93 [ -7.18](32.23)
> 1024 1.00 [ 0.00]( 0.60) 1.16 [ 15.61]( 0.07) 1.16 [ 15.92]( 0.06) 1.12 [ 11.57]( 1.86) 1.12 [ 11.97]( 0.21)
> 2048 1.00 [ 0.00]( 0.16) 1.15 [ 14.62]( 0.90) 1.15 [ 15.20]( 0.29) 1.08 [ 7.64]( 1.44) 1.15 [ 14.57]( 0.23)
>
> I'm also struggling to come up for an explanation for why Zen4 would
> operate well with SHARED_RUNQ under heavy load. Do you have a theory?

Yes. So, my theory is that with SHARED_RUNQ, we delay entering idle
state due to the checking of the shared runqueue while acquiring a
lock. So perhaps this is helping in an unintended manner. I want to
rerun those parts while collecting the idle-statistics.


>
> > Server configurations used:
> >
> > AMD Zen3 Server:
> > * 2 sockets,
> > * 64 cores per socket,
> > * SMT2 enabled
> > * Total of 256 threads.
> > * Configured in Nodes-Per-Socket(NPS)=1
> >
> > AMD Zen4 Server:
> > * 2 sockets,
> > * 128 cores per socket,
> > * SMT2 enabled
> > * Total of 512 threads.
> > * Configured in Nodes-Per-Socket(NPS)=1
> >
> > The trends on NPS=2 and NPS=4 are similar. So I am not posting those.
> >
> >
> > Legend:
> > tip : Tip kernel with top commit ebb83d84e49b
> > ("sched/core: Avoid multiple calling update_rq_clock() in __cfsb_csd_unthrottle()")
> >
> > swqueue_v1 : Your v1 patches applied on top of the aforemenioned tip commit.
> >
> > noshard : shared-runqueue v2 patches 1-5. This uses a shared-runqueue
> > during wakeup. No sharding.
> >
> > shard_wakeup : shared-runqueue v2 patches 1-6. This uses a
> > shared-runqueue during wakeup and has shards with
> > shard size = 6 (default)
> >
> > shard_all : v2 patches 1-7. This uses a sharded shared-runqueue during
> > enqueue_entity
>
> So, what's your overall impression from these numbers? My general
> impression so far is the following:
>
> - SHARED_RUNQ works best when the system would otherwise be
> under-utilized. If the system is going to be overloaded, it's unlikely
> to provide a significant benefit over CFS, and may even just add
> overhead with no benefit (or just cause worse cache locality).


I agree with you here. The only thing that saw a consistent benefit
was hackbench under moderate load.

>
> - SHARED_RUNQ isn't well-suited to workloads such as netperf which
> pummel the scheduler. Sharding helps a lot here, but doesn't
> completely fix the problem depending on how aggressively tasks are
> hammering the runqueue.

Yeah. Even with sharding (which I would assume would definitely help
on platforms with a larger LLC domain), each idle entry would result
in searching the shared-wakequeue and probability of finding something
there is very low if the workload runs for a very short duration.

>
> - To the point above, using SHARED_RUNQ in __enqueue_entity() /
> __dequeue_entity(), rather than just on the wakeup path, is a net
> positive. Workloads which hammer the runq such as netperf or schbench
> -L -m 52 -p 512 -r 10 -t 1 will do poorly in both scenarios, so we may
> as well get the better work conservation from __enqueue_entity() /
> __dequeue_entity(). hackbench is one example of a workload that
> benefits from this, another is kernel compile, and I strongly suspect
> that HHVM would similarly benefit.

Well, the magnitude of the performance degradation is much higher for
tbench and netperf when we have shared_runq being used in the
__enqueue_entity()/__dequeue_entity() path. So it is very workload
dependent. I would like to try out with a variant that used
shared_runq() in the __enqueue/__dequeue_entity() path, without
sharding though. Just to see if it makes any difference.



>
> - Sharding in general doesn't seem to regress performance by much when
> it wouldn't have otherwise been necessary to avoid contention.
> hackbench is better without sharding on Zen3, but it's also better
> with shard_all on Zen4.


>
> In general, if our goal is to better support hosts with large CCXs, I
> think we'll just need to support sharding.

I think the shard size needs to be determined as a function of the LLC
size. Or have the arch specific code pick the size to suit a
particular generation. At least on Zen3, Zen4 servers with 8 cores per
LLC domain, creation of shards was not providing additional benefit.


>
> Thoughts?
>
> I have the v3 version of the patch set which properly supports domain
> recreation and hotplug, but I still need to get updated benchmark
> numbers on it, as well as benchmark spreading a shared_runq over
> multiple CCXs per Peter's comment in [1] about the initial motivation
> behind SIS_NODE also applying to SHARED_RUNQ.

> [1]: https://lore.kernel.org/all/[email protected]/

Based on the various flavors of SIS_NODE that we have experimented
with on the EPYC servers, it seems to work very well when the
probability of finding an idle core/cpu in the wider sched-domain is
higher. In that case, the extra time spent in searching for that idle
core/cpu is justified by the fact that the task gets to run
immediately. However, as the utilization on the system increases, we
are less likely to find an idle core/cpu, the additional time spent
searching does show up as a regression. What we need is a way to limit
the downside in these latter case without lowering that upside that we
see in the low-moderate utilization cases.


>
> Given the points above, I would ideally like to just run the shard_all
> variant and compare that to the numbers I collected on v2 and shared in
> [2]. What do you think?

Would that be a fair comparison, because SIS_NODE only explores a
wider-search during wakeups while shard_all would add task to the
shared_runq even during a regular enqueue.

> There will be tradeoffs no matter what we choose
> to do, but enqueuing / dequeuing in __enqueue_entity() /
> __dequeue_entity() seems to perform the best for workloads that don't
> hammer the runqueue, and sharding seems like a given if we do decide to
> do that.

Or see if we can avoid using shard_runq/SIS_NODE when the probability
of reducing the scheduling latency and improving utilization is
low. In these case, the default scheduling strategy should just work
fine. However, I don't know of any clean way to detect such a
situation. That quest is still on :-)

>
> [2]: https://lore.kernel.org/all/[email protected]/
>
> Thanks,
> David

--
Thanks and Regards
gautham.