2023-06-13 05:28:24

by David Vernet

[permalink] [raw]
Subject: [RFC PATCH 0/3] sched: Implement shared wakequeue in CFS

Overview
========

The scheduler must constantly strike a balance between work
conservation, and avoiding costly migrations which harm performance due
to e.g. decreased cache locality. The matter is further complicated by
the topology of the system. Migrating a task between cores on the same
LLC may be more optimal than keeping a task local to the CPU, whereas
migrating a task between LLCs or NUMA nodes may tip the balance in the
other direction.

With that in mind, while CFS is by and large mostly a work conserving
scheduler, there are certain instances where the scheduler will choose
to keep a task local to a CPU, when it would have been more optimal to
migrate it to an idle core.

An example of such a workload is the HHVM / web workload at Meta. HHVM
is a VM that JITs Hack and PHP code in service of web requests. Like
other JIT / compilation workloads, it tends to be heavily CPU bound, and
exhibit generally poor cache locality. To try and address this, we set
several debugfs (/sys/kernel/debug/sched) knobs on our HHVM workloads:

- migration_cost_ns -> 0
- latency_ns -> 20000000
- min_granularity_ns -> 10000000
- wakeup_granularity_ns -> 12000000

These knobs are intended both to encourage the scheduler to be as work
conserving as possible (migration_cost_ns -> 0), and also to keep tasks
running for relatively long time slices so as to avoid the overhead of
context switching (the other knobs). Collectively, these knobs provide a
substantial performance win; resulting in roughly a 20% improvement in
throughput. Worth noting, however, is that this improvement is _not_ at
full machine saturation.

That said, even with these knobs, we noticed that CPUs were still going
idle even when the host was overcommitted. In response, we wrote the
"shared wakequeue" (swqueue) feature proposed in this patch set. The
idea behind swqueue is simple: it enables the scheduler to be
aggressively work conserving by placing a waking task into a per-LLC
FIFO queue that can be pulled from by another core in the LLC FIFO queue
which can then be pulled from before it goes idle.

With this simple change, we were able to achieve a 1 - 1.6% improvement
in throughput, as well as a small, consistent improvement in p95 and p99
latencies, in HHVM. These performance improvements were in addition to
the wins from the debugfs knobs mentioned above.

Design
======

The design of swqueue is quite simple. An swqueue is simply a struct
list_head, and a spinlock:

struct swqueue {
struct list_head list;
spinlock_t lock;
} ____cacheline_aligned;

We create a struct swqueue per LLC, ensuring they're in their own
cachelines to avoid false sharing between CPUs on different LLCs.

When a task first wakes up, it enqueues itself in the swqueue of its
current LLC at the end of enqueue_task_fair(). Enqueues only happen if
the task was not manually migrated to the current core by
select_task_rq(), and is not pinned to a specific CPU.

A core will pull a task from its LLC's swqueue before calling
newidle_balance().

Difference between SIS_NODE
===========================

In [0] Peter proposed a patch that addresses Tejun's observations that
when workqueues are targeted towards a specific LLC on his Zen2 machine
with small CCXs, that there would be significant idle time due to
select_idle_sibling() not considering anything outside of the current
LLC.

This patch (SIS_NODE) is essentially the complement to the proposal
here. SID_NODE causes waking tasks to look for idle cores in neighboring
LLCs on the same die, whereas swqueue causes cores about to go idle to
look for enqueued tasks. That said, in its current form, the two
features at are a different scope as SIS_NODE searches for idle cores
between LLCs, while swqueue enqueues tasks within a single LLC.

The patch was since removed in [1], but we elect to compare its
performance to swqueue given that as described above, it's conceptually
complementary.

[0]: https://lore.kernel.org/all/[email protected]/
[1]: https://lore.kernel.org/all/[email protected]/

I observed that while SIS_NODE works quite well for hosts with small
CCXs, it can result in degraded performance on machines either with a
large number of total cores in a CCD, or for which the cache miss
penalty of migrating between CCXs is high, even on the same die.

For example, on Zen 4c hosts (Bergamo), CCXs within a CCD are muxed
through a single link to the IO die, and thus have similar cache miss
latencies as cores in remote CCDs.

Such subtleties could be taken into account with SIS_NODE, but
regardless, both features are conceptually complementary sides of the
same coin. SIS_NODE searches for idle cores for waking threads, whereas
swqueue searches for available work before a core goes idle.

Results
=======

Note that the motivation for the shared wakequeue feature was originally
arrived at using experiments in the sched_ext framework that's currently
being proposed upstream. The ~1 - 1.6% improvement in HHVM throughput
is similarly visible using work-conserving sched_ext schedulers (even
very simple ones like global FIFO).

In both single and multi socket / CCX hosts, this can measurably improve
performance. In addition to the performance gains observed on our
internal web workloads, we also observed an improvement in common
workloads such as kernel compile when running shared wakequeue. Here are
the results of running make -j$(nproc) built-in.a on several different
types of hosts configured with make allyesconfig on commit a27648c74210
("afs: Fix setting of mtime when creating a file/dir/symlink") on Linus'
tree (boost was disabled on all of these hosts when the experiments were
performed):

Single-socket | 32-core | 2-CCX | AMD 7950X Zen4

CPU max MHz: 5879.8818
CPU min MHz: 3000.0000
o____________o_______o
| mean | CPU |
o------------o-------o
NO_SWQUEUE + NO_SIS_NODE: | 590.52s | 3103% |
NO_SWQUEUE + SIS_NODE: | 590.80s | 3102% |
SWQUEUE + NO_SIS_NODE: | 589.65s | 3116% |
SWQUEUE + SIS_NODE: | 589.99s | 3115% |
o------------o-------o

Takeaway: swqueue doesn't seem to provide a statistically significant
improvement for kernel compile on my 7950X. SIS_NODE similarly does not
have a noticeable effect on performance.

-------------------------------------------------------------------------------

Single-socket | 72-core | 6-CCX | AMD Milan Zen3

CPU max MHz: 3245.0190
CPU min MHz: 700.0000
o_____________o_______o
| mean | CPU |
o-------------o-------o
NO_SWQUEUE + NO_SIS_NODE: | 1608.69s | 6488% |
NO_SWQUEUE + SIS_NODE: | 1610.24s | 6473% |
SWQUEUE + NO_SIS_NODE: | 1605.80s | 6504% |
SWQUEUE + SIS_NODE: | 1606.96s | 6488% |
o-------------o-------o

Takeaway: swqueue does provide a small statistically significant
improvement on Milan, but the compile times in general were quite long
relative to the 7950X Zen4, and the Bergamo Zen4c due to the lower clock
frequency. Milan also has larger CCXs than Bergamo, so it stands to
reason that select_idle_sibling() will have an easier time finding idle
cores inside the current CCX.

It also seems logical that SIS_NODE would hurt performance a bit here,
as all cores / CCXs are in the same NUMA node, so select_idle_sibling()
has to iterate over 72 cores; delaying task wakeup. That said, I'm not
sure that's a viable theory if total CPU% is lower with SIS_NODE.

-------------------------------------------------------------------------------

Single-socket | 176-core | 11-CCX | 2-CCX per CCD | AMD Bergamo Zen4c

CPU max MHz: 1200.0000
CPU min MHz: 1000.0000

o____________o________o
| mean | CPU |
o------------o--------o
NO_SWQUEUE + NO_SIS_NODE: | 322.44s | 15534% |
NO_SWQUEUE + SIS_NODE: | 324.39s | 15508% |
SWQUEUE + NO_SIS_NODE: | 321.54s | 15603% |
SWQUEUE + SIS_NODE: | 321.88s | 15622% |
o------------o--------o

Takeaway: swqueue barely beats NO_SWQUEUE + NO_SIS_NODE, to the point
that it's arguably not statistically significant.
SIS_NODE results in a ~.9% performance degradation, for likely the same
reason as Milan: the host has a large number of LLCs within a single
socket, so task wakeup latencies suffer due to select_idle_node()
searching up to 11 CCXs.

Conclusion
==========

swqueue in this form seems to provide a small, but noticeable win for
front-end CPU-bound workloads spread over multiple CCXs. The reason
seems fairly straightforward: swqueue encourages work conservation
inside of a CCX by having a CPU do an O(1) pull from a per-LLC queue of
runnable tasks. As mentioned above, it is complementary to SIS_NODE,
which searches for idle cores on the wakeup path.

While swqueue in this form encourages work conservation, it of course
does not guarantee it given that we don't implement any kind of work
stealing between swqueues. In the future, we could potentially push CPU
utilization even higher by enabling work stealing between swqueues,
likely between CCXs on the same NUMA node.

Originally-by: Roman Gushchin <[email protected]>
Signed-off-by: David Vernet <[email protected]>

David Vernet (3):
sched: Make migrate_task_to() take any task
sched/fair: Add SWQUEUE sched feature and skeleton calls
sched: Implement shared wakequeue in CFS

include/linux/sched.h | 2 +
kernel/sched/core.c | 52 ++++++-----
kernel/sched/fair.c | 200 +++++++++++++++++++++++++++++++++++++++-
kernel/sched/features.h | 1 +
kernel/sched/sched.h | 6 +-
5 files changed, 234 insertions(+), 27 deletions(-)

--
2.40.1


2023-06-13 05:29:48

by David Vernet

[permalink] [raw]
Subject: [RFC PATCH 2/3] sched/fair: Add SWQUEUE sched feature and skeleton calls

For certain workloads in CFS, CPU utilization is of the upmost
importance. For example, at Meta, our main web workload benefits from a
1 - 1.5% improvement in RPS, and a 1 - 2% improvement in p99 latency,
when CPU utilization is pushed as high as possible.

This is likely something that would be useful for any workload with long
slices, or for which avoiding migration is unlikely to result in
improved cache locality.

We will soon be enabling more aggressive load balancing via a new
feature called swqueue, which places tasks into a FIFO queue on the
wakeup path, and then dequeues them when a core goes idle before
invoking newidle_balance(). We don't want to enable the feature by
default, so this patch defines and declares a new scheduler feature
called SWQUEUE which is disabled by default. In addition, we add some
calls to empty / skeleton functions in the relevant fair codepaths where
swqueue will be utilized.

A set of future patches will implement these functions, and enable
swqueue for both single and multi socket / CCX architectures.

Originally-by: Roman Gushchin <[email protected]>
Signed-off-by: David Vernet <[email protected]>
---
kernel/sched/fair.c | 35 +++++++++++++++++++++++++++++++++++
kernel/sched/features.h | 1 +
2 files changed, 36 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 292c593fc84f..807986bd6ea6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -140,6 +140,17 @@ static int __init setup_sched_thermal_decay_shift(char *str)
__setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift);

#ifdef CONFIG_SMP
+static void swqueue_enqueue(struct rq *rq, struct task_struct *p,
+ int enq_flags)
+{}
+static int swqueue_pick_next_task(struct rq *rq, struct rq_flags *rf)
+{
+ return 0;
+}
+
+static void swqueue_remove_task(struct task_struct *p)
+{}
+
/*
* For asym packing, by default the lower numbered CPU has higher priority.
*/
@@ -162,6 +173,17 @@ int __weak arch_asym_cpu_priority(int cpu)
* (default: ~5%)
*/
#define capacity_greater(cap1, cap2) ((cap1) * 1024 > (cap2) * 1078)
+#else
+static void swqueue_enqueue(struct rq *rq, struct task_struct *p,
+ int enq_flags)
+{}
+static int swqueue_pick_next_task(struct rq *rq, struct rq_flags *rf)
+{
+ return 0;
+}
+
+static void swqueue_remove_task(struct task_struct *p)
+{}
#endif

#ifdef CONFIG_CFS_BANDWIDTH
@@ -6368,6 +6390,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (!task_new)
update_overutilized_status(rq);

+ if (sched_feat(SWQUEUE))
+ swqueue_enqueue(rq, p, flags);
+
enqueue_throttle:
assert_list_leaf_cfs_rq(rq);

@@ -6449,6 +6474,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
dequeue_throttle:
util_est_update(&rq->cfs, p, task_sleep);
hrtick_update(rq);
+
+ if (sched_feat(SWQUEUE))
+ swqueue_remove_task(p);
}

#ifdef CONFIG_SMP
@@ -8155,12 +8183,18 @@ done: __maybe_unused;

update_misfit_status(p, rq);

+ if (sched_feat(SWQUEUE))
+ swqueue_remove_task(p);
+
return p;

idle:
if (!rf)
return NULL;

+ if (sched_feat(SWQUEUE) && swqueue_pick_next_task(rq, rf))
+ return RETRY_TASK;
+
new_tasks = newidle_balance(rq, rf);

/*
@@ -12325,6 +12359,7 @@ static void attach_task_cfs_rq(struct task_struct *p)

static void switched_from_fair(struct rq *rq, struct task_struct *p)
{
+ swqueue_remove_task(p);
detach_task_cfs_rq(p);
}

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index ee7f23c76bd3..57b19bc70cd4 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -101,3 +101,4 @@ SCHED_FEAT(LATENCY_WARN, false)

SCHED_FEAT(ALT_PERIOD, true)
SCHED_FEAT(BASE_SLICE, true)
+SCHED_FEAT(SWQUEUE, false)
--
2.40.1


2023-06-13 05:29:58

by David Vernet

[permalink] [raw]
Subject: [RFC PATCH 3/3] sched: Implement shared wakequeue in CFS

Overview
========

The scheduler must constantly strike a balance between work
conservation, and avoiding costly migrations which harm performance due
to e.g. decreased cache locality. The matter is further complicated by
the topology of the system. Migrating a task between cores on the same
LLC may be more optimal than keeping a task local to the CPU, whereas
migrating a task between LLCs or NUMA nodes may tip the balance in the
other direction.

With that in mind, while CFS is by and large mostly a work conserving
scheduler, there are certain instances where the scheduler will choose
to keep a task local to a CPU, when it would have been more optimal to
migrate it to an idle core.

An example of such a workload is the HHVM / web workload at Meta. HHVM
is a VM that JITs Hack and PHP code in service of web requests. Like
other JIT / compilation workloads, it tends to be heavily CPU bound, and
exhibit generally poor cache locality. To try and address this, we set
several debugfs (/sys/kernel/debug/sched) knobs on our HHVM workloads:

- migration_cost_ns -> 0
- latency_ns -> 20000000
- min_granularity_ns -> 10000000
- wakeup_granularity_ns -> 12000000

These knobs are intended both to encourage the scheduler to be as work
conserving as possible (migration_cost_ns -> 0), and also to keep tasks
running for relatively long time slices so as to avoid the overhead of
context switching (the other knobs). Collectively, these knobs provide a
substantial performance win; resulting in roughly a 20% improvement in
throughput. Worth noting, however, is that this improvement is _not_ at
full machine saturation.

That said, even with these knobs, we noticed that CPUs were still going
idle even when the host was overcommitted. In response, we wrote the
"shared wakequeue" (swqueue) feature proposed in this patch set. The
idea behind swqueue is simple: it enables the scheduler to be
aggressively work conserving by placing a waking task into a per-LLC
FIFO queue that can be pulled from by another core in the LLC FIFO queue
which can then be pulled from before it goes idle.

With this simple change, we were able to achieve a 1 - 1.6% improvement
in throughput, as well as a small, consistent improvement in p95 and p99
latencies, in HHVM. These performance improvements were in addition to
the wins from the debugfs knobs mentioned above.

Design
======

The design of swqueue is quite simple. An swqueue is simply a struct
list_head, and a spinlock:

struct swqueue {
struct list_head list;
spinlock_t lock;
} ____cacheline_aligned;

We create a struct swqueue per LLC, ensuring they're in their own
cachelines to avoid false sharing between CPUs on different LLCs.

When a task first wakes up, it enqueues itself in the swqueue of its
current LLC at the end of enqueue_task_fair(). Enqueues only happen if
the task was not manually migrated to the current core by
select_task_rq(), and is not pinned to a specific CPU.

A core will pull a task from its LLC's swqueue before calling
newidle_balance().

Difference between SIS_NODE
===========================

In [0] Peter proposed a patch that addresses Tejun's observations that
when workqueues are targeted towards a specific LLC on his Zen2 machine
with small CCXs, that there would be significant idle time due to
select_idle_sibling() not considering anything outside of the current
LLC.

This patch (SIS_NODE) is essentially the complement to the proposal
here. SID_NODE causes waking tasks to look for idle cores in neighboring
LLCs on the same die, whereas swqueue causes cores about to go idle to
look for enqueued tasks. That said, in its current form, the two
features at are a different scope as SIS_NODE searches for idle cores
between LLCs, while swqueue enqueues tasks within a single LLC.

The patch was since removed in [1], but we elect to compare its
performance to swqueue given that as described above, it's conceptually
complementary.

[0]: https://lore.kernel.org/all/[email protected]/
[1]: https://lore.kernel.org/all/[email protected]/

I observed that while SIS_NODE works quite well for hosts with small
CCXs, it can result in degraded performance on machines either with a
large number of total cores in a CCD, or for which the cache miss
penalty of migrating between CCXs is high, even on the same die.

For example, on Zen 4c hosts (Bergamo), CCXs within a CCD are muxed
through a single link to the IO die, and thus have similar cache miss
latencies as cores in remote CCDs.

Such subtleties could be taken into account with SIS_NODE, but
regardless, both features are conceptually complementary sides of the
same coin. SIS_NODE searches for idle cores for waking threads, whereas
swqueue searches for available work before a core goes idle.

Results
=======

Note that the motivation for the shared wakequeue feature was originally
arrived at using experiments in the sched_ext framework that's currently
being proposed upstream. The ~1 - 1.6% improvement in HHVM throughput
is similarly visible using work-conserving sched_ext schedulers (even
very simple ones like global FIFO).

In both single and multi socket / CCX hosts, this can measurably improve
performance. In addition to the performance gains observed on our
internal web workloads, we also observed an improvement in common
workloads such as kernel compile when running shared wakequeue. Here are
the results of running make -j$(nproc) built-in.a on several different
types of hosts configured with make allyesconfig on commit a27648c74210
("afs: Fix setting of mtime when creating a file/dir/symlink") on Linus'
tree (boost was disabled on all of these hosts when the experiments were
performed):

Single-socket | 32-core | 2-CCX | AMD 7950X Zen4

CPU max MHz: 5879.8818
CPU min MHz: 3000.0000
o____________o_______o
| mean | CPU |
o------------o-------o
NO_SWQUEUE + NO_SIS_NODE: | 590.52s | 3103% |
NO_SWQUEUE + SIS_NODE: | 590.80s | 3102% |
SWQUEUE + NO_SIS_NODE: | 589.65s | 3116% |
SWQUEUE + SIS_NODE: | 589.99s | 3115% |
o------------o-------o

Takeaway: swqueue doesn't seem to provide a statistically significant
improvement for kernel compile on my 7950X. SIS_NODE similarly does not
have a noticeable effect on performance.

-------------------------------------------------------------------------------

Single-socket | 72-core | 6-CCX | AMD Milan Zen3

CPU max MHz: 3245.0190
CPU min MHz: 700.0000
o_____________o_______o
| mean | CPU |
o-------------o-------o
NO_SWQUEUE + NO_SIS_NODE: | 1608.69s | 6488% |
NO_SWQUEUE + SIS_NODE: | 1610.24s | 6473% |
SWQUEUE + NO_SIS_NODE: | 1605.80s | 6504% |
SWQUEUE + SIS_NODE: | 1606.96s | 6488% |
o-------------o-------o

Takeaway: swqueue does provide a small statistically significant
improvement on Milan, but the compile times in general were quite long
relative to the 7950X Zen4, and the Bergamo Zen4c due to the lower clock
frequency. Milan also has larger CCXs than Bergamo, so it stands to
reason that select_idle_sibling() will have an easier time finding idle
cores inside the current CCX.

It also seems logical that SIS_NODE would hurt performance a bit here,
as all cores / CCXs are in the same NUMA node, so select_idle_sibling()
has to iterate over 72 cores; delaying task wakeup. That said, I'm not
sure that's a viable theory if total CPU% is lower with SIS_NODE.

-------------------------------------------------------------------------------

Single-socket | 176-core | 11-CCX | 2-CCX per CCD | AMD Bergamo Zen4c

CPU max MHz: 1200.0000
CPU min MHz: 1000.0000

o____________o________o
| mean | CPU |
o------------o--------o
NO_SWQUEUE + NO_SIS_NODE: | 322.44s | 15534% |
NO_SWQUEUE + SIS_NODE: | 324.39s | 15508% |
SWQUEUE + NO_SIS_NODE: | 321.54s | 15603% |
SWQUEUE + SIS_NODE: | 321.88s | 15622% |
o------------o--------o

Takeaway: swqueue barely beats NO_SWQUEUE + NO_SIS_NODE, to the point
that it's arguably not statistically significant.
SIS_NODE results in a ~.9% performance degradation, for likely the same
reason as Milan: the host has a large number of LLCs within a single
socket, so task wakeup latencies suffer due to select_idle_node()
searching up to 11 CCXs.

Conclusion
==========

swqueue in this form seems to provide a small, but noticeable win for
front-end CPU-bound workloads spread over multiple CCXs. The reason
seems fairly straightforward: swqueue encourages work conservation
inside of a CCX by having a CPU do an O(1) pull from a per-LLC queue of
runnable tasks. As mentioned above, it is complementary to SIS_NODE,
which searches for idle cores on the wakeup path.

While swqueue in this form encourages work conservation, it of course
does not guarantee it given that we don't implement any kind of work
stealing between swqueues. In the future, we could potentially push CPU
utilization even higher by enabling work stealing between swqueues,
likely between CCXs on the same NUMA node.

Originally-by: Roman Gushchin <[email protected]>
Signed-off-by: David Vernet <[email protected]>
---
include/linux/sched.h | 2 +
kernel/sched/core.c | 2 +
kernel/sched/fair.c | 175 ++++++++++++++++++++++++++++++++++++++++--
kernel/sched/sched.h | 2 +
4 files changed, 176 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1292d38d66cc..1f4fd22f88a8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -770,6 +770,8 @@ struct task_struct {
unsigned long wakee_flip_decay_ts;
struct task_struct *last_wakee;

+ struct list_head swqueue_node;
+
/*
* recent_used_cpu is initially set as the last CPU used by a task
* that wakes affine another task. Waker/wakee relationships can
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d911b0631e7b..e04f0daf1f05 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4533,6 +4533,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
#ifdef CONFIG_SMP
p->wake_entry.u_flags = CSD_TYPE_TTWU;
p->migration_pending = NULL;
+ INIT_LIST_HEAD(&p->swqueue_node);
#endif
init_sched_mm_cid(p);
}
@@ -9872,6 +9873,7 @@ void __init sched_init_smp(void)

init_sched_rt_class();
init_sched_dl_class();
+ init_sched_fair_class_late();

sched_smp_initialized = true;
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 807986bd6ea6..29fe25794884 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -139,17 +139,151 @@ static int __init setup_sched_thermal_decay_shift(char *str)
}
__setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift);

+/**
+ * struct swqueue - Per-LLC queue structure for enqueuing and pulling waking
+ * tasks.
+ *
+ * WHAT
+ * ====
+ *
+ * This structure enables the scheduler to be more aggressively work
+ * conserving, by placing waking tasks on a per-LLC FIFO queue that can then be
+ * pulled from when another core in the LLC is going to go idle.
+ *
+ * struct rq stores a pointer to its LLC's swqueue via struct cfs_rq. Waking
+ * tasks are enqueued in a swqueue at the end of enqueue_task_fair(), and are
+ * opportunistically pulled from the swqueue in pick_next_task_fair() prior to
+ * invoking newidle_balance(). Tasks enqueued in a swqueue be scheduled prior
+ * to being pulled from the swqueue, in which case they're simply removed from
+ * the swqueue. A waking task is only enqueued to a swqueue when it was _not_
+ * manually migrated to the current runqueue by select_task_rq_fair().
+ *
+ * There is currently no task-stealing between swqueues in different LLCs,
+ * which means that swqueue is not fully work conserving. This could be added
+ * at a later time, with tasks likely only being stolen across swqueues on the
+ * same NUMA node to avoid violating NUMA affinities.
+ *
+ * HOW
+ * ===
+ *
+ * An swqueue is comprised of a list, and a spinlock for synchronization. Given
+ * that the critical section for a swqueue is typically a fast list operation,
+ * and that the swqueue is localized to a single LLC, the spinlock does not
+ * seem to be contended; even on a heavily utilized host. struct swqueues are
+ * also cacheline aligned to prevent false sharing between CPUs manipulating
+ * swqueues in other LLCs.
+ *
+ * WHY
+ * ===
+ *
+ * As mentioned above, the main benefit of swqueue is that it enables more
+ * aggressive work conservation in the scheduler. This can benefit workloads
+ * that benefit more from CPU utilization than from L1/L2 cache locality.
+ *
+ * swqueues are segmented across LLCs both to avoid contention on the swqueue
+ * spinlock by minimizing the number of CPUs that could contend on it, as well
+ * as to strike a balance between work conservation, and L3 cache locality.
+ */
+struct swqueue {
+ struct list_head list;
+ spinlock_t lock;
+} ____cacheline_aligned;
+
#ifdef CONFIG_SMP
-static void swqueue_enqueue(struct rq *rq, struct task_struct *p,
- int enq_flags)
-{}
+static struct swqueue *rq_swqueue(struct rq *rq)
+{
+ return rq->cfs.swqueue;
+}
+
+static struct task_struct *swqueue_pull_task(struct swqueue *swqueue)
+{
+ unsigned long flags;
+
+ struct task_struct *p;
+
+ spin_lock_irqsave(&swqueue->lock, flags);
+ p = list_first_entry_or_null(&swqueue->list, struct task_struct,
+ swqueue_node);
+ if (p)
+ list_del_init(&p->swqueue_node);
+ spin_unlock_irqrestore(&swqueue->lock, flags);
+
+ return p;
+}
+
+static void swqueue_enqueue(struct rq *rq, struct task_struct *p, int enq_flags)
+{
+ unsigned long flags;
+ struct swqueue *swqueue;
+ bool task_migrated = enq_flags & ENQUEUE_MIGRATED;
+ bool task_wakeup = enq_flags & ENQUEUE_WAKEUP;
+
+ /*
+ * Only enqueue the task in the shared wakequeue if:
+ *
+ * - SWQUEUE is enabled
+ * - The task is on the wakeup path
+ * - The task wasn't purposefully migrated to the current rq by
+ * select_task_rq()
+ * - The task isn't pinned to a specific CPU
+ */
+ if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1)
+ return;
+
+ swqueue = rq_swqueue(rq);
+ spin_lock_irqsave(&swqueue->lock, flags);
+ list_add_tail(&p->swqueue_node, &swqueue->list);
+ spin_unlock_irqrestore(&swqueue->lock, flags);
+}
+
static int swqueue_pick_next_task(struct rq *rq, struct rq_flags *rf)
{
- return 0;
+ struct swqueue *swqueue;
+ struct task_struct *p = NULL;
+ struct rq *src_rq;
+ struct rq_flags src_rf;
+ int ret;
+
+ swqueue = rq_swqueue(rq);
+ if (!list_empty(&swqueue->list))
+ p = swqueue_pull_task(swqueue);
+
+ if (!p)
+ return 0;
+
+ rq_unpin_lock(rq, rf);
+ raw_spin_rq_unlock(rq);
+
+ src_rq = task_rq_lock(p, &src_rf);
+
+ if (task_on_rq_queued(p) && !task_on_cpu(rq, p))
+ src_rq = migrate_task_to(src_rq, &src_rf, p, cpu_of(rq));
+
+ if (src_rq->cpu != rq->cpu)
+ ret = 1;
+ else
+ ret = -1;
+
+ task_rq_unlock(src_rq, p, &src_rf);
+
+ raw_spin_rq_lock(rq);
+ rq_repin_lock(rq, rf);
+
+ return ret;
}

static void swqueue_remove_task(struct task_struct *p)
-{}
+{
+ unsigned long flags;
+ struct swqueue *swqueue;
+
+ if (!list_empty(&p->swqueue_node)) {
+ swqueue = rq_swqueue(task_rq(p));
+ spin_lock_irqsave(&swqueue->lock, flags);
+ list_del_init(&p->swqueue_node);
+ spin_unlock_irqrestore(&swqueue->lock, flags);
+ }
+}

/*
* For asym packing, by default the lower numbered CPU has higher priority.
@@ -12839,3 +12973,34 @@ __init void init_sched_fair_class(void)
#endif /* SMP */

}
+
+__init void init_sched_fair_class_late(void)
+{
+#ifdef CONFIG_SMP
+ int i;
+ struct swqueue *swqueue;
+ struct rq *rq;
+ struct rq *llc_rq;
+
+ for_each_possible_cpu(i) {
+ if (per_cpu(sd_llc_id, i) == i) {
+ llc_rq = cpu_rq(i);
+
+ swqueue = kzalloc_node(sizeof(struct swqueue),
+ GFP_KERNEL, cpu_to_node(i));
+ INIT_LIST_HEAD(&swqueue->list);
+ spin_lock_init(&swqueue->lock);
+ llc_rq->cfs.swqueue = swqueue;
+ }
+ }
+
+ for_each_possible_cpu(i) {
+ rq = cpu_rq(i);
+ llc_rq = cpu_rq(per_cpu(sd_llc_id, i));
+
+ if (rq == llc_rq)
+ continue;
+ rq->cfs.swqueue = llc_rq->cfs.swqueue;
+ }
+#endif /* SMP */
+}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 5a86e9795731..daee5c64af87 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -575,6 +575,7 @@ struct cfs_rq {
#endif

#ifdef CONFIG_SMP
+ struct swqueue *swqueue;
/*
* CFS load tracking
*/
@@ -2380,6 +2381,7 @@ extern void update_max_interval(void);
extern void init_sched_dl_class(void);
extern void init_sched_rt_class(void);
extern void init_sched_fair_class(void);
+extern void init_sched_fair_class_late(void);

extern void reweight_task(struct task_struct *p, int prio);

--
2.40.1


2023-06-13 05:58:32

by David Vernet

[permalink] [raw]
Subject: [RFC PATCH 1/3] sched: Make migrate_task_to() take any task

The migrate_task_to() function exposed from kernel/sched/core.c migrates
the current task, which is silently assumed to also be its first
argument, to the specified CPU. The function uses stop_one_cpu() to
migrate the task to the target CPU, which won't work if @p is not the
current task as the stop_one_cpu() callback isn't invoked on remote
CPUs.

While this operation is useful for task_numa_migrate() in fair.c, it
would be useful if __migrate_task() in core.c was given external
linkage, as it actually can be used to migrate any task to a CPU.

This patch therefore:
1. Renames the existing migrate_task_to() be called
numa_migrate_current_task_to().
2. Renames __migrate_task() to migrate_task_to(), gives it global
linkage, and updates all callers accordingly.

A follow-on patch will call the new migrate_task_to() from fair.c when
migrating a task in a shared wakequeue to a remote CPU.

Signed-off-by: David Vernet <[email protected]>
---
kernel/sched/core.c | 16 ++++++++--------
kernel/sched/fair.c | 2 +-
kernel/sched/sched.h | 4 +++-
3 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ac38225e6d09..d911b0631e7b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2539,8 +2539,8 @@ struct set_affinity_pending {
* So we race with normal scheduler movements, but that's OK, as long
* as the task is no longer on this CPU.
*/
-static struct rq *__migrate_task(struct rq *rq, struct rq_flags *rf,
- struct task_struct *p, int dest_cpu)
+struct rq *migrate_task_to(struct rq *rq, struct rq_flags *rf,
+ struct task_struct *p, int dest_cpu)
{
/* Affinity changed (again). */
if (!is_cpu_allowed(p, dest_cpu))
@@ -2573,7 +2573,7 @@ static int migration_cpu_stop(void *data)
local_irq_save(rf.flags);
/*
* We need to explicitly wake pending tasks before running
- * __migrate_task() such that we will not miss enforcing cpus_ptr
+ * migrate_task_to() such that we will not miss enforcing cpus_ptr
* during wakeups, see set_cpus_allowed_ptr()'s TASK_WAKING test.
*/
flush_smp_call_function_queue();
@@ -2605,12 +2605,12 @@ static int migration_cpu_stop(void *data)
}

if (task_on_rq_queued(p))
- rq = __migrate_task(rq, &rf, p, arg->dest_cpu);
+ rq = migrate_task_to(rq, &rf, p, arg->dest_cpu);
else
p->wake_cpu = arg->dest_cpu;

/*
- * XXX __migrate_task() can fail, at which point we might end
+ * XXX migrate_task_to() can fail, at which point we might end
* up running on a dodgy CPU, AFAICT this can only happen
* during CPU hotplug, at which point we'll get pushed out
* anyway, so it's probably not a big deal.
@@ -3259,7 +3259,7 @@ void force_compatible_cpus_allowed_ptr(struct task_struct *p)
alloc_cpumask_var(&new_mask, GFP_KERNEL);

/*
- * __migrate_task() can fail silently in the face of concurrent
+ * migrate_task_to() can fail silently in the face of concurrent
* offlining of the chosen destination CPU, so take the hotplug
* lock to ensure that the migration succeeds.
*/
@@ -9359,7 +9359,7 @@ bool sched_smp_initialized __read_mostly;

#ifdef CONFIG_NUMA_BALANCING
/* Migrate current task p to target_cpu */
-int migrate_task_to(struct task_struct *p, int target_cpu)
+int numa_migrate_current_task_to(struct task_struct *p, int target_cpu)
{
struct migration_arg arg = { p, target_cpu };
int curr_cpu = task_cpu(p);
@@ -9439,7 +9439,7 @@ static int __balance_push_cpu_stop(void *arg)

if (task_rq(p) == rq && task_on_rq_queued(p)) {
cpu = select_fallback_rq(rq->cpu, p);
- rq = __migrate_task(rq, &rf, p, cpu);
+ rq = migrate_task_to(rq, &rf, p, cpu);
}

rq_unlock(rq, &rf);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6189d1a45635..292c593fc84f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2278,7 +2278,7 @@ static int task_numa_migrate(struct task_struct *p)

best_rq = cpu_rq(env.best_cpu);
if (env.best_task == NULL) {
- ret = migrate_task_to(p, env.best_cpu);
+ ret = numa_migrate_current_task_to(p, env.best_cpu);
WRITE_ONCE(best_rq->numa_migrate_on, 0);
if (ret != 0)
trace_sched_stick_numa(p, env.src_cpu, NULL, env.best_cpu);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 556496c77dc2..5a86e9795731 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1718,7 +1718,7 @@ enum numa_faults_stats {
NUMA_CPUBUF
};
extern void sched_setnuma(struct task_struct *p, int node);
-extern int migrate_task_to(struct task_struct *p, int cpu);
+extern int numa_migrate_current_task_to(struct task_struct *p, int target_cpu);
extern int migrate_swap(struct task_struct *p, struct task_struct *t,
int cpu, int scpu);
extern void init_numa_balancing(unsigned long clone_flags, struct task_struct *p);
@@ -1731,6 +1731,8 @@ init_numa_balancing(unsigned long clone_flags, struct task_struct *p)

#ifdef CONFIG_SMP

+extern struct rq *migrate_task_to(struct rq *rq, struct rq_flags *rf,
+ struct task_struct *p, int dest_cpu);
static inline void
queue_balance_callback(struct rq *rq,
struct balance_callback *head,
--
2.40.1


2023-06-13 08:49:09

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] sched: Implement shared wakequeue in CFS

On Tue, Jun 13, 2023 at 12:20:04AM -0500, David Vernet wrote:
> +struct swqueue {
> + struct list_head list;
> + spinlock_t lock;
> +} ____cacheline_aligned;
> +
> #ifdef CONFIG_SMP
> +static struct swqueue *rq_swqueue(struct rq *rq)
> +{
> + return rq->cfs.swqueue;
> +}
> +
> +static struct task_struct *swqueue_pull_task(struct swqueue *swqueue)
> +{
> + unsigned long flags;
> +
> + struct task_struct *p;
> +
> + spin_lock_irqsave(&swqueue->lock, flags);
> + p = list_first_entry_or_null(&swqueue->list, struct task_struct,
> + swqueue_node);
> + if (p)
> + list_del_init(&p->swqueue_node);
> + spin_unlock_irqrestore(&swqueue->lock, flags);
> +
> + return p;
> +}
> +
> +static void swqueue_enqueue(struct rq *rq, struct task_struct *p, int enq_flags)
> +{
> + unsigned long flags;
> + struct swqueue *swqueue;
> + bool task_migrated = enq_flags & ENQUEUE_MIGRATED;
> + bool task_wakeup = enq_flags & ENQUEUE_WAKEUP;
> +
> + /*
> + * Only enqueue the task in the shared wakequeue if:
> + *
> + * - SWQUEUE is enabled
> + * - The task is on the wakeup path
> + * - The task wasn't purposefully migrated to the current rq by
> + * select_task_rq()
> + * - The task isn't pinned to a specific CPU
> + */
> + if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1)
> + return;
> +
> + swqueue = rq_swqueue(rq);
> + spin_lock_irqsave(&swqueue->lock, flags);
> + list_add_tail(&p->swqueue_node, &swqueue->list);
> + spin_unlock_irqrestore(&swqueue->lock, flags);
> +}
> +
> static int swqueue_pick_next_task(struct rq *rq, struct rq_flags *rf)
> {
> - return 0;
> + struct swqueue *swqueue;
> + struct task_struct *p = NULL;
> + struct rq *src_rq;
> + struct rq_flags src_rf;
> + int ret;
> +
> + swqueue = rq_swqueue(rq);
> + if (!list_empty(&swqueue->list))
> + p = swqueue_pull_task(swqueue);
> +
> + if (!p)
> + return 0;
> +
> + rq_unpin_lock(rq, rf);
> + raw_spin_rq_unlock(rq);
> +
> + src_rq = task_rq_lock(p, &src_rf);
> +
> + if (task_on_rq_queued(p) && !task_on_cpu(rq, p))
> + src_rq = migrate_task_to(src_rq, &src_rf, p, cpu_of(rq));
> +
> + if (src_rq->cpu != rq->cpu)
> + ret = 1;
> + else
> + ret = -1;
> +
> + task_rq_unlock(src_rq, p, &src_rf);
> +
> + raw_spin_rq_lock(rq);
> + rq_repin_lock(rq, rf);
> +
> + return ret;
> }
>
> static void swqueue_remove_task(struct task_struct *p)
> -{}
> +{
> + unsigned long flags;
> + struct swqueue *swqueue;
> +
> + if (!list_empty(&p->swqueue_node)) {
> + swqueue = rq_swqueue(task_rq(p));
> + spin_lock_irqsave(&swqueue->lock, flags);
> + list_del_init(&p->swqueue_node);
> + spin_unlock_irqrestore(&swqueue->lock, flags);
> + }
> +}
>
> /*
> * For asym packing, by default the lower numbered CPU has higher priority.

*sigh*... pretty much all, if not all of this is called with rq->lock
held. So why the irqsave and big fat fail for using spinlock :-(

2023-06-13 08:50:03

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] sched: Implement shared wakequeue in CFS


Still gotta read it properly, however:

On Tue, Jun 13, 2023 at 12:20:04AM -0500, David Vernet wrote:
> Single-socket | 32-core | 2-CCX | AMD 7950X Zen4
> Single-socket | 72-core | 6-CCX | AMD Milan Zen3
> Single-socket | 176-core | 11-CCX | 2-CCX per CCD | AMD Bergamo Zen4c

Could you please also benchmark on something Intel that has these stupid
large LLCs ?

Because the last time I tried something like this, it came apart real
quick. And AMD has these relatively small 8-core LLCs.


2023-06-14 04:47:07

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] sched: Implement shared wakequeue in CFS

On Tue, Jun 13, 2023 at 10:32:03AM +0200, Peter Zijlstra wrote:
>
> Still gotta read it properly, however:
>
> On Tue, Jun 13, 2023 at 12:20:04AM -0500, David Vernet wrote:
> > Single-socket | 32-core | 2-CCX | AMD 7950X Zen4
> > Single-socket | 72-core | 6-CCX | AMD Milan Zen3
> > Single-socket | 176-core | 11-CCX | 2-CCX per CCD | AMD Bergamo Zen4c
>
> Could you please also benchmark on something Intel that has these stupid
> large LLCs ?
>
> Because the last time I tried something like this, it came apart real
> quick. And AMD has these relatively small 8-core LLCs.

I tested on Intel(R) Xeon(R) Platinum 8358, which has 2 sockets and each
socket has a single LLC with 32 cores/64threads.

When running netperf with nr_thread=128, runtime=60:

"
netserver -4

for i in `seq $nr_threads`; do
netperf -4 -H 127.0.0.1 -t UDP_RR -c -C -l $runtime &
done

wait
"

The lock contention due to the per-LLC swqueue->lock is quite serious:

83.39% 83.33% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath - -
|
|--42.86%--__libc_sendto
| entry_SYSCALL_64
| do_syscall_64
| |
| --42.81%--__x64_sys_sendto
| __sys_sendto
| sock_sendmsg
| inet_sendmsg
| udp_sendmsg
| udp_send_skb
| ip_send_skb
| ip_output
| ip_finish_output
| __ip_finish_output
| ip_finish_output2
| __dev_queue_xmit
| __local_bh_enable_ip
| do_softirq.part.0
| __do_softirq
| |
| --42.81%--net_rx_action
| __napi_poll
| process_backlog
| __netif_receive_skb
| __netif_receive_skb_one_core
| ip_rcv
| ip_local_deliver
| ip_local_deliver_finish
| ip_protocol_deliver_rcu
| udp_rcv
| __udp4_lib_rcv
| udp_unicast_rcv_skb
| udp_queue_rcv_skb
| udp_queue_rcv_one_skb
| __udp_enqueue_schedule_skb
| sock_def_readable
| __wake_up_sync_key
| __wake_up_common_lock
| |
| --42.81%--__wake_up_common
| receiver_wake_function
| autoremove_wake_function
| default_wake_function
| try_to_wake_up
| ttwu_do_activate
| enqueue_task
| enqueue_task_fair
| _raw_spin_lock_irqsave
| |
| --42.81%--native_queued_spin_lock_slowpath
|
|--20.39%--0
| __libc_recvfrom
| entry_SYSCALL_64
| do_syscall_64
| __x64_sys_recvfrom
| __sys_recvfrom
| sock_recvmsg
| inet_recvmsg
| udp_recvmsg
| __skb_recv_udp
| __skb_wait_for_more_packets
| schedule_timeout
| schedule
| __schedule
| pick_next_task_fair
| |
| --20.39%--swqueue_remove_task
| _raw_spin_lock_irqsave
| |
| --20.39%--native_queued_spin_lock_slowpath
|
--20.07%--__libc_recvfrom
entry_SYSCALL_64
do_syscall_64
__x64_sys_recvfrom
__sys_recvfrom
sock_recvmsg
inet_recvmsg
udp_recvmsg
__skb_recv_udp
__skb_wait_for_more_packets
schedule_timeout
schedule
__schedule
|
--20.06%--pick_next_task_fair
swqueue_remove_task
_raw_spin_lock_irqsave
|
--20.06%--native_queued_spin_lock_slowpath

I suppose that is because there are too many CPUs in a single LLC on
this machine and when all these CPUs try to queue/pull tasks in this
per-LLC shared wakequeue, it just doesn't scale well.

2023-06-14 09:32:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] sched: Implement shared wakequeue in CFS

On Wed, Jun 14, 2023 at 12:35:29PM +0800, Aaron Lu wrote:
> On Tue, Jun 13, 2023 at 10:32:03AM +0200, Peter Zijlstra wrote:
> >
> > Still gotta read it properly, however:
> >
> > On Tue, Jun 13, 2023 at 12:20:04AM -0500, David Vernet wrote:
> > > Single-socket | 32-core | 2-CCX | AMD 7950X Zen4
> > > Single-socket | 72-core | 6-CCX | AMD Milan Zen3
> > > Single-socket | 176-core | 11-CCX | 2-CCX per CCD | AMD Bergamo Zen4c
> >
> > Could you please also benchmark on something Intel that has these stupid
> > large LLCs ?
> >
> > Because the last time I tried something like this, it came apart real
> > quick. And AMD has these relatively small 8-core LLCs.
>
> I tested on Intel(R) Xeon(R) Platinum 8358, which has 2 sockets and each
> socket has a single LLC with 32 cores/64threads.

> The lock contention due to the per-LLC swqueue->lock is quite serious:

Yep, that's what I was expecting -- complete meltdown that.

2023-06-14 20:59:42

by David Vernet

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] sched: Implement shared wakequeue in CFS

On Tue, Jun 13, 2023 at 10:41:11AM +0200, Peter Zijlstra wrote:
> On Tue, Jun 13, 2023 at 12:20:04AM -0500, David Vernet wrote:
> > +struct swqueue {
> > + struct list_head list;
> > + spinlock_t lock;
> > +} ____cacheline_aligned;
> > +
> > #ifdef CONFIG_SMP
> > +static struct swqueue *rq_swqueue(struct rq *rq)
> > +{
> > + return rq->cfs.swqueue;
> > +}
> > +
> > +static struct task_struct *swqueue_pull_task(struct swqueue *swqueue)
> > +{
> > + unsigned long flags;
> > +
> > + struct task_struct *p;
> > +
> > + spin_lock_irqsave(&swqueue->lock, flags);
> > + p = list_first_entry_or_null(&swqueue->list, struct task_struct,
> > + swqueue_node);
> > + if (p)
> > + list_del_init(&p->swqueue_node);
> > + spin_unlock_irqrestore(&swqueue->lock, flags);
> > +
> > + return p;
> > +}
> > +
> > +static void swqueue_enqueue(struct rq *rq, struct task_struct *p, int enq_flags)
> > +{
> > + unsigned long flags;
> > + struct swqueue *swqueue;
> > + bool task_migrated = enq_flags & ENQUEUE_MIGRATED;
> > + bool task_wakeup = enq_flags & ENQUEUE_WAKEUP;
> > +
> > + /*
> > + * Only enqueue the task in the shared wakequeue if:
> > + *
> > + * - SWQUEUE is enabled
> > + * - The task is on the wakeup path
> > + * - The task wasn't purposefully migrated to the current rq by
> > + * select_task_rq()
> > + * - The task isn't pinned to a specific CPU
> > + */
> > + if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1)
> > + return;
> > +
> > + swqueue = rq_swqueue(rq);
> > + spin_lock_irqsave(&swqueue->lock, flags);
> > + list_add_tail(&p->swqueue_node, &swqueue->list);
> > + spin_unlock_irqrestore(&swqueue->lock, flags);
> > +}
> > +
> > static int swqueue_pick_next_task(struct rq *rq, struct rq_flags *rf)
> > {
> > - return 0;
> > + struct swqueue *swqueue;
> > + struct task_struct *p = NULL;
> > + struct rq *src_rq;
> > + struct rq_flags src_rf;
> > + int ret;
> > +
> > + swqueue = rq_swqueue(rq);
> > + if (!list_empty(&swqueue->list))
> > + p = swqueue_pull_task(swqueue);
> > +
> > + if (!p)
> > + return 0;
> > +
> > + rq_unpin_lock(rq, rf);
> > + raw_spin_rq_unlock(rq);
> > +
> > + src_rq = task_rq_lock(p, &src_rf);
> > +
> > + if (task_on_rq_queued(p) && !task_on_cpu(rq, p))
> > + src_rq = migrate_task_to(src_rq, &src_rf, p, cpu_of(rq));
> > +
> > + if (src_rq->cpu != rq->cpu)
> > + ret = 1;
> > + else
> > + ret = -1;
> > +
> > + task_rq_unlock(src_rq, p, &src_rf);
> > +
> > + raw_spin_rq_lock(rq);
> > + rq_repin_lock(rq, rf);
> > +
> > + return ret;
> > }
> >
> > static void swqueue_remove_task(struct task_struct *p)
> > -{}
> > +{
> > + unsigned long flags;
> > + struct swqueue *swqueue;
> > +
> > + if (!list_empty(&p->swqueue_node)) {
> > + swqueue = rq_swqueue(task_rq(p));
> > + spin_lock_irqsave(&swqueue->lock, flags);
> > + list_del_init(&p->swqueue_node);
> > + spin_unlock_irqrestore(&swqueue->lock, flags);
> > + }
> > +}
> >
> > /*
> > * For asym packing, by default the lower numbered CPU has higher priority.
>
> *sigh*... pretty much all, if not all of this is called with rq->lock
> held. So why the irqsave and big fat fail for using spinlock :-(

Hi Peter,

Thanks for the quick review. Yeah good call about the irq's -- looks
like we're holding an rq lock on all swqueue paths so we can just use a
raw spinlock. I'll make that change for v2.

Regarding the per-swqueue spinlock being a potential bottleneck, I'll
reply to Aaron's thread on [0] with some numbers I collected locally on
a 26 core / 52 thread Cooperlake host, and a 20/40 x 2 Skylake. The
TL;DR is that I'm not observing the spinlock be contended on either
netperf or kernel compile workloads, with swqueue actually performing
~1 - 2% better than non-swqueue on both of these hosts for netperf.

[0]: https://lore.kernel.org/all/20230614043529.GA1942@ziqianlu-dell/

Thanks,
David

2023-06-15 00:07:01

by David Vernet

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] sched: Implement shared wakequeue in CFS

On Wed, Jun 14, 2023 at 12:35:29PM +0800, Aaron Lu wrote:
> On Tue, Jun 13, 2023 at 10:32:03AM +0200, Peter Zijlstra wrote:
> >
> > Still gotta read it properly, however:
> >
> > On Tue, Jun 13, 2023 at 12:20:04AM -0500, David Vernet wrote:
> > > Single-socket | 32-core | 2-CCX | AMD 7950X Zen4
> > > Single-socket | 72-core | 6-CCX | AMD Milan Zen3
> > > Single-socket | 176-core | 11-CCX | 2-CCX per CCD | AMD Bergamo Zen4c
> >
> > Could you please also benchmark on something Intel that has these stupid
> > large LLCs ?

Sure, please see below.

> > Because the last time I tried something like this, it came apart real
> > quick. And AMD has these relatively small 8-core LLCs.
>
> I tested on Intel(R) Xeon(R) Platinum 8358, which has 2 sockets and each
> socket has a single LLC with 32 cores/64threads.
>
> When running netperf with nr_thread=128, runtime=60:
>
> "
> netserver -4
>
> for i in `seq $nr_threads`; do
> netperf -4 -H 127.0.0.1 -t UDP_RR -c -C -l $runtime &
> done
>
> wait
> "
>
> The lock contention due to the per-LLC swqueue->lock is quite serious:
>
> 83.39% 83.33% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath - -
> |
> |--42.86%--__libc_sendto
> | entry_SYSCALL_64
> | do_syscall_64
> | |
> | --42.81%--__x64_sys_sendto
> | __sys_sendto
> | sock_sendmsg
> | inet_sendmsg
> | udp_sendmsg
> | udp_send_skb
> | ip_send_skb
> | ip_output
> | ip_finish_output
> | __ip_finish_output
> | ip_finish_output2
> | __dev_queue_xmit
> | __local_bh_enable_ip
> | do_softirq.part.0
> | __do_softirq
> | |
> | --42.81%--net_rx_action
> | __napi_poll
> | process_backlog
> | __netif_receive_skb
> | __netif_receive_skb_one_core
> | ip_rcv
> | ip_local_deliver
> | ip_local_deliver_finish
> | ip_protocol_deliver_rcu
> | udp_rcv
> | __udp4_lib_rcv
> | udp_unicast_rcv_skb
> | udp_queue_rcv_skb
> | udp_queue_rcv_one_skb
> | __udp_enqueue_schedule_skb
> | sock_def_readable
> | __wake_up_sync_key
> | __wake_up_common_lock
> | |
> | --42.81%--__wake_up_common
> | receiver_wake_function
> | autoremove_wake_function
> | default_wake_function
> | try_to_wake_up
> | ttwu_do_activate
> | enqueue_task
> | enqueue_task_fair
> | _raw_spin_lock_irqsave
> | |
> | --42.81%--native_queued_spin_lock_slowpath
> |
> |--20.39%--0
> | __libc_recvfrom
> | entry_SYSCALL_64
> | do_syscall_64
> | __x64_sys_recvfrom
> | __sys_recvfrom
> | sock_recvmsg
> | inet_recvmsg
> | udp_recvmsg
> | __skb_recv_udp
> | __skb_wait_for_more_packets
> | schedule_timeout
> | schedule
> | __schedule
> | pick_next_task_fair
> | |
> | --20.39%--swqueue_remove_task
> | _raw_spin_lock_irqsave
> | |
> | --20.39%--native_queued_spin_lock_slowpath
> |
> --20.07%--__libc_recvfrom
> entry_SYSCALL_64
> do_syscall_64
> __x64_sys_recvfrom
> __sys_recvfrom
> sock_recvmsg
> inet_recvmsg
> udp_recvmsg
> __skb_recv_udp
> __skb_wait_for_more_packets
> schedule_timeout
> schedule
> __schedule
> |
> --20.06%--pick_next_task_fair
> swqueue_remove_task
> _raw_spin_lock_irqsave
> |
> --20.06%--native_queued_spin_lock_slowpath
>
> I suppose that is because there are too many CPUs in a single LLC on
> this machine and when all these CPUs try to queue/pull tasks in this
> per-LLC shared wakequeue, it just doesn't scale well.

Hi Aaron,

Thanks for taking a look and running some benchmarks. I tried to
reproduce your results on a 26 core / 52 thread Cooperlake host, and a
20 core / 40 thread x 2 Skylake host, but I wasn't able to observe the
contention on the per-swqueue spinlock you were saw on your Ice Lake.

I ran the following netperf benchmarks:

- netperf -l 60 -n $(nproc) -6
- netperf -l 60 -n $(nproc) -6 -t UDP_RR

Here are the results I'm seeing from taking the average of running those
benchmarks three times on each host (all results are from the "Throughput"
column of the below table).

Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput <<<
bytes bytes bytes secs. 10^6bits/sec


26 / 52 x 1 Cooperlake
----------------------

Default workload:

NO_SWQUEUE: 34103.76
SWQUEUE: 34707.46
Delta: +1.77%

UDP_RR:

NO_SWQUEUE: 57695.29
SWQUEUE: 54827.36
Delta: -4.97%

There's clearly a statistically significant decline in performance for
UDP_RR here, but surprisingly, I don't see any contention on swqueue
locks when I profile with perf. Rather, we seem to be contending on the
rq lock (presumably in swqueue_pick_next_task() which is unexpected (to
me, at least):

7.81% netperf [kernel.vmlinux] [k] queued_spin_lock_slowpath
|
--7.81%--queued_spin_lock_slowpath
|
--7.55%--task_rq_lock
pick_next_task_fair
schedule
schedule_timeout
__skb_wait_for_more_packets
__skb_recv_udp
udpv6_recvmsg
inet6_recvmsg
__x64_sys_recvfrom
do_syscall_64
entry_SYSCALL_64
__libc_recvfrom
0x7f923fdfacf0
0
4.97% netperf [kernel.vmlinux] [k] _raw_spin_lock_irqsave
|
--4.96%--_raw_spin_lock_irqsave
|
|--1.13%--prepare_to_wait_exclusive
| __skb_wait_for_more_packets
| __skb_recv_udp
| udpv6_recvmsg
| inet6_recvmsg
| __x64_sys_recvfrom
| do_syscall_64
| entry_SYSCALL_64
| __libc_recvfrom
| 0x7f923fdfacf0
| 0
|
|--0.92%--default_wake_function
| autoremove_wake_function
| __wake_up_sync_key
| sock_def_readable
| __udp_enqueue_schedule_skb
| udpv6_queue_rcv_one_skb
| __udp6_lib_rcv
| ip6_input
| ipv6_rcv
| process_backlog
| net_rx_action
| __softirqentry_text_start
| __local_bh_enable_ip
| ip6_output
| ip6_local_out
| ip6_send_skb
| udp_v6_send_skb
| udpv6_sendmsg
| __sys_sendto
| __x64_sys_sendto
| do_syscall_64
| entry_SYSCALL_64
| __libc_sendto
|
|--0.81%--__wake_up_sync_key
| sock_def_readable
| __udp_enqueue_schedule_skb
| udpv6_queue_rcv_one_skb
| __udp6_lib_rcv
| ip6_input
| ipv6_rcv
| process_backlog
| net_rx_action
| __softirqentry_text_start
| __local_bh_enable_ip
| ip6_output
| ip6_local_out
| ip6_send_skb
| udp_v6_send_skb
| udpv6_sendmsg
| __sys_sendto
| __x64_sys_sendto
| do_syscall_64
| entry_SYSCALL_64
| __libc_sendto
|
|--0.73%--enqueue_task_fair
| |
| --0.72%--default_wake_function
| autoremove_wake_function
| __wake_up_sync_key
| sock_def_readable
| __udp_enqueue_schedule_skb
| udpv6_queue_rcv_one_skb
| __udp6_lib_rcv
| ip6_input
| ipv6_rcv
| process_backlog
| net_rx_action
| __softirqentry_text_start
| __local_bh_enable_ip
| ip6_output
| ip6_local_out
| ip6_send_skb
| udp_v6_send_skb
| udpv6_sendmsg
| __sys_sendto
| __x64_sys_sendto
| do_syscall_64
| entry_SYSCALL_64
| __libc_sendto
|
--0.68%--task_rq_lock
pick_next_task_fair
schedule
schedule_timeout
__skb_wait_for_more_packets
__skb_recv_udp
udpv6_recvmsg
inet6_recvmsg
__x64_sys_recvfrom
do_syscall_64
entry_SYSCALL_64
__libc_recvfrom
0x7f923fdfacf0
0

The profile without swqueue doesn't show any contention on the rq lock,
but does show us spending a good amount of time in the scheduler:

4.03% netperf [kernel.vmlinux] [k] default_wake_function
|
--3.98%--default_wake_function
autoremove_wake_function
__wake_up_sync_key
sock_def_readable
__udp_enqueue_schedule_skb
udpv6_queue_rcv_one_skb
__udp6_lib_rcv
ip6_input
ipv6_rcv
process_backlog
net_rx_action
__softirqentry_text_start
__local_bh_enable_ip
ip6_output
ip6_local_out
ip6_send_skb
udp_v6_send_skb
udpv6_sendmsg
__sys_sendto
__x64_sys_sendto
do_syscall_64
entry_SYSCALL_64
__libc_sendto
3.70% netperf [kernel.vmlinux] [k] enqueue_entity
|
--3.66%--enqueue_entity
enqueue_task_fair
|
--3.66%--default_wake_function
autoremove_wake_function
__wake_up_sync_key
sock_def_readable
__udp_enqueue_schedule_skb
udpv6_queue_rcv_one_skb
__udp6_lib_rcv
ip6_input
ipv6_rcv
process_backlog
net_rx_action
__softirqentry_text_start
__local_bh_enable_ip
ip6_output
ip6_local_out
ip6_send_skb
udp_v6_send_skb
udpv6_sendmsg
__sys_sendto
__x64_sys_sendto
do_syscall_64
entry_SYSCALL_64
__libc_sendto

There's clearly a measurable impact on performance here due to swqueue
(negative for UDP_RR, positive for the default workload), but this looks
quite different than what you were observing.

20 / 40 x 2 Skylake
-------------------

Default workload:

NO_SWQUEUE: 57437.45
SWQUEUE: 58801.11
Delta: +2.37%

UDP_RR:

NO_SWQUEUE: 75932.28
SWQUEUE: 75232.77
Delta: -.92%

Same observation here. I didn't collect a profile, but the trend seems
consistent, and there's clearly a tradeoff. Despite the small drop in
perf for UDP_RR, it seems quite a bit less drastic than what would be
expected with the contention you showed in your profile.

7950X (8 cores / 16 threads per CCX, 2 CCXs)
--------------------------------------------

Default workload:

NO_SWQUEUE: 77615.08
SWQUEUE: 77465.73
Delta: -.19%

UDP_RR:

NO_SWQUEUE: 230258.75
SWQUEUE: 230280.34
Delta: ~0%

I'd call this essentially a no-op.

Milan (6 cores / 12 threads per CCX, 6 CCXs)
--------------------------------------------

Default workload:

NO_SWQUEUE: 23571.39
SWQUEUE: 23791.47
Delta: +.93%

UDP_RR:

NO_SWQUEUE: 40954.66
SWQUEUE: 40385.08
Delta: -1.39%

Same story here as above. Interesting to note though that Milan has a
small numer of cores per-LLC. I collected a profile, but I'm not sure
how useful it is given that so much of the profile seems to be indicting
perf as the bottleneck:

24.93% 664 netperf [kernel.vmlinux] [k] x86_pmu_disable_all
|
---x86_pmu_disable_all
amd_pmu_disable_all
|
|--23.87%--__perf_event_task_sched_out
| schedule
| |
| --23.84%--schedule_timeout
| __skb_wait_for_more_packets
| __skb_recv_udp
| udpv6_recvmsg
| inet6_recvmsg
| __x64_sys_recvfrom
| do_syscall_64
| entry_SYSCALL_64
| __libc_recvfrom
| 0x7fbfa69facf0
| 0
|
--0.62%--perf_mux_hrtimer_handler
hrtimer_interrupt
__sysvec_apic_timer_interrupt
sysvec_apic_timer_interrupt
asm_sysvec_apic_timer_interrupt
5.33% 135 netperf [kernel.vmlinux] [k] amd_pmu_test_overflow_topbit
|
---amd_pmu_test_overflow_topbit
amd_pmu_check_overflow
|
--5.18%--__perf_event_task_sched_out
schedule
schedule_timeout
__skb_wait_for_more_packets
__skb_recv_udp
udpv6_recvmsg
inet6_recvmsg
__x64_sys_recvfrom
do_syscall_64
entry_SYSCALL_64
__libc_recvfrom
0x7fbfa69facf0
0
3.73% 102 netperf [kernel.vmlinux] [k] native_sched_clock
|
---native_sched_clock
|
|--1.18%--sched_clock_cpu
| |
| --0.75%--irqtime_account_irq
| |
| --0.71%--__softirqentry_text_start
| __local_bh_enable_ip
| ip6_output
| ip6_local_out
| ip6_send_skb
| udp_v6_send_skb
| udpv6_sendmsg
| __sys_sendto
| __x64_sys_sendto
| do_syscall_64
| entry_SYSCALL_64
| __libc_sendto
|
|--1.18%--vtime_user_exit
| __context_tracking_exit
| syscall_enter_from_user_mode
| do_syscall_64
| entry_SYSCALL_64
| |
| |--0.61%--__libc_recvfrom
| | 0x7fbfa69facf0
| | 0
| |
| --0.57%--__libc_sendto
|
--0.51%--vtime_user_enter
__context_tracking_enter
syscall_exit_to_user_mode
do_syscall_64
entry_SYSCALL_64
3.53% 107 netperf [kernel.vmlinux] [k] copy_user_generic_string
|
---copy_user_generic_string
|
|--1.45%--_copy_to_iter
| udpv6_recvmsg
| inet6_recvmsg
| __x64_sys_recvfrom
| do_syscall_64
| entry_SYSCALL_64
| __libc_recvfrom
| 0x7fbfa69facf0
| 0
|
|--0.91%--_copy_to_user
| move_addr_to_user
| __x64_sys_recvfrom
| do_syscall_64
| entry_SYSCALL_64
| __libc_recvfrom
| 0x7fbfa69facf0
| 0
|
--0.73%--_copy_from_user
__sys_sendto
__x64_sys_sendto
do_syscall_64
entry_SYSCALL_64
__libc_sendto
2.98% 79 netperf [kernel.vmlinux] [k] default_wake_function
|
---default_wake_function
autoremove_wake_function
__wake_up_sync_key
sock_def_readable
__udp_enqueue_schedule_skb
udpv6_queue_rcv_one_skb
__udp6_lib_rcv
ip6_input
ipv6_rcv
process_backlog
net_rx_action
__softirqentry_text_start
__local_bh_enable_ip
ip6_output
ip6_local_out
ip6_send_skb
udp_v6_send_skb
udpv6_sendmsg
__sys_sendto
__x64_sys_sendto
do_syscall_64
entry_SYSCALL_64
__libc_sendto

In any case, as expected, the swqueue lock doesn't seem to be causing
any issues here.


I was wondering if the contention you were seeing could be due to
something like TTWU_QUEUE (though I wouldn't have expected that because
we don't use the swqueue if a task is migrated following
select_task_rq()), so I also tried collecting data with NO_TTWU_QUEUE
(only on the 2-socket SKL), but the results matched what I observed
without it:

Default workload:

NO_SWQUEUE: 57552.69
SWQUEUE: 58320.29
Delta: +1.33%

UDP_RR:

NO_SWQUEUE: 76296.46
SWQUEUE: 75418.35
Delta: -1.15%


I also tried running kernel compile workloads just to verify they didn't
regress per Peter's request:


26 / 52 x 1 Cooperlake
----------------------

NO_SWQUEUE: 1187.72s
SWQUEUE: 1185.39s
Delta: +.20%


20 / 40 x 2 Skylake
-------------------

NO_SWQUEUE: 418.15s
SWQUEUE: 418.87s
Delta: -.17%

I'd call this essentially a no-op for kernel compile.


In general, the results indicate some wins and some losses.

For kernel compile, larger LLCs with more cores was a no-op. For
netperf, the default workload had some gains, and for UPD_RR, some
losses.

With all that said, I have a few thoughts.

Firstly, would you mind please sharing your .config? It's possible that
the hosts I'm testing on just don't have big enough LLCs to observe the
contention you're seeing on the swqueue spinlock, as my 26 / 52 CPL host
is smaller than a single socket of your 32/64 ICL host. On the other
hand, your ICL isn't _that_ much bigger per-LLC, so I'd be curious to
see if there's a .config difference here that's causing the contention.
Also, the fact that Milan (which has only 6 cores / 12 threads per LLC)
also saw a performance hit with swqueue for UDP_RR suggests to me that
the issue with UDP_RR is not the scalability of the per-LLC swqueue
spinlock.

Regardless, even if there are some SKUs that have so many cores per LLC
that we do end up contending, I don't think that should preclude adding
a feature that which does benefit plenty of other configurations that
have fewer cores per LLC.

Thanks,
David

2023-06-15 05:02:18

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] sched: Implement shared wakequeue in CFS

Hi David,

On Wed, Jun 14, 2023 at 07:01:03PM -0500, David Vernet wrote:
> Hi Aaron,
>
> Thanks for taking a look and running some benchmarks. I tried to
> reproduce your results on a 26 core / 52 thread Cooperlake host, and a
> 20 core / 40 thread x 2 Skylake host, but I wasn't able to observe the
> contention on the per-swqueue spinlock you were saw on your Ice Lake.
>
> I ran the following netperf benchmarks:
>
> - netperf -l 60 -n $(nproc) -6
> - netperf -l 60 -n $(nproc) -6 -t UDP_RR

Just to confirm: did you run multiple of the above or just one?

>
> Here are the results I'm seeing from taking the average of running those
> benchmarks three times on each host (all results are from the "Throughput"
> column of the below table).
>
> Recv Send Send
> Socket Socket Message Elapsed
> Size Size Size Time Throughput <<<
> bytes bytes bytes secs. 10^6bits/sec
>
>
> 26 / 52 x 1 Cooperlake
> ----------------------
>
> Default workload:
>
> NO_SWQUEUE: 34103.76
> SWQUEUE: 34707.46
> Delta: +1.77%
>
> UDP_RR:
>
> NO_SWQUEUE: 57695.29
> SWQUEUE: 54827.36
> Delta: -4.97%
>
> There's clearly a statistically significant decline in performance for
> UDP_RR here, but surprisingly, I don't see any contention on swqueue
> locks when I profile with perf. Rather, we seem to be contending on the
> rq lock (presumably in swqueue_pick_next_task() which is unexpected (to
> me, at least):

I don't see rq lock contention though and since you have seen some
contention here, I suppose you should have started multiple instances of
netperf client?

>
> 7.81% netperf [kernel.vmlinux] [k] queued_spin_lock_slowpath
> |
> --7.81%--queued_spin_lock_slowpath
> |
> --7.55%--task_rq_lock
> pick_next_task_fair
> schedule
> schedule_timeout
> __skb_wait_for_more_packets
> __skb_recv_udp
> udpv6_recvmsg
> inet6_recvmsg
> __x64_sys_recvfrom
> do_syscall_64
> entry_SYSCALL_64
> __libc_recvfrom
> 0x7f923fdfacf0
> 0
> 4.97% netperf [kernel.vmlinux] [k] _raw_spin_lock_irqsave
> |
> --4.96%--_raw_spin_lock_irqsave
> |
> |--1.13%--prepare_to_wait_exclusive
> | __skb_wait_for_more_packets
> | __skb_recv_udp
> | udpv6_recvmsg
> | inet6_recvmsg
> | __x64_sys_recvfrom
> | do_syscall_64
> | entry_SYSCALL_64
> | __libc_recvfrom
> | 0x7f923fdfacf0
> | 0
> |
> |--0.92%--default_wake_function
> | autoremove_wake_function
> | __wake_up_sync_key
> | sock_def_readable
> | __udp_enqueue_schedule_skb
> | udpv6_queue_rcv_one_skb
> | __udp6_lib_rcv
> | ip6_input
> | ipv6_rcv
> | process_backlog
> | net_rx_action
> | __softirqentry_text_start
> | __local_bh_enable_ip
> | ip6_output
> | ip6_local_out
> | ip6_send_skb
> | udp_v6_send_skb
> | udpv6_sendmsg
> | __sys_sendto
> | __x64_sys_sendto
> | do_syscall_64
> | entry_SYSCALL_64
> | __libc_sendto
> |
> |--0.81%--__wake_up_sync_key
> | sock_def_readable
> | __udp_enqueue_schedule_skb
> | udpv6_queue_rcv_one_skb
> | __udp6_lib_rcv
> | ip6_input
> | ipv6_rcv
> | process_backlog
> | net_rx_action
> | __softirqentry_text_start
> | __local_bh_enable_ip
> | ip6_output
> | ip6_local_out
> | ip6_send_skb
> | udp_v6_send_skb
> | udpv6_sendmsg
> | __sys_sendto
> | __x64_sys_sendto
> | do_syscall_64
> | entry_SYSCALL_64
> | __libc_sendto
> |
> |--0.73%--enqueue_task_fair
> | |
> | --0.72%--default_wake_function
> | autoremove_wake_function
> | __wake_up_sync_key
> | sock_def_readable
> | __udp_enqueue_schedule_skb
> | udpv6_queue_rcv_one_skb
> | __udp6_lib_rcv
> | ip6_input
> | ipv6_rcv
> | process_backlog
> | net_rx_action
> | __softirqentry_text_start
> | __local_bh_enable_ip
> | ip6_output
> | ip6_local_out
> | ip6_send_skb
> | udp_v6_send_skb
> | udpv6_sendmsg
> | __sys_sendto
> | __x64_sys_sendto
> | do_syscall_64
> | entry_SYSCALL_64
> | __libc_sendto
> |
> --0.68%--task_rq_lock
> pick_next_task_fair
> schedule
> schedule_timeout
> __skb_wait_for_more_packets
> __skb_recv_udp
> udpv6_recvmsg
> inet6_recvmsg
> __x64_sys_recvfrom
> do_syscall_64
> entry_SYSCALL_64
> __libc_recvfrom
> 0x7f923fdfacf0
> 0
>
> The profile without swqueue doesn't show any contention on the rq lock,
> but does show us spending a good amount of time in the scheduler:
>
> 4.03% netperf [kernel.vmlinux] [k] default_wake_function
> |
> --3.98%--default_wake_function
> autoremove_wake_function
> __wake_up_sync_key
> sock_def_readable
> __udp_enqueue_schedule_skb
> udpv6_queue_rcv_one_skb
> __udp6_lib_rcv
> ip6_input
> ipv6_rcv
> process_backlog
> net_rx_action
> __softirqentry_text_start
> __local_bh_enable_ip
> ip6_output
> ip6_local_out
> ip6_send_skb
> udp_v6_send_skb
> udpv6_sendmsg
> __sys_sendto
> __x64_sys_sendto
> do_syscall_64
> entry_SYSCALL_64
> __libc_sendto
> 3.70% netperf [kernel.vmlinux] [k] enqueue_entity
> |
> --3.66%--enqueue_entity
> enqueue_task_fair
> |
> --3.66%--default_wake_function
> autoremove_wake_function
> __wake_up_sync_key
> sock_def_readable
> __udp_enqueue_schedule_skb
> udpv6_queue_rcv_one_skb
> __udp6_lib_rcv
> ip6_input
> ipv6_rcv
> process_backlog
> net_rx_action
> __softirqentry_text_start
> __local_bh_enable_ip
> ip6_output
> ip6_local_out
> ip6_send_skb
> udp_v6_send_skb
> udpv6_sendmsg
> __sys_sendto
> __x64_sys_sendto
> do_syscall_64
> entry_SYSCALL_64
> __libc_sendto
>
> There's clearly a measurable impact on performance here due to swqueue
> (negative for UDP_RR, positive for the default workload), but this looks
> quite different than what you were observing.

Yes.

>
> 20 / 40 x 2 Skylake
> -------------------
>
> Default workload:
>
> NO_SWQUEUE: 57437.45
> SWQUEUE: 58801.11
> Delta: +2.37%
>
> UDP_RR:
>
> NO_SWQUEUE: 75932.28
> SWQUEUE: 75232.77
> Delta: -.92%
>
> Same observation here. I didn't collect a profile, but the trend seems
> consistent, and there's clearly a tradeoff. Despite the small drop in
> perf for UDP_RR, it seems quite a bit less drastic than what would be
> expected with the contention you showed in your profile.

Indeed.

>
> 7950X (8 cores / 16 threads per CCX, 2 CCXs)
> --------------------------------------------
>
> Default workload:
>
> NO_SWQUEUE: 77615.08
> SWQUEUE: 77465.73
> Delta: -.19%
>
> UDP_RR:
>
> NO_SWQUEUE: 230258.75
> SWQUEUE: 230280.34
> Delta: ~0%
>
> I'd call this essentially a no-op.
>

> With all that said, I have a few thoughts.
>
> Firstly, would you mind please sharing your .config? It's possible that
> the hosts I'm testing on just don't have big enough LLCs to observe the
> contention you're seeing on the swqueue spinlock, as my 26 / 52 CPL host
> is smaller than a single socket of your 32/64 ICL host. On the other
> hand, your ICL isn't _that_ much bigger per-LLC, so I'd be curious to

Agreed. Your 26cores/52threads isn't that smaller than mine and I had
expected to see something similar.

> see if there's a .config difference here that's causing the contention.

Attached my .config.
I also pushed the branch I used for testing to github just in case you
want to take a look: https://github.com/aaronlu/linux.git swqueue

> Also, the fact that Milan (which has only 6 cores / 12 threads per LLC)
> also saw a performance hit with swqueue for UDP_RR suggests to me that
> the issue with UDP_RR is not the scalability of the per-LLC swqueue
> spinlock.

I've tested again today and I still saw serious contention on
swqueue->lock with your cmdline. I did it this way:
"
$ netserver
Starting netserver with host 'IN(6)ADDR_ANY' port '12865' and family AF_UNSPEC

$ for i in `seq 128`; do netperf -l 60 -n 128 -6 -t UDP_RR & done
"
And the profile is about the same as my last posting:

83.23% 83.18% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath - -
|
|--40.23%--sendto
| entry_SYSCALL_64
| do_syscall_64
| |
| --40.22%--__x64_sys_sendto
| __sys_sendto
| sock_sendmsg
| inet6_sendmsg
| udpv6_sendmsg
| udp_v6_send_skb
| ip6_send_skb
| ip6_local_out
| ip6_output
| ip6_finish_output
| ip6_finish_output2
| __dev_queue_xmit
| __local_bh_enable_ip
| do_softirq.part.0
| __do_softirq
| net_rx_action
| __napi_poll
| process_backlog
| __netif_receive_skb
| __netif_receive_skb_one_core
| ipv6_rcv
| ip6_input
| ip6_input_finish
| ip6_protocol_deliver_rcu
| udpv6_rcv
| __udp6_lib_rcv
| udp6_unicast_rcv_skb
| udpv6_queue_rcv_skb
| udpv6_queue_rcv_one_skb
| __udp_enqueue_schedule_skb
| sock_def_readable
| __wake_up_sync_key
| __wake_up_common_lock
| |
| --40.22%--__wake_up_common
| receiver_wake_function
| autoremove_wake_function
| default_wake_function
| try_to_wake_up
| ttwu_do_activate
| enqueue_task
| enqueue_task_fair
| |
| --40.22%--_raw_spin_lock_irqsave
| |
| --40.21%--native_queued_spin_lock_slowpath
|
|--38.59%--recvfrom
| |
| --38.59%--entry_SYSCALL_64
| do_syscall_64
| __x64_sys_recvfrom
| __sys_recvfrom
| sock_recvmsg
| inet6_recvmsg
| udpv6_recvmsg
| __skb_recv_udp
| __skb_wait_for_more_packets
| schedule_timeout
| schedule
| __schedule
| |
| --38.59%--pick_next_task_fair
| |
| --38.59%--swqueue_remove_task
| |
| --38.59%--_raw_spin_lock_irqsave
| |
| --38.58%--native_queued_spin_lock_slowpath
|
|--2.25%--sendto
| entry_SYSCALL_64
| do_syscall_64
| |
| --2.25%--__x64_sys_sendto
| __sys_sendto
| sock_sendmsg
| inet6_sendmsg
| udpv6_sendmsg
| udp_v6_send_skb
| ip6_send_skb
| ip6_local_out
| ip6_output
| ip6_finish_output
| ip6_finish_output2
| __dev_queue_xmit
| __local_bh_enable_ip
| do_softirq.part.0
| __do_softirq
| net_rx_action
| __napi_poll
| process_backlog
| __netif_receive_skb
| __netif_receive_skb_one_core
| ipv6_rcv
| ip6_input
| ip6_input_finish
| ip6_protocol_deliver_rcu
| udpv6_rcv
| __udp6_lib_rcv
| udp6_unicast_rcv_skb
| udpv6_queue_rcv_skb
| udpv6_queue_rcv_one_skb
| __udp_enqueue_schedule_skb
| sock_def_readable
| __wake_up_sync_key
| __wake_up_common_lock
| |
| --2.25%--__wake_up_common
| receiver_wake_function
| autoremove_wake_function
| default_wake_function
| try_to_wake_up
| ttwu_do_activate
| enqueue_task
| enqueue_task_fair
| _raw_spin_lock_irqsave
| |
| --2.25%--native_queued_spin_lock_slowpath
|
--2.10%--recvfrom
entry_SYSCALL_64
do_syscall_64
__x64_sys_recvfrom
__sys_recvfrom
sock_recvmsg
inet6_recvmsg
udpv6_recvmsg
__skb_recv_udp
__skb_wait_for_more_packets
schedule_timeout
schedule
__schedule
|
--2.09%--pick_next_task_fair
swqueue_remove_task
_raw_spin_lock_irqsave
|
--2.09%--native_queued_spin_lock_slowpath

During the test, vmstat showed lines like this:
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
185 0 0 128555784 47348 1791244 0 0 0 0 32387 4380484 1 98 0 0 0

When swqueue is disabled, vmstat showed lines like this and there is no
lock contention:
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
116 0 0 128977136 58324 1338760 0 0 0 16 599677 9734395 5 72 23 0 0

The runnable tasks are a lot more when swqueue is in use and sys% also
increased, perhaps due to the lock contention.

I'll see if I can find a smaller machine and give it a run there too.

Thanks,
Aaron


Attachments:
(No filename) (21.56 kB)
config-6.4.0-rc5-00237-g06b8769b15ae.gz (65.83 kB)
Download all attachments

2023-06-15 07:38:46

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] sched: Implement shared wakequeue in CFS

On Thu, Jun 15, 2023 at 12:49:17PM +0800, Aaron Lu wrote:
> I'll see if I can find a smaller machine and give it a run there too.

Found a Skylake with 18cores/36threads on each socket/LLC and with
netperf, the contention is still serious.

"
$ netserver
$ sudo sh -c "echo SWQUEUE > /sys/kernel/debug/sched/features"
$ for i in `seq 72`; do netperf -l 60 -n 72 -6 -t UDP_RR & done
"

53.61% 53.61% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath - -
|
|--27.93%--sendto
| entry_SYSCALL_64
| do_syscall_64
| |
| --27.93%--__x64_sys_sendto
| __sys_sendto
| sock_sendmsg
| inet6_sendmsg
| udpv6_sendmsg
| udp_v6_send_skb
| ip6_send_skb
| ip6_local_out
| ip6_output
| ip6_finish_output
| ip6_finish_output2
| __dev_queue_xmit
| __local_bh_enable_ip
| do_softirq.part.0
| __do_softirq
| net_rx_action
| __napi_poll
| process_backlog
| __netif_receive_skb
| __netif_receive_skb_one_core
| ipv6_rcv
| ip6_input
| ip6_input_finish
| ip6_protocol_deliver_rcu
| udpv6_rcv
| __udp6_lib_rcv
| udp6_unicast_rcv_skb
| udpv6_queue_rcv_skb
| udpv6_queue_rcv_one_skb
| __udp_enqueue_schedule_skb
| sock_def_readable
| __wake_up_sync_key
| __wake_up_common_lock
| |
| --27.85%--__wake_up_common
| receiver_wake_function
| autoremove_wake_function
| default_wake_function
| try_to_wake_up
| |
| --27.85%--ttwu_do_activate
| enqueue_task
| enqueue_task_fair
| |
| --27.85%--_raw_spin_lock_irqsave
| |
| --27.85%--native_queued_spin_lock_slowpath
|
--25.67%--recvfrom
entry_SYSCALL_64
do_syscall_64
__x64_sys_recvfrom
__sys_recvfrom
sock_recvmsg
inet6_recvmsg
udpv6_recvmsg
__skb_recv_udp
|
--25.67%--__skb_wait_for_more_packets
schedule_timeout
schedule
__schedule
|
--25.66%--pick_next_task_fair
|
--25.65%--swqueue_remove_task
|
--25.65%--_raw_spin_lock_irqsave
|
--25.65%--native_queued_spin_lock_slowpath

I didn't aggregate the throughput(Trans. Rate per sec) from all these
clients, but a glimpse from the result showed that the throughput of
each client dropped from 4xxxx(NO_SWQUEUE) to 2xxxx(SWQUEUE).

Thanks,
Aaron

2023-06-16 00:18:28

by David Vernet

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] sched: Implement shared wakequeue in CFS

On Thu, Jun 15, 2023 at 03:31:53PM +0800, Aaron Lu wrote:
> On Thu, Jun 15, 2023 at 12:49:17PM +0800, Aaron Lu wrote:
> > I'll see if I can find a smaller machine and give it a run there too.
>
> Found a Skylake with 18cores/36threads on each socket/LLC and with
> netperf, the contention is still serious.
>
> "
> $ netserver
> $ sudo sh -c "echo SWQUEUE > /sys/kernel/debug/sched/features"
> $ for i in `seq 72`; do netperf -l 60 -n 72 -6 -t UDP_RR & done
> "
>
> 53.61% 53.61% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath - -
> |
> |--27.93%--sendto
> | entry_SYSCALL_64
> | do_syscall_64
> | |
> | --27.93%--__x64_sys_sendto
> | __sys_sendto
> | sock_sendmsg
> | inet6_sendmsg
> | udpv6_sendmsg
> | udp_v6_send_skb
> | ip6_send_skb
> | ip6_local_out
> | ip6_output
> | ip6_finish_output
> | ip6_finish_output2
> | __dev_queue_xmit
> | __local_bh_enable_ip
> | do_softirq.part.0
> | __do_softirq
> | net_rx_action
> | __napi_poll
> | process_backlog
> | __netif_receive_skb
> | __netif_receive_skb_one_core
> | ipv6_rcv
> | ip6_input
> | ip6_input_finish
> | ip6_protocol_deliver_rcu
> | udpv6_rcv
> | __udp6_lib_rcv
> | udp6_unicast_rcv_skb
> | udpv6_queue_rcv_skb
> | udpv6_queue_rcv_one_skb
> | __udp_enqueue_schedule_skb
> | sock_def_readable
> | __wake_up_sync_key
> | __wake_up_common_lock
> | |
> | --27.85%--__wake_up_common
> | receiver_wake_function
> | autoremove_wake_function
> | default_wake_function
> | try_to_wake_up
> | |
> | --27.85%--ttwu_do_activate
> | enqueue_task
> | enqueue_task_fair
> | |
> | --27.85%--_raw_spin_lock_irqsave
> | |
> | --27.85%--native_queued_spin_lock_slowpath
> |
> --25.67%--recvfrom
> entry_SYSCALL_64
> do_syscall_64
> __x64_sys_recvfrom
> __sys_recvfrom
> sock_recvmsg
> inet6_recvmsg
> udpv6_recvmsg
> __skb_recv_udp
> |
> --25.67%--__skb_wait_for_more_packets
> schedule_timeout
> schedule
> __schedule
> |
> --25.66%--pick_next_task_fair
> |
> --25.65%--swqueue_remove_task
> |
> --25.65%--_raw_spin_lock_irqsave
> |
> --25.65%--native_queued_spin_lock_slowpath
>
> I didn't aggregate the throughput(Trans. Rate per sec) from all these
> clients, but a glimpse from the result showed that the throughput of
> each client dropped from 4xxxx(NO_SWQUEUE) to 2xxxx(SWQUEUE).
>
> Thanks,
> Aaron

Ok, it seems that the issue is that I wasn't creating enough netperf
clients. I assumed that -n $(nproc) was sufficient. I was able to repro
the contention on my 26 core / 52 thread skylake client as well:


41.01% netperf [kernel.vmlinux] [k] queued_spin_lock_slowpath
|
--41.01%--queued_spin_lock_slowpath
|
--40.63%--_raw_spin_lock_irqsave
|
|--21.18%--enqueue_task_fair
| |
| --21.09%--default_wake_function
| |
| --21.09%--autoremove_wake_function
| |
| --21.09%--__wake_up_sync_key
| sock_def_readable
| __udp_enqueue_schedule_skb
| udpv6_queue_rcv_one_skb
| __udp6_lib_rcv
| ip6_input
| ipv6_rcv
| process_backlog
| net_rx_action
| |
| --21.09%--__softirqentry_text_start
| __local_bh_enable_ip
| ip6_output
| ip6_local_out
| ip6_send_skb
| udp_v6_send_skb
| udpv6_sendmsg
| __sys_sendto
| __x64_sys_sendto
| do_syscall_64
| entry_SYSCALL_64
|
--19.44%--swqueue_remove_task
|
--19.42%--pick_next_task_fair
|
--19.42%--schedule
|
--19.21%--schedule_timeout
__skb_wait_for_more_packets
__skb_recv_udp
udpv6_recvmsg
inet6_recvmsg
__x64_sys_recvfrom
do_syscall_64
entry_SYSCALL_64
40.87% netserver [kernel.vmlinux] [k] queued_spin_lock_slowpath
|
--40.87%--queued_spin_lock_slowpath
|
--40.51%--_raw_spin_lock_irqsave
|
|--21.03%--enqueue_task_fair
| |
| --20.94%--default_wake_function
| |
| --20.94%--autoremove_wake_function
| |
| --20.94%--__wake_up_sync_key
| sock_def_readable
| __udp_enqueue_schedule_skb
| udpv6_queue_rcv_one_skb
| __udp6_lib_rcv
| ip6_input
| ipv6_rcv
| process_backlog
| net_rx_action
| |
| --20.94%--__softirqentry_text_start
| __local_bh_enable_ip
| ip6_output
| ip6_local_out
| ip6_send_skb
| udp_v6_send_skb
| udpv6_sendmsg
| __sys_sendto
| __x64_sys_sendto
| do_syscall_64
| entry_SYSCALL_64
|
--19.48%--swqueue_remove_task
|
--19.47%--pick_next_task_fair
schedule
|
--19.38%--schedule_timeout
__skb_wait_for_more_packets
__skb_recv_udp
udpv6_recvmsg
inet6_recvmsg
__x64_sys_recvfrom
do_syscall_64
entry_SYSCALL_64

Thanks for the help in getting the repro on my end.

So yes, there is certainly a scalability concern to bear in mind for
swqueue for LLCs with a lot of cores. If you have a lot of tasks quickly
e.g. blocking and waking on futexes in a tight loop, I expect a similar
issue would be observed.

On the other hand, the issue did not occur on my 7950X. I also wasn't
able to repro the contention on the Skylake if I ran with the default
netperf workload rather than UDP_RR (even with the additional clients).
I didn't bother to take the mean of all of the throughput results
between NO_SWQUEUE and SWQUEUE, but they looked roughly equal.

So swqueue isn't ideal for every configuration, but I'll echo my
sentiment from [0] that this shouldn't on its own necessarily preclude
it from being merged given that it does help a large class of
configurations and workloads, and it's disabled by default.

[0]: https://lore.kernel.org/all/20230615000103.GC2883716@maniforge/

Thanks,
David

2023-06-16 01:17:00

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] sched: Implement shared wakequeue in CFS

On Thu, Jun 15, 2023 at 06:26:05PM -0500, David Vernet wrote:

> Ok, it seems that the issue is that I wasn't creating enough netperf
> clients. I assumed that -n $(nproc) was sufficient. I was able to repro

Yes that switch is confusing.

> the contention on my 26 core / 52 thread skylake client as well:
>
>

> Thanks for the help in getting the repro on my end.

You are welcome.

> So yes, there is certainly a scalability concern to bear in mind for
> swqueue for LLCs with a lot of cores. If you have a lot of tasks quickly
> e.g. blocking and waking on futexes in a tight loop, I expect a similar
> issue would be observed.
>
> On the other hand, the issue did not occur on my 7950X. I also wasn't

Using netperf/UDP_RR?

> able to repro the contention on the Skylake if I ran with the default
> netperf workload rather than UDP_RR (even with the additional clients).

I also tried that on the 18cores/36threads/LLC Skylake and the contention
is indeed much smaller than UDP_RR:

7.30% 7.29% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath

But I wouldn't say it's entirely gone. Also consider Skylake has a lot
fewer cores per LLC than later Intel servers like Icelake and Sapphire
Rapids and I expect things would be worse on those two machines.

> I didn't bother to take the mean of all of the throughput results
> between NO_SWQUEUE and SWQUEUE, but they looked roughly equal.
>
> So swqueue isn't ideal for every configuration, but I'll echo my
> sentiment from [0] that this shouldn't on its own necessarily preclude
> it from being merged given that it does help a large class of
> configurations and workloads, and it's disabled by default.
>
> [0]: https://lore.kernel.org/all/20230615000103.GC2883716@maniforge/

I was wondering: does it make sense to do some divide on machines with
big LLCs? Like converting the per-LLC swqueue to per-group swqueue where
the group can be made of ~8 cpus of the same LLC. This will have a
similar effect of reducing the number of CPUs in a single LLC so the
scalability issue can hopefully be fixed while at the same time, it
might still help some workloads. I realized this isn't ideal in that
wakeup happens at LLC scale so the group thing may not fit very well
here.

Just a thought, feel free to ignore it if you don't think this is
feasible :-)

Thanks,
Aaron

2023-06-16 08:24:52

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] sched: Implement shared wakequeue in CFS

On Tue, 13 Jun 2023 at 07:20, David Vernet <[email protected]> wrote:
>
> Overview
> ========
>
> The scheduler must constantly strike a balance between work
> conservation, and avoiding costly migrations which harm performance due
> to e.g. decreased cache locality. The matter is further complicated by
> the topology of the system. Migrating a task between cores on the same
> LLC may be more optimal than keeping a task local to the CPU, whereas
> migrating a task between LLCs or NUMA nodes may tip the balance in the
> other direction.
>
> With that in mind, while CFS is by and large mostly a work conserving
> scheduler, there are certain instances where the scheduler will choose
> to keep a task local to a CPU, when it would have been more optimal to
> migrate it to an idle core.
>
> An example of such a workload is the HHVM / web workload at Meta. HHVM
> is a VM that JITs Hack and PHP code in service of web requests. Like
> other JIT / compilation workloads, it tends to be heavily CPU bound, and
> exhibit generally poor cache locality. To try and address this, we set
> several debugfs (/sys/kernel/debug/sched) knobs on our HHVM workloads:
>
> - migration_cost_ns -> 0
> - latency_ns -> 20000000
> - min_granularity_ns -> 10000000
> - wakeup_granularity_ns -> 12000000
>
> These knobs are intended both to encourage the scheduler to be as work
> conserving as possible (migration_cost_ns -> 0), and also to keep tasks
> running for relatively long time slices so as to avoid the overhead of
> context switching (the other knobs). Collectively, these knobs provide a
> substantial performance win; resulting in roughly a 20% improvement in
> throughput. Worth noting, however, is that this improvement is _not_ at
> full machine saturation.
>
> That said, even with these knobs, we noticed that CPUs were still going
> idle even when the host was overcommitted. In response, we wrote the
> "shared wakequeue" (swqueue) feature proposed in this patch set. The
> idea behind swqueue is simple: it enables the scheduler to be
> aggressively work conserving by placing a waking task into a per-LLC
> FIFO queue that can be pulled from by another core in the LLC FIFO queue
> which can then be pulled from before it goes idle.

This seems to be just another newly idle load balance outside the current one !

The knobs above are not the only thing preventing a rq to pull a new
task. We have rq->avg_idle, curr_cost and sd->max_newidle_lb_cost
stuff which might be one main root cause for one of your cpu not
pulling a waiting task

It's not clear in your explanation why fixing newly_idle_load_balance
was not possible instead of adding outside code and what prevents
newly_idle_load balance from picking a task in your case ?

For example, have you tried to disable the early break because of avg_idle ?

>
> With this simple change, we were able to achieve a 1 - 1.6% improvement
> in throughput, as well as a small, consistent improvement in p95 and p99
> latencies, in HHVM. These performance improvements were in addition to
> the wins from the debugfs knobs mentioned above.
>
> Design
> ======
>
> The design of swqueue is quite simple. An swqueue is simply a struct
> list_head, and a spinlock:
>
> struct swqueue {
> struct list_head list;
> spinlock_t lock;
> } ____cacheline_aligned;
>
> We create a struct swqueue per LLC, ensuring they're in their own
> cachelines to avoid false sharing between CPUs on different LLCs.
>
> When a task first wakes up, it enqueues itself in the swqueue of its
> current LLC at the end of enqueue_task_fair(). Enqueues only happen if
> the task was not manually migrated to the current core by
> select_task_rq(), and is not pinned to a specific CPU.
>
> A core will pull a task from its LLC's swqueue before calling
> newidle_balance().
>
> Difference between SIS_NODE
> ===========================
>
> In [0] Peter proposed a patch that addresses Tejun's observations that
> when workqueues are targeted towards a specific LLC on his Zen2 machine
> with small CCXs, that there would be significant idle time due to
> select_idle_sibling() not considering anything outside of the current
> LLC.
>
> This patch (SIS_NODE) is essentially the complement to the proposal
> here. SID_NODE causes waking tasks to look for idle cores in neighboring
> LLCs on the same die, whereas swqueue causes cores about to go idle to
> look for enqueued tasks. That said, in its current form, the two
> features at are a different scope as SIS_NODE searches for idle cores
> between LLCs, while swqueue enqueues tasks within a single LLC.
>
> The patch was since removed in [1], but we elect to compare its
> performance to swqueue given that as described above, it's conceptually
> complementary.
>
> [0]: https://lore.kernel.org/all/[email protected]/
> [1]: https://lore.kernel.org/all/[email protected]/
>
> I observed that while SIS_NODE works quite well for hosts with small
> CCXs, it can result in degraded performance on machines either with a
> large number of total cores in a CCD, or for which the cache miss
> penalty of migrating between CCXs is high, even on the same die.
>
> For example, on Zen 4c hosts (Bergamo), CCXs within a CCD are muxed
> through a single link to the IO die, and thus have similar cache miss
> latencies as cores in remote CCDs.
>
> Such subtleties could be taken into account with SIS_NODE, but
> regardless, both features are conceptually complementary sides of the
> same coin. SIS_NODE searches for idle cores for waking threads, whereas
> swqueue searches for available work before a core goes idle.
>
> Results
> =======
>
> Note that the motivation for the shared wakequeue feature was originally
> arrived at using experiments in the sched_ext framework that's currently
> being proposed upstream. The ~1 - 1.6% improvement in HHVM throughput
> is similarly visible using work-conserving sched_ext schedulers (even
> very simple ones like global FIFO).
>
> In both single and multi socket / CCX hosts, this can measurably improve
> performance. In addition to the performance gains observed on our
> internal web workloads, we also observed an improvement in common
> workloads such as kernel compile when running shared wakequeue. Here are
> the results of running make -j$(nproc) built-in.a on several different
> types of hosts configured with make allyesconfig on commit a27648c74210
> ("afs: Fix setting of mtime when creating a file/dir/symlink") on Linus'
> tree (boost was disabled on all of these hosts when the experiments were
> performed):
>
> Single-socket | 32-core | 2-CCX | AMD 7950X Zen4
>
> CPU max MHz: 5879.8818
> CPU min MHz: 3000.0000
> o____________o_______o
> | mean | CPU |
> o------------o-------o
> NO_SWQUEUE + NO_SIS_NODE: | 590.52s | 3103% |
> NO_SWQUEUE + SIS_NODE: | 590.80s | 3102% |
> SWQUEUE + NO_SIS_NODE: | 589.65s | 3116% |
> SWQUEUE + SIS_NODE: | 589.99s | 3115% |
> o------------o-------o
>
> Takeaway: swqueue doesn't seem to provide a statistically significant
> improvement for kernel compile on my 7950X. SIS_NODE similarly does not
> have a noticeable effect on performance.
>
> -------------------------------------------------------------------------------
>
> Single-socket | 72-core | 6-CCX | AMD Milan Zen3
>
> CPU max MHz: 3245.0190
> CPU min MHz: 700.0000
> o_____________o_______o
> | mean | CPU |
> o-------------o-------o
> NO_SWQUEUE + NO_SIS_NODE: | 1608.69s | 6488% |
> NO_SWQUEUE + SIS_NODE: | 1610.24s | 6473% |
> SWQUEUE + NO_SIS_NODE: | 1605.80s | 6504% |
> SWQUEUE + SIS_NODE: | 1606.96s | 6488% |
> o-------------o-------o
>
> Takeaway: swqueue does provide a small statistically significant
> improvement on Milan, but the compile times in general were quite long
> relative to the 7950X Zen4, and the Bergamo Zen4c due to the lower clock
> frequency. Milan also has larger CCXs than Bergamo, so it stands to
> reason that select_idle_sibling() will have an easier time finding idle
> cores inside the current CCX.
>
> It also seems logical that SIS_NODE would hurt performance a bit here,
> as all cores / CCXs are in the same NUMA node, so select_idle_sibling()
> has to iterate over 72 cores; delaying task wakeup. That said, I'm not
> sure that's a viable theory if total CPU% is lower with SIS_NODE.
>
> -------------------------------------------------------------------------------
>
> Single-socket | 176-core | 11-CCX | 2-CCX per CCD | AMD Bergamo Zen4c
>
> CPU max MHz: 1200.0000
> CPU min MHz: 1000.0000
>
> o____________o________o
> | mean | CPU |
> o------------o--------o
> NO_SWQUEUE + NO_SIS_NODE: | 322.44s | 15534% |
> NO_SWQUEUE + SIS_NODE: | 324.39s | 15508% |
> SWQUEUE + NO_SIS_NODE: | 321.54s | 15603% |
> SWQUEUE + SIS_NODE: | 321.88s | 15622% |
> o------------o--------o
>
> Takeaway: swqueue barely beats NO_SWQUEUE + NO_SIS_NODE, to the point
> that it's arguably not statistically significant.
> SIS_NODE results in a ~.9% performance degradation, for likely the same
> reason as Milan: the host has a large number of LLCs within a single
> socket, so task wakeup latencies suffer due to select_idle_node()
> searching up to 11 CCXs.
>
> Conclusion
> ==========
>
> swqueue in this form seems to provide a small, but noticeable win for
> front-end CPU-bound workloads spread over multiple CCXs. The reason
> seems fairly straightforward: swqueue encourages work conservation
> inside of a CCX by having a CPU do an O(1) pull from a per-LLC queue of
> runnable tasks. As mentioned above, it is complementary to SIS_NODE,
> which searches for idle cores on the wakeup path.
>
> While swqueue in this form encourages work conservation, it of course
> does not guarantee it given that we don't implement any kind of work
> stealing between swqueues. In the future, we could potentially push CPU
> utilization even higher by enabling work stealing between swqueues,
> likely between CCXs on the same NUMA node.
>
> Originally-by: Roman Gushchin <[email protected]>
> Signed-off-by: David Vernet <[email protected]>
> ---
> include/linux/sched.h | 2 +
> kernel/sched/core.c | 2 +
> kernel/sched/fair.c | 175 ++++++++++++++++++++++++++++++++++++++++--
> kernel/sched/sched.h | 2 +
> 4 files changed, 176 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 1292d38d66cc..1f4fd22f88a8 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -770,6 +770,8 @@ struct task_struct {
> unsigned long wakee_flip_decay_ts;
> struct task_struct *last_wakee;
>
> + struct list_head swqueue_node;
> +
> /*
> * recent_used_cpu is initially set as the last CPU used by a task
> * that wakes affine another task. Waker/wakee relationships can
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index d911b0631e7b..e04f0daf1f05 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4533,6 +4533,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
> #ifdef CONFIG_SMP
> p->wake_entry.u_flags = CSD_TYPE_TTWU;
> p->migration_pending = NULL;
> + INIT_LIST_HEAD(&p->swqueue_node);
> #endif
> init_sched_mm_cid(p);
> }
> @@ -9872,6 +9873,7 @@ void __init sched_init_smp(void)
>
> init_sched_rt_class();
> init_sched_dl_class();
> + init_sched_fair_class_late();
>
> sched_smp_initialized = true;
> }
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 807986bd6ea6..29fe25794884 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -139,17 +139,151 @@ static int __init setup_sched_thermal_decay_shift(char *str)
> }
> __setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift);
>
> +/**
> + * struct swqueue - Per-LLC queue structure for enqueuing and pulling waking
> + * tasks.
> + *
> + * WHAT
> + * ====
> + *
> + * This structure enables the scheduler to be more aggressively work
> + * conserving, by placing waking tasks on a per-LLC FIFO queue that can then be
> + * pulled from when another core in the LLC is going to go idle.
> + *
> + * struct rq stores a pointer to its LLC's swqueue via struct cfs_rq. Waking
> + * tasks are enqueued in a swqueue at the end of enqueue_task_fair(), and are
> + * opportunistically pulled from the swqueue in pick_next_task_fair() prior to
> + * invoking newidle_balance(). Tasks enqueued in a swqueue be scheduled prior
> + * to being pulled from the swqueue, in which case they're simply removed from
> + * the swqueue. A waking task is only enqueued to a swqueue when it was _not_
> + * manually migrated to the current runqueue by select_task_rq_fair().
> + *
> + * There is currently no task-stealing between swqueues in different LLCs,
> + * which means that swqueue is not fully work conserving. This could be added
> + * at a later time, with tasks likely only being stolen across swqueues on the
> + * same NUMA node to avoid violating NUMA affinities.
> + *
> + * HOW
> + * ===
> + *
> + * An swqueue is comprised of a list, and a spinlock for synchronization. Given
> + * that the critical section for a swqueue is typically a fast list operation,
> + * and that the swqueue is localized to a single LLC, the spinlock does not
> + * seem to be contended; even on a heavily utilized host. struct swqueues are
> + * also cacheline aligned to prevent false sharing between CPUs manipulating
> + * swqueues in other LLCs.
> + *
> + * WHY
> + * ===
> + *
> + * As mentioned above, the main benefit of swqueue is that it enables more
> + * aggressive work conservation in the scheduler. This can benefit workloads
> + * that benefit more from CPU utilization than from L1/L2 cache locality.
> + *
> + * swqueues are segmented across LLCs both to avoid contention on the swqueue
> + * spinlock by minimizing the number of CPUs that could contend on it, as well
> + * as to strike a balance between work conservation, and L3 cache locality.
> + */
> +struct swqueue {
> + struct list_head list;
> + spinlock_t lock;
> +} ____cacheline_aligned;
> +
> #ifdef CONFIG_SMP
> -static void swqueue_enqueue(struct rq *rq, struct task_struct *p,
> - int enq_flags)
> -{}
> +static struct swqueue *rq_swqueue(struct rq *rq)
> +{
> + return rq->cfs.swqueue;
> +}
> +
> +static struct task_struct *swqueue_pull_task(struct swqueue *swqueue)
> +{
> + unsigned long flags;
> +
> + struct task_struct *p;
> +
> + spin_lock_irqsave(&swqueue->lock, flags);
> + p = list_first_entry_or_null(&swqueue->list, struct task_struct,
> + swqueue_node);
> + if (p)
> + list_del_init(&p->swqueue_node);
> + spin_unlock_irqrestore(&swqueue->lock, flags);
> +
> + return p;
> +}
> +
> +static void swqueue_enqueue(struct rq *rq, struct task_struct *p, int enq_flags)
> +{
> + unsigned long flags;
> + struct swqueue *swqueue;
> + bool task_migrated = enq_flags & ENQUEUE_MIGRATED;
> + bool task_wakeup = enq_flags & ENQUEUE_WAKEUP;
> +
> + /*
> + * Only enqueue the task in the shared wakequeue if:
> + *
> + * - SWQUEUE is enabled
> + * - The task is on the wakeup path
> + * - The task wasn't purposefully migrated to the current rq by
> + * select_task_rq()
> + * - The task isn't pinned to a specific CPU
> + */
> + if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1)
> + return;
> +
> + swqueue = rq_swqueue(rq);
> + spin_lock_irqsave(&swqueue->lock, flags);
> + list_add_tail(&p->swqueue_node, &swqueue->list);
> + spin_unlock_irqrestore(&swqueue->lock, flags);
> +}
> +
> static int swqueue_pick_next_task(struct rq *rq, struct rq_flags *rf)
> {
> - return 0;
> + struct swqueue *swqueue;
> + struct task_struct *p = NULL;
> + struct rq *src_rq;
> + struct rq_flags src_rf;
> + int ret;
> +
> + swqueue = rq_swqueue(rq);
> + if (!list_empty(&swqueue->list))
> + p = swqueue_pull_task(swqueue);
> +
> + if (!p)
> + return 0;
> +
> + rq_unpin_lock(rq, rf);
> + raw_spin_rq_unlock(rq);
> +
> + src_rq = task_rq_lock(p, &src_rf);
> +
> + if (task_on_rq_queued(p) && !task_on_cpu(rq, p))
> + src_rq = migrate_task_to(src_rq, &src_rf, p, cpu_of(rq));
> +
> + if (src_rq->cpu != rq->cpu)
> + ret = 1;
> + else
> + ret = -1;
> +
> + task_rq_unlock(src_rq, p, &src_rf);
> +
> + raw_spin_rq_lock(rq);
> + rq_repin_lock(rq, rf);
> +
> + return ret;
> }
>
> static void swqueue_remove_task(struct task_struct *p)
> -{}
> +{
> + unsigned long flags;
> + struct swqueue *swqueue;
> +
> + if (!list_empty(&p->swqueue_node)) {
> + swqueue = rq_swqueue(task_rq(p));
> + spin_lock_irqsave(&swqueue->lock, flags);
> + list_del_init(&p->swqueue_node);
> + spin_unlock_irqrestore(&swqueue->lock, flags);
> + }
> +}
>
> /*
> * For asym packing, by default the lower numbered CPU has higher priority.
> @@ -12839,3 +12973,34 @@ __init void init_sched_fair_class(void)
> #endif /* SMP */
>
> }
> +
> +__init void init_sched_fair_class_late(void)
> +{
> +#ifdef CONFIG_SMP
> + int i;
> + struct swqueue *swqueue;
> + struct rq *rq;
> + struct rq *llc_rq;
> +
> + for_each_possible_cpu(i) {
> + if (per_cpu(sd_llc_id, i) == i) {
> + llc_rq = cpu_rq(i);
> +
> + swqueue = kzalloc_node(sizeof(struct swqueue),
> + GFP_KERNEL, cpu_to_node(i));
> + INIT_LIST_HEAD(&swqueue->list);
> + spin_lock_init(&swqueue->lock);
> + llc_rq->cfs.swqueue = swqueue;
> + }
> + }
> +
> + for_each_possible_cpu(i) {
> + rq = cpu_rq(i);
> + llc_rq = cpu_rq(per_cpu(sd_llc_id, i));
> +
> + if (rq == llc_rq)
> + continue;
> + rq->cfs.swqueue = llc_rq->cfs.swqueue;
> + }
> +#endif /* SMP */
> +}
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 5a86e9795731..daee5c64af87 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -575,6 +575,7 @@ struct cfs_rq {
> #endif
>
> #ifdef CONFIG_SMP
> + struct swqueue *swqueue;
> /*
> * CFS load tracking
> */
> @@ -2380,6 +2381,7 @@ extern void update_max_interval(void);
> extern void init_sched_dl_class(void);
> extern void init_sched_rt_class(void);
> extern void init_sched_fair_class(void);
> +extern void init_sched_fair_class_late(void);
>
> extern void reweight_task(struct task_struct *p, int prio);
>
> --
> 2.40.1
>

2023-06-19 06:40:03

by Gautham R. Shenoy

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] sched: Implement shared wakequeue in CFS

Hello David,


On Tue, Jun 13, 2023 at 12:20:04AM -0500, David Vernet wrote:
[..snip..]

> +static void swqueue_enqueue(struct rq *rq, struct task_struct *p, int enq_flags)
> +{
> + unsigned long flags;
> + struct swqueue *swqueue;
> + bool task_migrated = enq_flags & ENQUEUE_MIGRATED;
> + bool task_wakeup = enq_flags & ENQUEUE_WAKEUP;
> +
> + /*
> + * Only enqueue the task in the shared wakequeue if:
> + *
> + * - SWQUEUE is enabled
> + * - The task is on the wakeup path
> + * - The task wasn't purposefully migrated to the current rq by
> + * select_task_rq()
> + * - The task isn't pinned to a specific CPU
> + */
> + if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1)
> + return;

In select_task_rq_fair(), having determined if the target of task
wakeup should be the task's previous CPU vs the waker's current CPU,
we spend quite a bit of time already to determine if there is an idle
core/CPU in the target's LLC. @rq would correspond to CPU chosen as a
result of that scan or if no idle CPU exists, @rq corresponds to the
target CPU determined by wake_affine_idle()/wake_affine_weight().

So if the CPU of @rq is idle here, can we not simply return here?

Or if the idea is to avoid the scan for an idle core/CPU in
select_task_rq_fair(), then

Perhaps I am missing something...

> +
> + swqueue = rq_swqueue(rq);
> + spin_lock_irqsave(&swqueue->lock, flags);
> + list_add_tail(&p->swqueue_node, &swqueue->list);
> + spin_unlock_irqrestore(&swqueue->lock, flags);
> +}
> +

--
Thanks and Regards
gautham.

2023-06-20 18:08:30

by David Vernet

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] sched: Implement shared wakequeue in CFS

On Fri, Jun 16, 2023 at 08:53:38AM +0800, Aaron Lu wrote:
> On Thu, Jun 15, 2023 at 06:26:05PM -0500, David Vernet wrote:
>
> > Ok, it seems that the issue is that I wasn't creating enough netperf
> > clients. I assumed that -n $(nproc) was sufficient. I was able to repro
>
> Yes that switch is confusing.
>
> > the contention on my 26 core / 52 thread skylake client as well:
> >
> >
>
> > Thanks for the help in getting the repro on my end.
>
> You are welcome.
>
> > So yes, there is certainly a scalability concern to bear in mind for
> > swqueue for LLCs with a lot of cores. If you have a lot of tasks quickly
> > e.g. blocking and waking on futexes in a tight loop, I expect a similar
> > issue would be observed.
> >
> > On the other hand, the issue did not occur on my 7950X. I also wasn't
>
> Using netperf/UDP_RR?

Correct

> > able to repro the contention on the Skylake if I ran with the default
> > netperf workload rather than UDP_RR (even with the additional clients).
>
> I also tried that on the 18cores/36threads/LLC Skylake and the contention
> is indeed much smaller than UDP_RR:
>
> 7.30% 7.29% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
>
> But I wouldn't say it's entirely gone. Also consider Skylake has a lot
> fewer cores per LLC than later Intel servers like Icelake and Sapphire
> Rapids and I expect things would be worse on those two machines.

I cannot reproduce this contention locally, even on a slightly larger
Skylake. Not really sure what to make of the difference here. Perhaps
it's because you're running with CONFIG_SCHED_CORE=y? What is the
change in throughput when you run the default workload on your SKL?

> > I didn't bother to take the mean of all of the throughput results
> > between NO_SWQUEUE and SWQUEUE, but they looked roughly equal.
> >
> > So swqueue isn't ideal for every configuration, but I'll echo my
> > sentiment from [0] that this shouldn't on its own necessarily preclude
> > it from being merged given that it does help a large class of
> > configurations and workloads, and it's disabled by default.
> >
> > [0]: https://lore.kernel.org/all/20230615000103.GC2883716@maniforge/
>
> I was wondering: does it make sense to do some divide on machines with
> big LLCs? Like converting the per-LLC swqueue to per-group swqueue where
> the group can be made of ~8 cpus of the same LLC. This will have a
> similar effect of reducing the number of CPUs in a single LLC so the
> scalability issue can hopefully be fixed while at the same time, it
> might still help some workloads. I realized this isn't ideal in that
> wakeup happens at LLC scale so the group thing may not fit very well
> here.
>
> Just a thought, feel free to ignore it if you don't think this is
> feasible :-)

That's certainly an idea we could explore, but my inclination would be
to keep everything at a per-LLC granularity. It makes it easier to
reason about performance; both in terms of work conservation per-LLC
(again, not every workload suffers from having large LLCs even if others
do, and halving the size of a swqueue in an LLC could harm other
workloads which benefit from the increased work conservation), and in
terms of contention. To the latter point, I think it would be difficult
to choose an LLC size that wasn't somewhat artificial and workload
specific. If someone has that requirement, I think sched_ext would be a
better alternative.

Thanks,
David

2023-06-20 20:14:39

by David Vernet

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] sched: Implement shared wakequeue in CFS

On Fri, Jun 16, 2023 at 10:08:57AM +0200, Vincent Guittot wrote:
> On Tue, 13 Jun 2023 at 07:20, David Vernet <[email protected]> wrote:
> >
> > Overview
> > ========
> >
> > The scheduler must constantly strike a balance between work
> > conservation, and avoiding costly migrations which harm performance due
> > to e.g. decreased cache locality. The matter is further complicated by
> > the topology of the system. Migrating a task between cores on the same
> > LLC may be more optimal than keeping a task local to the CPU, whereas
> > migrating a task between LLCs or NUMA nodes may tip the balance in the
> > other direction.
> >
> > With that in mind, while CFS is by and large mostly a work conserving
> > scheduler, there are certain instances where the scheduler will choose
> > to keep a task local to a CPU, when it would have been more optimal to
> > migrate it to an idle core.
> >
> > An example of such a workload is the HHVM / web workload at Meta. HHVM
> > is a VM that JITs Hack and PHP code in service of web requests. Like
> > other JIT / compilation workloads, it tends to be heavily CPU bound, and
> > exhibit generally poor cache locality. To try and address this, we set
> > several debugfs (/sys/kernel/debug/sched) knobs on our HHVM workloads:
> >
> > - migration_cost_ns -> 0
> > - latency_ns -> 20000000
> > - min_granularity_ns -> 10000000
> > - wakeup_granularity_ns -> 12000000
> >
> > These knobs are intended both to encourage the scheduler to be as work
> > conserving as possible (migration_cost_ns -> 0), and also to keep tasks
> > running for relatively long time slices so as to avoid the overhead of
> > context switching (the other knobs). Collectively, these knobs provide a
> > substantial performance win; resulting in roughly a 20% improvement in
> > throughput. Worth noting, however, is that this improvement is _not_ at
> > full machine saturation.
> >
> > That said, even with these knobs, we noticed that CPUs were still going
> > idle even when the host was overcommitted. In response, we wrote the
> > "shared wakequeue" (swqueue) feature proposed in this patch set. The
> > idea behind swqueue is simple: it enables the scheduler to be
> > aggressively work conserving by placing a waking task into a per-LLC
> > FIFO queue that can be pulled from by another core in the LLC FIFO queue
> > which can then be pulled from before it goes idle.
>
> This seems to be just another newly idle load balance outside the current one !

Hi Vincent,

I can bring the swqueue logic inside of newidle_balance(). In hindsight
I think it makes more sense there.

To answer your point more generally though, yes, this is a new approach
to load balancing that eschews tracking migration costs, scanning
runqueues, etc in favor of optimizing for work conservation, and
tracking runnable tasks in a shared data structure. More on this below
in response to your other points.

>
> The knobs above are not the only thing preventing a rq to pull a new
> task. We have rq->avg_idle, curr_cost and sd->max_newidle_lb_cost
> stuff which might be one main root cause for one of your cpu not
> pulling a waiting task
>
> It's not clear in your explanation why fixing newly_idle_load_balance
> was not possible instead of adding outside code and what prevents
> newly_idle_load balance from picking a task in your case ?
>
> For example, have you tried to disable the early break because of avg_idle ?

The goal of swqueue is to enable work conservation using a shared, per-LLC data
structure. The shared data structure is really the salient point as to why just
updating newidle_balance() wouldn't achieve the same result. newidle_balance()
finds tasks to migrate by (sometimes) iterating over runqueues and scanning for
tasks. It's an expensive operation, which doesn't scale to large machines or to
being performed on every idle path. swqueue, on the other hand, is shared
across all cores in an LLC, so pulling a runnable task is simply a matter of a
spinlock acquire and a list operation. This doesn't scale to every single
configuration as Aaron pointed out in [0], but it works well in many other
configurations.

[0]: https://lore.kernel.org/all/20230614043529.GA1942@ziqianlu-dell/

Another consideration is that even if we could adjust newidle_balance()
to load balance well enough for our specific purpose, we're still
relying on heuristics to determine when it's appropriate to load
balance; and that will

a) Inevitably be suboptimal for certain workloads and configurations. For
example, if we got rid of the following check:

12021 for_each_domain(this_cpu, sd) {
12022 int continue_balancing = 1;
12023 u64 domain_cost;
12024
12025 update_next_balance(sd, &next_balance);
12026
12027 if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost)
12028 break;

we may end up load balancing too frequently, which could be a regression in and
of itself per the scalability concerns mentioned above. Or, for certain
workloads, we'll load balance too aggressively and miss out on L1/L2 locality.

b) Be harder for users to effectively tune or use. At OSPM, Peter made it quite
clear that users should not be tuning any of the debugfs knobs. Relying on
heuristics and knobs like this feels antithetical to that policy. swqueue feels
like a reasonable, self-contained alternative to that.

Thanks,
David

2023-06-20 20:32:58

by David Vernet

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] sched: Implement shared wakequeue in CFS

On Mon, Jun 19, 2023 at 11:43:13AM +0530, Gautham R. Shenoy wrote:
> Hello David,
>
>
> On Tue, Jun 13, 2023 at 12:20:04AM -0500, David Vernet wrote:
> [..snip..]
>
> > +static void swqueue_enqueue(struct rq *rq, struct task_struct *p, int enq_flags)
> > +{
> > + unsigned long flags;
> > + struct swqueue *swqueue;
> > + bool task_migrated = enq_flags & ENQUEUE_MIGRATED;
> > + bool task_wakeup = enq_flags & ENQUEUE_WAKEUP;
> > +
> > + /*
> > + * Only enqueue the task in the shared wakequeue if:
> > + *
> > + * - SWQUEUE is enabled
> > + * - The task is on the wakeup path
> > + * - The task wasn't purposefully migrated to the current rq by
> > + * select_task_rq()
> > + * - The task isn't pinned to a specific CPU
> > + */
> > + if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1)
> > + return;
>
> In select_task_rq_fair(), having determined if the target of task
> wakeup should be the task's previous CPU vs the waker's current CPU,
> we spend quite a bit of time already to determine if there is an idle
> core/CPU in the target's LLC. @rq would correspond to CPU chosen as a
> result of that scan or if no idle CPU exists, @rq corresponds to the
> target CPU determined by wake_affine_idle()/wake_affine_weight().
>
> So if the CPU of @rq is idle here, can we not simply return here?

Hi Gautum,

Sorry, I'm not sure I'm quite following the issue you're pointing out.
We don't use swqueue if the task was migrated following
select_task_rq_fair(). That's the idea with us returning if the task was
migrated (the second conditional in that if). If I messed up that logic
somehow, it should be fixed.

> Or if the idea is to avoid the scan for an idle core/CPU in
> select_task_rq_fair(), then

No, swqueue_enqueue() is called from enqueue_task_fair(), not
select_task_rq_fair(). If select_task_rq_fair() (which is of course
called beforehand for a waking task) found an idle core/CPU, we don't
bother using swqueue, as mentioned above.

Thanks,
David

2023-06-20 22:28:15

by Roman Gushchin

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] sched: Implement shared wakequeue in CFS

On Tue, Jun 20, 2023 at 02:54:23PM -0500, David Vernet wrote:
> On Fri, Jun 16, 2023 at 10:08:57AM +0200, Vincent Guittot wrote:
> > On Tue, 13 Jun 2023 at 07:20, David Vernet <[email protected]> wrote:
> > >
> > > Overview
> > > ========
> > >
> > > The scheduler must constantly strike a balance between work
> > > conservation, and avoiding costly migrations which harm performance due
> > > to e.g. decreased cache locality. The matter is further complicated by
> > > the topology of the system. Migrating a task between cores on the same
> > > LLC may be more optimal than keeping a task local to the CPU, whereas
> > > migrating a task between LLCs or NUMA nodes may tip the balance in the
> > > other direction.
> > >
> > > With that in mind, while CFS is by and large mostly a work conserving
> > > scheduler, there are certain instances where the scheduler will choose
> > > to keep a task local to a CPU, when it would have been more optimal to
> > > migrate it to an idle core.
> > >
> > > An example of such a workload is the HHVM / web workload at Meta. HHVM
> > > is a VM that JITs Hack and PHP code in service of web requests. Like
> > > other JIT / compilation workloads, it tends to be heavily CPU bound, and
> > > exhibit generally poor cache locality. To try and address this, we set
> > > several debugfs (/sys/kernel/debug/sched) knobs on our HHVM workloads:
> > >
> > > - migration_cost_ns -> 0
> > > - latency_ns -> 20000000
> > > - min_granularity_ns -> 10000000
> > > - wakeup_granularity_ns -> 12000000
> > >
> > > These knobs are intended both to encourage the scheduler to be as work
> > > conserving as possible (migration_cost_ns -> 0), and also to keep tasks
> > > running for relatively long time slices so as to avoid the overhead of
> > > context switching (the other knobs). Collectively, these knobs provide a
> > > substantial performance win; resulting in roughly a 20% improvement in
> > > throughput. Worth noting, however, is that this improvement is _not_ at
> > > full machine saturation.
> > >
> > > That said, even with these knobs, we noticed that CPUs were still going
> > > idle even when the host was overcommitted. In response, we wrote the
> > > "shared wakequeue" (swqueue) feature proposed in this patch set. The
> > > idea behind swqueue is simple: it enables the scheduler to be
> > > aggressively work conserving by placing a waking task into a per-LLC
> > > FIFO queue that can be pulled from by another core in the LLC FIFO queue
> > > which can then be pulled from before it goes idle.
> >
> > This seems to be just another newly idle load balance outside the current one !
>
> Hi Vincent,
>
> I can bring the swqueue logic inside of newidle_balance(). In hindsight
> I think it makes more sense there.
>
> To answer your point more generally though, yes, this is a new approach
> to load balancing that eschews tracking migration costs, scanning
> runqueues, etc in favor of optimizing for work conservation, and
> tracking runnable tasks in a shared data structure. More on this below
> in response to your other points.
>
> >
> > The knobs above are not the only thing preventing a rq to pull a new
> > task. We have rq->avg_idle, curr_cost and sd->max_newidle_lb_cost
> > stuff which might be one main root cause for one of your cpu not
> > pulling a waiting task
> >
> > It's not clear in your explanation why fixing newly_idle_load_balance
> > was not possible instead of adding outside code and what prevents
> > newly_idle_load balance from picking a task in your case ?
> >
> > For example, have you tried to disable the early break because of avg_idle ?
>
> The goal of swqueue is to enable work conservation using a shared, per-LLC data
> structure. The shared data structure is really the salient point as to why just
> updating newidle_balance() wouldn't achieve the same result. newidle_balance()
> finds tasks to migrate by (sometimes) iterating over runqueues and scanning for
> tasks. It's an expensive operation, which doesn't scale to large machines or to
> being performed on every idle path. swqueue, on the other hand, is shared
> across all cores in an LLC, so pulling a runnable task is simply a matter of a
> spinlock acquire and a list operation. This doesn't scale to every single
> configuration as Aaron pointed out in [0], but it works well in many other
> configurations.

+1

Additionally, swqueue minimizes the number of races when two tasks pick
the same cpu for being woken up.
It's also serializing the wakeup order, which effectively pessimises tasks
which are woken very often and promotes cpu-bound tasks, which usually
positively affects the throughput.

I agree that swqueue can be seen as an alternative form of the idle load balancing,
I thought this way when I was working on it.
Can it replace it completely? Idk, maybe. But maybe we need some sort of a balancing
between multiple wait queues. E.g. if there are two swqueues on the system, one is empty
and another is long, load idle balancing can borrow some tasks from the "foreign" swqueue.
Just some ideas.

Thanks!

2023-06-21 02:54:38

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] sched: Implement shared wakequeue in CFS

On Tue, Jun 20, 2023 at 12:36:26PM -0500, David Vernet wrote:
> On Fri, Jun 16, 2023 at 08:53:38AM +0800, Aaron Lu wrote:
> > I also tried that on the 18cores/36threads/LLC Skylake and the contention
> > is indeed much smaller than UDP_RR:
> >
> > 7.30% 7.29% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> >
> > But I wouldn't say it's entirely gone. Also consider Skylake has a lot
> > fewer cores per LLC than later Intel servers like Icelake and Sapphire
> > Rapids and I expect things would be worse on those two machines.
>
> I cannot reproduce this contention locally, even on a slightly larger

With netperf client number equal to nr_cpu?

> Skylake. Not really sure what to make of the difference here. Perhaps
> it's because you're running with CONFIG_SCHED_CORE=y? What is the

Yes I had that config on but I didn't tag any tasks or groups.

> change in throughput when you run the default workload on your SKL?

The throughput dropped a little with SWQUEUE:

avg_throughput native_queued_spin_lock_slowpath%
NO_SWQUEUE: 9528.061111111108 0.09%
SWQUEUE: 8984.369722222222 8.05%

avg_throughput: average throughput of all netperf client's throughput,
higher is better.

I run this workload like this:
"
netserver

for i in `seq 72`; do
netperf -l 60 -n 72 -6 &
done

sleep 30
perf record -ag -e cycles:pp -- sleep 5 &

wait
"
(the '-n 72' should be redundant but I just keep it there)

Thanks,
Aaron

2023-06-21 03:15:48

by David Vernet

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] sched: Implement shared wakequeue in CFS

On Wed, Jun 21, 2023 at 10:35:34AM +0800, Aaron Lu wrote:
> On Tue, Jun 20, 2023 at 12:36:26PM -0500, David Vernet wrote:
> > On Fri, Jun 16, 2023 at 08:53:38AM +0800, Aaron Lu wrote:
> > > I also tried that on the 18cores/36threads/LLC Skylake and the contention
> > > is indeed much smaller than UDP_RR:
> > >
> > > 7.30% 7.29% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> > >
> > > But I wouldn't say it's entirely gone. Also consider Skylake has a lot
> > > fewer cores per LLC than later Intel servers like Icelake and Sapphire
> > > Rapids and I expect things would be worse on those two machines.
> >
> > I cannot reproduce this contention locally, even on a slightly larger
>
> With netperf client number equal to nr_cpu?

No, that confusion was only the first time around. See below though, I'm
not sure what insights are to be gained by continuing to tinker with
netperf runs.

> > Skylake. Not really sure what to make of the difference here. Perhaps
> > it's because you're running with CONFIG_SCHED_CORE=y? What is the
>
> Yes I had that config on but I didn't tag any tasks or groups.
>
> > change in throughput when you run the default workload on your SKL?
>
> The throughput dropped a little with SWQUEUE:
>
> avg_throughput native_queued_spin_lock_slowpath%
> NO_SWQUEUE: 9528.061111111108 0.09%
> SWQUEUE: 8984.369722222222 8.05%
>
> avg_throughput: average throughput of all netperf client's throughput,
> higher is better.
>
> I run this workload like this:
> "
> netserver
>
> for i in `seq 72`; do
> netperf -l 60 -n 72 -6 &
> done
>
> sleep 30
> perf record -ag -e cycles:pp -- sleep 5 &
>
> wait
> "
> (the '-n 72' should be redundant but I just keep it there)

At this point I'd say we've spent quite a bit of time discussing netperf
results. We understand where the contention is coming from, and yes,
we've established that there are going to be some configurations where
swqueue is not well suited. We've also established that there are
configurations where it will and does perform well, including on
netperf.

I'm not sure what we're hoping to gain by continuing to run various
netperf workloads with your specific parameters?

Thanks,
David

2023-06-21 05:07:45

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] sched: Implement shared wakequeue in CFS

On Tue, Jun 20, 2023 at 09:43:00PM -0500, David Vernet wrote:
> On Wed, Jun 21, 2023 at 10:35:34AM +0800, Aaron Lu wrote:
> > On Tue, Jun 20, 2023 at 12:36:26PM -0500, David Vernet wrote:
> > > On Fri, Jun 16, 2023 at 08:53:38AM +0800, Aaron Lu wrote:
> > > > I also tried that on the 18cores/36threads/LLC Skylake and the contention
> > > > is indeed much smaller than UDP_RR:
> > > >
> > > > 7.30% 7.29% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> > > >
> > > > But I wouldn't say it's entirely gone. Also consider Skylake has a lot
> > > > fewer cores per LLC than later Intel servers like Icelake and Sapphire
> > > > Rapids and I expect things would be worse on those two machines.
> > >
> > > I cannot reproduce this contention locally, even on a slightly larger
> >
> > With netperf client number equal to nr_cpu?
>
> No, that confusion was only the first time around. See below though, I'm
> not sure what insights are to be gained by continuing to tinker with
> netperf runs.
>
> > > Skylake. Not really sure what to make of the difference here. Perhaps
> > > it's because you're running with CONFIG_SCHED_CORE=y? What is the
> >
> > Yes I had that config on but I didn't tag any tasks or groups.
> >
> > > change in throughput when you run the default workload on your SKL?
> >
> > The throughput dropped a little with SWQUEUE:
> >
> > avg_throughput native_queued_spin_lock_slowpath%
> > NO_SWQUEUE: 9528.061111111108 0.09%
> > SWQUEUE: 8984.369722222222 8.05%
> >
> > avg_throughput: average throughput of all netperf client's throughput,
> > higher is better.
> >
> > I run this workload like this:
> > "
> > netserver
> >
> > for i in `seq 72`; do
> > netperf -l 60 -n 72 -6 &
> > done
> >
> > sleep 30
> > perf record -ag -e cycles:pp -- sleep 5 &
> >
> > wait
> > "
> > (the '-n 72' should be redundant but I just keep it there)
>
> At this point I'd say we've spent quite a bit of time discussing netperf
> results. We understand where the contention is coming from, and yes,
> we've established that there are going to be some configurations where
> swqueue is not well suited. We've also established that there are
> configurations where it will and does perform well, including on
> netperf.
>
> I'm not sure what we're hoping to gain by continuing to run various
> netperf workloads with your specific parameters?

I don't quite follow you.

I thought we were in the process of figuring out why for the same
workload(netperf/default_mode/nr_client=nr_cpu) on two similar
machines(both are Skylake) you saw no contention while I saw some so I
tried to be exact on how I run the workload.

If that's not the case, then yes there is no much value continuing this
discussion.

Thanks,
Aaron

2023-06-21 06:02:57

by David Vernet

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] sched: Implement shared wakequeue in CFS

On Wed, Jun 21, 2023 at 12:54:16PM +0800, Aaron Lu wrote:
> On Tue, Jun 20, 2023 at 09:43:00PM -0500, David Vernet wrote:
> > On Wed, Jun 21, 2023 at 10:35:34AM +0800, Aaron Lu wrote:
> > > On Tue, Jun 20, 2023 at 12:36:26PM -0500, David Vernet wrote:
> > > > On Fri, Jun 16, 2023 at 08:53:38AM +0800, Aaron Lu wrote:
> > > > > I also tried that on the 18cores/36threads/LLC Skylake and the contention
> > > > > is indeed much smaller than UDP_RR:
> > > > >
> > > > > 7.30% 7.29% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> > > > >
> > > > > But I wouldn't say it's entirely gone. Also consider Skylake has a lot
> > > > > fewer cores per LLC than later Intel servers like Icelake and Sapphire
> > > > > Rapids and I expect things would be worse on those two machines.
> > > >
> > > > I cannot reproduce this contention locally, even on a slightly larger
> > >
> > > With netperf client number equal to nr_cpu?
> >
> > No, that confusion was only the first time around. See below though, I'm
> > not sure what insights are to be gained by continuing to tinker with
> > netperf runs.
> >
> > > > Skylake. Not really sure what to make of the difference here. Perhaps
> > > > it's because you're running with CONFIG_SCHED_CORE=y? What is the
> > >
> > > Yes I had that config on but I didn't tag any tasks or groups.
> > >
> > > > change in throughput when you run the default workload on your SKL?
> > >
> > > The throughput dropped a little with SWQUEUE:
> > >
> > > avg_throughput native_queued_spin_lock_slowpath%
> > > NO_SWQUEUE: 9528.061111111108 0.09%
> > > SWQUEUE: 8984.369722222222 8.05%
> > >
> > > avg_throughput: average throughput of all netperf client's throughput,
> > > higher is better.
> > >
> > > I run this workload like this:
> > > "
> > > netserver
> > >
> > > for i in `seq 72`; do
> > > netperf -l 60 -n 72 -6 &
> > > done
> > >
> > > sleep 30
> > > perf record -ag -e cycles:pp -- sleep 5 &
> > >
> > > wait
> > > "
> > > (the '-n 72' should be redundant but I just keep it there)
> >
> > At this point I'd say we've spent quite a bit of time discussing netperf
> > results. We understand where the contention is coming from, and yes,
> > we've established that there are going to be some configurations where
> > swqueue is not well suited. We've also established that there are
> > configurations where it will and does perform well, including on
> > netperf.
> >
> > I'm not sure what we're hoping to gain by continuing to run various
> > netperf workloads with your specific parameters?
>
> I don't quite follow you.
>
> I thought we were in the process of figuring out why for the same
> workload(netperf/default_mode/nr_client=nr_cpu) on two similar
> machines(both are Skylake) you saw no contention while I saw some so I
> tried to be exact on how I run the workload.

I just reran the workload on a 26 core / 52 thread Cooper Lake using
your exact command below and still don't observe any contention
whatsoever on the swqueue lock:

for i in `seq 72`; do
netperf -l 60 -n 72 -6 &
done

> If that's not the case, then yes there is no much value continuing this
> discussion.

We can iterate until we find out why we're seeing slightly different
contention (different configs, different amount of RAM, maybe you have
turbo enabled or other things running on your host, etc), but I don't
see what that would tell us that would meaningfully drive the discussion
forward for the patch set. Is there anything in particular you're trying
to determine and/or do you have reason to think that the contention
you're observing is due to something other than a lot of tasks waking up
at the same time, just as it was with UDP_RR?

Thanks,
David

2023-06-21 06:20:18

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] sched: Implement shared wakequeue in CFS

On Wed, Jun 21, 2023 at 12:43:52AM -0500, David Vernet wrote:
> On Wed, Jun 21, 2023 at 12:54:16PM +0800, Aaron Lu wrote:
> > On Tue, Jun 20, 2023 at 09:43:00PM -0500, David Vernet wrote:
> > > On Wed, Jun 21, 2023 at 10:35:34AM +0800, Aaron Lu wrote:
> > > > On Tue, Jun 20, 2023 at 12:36:26PM -0500, David Vernet wrote:
> > > > > On Fri, Jun 16, 2023 at 08:53:38AM +0800, Aaron Lu wrote:
> > > > > > I also tried that on the 18cores/36threads/LLC Skylake and the contention
> > > > > > is indeed much smaller than UDP_RR:
> > > > > >
> > > > > > 7.30% 7.29% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> > > > > >
> > > > > > But I wouldn't say it's entirely gone. Also consider Skylake has a lot
> > > > > > fewer cores per LLC than later Intel servers like Icelake and Sapphire
> > > > > > Rapids and I expect things would be worse on those two machines.
> > > > >
> > > > > I cannot reproduce this contention locally, even on a slightly larger
> > > >
> > > > With netperf client number equal to nr_cpu?
> > >
> > > No, that confusion was only the first time around. See below though, I'm
> > > not sure what insights are to be gained by continuing to tinker with
> > > netperf runs.
> > >
> > > > > Skylake. Not really sure what to make of the difference here. Perhaps
> > > > > it's because you're running with CONFIG_SCHED_CORE=y? What is the
> > > >
> > > > Yes I had that config on but I didn't tag any tasks or groups.
> > > >
> > > > > change in throughput when you run the default workload on your SKL?
> > > >
> > > > The throughput dropped a little with SWQUEUE:
> > > >
> > > > avg_throughput native_queued_spin_lock_slowpath%
> > > > NO_SWQUEUE: 9528.061111111108 0.09%
> > > > SWQUEUE: 8984.369722222222 8.05%
> > > >
> > > > avg_throughput: average throughput of all netperf client's throughput,
> > > > higher is better.
> > > >
> > > > I run this workload like this:
> > > > "
> > > > netserver
> > > >
> > > > for i in `seq 72`; do
> > > > netperf -l 60 -n 72 -6 &
> > > > done
> > > >
> > > > sleep 30
> > > > perf record -ag -e cycles:pp -- sleep 5 &
> > > >
> > > > wait
> > > > "
> > > > (the '-n 72' should be redundant but I just keep it there)
> > >
> > > At this point I'd say we've spent quite a bit of time discussing netperf
> > > results. We understand where the contention is coming from, and yes,
> > > we've established that there are going to be some configurations where
> > > swqueue is not well suited. We've also established that there are
> > > configurations where it will and does perform well, including on
> > > netperf.
> > >
> > > I'm not sure what we're hoping to gain by continuing to run various
> > > netperf workloads with your specific parameters?
> >
> > I don't quite follow you.
> >
> > I thought we were in the process of figuring out why for the same
> > workload(netperf/default_mode/nr_client=nr_cpu) on two similar
> > machines(both are Skylake) you saw no contention while I saw some so I
> > tried to be exact on how I run the workload.
>
> I just reran the workload on a 26 core / 52 thread Cooper Lake using
> your exact command below and still don't observe any contention
> whatsoever on the swqueue lock:

Well, it's a puzzle to me.

But as you said below, I guess I'll just move on.

Thanks,
Aaron

>
> for i in `seq 72`; do
> netperf -l 60 -n 72 -6 &
> done
>
> > If that's not the case, then yes there is no much value continuing this
> > discussion.
>
> We can iterate until we find out why we're seeing slightly different
> contention (different configs, different amount of RAM, maybe you have
> turbo enabled or other things running on your host, etc), but I don't
> see what that would tell us that would meaningfully drive the discussion
> forward for the patch set. Is there anything in particular you're trying
> to determine and/or do you have reason to think that the contention
> you're observing is due to something other than a lot of tasks waking up
> at the same time, just as it was with UDP_RR?
>
> Thanks,
> David

2023-06-21 08:53:08

by Gautham R. Shenoy

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] sched: Implement shared wakequeue in CFS

Hello David,

On Tue, Jun 20, 2023 at 03:08:22PM -0500, David Vernet wrote:
> On Mon, Jun 19, 2023 at 11:43:13AM +0530, Gautham R. Shenoy wrote:
> > Hello David,
> >
> >
> > On Tue, Jun 13, 2023 at 12:20:04AM -0500, David Vernet wrote:
> > [..snip..]
> >
> > > +static void swqueue_enqueue(struct rq *rq, struct task_struct *p, int enq_flags)
> > > +{
> > > + unsigned long flags;
> > > + struct swqueue *swqueue;
> > > + bool task_migrated = enq_flags & ENQUEUE_MIGRATED;
> > > + bool task_wakeup = enq_flags & ENQUEUE_WAKEUP;
> > > +
> > > + /*
> > > + * Only enqueue the task in the shared wakequeue if:
> > > + *
> > > + * - SWQUEUE is enabled
> > > + * - The task is on the wakeup path
> > > + * - The task wasn't purposefully migrated to the current rq by
> > > + * select_task_rq()
> > > + * - The task isn't pinned to a specific CPU
> > > + */
> > > + if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1)
> > > + return;
> >
> > In select_task_rq_fair(), having determined if the target of task
> > wakeup should be the task's previous CPU vs the waker's current CPU,
> > we spend quite a bit of time already to determine if there is an idle
> > core/CPU in the target's LLC. @rq would correspond to CPU chosen as a
> > result of that scan or if no idle CPU exists, @rq corresponds to the
> > target CPU determined by wake_affine_idle()/wake_affine_weight().
> >
> > So if the CPU of @rq is idle here, can we not simply return here?
>
> Hi Gautum,
>
> Sorry, I'm not sure I'm quite following the issue you're pointing out.
> We don't use swqueue if the task was migrated following
> select_task_rq_fair(). That's the idea with us returning if the task was
> migrated (the second conditional in that if). If I messed up that logic
> somehow, it should be fixed.

Sorry, my bad. I see it now.

So as per this patch, the only time we enqueue the task on the shared
wakeup is if the target of try_to_wake_up() is the same CPU where the
task ran previously.

When wake_affine logic fails and the previous CPU is chosen as the
target, and when there are no other idle cores/threads in the LLC of
the previous CPU, it makes sense to queue the task on the
shared-wakequeue instead of on a busy previous CPU.

And when that previous CPU is idle, the try_to_wake_up() would have
woken it up via ttwu_queue(), so before going idle the next time it
will check the shared queue for the task and find it. We should be
good in this case.

Now, it is possible that select_task_rq_fair() ended up selecting the
waker's CPU as the target based on the
wake_affine_idle()/wake_affine_weight() logic. And if there is no idle
core/thread on the waker's LLC, the target would be the busy waker
CPU. In the case when the waker CPU is different from the task's
previous CPU, due to ENQUEUE_MIGRATE flag being set, the task won't be
queued on the shared wakequeue and instead has to wait on the busy
waker CPU.

I wonder if it makes sense to enqueue the task on the shared wakequeue
in this scenario as well.

>
> > Or if the idea is to avoid the scan for an idle core/CPU in
> > select_task_rq_fair(), then
>
> No, swqueue_enqueue() is called from enqueue_task_fair(), not
> select_task_rq_fair(). If select_task_rq_fair() (which is of course
> called beforehand for a waking task) found an idle core/CPU, we don't
> bother using swqueue, as mentioned above.

Got it. Thanks!

>
> Thanks,
> David

--
Thanks and Regards
gautham.

2023-06-21 13:20:24

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 1/3] sched: Make migrate_task_to() take any task

On Tue, Jun 13, 2023 at 12:20:02AM -0500, David Vernet wrote:
> The migrate_task_to() function exposed from kernel/sched/core.c migrates
> the current task, which is silently assumed to also be its first
> argument, to the specified CPU. The function uses stop_one_cpu() to
> migrate the task to the target CPU, which won't work if @p is not the
> current task as the stop_one_cpu() callback isn't invoked on remote
> CPUs.
>
> While this operation is useful for task_numa_migrate() in fair.c, it
> would be useful if __migrate_task() in core.c was given external
> linkage, as it actually can be used to migrate any task to a CPU.
>
> This patch therefore:
> 1. Renames the existing migrate_task_to() be called
> numa_migrate_current_task_to().
> 2. Renames __migrate_task() to migrate_task_to(), gives it global
> linkage, and updates all callers accordingly.
>
> A follow-on patch will call the new migrate_task_to() from fair.c when
> migrating a task in a shared wakequeue to a remote CPU.

I would suggest simply exposing move_queued_task(). You can actually do
is_cpu_allowed() before you do the whole lock dance, or even before
pull.

And then you don't have to create silly long function names either.

2023-06-21 13:24:24

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 2/3] sched/fair: Add SWQUEUE sched feature and skeleton calls

On Tue, Jun 13, 2023 at 12:20:03AM -0500, David Vernet wrote:

I can't help but read this thing as software-queue :/ Can we please pick
a better name?

> @@ -6368,6 +6390,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> if (!task_new)
> update_overutilized_status(rq);
>
> + if (sched_feat(SWQUEUE))
> + swqueue_enqueue(rq, p, flags);
> +
> enqueue_throttle:
> assert_list_leaf_cfs_rq(rq);
>
> @@ -6449,6 +6474,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> dequeue_throttle:
> util_est_update(&rq->cfs, p, task_sleep);
> hrtick_update(rq);
> +
> + if (sched_feat(SWQUEUE))
> + swqueue_remove_task(p);
> }
>
> #ifdef CONFIG_SMP

_enqueue() should obviously be complemented by _dequeue(). This naming
is offensive :-)

> @@ -8155,12 +8183,18 @@ done: __maybe_unused;
>
> update_misfit_status(p, rq);
>
> + if (sched_feat(SWQUEUE))
> + swqueue_remove_task(p);
> +
> return p;
>
> idle:
> if (!rf)
> return NULL;
>
> + if (sched_feat(SWQUEUE) && swqueue_pick_next_task(rq, rf))
> + return RETRY_TASK;
> +
> new_tasks = newidle_balance(rq, rf);
>
> /*

That's either not correct or insufficient or both.

It fails to consider the whole core-scheduling mess. But it also fails
to consider the regular (non optimized) pick case that should do newidle
through put_prev_task_balance() -> balance_fair().

I think placing the pick call in newidle_balance() itself is the
simplest solution.


2023-06-21 14:34:59

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] sched: Implement shared wakequeue in CFS

On Tue, Jun 13, 2023 at 12:20:04AM -0500, David Vernet wrote:
> +struct swqueue {
> + struct list_head list;
> + spinlock_t lock;
> +} ____cacheline_aligned;

I'm thinking you can shard this just fine, it makes that pop() needs to
iterate all shards, but that shouldn't be a problem, and it would still
only need to take a single lock.

I'm thinking 4 or 8 shards should be plenty, even for Intel LLC.

> #ifdef CONFIG_SMP

> +static struct task_struct *swqueue_pull_task(struct swqueue *swqueue)
> +{
> + unsigned long flags;
> +
> + struct task_struct *p;
> +
> + spin_lock_irqsave(&swqueue->lock, flags);
> + p = list_first_entry_or_null(&swqueue->list, struct task_struct,
> + swqueue_node);
> + if (p)
> + list_del_init(&p->swqueue_node);
> + spin_unlock_irqrestore(&swqueue->lock, flags);
> +
> + return p;
> +}

Would this not normally be called pop() or somesuch?

> +static void swqueue_enqueue(struct rq *rq, struct task_struct *p, int enq_flags)
> +{
> + unsigned long flags;
> + struct swqueue *swqueue;
> + bool task_migrated = enq_flags & ENQUEUE_MIGRATED;
> + bool task_wakeup = enq_flags & ENQUEUE_WAKEUP;
> +
> + /*
> + * Only enqueue the task in the shared wakequeue if:
> + *
> + * - SWQUEUE is enabled
> + * - The task is on the wakeup path
> + * - The task wasn't purposefully migrated to the current rq by
> + * select_task_rq()
> + * - The task isn't pinned to a specific CPU
> + */
> + if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1)
> + return;

Elsewhere you mentioned heuristics, this smells like them. This and the
is_cpus_allowed() thing makes you loose plenty of opportunities.

> + swqueue = rq_swqueue(rq);
> + spin_lock_irqsave(&swqueue->lock, flags);
> + list_add_tail(&p->swqueue_node, &swqueue->list);
> + spin_unlock_irqrestore(&swqueue->lock, flags);
> +}
> +
> static int swqueue_pick_next_task(struct rq *rq, struct rq_flags *rf)
> {
> - return 0;
> + struct swqueue *swqueue;
> + struct task_struct *p = NULL;
> + struct rq *src_rq;
> + struct rq_flags src_rf;
> + int ret;
> +
> + swqueue = rq_swqueue(rq);
> + if (!list_empty(&swqueue->list))
> + p = swqueue_pull_task(swqueue);
> +
> + if (!p)
> + return 0;

At this point you can do the whole is_cpu_allowed() and avoid the whole
lock dance if not.

> +
> + rq_unpin_lock(rq, rf);
> + raw_spin_rq_unlock(rq);
> +
> + src_rq = task_rq_lock(p, &src_rf);
> +
> + if (task_on_rq_queued(p) && !task_on_cpu(rq, p))
> + src_rq = migrate_task_to(src_rq, &src_rf, p, cpu_of(rq));

And then this becomes move_queued_task().

> + if (src_rq->cpu != rq->cpu)
> + ret = 1;
> + else
> + ret = -1;
> +
> + task_rq_unlock(src_rq, p, &src_rf);
> +
> + raw_spin_rq_lock(rq);
> + rq_repin_lock(rq, rf);
> +
> + return ret;
> }
>
> static void swqueue_remove_task(struct task_struct *p)
> +{
> + unsigned long flags;
> + struct swqueue *swqueue;
> +
> + if (!list_empty(&p->swqueue_node)) {
> + swqueue = rq_swqueue(task_rq(p));
> + spin_lock_irqsave(&swqueue->lock, flags);
> + list_del_init(&p->swqueue_node);
> + spin_unlock_irqrestore(&swqueue->lock, flags);
> + }
> +}

dequeue()

2023-06-21 14:36:08

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] sched: Implement shared wakequeue in CFS

On Tue, Jun 20, 2023 at 02:54:23PM -0500, David Vernet wrote:

> Another consideration is that even if we could adjust newidle_balance()
> to load balance well enough for our specific purpose, we're still
> relying on heuristics to determine when it's appropriate to load
> balance; and that will

Your thing has heuristics too.

But if we're going to do this, then it should replace newidle <= LLC
domains.

2023-06-21 20:56:07

by David Vernet

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] sched: Implement shared wakequeue in CFS

On Wed, Jun 21, 2023 at 04:20:20PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 13, 2023 at 12:20:04AM -0500, David Vernet wrote:
> > +struct swqueue {
> > + struct list_head list;
> > + spinlock_t lock;
> > +} ____cacheline_aligned;
>
> I'm thinking you can shard this just fine, it makes that pop() needs to
> iterate all shards, but that shouldn't be a problem, and it would still
> only need to take a single lock.

Is the idea here to have multiple sharded queues per LLC, with a single
lock protecting them? Assuming so, is the idea that it would avoid
bouncing the list heads amongst all of the cores' cachelines? I could
see that being useful in some scenarios, but it also feels a bit
complicated for what it gives you. If you have tasks being pulled
quickly I don't think that will help much because the list heads will
just bounce to the pullers. Also, if the lock is already heavily
contended, it seems possible that the cores inside a single shard could
still bounce the head back and forth amongst them, or the cache line
bouncing will be a small-order overhead compared to the lock itself.

Or did I misunderstand your suggestion entirely?

> I'm thinking 4 or 8 shards should be plenty, even for Intel LLC.
>
> > #ifdef CONFIG_SMP
>
> > +static struct task_struct *swqueue_pull_task(struct swqueue *swqueue)
> > +{
> > + unsigned long flags;
> > +
> > + struct task_struct *p;
> > +
> > + spin_lock_irqsave(&swqueue->lock, flags);
> > + p = list_first_entry_or_null(&swqueue->list, struct task_struct,
> > + swqueue_node);
> > + if (p)
> > + list_del_init(&p->swqueue_node);
> > + spin_unlock_irqrestore(&swqueue->lock, flags);
> > +
> > + return p;
> > +}
>
> Would this not normally be called pop() or somesuch?

Yes, I'll improve the name in the next iteration. swqueue_dequeue() and
swqueue_enqueue() seem like the most canonical. Let me know if you have another
preference.

>
> > +static void swqueue_enqueue(struct rq *rq, struct task_struct *p, int enq_flags)
> > +{
> > + unsigned long flags;
> > + struct swqueue *swqueue;
> > + bool task_migrated = enq_flags & ENQUEUE_MIGRATED;
> > + bool task_wakeup = enq_flags & ENQUEUE_WAKEUP;
> > +
> > + /*
> > + * Only enqueue the task in the shared wakequeue if:
> > + *
> > + * - SWQUEUE is enabled
> > + * - The task is on the wakeup path
> > + * - The task wasn't purposefully migrated to the current rq by
> > + * select_task_rq()
> > + * - The task isn't pinned to a specific CPU
> > + */
> > + if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1)
> > + return;
>
> Elsewhere you mentioned heuristics, this smells like them. This and the
> is_cpus_allowed() thing makes you loose plenty of opportunities.

Yeah fair enough, these certainly are heuristics as well.

I thought it best to try and avoid swqueue getting in the way of
select_task_rq_fair() (at least to start out with), but we could always
remove that and run other experiments to see how it does.

> > + swqueue = rq_swqueue(rq);
> > + spin_lock_irqsave(&swqueue->lock, flags);
> > + list_add_tail(&p->swqueue_node, &swqueue->list);
> > + spin_unlock_irqrestore(&swqueue->lock, flags);
> > +}
> > +
> > static int swqueue_pick_next_task(struct rq *rq, struct rq_flags *rf)
> > {
> > - return 0;
> > + struct swqueue *swqueue;
> > + struct task_struct *p = NULL;
> > + struct rq *src_rq;
> > + struct rq_flags src_rf;
> > + int ret;
> > +
> > + swqueue = rq_swqueue(rq);
> > + if (!list_empty(&swqueue->list))
> > + p = swqueue_pull_task(swqueue);
> > +
> > + if (!p)
> > + return 0;
>
> At this point you can do the whole is_cpu_allowed() and avoid the whole
> lock dance if not.

Good idea, will incorporate into the next iteration.

> > +
> > + rq_unpin_lock(rq, rf);
> > + raw_spin_rq_unlock(rq);
> > +
> > + src_rq = task_rq_lock(p, &src_rf);
> > +
> > + if (task_on_rq_queued(p) && !task_on_cpu(rq, p))
> > + src_rq = migrate_task_to(src_rq, &src_rf, p, cpu_of(rq));
>
> And then this becomes move_queued_task().

Yep, will make this change per your suggestion in [0].

[0]: https://lore.kernel.org/all/[email protected]/

> > + if (src_rq->cpu != rq->cpu)
> > + ret = 1;
> > + else
> > + ret = -1;
> > +
> > + task_rq_unlock(src_rq, p, &src_rf);
> > +
> > + raw_spin_rq_lock(rq);
> > + rq_repin_lock(rq, rf);
> > +
> > + return ret;
> > }
> >
> > static void swqueue_remove_task(struct task_struct *p)
> > +{
> > + unsigned long flags;
> > + struct swqueue *swqueue;
> > +
> > + if (!list_empty(&p->swqueue_node)) {
> > + swqueue = rq_swqueue(task_rq(p));
> > + spin_lock_irqsave(&swqueue->lock, flags);
> > + list_del_init(&p->swqueue_node);
> > + spin_unlock_irqrestore(&swqueue->lock, flags);
> > + }
> > +}
>
> dequeue()

Ack

2023-06-22 02:05:48

by David Vernet

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] sched: Implement shared wakequeue in CFS

On Wed, Jun 21, 2023 at 01:47:00PM +0530, Gautham R. Shenoy wrote:
> Hello David,
> On Tue, Jun 20, 2023 at 03:08:22PM -0500, David Vernet wrote:
> > On Mon, Jun 19, 2023 at 11:43:13AM +0530, Gautham R. Shenoy wrote:
> > > Hello David,
> > >
> > >
> > > On Tue, Jun 13, 2023 at 12:20:04AM -0500, David Vernet wrote:
> > > [..snip..]
> > >
> > > > +static void swqueue_enqueue(struct rq *rq, struct task_struct *p, int enq_flags)
> > > > +{
> > > > + unsigned long flags;
> > > > + struct swqueue *swqueue;
> > > > + bool task_migrated = enq_flags & ENQUEUE_MIGRATED;
> > > > + bool task_wakeup = enq_flags & ENQUEUE_WAKEUP;
> > > > +
> > > > + /*
> > > > + * Only enqueue the task in the shared wakequeue if:
> > > > + *
> > > > + * - SWQUEUE is enabled
> > > > + * - The task is on the wakeup path
> > > > + * - The task wasn't purposefully migrated to the current rq by
> > > > + * select_task_rq()
> > > > + * - The task isn't pinned to a specific CPU
> > > > + */
> > > > + if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1)
> > > > + return;
> > >
> > > In select_task_rq_fair(), having determined if the target of task
> > > wakeup should be the task's previous CPU vs the waker's current CPU,
> > > we spend quite a bit of time already to determine if there is an idle
> > > core/CPU in the target's LLC. @rq would correspond to CPU chosen as a
> > > result of that scan or if no idle CPU exists, @rq corresponds to the
> > > target CPU determined by wake_affine_idle()/wake_affine_weight().
> > >
> > > So if the CPU of @rq is idle here, can we not simply return here?
> >
> > Hi Gautum,
> >
> > Sorry, I'm not sure I'm quite following the issue you're pointing out.
> > We don't use swqueue if the task was migrated following
> > select_task_rq_fair(). That's the idea with us returning if the task was
> > migrated (the second conditional in that if). If I messed up that logic
> > somehow, it should be fixed.
>
> Sorry, my bad. I see it now.
>
> So as per this patch, the only time we enqueue the task on the shared
> wakeup is if the target of try_to_wake_up() is the same CPU where the
> task ran previously.
>
> When wake_affine logic fails and the previous CPU is chosen as the
> target, and when there are no other idle cores/threads in the LLC of
> the previous CPU, it makes sense to queue the task on the
> shared-wakequeue instead of on a busy previous CPU.
>
> And when that previous CPU is idle, the try_to_wake_up() would have
> woken it up via ttwu_queue(), so before going idle the next time it
> will check the shared queue for the task and find it. We should be
> good in this case.
>
> Now, it is possible that select_task_rq_fair() ended up selecting the
> waker's CPU as the target based on the
> wake_affine_idle()/wake_affine_weight() logic. And if there is no idle
> core/thread on the waker's LLC, the target would be the busy waker
> CPU. In the case when the waker CPU is different from the task's
> previous CPU, due to ENQUEUE_MIGRATE flag being set, the task won't be
> queued on the shared wakequeue and instead has to wait on the busy
> waker CPU.
>
> I wonder if it makes sense to enqueue the task on the shared wakequeue
> in this scenario as well.

Hello Gautham,

That's a good point. My original intention with opting out of using
swqueue if select_task_rq_fair() caused us to migrate is so that it
wouldn't interfere with decisions made with other select_task_rq_fair()
heuristics like wake_affine_*(). Basically just minimizing the possible
impact of swqueue. That said, I think it probably does make sense to
just enqueue in the swqueue regardless of whether ENQUEUE_MIGRATED is
set. One of the main goals of swqueue is work conservation, and in
hindsight it does feel somewhat artificial to add a heuristic that works
against that.

I'd like to hear what others think. In my opinion it's worth at least
running some tests on workloads that heavily utilize the CPU such as
kernel compile, and seeing what the outcomes are.

Thanks,
David

2023-06-22 03:11:31

by David Vernet

[permalink] [raw]
Subject: Re: [RFC PATCH 1/3] sched: Make migrate_task_to() take any task

On Wed, Jun 21, 2023 at 03:04:39PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 13, 2023 at 12:20:02AM -0500, David Vernet wrote:
> > The migrate_task_to() function exposed from kernel/sched/core.c migrates
> > the current task, which is silently assumed to also be its first
> > argument, to the specified CPU. The function uses stop_one_cpu() to
> > migrate the task to the target CPU, which won't work if @p is not the
> > current task as the stop_one_cpu() callback isn't invoked on remote
> > CPUs.
> >
> > While this operation is useful for task_numa_migrate() in fair.c, it
> > would be useful if __migrate_task() in core.c was given external
> > linkage, as it actually can be used to migrate any task to a CPU.
> >
> > This patch therefore:
> > 1. Renames the existing migrate_task_to() be called
> > numa_migrate_current_task_to().
> > 2. Renames __migrate_task() to migrate_task_to(), gives it global
> > linkage, and updates all callers accordingly.
> >
> > A follow-on patch will call the new migrate_task_to() from fair.c when
> > migrating a task in a shared wakequeue to a remote CPU.
>
> I would suggest simply exposing move_queued_task(). You can actually do
> is_cpu_allowed() before you do the whole lock dance, or even before
> pull.

Good call, I'll make that improvement for v2. Also, ack on exposing
move_queued_task(). That makes more sense.

> And then you don't have to create silly long function names either.

Yeah, the function name I chose is admittedly crap, but IMO it is pretty
unintuitive that p == current. But yeah, it'd probably be better to just
do something like remove the @p parameter and use current directly. That
whole call chain passes p == current around though, so *shrug*.

2023-06-22 09:34:41

by Gautham R. Shenoy

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] sched: Implement shared wakequeue in CFS

On Wed, Jun 21, 2023 at 08:43:29PM -0500, David Vernet wrote:
> On Wed, Jun 21, 2023 at 01:47:00PM +0530, Gautham R. Shenoy wrote:
> > Hello David,
> > On Tue, Jun 20, 2023 at 03:08:22PM -0500, David Vernet wrote:
> > > On Mon, Jun 19, 2023 at 11:43:13AM +0530, Gautham R. Shenoy wrote:
> > > > Hello David,
> > > >
> > > >
> > > > On Tue, Jun 13, 2023 at 12:20:04AM -0500, David Vernet wrote:
> > > > [..snip..]
> > > >
> > > > > +static void swqueue_enqueue(struct rq *rq, struct task_struct *p, int enq_flags)
> > > > > +{
> > > > > + unsigned long flags;
> > > > > + struct swqueue *swqueue;
> > > > > + bool task_migrated = enq_flags & ENQUEUE_MIGRATED;
> > > > > + bool task_wakeup = enq_flags & ENQUEUE_WAKEUP;
> > > > > +
> > > > > + /*
> > > > > + * Only enqueue the task in the shared wakequeue if:
> > > > > + *
> > > > > + * - SWQUEUE is enabled
> > > > > + * - The task is on the wakeup path
> > > > > + * - The task wasn't purposefully migrated to the current rq by
> > > > > + * select_task_rq()
> > > > > + * - The task isn't pinned to a specific CPU
> > > > > + */
> > > > > + if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1)
> > > > > + return;
> > > >
> > > > In select_task_rq_fair(), having determined if the target of task
> > > > wakeup should be the task's previous CPU vs the waker's current CPU,
> > > > we spend quite a bit of time already to determine if there is an idle
> > > > core/CPU in the target's LLC. @rq would correspond to CPU chosen as a
> > > > result of that scan or if no idle CPU exists, @rq corresponds to the
> > > > target CPU determined by wake_affine_idle()/wake_affine_weight().
> > > >
> > > > So if the CPU of @rq is idle here, can we not simply return here?
> > >
> > > Hi Gautum,
> > >
> > > Sorry, I'm not sure I'm quite following the issue you're pointing out.
> > > We don't use swqueue if the task was migrated following
> > > select_task_rq_fair(). That's the idea with us returning if the task was
> > > migrated (the second conditional in that if). If I messed up that logic
> > > somehow, it should be fixed.
> >
> > Sorry, my bad. I see it now.
> >
> > So as per this patch, the only time we enqueue the task on the shared
> > wakeup is if the target of try_to_wake_up() is the same CPU where the
> > task ran previously.
> >
> > When wake_affine logic fails and the previous CPU is chosen as the
> > target, and when there are no other idle cores/threads in the LLC of
> > the previous CPU, it makes sense to queue the task on the
> > shared-wakequeue instead of on a busy previous CPU.
> >
> > And when that previous CPU is idle, the try_to_wake_up() would have
> > woken it up via ttwu_queue(), so before going idle the next time it
> > will check the shared queue for the task and find it. We should be
> > good in this case.
> >
> > Now, it is possible that select_task_rq_fair() ended up selecting the
> > waker's CPU as the target based on the
> > wake_affine_idle()/wake_affine_weight() logic. And if there is no idle
> > core/thread on the waker's LLC, the target would be the busy waker
> > CPU. In the case when the waker CPU is different from the task's
> > previous CPU, due to ENQUEUE_MIGRATE flag being set, the task won't be
> > queued on the shared wakequeue and instead has to wait on the busy
> > waker CPU.
> >
> > I wonder if it makes sense to enqueue the task on the shared wakequeue
> > in this scenario as well.
>
> Hello Gautham,
>
> That's a good point. My original intention with opting out of using
> swqueue if select_task_rq_fair() caused us to migrate is so that it
> wouldn't interfere with decisions made with other select_task_rq_fair()
> heuristics like wake_affine_*(). Basically just minimizing the possible
> impact of swqueue.
> That said, I think it probably does make sense to
> just enqueue in the swqueue regardless of whether ENQUEUE_MIGRATED is
> set. One of the main goals of swqueue is work conservation, and in
> hindsight it does feel somewhat artificial to add a heuristic that works
> against that.

In that case we can perhaps have an explicit flag that is passed by
try_to_wake_up() when it cannot find an idle CPU and the chosen target
is just a fallback. The task gets enqueued on the swqueue of the
target only in such cases. Something like the following entirely
untested:

---------------------------- x8----------------------------
diff --git a/include/linux/sched.h b/include/linux/sched.h
index b64fec55a381..38005262a7fe 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -910,6 +910,12 @@ struct task_struct {
*/
unsigned sched_remote_wakeup:1;

+ /*
+ * Bit used by select_idle_sibling() to signal enqueuing the
+ * task on a shared wakequeue.
+ */
+ unsigned sched_add_to_swq:1;
+
/* Bit to tell LSMs we're in execve(): */
unsigned in_execve:1;
unsigned in_iowait:1;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e311d1c3b992..f4246c33f3c5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -215,21 +215,17 @@ static void swqueue_enqueue(struct rq *rq, struct task_struct *p, int enq_flags)
{
unsigned long flags;
struct swqueue *swqueue;
- bool task_migrated = enq_flags & ENQUEUE_MIGRATED;
- bool task_wakeup = enq_flags & ENQUEUE_WAKEUP;

/*
* Only enqueue the task in the shared wakequeue if:
*
* - SWQUEUE is enabled
- * - The task is on the wakeup path
- * - The task wasn't purposefully migrated to the current rq by
- * select_task_rq()
- * - The task isn't pinned to a specific CPU
+ * - select_idle_sibling() didn't find an idle CPU to enqueue this wakee task.
*/
- if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1)
+ if (!READ_ONCE(p->sched_add_to_swq))
return;

+ WRITE_ONCE(p->sched_add_to_swq, 0);
swqueue = rq_swqueue(rq);
spin_lock_irqsave(&swqueue->lock, flags);
list_add_tail(&p->swqueue_node, &swqueue->list);
@@ -7361,6 +7357,11 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
if ((unsigned)i < nr_cpumask_bits)
return i;

+ /*
+ * No idle sibling was found. Ok to queue this task on the
+ * shared wakequeue of the target.
+ */
+ WRITE_ONCE(p->sched_add_to_swq, 1);
return target;
}


>
> I'd like to hear what others think. In my opinion it's worth at least
> running some tests on workloads that heavily utilize the CPU such as
> kernel compile, and seeing what the outcomes are.

I will try and get some numbers for such workloads on our setup over
this weekend.

>
> Thanks,
> David

--
Thanks and Regards
gautham.

2023-06-22 11:04:34

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] sched: Implement shared wakequeue in CFS

On Thu, Jun 22, 2023 at 02:41:57PM +0530, Gautham R. Shenoy wrote:

> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -215,21 +215,17 @@ static void swqueue_enqueue(struct rq *rq, struct task_struct *p, int enq_flags)
> {
> unsigned long flags;
> struct swqueue *swqueue;
> - bool task_migrated = enq_flags & ENQUEUE_MIGRATED;
> - bool task_wakeup = enq_flags & ENQUEUE_WAKEUP;
>
> /*
> * Only enqueue the task in the shared wakequeue if:
> *
> * - SWQUEUE is enabled
> - * - The task is on the wakeup path
> - * - The task wasn't purposefully migrated to the current rq by
> - * select_task_rq()
> - * - The task isn't pinned to a specific CPU
> + * - select_idle_sibling() didn't find an idle CPU to enqueue this wakee task.
> */
> - if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1)

You lost the nr_cpus_allowed == 1 case, that actually made sense, except
he didn't don't have a hook in set_cpus_allowed() to fix it back up.

> + if (!READ_ONCE(p->sched_add_to_swq))
> return;

So suppose we fill all CPUs with tasks T_{0..n-n1}, sis finds them a
nice idle cpu and we don't get queued.

Then wake up another n tasks T_{n...2n-1}, none of them find an idle
CPU, as per the above them all taken. These all do get queued.

However, suppose T>=n does a wakeup-preemption, and we end up with T>=n
running while T<n get queued on the rq, but not the 'queue' (I so hate
the swqueue name).

Then one CPU has both it's tasks go 'away' and becomes idle.

At this point the 'queue' contains only running tasks that are not
migratable and all the tasks that could be migrated are not on the queue
-- IOW it is completely useless.

Is this the intention?



2023-06-22 11:22:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] sched: Implement shared wakequeue in CFS

On Wed, Jun 21, 2023 at 03:34:45PM -0500, David Vernet wrote:
> On Wed, Jun 21, 2023 at 04:20:20PM +0200, Peter Zijlstra wrote:
> > On Tue, Jun 13, 2023 at 12:20:04AM -0500, David Vernet wrote:
> > > +struct swqueue {
> > > + struct list_head list;
> > > + spinlock_t lock;
> > > +} ____cacheline_aligned;
> >
> > I'm thinking you can shard this just fine, it makes that pop() needs to
> > iterate all shards, but that shouldn't be a problem, and it would still
> > only need to take a single lock.
>
> Is the idea here to have multiple sharded queues per LLC, with a single
> lock protecting them?

No, each shard will have it's own lock, each shard it's own cacheline.

> Or did I misunderstand your suggestion entirely?

Yup, so each shard is basically the above struct, whole cacheline with a
lock and list. Then on enqueue we hash to a shard (possibly based on cpu
-- this gives the least amount of bounces) and stick it on.

Then on pop we iterate the shards, the first non-empty shard gets the
lock taken, and pop'ed. If you start iteration based on the cpu->shard
map used for enqueue you can even get a sense of 'locality' depending on
the mapping.

So pop is O(n) in shards, but on average will only take a single lock --
there is an obvious race where it might see a non-empty queue, but race
on lock and find the queue empty once acquired, but meh.

> > I'm thinking 4 or 8 shards should be plenty, even for Intel LLC.
> >
> > > #ifdef CONFIG_SMP
> >
> > > +static struct task_struct *swqueue_pull_task(struct swqueue *swqueue)
> > > +{
> > > + unsigned long flags;
> > > +
> > > + struct task_struct *p;
> > > +
> > > + spin_lock_irqsave(&swqueue->lock, flags);
> > > + p = list_first_entry_or_null(&swqueue->list, struct task_struct,
> > > + swqueue_node);
> > > + if (p)
> > > + list_del_init(&p->swqueue_node);
> > > + spin_unlock_irqrestore(&swqueue->lock, flags);
> > > +
> > > + return p;
> > > +}
> >
> > Would this not normally be called pop() or somesuch?
>
> Yes, I'll improve the name in the next iteration. swqueue_dequeue() and
> swqueue_enqueue() seem like the most canonical. Let me know if you have another
> preference.

Well, there's two layers intermixed here, which is, I think, what gave
rise to the whole wonky naming.

If you implement the queue primitives as: push, pop, del
and then build, using them, the scheduler primitives: enqueue, dequeue,
pick, things should become saner.

> > > + if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1)
> > > + return;
> >
> > Elsewhere you mentioned heuristics, this smells like them. This and the
> > is_cpus_allowed() thing makes you loose plenty of opportunities.
>
> Yeah fair enough, these certainly are heuristics as well.
>
> I thought it best to try and avoid swqueue getting in the way of
> select_task_rq_fair() (at least to start out with), but we could always
> remove that and run other experiments to see how it does.

So, part of the problem is that you might be hooking at the wrong level
(see also my earlier email about the queue only containing running tasks
and none of the ready tasks).

If you were to hook at the __{en,de}queue_entity() level you'll
automagically dequeue running tasks and enqueue any ready tasks. OTOH
you get to deal with the problem that not all entities are tasks, but
that shouldn't be too hard.

Also, you end up with more enqueue/dequeue -- so that might hurt.

Also, at this point you're nr_cpus/N away from simply scanning the
runqueues and pulling things from non-empty runqueues. But for large
nr_cpus and smallish N it might be a win over-all. Dunno..



2023-06-22 14:55:17

by David Vernet

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] sched: Implement shared wakequeue in CFS

On Thu, Jun 22, 2023 at 12:58:41PM +0200, Peter Zijlstra wrote:
> On Wed, Jun 21, 2023 at 03:34:45PM -0500, David Vernet wrote:
> > On Wed, Jun 21, 2023 at 04:20:20PM +0200, Peter Zijlstra wrote:
> > > On Tue, Jun 13, 2023 at 12:20:04AM -0500, David Vernet wrote:
> > > > +struct swqueue {
> > > > + struct list_head list;
> > > > + spinlock_t lock;
> > > > +} ____cacheline_aligned;
> > >
> > > I'm thinking you can shard this just fine, it makes that pop() needs to
> > > iterate all shards, but that shouldn't be a problem, and it would still
> > > only need to take a single lock.
> >
> > Is the idea here to have multiple sharded queues per LLC, with a single
> > lock protecting them?
>
> No, each shard will have it's own lock, each shard it's own cacheline.
>
> > Or did I misunderstand your suggestion entirely?
>
> Yup, so each shard is basically the above struct, whole cacheline with a
> lock and list. Then on enqueue we hash to a shard (possibly based on cpu
> -- this gives the least amount of bounces) and stick it on.
>
> Then on pop we iterate the shards, the first non-empty shard gets the
> lock taken, and pop'ed. If you start iteration based on the cpu->shard
> map used for enqueue you can even get a sense of 'locality' depending on
> the mapping.

That's nifty. Even if we don't get locality from it, it seems prudent to
do to avoid contention.

> So pop is O(n) in shards, but on average will only take a single lock --

Ah, so it's 1 lock on average -- thanks for clarifying.

I like this idea and will implement it for v2. I'm not keen on the idea
of artificially having multiple independent swqueues (yes, I'll think of
a less evil name) per-LLC to avoid contending, but sharding like this
seems like a simple but effective solution to the scalability issue;
assuming having a sufficiently high N doesn't slow down the pop path by
too much as you pointed out below.

> there is an obvious race where it might see a non-empty queue, but race
> on lock and find the queue empty once acquired, but meh.

Yep, I would be surprised if this race caused a problem in practice.
Especially if we spread the poppers by starting iteration based on the
cpu->shared map as you suggested.

> > > I'm thinking 4 or 8 shards should be plenty, even for Intel LLC.

I'll try out a few combinations and see what works.

> > >
> > > > #ifdef CONFIG_SMP
> > >
> > > > +static struct task_struct *swqueue_pull_task(struct swqueue *swqueue)
> > > > +{
> > > > + unsigned long flags;
> > > > +
> > > > + struct task_struct *p;
> > > > +
> > > > + spin_lock_irqsave(&swqueue->lock, flags);
> > > > + p = list_first_entry_or_null(&swqueue->list, struct task_struct,
> > > > + swqueue_node);
> > > > + if (p)
> > > > + list_del_init(&p->swqueue_node);
> > > > + spin_unlock_irqrestore(&swqueue->lock, flags);
> > > > +
> > > > + return p;
> > > > +}
> > >
> > > Would this not normally be called pop() or somesuch?
> >
> > Yes, I'll improve the name in the next iteration. swqueue_dequeue() and
> > swqueue_enqueue() seem like the most canonical. Let me know if you have another
> > preference.
>
> Well, there's two layers intermixed here, which is, I think, what gave
> rise to the whole wonky naming.
>
> If you implement the queue primitives as: push, pop, del
> and then build, using them, the scheduler primitives: enqueue, dequeue,
> pick, things should become saner.

Right, will do

> > > > + if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1)
> > > > + return;
> > >
> > > Elsewhere you mentioned heuristics, this smells like them. This and the
> > > is_cpus_allowed() thing makes you loose plenty of opportunities.
> >
> > Yeah fair enough, these certainly are heuristics as well.
> >
> > I thought it best to try and avoid swqueue getting in the way of
> > select_task_rq_fair() (at least to start out with), but we could always
> > remove that and run other experiments to see how it does.
>
> So, part of the problem is that you might be hooking at the wrong level
> (see also my earlier email about the queue only containing running tasks
> and none of the ready tasks).
>
> If you were to hook at the __{en,de}queue_entity() level you'll
> automagically dequeue running tasks and enqueue any ready tasks. OTOH
> you get to deal with the problem that not all entities are tasks, but
> that shouldn't be too hard.
>
> Also, you end up with more enqueue/dequeue -- so that might hurt.

If this doesn't cause untenable contention then it seems preferable if
the overall goal is work conservation. Using your shard idea would
hopefully mitigate that concern somewhat as well. I'll try this out for
v2 and compare it to keeping the enqueues to only be on the wakeup path.

If there's too much contention we can always just go back to the wakeup
case, but I'm hopeful this works out.

> Also, at this point you're nr_cpus/N away from simply scanning the
> runqueues and pulling things from non-empty runqueues. But for large
> nr_cpus and smallish N it might be a win over-all. Dunno..

Given that we're not seeing any contention on smaller LLCs but still
observing perf wins, I'm optimistic.

Thanks,
David

2023-06-22 14:59:05

by David Vernet

[permalink] [raw]
Subject: Re: [RFC PATCH 2/3] sched/fair: Add SWQUEUE sched feature and skeleton calls

On Wed, Jun 21, 2023 at 02:49:33PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 13, 2023 at 12:20:03AM -0500, David Vernet wrote:
>
> I can't help but read this thing as software-queue :/ Can we please pick
> a better name?

Yes, I'll think of a better one. Suggestions welcome if you have a
preference.

> > @@ -6368,6 +6390,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > if (!task_new)
> > update_overutilized_status(rq);
> >
> > + if (sched_feat(SWQUEUE))
> > + swqueue_enqueue(rq, p, flags);
> > +
> > enqueue_throttle:
> > assert_list_leaf_cfs_rq(rq);
> >
> > @@ -6449,6 +6474,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > dequeue_throttle:
> > util_est_update(&rq->cfs, p, task_sleep);
> > hrtick_update(rq);
> > +
> > + if (sched_feat(SWQUEUE))
> > + swqueue_remove_task(p);
> > }
> >
> > #ifdef CONFIG_SMP
>
> _enqueue() should obviously be complemented by _dequeue(). This naming
> is offensive :-)

Ack

> > @@ -8155,12 +8183,18 @@ done: __maybe_unused;
> >
> > update_misfit_status(p, rq);
> >
> > + if (sched_feat(SWQUEUE))
> > + swqueue_remove_task(p);
> > +
> > return p;
> >
> > idle:
> > if (!rf)
> > return NULL;
> >
> > + if (sched_feat(SWQUEUE) && swqueue_pick_next_task(rq, rf))
> > + return RETRY_TASK;
> > +
> > new_tasks = newidle_balance(rq, rf);
> >
> > /*
>
> That's either not correct or insufficient or both.
>
> It fails to consider the whole core-scheduling mess. But it also fails
> to consider the regular (non optimized) pick case that should do newidle
> through put_prev_task_balance() -> balance_fair().
>
> I think placing the pick call in newidle_balance() itself is the
> simplest solution.

Yep, not sure where I went off the rails here -- the pick call clearly
belongs in newidle_balance(). I'll also make sure we handle core sched
correctly as well. Thanks for pointing those out.

2023-06-22 16:13:57

by Chris Mason

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] sched: Implement shared wakequeue in CFS

On 6/21/23 2:03 AM, Aaron Lu wrote:
> On Wed, Jun 21, 2023 at 12:43:52AM -0500, David Vernet wrote:
>> On Wed, Jun 21, 2023 at 12:54:16PM +0800, Aaron Lu wrote:
>>> On Tue, Jun 20, 2023 at 09:43:00PM -0500, David Vernet wrote:
>>>> On Wed, Jun 21, 2023 at 10:35:34AM +0800, Aaron Lu wrote:
>>>>> On Tue, Jun 20, 2023 at 12:36:26PM -0500, David Vernet wrote:

[ ... ]

>>>> I'm not sure what we're hoping to gain by continuing to run various
>>>> netperf workloads with your specific parameters?
>>>
>>> I don't quite follow you.
>>>
>>> I thought we were in the process of figuring out why for the same
>>> workload(netperf/default_mode/nr_client=nr_cpu) on two similar
>>> machines(both are Skylake) you saw no contention while I saw some so I
>>> tried to be exact on how I run the workload.
>>
>> I just reran the workload on a 26 core / 52 thread Cooper Lake using
>> your exact command below and still don't observe any contention
>> whatsoever on the swqueue lock:
>
> Well, it's a puzzle to me.
>
> But as you said below, I guess I'll just move on.

Thanks for bringing this up Aaron. The discussion moved on to different
ways to fix the netperf triggered contention, but I wanted to toss this
out as an easy way to see the same problem:

# swqueue disabled:
# ./schbench -L -m 52 -p 512 -r 10 -t 1
Wakeup Latencies percentiles (usec) runtime 10 (s) (14674354 total samples)
20.0th: 8 (4508866 samples)
50.0th: 11 (2879648 samples)
90.0th: 35 (5865268 samples)
* 99.0th: 70 (1282166 samples)
99.9th: 110 (124040 samples)
min=1, max=9312
avg worker transfer: 28211.91 ops/sec 13.78MB/s

During the swqueue=0 run, the system was ~30% idle

# swqueue enabled:
# ./schbench -L -m 52 -p 512 -r 10 -t 1
Wakeup Latencies percentiles (usec) runtime 10 (s) (6448414 total samples)
20.0th: 30 (1383423 samples)
50.0th: 39 (1986827 samples)
90.0th: 63 (2446965 samples)
* 99.0th: 108 (567275 samples)
99.9th: 158 (57487 samples)
min=1, max=15018
avg worker transfer: 12395.27 ops/sec 6.05MB/s

During the swqueue=1 run, the CPU was at was 97% system time, all stuck
on spinlock contention in the scheduler.

This is a single socket cooperlake with 26 cores/52 threads.

The work is similar to perf pipe test, 52 messenger threads each bouncing
a message back and forth with their own private worker for a 10 second run.

Adding more messenger threads (-m 128) increases the swqueue=0 ops/sec
to about 19MB/s and drags down the swqueue=1 ops/sec to 2MB/s.

-chris


2023-06-23 10:01:52

by Gautham R. Shenoy

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] sched: Implement shared wakequeue in CFS

On Thu, Jun 22, 2023 at 12:29:35PM +0200, Peter Zijlstra wrote:
> On Thu, Jun 22, 2023 at 02:41:57PM +0530, Gautham R. Shenoy wrote:
>
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -215,21 +215,17 @@ static void swqueue_enqueue(struct rq *rq, struct task_struct *p, int enq_flags)
> > {
> > unsigned long flags;
> > struct swqueue *swqueue;
> > - bool task_migrated = enq_flags & ENQUEUE_MIGRATED;
> > - bool task_wakeup = enq_flags & ENQUEUE_WAKEUP;
> >
> > /*
> > * Only enqueue the task in the shared wakequeue if:
> > *
> > * - SWQUEUE is enabled
> > - * - The task is on the wakeup path
> > - * - The task wasn't purposefully migrated to the current rq by
> > - * select_task_rq()
> > - * - The task isn't pinned to a specific CPU
> > + * - select_idle_sibling() didn't find an idle CPU to enqueue this wakee task.
> > */
> > - if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1)
>
> You lost the nr_cpus_allowed == 1 case, that actually made sense, except
> he didn't don't have a hook in set_cpus_allowed() to fix it back up.
>

I didn't retain the "nr_cpus_allowed == 1" check because
select_task_rq() calls select_task_rq_fair() only when
p->nr_cpus_allowed > 1.

So if p was affined to a single CPU, it would always have
p->sched_add_to_swq set to 0, thereby not queueing it on the 'queue'.


> > + if (!READ_ONCE(p->sched_add_to_swq))
> > return;
>
> So suppose we fill all CPUs with tasks T_{0..n-n1}, sis finds them a
> nice idle cpu and we don't get queued.

Yes.

>
> Then wake up another n tasks T_{n...2n-1}, none of them find an idle
> CPU, as per the above them all taken. These all do get queued.

Yes.

>
> However, suppose T>=n does a wakeup-preemption, and we end up with T>=n
> running while T<n get queued on the rq, but not the 'queue' (I so hate
> the swqueue name).

Yes. But note that prior to running T>=n on the CPU, it will be
removed from the 'queue'. So if every T>=n preempts the corresponding
T<n, then, the queue becomes empty. And yes, in this case the queue
served no purpose and does not justify the additional overhead.

>
> Then one CPU has both it's tasks go 'away' and becomes idle.

Yes, and this CPU prior to newidle_balance() will now search the
'queue'.

>
> At this point the 'queue' contains only running tasks that are not
> migratable and all the tasks that could be migrated are not on the queue
> -- IOW it is completely useless.

At this point, the CPU will call swqueue_pick_next_task() and we can
have one of the three cases:

* The 'queue' is empty: As mentioned above, the queue didn't do
anything for the tasks and was cmpletely useless.

* The task at the head of 'queue' is not runnable on the CPU which
is going idle. Again, wasted effort of adding this task to the
queue, because in the current implementation, the task is
removed from the 'queue' even when it is not runnable on this
CPU.

* The 'queue' has some task which can be run on the CPU going
idle. The task is successfully migrated to this CPU, which no
longer has to do newidle_balance() and the task has reduced it
waiting period.

>
> Is this the intention?

IIUC, the intention is to reduce the avg post-wakeup wait time of the
task by running it on a newidle CPU, when the task's CPU is still busy
running something else. This is assuming that the task won't win
wakeup-preemption. But the flipside is the additional cost of swqueue
enqueue/dequeue.

I have queued a run across a set of benchmarks, but just to get an
idea how the patch performs, I ran hackbench and this is what the
results look like on a 2 Socket Gen3 EPYC configured in NPS=2 with 64
cores 128 threads per socket:

Test: tip tip_swq_disabled tip_swq_enabled
=============================================================================
1-groups: 3.82 (0.00 pct) 3.84 (-0.52 pct) 3.46 (+9.42 pct)
2-groups: 4.40 (0.00 pct) 4.42 (-0.45 pct) 3.66 (+16.81 pct)
4-groups: 4.84 (0.00 pct) 4.82 (+0.41 pct) 4.50 (+7.02 pct)
8-groups: 5.45 (0.00 pct) 5.43 (+0.36 pct) 5.85 (-7.33 pct)
16-groups: 6.94 (0.00 pct) 6.90 (+0.57 pct) 8.41 (-21.18 pct)

We can see that patchset does quite well with 1,2,4 groups, but with 8
and 16 groups as the system becomes overloaded, we can see the
performance degrade. In the overloaded case, I could observe
native_queued_spin_lock_slowpath() looming large in the perf profiles
when the swqueue feature was enabled. So perhaps the cost of swqueue
enqueue/dequeue is not justifiable, especially if the tasks are anyway
going to run on their target CPUs.

I will post more results later.
--
Thanks and Regards
gautham.

2023-06-26 06:14:13

by Gautham R. Shenoy

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] sched: Implement shared wakequeue in CFS

Hello Peter, David,

On Fri, Jun 23, 2023 at 03:20:15PM +0530, Gautham R. Shenoy wrote:
> On Thu, Jun 22, 2023 at 12:29:35PM +0200, Peter Zijlstra wrote:
> > On Thu, Jun 22, 2023 at 02:41:57PM +0530, Gautham R. Shenoy wrote:

>
> I will post more results later.

I was able to get some numbers for hackbench, schbench (old), and
tbench over the weekend on a 2 Socket Zen3 box with 64 cores 128
threads per socket configured in NPS1 mode.

The legend is as follows:

tip : tip/sched/core with HEAD being commit e2a1f85bf9f5 ("sched/psi:
Avoid resetting the min update period when it is unnecessary")


david : This patchset

david-ego-1 : David's patchset + my modification to allow SIS signal
that a task should be queued on the shared-wakequeue when SIS cannot
find an idle CPU to wake up the task.

david-ego-2 : David's patchset + david-ego-1 + my modification to
remove the first task from the shared-wakequeue whose
cpus_allowed contains this CPU. Currently we don't do
this check and always remove the first task.


david-ego-1 and david-ego-2 are attached with this mail.

hackbench (Measure: time taken to complete, in seconds)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Test: tip david david-ego-1 david-ego-2
1-groups: 3.92 (0.00 pct) 3.35 (14.54 pct) 3.53 (9.94 pct) 3.30 (15.81 pct)
2-groups: 4.58 (0.00 pct) 3.89 (15.06 pct) 3.95 (13.75 pct) 3.79 (17.24 pct)
4-groups: 4.99 (0.00 pct) 4.42 (11.42 pct) 4.76 (4.60 pct) 4.77 (4.40 pct)
8-groups: 5.67 (0.00 pct) 5.08 (10.40 pct) 6.16 (-8.64 pct) 6.33 (-11.64 pct)
16-groups: 7.88 (0.00 pct) 7.32 (7.10 pct) 8.57 (-8.75 pct) 9.77 (-23.98 pct)


Observation: We see that David's patchset does very well across all
the groups. Expanding the scope of the shared-wakequeue with
david-ego-1 doesn't give us much and in fact hurts at higher
utilization. Same is the case with david-ego-2 which only pulls
allowed tasks from the shared-wakequeue. In david-ego-2 we see a
greater amount of spin-lock contention for 8 and 16 groups, as the
code holds the spinlock and iterates through the list members while
checking cpu-affinity.

So, David's original patchset wins this one.




schbench (Measure : 99th Percentile latency, in us)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#workers: tip david david-ego-1 david-ego-2
1: 26.00 (0.00 pct) 21.00 (19.23 pct) 28.00 (-7.69 pct) 22.00 (15.38 pct)
2: 27.00 (0.00 pct) 29.00 (-7.40 pct) 28.00 (-3.70 pct) 30.00 (-11.11 pct)
4: 31.00 (0.00 pct) 31.00 (0.00 pct) 31.00 (0.00 pct) 28.00 (9.67 pct)
8: 36.00 (0.00 pct) 37.00 (-2.77 pct) 34.00 (5.55 pct) 39.00 (-8.33 pct)
16: 49.00 (0.00 pct) 49.00 (0.00 pct) 48.00 (2.04 pct) 50.00 (-2.04 pct)
32: 80.00 (0.00 pct) 80.00 (0.00 pct) 88.00 (-10.00 pct) 79.00 (1.25 pct)
64: 169.00 (0.00 pct) 180.00 (-6.50 pct) 174.00 (-2.95 pct) 168.00 (0.59 pct)
128: 343.00 (0.00 pct) 355.00 (-3.49 pct) 356.00 (-3.79 pct) 344.00 (-0.29 pct)
256: 42048.00 (0.00 pct) 46528.00 (-10.65 pct) 51904.00 (-23.43 pct) 48064.00 (-14.30 pct)
512: 95104.00 (0.00 pct) 95872.00 (-0.80 pct) 95360.00 (-0.26 pct) 97152.00 (-2.15 pct)


Observations: There are run-to-run variations with this benchmark. I
will try with the newer schbench later this week.

tbench (Measure: Throughput, records/s)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Clients: tip sis-node david david-ego-1 ego-david-2
1 452.49 (0.00 pct) 457.94 (1.20 pct) 448.52 (-0.87 pct) 447.11 (-1.18 pct) 458.45 (1.31 pct)
2 862.44 (0.00 pct) 879.99 (2.03 pct) 860.14 (-0.26 pct) 873.27 (1.25 pct) 891.72 (3.39 pct)
4 1604.27 (0.00 pct) 1618.87 (0.91 pct) 1610.95 (0.41 pct) 1628.45 (1.50 pct) 1657.26 (3.30 pct)
8 2966.77 (0.00 pct) 3040.90 (2.49 pct) 2991.07 (0.81 pct) 3063.31 (3.25 pct) 3106.50 (4.70 pct)
16 5176.70 (0.00 pct) 5292.29 (2.23 pct) 5478.32 (5.82 pct) 5462.05 (5.51 pct) 5537.15 (6.96 pct)
32 8205.24 (0.00 pct) 8949.12 (9.06 pct) 9039.63 (10.16 pct) 9466.07 (15.36 pct) 9365.06 (14.13 pct)
64 13956.71 (0.00 pct) 14461.42 (3.61 pct) 16337.65 (17.05 pct) 16941.63 (21.38 pct) 15697.47 (12.47 pct)
128 24005.50 (0.00 pct) 26052.75 (8.52 pct) 25605.24 (6.66 pct) 27243.19 (13.48 pct) 24854.60 (3.53 pct)
256 32457.61 (0.00 pct) 21999.41 (-32.22 pct) 36953.22 (13.85 pct) 32299.31 (-0.48 pct) 33037.03 (1.78 pct)
512 34345.24 (0.00 pct) 41166.39 (19.86 pct) 40845.23 (18.92 pct) 40797.97 (18.78 pct) 38150.17 (11.07 pct)
1024 33432.92 (0.00 pct) 40900.84 (22.33 pct) 39749.35 (18.89 pct) 41133.82 (23.03 pct) 38464.26 (15.04 pct)


Observations: tbench really likes all variants of shared-wakeueue. I
have also included sis-node numbers since we saw that tbench liked
sis-node.

Also, it can be noted that except for the 256 clients case (number of
clients == number of threads in the system), in all other cases, we
see a benefit with david-ego-1 which extends the usage of
shared-wakequeue to the waker's target when the waker's LLC is busy.

Will try and get the netperf, postgresql, SPECjbb and Deathstarbench
numbers this week.

--
Thanks and Regards
gautham.







Attachments:
(No filename) (5.54 kB)
david-ego-1.patch (2.74 kB)
david-ego-2.patch (2.45 kB)
Download all attachments

2023-06-27 04:23:58

by David Vernet

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] sched: Implement shared wakequeue in CFS

On Mon, Jun 26, 2023 at 11:34:41AM +0530, Gautham R. Shenoy wrote:
> Hello Peter, David,

Hello Gautham,

Thanks a lot for running these experiments and experimenting with some
variations.

>
> On Fri, Jun 23, 2023 at 03:20:15PM +0530, Gautham R. Shenoy wrote:
> > On Thu, Jun 22, 2023 at 12:29:35PM +0200, Peter Zijlstra wrote:
> > > On Thu, Jun 22, 2023 at 02:41:57PM +0530, Gautham R. Shenoy wrote:
>
> >
> > I will post more results later.
>
> I was able to get some numbers for hackbench, schbench (old), and
> tbench over the weekend on a 2 Socket Zen3 box with 64 cores 128
> threads per socket configured in NPS1 mode.
>
> The legend is as follows:
>
> tip : tip/sched/core with HEAD being commit e2a1f85bf9f5 ("sched/psi:
> Avoid resetting the min update period when it is unnecessary")
>
>
> david : This patchset
>
> david-ego-1 : David's patchset + my modification to allow SIS signal
> that a task should be queued on the shared-wakequeue when SIS cannot
> find an idle CPU to wake up the task.
>
> david-ego-2 : David's patchset + david-ego-1 + my modification to
> remove the first task from the shared-wakequeue whose
> cpus_allowed contains this CPU. Currently we don't do
> this check and always remove the first task.
>
>
> david-ego-1 and david-ego-2 are attached with this mail.
>
> hackbench (Measure: time taken to complete, in seconds)
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> Test: tip david david-ego-1 david-ego-2
> 1-groups: 3.92 (0.00 pct) 3.35 (14.54 pct) 3.53 (9.94 pct) 3.30 (15.81 pct)
> 2-groups: 4.58 (0.00 pct) 3.89 (15.06 pct) 3.95 (13.75 pct) 3.79 (17.24 pct)
> 4-groups: 4.99 (0.00 pct) 4.42 (11.42 pct) 4.76 (4.60 pct) 4.77 (4.40 pct)
> 8-groups: 5.67 (0.00 pct) 5.08 (10.40 pct) 6.16 (-8.64 pct) 6.33 (-11.64 pct)
> 16-groups: 7.88 (0.00 pct) 7.32 (7.10 pct) 8.57 (-8.75 pct) 9.77 (-23.98 pct)

Nice, these are pretty significant improvements.

> Observation: We see that David's patchset does very well across all
> the groups. Expanding the scope of the shared-wakequeue with
> david-ego-1 doesn't give us much and in fact hurts at higher
> utilization. Same is the case with david-ego-2 which only pulls
> allowed tasks from the shared-wakequeue. In david-ego-2 we see a
> greater amount of spin-lock contention for 8 and 16 groups, as the
> code holds the spinlock and iterates through the list members while
> checking cpu-affinity.

Peter's idea to shard may help with the contention here. I also wonder
how we'd do with david-ego-2 minus david-ego-1. Perhaps the slightly
less aggressive enqueing would somewhat lower the contention?

> so, david's original patchset wins this one.
>
>
>
>
> schbench (measure : 99th percentile latency, in us)
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> #workers: tip david david-ego-1 david-ego-2
> 1: 26.00 (0.00 pct) 21.00 (19.23 pct) 28.00 (-7.69 pct) 22.00 (15.38 pct)
> 2: 27.00 (0.00 pct) 29.00 (-7.40 pct) 28.00 (-3.70 pct) 30.00 (-11.11 pct)
> 4: 31.00 (0.00 pct) 31.00 (0.00 pct) 31.00 (0.00 pct) 28.00 (9.67 pct)
> 8: 36.00 (0.00 pct) 37.00 (-2.77 pct) 34.00 (5.55 pct) 39.00 (-8.33 pct)
> 16: 49.00 (0.00 pct) 49.00 (0.00 pct) 48.00 (2.04 pct) 50.00 (-2.04 pct)
> 32: 80.00 (0.00 pct) 80.00 (0.00 pct) 88.00 (-10.00 pct) 79.00 (1.25 pct)
> 64: 169.00 (0.00 pct) 180.00 (-6.50 pct) 174.00 (-2.95 pct) 168.00 (0.59 pct)
> 128: 343.00 (0.00 pct) 355.00 (-3.49 pct) 356.00 (-3.79 pct) 344.00 (-0.29 pct)
> 256: 42048.00 (0.00 pct) 46528.00 (-10.65 pct) 51904.00 (-23.43 pct) 48064.00 (-14.30 pct)
> 512: 95104.00 (0.00 pct) 95872.00 (-0.80 pct) 95360.00 (-0.26 pct) 97152.00 (-2.15 pct)
>
>
> observations: there are run-to-run variations with this benchmark. i
> will try with the newer schbench later this week.

+cc Chris. He showed how you can use schbench to cause painful
contention with swqueue in [0]. Chris -- any other schbench incantations
that you'd recommend we try with Gautham's patches?

[0]: https://lore.kernel.org/lkml/[email protected]/

> tbench (measure: throughput, records/s)
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> clients: tip sis-node david david-ego-1 ego-david-2
> 1 452.49 (0.00 pct) 457.94 (1.20 pct) 448.52 (-0.87 pct) 447.11 (-1.18 pct) 458.45 (1.31 pct)
> 2 862.44 (0.00 pct) 879.99 (2.03 pct) 860.14 (-0.26 pct) 873.27 (1.25 pct) 891.72 (3.39 pct)
> 4 1604.27 (0.00 pct) 1618.87 (0.91 pct) 1610.95 (0.41 pct) 1628.45 (1.50 pct) 1657.26 (3.30 pct)
> 8 2966.77 (0.00 pct) 3040.90 (2.49 pct) 2991.07 (0.81 pct) 3063.31 (3.25 pct) 3106.50 (4.70 pct)
> 16 5176.70 (0.00 pct) 5292.29 (2.23 pct) 5478.32 (5.82 pct) 5462.05 (5.51 pct) 5537.15 (6.96 pct)
> 32 8205.24 (0.00 pct) 8949.12 (9.06 pct) 9039.63 (10.16 pct) 9466.07 (15.36 pct) 9365.06 (14.13 pct)
> 64 13956.71 (0.00 pct) 14461.42 (3.61 pct) 16337.65 (17.05 pct) 16941.63 (21.38 pct) 15697.47 (12.47 pct)
> 128 24005.50 (0.00 pct) 26052.75 (8.52 pct) 25605.24 (6.66 pct) 27243.19 (13.48 pct) 24854.60 (3.53 pct)
> 256 32457.61 (0.00 pct) 21999.41 (-32.22 pct) 36953.22 (13.85 pct) 32299.31 (-0.48 pct) 33037.03 (1.78 pct)
> 512 34345.24 (0.00 pct) 41166.39 (19.86 pct) 40845.23 (18.92 pct) 40797.97 (18.78 pct) 38150.17 (11.07 pct)
> 1024 33432.92 (0.00 pct) 40900.84 (22.33 pct) 39749.35 (18.89 pct) 41133.82 (23.03 pct) 38464.26 (15.04 pct)
>
>
> observations: tbench really likes all variants of shared-wakeueue. i

It's encouraging to see how well swqueue is handling these benchmarks;
especially on CPUs with larger LLCs and with a lot of clients. I'm keen
to see how we do with the sharding approach and your patches.

> have also included sis-node numbers since we saw that tbench liked
> sis-node.
>
> also, it can be noted that except for the 256 clients case (number of
> clients == number of threads in the system), in all other cases, we
> see a benefit with david-ego-1 which extends the usage of
> shared-wakequeue to the waker's target when the waker's llc is busy.

Also confused why the 256 client case would perform worse for SIS_NODE
and david-ego-1. Was it consistently negative if you do multiple runs?

Overall, my handwavey impression is that david-ego-1 doesn't seem to
significantly improve over david, though I'd very much like to see the
results of the other benchmarks you listed below. Iterating over tasks
until you find a usable cpus->ptr also makes sense, and I'm hoping that
the sharding idea will help with the contention there.

> will try and get the netperf, postgresql, specjbb and deathstarbench
> numbers this week.

Thanks, that would be great. I'll be on vacation this week starting
Wednesday, but will try to put together a v2 that includes Peter's
suggestions by the end of the day tomorrow. Otherwise, I'll likely send
it out next week (along with your patch(es) if they're shown to be an
improvement over the baseline).

Thanks,
David

>
> --
> thanks and regards
> gautham.
>
>
>
>
>
>

> from 05d8efe2f3ae3abafd4bf94a0579d378dba63bb6 mon sep 17 00:00:00 2001
> from: "gautham r. shenoy" <[email protected]>
> date: fri, 23 jun 2023 11:02:03 +0530
> subject: [patch 1/2] swqueue: control if a task should be queued on swq in
> select_idle_sibling()
>
> if select_idle_sibling() fails to find an idle cpu to wakeup the task
> on, then update the newly defined sched_add_to_swq field in its task
> struct.
>
> use the value in this field to later on to determine if the task
> should also be queued on the shared-wakequeue of the llc of the target
> cpu.
>
> this extends the use of shared-wakequeue to cases when the target of a
> wakeup is the current cpu instead of the task's previous cpu.
>
> signed-off-by: gautham r. shenoy <[email protected]>
> ---
> include/linux/sched.h | 6 ++++++
> kernel/sched/fair.c | 15 ++++++++-------
> 2 files changed, 14 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index b64fec55a381..38005262a7fe 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -910,6 +910,12 @@ struct task_struct {
> */
> unsigned sched_remote_wakeup:1;
>
> + /*
> + * bit used by select_idle_sibling() to signal enqueuing the
> + * task on a shared wakequeue when it failed find an idle cpu.
> + */
> + unsigned sched_add_to_swq:1;
> +
> /* bit to tell lsms we're in execve(): */
> unsigned in_execve:1;
> unsigned in_iowait:1;
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e311d1c3b992..fe33f6b13299 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -215,21 +215,17 @@ static void swqueue_enqueue(struct rq *rq, struct task_struct *p, int enq_flags)
> {
> unsigned long flags;
> struct swqueue *swqueue;
> - bool task_migrated = enq_flags & enqueue_migrated;
> - bool task_wakeup = enq_flags & enqueue_wakeup;
>
> /*
> * only enqueue the task in the shared wakequeue if:
> *
> * - swqueue is enabled
> - * - the task is on the wakeup path
> - * - the task wasn't purposefully migrated to the current rq by
> - * select_task_rq()
> - * - the task isn't pinned to a specific cpu
> + * - select_idle_sibling() didn't find an idle cpu to enqueue this wakee task.
> */
> - if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1)
> + if (!p->sched_add_to_swq)
> return;
>
> + p->sched_add_to_swq = 0;
> swqueue = rq_swqueue(rq);
> spin_lock_irqsave(&swqueue->lock, flags);
> list_add_tail(&p->swqueue_node, &swqueue->list);
> @@ -7361,6 +7357,11 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> if ((unsigned)i < nr_cpumask_bits)
> return i;
>
> + /*
> + * no idle sibling was found. ok to queue this task on the
> + * shared wakequeue of the target.
> + */
> + p->sched_add_to_swq = 1;
> return target;
> }
>
> --
> 2.25.1
>

> from 88f52c2df8a2d92423ddd12c92edec949148bf3c mon sep 17 00:00:00 2001
> from: "gautham r. shenoy" <[email protected]>
> date: fri, 23 jun 2023 23:25:04 +0530
> subject: [patch 2/2] swqueue: only pull a task with valid affinity from
> swqueue
>
> currently swqueue_pull_task() dequeues the task at the head of the
> shared-wakequeue and then tries to migrate the task onto the current
> cpu.
>
> this may fail, since the current cpu may not be set in the task's
> affinity mask.
>
> hence in swqueue_pull_task(), pull the first task from the
> shared-wakequeue that can be run on this cpu. with this,
> swqueue_pick_next_task() can return a 0/1 instead of 0/-1/1 as it is
> done now.
>
> singed-off-by: gautham r. shenoy <[email protected]>
> ---
> kernel/sched/fair.c | 22 ++++++++++++----------
> 1 file changed, 12 insertions(+), 10 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index fe33f6b13299..e78b8302b4c8 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -195,17 +195,21 @@ static struct swqueue *rq_swqueue(struct rq *rq)
> return rq->cfs.swqueue;
> }
>
> -static struct task_struct *swqueue_pull_task(struct swqueue *swqueue)
> +static struct task_struct *swqueue_pull_task(struct swqueue *swqueue, int cpu)
> {
> unsigned long flags;
>
> struct task_struct *p;
>
> spin_lock_irqsave(&swqueue->lock, flags);
> - p = list_first_entry_or_null(&swqueue->list, struct task_struct,
> - swqueue_node);
> - if (p)
> - list_del_init(&p->swqueue_node);
> + list_for_each_entry(p, &swqueue->list, swqueue_node) {
> + if (cpumask_test_cpu(cpu, p->cpus_ptr)) {
> + list_del_init(&p->swqueue_node);
> + goto found;
> + }
> + }
> + p = null;
> +found:
> spin_unlock_irqrestore(&swqueue->lock, flags);
>
> return p;
> @@ -238,11 +242,11 @@ static int swqueue_pick_next_task(struct rq *rq, struct rq_flags *rf)
> struct task_struct *p = null;
> struct rq *src_rq;
> struct rq_flags src_rf;
> - int ret;
> + int ret = 0;
>
> swqueue = rq_swqueue(rq);
> if (!list_empty(&swqueue->list))
> - p = swqueue_pull_task(swqueue);
> + p = swqueue_pull_task(swqueue, rq->cpu);
>
> if (!p)
> return 0;
> @@ -255,10 +259,8 @@ static int swqueue_pick_next_task(struct rq *rq, struct rq_flags *rf)
> if (task_on_rq_queued(p) && !task_on_cpu(rq, p))
> src_rq = migrate_task_to(src_rq, &src_rf, p, cpu_of(rq));
>
> - if (src_rq->cpu != rq->cpu)
> + if (src_rq->cpu == rq->cpu)
> ret = 1;
> - else
> - ret = -1;
>
> task_rq_unlock(src_rq, p, &src_rf);
>
> --
> 2.25.1
>


2023-06-27 16:58:18

by Chris Mason

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] sched: Implement shared wakequeue in CFS

On 6/26/23 11:17 PM, David Vernet wrote:
> On Mon, Jun 26, 2023 at 11:34:41AM +0530, Gautham R. Shenoy wrote:

>>
>> observations: there are run-to-run variations with this benchmark. i
>> will try with the newer schbench later this week.
>
> +cc Chris. He showed how you can use schbench to cause painful
> contention with swqueue in [0]. Chris -- any other schbench incantations
> that you'd recommend we try with Gautham's patches?
>
> [0]: https://lore.kernel.org/lkml/[email protected]/

Hopefully my command line from this other email will be consistent with
the new schbench, please let me know if you're still seeing variations.

In terms of other things to try, I was creating one worker per messenger
to maximize runqueue lock contention, but variations on -t may be
interesting.

Basically I did -t 1 -m <increasing> until all the idle time on
unpatched Linus was soaked up, and then I compared Linus and swqueue.

-chris


2023-07-10 12:07:12

by K Prateek Nayak

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] sched: Implement shared wakequeue in CFS

Hello David,

Sharing the results from testing SWQUEUE with some of the standard
benchmarks below.

tl;dr

- Hackbench, tbench, DeathStarBench and SPECjbb seem to like the
SWQUEUE (except for hackbench in NPS4, with 16 groups)

- Netperf sees regression as system becomes more loaded. I'll go
back and check if there is some lock contention issues here which
is parallely being discussed on the tread.

- Other benchmarks seem to be behave similar to tip.

On 6/13/2023 10:50 AM, David Vernet wrote:
> Overview
> ========
>
> The scheduler must constantly strike a balance between work
> conservation, and avoiding costly migrations which harm performance due
> to e.g. decreased cache locality. The matter is further complicated by
> the topology of the system. Migrating a task between cores on the same
> LLC may be more optimal than keeping a task local to the CPU, whereas
> migrating a task between LLCs or NUMA nodes may tip the balance in the
> other direction.
>
> With that in mind, while CFS is by and large mostly a work conserving
> scheduler, there are certain instances where the scheduler will choose
> to keep a task local to a CPU, when it would have been more optimal to
> migrate it to an idle core.
>
> An example of such a workload is the HHVM / web workload at Meta. HHVM
> is a VM that JITs Hack and PHP code in service of web requests. Like
> other JIT / compilation workloads, it tends to be heavily CPU bound, and
> exhibit generally poor cache locality. To try and address this, we set
> several debugfs (/sys/kernel/debug/sched) knobs on our HHVM workloads:
>
> - migration_cost_ns -> 0
> - latency_ns -> 20000000
> - min_granularity_ns -> 10000000
> - wakeup_granularity_ns -> 12000000
>
> These knobs are intended both to encourage the scheduler to be as work
> conserving as possible (migration_cost_ns -> 0), and also to keep tasks
> running for relatively long time slices so as to avoid the overhead of
> context switching (the other knobs). Collectively, these knobs provide a
> substantial performance win; resulting in roughly a 20% improvement in
> throughput. Worth noting, however, is that this improvement is _not_ at
> full machine saturation.
>
> That said, even with these knobs, we noticed that CPUs were still going
> idle even when the host was overcommitted. In response, we wrote the
> "shared wakequeue" (swqueue) feature proposed in this patch set. The
> idea behind swqueue is simple: it enables the scheduler to be
> aggressively work conserving by placing a waking task into a per-LLC
> FIFO queue that can be pulled from by another core in the LLC FIFO queue
> which can then be pulled from before it goes idle.
>
> With this simple change, we were able to achieve a 1 - 1.6% improvement
> in throughput, as well as a small, consistent improvement in p95 and p99
> latencies, in HHVM. These performance improvements were in addition to
> the wins from the debugfs knobs mentioned above.
>
> Design
> ======
>
> The design of swqueue is quite simple. An swqueue is simply a struct
> list_head, and a spinlock:
>
> struct swqueue {
> struct list_head list;
> spinlock_t lock;
> } ____cacheline_aligned;
>
> We create a struct swqueue per LLC, ensuring they're in their own
> cachelines to avoid false sharing between CPUs on different LLCs.
>
> When a task first wakes up, it enqueues itself in the swqueue of its
> current LLC at the end of enqueue_task_fair(). Enqueues only happen if
> the task was not manually migrated to the current core by
> select_task_rq(), and is not pinned to a specific CPU.
>
> A core will pull a task from its LLC's swqueue before calling
> newidle_balance().
>
> Difference between SIS_NODE
> ===========================
>
> In [0] Peter proposed a patch that addresses Tejun's observations that
> when workqueues are targeted towards a specific LLC on his Zen2 machine
> with small CCXs, that there would be significant idle time due to
> select_idle_sibling() not considering anything outside of the current
> LLC.
>
> This patch (SIS_NODE) is essentially the complement to the proposal
> here. SID_NODE causes waking tasks to look for idle cores in neighboring
> LLCs on the same die, whereas swqueue causes cores about to go idle to
> look for enqueued tasks. That said, in its current form, the two
> features at are a different scope as SIS_NODE searches for idle cores
> between LLCs, while swqueue enqueues tasks within a single LLC.
>
> The patch was since removed in [1], but we elect to compare its
> performance to swqueue given that as described above, it's conceptually
> complementary.
>
> [0]: https://lore.kernel.org/all/[email protected]/
> [1]: https://lore.kernel.org/all/[email protected]/
>
> I observed that while SIS_NODE works quite well for hosts with small
> CCXs, it can result in degraded performance on machines either with a
> large number of total cores in a CCD, or for which the cache miss
> penalty of migrating between CCXs is high, even on the same die.
>
> For example, on Zen 4c hosts (Bergamo), CCXs within a CCD are muxed
> through a single link to the IO die, and thus have similar cache miss
> latencies as cores in remote CCDs.
>
> Such subtleties could be taken into account with SIS_NODE, but
> regardless, both features are conceptually complementary sides of the
> same coin. SIS_NODE searches for idle cores for waking threads, whereas
> swqueue searches for available work before a core goes idle.
>
> Results
> =======
>
> Note that the motivation for the shared wakequeue feature was originally
> arrived at using experiments in the sched_ext framework that's currently
> being proposed upstream. The ~1 - 1.6% improvement in HHVM throughput
> is similarly visible using work-conserving sched_ext schedulers (even
> very simple ones like global FIFO).
>
> In both single and multi socket / CCX hosts, this can measurably improve
> performance. In addition to the performance gains observed on our
> internal web workloads, we also observed an improvement in common
> workloads such as kernel compile when running shared wakequeue. Here are
> the results of running make -j$(nproc) built-in.a on several different
> types of hosts configured with make allyesconfig on commit a27648c74210
> ("afs: Fix setting of mtime when creating a file/dir/symlink") on Linus'
> tree (boost was disabled on all of these hosts when the experiments were
> performed):
>
> Single-socket | 32-core | 2-CCX | AMD 7950X Zen4
>
> CPU max MHz: 5879.8818
> CPU min MHz: 3000.0000
> o____________o_______o
> | mean | CPU |
> o------------o-------o
> NO_SWQUEUE + NO_SIS_NODE: | 590.52s | 3103% |
> NO_SWQUEUE + SIS_NODE: | 590.80s | 3102% |
> SWQUEUE + NO_SIS_NODE: | 589.65s | 3116% |
> SWQUEUE + SIS_NODE: | 589.99s | 3115% |
> o------------o-------o
>
> Takeaway: swqueue doesn't seem to provide a statistically significant
> improvement for kernel compile on my 7950X. SIS_NODE similarly does not
> have a noticeable effect on performance.
>
> -------------------------------------------------------------------------------
>
> Single-socket | 72-core | 6-CCX | AMD Milan Zen3
>
> CPU max MHz: 3245.0190
> CPU min MHz: 700.0000
> o_____________o_______o
> | mean | CPU |
> o-------------o-------o
> NO_SWQUEUE + NO_SIS_NODE: | 1608.69s | 6488% |
> NO_SWQUEUE + SIS_NODE: | 1610.24s | 6473% |
> SWQUEUE + NO_SIS_NODE: | 1605.80s | 6504% |
> SWQUEUE + SIS_NODE: | 1606.96s | 6488% |
> o-------------o-------o
>
> Takeaway: swqueue does provide a small statistically significant
> improvement on Milan, but the compile times in general were quite long
> relative to the 7950X Zen4, and the Bergamo Zen4c due to the lower clock
> frequency. Milan also has larger CCXs than Bergamo, so it stands to
> reason that select_idle_sibling() will have an easier time finding idle
> cores inside the current CCX.

o System Details

Dual Socket 3rd Generation EPYC System (2 x 64C/128T)

o NPS Modes

NPS Modes are used to logically divide single socket into
multiple NUMA region.
Following is the NUMA configuration for each NPS mode on the system:

NPS1: Each socket is a NUMA node.
Total 2 NUMA nodes in the dual socket machine.

Node 0: 0-63, 128-191
Node 1: 64-127, 192-255

NPS2: Each socket is further logically divided into 2 NUMA regions.
Total 4 NUMA nodes exist over 2 socket.

Node 0: 0-31, 128-159
Node 1: 32-63, 160-191
Node 2: 64-95, 192-223
Node 3: 96-127, 223-255

NPS4: Each socket is logically divided into 4 NUMA regions.
Total 8 NUMA nodes exist over 2 socket.

Node 0: 0-15, 128-143
Node 1: 16-31, 144-159
Node 2: 32-47, 160-175
Node 3: 48-63, 176-191
Node 4: 64-79, 192-207
Node 5: 80-95, 208-223
Node 6: 96-111, 223-231
Node 7: 112-127, 232-255

o Kernel Versions

- tip - tip:sched/core at commit e2a1f85bf9f5 "sched/psi:
Avoid resetting the min update period when it is
unnecessary")

- SWQUEUE - tip:sched/core + this patch
(SWQUEUE feature enabled from debugfs)

~~~~~~~~~~~~~
~ hackbench ~
~~~~~~~~~~~~~

o NPS1

Test: tip SWQUEUE
1-groups: 3.92 (0.00 pct) 3.41 (13.01 pct)
2-groups: 4.58 (0.00 pct) 3.80 (17.03 pct)
4-groups: 4.99 (0.00 pct) 4.48 (10.22 pct)
8-groups: 5.67 (0.00 pct) 5.11 (9.87 pct)
16-groups: 7.88 (0.00 pct) 7.29 (7.48 pct)

o NPS2

Test: tip SWQUEUE
1-groups: 3.82 (0.00 pct) 3.45 (9.68 pct)
2-groups: 4.40 (0.00 pct) 3.72 (15.45 pct)
4-groups: 4.84 (0.00 pct) 4.27 (11.77 pct)
8-groups: 5.45 (0.00 pct) 4.98 (8.62 pct)
16-groups: 6.94 (0.00 pct) 6.82 (1.72 pct)

o NPS4

Test: tip SWQUEUE
1-groups: 3.82 (0.00 pct) 3.50 (8.37 pct)
2-groups: 4.44 (0.00 pct) 3.77 (15.09 pct)
4-groups: 4.86 (0.00 pct) 4.44 (8.64 pct)
8-groups: 5.42 (0.00 pct) 5.08 (6.27 pct)
16-groups: 6.68 (0.00 pct) 7.82 (-17.06 pct)

~~~~~~~~~~~~
~ schbench ~
~~~~~~~~~~~~

o NPS1

#workers: tip SWQUEUE
1: 26.00 (0.00 pct) 28.00 (-7.69 pct)
2: 27.00 (0.00 pct) 23.00 (14.81 pct)
4: 31.00 (0.00 pct) 29.00 (6.45 pct)
8: 36.00 (0.00 pct) 38.00 (-5.55 pct)
16: 49.00 (0.00 pct) 48.00 (2.04 pct)
32: 80.00 (0.00 pct) 78.00 (2.50 pct)
64: 169.00 (0.00 pct) 168.00 (0.59 pct)
128: 343.00 (0.00 pct) 383.00 (-11.66 pct)
256: 42048.00 (0.00 pct) 50368.00 (-19.78 pct)
512: 95104.00 (0.00 pct) 93312.00 (1.88 pct)

o NPS2

#workers: tip SWQUEUE
1: 23.00 (0.00 pct) 21.00 (8.69 pct)
2: 24.00 (0.00 pct) 25.00 (-4.16 pct)
4: 31.00 (0.00 pct) 29.00 (6.45 pct)
8: 41.00 (0.00 pct) 43.00 (-4.87 pct)
16: 48.00 (0.00 pct) 50.00 (-4.16 pct)
32: 81.00 (0.00 pct) 80.00 (1.23 pct)
64: 157.00 (0.00 pct) 161.00 (-2.54 pct)
128: 386.00 (0.00 pct) 413.00 (-6.99 pct)
256: 48832.00 (0.00 pct) 49472.00 (-1.31 pct)
512: 92032.00 (0.00 pct) 93568.00 (-1.66 pct)

o NPS4

#workers: tip SWQUEUE
1: 21.00 (0.00 pct) 22.00 (-4.76 pct)
2: 28.00 (0.00 pct) 27.00 (3.57 pct)
4: 32.00 (0.00 pct) 29.00 (9.37 pct)
8: 46.00 (0.00 pct) 44.00 (4.34 pct)
16: 51.00 (0.00 pct) 59.00 (-15.68 pct)
32: 82.00 (0.00 pct) 86.00 (-4.87 pct)
64: 173.00 (0.00 pct) 164.00 (5.20 pct)
128: 396.00 (0.00 pct) 400.00 (-1.01 pct)
256: 48832.00 (0.00 pct) 45504.00 (6.81 pct)
512: 95104.00 (0.00 pct) 96640.00 (-1.61 pct)


~~~~~~~~~~
~ tbench ~
~~~~~~~~~~

o NPS1

Clients: tip SWQUEUE
1 452.49 (0.00 pct) 438.60 (-3.06 pct)
2 862.44 (0.00 pct) 859.99 (-0.28 pct)
4 1604.27 (0.00 pct) 1593.44 (-0.67 pct)
8 2966.77 (0.00 pct) 3005.65 (1.31 pct)
16 5176.70 (0.00 pct) 5263.86 (1.68 pct)
32 8205.24 (0.00 pct) 8890.77 (8.35 pct)
64 13956.71 (0.00 pct) 14600.43 (4.61 pct)
128 24005.50 (0.00 pct) 25695.32 (7.03 pct)
256 32457.61 (0.00 pct) 35580.53 (9.62 pct)
512 34345.24 (0.00 pct) 39206.89 (14.15 pct)
1024 33432.92 (0.00 pct) 38751.11 (15.90 pct)

o NPS2

Clients: tip SWQUEUE
1 453.73 (0.00 pct) 447.49 (-1.37 pct)
2 861.71 (0.00 pct) 849.51 (-1.41 pct)
4 1599.14 (0.00 pct) 1604.38 (0.32 pct)
8 2951.03 (0.00 pct) 2992.67 (1.41 pct)
16 5080.32 (0.00 pct) 5318.28 (4.68 pct)
32 7900.41 (0.00 pct) 7845.42 (-0.69 pct)
64 14629.65 (0.00 pct) 14621.95 (-0.05 pct)
128 23155.88 (0.00 pct) 24286.26 (4.88 pct)
256 33449.57 (0.00 pct) 33606.58 (0.46 pct)
512 33757.47 (0.00 pct) 37592.42 (11.36 pct)
1024 34823.14 (0.00 pct) 38943.21 (11.83 pct)

o NPS4

Clients: tip SWQUEUE
1 450.14 (0.00 pct) 444.69 (-1.21 pct)
2 863.26 (0.00 pct) 859.65 (-0.41 pct)
4 1618.71 (0.00 pct) 1604.36 (-0.88 pct)
8 2929.35 (0.00 pct) 2996.47 (2.29 pct)
16 5114.04 (0.00 pct) 5243.98 (2.54 pct)
32 7912.18 (0.00 pct) 8200.71 (3.64 pct)
64 14424.72 (0.00 pct) 14149.10 (-1.91 pct)
128 23614.97 (0.00 pct) 24149.05 (2.26 pct)
256 34365.13 (0.00 pct) 34244.24 (-0.35 pct)
512 34215.50 (0.00 pct) 39799.28 (16.31 pct)
1024 35421.90 (0.00 pct) 40460.77 (14.22 pct)


~~~~~~~~~~
~ stream ~
~~~~~~~~~~

o NPS1

- 10 Runs:

Test: tip SWQUEUE
Copy: 271317.35 (0.00 pct) 281858.05 (3.88 pct)
Scale: 205533.77 (0.00 pct) 203003.11 (-1.23 pct)
Add: 221624.62 (0.00 pct) 226854.06 (2.35 pct)
Triad: 228500.68 (0.00 pct) 224038.89 (-1.95 pct)

- 100 Runs:

Test: tip SWQUEUE
Copy: 317381.65 (0.00 pct) 319745.64 (0.74 pct)
Scale: 214145.00 (0.00 pct) 210485.52 (-1.70 pct)
Add: 239243.29 (0.00 pct) 233651.11 (-2.33 pct)
Triad: 249477.76 (0.00 pct) 241242.87 (-3.30 pct)

o NPS2

- 10 Runs:

Test: tip SWQUEUE
Copy: 277761.29 (0.00 pct) 292487.03 (5.30 pct)
Scale: 215193.83 (0.00 pct) 209904.82 (-2.45 pct)
Add: 242725.75 (0.00 pct) 239361.95 (-1.38 pct)
Triad: 237253.44 (0.00 pct) 249638.69 (5.22 pct)

- 100 Runs:

Test: tip SWQUEUE
Copy: 318082.10 (0.00 pct) 318466.54 (0.12 pct)
Scale: 219338.56 (0.00 pct) 221856.65 (1.14 pct)
Add: 248118.20 (0.00 pct) 252907.96 (1.93 pct)
Triad: 247088.55 (0.00 pct) 248171.73 (0.43 pct)

o NPS4

- 10 Runs:

Test: tip SWQUEUE
Copy: 273307.14 (0.00 pct) 294527.85 (7.76 pct)
Scale: 235715.23 (0.00 pct) 232442.47 (-1.38 pct)
Add: 244500.40 (0.00 pct) 243812.43 (-0.28 pct)
Triad: 250600.04 (0.00 pct) 249006.70 (-0.63 pct)

- 100 Runs:

Test: tip SWQUEUE
Copy: 345396.19 (0.00 pct) 323605.07 (-6.30 pct)
Scale: 241521.63 (0.00 pct) 218688.02 (-9.45 pct)
Add: 261157.86 (0.00 pct) 234725.46 (-10.12 pct)
Triad: 267804.99 (0.00 pct) 253017.71 (-5.52 pct)

~~~~~~~~~~~
~ netperf ~
~~~~~~~~~~~

o NPS1

Test: tip SWQUEUE
1-clients: 102839.97 (0.00 pct) 98690.53 (-4.03 pct)
2-clients: 98428.08 (0.00 pct) 97027.85 (-1.42 pct)
4-clients: 92298.45 (0.00 pct) 93217.27 (0.99 pct)
8-clients: 85618.41 (0.00 pct) 87393.58 (2.07 pct)
16-clients: 78722.18 (0.00 pct) 78607.48 (-0.14 pct)
32-clients: 73610.75 (0.00 pct) 70636.67 (-4.04 pct)
64-clients: 55285.07 (0.00 pct) 52613.62 (-4.83 pct)
128-clients: 31176.92 (0.00 pct) 35212.35 (12.94 pct)
256-clients: 20011.44 (0.00 pct) 18817.54 (-5.96 pct)

o NPS2

Test: tip SWQUEUE
1-clients: 103105.55 (0.00 pct) 101318.80 (-1.73 pct)
2-clients: 98720.29 (0.00 pct) 97809.70 (-0.92 pct)
4-clients: 92289.39 (0.00 pct) 93771.16 (1.60 pct)
8-clients: 84998.63 (0.00 pct) 85093.68 (0.11 pct)
16-clients: 76395.81 (0.00 pct) 78181.45 (2.33 pct)
32-clients: 71110.89 (0.00 pct) 70312.53 (-1.12 pct)
64-clients: 49526.21 (0.00 pct) 49030.94 (-1.00 pct)
128-clients: 27917.51 (0.00 pct) 31428.64 (12.57 pct)
256-clients: 20067.17 (0.00 pct) 18273.28 (-8.93 pct)

o NPS4

Test: tip SWQUEUE
1-clients: 102139.49 (0.00 pct) 100629.37 (-1.47 pct)
2-clients: 98259.53 (0.00 pct) 96639.21 (-1.64 pct)
4-clients: 91576.79 (0.00 pct) 92032.50 (0.49 pct)
8-clients: 84742.30 (0.00 pct) 85588.07 (0.99 pct)
16-clients: 79540.75 (0.00 pct) 78033.63 (-1.89 pct)
32-clients: 71166.14 (0.00 pct) 69080.28 (-2.93 pct)
64-clients: 51763.24 (0.00 pct) 44243.04 (-14.52 pct)
128-clients: 27829.29 (0.00 pct) 32188.25 (15.66 pct)
256-clients: 24185.37 (0.00 pct) 17669.40 (-26.94 pct)

~~~~~~~~~~~~~
~ unixbench ~
~~~~~~~~~~~~~

o NPS1

tip SWQUEUE
Hmean unixbench-dhry2reg-1 41322625.19 ( 0.00%) 40874894.71 * -1.08%*
Hmean unixbench-dhry2reg-512 6252491108.60 ( 0.00%) 6253691168.43 ( 0.02%)
Amean unixbench-syscall-1 2501398.27 ( 0.00%) 2524108.97 * -0.91%*
Amean unixbench-syscall-512 8120524.00 ( 0.00%) 7596446.63 * 6.45%*
Hmean unixbench-pipe-1 2359346.02 ( 0.00%) 2429532.75 * 2.97%*
Hmean unixbench-pipe-512 338790322.61 ( 0.00%) 343170761.60 * 1.29%*
Hmean unixbench-spawn-1 4261.52 ( 0.00%) 4554.10 * 6.87%*
Hmean unixbench-spawn-512 64328.93 ( 0.00%) 62585.87 * -2.71%*
Hmean unixbench-execl-1 3677.73 ( 0.00%) 3655.40 * -0.61%*
Hmean unixbench-execl-512 11984.83 ( 0.00%) 13248.56 ( 10.54%)

o NPS2

tip SWQUEUE
Hmean unixbench-dhry2reg-1 41311787.29 ( 0.00%) 41100045.17 ( -0.51%)
Hmean unixbench-dhry2reg-512 6243873272.76 ( 0.00%) 6257834310.98 ( 0.22%)
Amean unixbench-syscall-1 2503190.70 ( 0.00%) 2520187.83 * -0.68%*
Amean unixbench-syscall-512 8012388.13 ( 0.00%) 7545705.57 * 5.82%*
Hmean unixbench-pipe-1 2340486.25 ( 0.00%) 2436643.73 * 4.11%*
Hmean unixbench-pipe-512 338965319.79 ( 0.00%) 342560610.22 * 1.06%*
Hmean unixbench-spawn-1 5241.83 ( 0.00%) 5174.11 ( -1.29%)
Hmean unixbench-spawn-512 65799.86 ( 0.00%) 63879.27 * -2.92%*
Hmean unixbench-execl-1 3670.65 ( 0.00%) 3634.26 * -0.99%*
Hmean unixbench-execl-512 13682.00 ( 0.00%) 14002.70 ( 2.34%)

o NPS4

tip SWQUEUE
Hmean unixbench-dhry2reg-1 41025577.99 ( 0.00%) 41588045.76 ( 1.37%)
Hmean unixbench-dhry2reg-512 6255568261.91 ( 0.00%) 6257479983.24 ( 0.03%)
Amean unixbench-syscall-1 2507165.37 ( 0.00%) 2519780.17 * -0.50%*
Amean unixbench-syscall-512 7458476.50 ( 0.00%) 8140849.63 * -9.15%*
Hmean unixbench-pipe-1 2369301.21 ( 0.00%) 2415933.77 * 1.97%*
Hmean unixbench-pipe-512 340299405.72 ( 0.00%) 343206741.58 * 0.85%*
Hmean unixbench-spawn-1 5571.78 ( 0.00%) 5421.31 ( -2.70%)
Hmean unixbench-spawn-512 63999.96 ( 0.00%) 63968.08 ( -0.05%)
Hmean unixbench-execl-1 3587.15 ( 0.00%) 3619.10 * 0.89%*
Hmean unixbench-execl-512 14184.17 ( 0.00%) 13006.80 * -8.30%*

~~~~~~~~~~~~~~~~
~ ycsb-mongodb ~
~~~~~~~~~~~~~~~~


o NPS1:

base: 298681.00 (var: 2.31%)
SWQUEUE 295218.00 (var: 0.15%) (-1.15%)

o NPS2:

base: 296570.00 (var: 1.01%)
SWQUEUE 293648.67 (var: 0.89%) (-0.98%)

o NPS4:

base 297181.67 (var: 0.46%)
SWQUEUE 290700.67 (var: 2.09%) (-2.18%)

~~~~~~~~~~~~~~~~~~
~ DeathStarBench ~
~~~~~~~~~~~~~~~~~~

o NPS1:

- 1 CCD

base: 1.00 (var: 0.27%)
SWQUEUE: 0.99 (var: 0.40%) (-0.83%)

- 2 CCD

base: 1.00 (var: 0.42%)
SWQUEUE: 1.02 (var: 0.40%) (2.52%)

- 4 CCD

base: 1.00 (var: 0.46%)
SWQUEUE: 1.04 (var: 1.08%) (4.63%)

- 8 CCD

base: 1.00 (var: 0.63%)
SWQUEUE: 0.99 (var: 7.18%) (-0.83%)

~~~~~~~~~~~~~~~~~~~~~~~~~~~
~ SPECjbb2015 - multi-JVM ~
~~~~~~~~~~~~~~~~~~~~~~~~~~~

tip SWQUEUE
max-jOPS 1.00 1.02 (+2.47%)
critical-jOPS 1.00 1.02 (+1.72%)

--

I'll go back and check why netperf is unhappy using perf and IBS.
Meanwhile if there is any specific benchmark you would like me to run
on the test system, please do let me know.

>
> It also seems logical that SIS_NODE would hurt performance a bit here,
> as all cores / CCXs are in the same NUMA node, so select_idle_sibling()
> has to iterate over 72 cores; delaying task wakeup. That said, I'm not
> sure that's a viable theory if total CPU% is lower with SIS_NODE.
>
> [..snip..]
>

--
Thanks and Regards,
Prateek

2023-07-11 05:11:57

by David Vernet

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] sched: Implement shared wakequeue in CFS

On Mon, Jul 10, 2023 at 05:27:15PM +0530, K Prateek Nayak wrote:
> Hello David,
>
> Sharing the results from testing SWQUEUE with some of the standard
> benchmarks below.

Hello Prateek,

Thank you for taking the time to run these benchmarks. Very much
appreciated. As I expect you saw given that I cc'd you, I sent v2 of
this series out this morning in [0].

[0]: https://lore.kernel.org/all/[email protected]/

There were quite a few changes, so if you have the bandwidth it may be
worth running these benchmarks again on v2.

> tl;dr
>
> - Hackbench, tbench, DeathStarBench and SPECjbb seem to like the
> SWQUEUE (except for hackbench in NPS4, with 16 groups)

Excellent, glad to hear that. I observed a significant improvement in
hackbench as well on v2, glad to hear that other benchmarks seem to like
this too.

>
> - Netperf sees regression as system becomes more loaded. I'll go
> back and check if there is some lock contention issues here which
> is parallely being discussed on the tread.

Lock contention is indeed the issue, as Aaron pointed out in [1].

[1]: https://lore.kernel.org/lkml/20230614043529.GA1942@ziqianlu-dell/

Sharding the per-LLC shared runqueues helps quite a bit here, but it
doesn't fully address the issue. See my write-up on the v2 cover letter
for more info.

>
> - Other benchmarks seem to be behave similar to tip.
>
> On 6/13/2023 10:50 AM, David Vernet wrote:
> > Overview
> > ========
> >
> > The scheduler must constantly strike a balance between work
> > conservation, and avoiding costly migrations which harm performance due
> > to e.g. decreased cache locality. The matter is further complicated by
> > the topology of the system. Migrating a task between cores on the same
> > LLC may be more optimal than keeping a task local to the CPU, whereas
> > migrating a task between LLCs or NUMA nodes may tip the balance in the
> > other direction.
> >
> > With that in mind, while CFS is by and large mostly a work conserving
> > scheduler, there are certain instances where the scheduler will choose
> > to keep a task local to a CPU, when it would have been more optimal to
> > migrate it to an idle core.
> >
> > An example of such a workload is the HHVM / web workload at Meta. HHVM
> > is a VM that JITs Hack and PHP code in service of web requests. Like
> > other JIT / compilation workloads, it tends to be heavily CPU bound, and
> > exhibit generally poor cache locality. To try and address this, we set
> > several debugfs (/sys/kernel/debug/sched) knobs on our HHVM workloads:
> >
> > - migration_cost_ns -> 0
> > - latency_ns -> 20000000
> > - min_granularity_ns -> 10000000
> > - wakeup_granularity_ns -> 12000000
> >
> > These knobs are intended both to encourage the scheduler to be as work
> > conserving as possible (migration_cost_ns -> 0), and also to keep tasks
> > running for relatively long time slices so as to avoid the overhead of
> > context switching (the other knobs). Collectively, these knobs provide a
> > substantial performance win; resulting in roughly a 20% improvement in
> > throughput. Worth noting, however, is that this improvement is _not_ at
> > full machine saturation.
> >
> > That said, even with these knobs, we noticed that CPUs were still going
> > idle even when the host was overcommitted. In response, we wrote the
> > "shared wakequeue" (swqueue) feature proposed in this patch set. The
> > idea behind swqueue is simple: it enables the scheduler to be
> > aggressively work conserving by placing a waking task into a per-LLC
> > FIFO queue that can be pulled from by another core in the LLC FIFO queue
> > which can then be pulled from before it goes idle.
> >
> > With this simple change, we were able to achieve a 1 - 1.6% improvement
> > in throughput, as well as a small, consistent improvement in p95 and p99
> > latencies, in HHVM. These performance improvements were in addition to
> > the wins from the debugfs knobs mentioned above.
> >
> > Design
> > ======
> >
> > The design of swqueue is quite simple. An swqueue is simply a struct
> > list_head, and a spinlock:
> >
> > struct swqueue {
> > struct list_head list;
> > spinlock_t lock;
> > } ____cacheline_aligned;
> >
> > We create a struct swqueue per LLC, ensuring they're in their own
> > cachelines to avoid false sharing between CPUs on different LLCs.
> >
> > When a task first wakes up, it enqueues itself in the swqueue of its
> > current LLC at the end of enqueue_task_fair(). Enqueues only happen if
> > the task was not manually migrated to the current core by
> > select_task_rq(), and is not pinned to a specific CPU.
> >
> > A core will pull a task from its LLC's swqueue before calling
> > newidle_balance().
> >
> > Difference between SIS_NODE
> > ===========================
> >
> > In [0] Peter proposed a patch that addresses Tejun's observations that
> > when workqueues are targeted towards a specific LLC on his Zen2 machine
> > with small CCXs, that there would be significant idle time due to
> > select_idle_sibling() not considering anything outside of the current
> > LLC.
> >
> > This patch (SIS_NODE) is essentially the complement to the proposal
> > here. SID_NODE causes waking tasks to look for idle cores in neighboring
> > LLCs on the same die, whereas swqueue causes cores about to go idle to
> > look for enqueued tasks. That said, in its current form, the two
> > features at are a different scope as SIS_NODE searches for idle cores
> > between LLCs, while swqueue enqueues tasks within a single LLC.
> >
> > The patch was since removed in [1], but we elect to compare its
> > performance to swqueue given that as described above, it's conceptually
> > complementary.
> >
> > [0]: https://lore.kernel.org/all/[email protected]/
> > [1]: https://lore.kernel.org/all/[email protected]/
> >
> > I observed that while SIS_NODE works quite well for hosts with small
> > CCXs, it can result in degraded performance on machines either with a
> > large number of total cores in a CCD, or for which the cache miss
> > penalty of migrating between CCXs is high, even on the same die.
> >
> > For example, on Zen 4c hosts (Bergamo), CCXs within a CCD are muxed
> > through a single link to the IO die, and thus have similar cache miss
> > latencies as cores in remote CCDs.
> >
> > Such subtleties could be taken into account with SIS_NODE, but
> > regardless, both features are conceptually complementary sides of the
> > same coin. SIS_NODE searches for idle cores for waking threads, whereas
> > swqueue searches for available work before a core goes idle.
> >
> > Results
> > =======
> >
> > Note that the motivation for the shared wakequeue feature was originally
> > arrived at using experiments in the sched_ext framework that's currently
> > being proposed upstream. The ~1 - 1.6% improvement in HHVM throughput
> > is similarly visible using work-conserving sched_ext schedulers (even
> > very simple ones like global FIFO).
> >
> > In both single and multi socket / CCX hosts, this can measurably improve
> > performance. In addition to the performance gains observed on our
> > internal web workloads, we also observed an improvement in common
> > workloads such as kernel compile when running shared wakequeue. Here are
> > the results of running make -j$(nproc) built-in.a on several different
> > types of hosts configured with make allyesconfig on commit a27648c74210
> > ("afs: Fix setting of mtime when creating a file/dir/symlink") on Linus'
> > tree (boost was disabled on all of these hosts when the experiments were
> > performed):
> >
> > Single-socket | 32-core | 2-CCX | AMD 7950X Zen4
> >
> > CPU max MHz: 5879.8818
> > CPU min MHz: 3000.0000
> > o____________o_______o
> > | mean | CPU |
> > o------------o-------o
> > NO_SWQUEUE + NO_SIS_NODE: | 590.52s | 3103% |
> > NO_SWQUEUE + SIS_NODE: | 590.80s | 3102% |
> > SWQUEUE + NO_SIS_NODE: | 589.65s | 3116% |
> > SWQUEUE + SIS_NODE: | 589.99s | 3115% |
> > o------------o-------o
> >
> > Takeaway: swqueue doesn't seem to provide a statistically significant
> > improvement for kernel compile on my 7950X. SIS_NODE similarly does not
> > have a noticeable effect on performance.
> >
> > -------------------------------------------------------------------------------
> >
> > Single-socket | 72-core | 6-CCX | AMD Milan Zen3
> >
> > CPU max MHz: 3245.0190
> > CPU min MHz: 700.0000
> > o_____________o_______o
> > | mean | CPU |
> > o-------------o-------o
> > NO_SWQUEUE + NO_SIS_NODE: | 1608.69s | 6488% |
> > NO_SWQUEUE + SIS_NODE: | 1610.24s | 6473% |
> > SWQUEUE + NO_SIS_NODE: | 1605.80s | 6504% |
> > SWQUEUE + SIS_NODE: | 1606.96s | 6488% |
> > o-------------o-------o
> >
> > Takeaway: swqueue does provide a small statistically significant
> > improvement on Milan, but the compile times in general were quite long
> > relative to the 7950X Zen4, and the Bergamo Zen4c due to the lower clock
> > frequency. Milan also has larger CCXs than Bergamo, so it stands to
> > reason that select_idle_sibling() will have an easier time finding idle
> > cores inside the current CCX.
>
> o System Details
>
> Dual Socket 3rd Generation EPYC System (2 x 64C/128T)

Oh yeah, this would definitely run into contention on netperf. We were
seeing it on smaller LLCs as well. Very curious to hear about how the
sharded approach works for machines of this size.

> o NPS Modes
>
> NPS Modes are used to logically divide single socket into
> multiple NUMA region.
> Following is the NUMA configuration for each NPS mode on the system:
>
> NPS1: Each socket is a NUMA node.
> Total 2 NUMA nodes in the dual socket machine.
>
> Node 0: 0-63, 128-191
> Node 1: 64-127, 192-255
>
> NPS2: Each socket is further logically divided into 2 NUMA regions.
> Total 4 NUMA nodes exist over 2 socket.
>
> Node 0: 0-31, 128-159
> Node 1: 32-63, 160-191
> Node 2: 64-95, 192-223
> Node 3: 96-127, 223-255
>
> NPS4: Each socket is logically divided into 4 NUMA regions.
> Total 8 NUMA nodes exist over 2 socket.

Do these logical NUMA regions all share the same LLC? If so, I think the
sharded approach is probably preferable to this, though I guess it
depends on how bad the overhead is for newidle_balance() to walk all the
shards to find an idle task.

>
> Node 0: 0-15, 128-143
> Node 1: 16-31, 144-159
> Node 2: 32-47, 160-175
> Node 3: 48-63, 176-191
> Node 4: 64-79, 192-207
> Node 5: 80-95, 208-223
> Node 6: 96-111, 223-231
> Node 7: 112-127, 232-255
>
> o Kernel Versions
>
> - tip - tip:sched/core at commit e2a1f85bf9f5 "sched/psi:
> Avoid resetting the min update period when it is
> unnecessary")
>
> - SWQUEUE - tip:sched/core + this patch
> (SWQUEUE feature enabled from debugfs)
>
> ~~~~~~~~~~~~~
> ~ hackbench ~
> ~~~~~~~~~~~~~
>
> o NPS1
>
> Test: tip SWQUEUE
> 1-groups: 3.92 (0.00 pct) 3.41 (13.01 pct)
> 2-groups: 4.58 (0.00 pct) 3.80 (17.03 pct)
> 4-groups: 4.99 (0.00 pct) 4.48 (10.22 pct)
> 8-groups: 5.67 (0.00 pct) 5.11 (9.87 pct)
> 16-groups: 7.88 (0.00 pct) 7.29 (7.48 pct)
>
> o NPS2
>
> Test: tip SWQUEUE
> 1-groups: 3.82 (0.00 pct) 3.45 (9.68 pct)
> 2-groups: 4.40 (0.00 pct) 3.72 (15.45 pct)
> 4-groups: 4.84 (0.00 pct) 4.27 (11.77 pct)
> 8-groups: 5.45 (0.00 pct) 4.98 (8.62 pct)
> 16-groups: 6.94 (0.00 pct) 6.82 (1.72 pct)
>
> o NPS4
>
> Test: tip SWQUEUE
> 1-groups: 3.82 (0.00 pct) 3.50 (8.37 pct)
> 2-groups: 4.44 (0.00 pct) 3.77 (15.09 pct)
> 4-groups: 4.86 (0.00 pct) 4.44 (8.64 pct)
> 8-groups: 5.42 (0.00 pct) 5.08 (6.27 pct)
> 16-groups: 6.68 (0.00 pct) 7.82 (-17.06 pct)
>
> ~~~~~~~~~~~~
> ~ schbench ~
> ~~~~~~~~~~~~
>
> o NPS1
>
> #workers: tip SWQUEUE
> 1: 26.00 (0.00 pct) 28.00 (-7.69 pct)
> 2: 27.00 (0.00 pct) 23.00 (14.81 pct)
> 4: 31.00 (0.00 pct) 29.00 (6.45 pct)
> 8: 36.00 (0.00 pct) 38.00 (-5.55 pct)
> 16: 49.00 (0.00 pct) 48.00 (2.04 pct)
> 32: 80.00 (0.00 pct) 78.00 (2.50 pct)
> 64: 169.00 (0.00 pct) 168.00 (0.59 pct)
> 128: 343.00 (0.00 pct) 383.00 (-11.66 pct)
> 256: 42048.00 (0.00 pct) 50368.00 (-19.78 pct)
> 512: 95104.00 (0.00 pct) 93312.00 (1.88 pct)
>
> o NPS2
>
> #workers: tip SWQUEUE
> 1: 23.00 (0.00 pct) 21.00 (8.69 pct)
> 2: 24.00 (0.00 pct) 25.00 (-4.16 pct)
> 4: 31.00 (0.00 pct) 29.00 (6.45 pct)
> 8: 41.00 (0.00 pct) 43.00 (-4.87 pct)
> 16: 48.00 (0.00 pct) 50.00 (-4.16 pct)
> 32: 81.00 (0.00 pct) 80.00 (1.23 pct)
> 64: 157.00 (0.00 pct) 161.00 (-2.54 pct)
> 128: 386.00 (0.00 pct) 413.00 (-6.99 pct)
> 256: 48832.00 (0.00 pct) 49472.00 (-1.31 pct)
> 512: 92032.00 (0.00 pct) 93568.00 (-1.66 pct)
>
> o NPS4
>
> #workers: tip SWQUEUE
> 1: 21.00 (0.00 pct) 22.00 (-4.76 pct)
> 2: 28.00 (0.00 pct) 27.00 (3.57 pct)
> 4: 32.00 (0.00 pct) 29.00 (9.37 pct)
> 8: 46.00 (0.00 pct) 44.00 (4.34 pct)
> 16: 51.00 (0.00 pct) 59.00 (-15.68 pct)
> 32: 82.00 (0.00 pct) 86.00 (-4.87 pct)
> 64: 173.00 (0.00 pct) 164.00 (5.20 pct)
> 128: 396.00 (0.00 pct) 400.00 (-1.01 pct)
> 256: 48832.00 (0.00 pct) 45504.00 (6.81 pct)
> 512: 95104.00 (0.00 pct) 96640.00 (-1.61 pct)
>
>
> ~~~~~~~~~~
> ~ tbench ~
> ~~~~~~~~~~
>
> o NPS1
>
> Clients: tip SWQUEUE
> 1 452.49 (0.00 pct) 438.60 (-3.06 pct)
> 2 862.44 (0.00 pct) 859.99 (-0.28 pct)
> 4 1604.27 (0.00 pct) 1593.44 (-0.67 pct)
> 8 2966.77 (0.00 pct) 3005.65 (1.31 pct)
> 16 5176.70 (0.00 pct) 5263.86 (1.68 pct)
> 32 8205.24 (0.00 pct) 8890.77 (8.35 pct)
> 64 13956.71 (0.00 pct) 14600.43 (4.61 pct)
> 128 24005.50 (0.00 pct) 25695.32 (7.03 pct)
> 256 32457.61 (0.00 pct) 35580.53 (9.62 pct)
> 512 34345.24 (0.00 pct) 39206.89 (14.15 pct)
> 1024 33432.92 (0.00 pct) 38751.11 (15.90 pct)
>
> o NPS2
>
> Clients: tip SWQUEUE
> 1 453.73 (0.00 pct) 447.49 (-1.37 pct)
> 2 861.71 (0.00 pct) 849.51 (-1.41 pct)
> 4 1599.14 (0.00 pct) 1604.38 (0.32 pct)
> 8 2951.03 (0.00 pct) 2992.67 (1.41 pct)
> 16 5080.32 (0.00 pct) 5318.28 (4.68 pct)
> 32 7900.41 (0.00 pct) 7845.42 (-0.69 pct)
> 64 14629.65 (0.00 pct) 14621.95 (-0.05 pct)
> 128 23155.88 (0.00 pct) 24286.26 (4.88 pct)
> 256 33449.57 (0.00 pct) 33606.58 (0.46 pct)
> 512 33757.47 (0.00 pct) 37592.42 (11.36 pct)
> 1024 34823.14 (0.00 pct) 38943.21 (11.83 pct)
>
> o NPS4
>
> Clients: tip SWQUEUE
> 1 450.14 (0.00 pct) 444.69 (-1.21 pct)
> 2 863.26 (0.00 pct) 859.65 (-0.41 pct)
> 4 1618.71 (0.00 pct) 1604.36 (-0.88 pct)
> 8 2929.35 (0.00 pct) 2996.47 (2.29 pct)
> 16 5114.04 (0.00 pct) 5243.98 (2.54 pct)
> 32 7912.18 (0.00 pct) 8200.71 (3.64 pct)
> 64 14424.72 (0.00 pct) 14149.10 (-1.91 pct)
> 128 23614.97 (0.00 pct) 24149.05 (2.26 pct)
> 256 34365.13 (0.00 pct) 34244.24 (-0.35 pct)
> 512 34215.50 (0.00 pct) 39799.28 (16.31 pct)
> 1024 35421.90 (0.00 pct) 40460.77 (14.22 pct)
>
>
> ~~~~~~~~~~
> ~ stream ~
> ~~~~~~~~~~
>
> o NPS1
>
> - 10 Runs:
>
> Test: tip SWQUEUE
> Copy: 271317.35 (0.00 pct) 281858.05 (3.88 pct)
> Scale: 205533.77 (0.00 pct) 203003.11 (-1.23 pct)
> Add: 221624.62 (0.00 pct) 226854.06 (2.35 pct)
> Triad: 228500.68 (0.00 pct) 224038.89 (-1.95 pct)
>
> - 100 Runs:
>
> Test: tip SWQUEUE
> Copy: 317381.65 (0.00 pct) 319745.64 (0.74 pct)
> Scale: 214145.00 (0.00 pct) 210485.52 (-1.70 pct)
> Add: 239243.29 (0.00 pct) 233651.11 (-2.33 pct)
> Triad: 249477.76 (0.00 pct) 241242.87 (-3.30 pct)
>
> o NPS2
>
> - 10 Runs:
>
> Test: tip SWQUEUE
> Copy: 277761.29 (0.00 pct) 292487.03 (5.30 pct)
> Scale: 215193.83 (0.00 pct) 209904.82 (-2.45 pct)
> Add: 242725.75 (0.00 pct) 239361.95 (-1.38 pct)
> Triad: 237253.44 (0.00 pct) 249638.69 (5.22 pct)
>
> - 100 Runs:
>
> Test: tip SWQUEUE
> Copy: 318082.10 (0.00 pct) 318466.54 (0.12 pct)
> Scale: 219338.56 (0.00 pct) 221856.65 (1.14 pct)
> Add: 248118.20 (0.00 pct) 252907.96 (1.93 pct)
> Triad: 247088.55 (0.00 pct) 248171.73 (0.43 pct)
>
> o NPS4
>
> - 10 Runs:
>
> Test: tip SWQUEUE
> Copy: 273307.14 (0.00 pct) 294527.85 (7.76 pct)
> Scale: 235715.23 (0.00 pct) 232442.47 (-1.38 pct)
> Add: 244500.40 (0.00 pct) 243812.43 (-0.28 pct)
> Triad: 250600.04 (0.00 pct) 249006.70 (-0.63 pct)
>
> - 100 Runs:
>
> Test: tip SWQUEUE
> Copy: 345396.19 (0.00 pct) 323605.07 (-6.30 pct)
> Scale: 241521.63 (0.00 pct) 218688.02 (-9.45 pct)
> Add: 261157.86 (0.00 pct) 234725.46 (-10.12 pct)
> Triad: 267804.99 (0.00 pct) 253017.71 (-5.52 pct)
>
> ~~~~~~~~~~~
> ~ netperf ~
> ~~~~~~~~~~~
>
> o NPS1
>
> Test: tip SWQUEUE
> 1-clients: 102839.97 (0.00 pct) 98690.53 (-4.03 pct)
> 2-clients: 98428.08 (0.00 pct) 97027.85 (-1.42 pct)
> 4-clients: 92298.45 (0.00 pct) 93217.27 (0.99 pct)
> 8-clients: 85618.41 (0.00 pct) 87393.58 (2.07 pct)
> 16-clients: 78722.18 (0.00 pct) 78607.48 (-0.14 pct)
> 32-clients: 73610.75 (0.00 pct) 70636.67 (-4.04 pct)
> 64-clients: 55285.07 (0.00 pct) 52613.62 (-4.83 pct)
> 128-clients: 31176.92 (0.00 pct) 35212.35 (12.94 pct)
> 256-clients: 20011.44 (0.00 pct) 18817.54 (-5.96 pct)
>
> o NPS2
>
> Test: tip SWQUEUE
> 1-clients: 103105.55 (0.00 pct) 101318.80 (-1.73 pct)
> 2-clients: 98720.29 (0.00 pct) 97809.70 (-0.92 pct)
> 4-clients: 92289.39 (0.00 pct) 93771.16 (1.60 pct)
> 8-clients: 84998.63 (0.00 pct) 85093.68 (0.11 pct)
> 16-clients: 76395.81 (0.00 pct) 78181.45 (2.33 pct)
> 32-clients: 71110.89 (0.00 pct) 70312.53 (-1.12 pct)
> 64-clients: 49526.21 (0.00 pct) 49030.94 (-1.00 pct)
> 128-clients: 27917.51 (0.00 pct) 31428.64 (12.57 pct)
> 256-clients: 20067.17 (0.00 pct) 18273.28 (-8.93 pct)
>
> o NPS4
>
> Test: tip SWQUEUE
> 1-clients: 102139.49 (0.00 pct) 100629.37 (-1.47 pct)
> 2-clients: 98259.53 (0.00 pct) 96639.21 (-1.64 pct)
> 4-clients: 91576.79 (0.00 pct) 92032.50 (0.49 pct)
> 8-clients: 84742.30 (0.00 pct) 85588.07 (0.99 pct)
> 16-clients: 79540.75 (0.00 pct) 78033.63 (-1.89 pct)
> 32-clients: 71166.14 (0.00 pct) 69080.28 (-2.93 pct)
> 64-clients: 51763.24 (0.00 pct) 44243.04 (-14.52 pct)
> 128-clients: 27829.29 (0.00 pct) 32188.25 (15.66 pct)
> 256-clients: 24185.37 (0.00 pct) 17669.40 (-26.94 pct)
>
> ~~~~~~~~~~~~~
> ~ unixbench ~
> ~~~~~~~~~~~~~
>
> o NPS1
>
> tip SWQUEUE
> Hmean unixbench-dhry2reg-1 41322625.19 ( 0.00%) 40874894.71 * -1.08%*
> Hmean unixbench-dhry2reg-512 6252491108.60 ( 0.00%) 6253691168.43 ( 0.02%)
> Amean unixbench-syscall-1 2501398.27 ( 0.00%) 2524108.97 * -0.91%*
> Amean unixbench-syscall-512 8120524.00 ( 0.00%) 7596446.63 * 6.45%*
> Hmean unixbench-pipe-1 2359346.02 ( 0.00%) 2429532.75 * 2.97%*
> Hmean unixbench-pipe-512 338790322.61 ( 0.00%) 343170761.60 * 1.29%*
> Hmean unixbench-spawn-1 4261.52 ( 0.00%) 4554.10 * 6.87%*
> Hmean unixbench-spawn-512 64328.93 ( 0.00%) 62585.87 * -2.71%*
> Hmean unixbench-execl-1 3677.73 ( 0.00%) 3655.40 * -0.61%*
> Hmean unixbench-execl-512 11984.83 ( 0.00%) 13248.56 ( 10.54%)
>
> o NPS2
>
> tip SWQUEUE
> Hmean unixbench-dhry2reg-1 41311787.29 ( 0.00%) 41100045.17 ( -0.51%)
> Hmean unixbench-dhry2reg-512 6243873272.76 ( 0.00%) 6257834310.98 ( 0.22%)
> Amean unixbench-syscall-1 2503190.70 ( 0.00%) 2520187.83 * -0.68%*
> Amean unixbench-syscall-512 8012388.13 ( 0.00%) 7545705.57 * 5.82%*
> Hmean unixbench-pipe-1 2340486.25 ( 0.00%) 2436643.73 * 4.11%*
> Hmean unixbench-pipe-512 338965319.79 ( 0.00%) 342560610.22 * 1.06%*
> Hmean unixbench-spawn-1 5241.83 ( 0.00%) 5174.11 ( -1.29%)
> Hmean unixbench-spawn-512 65799.86 ( 0.00%) 63879.27 * -2.92%*
> Hmean unixbench-execl-1 3670.65 ( 0.00%) 3634.26 * -0.99%*
> Hmean unixbench-execl-512 13682.00 ( 0.00%) 14002.70 ( 2.34%)
>
> o NPS4
>
> tip SWQUEUE
> Hmean unixbench-dhry2reg-1 41025577.99 ( 0.00%) 41588045.76 ( 1.37%)
> Hmean unixbench-dhry2reg-512 6255568261.91 ( 0.00%) 6257479983.24 ( 0.03%)
> Amean unixbench-syscall-1 2507165.37 ( 0.00%) 2519780.17 * -0.50%*
> Amean unixbench-syscall-512 7458476.50 ( 0.00%) 8140849.63 * -9.15%*
> Hmean unixbench-pipe-1 2369301.21 ( 0.00%) 2415933.77 * 1.97%*
> Hmean unixbench-pipe-512 340299405.72 ( 0.00%) 343206741.58 * 0.85%*
> Hmean unixbench-spawn-1 5571.78 ( 0.00%) 5421.31 ( -2.70%)
> Hmean unixbench-spawn-512 63999.96 ( 0.00%) 63968.08 ( -0.05%)
> Hmean unixbench-execl-1 3587.15 ( 0.00%) 3619.10 * 0.89%*
> Hmean unixbench-execl-512 14184.17 ( 0.00%) 13006.80 * -8.30%*
>
> ~~~~~~~~~~~~~~~~
> ~ ycsb-mongodb ~
> ~~~~~~~~~~~~~~~~
>
>
> o NPS1:
>
> base: 298681.00 (var: 2.31%)
> SWQUEUE 295218.00 (var: 0.15%) (-1.15%)
>
> o NPS2:
>
> base: 296570.00 (var: 1.01%)
> SWQUEUE 293648.67 (var: 0.89%) (-0.98%)
>
> o NPS4:
>
> base 297181.67 (var: 0.46%)
> SWQUEUE 290700.67 (var: 2.09%) (-2.18%)
>
> ~~~~~~~~~~~~~~~~~~
> ~ DeathStarBench ~
> ~~~~~~~~~~~~~~~~~~
>
> o NPS1:
>
> - 1 CCD
>
> base: 1.00 (var: 0.27%)
> SWQUEUE: 0.99 (var: 0.40%) (-0.83%)
>
> - 2 CCD
>
> base: 1.00 (var: 0.42%)
> SWQUEUE: 1.02 (var: 0.40%) (2.52%)
>
> - 4 CCD
>
> base: 1.00 (var: 0.46%)
> SWQUEUE: 1.04 (var: 1.08%) (4.63%)
>
> - 8 CCD
>
> base: 1.00 (var: 0.63%)
> SWQUEUE: 0.99 (var: 7.18%) (-0.83%)
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~
> ~ SPECjbb2015 - multi-JVM ~
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> tip SWQUEUE
> max-jOPS 1.00 1.02 (+2.47%)
> critical-jOPS 1.00 1.02 (+1.72%)
>
> --
>
> I'll go back and check why netperf is unhappy using perf and IBS.
> Meanwhile if there is any specific benchmark you would like me to run
> on the test system, please do let me know.

Thanks very much for the time you're spending on this. I don't have any
other specific benchmarks in mind. My only suggestion would be that your
time is probably better spent experimenting with the v2 version of the
patch set.

The only thing to bear in mind is that the series hard-codes the shard
size at 6. We may want to make that configurable in a separate debugfs
node. In the meantime, you just need to adjust SHARED_RUNQ_SHARD_SZ in
[2].

[2]: https://lore.kernel.org/all/[email protected]/

Thanks,
David

2023-07-11 05:48:00

by K Prateek Nayak

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] sched: Implement shared wakequeue in CFS

Hello David,

On 7/11/2023 10:13 AM, David Vernet wrote:
> On Mon, Jul 10, 2023 at 05:27:15PM +0530, K Prateek Nayak wrote:
>> Hello David,
>>
>> Sharing the results from testing SWQUEUE with some of the standard
>> benchmarks below.
>
> Hello Prateek,
>
> Thank you for taking the time to run these benchmarks. Very much
> appreciated. As I expect you saw given that I cc'd you, I sent v2 of
> this series out this morning in [0].
>
> [0]: https://lore.kernel.org/all/[email protected]/
>
> There were quite a few changes, so if you have the bandwidth it may be
> worth running these benchmarks again on v2.

Sure. I'll continue further testing and analysis with v2.

>
>> tl;dr
>>
>> - Hackbench, tbench, DeathStarBench and SPECjbb seem to like the
>> SWQUEUE (except for hackbench in NPS4, with 16 groups)
>
> Excellent, glad to hear that. I observed a significant improvement in
> hackbench as well on v2, glad to hear that other benchmarks seem to like
> this too.
>
>>
>> - Netperf sees regression as system becomes more loaded. I'll go
>> back and check if there is some lock contention issues here which
>> is parallely being discussed on the tread.
>
> Lock contention is indeed the issue, as Aaron pointed out in [1].
>
> [1]: https://lore.kernel.org/lkml/20230614043529.GA1942@ziqianlu-dell/
>
> Sharding the per-LLC shared runqueues helps quite a bit here, but it
> doesn't fully address the issue. See my write-up on the v2 cover letter
> for more info.

Will do.

>
>>
>> - Other benchmarks seem to be behave similar to tip.
>>
>> On 6/13/2023 10:50 AM, David Vernet wrote:
>>> Overview
>>> ========
>>>
>>> The scheduler must constantly strike a balance between work
>>> conservation, and avoiding costly migrations which harm performance due
>>> to e.g. decreased cache locality. The matter is further complicated by
>>> the topology of the system. Migrating a task between cores on the same
>>> LLC may be more optimal than keeping a task local to the CPU, whereas
>>> migrating a task between LLCs or NUMA nodes may tip the balance in the
>>> other direction.
>>>
>>> With that in mind, while CFS is by and large mostly a work conserving
>>> scheduler, there are certain instances where the scheduler will choose
>>> to keep a task local to a CPU, when it would have been more optimal to
>>> migrate it to an idle core.
>>>
>>> An example of such a workload is the HHVM / web workload at Meta. HHVM
>>> is a VM that JITs Hack and PHP code in service of web requests. Like
>>> other JIT / compilation workloads, it tends to be heavily CPU bound, and
>>> exhibit generally poor cache locality. To try and address this, we set
>>> several debugfs (/sys/kernel/debug/sched) knobs on our HHVM workloads:
>>>
>>> - migration_cost_ns -> 0
>>> - latency_ns -> 20000000
>>> - min_granularity_ns -> 10000000
>>> - wakeup_granularity_ns -> 12000000
>>>
>>> These knobs are intended both to encourage the scheduler to be as work
>>> conserving as possible (migration_cost_ns -> 0), and also to keep tasks
>>> running for relatively long time slices so as to avoid the overhead of
>>> context switching (the other knobs). Collectively, these knobs provide a
>>> substantial performance win; resulting in roughly a 20% improvement in
>>> throughput. Worth noting, however, is that this improvement is _not_ at
>>> full machine saturation.
>>>
>>> That said, even with these knobs, we noticed that CPUs were still going
>>> idle even when the host was overcommitted. In response, we wrote the
>>> "shared wakequeue" (swqueue) feature proposed in this patch set. The
>>> idea behind swqueue is simple: it enables the scheduler to be
>>> aggressively work conserving by placing a waking task into a per-LLC
>>> FIFO queue that can be pulled from by another core in the LLC FIFO queue
>>> which can then be pulled from before it goes idle.
>>>
>>> With this simple change, we were able to achieve a 1 - 1.6% improvement
>>> in throughput, as well as a small, consistent improvement in p95 and p99
>>> latencies, in HHVM. These performance improvements were in addition to
>>> the wins from the debugfs knobs mentioned above.
>>>
>>> Design
>>> ======
>>>
>>> The design of swqueue is quite simple. An swqueue is simply a struct
>>> list_head, and a spinlock:
>>>
>>> struct swqueue {
>>> struct list_head list;
>>> spinlock_t lock;
>>> } ____cacheline_aligned;
>>>
>>> We create a struct swqueue per LLC, ensuring they're in their own
>>> cachelines to avoid false sharing between CPUs on different LLCs.
>>>
>>> When a task first wakes up, it enqueues itself in the swqueue of its
>>> current LLC at the end of enqueue_task_fair(). Enqueues only happen if
>>> the task was not manually migrated to the current core by
>>> select_task_rq(), and is not pinned to a specific CPU.
>>>
>>> A core will pull a task from its LLC's swqueue before calling
>>> newidle_balance().
>>>
>>> Difference between SIS_NODE
>>> ===========================
>>>
>>> In [0] Peter proposed a patch that addresses Tejun's observations that
>>> when workqueues are targeted towards a specific LLC on his Zen2 machine
>>> with small CCXs, that there would be significant idle time due to
>>> select_idle_sibling() not considering anything outside of the current
>>> LLC.
>>>
>>> This patch (SIS_NODE) is essentially the complement to the proposal
>>> here. SID_NODE causes waking tasks to look for idle cores in neighboring
>>> LLCs on the same die, whereas swqueue causes cores about to go idle to
>>> look for enqueued tasks. That said, in its current form, the two
>>> features at are a different scope as SIS_NODE searches for idle cores
>>> between LLCs, while swqueue enqueues tasks within a single LLC.
>>>
>>> The patch was since removed in [1], but we elect to compare its
>>> performance to swqueue given that as described above, it's conceptually
>>> complementary.
>>>
>>> [0]: https://lore.kernel.org/all/[email protected]/
>>> [1]: https://lore.kernel.org/all/[email protected]/
>>>
>>> I observed that while SIS_NODE works quite well for hosts with small
>>> CCXs, it can result in degraded performance on machines either with a
>>> large number of total cores in a CCD, or for which the cache miss
>>> penalty of migrating between CCXs is high, even on the same die.
>>>
>>> For example, on Zen 4c hosts (Bergamo), CCXs within a CCD are muxed
>>> through a single link to the IO die, and thus have similar cache miss
>>> latencies as cores in remote CCDs.
>>>
>>> Such subtleties could be taken into account with SIS_NODE, but
>>> regardless, both features are conceptually complementary sides of the
>>> same coin. SIS_NODE searches for idle cores for waking threads, whereas
>>> swqueue searches for available work before a core goes idle.
>>>
>>> Results
>>> =======
>>>
>>> Note that the motivation for the shared wakequeue feature was originally
>>> arrived at using experiments in the sched_ext framework that's currently
>>> being proposed upstream. The ~1 - 1.6% improvement in HHVM throughput
>>> is similarly visible using work-conserving sched_ext schedulers (even
>>> very simple ones like global FIFO).
>>>
>>> In both single and multi socket / CCX hosts, this can measurably improve
>>> performance. In addition to the performance gains observed on our
>>> internal web workloads, we also observed an improvement in common
>>> workloads such as kernel compile when running shared wakequeue. Here are
>>> the results of running make -j$(nproc) built-in.a on several different
>>> types of hosts configured with make allyesconfig on commit a27648c74210
>>> ("afs: Fix setting of mtime when creating a file/dir/symlink") on Linus'
>>> tree (boost was disabled on all of these hosts when the experiments were
>>> performed):
>>>
>>> Single-socket | 32-core | 2-CCX | AMD 7950X Zen4
>>>
>>> CPU max MHz: 5879.8818
>>> CPU min MHz: 3000.0000
>>> o____________o_______o
>>> | mean | CPU |
>>> o------------o-------o
>>> NO_SWQUEUE + NO_SIS_NODE: | 590.52s | 3103% |
>>> NO_SWQUEUE + SIS_NODE: | 590.80s | 3102% |
>>> SWQUEUE + NO_SIS_NODE: | 589.65s | 3116% |
>>> SWQUEUE + SIS_NODE: | 589.99s | 3115% |
>>> o------------o-------o
>>>
>>> Takeaway: swqueue doesn't seem to provide a statistically significant
>>> improvement for kernel compile on my 7950X. SIS_NODE similarly does not
>>> have a noticeable effect on performance.
>>>
>>> -------------------------------------------------------------------------------
>>>
>>> Single-socket | 72-core | 6-CCX | AMD Milan Zen3
>>>
>>> CPU max MHz: 3245.0190
>>> CPU min MHz: 700.0000
>>> o_____________o_______o
>>> | mean | CPU |
>>> o-------------o-------o
>>> NO_SWQUEUE + NO_SIS_NODE: | 1608.69s | 6488% |
>>> NO_SWQUEUE + SIS_NODE: | 1610.24s | 6473% |
>>> SWQUEUE + NO_SIS_NODE: | 1605.80s | 6504% |
>>> SWQUEUE + SIS_NODE: | 1606.96s | 6488% |
>>> o-------------o-------o
>>>
>>> Takeaway: swqueue does provide a small statistically significant
>>> improvement on Milan, but the compile times in general were quite long
>>> relative to the 7950X Zen4, and the Bergamo Zen4c due to the lower clock
>>> frequency. Milan also has larger CCXs than Bergamo, so it stands to
>>> reason that select_idle_sibling() will have an easier time finding idle
>>> cores inside the current CCX.
>>
>> o System Details
>>
>> Dual Socket 3rd Generation EPYC System (2 x 64C/128T)
>
> Oh yeah, this would definitely run into contention on netperf. We were
> seeing it on smaller LLCs as well. Very curious to hear about how the
> sharded approach works for machines of this size.

I'll share the results on v2 when the tests finish.

>
>> o NPS Modes
>>
>> NPS Modes are used to logically divide single socket into
>> multiple NUMA region.
>> Following is the NUMA configuration for each NPS mode on the system:
>>
>> NPS1: Each socket is a NUMA node.
>> Total 2 NUMA nodes in the dual socket machine.
>>
>> Node 0: 0-63, 128-191
>> Node 1: 64-127, 192-255
>>
>> NPS2: Each socket is further logically divided into 2 NUMA regions.
>> Total 4 NUMA nodes exist over 2 socket.
>>
>> Node 0: 0-31, 128-159
>> Node 1: 32-63, 160-191
>> Node 2: 64-95, 192-223
>> Node 3: 96-127, 223-255
>>
>> NPS4: Each socket is logically divided into 4 NUMA regions.
>> Total 8 NUMA nodes exist over 2 socket.
>
> Do these logical NUMA regions all share the same LLC?

The NUMA regions will further contain multiple MCs which marks the LLC.
Sharding should have similar effect across the NPS modes since LLC size
remains the same.

> If so, I think the
> sharded approach is probably preferable to this, though I guess it
> depends on how bad the overhead is for newidle_balance() to walk all the
> shards to find an idle task.
>
> [..snip..]
>>
>> I'll go back and check why netperf is unhappy using perf and IBS.
>> Meanwhile if there is any specific benchmark you would like me to run
>> on the test system, please do let me know.
>
> Thanks very much for the time you're spending on this. I don't have any
> other specific benchmarks in mind. My only suggestion would be that your
> time is probably better spent experimenting with the v2 version of the
> patch set.
>
> The only thing to bear in mind is that the series hard-codes the shard
> size at 6. We may want to make that configurable in a separate debugfs
> node. In the meantime, you just need to adjust SHARED_RUNQ_SHARD_SZ in
> [2].

I'll play around with a couple of values of SHARED_RUNQ_SHARD_SZ and pick
the benchmarks that show noticeable variation. I'll test them further
with more shared sizes. Let me know if I should take a different approach
at testing shard sizes.

>
> [2]: https://lore.kernel.org/all/[email protected]/
>
> Thanks,
> David

--
Thanks and Regards,
Prateek