Neither the sched/tip nor Mel's v5 patchset [1] provides an optimal
new-task wakeup strategy when the tasks are affined to a subset of cpus
which can result in piling of tasks on the same set of CPU in a NUMA
group despite there being other cpus in a different NUMA group where the
task could have run in. A good placement makes a difference especially
in case of short lived task where the delay in load balancer kicking in
can cause degradation in perfromance.
Benchmark is performed using pinned run of STREAM, parallelized with OMP,
on a Zen3 machine. STREAM is configured to run 8 threads with CPU affinity
set to cpus 0,16,32,48,64,80,96,112.
This ensures an even distribution of allowed cpus across the NUMA groups
in NPS1, NPS2 and NPS4 modes.
The script running the stream itself is pinned to cpus 8-15 to maintain
consistency across runs and to make sure the script runs on an LLC
not part of stream cpulist so as to not interfere with the benchmark.
Changes are based on top of v5 of Mel's patchset
"Adjust NUMA imbalance for multiple LLCs" [1]
Following are the results:
5.17.0-rc1 5.17.0-rc1 5.17.0-rc1
tip sched/core tip sched/core tip sched/core
+ mel-v5 + mel-v5
+ this-fix
NPS Mode - NPS1
Copy: 92133.36 (0.00 pct) 117595.57 (27.63 pct) 150655.69 (63.51 pct)
Scale: 90847.82 (0.00 pct) 114525.59 (26.06 pct) 148939.47 (63.94 pct)
Add: 103585.04 (0.00 pct) 137548.40 (32.78 pct) 186323.75 (79.87 pct)
Triad: 100060.42 (0.00 pct) 133695.34 (33.61 pct) 184203.97 (84.09 pct)
NPS Mode - NPS 2
Copy: 52969.12 (0.00 pct) 76969.90 (45.31 pct) 165892.91 (213.18 pct)
Scale: 49209.91 (0.00 pct) 69200.05 (40.62 pct) 152210.69 (209.30 pct)
Add: 60106.69 (0.00 pct) 92049.47 (53.14 pct) 195135.02 (224.64 pct)
Triad: 60052.66 (0.00 pct) 88323.03 (47.07 pct) 193672.59 (222.50 pct)
NPS Mode - NPS4
Copy: 44122.00 (0.00 pct) 157154.70 (256.18 pct) 169755.52 (284.74 pct)
Scale: 41730.68 (0.00 pct) 172303.88 (312.89 pct) 170247.06 (307.96 pct)
Add: 51666.98 (0.00 pct) 214293.71 (314.75 pct) 213560.61 (313.34 pct)
Triad: 50489.87 (0.00 pct) 212242.49 (320.36 pct) 210844.58 (317.59 pct)
The following sched_wakeup_new tracepoint output shows the initial
placement of tasks in mel-v5 in NPS2 mode:
stream-4578 [016] d..2. 81.970702: sched_wakeup_new: comm=stream pid=4580 prio=120 target_cpu=000
stream-4578 [016] d..2. 81.970760: sched_wakeup_new: comm=stream pid=4581 prio=120 target_cpu=016
stream-4578 [016] d..2. 81.970823: sched_wakeup_new: comm=stream pid=4582 prio=120 target_cpu=048
stream-4578 [016] d..2. 81.970875: sched_wakeup_new: comm=stream pid=4583 prio=120 target_cpu=032
stream-4578 [016] d..2. 81.970920: sched_wakeup_new: comm=stream pid=4584 prio=120 target_cpu=016
stream-4578 [016] d..2. 81.970961: sched_wakeup_new: comm=stream pid=4585 prio=120 target_cpu=016
stream-4578 [016] d..2. 81.971039: sched_wakeup_new: comm=stream pid=4586 prio=120 target_cpu=112
Three stream threads pile up on cpu 16 initially and in case of
short runs, where the load balancer doesn't have enough time to kick in
to migrate task, performance might suffer. This pattern is observed
consistently where tasks pile on one cpu of the group where the
runner script is pinned to.
The following sched_wakeup_new tracepoint output shows the initial
placement of tasks with this fix in NPS2 mode:
stream-4639 [032] d..2. 102.903581: sched_wakeup_new: comm=stream pid=4641 prio=120 target_cpu=016
stream-4639 [032] d..2. 102.903698: sched_wakeup_new: comm=stream pid=4642 prio=120 target_cpu=048
stream-4639 [032] d..2. 102.903762: sched_wakeup_new: comm=stream pid=4643 prio=120 target_cpu=080
stream-4639 [032] d..2. 102.903823: sched_wakeup_new: comm=stream pid=4644 prio=120 target_cpu=112
stream-4639 [032] d..2. 102.903879: sched_wakeup_new: comm=stream pid=4645 prio=120 target_cpu=096
stream-4639 [032] d..2. 102.903938: sched_wakeup_new: comm=stream pid=4646 prio=120 target_cpu=000
stream-4639 [032] d..2. 102.903991: sched_wakeup_new: comm=stream pid=4647 prio=120 target_cpu=064
The tasks have been distributed evenly across all the allowed cpus
and no pile up can be seen.
Aggressive NUMA balancing is only done when needed. We select the
minimum of number of allowed cpus in sched group and the calculated
sd.imb_numa_nr as our imbalance threshold and the default behavior
of mel-v5 is only modified when the former is smaller than
the latter.
This can help is case of embarrassingly parallel programs with tight
cpus affinity set.
[1] https://lore.kernel.org/lkml/[email protected]/
Signed-off-by: K Prateek Nayak <[email protected]>
---
kernel/sched/core.c | 3 +++
kernel/sched/fair.c | 7 ++++++-
2 files changed, 9 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1d863d7f6ad7..9a92ac42bb24 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9294,6 +9294,7 @@ static struct kmem_cache *task_group_cache __read_mostly;
DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
DECLARE_PER_CPU(cpumask_var_t, select_idle_mask);
+DECLARE_PER_CPU(cpumask_var_t, find_idlest_group_mask);
void __init sched_init(void)
{
@@ -9344,6 +9345,8 @@ void __init sched_init(void)
cpumask_size(), GFP_KERNEL, cpu_to_node(i));
per_cpu(select_idle_mask, i) = (cpumask_var_t)kzalloc_node(
cpumask_size(), GFP_KERNEL, cpu_to_node(i));
+ per_cpu(find_idlest_group_mask, i) = (cpumask_var_t)kzalloc_node(
+ cpumask_size(), GFP_KERNEL, cpu_to_node(i));
}
#endif /* CONFIG_CPUMASK_OFFSTACK */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index babf3b65db38..ffced741b244 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5751,6 +5751,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
/* Working cpumask for: load_balance, load_balance_newidle. */
DEFINE_PER_CPU(cpumask_var_t, load_balance_mask);
DEFINE_PER_CPU(cpumask_var_t, select_idle_mask);
+DEFINE_PER_CPU(cpumask_var_t, find_idlest_group_mask);
#ifdef CONFIG_NO_HZ_COMMON
@@ -9022,6 +9023,7 @@ static inline bool allow_numa_imbalance(int running, int imb_numa_nr)
static struct sched_group *
find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
{
+ struct cpumask *cpus = this_cpu_cpumask_var_ptr(find_idlest_group_mask);
struct sched_group *idlest = NULL, *local = NULL, *group = sd->groups;
struct sg_lb_stats local_sgs, tmp_sgs;
struct sg_lb_stats *sgs;
@@ -9130,6 +9132,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
case group_has_spare:
if (sd->flags & SD_NUMA) {
+ int imb;
#ifdef CONFIG_NUMA_BALANCING
int idlest_cpu;
/*
@@ -9150,7 +9153,9 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
* allowed. If there is a real need of migration,
* periodic load balance will take care of it.
*/
- if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, sd->imb_numa_nr))
+ cpumask_and(cpus, sched_group_span(local), p->cpus_ptr);
+ imb = min(cpumask_weight(cpus), sd->imb_numa_nr);
+ if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, imb))
return NULL;
}
--
2.25.1
Hello Mel,
Thank you for taking a look at the patch.
On 2/8/2022 4:21 PM, Mel Gorman wrote:
> On Mon, Feb 07, 2022 at 09:29:21PM +0530, K Prateek Nayak wrote:
>> Neither the sched/tip nor Mel's v5 patchset [1] provides an optimal
>> new-task wakeup strategy when the tasks are affined to a subset of cpus
>> which can result in piling of tasks on the same set of CPU in a NUMA
>> group despite there being other cpus in a different NUMA group where the
>> task could have run in. A good placement makes a difference especially
>> in case of short lived task where the delay in load balancer kicking in
>> can cause degradation in perfromance.
>>
> Thanks.
>
> V6 was posted based on previous feedback. While this patch is building
> on top of it, please add Acked-by or Tested-by if the imbalance series
> helps the general problem of handling imbalances when there are multiple
> last level caches.
Yes, the imbalance series does a good job handling the general imbalance
problem in case of systems with multiple LLCs. This patch builds on top of
your effort to balance more aggressively in certain scenarios arising when
tasks are pinned to a subset of CPUs.
I'll run benchmarks against v6 and ack the results on the imbalance patchset.
>> <SNIP>
>>
>> Aggressive NUMA balancing is only done when needed. We select the
>> minimum of number of allowed cpus in sched group and the calculated
>> sd.imb_numa_nr as our imbalance threshold and the default behavior
>> of mel-v5 is only modified when the former is smaller than
>> the latter.
>>
> In this context, it should be safe to reuse select_idle_mask like this
> build tested patch
Thank you for pointing this out. I'll make the changes in the follow up.
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 538756bd8e7f..1e759c21371b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9128,6 +9128,8 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>
> case group_has_spare:
> if (sd->flags & SD_NUMA) {
> + struct cpumask *cpus;
> + int imb;
> #ifdef CONFIG_NUMA_BALANCING
> int idlest_cpu;
> /*
> @@ -9145,10 +9147,15 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
> * Otherwise, keep the task close to the wakeup source
> * and improve locality if the number of running tasks
> * would remain below threshold where an imbalance is
> - * allowed. If there is a real need of migration,
> - * periodic load balance will take care of it.
> + * allowed while accounting for the possibility the
> + * task is pinned to a subset of CPUs. If there is a
> + * real need of migration, periodic load balance will
> + * take care of it.
> */
> - if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, sd->imb_numa_nr))
> + cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
> + cpumask_and(cpus, sched_group_span(local), p->cpus_ptr);
> + imb = min(cpumask_weight(cpus), sd->imb_numa_nr);
> + if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, imb))
> return NULL;
> }
>
Thank you for the feedback and suggestions.
Thanks and Regards
Prateek
On Mon, Feb 07, 2022 at 09:29:21PM +0530, K Prateek Nayak wrote:
> Neither the sched/tip nor Mel's v5 patchset [1] provides an optimal
> new-task wakeup strategy when the tasks are affined to a subset of cpus
> which can result in piling of tasks on the same set of CPU in a NUMA
> group despite there being other cpus in a different NUMA group where the
> task could have run in. A good placement makes a difference especially
> in case of short lived task where the delay in load balancer kicking in
> can cause degradation in perfromance.
>
Thanks.
V6 was posted based on previous feedback. While this patch is building
on top of it, please add Acked-by or Tested-by if the imbalance series
helps the general problem of handling imbalances when there are multiple
last level caches.
> <SNIP>
>
> Aggressive NUMA balancing is only done when needed. We select the
> minimum of number of allowed cpus in sched group and the calculated
> sd.imb_numa_nr as our imbalance threshold and the default behavior
> of mel-v5 is only modified when the former is smaller than
> the latter.
>
In this context, it should be safe to reuse select_idle_mask like this
build tested patch
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 538756bd8e7f..1e759c21371b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9128,6 +9128,8 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
case group_has_spare:
if (sd->flags & SD_NUMA) {
+ struct cpumask *cpus;
+ int imb;
#ifdef CONFIG_NUMA_BALANCING
int idlest_cpu;
/*
@@ -9145,10 +9147,15 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
* Otherwise, keep the task close to the wakeup source
* and improve locality if the number of running tasks
* would remain below threshold where an imbalance is
- * allowed. If there is a real need of migration,
- * periodic load balance will take care of it.
+ * allowed while accounting for the possibility the
+ * task is pinned to a subset of CPUs. If there is a
+ * real need of migration, periodic load balance will
+ * take care of it.
*/
- if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, sd->imb_numa_nr))
+ cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
+ cpumask_and(cpus, sched_group_span(local), p->cpus_ptr);
+ imb = min(cpumask_weight(cpus), sd->imb_numa_nr);
+ if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, imb))
return NULL;
}
On Mon, Feb 07, 2022 at 09:29:21PM +0530, K Prateek Nayak wrote:
> Neither the sched/tip nor Mel's v5 patchset [1] provides an optimal
> new-task wakeup strategy when the tasks are affined to a subset of cpus
> which can result in piling of tasks on the same set of CPU in a NUMA
> group despite there being other cpus in a different NUMA group where the
> task could have run in.
Where does this affinity come from?
On Wed, Feb 09, 2022 at 03:14:32PM +0000, Nayak, K Prateek wrote:
> [AMD Official Use Only]
I think you need to invest in a new mail setup.
> > Where does this affinity come from?
>
> The affinity comes from limiting the process to a certain subset of
> available cpus by modifying the cpus_ptr member of task_struck
> via taskset or numactl.
That's obviously not an answer. Why is that done?
Hello Peter,
On 2/9/2022 4:16 PM, Peter Zijlstra wrote:
> On Mon, Feb 07, 2022 at 09:29:21PM +0530, K Prateek Nayak wrote:
>> Neither the sched/tip nor Mel's v5 patchset [1] provides an optimal
>> new-task wakeup strategy when the tasks are affined to a subset of cpus
>> which can result in piling of tasks on the same set of CPU in a NUMA
>> group despite there being other cpus in a different NUMA group where the
>> task could have run in.
> Where does this affinity come from?
The affinity comes from limiting the process to run on certain subset
of available cpus by modifying the cpus_ptr member of task_struck
of process via taskset or numactl.
---
Thanks and Regards
Prateek
Hello Peter,
On 2/9/2022 10:50 PM, Peter Zijlstra wrote:
>>> Where does this affinity come from?
>> The affinity comes from limiting the process to a certain subset of
>> available cpus by modifying the cpus_ptr member of task_struck
>> via taskset or numactl.
> That's obviously not an answer. Why is that done?
Sorry, I should have been more clear in my previous reply.
Currently, the scheduler is conservative while spreading the tasks across
groups of a NUMA domain. It does so only when the number of runnable
tasks in the local sched-group is less than the allowed imbalance
threshold. The imbalance threshold is 25% of the domain's span weight
in sched/tip/core. Mel's recent patchset
"Adjust NUMA imbalance for multiple LLCs"
(https://lore.kernel.org/lkml/[email protected]/)
make it dependent on the number of LLCs in the NUMA domain.
In case of AMD Zen like systems containing with multiple LLCs per
socket, users want to spread bandwidth hungry applications across
multiple LLCs. Stream is one such representative workload where the
best performance is obtained by limiting one stream thread per LLC. To
ensure this, users are known to pin the tasks to a specify a subset of
the CPUs consisting of one CPU per LLC while running such bandwidth
hungry tasks.
Suppose we kickstart a multi-threaded task like
stream with 8 threads using taskset or numactl to run on a subset of
CPUs on a 2 socket Zen3 server where each socket contains 128 CPUs
(0-63,128-191 in one socket, 64-127,192-255 in another socket)
Eg: numactl -C 0,16,32,48,64,80,96,112 ./stream8
Here each CPU in the list is from a different LLC and 4 of those LLCs
are on one socket, while the other 4 are on another socket.
Ideally we would prefer that each stream thread runs on a different
CPU from the allowed list of CPUs. However, the current heuristics in
find_idlest_group() do not allow this during the initial placement.
Suppose the first socket (0-63,128-191) is our local group from which
we are kickstarting the stream tasks. The first four stream threads
will be placed in this socket. When it comes to placing the 5th
thread, all the allowed CPUs are from the local group (0,16,32,48)
would have been taken. We can detect this by checking if the number of
CPUs in the local group are fewer than the number of tasks running in
the local group and use this information to spread the 5th task out
into the the next socket (after all, the goal in this slowpath is to
find the idlest group and the idlest CPU during the initial
placement!).
However, the current scheduler code simply checks if the number of
tasks in the local group is fewer than the allowed numa-imbalance
threshold. This threshold is 25% of the NUMA domain span
in sched/tip/core (in this case threshold = 32) and is equal to the
number of LLCs in the domain with Mel's recent v6 patchset
(https://lore.kernel.org/lkml/[email protected]/)
(in this case threshold = 8). For this example, the number of tasks
will always be within threshold and thus all the 8 stream threads
will be woken up on the first socket thereby resulting in
sub-optimal performance.
Following are the results from running 8 Stream threads with and
without pinning:
tip sched/core tip sched/core tip sched/core
+ mel-v6 + mel-v6 + this-patch
(no pinning) +pinning + pinning
Copy: 111309.82 (0.00 pct) 111133.84 (-0.15 pct) 151249.35 (35.88 pct)
Scale: 107391.64 (0.00 pct) 105933.51 (-1.35 pct) 144272.14 (34.34 pct)
Add: 126090.18 (0.00 pct) 127533.88 (1.14 pct) 177142.50 (40.48 pct)
Triad: 124517.67 (0.00 pct) 126944.83 (1.94 pct) 175712.64 (41.11 pct)
The following sched_wakeup_new tracepoint output shows the initial
placement of tasks in tip/sched/core:
stream-4300 [032] d..2. 115.753321: sched_wakeup_new: comm=stream pid=4302 prio=120 target_cpu=048
stream-4300 [032] d..2. 115.753389: sched_wakeup_new: comm=stream pid=4303 prio=120 target_cpu=000
stream-4300 [032] d..2. 115.753443: sched_wakeup_new: comm=stream pid=4304 prio=120 target_cpu=016
stream-4300 [032] d..2. 115.753487: sched_wakeup_new: comm=stream pid=4305 prio=120 target_cpu=032
stream-4300 [032] d..2. 115.753539: sched_wakeup_new: comm=stream pid=4306 prio=120 target_cpu=032
stream-4300 [032] d..2. 115.753576: sched_wakeup_new: comm=stream pid=4307 prio=120 target_cpu=032
stream-4300 [032] d..2. 115.753611: sched_wakeup_new: comm=stream pid=4308 prio=120 target_cpu=032
Output from V6 of Mel's patchset is also similar.
Once the first four threads are distributed among the allowed CPUs of
socket one, the rest of the treads start piling on these same CPUs
when clearly there are CPUs on the second socket that can be used.
The following sched_wakeup_new tracepoint output shows the initial
placement of tasks in after adding this fix:
stream-4733 [032] d..2. 116.017980: sched_wakeup_new: comm=stream pid=4735 prio=120 target_cpu=048
stream-4733 [032] d..2. 116.018032: sched_wakeup_new: comm=stream pid=4736 prio=120 target_cpu=000
stream-4733 [032] d..2. 116.018127: sched_wakeup_new: comm=stream pid=4737 prio=120 target_cpu=064
stream-4733 [032] d..2. 116.018185: sched_wakeup_new: comm=stream pid=4738 prio=120 target_cpu=112
stream-4733 [032] d..2. 116.018235: sched_wakeup_new: comm=stream pid=4739 prio=120 target_cpu=096
stream-4733 [032] d..2. 116.018289: sched_wakeup_new: comm=stream pid=4740 prio=120 target_cpu=016
stream-4733 [032] d..2. 116.018334: sched_wakeup_new: comm=stream pid=4741 prio=120 target_cpu=080
We see that threads are using all of the allowed CPUs
and there is no pileup.
Output of tracepoint sched_migrate_task for sched-tip is as follows:
(output has been slightly altered for readability)
115.765048: sched_migrate_task: comm=stream pid=4305 prio=120 orig_cpu=32 dest_cpu=16 START - {8}{0}
115.767042: sched_migrate_task: comm=stream pid=4306 prio=120 orig_cpu=32 dest_cpu=0
115.767089: sched_migrate_task: comm=stream pid=4307 prio=120 orig_cpu=32 dest_cpu=48
115.996255: sched_migrate_task: comm=stream pid=4306 prio=120 orig_cpu=0 dest_cpu=64 * {7}{1}
116.039173: sched_migrate_task: comm=stream pid=4304 prio=120 orig_cpu=16 dest_cpu=64 * {6}{2}
... 19 migrations
116.367329: sched_migrate_task: comm=stream pid=4303 prio=120 orig_cpu=0 dest_cpu=64 * {5}{3}
... 17 migrations
116.647607: sched_migrate_task: comm=stream pid=4306 prio=120 orig_cpu=64 dest_cpu=0 * {6}{2}
... 3 migrations
116.705935: sched_migrate_task: comm=stream pid=4308 prio=120 orig_cpu=48 dest_cpu=80 * {5}{3}
... 15 migrations
116.921504: sched_migrate_task: comm=stream pid=4300 prio=120 orig_cpu=48 dest_cpu=64 * {4}{4}
116.941469: sched_migrate_task: comm=stream pid=4300 prio=120 orig_cpu=64 dest_cpu=32 * {5}{3}
... 20 migrations
117.426116: sched_migrate_task: comm=stream pid=4305 prio=120 orig_cpu=32 dest_cpu=64 * {4}{4}
... 4 migrations
117.634768: sched_migrate_task: comm=stream pid=4303 prio=120 orig_cpu=64 dest_cpu=16 * {5}{3}
... 5 migrations
117.775718: sched_migrate_task: comm=stream pid=4303 prio=120 orig_cpu=48 dest_cpu=64 * {4}{4}
... 3 migrations
117.901872: sched_migrate_task: comm=stream pid=4303 prio=120 orig_cpu=96 dest_cpu=112 END - {4}{4}
* Denotes cross NUMA migrations of task followed by number of
stream threads running in each NUMA domain.
No output is generated for tracepoint sched_migrate_task with this
patch due to a perfect initial placement which removes the need
for balancing later on - both across NUMA boundaries and within
NUMA boundaries for stream.
Based on the results above, a bad placement can lead to lot of
unnecessary migrations before reaching the optimal placement
which is further found to be unstable - in the
sched_migrate_task traces above we see a lot of ping ponging
between optimal and nearly optimal case ({5}{3} -> {4}{4} -> {5}{3})
Thus there is an opportunity detect the situation when the current
NUMA group is full and spread the remaining tasks to the CPUs from
the other NUMA group.
If the task is not pinned, we fall to the default behavior as we
consider the minimum of:
min(number_of_allowed_cpus, calculated_imbalance_metric)
Mel suggested reusing the select_idle_mask which is now done in
my V2 (https://lore.kernel.org/lkml/[email protected]/)
V2 changes are rebased on top of V6 of Mel's patchset and
contains the latest numbers for this fix.
--
Thanks and Regards,
Prateek