2022-04-07 20:22:54

by K Prateek Nayak

[permalink] [raw]
Subject: [PATCH v7] sched/fair: Consider cpu affinity when allowing NUMA imbalance in find_idlest_group

In the case of systems containing multiple LLCs per socket, like
AMD Zen systems, users want to spread bandwidth hungry applications
across multiple LLCs. Stream is one such representative workload where
the best performance is obtained by limiting one stream thread per LLC.
To ensure this, users are known to pin the tasks to a specify a subset
of the CPUs consisting of one CPU per LLC while running such bandwidth
hungry tasks.

Suppose we kickstart a multi-threaded task like stream with 8 threads
using taskset or numactl to run on a subset of CPUs on a 2 socket Zen3
server where each socket contains 128 CPUs
(0-63,128-191 in one socket, 64-127,192-255 in another socket)

Eg: numactl -C 0,16,32,48,64,80,96,112 ./stream8

Here each CPU in the list is from a different LLC and 4 of those LLCs
are on one socket, while the other 4 are on another socket.

Ideally we would prefer that each stream thread runs on a different
CPU from the allowed list of CPUs. However, the current heuristics in
find_idlest_group() do not allow this during the initial placement.

Suppose the first socket (0-63,128-191) is our local group from which
we are kickstarting the stream tasks. The first four stream threads
will be placed in this socket. When it comes to placing the 5th
thread, all the allowed CPUs are from the local group (0,16,32,48)
would have been taken.

However, the current scheduler code simply checks if the number of
tasks in the local group is fewer than the allowed numa-imbalance
threshold. This threshold was previously 25% of the NUMA domain span
(in this case threshold = 32) but after the v6 of Mel's patchset
"Adjust NUMA imbalance for multiple LLCs", got merged in sched-tip,
Commit: e496132ebedd ("sched/fair: Adjust the allowed NUMA imbalance
when SD_NUMA spans multiple LLCs") it is now equal to number of LLCs
in the NUMA domain, for processors with multiple LLCs.
(in this case threshold = 8).

For this example, the number of tasks will always be within threshold
and thus all the 8 stream threads will be woken up on the first socket
thereby resulting in sub-optimal performance.

The following sched_wakeup_new tracepoint output shows the initial
placement of tasks in the current tip/sched/core on the Zen3 machine:

stream-5313 [016] d..2. 627.005036: sched_wakeup_new: comm=stream pid=5315 prio=120 target_cpu=032
stream-5313 [016] d..2. 627.005086: sched_wakeup_new: comm=stream pid=5316 prio=120 target_cpu=048
stream-5313 [016] d..2. 627.005141: sched_wakeup_new: comm=stream pid=5317 prio=120 target_cpu=000
stream-5313 [016] d..2. 627.005183: sched_wakeup_new: comm=stream pid=5318 prio=120 target_cpu=016
stream-5313 [016] d..2. 627.005218: sched_wakeup_new: comm=stream pid=5319 prio=120 target_cpu=016
stream-5313 [016] d..2. 627.005256: sched_wakeup_new: comm=stream pid=5320 prio=120 target_cpu=016
stream-5313 [016] d..2. 627.005295: sched_wakeup_new: comm=stream pid=5321 prio=120 target_cpu=016

Once the first four threads are distributed among the allowed CPUs of
socket one, the rest of the treads start piling on these same CPUs
when clearly there are CPUs on the second socket that can be used.

Following the initial pile up on a small number of CPUs, though the
load-balancer eventually kicks in, it takes a while to get to {4}{4}
and even {4}{4} isn't stable as we observe a bunch of ping ponging
between {4}{4} to {5}{3} and back before a stable state is reached
much later (1 Stream thread per allowed CPU) and no more migration is
required.

We can detect this piling and avoid it by checking if the number of
allowed CPUs in the local group are fewer than the number of tasks
running in the local group and use this information to spread the
5th task out into the next socket (after all, the goal in this
slowpath is to find the idlest group and the idlest CPU during the
initial placement!).

The following sched_wakeup_new tracepoint output shows the initial
placement of tasks after adding this fix on the Zen3 machine:

stream-4485 [016] d..2. 230.784046: sched_wakeup_new: comm=stream pid=4487 prio=120 target_cpu=032
stream-4485 [016] d..2. 230.784123: sched_wakeup_new: comm=stream pid=4488 prio=120 target_cpu=048
stream-4485 [016] d..2. 230.784167: sched_wakeup_new: comm=stream pid=4489 prio=120 target_cpu=000
stream-4485 [016] d..2. 230.784222: sched_wakeup_new: comm=stream pid=4490 prio=120 target_cpu=112
stream-4485 [016] d..2. 230.784271: sched_wakeup_new: comm=stream pid=4491 prio=120 target_cpu=096
stream-4485 [016] d..2. 230.784322: sched_wakeup_new: comm=stream pid=4492 prio=120 target_cpu=080
stream-4485 [016] d..2. 230.784368: sched_wakeup_new: comm=stream pid=4493 prio=120 target_cpu=064

We see that threads are using all of the allowed CPUs and there is
no pileup.

No output is generated for tracepoint sched_migrate_task with this
patch due to a perfect initial placement which removes the need
for balancing later on - both across NUMA boundaries and within
NUMA boundaries for stream.

Following are the results from running 8 Stream threads with and
without pinning on a dual socket Zen3 Machine (2 x 64C/128T):

During the testing of this patch, the tip sched/core was at
commit: 089c02ae2771 "ftrace: Use preemption model accessors for trace
header printout"

Pinning is done using: numactl -C 0,16,32,48,64,80,96,112 ./stream8

5.18.0-rc1 5.18.0-rc1 5.18.0-rc1
tip sched/core tip sched/core tip sched/core
(no pinning) + pinning + this-patch
+ pinning

Copy: 109364.74 (0.00 pct) 94220.50 (-13.84 pct) 158301.28 (44.74 pct)
Scale: 109670.26 (0.00 pct) 90210.59 (-17.74 pct) 149525.64 (36.34 pct)
Add: 129029.01 (0.00 pct) 101906.00 (-21.02 pct) 186658.17 (44.66 pct)
Triad: 127260.05 (0.00 pct) 106051.36 (-16.66 pct) 184327.30 (44.84 pct)

Pinning currently hurts the performance compared to unbound case on
tip/sched/core. With the addition of this patch, we are able to
outperform tip/sched/core by a good margin with pinning.

Following are the results from running 16 Stream threads with and
without pinning on a dual socket IceLake Machine (2 x 32C/64T):

NUMA Topology of Intel Skylake machine:
Node 1: 0,2,4,6 ... 126 (Even numbers)
Node 2: 1,3,5,7 ... 127 (Odd numbers)

Pinning is done using: numactl -C 0-15 ./stream16

5.18.0-rc1 5.18.0-rc1 5.18.0-rc1
tip sched/core tip sched/core tip sched/core
(no pinning) +pinning + this-patch
+ pinning

Copy: 85815.31 (0.00 pct) 149819.21 (74.58 pct) 156807.48 (82.72 pct)
Scale: 64795.60 (0.00 pct) 97595.07 (50.61 pct) 99871.96 (54.13 pct)
Add: 71340.68 (0.00 pct) 111549.10 (56.36 pct) 114598.33 (60.63 pct)
Triad: 68890.97 (0.00 pct) 111635.16 (62.04 pct) 114589.24 (66.33 pct)

In case of Icelake machine, with single LLC per socket, pinning across
the two sockets reduces cache contention, thus showing great
improvement in pinned case which is further benefited by this patch.

Signed-off-by: K Prateek Nayak <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Reviewed-by: Vincent Guittot <[email protected]>
Reviewed-by: Srikar Dronamraju <[email protected]>
---
Changelog v6-->v7:
- Rebased the changes on the latest sched-tip.
- Updated commit log with numbers comparing patch with the latest
sched-tip on AMD Zen3 and Intel Icelake based server offerings.
- Collected tags from v6.
Changelog v5-->v6:
- Move the cpumask variable declaration to the block it is used in.
- Collect tags from v5.
---
kernel/sched/fair.c | 16 +++++++++++++---
1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d4bd299d67ab..520593bf0de6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9215,6 +9215,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)

case group_has_spare:
if (sd->flags & SD_NUMA) {
+ int imb;
#ifdef CONFIG_NUMA_BALANCING
int idlest_cpu;
/*
@@ -9232,10 +9233,19 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
* Otherwise, keep the task close to the wakeup source
* and improve locality if the number of running tasks
* would remain below threshold where an imbalance is
- * allowed. If there is a real need of migration,
- * periodic load balance will take care of it.
+ * allowed while accounting for the possibility the
+ * task is pinned to a subset of CPUs. If there is a
+ * real need of migration, periodic load balance will
+ * take care of it.
*/
- if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, sd->imb_numa_nr))
+ imb = sd->imb_numa_nr;
+ if (p->nr_cpus_allowed != num_online_cpus()) {
+ struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
+
+ cpumask_and(cpus, sched_group_span(local), p->cpus_ptr);
+ imb = min(cpumask_weight(cpus), sd->imb_numa_nr);
+ }
+ if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, imb))
return NULL;
}

--
2.25.1


2022-05-12 16:15:17

by K Prateek Nayak

[permalink] [raw]
Subject: Re: [PATCH v7] sched/fair: Consider cpu affinity when allowing NUMA imbalance in find_idlest_group

Ping :)

On 4/7/2022 4:42 PM, K Prateek Nayak wrote:
> In the case of systems containing multiple LLCs per socket, like
> AMD Zen systems, users want to spread bandwidth hungry applications
> across multiple LLCs. Stream is one such representative workload where
> the best performance is obtained by limiting one stream thread per LLC.
> To ensure this, users are known to pin the tasks to a specify a subset
> of the CPUs consisting of one CPU per LLC while running such bandwidth
> hungry tasks.
>
> Suppose we kickstart a multi-threaded task like stream with 8 threads
> using taskset or numactl to run on a subset of CPUs on a 2 socket Zen3
> server where each socket contains 128 CPUs
> (0-63,128-191 in one socket, 64-127,192-255 in another socket)
>
> Eg: numactl -C 0,16,32,48,64,80,96,112 ./stream8
>
> Here each CPU in the list is from a different LLC and 4 of those LLCs
> are on one socket, while the other 4 are on another socket.
>
> Ideally we would prefer that each stream thread runs on a different
> CPU from the allowed list of CPUs. However, the current heuristics in
> find_idlest_group() do not allow this during the initial placement.
>
> Suppose the first socket (0-63,128-191) is our local group from which
> we are kickstarting the stream tasks. The first four stream threads
> will be placed in this socket. When it comes to placing the 5th
> thread, all the allowed CPUs are from the local group (0,16,32,48)
> would have been taken.
>
> However, the current scheduler code simply checks if the number of
> tasks in the local group is fewer than the allowed numa-imbalance
> threshold. This threshold was previously 25% of the NUMA domain span
> (in this case threshold = 32) but after the v6 of Mel's patchset
> "Adjust NUMA imbalance for multiple LLCs", got merged in sched-tip,
> Commit: e496132ebedd ("sched/fair: Adjust the allowed NUMA imbalance
> when SD_NUMA spans multiple LLCs") it is now equal to number of LLCs
> in the NUMA domain, for processors with multiple LLCs.
> (in this case threshold = 8).
>
> For this example, the number of tasks will always be within threshold
> and thus all the 8 stream threads will be woken up on the first socket
> thereby resulting in sub-optimal performance.
>
> The following sched_wakeup_new tracepoint output shows the initial
> placement of tasks in the current tip/sched/core on the Zen3 machine:
>
> stream-5313 [016] d..2. 627.005036: sched_wakeup_new: comm=stream pid=5315 prio=120 target_cpu=032
> stream-5313 [016] d..2. 627.005086: sched_wakeup_new: comm=stream pid=5316 prio=120 target_cpu=048
> stream-5313 [016] d..2. 627.005141: sched_wakeup_new: comm=stream pid=5317 prio=120 target_cpu=000
> stream-5313 [016] d..2. 627.005183: sched_wakeup_new: comm=stream pid=5318 prio=120 target_cpu=016
> stream-5313 [016] d..2. 627.005218: sched_wakeup_new: comm=stream pid=5319 prio=120 target_cpu=016
> stream-5313 [016] d..2. 627.005256: sched_wakeup_new: comm=stream pid=5320 prio=120 target_cpu=016
> stream-5313 [016] d..2. 627.005295: sched_wakeup_new: comm=stream pid=5321 prio=120 target_cpu=016
>
> Once the first four threads are distributed among the allowed CPUs of
> socket one, the rest of the treads start piling on these same CPUs
> when clearly there are CPUs on the second socket that can be used.
>
> Following the initial pile up on a small number of CPUs, though the
> load-balancer eventually kicks in, it takes a while to get to {4}{4}
> and even {4}{4} isn't stable as we observe a bunch of ping ponging
> between {4}{4} to {5}{3} and back before a stable state is reached
> much later (1 Stream thread per allowed CPU) and no more migration is
> required.
>
> We can detect this piling and avoid it by checking if the number of
> allowed CPUs in the local group are fewer than the number of tasks
> running in the local group and use this information to spread the
> 5th task out into the next socket (after all, the goal in this
> slowpath is to find the idlest group and the idlest CPU during the
> initial placement!).
>
> The following sched_wakeup_new tracepoint output shows the initial
> placement of tasks after adding this fix on the Zen3 machine:
>
> stream-4485 [016] d..2. 230.784046: sched_wakeup_new: comm=stream pid=4487 prio=120 target_cpu=032
> stream-4485 [016] d..2. 230.784123: sched_wakeup_new: comm=stream pid=4488 prio=120 target_cpu=048
> stream-4485 [016] d..2. 230.784167: sched_wakeup_new: comm=stream pid=4489 prio=120 target_cpu=000
> stream-4485 [016] d..2. 230.784222: sched_wakeup_new: comm=stream pid=4490 prio=120 target_cpu=112
> stream-4485 [016] d..2. 230.784271: sched_wakeup_new: comm=stream pid=4491 prio=120 target_cpu=096
> stream-4485 [016] d..2. 230.784322: sched_wakeup_new: comm=stream pid=4492 prio=120 target_cpu=080
> stream-4485 [016] d..2. 230.784368: sched_wakeup_new: comm=stream pid=4493 prio=120 target_cpu=064
>
> We see that threads are using all of the allowed CPUs and there is
> no pileup.
>
> No output is generated for tracepoint sched_migrate_task with this
> patch due to a perfect initial placement which removes the need
> for balancing later on - both across NUMA boundaries and within
> NUMA boundaries for stream.
>
> Following are the results from running 8 Stream threads with and
> without pinning on a dual socket Zen3 Machine (2 x 64C/128T):
>
> During the testing of this patch, the tip sched/core was at
> commit: 089c02ae2771 "ftrace: Use preemption model accessors for trace
> header printout"
>
> Pinning is done using: numactl -C 0,16,32,48,64,80,96,112 ./stream8
>
> 5.18.0-rc1 5.18.0-rc1 5.18.0-rc1
> tip sched/core tip sched/core tip sched/core
> (no pinning) + pinning + this-patch
> + pinning
>
> Copy: 109364.74 (0.00 pct) 94220.50 (-13.84 pct) 158301.28 (44.74 pct)
> Scale: 109670.26 (0.00 pct) 90210.59 (-17.74 pct) 149525.64 (36.34 pct)
> Add: 129029.01 (0.00 pct) 101906.00 (-21.02 pct) 186658.17 (44.66 pct)
> Triad: 127260.05 (0.00 pct) 106051.36 (-16.66 pct) 184327.30 (44.84 pct)
>
> Pinning currently hurts the performance compared to unbound case on
> tip/sched/core. With the addition of this patch, we are able to
> outperform tip/sched/core by a good margin with pinning.
>
> Following are the results from running 16 Stream threads with and
> without pinning on a dual socket IceLake Machine (2 x 32C/64T):
>
> NUMA Topology of Intel Skylake machine:
> Node 1: 0,2,4,6 ... 126 (Even numbers)
> Node 2: 1,3,5,7 ... 127 (Odd numbers)
>
> Pinning is done using: numactl -C 0-15 ./stream16
>
> 5.18.0-rc1 5.18.0-rc1 5.18.0-rc1
> tip sched/core tip sched/core tip sched/core
> (no pinning) +pinning + this-patch
> + pinning
>
> Copy: 85815.31 (0.00 pct) 149819.21 (74.58 pct) 156807.48 (82.72 pct)
> Scale: 64795.60 (0.00 pct) 97595.07 (50.61 pct) 99871.96 (54.13 pct)
> Add: 71340.68 (0.00 pct) 111549.10 (56.36 pct) 114598.33 (60.63 pct)
> Triad: 68890.97 (0.00 pct) 111635.16 (62.04 pct) 114589.24 (66.33 pct)
>
> In case of Icelake machine, with single LLC per socket, pinning across
> the two sockets reduces cache contention, thus showing great
> improvement in pinned case which is further benefited by this patch.
>
> Signed-off-by: K Prateek Nayak <[email protected]>
> Acked-by: Mel Gorman <[email protected]>
> Reviewed-by: Vincent Guittot <[email protected]>
> Reviewed-by: Srikar Dronamraju <[email protected]>
> ---
> Changelog v6-->v7:
> - Rebased the changes on the latest sched-tip.
> - Updated commit log with numbers comparing patch with the latest
> sched-tip on AMD Zen3 and Intel Icelake based server offerings.
> - Collected tags from v6.
> Changelog v5-->v6:
> - Move the cpumask variable declaration to the block it is used in.
> - Collect tags from v5.
> ---
> kernel/sched/fair.c | 16 +++++++++++++---
> 1 file changed, 13 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d4bd299d67ab..520593bf0de6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9215,6 +9215,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>
> case group_has_spare:
> if (sd->flags & SD_NUMA) {
> + int imb;
> #ifdef CONFIG_NUMA_BALANCING
> int idlest_cpu;
> /*
> @@ -9232,10 +9233,19 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
> * Otherwise, keep the task close to the wakeup source
> * and improve locality if the number of running tasks
> * would remain below threshold where an imbalance is
> - * allowed. If there is a real need of migration,
> - * periodic load balance will take care of it.
> + * allowed while accounting for the possibility the
> + * task is pinned to a subset of CPUs. If there is a
> + * real need of migration, periodic load balance will
> + * take care of it.
> */
> - if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, sd->imb_numa_nr))
> + imb = sd->imb_numa_nr;
> + if (p->nr_cpus_allowed != num_online_cpus()) {
> + struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
> +
> + cpumask_and(cpus, sched_group_span(local), p->cpus_ptr);
> + imb = min(cpumask_weight(cpus), sd->imb_numa_nr);
> + }
> + if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, imb))
> return NULL;
> }
>

--
Thanks and Regards,
Prateek


2022-06-09 12:43:34

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v7] sched/fair: Consider cpu affinity when allowing NUMA imbalance in find_idlest_group

On Thu, Apr 07, 2022 at 04:42:22PM +0530, K Prateek Nayak wrote:
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d4bd299d67ab..520593bf0de6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9215,6 +9215,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>
> case group_has_spare:
> if (sd->flags & SD_NUMA) {
> + int imb;
> #ifdef CONFIG_NUMA_BALANCING
> int idlest_cpu;
> /*
> @@ -9232,10 +9233,19 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
> * Otherwise, keep the task close to the wakeup source
> * and improve locality if the number of running tasks
> * would remain below threshold where an imbalance is
> - * allowed. If there is a real need of migration,
> - * periodic load balance will take care of it.
> + * allowed while accounting for the possibility the
> + * task is pinned to a subset of CPUs. If there is a
> + * real need of migration, periodic load balance will
> + * take care of it.
> */
> - if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, sd->imb_numa_nr))
> + imb = sd->imb_numa_nr;
> + if (p->nr_cpus_allowed != num_online_cpus()) {
> + struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
> +
> + cpumask_and(cpus, sched_group_span(local), p->cpus_ptr);
> + imb = min(cpumask_weight(cpus), sd->imb_numa_nr);
> + }
> + if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, imb))
> return NULL;
> }

OK, so I've gone collecting patches, and this conflicts with the NUMA
patches from Mel.

Now, I can (and have) fixed up the conflict, but it did make me look at
this in a little more detail; and the thing I noticed is that your:

'p->nr_cpus_allowed != num_online_cpus()'

test makes no sense. That's basically 'true'. The thing is,
nr_cpus_allowed is initialized to NR_CPUS, and unless someone somewhere
did set_cpus_allowed() on it, it'll still be NR_CPUS.

Also, CPU hotplug doesn't change nr_cpus_allowed, so num_online_cpus()
is just plain wrong.

Now, something that might work is:

'p->nr_cpus_allowed < num_online_cpus()'

And even that is no guarantee. You can construct a situation where this
is still false even though you actually have a constrained set.
Consider a machine with 8 CPUs. Then set the mask to: 0x55, which has 4
CPUs set. Then offline the last 4 so that the online mask becomes 0x0f.

Then the effective mask is 0x05, and the number we're looking for above
is 2, but the suggested test would still be false, because
nr_cpus_allowed would be 4, as would num_online_cpus().

Find below what I've made of it (on top of Mel's patches), but I'm not
sure this is what we want. For now I'll leave it commented out.

---
Subject: sched/fair: Consider CPU affinity when allowing NUMA imbalance in find_idlest_group()
From: K Prateek Nayak <[email protected]>
Date: Thu, 7 Apr 2022 16:42:22 +0530

From: K Prateek Nayak <[email protected]>

In the case of systems containing multiple LLCs per socket, like
AMD Zen systems, users want to spread bandwidth hungry applications
across multiple LLCs. Stream is one such representative workload where
the best performance is obtained by limiting one stream thread per LLC.
To ensure this, users are known to pin the tasks to a specify a subset
of the CPUs consisting of one CPU per LLC while running such bandwidth
hungry tasks.

Suppose we kickstart a multi-threaded task like stream with 8 threads
using taskset or numactl to run on a subset of CPUs on a 2 socket Zen3
server where each socket contains 128 CPUs
(0-63,128-191 in one socket, 64-127,192-255 in another socket)

Eg: numactl -C 0,16,32,48,64,80,96,112 ./stream8

Here each CPU in the list is from a different LLC and 4 of those LLCs
are on one socket, while the other 4 are on another socket.

Ideally we would prefer that each stream thread runs on a different
CPU from the allowed list of CPUs. However, the current heuristics in
find_idlest_group() do not allow this during the initial placement.

Suppose the first socket (0-63,128-191) is our local group from which
we are kickstarting the stream tasks. The first four stream threads
will be placed in this socket. When it comes to placing the 5th
thread, all the allowed CPUs are from the local group (0,16,32,48)
would have been taken.

However, the current scheduler code simply checks if the number of
tasks in the local group is fewer than the allowed numa-imbalance
threshold. This threshold was previously 25% of the NUMA domain span
(in this case threshold = 32) but after the v6 of Mel's patchset
"Adjust NUMA imbalance for multiple LLCs", got merged in sched-tip,
Commit: e496132ebedd ("sched/fair: Adjust the allowed NUMA imbalance
when SD_NUMA spans multiple LLCs") it is now equal to number of LLCs
in the NUMA domain, for processors with multiple LLCs.
(in this case threshold = 8).

For this example, the number of tasks will always be within threshold
and thus all the 8 stream threads will be woken up on the first socket
thereby resulting in sub-optimal performance.

The following sched_wakeup_new tracepoint output shows the initial
placement of tasks in the current tip/sched/core on the Zen3 machine:

stream-5313 [016] d..2. 627.005036: sched_wakeup_new: comm=stream pid=5315 prio=120 target_cpu=032
stream-5313 [016] d..2. 627.005086: sched_wakeup_new: comm=stream pid=5316 prio=120 target_cpu=048
stream-5313 [016] d..2. 627.005141: sched_wakeup_new: comm=stream pid=5317 prio=120 target_cpu=000
stream-5313 [016] d..2. 627.005183: sched_wakeup_new: comm=stream pid=5318 prio=120 target_cpu=016
stream-5313 [016] d..2. 627.005218: sched_wakeup_new: comm=stream pid=5319 prio=120 target_cpu=016
stream-5313 [016] d..2. 627.005256: sched_wakeup_new: comm=stream pid=5320 prio=120 target_cpu=016
stream-5313 [016] d..2. 627.005295: sched_wakeup_new: comm=stream pid=5321 prio=120 target_cpu=016

Once the first four threads are distributed among the allowed CPUs of
socket one, the rest of the treads start piling on these same CPUs
when clearly there are CPUs on the second socket that can be used.

Following the initial pile up on a small number of CPUs, though the
load-balancer eventually kicks in, it takes a while to get to {4}{4}
and even {4}{4} isn't stable as we observe a bunch of ping ponging
between {4}{4} to {5}{3} and back before a stable state is reached
much later (1 Stream thread per allowed CPU) and no more migration is
required.

We can detect this piling and avoid it by checking if the number of
allowed CPUs in the local group are fewer than the number of tasks
running in the local group and use this information to spread the
5th task out into the next socket (after all, the goal in this
slowpath is to find the idlest group and the idlest CPU during the
initial placement!).

The following sched_wakeup_new tracepoint output shows the initial
placement of tasks after adding this fix on the Zen3 machine:

stream-4485 [016] d..2. 230.784046: sched_wakeup_new: comm=stream pid=4487 prio=120 target_cpu=032
stream-4485 [016] d..2. 230.784123: sched_wakeup_new: comm=stream pid=4488 prio=120 target_cpu=048
stream-4485 [016] d..2. 230.784167: sched_wakeup_new: comm=stream pid=4489 prio=120 target_cpu=000
stream-4485 [016] d..2. 230.784222: sched_wakeup_new: comm=stream pid=4490 prio=120 target_cpu=112
stream-4485 [016] d..2. 230.784271: sched_wakeup_new: comm=stream pid=4491 prio=120 target_cpu=096
stream-4485 [016] d..2. 230.784322: sched_wakeup_new: comm=stream pid=4492 prio=120 target_cpu=080
stream-4485 [016] d..2. 230.784368: sched_wakeup_new: comm=stream pid=4493 prio=120 target_cpu=064

We see that threads are using all of the allowed CPUs and there is
no pileup.

No output is generated for tracepoint sched_migrate_task with this
patch due to a perfect initial placement which removes the need
for balancing later on - both across NUMA boundaries and within
NUMA boundaries for stream.

Following are the results from running 8 Stream threads with and
without pinning on a dual socket Zen3 Machine (2 x 64C/128T):

During the testing of this patch, the tip sched/core was at
commit: 089c02ae2771 "ftrace: Use preemption model accessors for trace
header printout"

Pinning is done using: numactl -C 0,16,32,48,64,80,96,112 ./stream8

5.18.0-rc1 5.18.0-rc1 5.18.0-rc1
tip sched/core tip sched/core tip sched/core
(no pinning) + pinning + this-patch
+ pinning

Copy: 109364.74 (0.00 pct) 94220.50 (-13.84 pct) 158301.28 (44.74 pct)
Scale: 109670.26 (0.00 pct) 90210.59 (-17.74 pct) 149525.64 (36.34 pct)
Add: 129029.01 (0.00 pct) 101906.00 (-21.02 pct) 186658.17 (44.66 pct)
Triad: 127260.05 (0.00 pct) 106051.36 (-16.66 pct) 184327.30 (44.84 pct)

Pinning currently hurts the performance compared to unbound case on
tip/sched/core. With the addition of this patch, we are able to
outperform tip/sched/core by a good margin with pinning.

Following are the results from running 16 Stream threads with and
without pinning on a dual socket IceLake Machine (2 x 32C/64T):

NUMA Topology of Intel Skylake machine:
Node 1: 0,2,4,6 ... 126 (Even numbers)
Node 2: 1,3,5,7 ... 127 (Odd numbers)

Pinning is done using: numactl -C 0-15 ./stream16

5.18.0-rc1 5.18.0-rc1 5.18.0-rc1
tip sched/core tip sched/core tip sched/core
(no pinning) +pinning + this-patch
+ pinning

Copy: 85815.31 (0.00 pct) 149819.21 (74.58 pct) 156807.48 (82.72 pct)
Scale: 64795.60 (0.00 pct) 97595.07 (50.61 pct) 99871.96 (54.13 pct)
Add: 71340.68 (0.00 pct) 111549.10 (56.36 pct) 114598.33 (60.63 pct)
Triad: 68890.97 (0.00 pct) 111635.16 (62.04 pct) 114589.24 (66.33 pct)

In case of Icelake machine, with single LLC per socket, pinning across
the two sockets reduces cache contention, thus showing great
improvement in pinned case which is further benefited by this patch.

Signed-off-by: K Prateek Nayak <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Vincent Guittot <[email protected]>
Reviewed-by: Srikar Dronamraju <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
kernel/sched/fair.c | 16 +++++++++++++---
1 file changed, 13 insertions(+), 3 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9210,6 +9210,7 @@ find_idlest_group(struct sched_domain *s
case group_has_spare:
#ifdef CONFIG_NUMA
if (sd->flags & SD_NUMA) {
+ int imb_numa_nr = sd->imb_numa_nr;
#ifdef CONFIG_NUMA_BALANCING
int idlest_cpu;
/*
@@ -9227,13 +9228,22 @@ find_idlest_group(struct sched_domain *s
* Otherwise, keep the task close to the wakeup source
* and improve locality if the number of running tasks
* would remain below threshold where an imbalance is
- * allowed. If there is a real need of migration,
- * periodic load balance will take care of it.
+ * allowed while accounting for the possibility the
+ * task is pinned to a subset of CPUs. If there is a
+ * real need of migration, periodic load balance will
+ * take care of it.
*/
+ if (p->nr_cpus_allowed < num_online_cpus()) {
+ struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
+
+ cpumask_and(cpus, sched_group_span(local), p->cpus_ptr);
+ imb_numa_nr = min(cpumask_weight(cpus), sd->imb_numa_nr);
+ }
+
imbalance = abs(local_sgs.idle_cpus - idlest_sgs.idle_cpus);
if (!adjust_numa_imbalance(imbalance,
local_sgs.sum_nr_running + 1,
- sd->imb_numa_nr)) {
+ imb_numa_nr)) {
return NULL;
}
}

2022-06-10 11:11:43

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH v7] sched/fair: Consider cpu affinity when allowing NUMA imbalance in find_idlest_group

On 09/06/22 13:54, Peter Zijlstra wrote:
> Now, I can (and have) fixed up the conflict, but it did make me look at
> this in a little more detail; and the thing I noticed is that your:
>
> 'p->nr_cpus_allowed != num_online_cpus()'
>
> test makes no sense. That's basically 'true'. The thing is,
> nr_cpus_allowed is initialized to NR_CPUS, and unless someone somewhere
> did set_cpus_allowed() on it, it'll still be NR_CPUS.
>
> Also, CPU hotplug doesn't change nr_cpus_allowed, so num_online_cpus()
> is just plain wrong.
>
> Now, something that might work is:
>
> 'p->nr_cpus_allowed < num_online_cpus()'
>
> And even that is no guarantee. You can construct a situation where this
> is still false even though you actually have a constrained set.
> Consider a machine with 8 CPUs. Then set the mask to: 0x55, which has 4
> CPUs set. Then offline the last 4 so that the online mask becomes 0x0f.
>
> Then the effective mask is 0x05, and the number we're looking for above
> is 2, but the suggested test would still be false, because
> nr_cpus_allowed would be 4, as would num_online_cpus().
>

IIUC we want to pay special attention when the task isn't allowed to run on
all online CPUs, wouldn't the below do that?

!cpumask_subset(cpu_online_mask, p->cpus_ptr)

The task affinity can be a superset of the online mask, obvious case is
init_task's CPU_MASK_ALL, and the above test is still false if both masks
are equal.

(Additionnaly we could add a step in sched_init() to "properly" initialize
the init_task mask and remove the NR_CPUS faff).

2022-06-10 14:33:11

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v7] sched/fair: Consider cpu affinity when allowing NUMA imbalance in find_idlest_group

On Fri, Jun 10, 2022 at 11:55:37AM +0100, Valentin Schneider wrote:
> On 09/06/22 13:54, Peter Zijlstra wrote:
> > Now, I can (and have) fixed up the conflict, but it did make me look at
> > this in a little more detail; and the thing I noticed is that your:
> >
> > 'p->nr_cpus_allowed != num_online_cpus()'
> >
> > test makes no sense. That's basically 'true'. The thing is,
> > nr_cpus_allowed is initialized to NR_CPUS, and unless someone somewhere
> > did set_cpus_allowed() on it, it'll still be NR_CPUS.
> >
> > Also, CPU hotplug doesn't change nr_cpus_allowed, so num_online_cpus()
> > is just plain wrong.
> >
> > Now, something that might work is:
> >
> > 'p->nr_cpus_allowed < num_online_cpus()'
> >
> > And even that is no guarantee. You can construct a situation where this
> > is still false even though you actually have a constrained set.
> > Consider a machine with 8 CPUs. Then set the mask to: 0x55, which has 4
> > CPUs set. Then offline the last 4 so that the online mask becomes 0x0f.
> >
> > Then the effective mask is 0x05, and the number we're looking for above
> > is 2, but the suggested test would still be false, because
> > nr_cpus_allowed would be 4, as would num_online_cpus().
> >
>
> IIUC we want to pay special attention when the task isn't allowed to run on
> all online CPUs, wouldn't the below do that?
>
> !cpumask_subset(cpu_online_mask, p->cpus_ptr)

At that point we might just as well do the whole cpumask_and() thing,
no? There's not much cost difference between these two operations.

> The task affinity can be a superset of the online mask, obvious case is
> init_task's CPU_MASK_ALL, and the above test is still false if both masks
> are equal.
>
> (Additionnaly we could add a step in sched_init() to "properly" initialize
> the init_task mask and remove the NR_CPUS faff).

I'm confused, NR_CPUS is the right value for CPU_MASK_ALL.

2022-06-10 14:53:31

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v7] sched/fair: Consider cpu affinity when allowing NUMA imbalance in find_idlest_group

On Fri, 10 Jun 2022 at 16:15, Peter Zijlstra <[email protected]> wrote:
>
> On Fri, Jun 10, 2022 at 11:55:37AM +0100, Valentin Schneider wrote:
> > On 09/06/22 13:54, Peter Zijlstra wrote:
> > > Now, I can (and have) fixed up the conflict, but it did make me look at
> > > this in a little more detail; and the thing I noticed is that your:
> > >
> > > 'p->nr_cpus_allowed != num_online_cpus()'
> > >
> > > test makes no sense. That's basically 'true'. The thing is,
> > > nr_cpus_allowed is initialized to NR_CPUS, and unless someone somewhere
> > > did set_cpus_allowed() on it, it'll still be NR_CPUS.
> > >
> > > Also, CPU hotplug doesn't change nr_cpus_allowed, so num_online_cpus()
> > > is just plain wrong.
> > >
> > > Now, something that might work is:
> > >
> > > 'p->nr_cpus_allowed < num_online_cpus()'
> > >
> > > And even that is no guarantee. You can construct a situation where this
> > > is still false even though you actually have a constrained set.
> > > Consider a machine with 8 CPUs. Then set the mask to: 0x55, which has 4
> > > CPUs set. Then offline the last 4 so that the online mask becomes 0x0f.
> > >
> > > Then the effective mask is 0x05, and the number we're looking for above
> > > is 2, but the suggested test would still be false, because
> > > nr_cpus_allowed would be 4, as would num_online_cpus().
> > >
> >
> > IIUC we want to pay special attention when the task isn't allowed to run on
> > all online CPUs, wouldn't the below do that?
> >
> > !cpumask_subset(cpu_online_mask, p->cpus_ptr)
>
> At that point we might just as well do the whole cpumask_and() thing,
> no? There's not much cost difference between these two operations.

The test was there to not do the computation with cpumask_and() if the
task's affinity has not been modified so maybe it would be enough to
test (p->nr_cpus_allowed != NR_CPUS) to check if the task's affinity
has been modified and we have we do the cpumask_and() and
cpumask_weight()

>
> > The task affinity can be a superset of the online mask, obvious case is
> > init_task's CPU_MASK_ALL, and the above test is still false if both masks
> > are equal.
> >
> > (Additionnaly we could add a step in sched_init() to "properly" initialize
> > the init_task mask and remove the NR_CPUS faff).
>
> I'm confused, NR_CPUS is the right value for CPU_MASK_ALL.

2022-06-10 15:12:16

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v7] sched/fair: Consider cpu affinity when allowing NUMA imbalance in find_idlest_group

On Fri, Jun 10, 2022 at 04:42:05PM +0200, Vincent Guittot wrote:

> The test was there to not do the computation with cpumask_and() if the
> task's affinity has not been modified so maybe it would be enough to
> test (p->nr_cpus_allowed != NR_CPUS) to check if the task's affinity
> has been modified and we have we do the cpumask_and() and
> cpumask_weight()

Yeah, I suppose that's the best we can do. I was thinking that perhaps
some cgroup/cpuset thing is common these days, in which case pretty much
all tasks will have their affinity set through the cpuset muck, but
alas, nothing really do be done about that.


2022-06-10 16:51:44

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH v7] sched/fair: Consider cpu affinity when allowing NUMA imbalance in find_idlest_group

On 10/06/22 16:15, Peter Zijlstra wrote:
> On Fri, Jun 10, 2022 at 11:55:37AM +0100, Valentin Schneider wrote:
>>
>> IIUC we want to pay special attention when the task isn't allowed to run on
>> all online CPUs, wouldn't the below do that?
>>
>> !cpumask_subset(cpu_online_mask, p->cpus_ptr)
>
> At that point we might just as well do the whole cpumask_and() thing,
> no? There's not much cost difference between these two operations.
>

Yeah, you got a point there.

>> The task affinity can be a superset of the online mask, obvious case is
>> init_task's CPU_MASK_ALL, and the above test is still false if both masks
>> are equal.
>>
>> (Additionnaly we could add a step in sched_init() to "properly" initialize
>> the init_task mask and remove the NR_CPUS faff).
>
> I'm confused, NR_CPUS is the right value for CPU_MASK_ALL.

Right, I meant to make the mask match cpu_online_mask from the get go, but
now that I allocate a few more neurons thinking about it it looks like a
can of worms; we'd have to do that after smp_init() to see which CPUs we
actually onlined, and by then we have already forked around.

Subject: [tip: sched/core] sched/fair: Consider CPU affinity when allowing NUMA imbalance in find_idlest_group()

The following commit has been merged into the sched/core branch of tip:

Commit-ID: f5b2eeb49991047f8f64785e7a7857d6f219d574
Gitweb: https://git.kernel.org/tip/f5b2eeb49991047f8f64785e7a7857d6f219d574
Author: K Prateek Nayak <[email protected]>
AuthorDate: Thu, 07 Apr 2022 16:42:22 +05:30
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Mon, 13 Jun 2022 10:30:00 +02:00

sched/fair: Consider CPU affinity when allowing NUMA imbalance in find_idlest_group()

In the case of systems containing multiple LLCs per socket, like
AMD Zen systems, users want to spread bandwidth hungry applications
across multiple LLCs. Stream is one such representative workload where
the best performance is obtained by limiting one stream thread per LLC.
To ensure this, users are known to pin the tasks to a specify a subset
of the CPUs consisting of one CPU per LLC while running such bandwidth
hungry tasks.

Suppose we kickstart a multi-threaded task like stream with 8 threads
using taskset or numactl to run on a subset of CPUs on a 2 socket Zen3
server where each socket contains 128 CPUs
(0-63,128-191 in one socket, 64-127,192-255 in another socket)

Eg: numactl -C 0,16,32,48,64,80,96,112 ./stream8

Here each CPU in the list is from a different LLC and 4 of those LLCs
are on one socket, while the other 4 are on another socket.

Ideally we would prefer that each stream thread runs on a different
CPU from the allowed list of CPUs. However, the current heuristics in
find_idlest_group() do not allow this during the initial placement.

Suppose the first socket (0-63,128-191) is our local group from which
we are kickstarting the stream tasks. The first four stream threads
will be placed in this socket. When it comes to placing the 5th
thread, all the allowed CPUs are from the local group (0,16,32,48)
would have been taken.

However, the current scheduler code simply checks if the number of
tasks in the local group is fewer than the allowed numa-imbalance
threshold. This threshold was previously 25% of the NUMA domain span
(in this case threshold = 32) but after the v6 of Mel's patchset
"Adjust NUMA imbalance for multiple LLCs", got merged in sched-tip,
Commit: e496132ebedd ("sched/fair: Adjust the allowed NUMA imbalance
when SD_NUMA spans multiple LLCs") it is now equal to number of LLCs
in the NUMA domain, for processors with multiple LLCs.
(in this case threshold = 8).

For this example, the number of tasks will always be within threshold
and thus all the 8 stream threads will be woken up on the first socket
thereby resulting in sub-optimal performance.

The following sched_wakeup_new tracepoint output shows the initial
placement of tasks in the current tip/sched/core on the Zen3 machine:

stream-5313 [016] d..2. 627.005036: sched_wakeup_new: comm=stream pid=5315 prio=120 target_cpu=032
stream-5313 [016] d..2. 627.005086: sched_wakeup_new: comm=stream pid=5316 prio=120 target_cpu=048
stream-5313 [016] d..2. 627.005141: sched_wakeup_new: comm=stream pid=5317 prio=120 target_cpu=000
stream-5313 [016] d..2. 627.005183: sched_wakeup_new: comm=stream pid=5318 prio=120 target_cpu=016
stream-5313 [016] d..2. 627.005218: sched_wakeup_new: comm=stream pid=5319 prio=120 target_cpu=016
stream-5313 [016] d..2. 627.005256: sched_wakeup_new: comm=stream pid=5320 prio=120 target_cpu=016
stream-5313 [016] d..2. 627.005295: sched_wakeup_new: comm=stream pid=5321 prio=120 target_cpu=016

Once the first four threads are distributed among the allowed CPUs of
socket one, the rest of the treads start piling on these same CPUs
when clearly there are CPUs on the second socket that can be used.

Following the initial pile up on a small number of CPUs, though the
load-balancer eventually kicks in, it takes a while to get to {4}{4}
and even {4}{4} isn't stable as we observe a bunch of ping ponging
between {4}{4} to {5}{3} and back before a stable state is reached
much later (1 Stream thread per allowed CPU) and no more migration is
required.

We can detect this piling and avoid it by checking if the number of
allowed CPUs in the local group are fewer than the number of tasks
running in the local group and use this information to spread the
5th task out into the next socket (after all, the goal in this
slowpath is to find the idlest group and the idlest CPU during the
initial placement!).

The following sched_wakeup_new tracepoint output shows the initial
placement of tasks after adding this fix on the Zen3 machine:

stream-4485 [016] d..2. 230.784046: sched_wakeup_new: comm=stream pid=4487 prio=120 target_cpu=032
stream-4485 [016] d..2. 230.784123: sched_wakeup_new: comm=stream pid=4488 prio=120 target_cpu=048
stream-4485 [016] d..2. 230.784167: sched_wakeup_new: comm=stream pid=4489 prio=120 target_cpu=000
stream-4485 [016] d..2. 230.784222: sched_wakeup_new: comm=stream pid=4490 prio=120 target_cpu=112
stream-4485 [016] d..2. 230.784271: sched_wakeup_new: comm=stream pid=4491 prio=120 target_cpu=096
stream-4485 [016] d..2. 230.784322: sched_wakeup_new: comm=stream pid=4492 prio=120 target_cpu=080
stream-4485 [016] d..2. 230.784368: sched_wakeup_new: comm=stream pid=4493 prio=120 target_cpu=064

We see that threads are using all of the allowed CPUs and there is
no pileup.

No output is generated for tracepoint sched_migrate_task with this
patch due to a perfect initial placement which removes the need
for balancing later on - both across NUMA boundaries and within
NUMA boundaries for stream.

Following are the results from running 8 Stream threads with and
without pinning on a dual socket Zen3 Machine (2 x 64C/128T):

During the testing of this patch, the tip sched/core was at
commit: 089c02ae2771 "ftrace: Use preemption model accessors for trace
header printout"

Pinning is done using: numactl -C 0,16,32,48,64,80,96,112 ./stream8

5.18.0-rc1 5.18.0-rc1 5.18.0-rc1
tip sched/core tip sched/core tip sched/core
(no pinning) + pinning + this-patch
+ pinning

Copy: 109364.74 (0.00 pct) 94220.50 (-13.84 pct) 158301.28 (44.74 pct)
Scale: 109670.26 (0.00 pct) 90210.59 (-17.74 pct) 149525.64 (36.34 pct)
Add: 129029.01 (0.00 pct) 101906.00 (-21.02 pct) 186658.17 (44.66 pct)
Triad: 127260.05 (0.00 pct) 106051.36 (-16.66 pct) 184327.30 (44.84 pct)

Pinning currently hurts the performance compared to unbound case on
tip/sched/core. With the addition of this patch, we are able to
outperform tip/sched/core by a good margin with pinning.

Following are the results from running 16 Stream threads with and
without pinning on a dual socket IceLake Machine (2 x 32C/64T):

NUMA Topology of Intel Skylake machine:
Node 1: 0,2,4,6 ... 126 (Even numbers)
Node 2: 1,3,5,7 ... 127 (Odd numbers)

Pinning is done using: numactl -C 0-15 ./stream16

5.18.0-rc1 5.18.0-rc1 5.18.0-rc1
tip sched/core tip sched/core tip sched/core
(no pinning) +pinning + this-patch
+ pinning

Copy: 85815.31 (0.00 pct) 149819.21 (74.58 pct) 156807.48 (82.72 pct)
Scale: 64795.60 (0.00 pct) 97595.07 (50.61 pct) 99871.96 (54.13 pct)
Add: 71340.68 (0.00 pct) 111549.10 (56.36 pct) 114598.33 (60.63 pct)
Triad: 68890.97 (0.00 pct) 111635.16 (62.04 pct) 114589.24 (66.33 pct)

In case of Icelake machine, with single LLC per socket, pinning across
the two sockets reduces cache contention, thus showing great
improvement in pinned case which is further benefited by this patch.

Signed-off-by: K Prateek Nayak <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Vincent Guittot <[email protected]>
Reviewed-by: Srikar Dronamraju <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
kernel/sched/fair.c | 16 +++++++++++++---
1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 166f5f9..1b2cac7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9210,6 +9210,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
case group_has_spare:
#ifdef CONFIG_NUMA
if (sd->flags & SD_NUMA) {
+ int imb_numa_nr = sd->imb_numa_nr;
#ifdef CONFIG_NUMA_BALANCING
int idlest_cpu;
/*
@@ -9227,13 +9228,22 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
* Otherwise, keep the task close to the wakeup source
* and improve locality if the number of running tasks
* would remain below threshold where an imbalance is
- * allowed. If there is a real need of migration,
- * periodic load balance will take care of it.
+ * allowed while accounting for the possibility the
+ * task is pinned to a subset of CPUs. If there is a
+ * real need of migration, periodic load balance will
+ * take care of it.
*/
+ if (p->nr_cpus_allowed != NR_CPUS) {
+ struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
+
+ cpumask_and(cpus, sched_group_span(local), p->cpus_ptr);
+ imb_numa_nr = min(cpumask_weight(cpus), sd->imb_numa_nr);
+ }
+
imbalance = abs(local_sgs.idle_cpus - idlest_sgs.idle_cpus);
if (!adjust_numa_imbalance(imbalance,
local_sgs.sum_nr_running + 1,
- sd->imb_numa_nr)) {
+ imb_numa_nr)) {
return NULL;
}
}

2022-06-13 16:39:05

by K Prateek Nayak

[permalink] [raw]
Subject: Re: [PATCH v7] sched/fair: Consider cpu affinity when allowing NUMA imbalance in find_idlest_group

Hello Peter,

Thank you for reviewing and picking up this patch.
I've shared some observations w.r.t. p->nr_cpus_allowed,
num_online_cpus() and NR_CPUS based on the discussion in
the thread.

tl;dr

p->nr_cpus_allowed seems to be equal to the num_online_cpus()
at the time of fork as opposed to NR_CPUS based on my testing.

On 6/9/2022 5:24 PM, Peter Zijlstra wrote:
> On Thu, Apr 07, 2022 at 04:42:22PM +0530, K Prateek Nayak wrote:
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index d4bd299d67ab..520593bf0de6 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -9215,6 +9215,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>>
>> case group_has_spare:
>> if (sd->flags & SD_NUMA) {
>> + int imb;
>> #ifdef CONFIG_NUMA_BALANCING
>> int idlest_cpu;
>> /*
>> @@ -9232,10 +9233,19 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>> * Otherwise, keep the task close to the wakeup source
>> * and improve locality if the number of running tasks
>> * would remain below threshold where an imbalance is
>> - * allowed. If there is a real need of migration,
>> - * periodic load balance will take care of it.
>> + * allowed while accounting for the possibility the
>> + * task is pinned to a subset of CPUs. If there is a
>> + * real need of migration, periodic load balance will
>> + * take care of it.
>> */
>> - if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, sd->imb_numa_nr))
>> + imb = sd->imb_numa_nr;
>> + if (p->nr_cpus_allowed != num_online_cpus()) {
>> + struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
>> +
>> + cpumask_and(cpus, sched_group_span(local), p->cpus_ptr);
>> + imb = min(cpumask_weight(cpus), sd->imb_numa_nr);
>> + }
>> + if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, imb))
>> return NULL;
>> }
>
> OK, so I've gone collecting patches, and this conflicts with the NUMA
> patches from Mel.
>
> Now, I can (and have) fixed up the conflict,

Thank you for resolving the conflicts.

but it did make me look at
> this in a little more detail; and the thing I noticed is that your:
>
> 'p->nr_cpus_allowed != num_online_cpus()'
>
> test makes no sense. That's basically 'true'. The thing is,
> nr_cpus_allowed is initialized to NR_CPUS, and unless someone somewhere
> did set_cpus_allowed() on it, it'll still be NR_CPUS.
>
> Also, CPU hotplug doesn't change nr_cpus_allowed, so num_online_cpus()
> is just plain wrong.

I agree this is true. If we offline CPUs, the p->nr_cpus_allowed of
already running tasks will remain same but the wakeup path through
find_idlest_group is only traversed during initial placment and from
what I understand, the nr_cpus_allowed during fork will be equal to,
or less than num_online_cpus() unless there is a race with CPU hotplug.

To verify the same, I added the below debug patch to
5.19.0-rc2 tip/sched/core at
commit: f3dd3f674555 "sched: Remove the limitation of WF_ON_CPU on wakelist if wakee cpu is idle"
to check the values of p->nr_cpus_allowed, num_online_cpus() and NR_CPUS,
when we are in find_idlest_group() method:

--
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8bed75757e65..596d45d148b0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9169,6 +9169,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
break;

case group_has_spare:
+ trace_printk("%s: %u %u %u\n", p->comm, p->nr_cpus_allowed, num_online_cpus(), NR_CPUS);
#ifdef CONFIG_NUMA
if (sd->flags & SD_NUMA) {
int imb_numa_nr = sd->imb_numa_nr;
--

Testing was done on a single socket 64C/128T Zen3 machine.
The configuration consists of 4 sched domain (SMT, MC, NUMA1, NUMA2)
and hence same log appears 4 times as we traverse down the hierarchy
in find_idlest_cpu() that calls find_idlest_group() at each level.
Following are the commands ran along with the relavent logs:

# _-----------> p->nr_cpus_allowed
# / _-------> num_online_cpus()
# | / _---> NR_CPUS
# | | /
# | | |
ls-2750 [006] d..1. 97.486667: find_idlest_group: bash: 128 128 8192
ls-2750 [006] d..1. 97.486669: find_idlest_group: bash: 128 128 8192
ls-2750 [006] d..1. 97.486670: find_idlest_group: bash: 128 128 8192
ls-2750 [006] d..1. 97.486671: find_idlest_group: bash: 128 128 8192
# echo 0 > /sys/devices/system/cpu/cpu127/online
# ls
ls-2753 [070] d..1. 103.661735: find_idlest_group: bash: 127 127 8192
ls-2753 [070] d..1. 103.661737: find_idlest_group: bash: 127 127 8192
ls-2753 [070] d..1. 103.661739: find_idlest_group: bash: 127 127 8192
ls-2753 [070] d..1. 103.661740: find_idlest_group: bash: 127 127 8192
# echo 0 > /sys/devices/system/cpu/cpu126/online
# ls
ls-2757 [070] d..1. 109.111868: find_idlest_group: bash: 126 126 8192
ls-2757 [070] d..1. 109.111871: find_idlest_group: bash: 126 126 8192
ls-2757 [070] d..1. 109.111872: find_idlest_group: bash: 126 126 8192
ls-2757 [070] d..1. 109.111873: find_idlest_group: bash: 126 126 8192
# echo 0 > /sys/devices/system/cpu/cpu125/online
# ls
ls-2760 [070] d..1. 113.816679: find_idlest_group: bash: 125 125 8192
ls-2760 [070] d..1. 113.816682: find_idlest_group: bash: 125 125 8192
ls-2760 [070] d..1. 113.816683: find_idlest_group: bash: 125 125 8192
ls-2760 [070] d..1. 113.816684: find_idlest_group: bash: 125 125 8192
# echo 0 > /sys/devices/system/cpu/cpu124/online
# ls
ls-2763 [066] d..1. 118.288481: find_idlest_group: bash: 124 124 8192
ls-2763 [066] d..1. 118.288483: find_idlest_group: bash: 124 124 8192
ls-2763 [066] d..1. 118.288485: find_idlest_group: bash: 124 124 8192
ls-2763 [066] d..1. 118.288485: find_idlest_group: bash: 124 124 8192
# taskset -c 0-127 ls
ls-2765 [066] d..1. 133.659570: find_idlest_group: bash: 124 124 8192
ls-2765 [066] d..1. 133.659572: find_idlest_group: bash: 124 124 8192
ls-2765 [066] d..1. 133.659573: find_idlest_group: bash: 124 124 8192
ls-2765 [066] d..1. 133.659574: find_idlest_group: bash: 124 124 8192
ls-2765 [066] d..1. 133.660128: find_idlest_group: taskset: 124 124 8192
ls-2765 [066] d..1. 133.660130: find_idlest_group: taskset: 124 124 8192
ls-2765 [066] d..1. 133.660132: find_idlest_group: taskset: 124 124 8192
ls-2765 [066] d..1. 133.660132: find_idlest_group: taskset: 124 124 8192
# taskset -c 0-120 ls
ls-2766 [066] d..1. 139.525659: find_idlest_group: bash: 124 124 8192
ls-2766 [066] d..1. 139.525661: find_idlest_group: bash: 124 124 8192
ls-2766 [066] d..1. 139.525662: find_idlest_group: bash: 124 124 8192
ls-2766 [066] d..1. 139.525663: find_idlest_group: bash: 124 124 8192
ls-2766 [066] d..1. 139.526283: find_idlest_group: taskset: 121 124 8192
ls-2766 [066] d..1. 139.526285: find_idlest_group: taskset: 121 124 8192
ls-2766 [066] d..1. 139.526287: find_idlest_group: taskset: 121 124 8192
ls-2766 [066] d..1. 139.526288: find_idlest_group: taskset: 121 124 8192

From the above logs, we can see:
o p->nr_cpus_allowed is equal to the value of num_online_cpus() at fork.
o p->nr_cpus_allowed of new task and num_online_cpus() change with CPU
hotplug activity.
o taskset with mask containing offlined CPUs will only count the online
CPUs towards p->nr_cpus_allowed.
o NR_CPUS seem to be remain same at 8192 despite the hotplug activity.

Following are the relevant lines from my .config for NR_CPUS:

CONFIG_NR_CPUS_RANGE_BEGIN=8192
CONFIG_NR_CPUS_RANGE_END=8192
CONFIG_NR_CPUS_DEFAULT=8192
CONFIG_NR_CPUS=8192

When there is a race between CPU hotpug activity, there might be a
difference between p->nr_cpus_allowed and num_online_cpus() as shown
in the logs below:

# _-----------> p->nr_cpus_allowed
# / _-------> num_online_cpus()
# | / _---> NR_CPUS
# | | /
# | | |
<...>-2863 [071] d..1. 1172.474901: find_idlest_group: bash: 124 124 8192
<...>-2863 [071] d..1. 1172.474904: find_idlest_group: bash: 124 124 8192
<...>-2863 [071] d..1. 1172.474906: find_idlest_group: bash: 124 124 8192
<...>-2863 [071] d..1. 1172.474907: find_idlest_group: bash: 124 124 8192
# systemd-udevd is waking up
# echo 0 > /sys/devices/system/cpu/cpu123/online
# systemd-udevd reaches find_idlest_group()
<...>-1612 [059] d..1. 1172.515209: find_idlest_group: systemd-udevd: 124 123 8192
<...>-1612 [059] d..1. 1172.515210: find_idlest_group: systemd-udevd: 124 123 8192
<...>-1612 [059] d..1. 1172.515212: find_idlest_group: systemd-udevd: 124 123 8192
<...>-1612 [059] d..1. 1172.515213: find_idlest_group: systemd-udevd: 124 123 8192
<...>-1612 [059] d..1. 1172.515449: find_idlest_group: systemd-udevd: 124 123 8192
<...>-1612 [059] d..1. 1172.515450: find_idlest_group: systemd-udevd: 124 123 8192
<...>-1612 [059] d..1. 1172.515452: find_idlest_group: systemd-udevd: 124 123 8192
<...>-1612 [059] d..1. 1172.515452: find_idlest_group: systemd-udevd: 124 123 8192
<...>-2863 [071] d..1. 1172.884914: find_idlest_group: bash: 123 123 8192
<...>-2863 [071] d..1. 1172.884916: find_idlest_group: bash: 123 123 8192
<...>-2863 [071] d..1. 1172.884918: find_idlest_group: bash: 123 123 8192
<...>-2863 [071] d..1. 1172.884919: find_idlest_group: bash: 123 123 8192

Here, the systemd-udevd process has p->nr_cpus_allowed set to 124
during the fork but by the time it reaches find_idlest_group(),
CPU 123 goes offline and hence num_online_cpus() reports 123.

>
> Now, something that might work is:
>
> 'p->nr_cpus_allowed < num_online_cpus()'
>

This condition can help the case above when the initial task
placement races with CPU offlining but if a CPU comes online,
as a part of hotplug, the condition will be true and we'll end
up performing the cpumask operation. Hence as it stands,
"p->nr_cpus_allowed != num_online_cpus()" should be a good
enough check for most cases.

> And even that is no guarantee. You can construct a situation where this
> is still false even though you actually have a constrained set.
> Consider a machine with 8 CPUs. Then set the mask to: 0x55, which has 4
> CPUs set. Then offline the last 4 so that the online mask becomes 0x0f.
>
> Then the effective mask is 0x05, and the number we're looking for above
> is 2, but the suggested test would still be false, because
> nr_cpus_allowed would be 4, as would num_online_cpus().

When we use a taskset to define a new task's affinity to 0x55 and
concurrently there is a offline operation in progress which will offline
CPUs 4-7, and the new task creation gets delayed, in that case we will end
up with a situation that p->nr_cpus_allowed == 4 and num_online_cpus() == 4,
but p->cpus_ptr is not a subset of cpus_online_mask.

However given that in practice the CPU offline operation is rare, and the
fact that the path through find_idlest_group() will most likely be traversed
only during the initial placement, such a scenario is extremely unlikely.

>
> Find below what I've made of it (on top of Mel's patches), but I'm not
> sure this is what we want. For now I'll leave it commented out.
>
> ---
> Subject: sched/fair: Consider CPU affinity when allowing NUMA imbalance in find_idlest_group()
> From: K Prateek Nayak <[email protected]>
> Date: Thu, 7 Apr 2022 16:42:22 +0530
>
> From: K Prateek Nayak <[email protected]>
>
> In the case of systems containing multiple LLCs per socket, like
> AMD Zen systems, users want to spread bandwidth hungry applications
> across multiple LLCs. Stream is one such representative workload where
> the best performance is obtained by limiting one stream thread per LLC.
> To ensure this, users are known to pin the tasks to a specify a subset
> of the CPUs consisting of one CPU per LLC while running such bandwidth
> hungry tasks.
>
> Suppose we kickstart a multi-threaded task like stream with 8 threads
> using taskset or numactl to run on a subset of CPUs on a 2 socket Zen3
> server where each socket contains 128 CPUs
> (0-63,128-191 in one socket, 64-127,192-255 in another socket)
>
> Eg: numactl -C 0,16,32,48,64,80,96,112 ./stream8
>
> Here each CPU in the list is from a different LLC and 4 of those LLCs
> are on one socket, while the other 4 are on another socket.
>
> Ideally we would prefer that each stream thread runs on a different
> CPU from the allowed list of CPUs. However, the current heuristics in
> find_idlest_group() do not allow this during the initial placement.
>
> Suppose the first socket (0-63,128-191) is our local group from which
> we are kickstarting the stream tasks. The first four stream threads
> will be placed in this socket. When it comes to placing the 5th
> thread, all the allowed CPUs are from the local group (0,16,32,48)
> would have been taken.
>
> However, the current scheduler code simply checks if the number of
> tasks in the local group is fewer than the allowed numa-imbalance
> threshold. This threshold was previously 25% of the NUMA domain span
> (in this case threshold = 32) but after the v6 of Mel's patchset
> "Adjust NUMA imbalance for multiple LLCs", got merged in sched-tip,
> Commit: e496132ebedd ("sched/fair: Adjust the allowed NUMA imbalance
> when SD_NUMA spans multiple LLCs") it is now equal to number of LLCs
> in the NUMA domain, for processors with multiple LLCs.
> (in this case threshold = 8).
>
> For this example, the number of tasks will always be within threshold
> and thus all the 8 stream threads will be woken up on the first socket
> thereby resulting in sub-optimal performance.
>
> The following sched_wakeup_new tracepoint output shows the initial
> placement of tasks in the current tip/sched/core on the Zen3 machine:
>
> stream-5313 [016] d..2. 627.005036: sched_wakeup_new: comm=stream pid=5315 prio=120 target_cpu=032
> stream-5313 [016] d..2. 627.005086: sched_wakeup_new: comm=stream pid=5316 prio=120 target_cpu=048
> stream-5313 [016] d..2. 627.005141: sched_wakeup_new: comm=stream pid=5317 prio=120 target_cpu=000
> stream-5313 [016] d..2. 627.005183: sched_wakeup_new: comm=stream pid=5318 prio=120 target_cpu=016
> stream-5313 [016] d..2. 627.005218: sched_wakeup_new: comm=stream pid=5319 prio=120 target_cpu=016
> stream-5313 [016] d..2. 627.005256: sched_wakeup_new: comm=stream pid=5320 prio=120 target_cpu=016
> stream-5313 [016] d..2. 627.005295: sched_wakeup_new: comm=stream pid=5321 prio=120 target_cpu=016
>
> Once the first four threads are distributed among the allowed CPUs of
> socket one, the rest of the treads start piling on these same CPUs
> when clearly there are CPUs on the second socket that can be used.
>
> Following the initial pile up on a small number of CPUs, though the
> load-balancer eventually kicks in, it takes a while to get to {4}{4}
> and even {4}{4} isn't stable as we observe a bunch of ping ponging
> between {4}{4} to {5}{3} and back before a stable state is reached
> much later (1 Stream thread per allowed CPU) and no more migration is
> required.
>
> We can detect this piling and avoid it by checking if the number of
> allowed CPUs in the local group are fewer than the number of tasks
> running in the local group and use this information to spread the
> 5th task out into the next socket (after all, the goal in this
> slowpath is to find the idlest group and the idlest CPU during the
> initial placement!).
>
> The following sched_wakeup_new tracepoint output shows the initial
> placement of tasks after adding this fix on the Zen3 machine:
>
> stream-4485 [016] d..2. 230.784046: sched_wakeup_new: comm=stream pid=4487 prio=120 target_cpu=032
> stream-4485 [016] d..2. 230.784123: sched_wakeup_new: comm=stream pid=4488 prio=120 target_cpu=048
> stream-4485 [016] d..2. 230.784167: sched_wakeup_new: comm=stream pid=4489 prio=120 target_cpu=000
> stream-4485 [016] d..2. 230.784222: sched_wakeup_new: comm=stream pid=4490 prio=120 target_cpu=112
> stream-4485 [016] d..2. 230.784271: sched_wakeup_new: comm=stream pid=4491 prio=120 target_cpu=096
> stream-4485 [016] d..2. 230.784322: sched_wakeup_new: comm=stream pid=4492 prio=120 target_cpu=080
> stream-4485 [016] d..2. 230.784368: sched_wakeup_new: comm=stream pid=4493 prio=120 target_cpu=064
>
> We see that threads are using all of the allowed CPUs and there is
> no pileup.
>
> No output is generated for tracepoint sched_migrate_task with this
> patch due to a perfect initial placement which removes the need
> for balancing later on - both across NUMA boundaries and within
> NUMA boundaries for stream.
>
> Following are the results from running 8 Stream threads with and
> without pinning on a dual socket Zen3 Machine (2 x 64C/128T):
>
> During the testing of this patch, the tip sched/core was at
> commit: 089c02ae2771 "ftrace: Use preemption model accessors for trace
> header printout"
>
> Pinning is done using: numactl -C 0,16,32,48,64,80,96,112 ./stream8
>
> 5.18.0-rc1 5.18.0-rc1 5.18.0-rc1
> tip sched/core tip sched/core tip sched/core
> (no pinning) + pinning + this-patch
> + pinning
>
> Copy: 109364.74 (0.00 pct) 94220.50 (-13.84 pct) 158301.28 (44.74 pct)
> Scale: 109670.26 (0.00 pct) 90210.59 (-17.74 pct) 149525.64 (36.34 pct)
> Add: 129029.01 (0.00 pct) 101906.00 (-21.02 pct) 186658.17 (44.66 pct)
> Triad: 127260.05 (0.00 pct) 106051.36 (-16.66 pct) 184327.30 (44.84 pct)
>
> Pinning currently hurts the performance compared to unbound case on
> tip/sched/core. With the addition of this patch, we are able to
> outperform tip/sched/core by a good margin with pinning.
>
> Following are the results from running 16 Stream threads with and
> without pinning on a dual socket IceLake Machine (2 x 32C/64T):
>
> NUMA Topology of Intel Skylake machine:
> Node 1: 0,2,4,6 ... 126 (Even numbers)
> Node 2: 1,3,5,7 ... 127 (Odd numbers)
>
> Pinning is done using: numactl -C 0-15 ./stream16
>
> 5.18.0-rc1 5.18.0-rc1 5.18.0-rc1
> tip sched/core tip sched/core tip sched/core
> (no pinning) +pinning + this-patch
> + pinning
>
> Copy: 85815.31 (0.00 pct) 149819.21 (74.58 pct) 156807.48 (82.72 pct)
> Scale: 64795.60 (0.00 pct) 97595.07 (50.61 pct) 99871.96 (54.13 pct)
> Add: 71340.68 (0.00 pct) 111549.10 (56.36 pct) 114598.33 (60.63 pct)
> Triad: 68890.97 (0.00 pct) 111635.16 (62.04 pct) 114589.24 (66.33 pct)
>
> In case of Icelake machine, with single LLC per socket, pinning across
> the two sockets reduces cache contention, thus showing great
> improvement in pinned case which is further benefited by this patch.
>
> Signed-off-by: K Prateek Nayak <[email protected]>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> Reviewed-by: Vincent Guittot <[email protected]>
> Reviewed-by: Srikar Dronamraju <[email protected]>
> Acked-by: Mel Gorman <[email protected]>
> Link: https://lkml.kernel.org/r/[email protected]
> ---
> kernel/sched/fair.c | 16 +++++++++++++---
> 1 file changed, 13 insertions(+), 3 deletions(-)
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9210,6 +9210,7 @@ find_idlest_group(struct sched_domain *s
> case group_has_spare:
> #ifdef CONFIG_NUMA
> if (sd->flags & SD_NUMA) {
> + int imb_numa_nr = sd->imb_numa_nr;
> #ifdef CONFIG_NUMA_BALANCING
> int idlest_cpu;
> /*
> @@ -9227,13 +9228,22 @@ find_idlest_group(struct sched_domain *s
> * Otherwise, keep the task close to the wakeup source
> * and improve locality if the number of running tasks
> * would remain below threshold where an imbalance is
> - * allowed. If there is a real need of migration,
> - * periodic load balance will take care of it.
> + * allowed while accounting for the possibility the
> + * task is pinned to a subset of CPUs. If there is a
> + * real need of migration, periodic load balance will
> + * take care of it.
> */
> + if (p->nr_cpus_allowed < num_online_cpus()) {
> + struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
> +
> + cpumask_and(cpus, sched_group_span(local), p->cpus_ptr);
> + imb_numa_nr = min(cpumask_weight(cpus), sd->imb_numa_nr);
> + }
> +
> imbalance = abs(local_sgs.idle_cpus - idlest_sgs.idle_cpus);
> if (!adjust_numa_imbalance(imbalance,
> local_sgs.sum_nr_running + 1,
> - sd->imb_numa_nr)) {
> + imb_numa_nr)) {
> return NULL;
> }
> }

Thank you again for resolving the conflict and picking up
this patch.
--
Thanks and Regards,
Prateek