2022-03-08 23:31:32

by K Prateek Nayak

[permalink] [raw]
Subject: [PATCH v6] sched/fair: Consider cpu affinity when allowing NUMA imbalance in find_idlest_group

In the case of systems containing multiple LLCs per socket, like
AMD Zen systems, users want to spread bandwidth hungry applications
across multiple LLCs. Stream is one such representative workload where
the best performance is obtained by limiting one stream thread per LLC.
To ensure this, users are known to pin the tasks to a specify a subset
of the CPUs consisting of one CPU per LLC while running such bandwidth
hungry tasks.

Suppose we kickstart a multi-threaded task like stream with 8 threads
using taskset or numactl to run on a subset of CPUs on a 2 socket Zen3
server where each socket contains 128 CPUs
(0-63,128-191 in one socket, 64-127,192-255 in another socket)

Eg: numactl -C 0,16,32,48,64,80,96,112 ./stream8

Here each CPU in the list is from a different LLC and 4 of those LLCs
are on one socket, while the other 4 are on another socket.

Ideally we would prefer that each stream thread runs on a different
CPU from the allowed list of CPUs. However, the current heuristics in
find_idlest_group() do not allow this during the initial placement.

Suppose the first socket (0-63,128-191) is our local group from which
we are kickstarting the stream tasks. The first four stream threads
will be placed in this socket. When it comes to placing the 5th
thread, all the allowed CPUs are from the local group (0,16,32,48)
would have been taken.

However, the current scheduler code simply checks if the number of
tasks in the local group is fewer than the allowed numa-imbalance
threshold. This threshold was previously 25% of the NUMA domain span
(in this case threshold = 32) but after the v6 of Mel's patchset
"Adjust NUMA imbalance for multiple LLCs", got merged in sched-tip,
Commit: e496132ebedd ("sched/fair: Adjust the allowed NUMA imbalance
when SD_NUMA spans multiple LLCs") it is now equal to number of LLCs
in the NUMA domain, for processors with multiple LLCs.
(in this case threshold = 8).

For this example, the number of tasks will always be within threshold
and thus all the 8 stream threads will be woken up on the first socket
thereby resulting in sub-optimal performance.

The following sched_wakeup_new tracepoint output shows the initial
placement of tasks in the current tip/sched/core on the Zen3 machine:

stream-5045 [032] d..2. 167.914699: sched_wakeup_new: comm=stream pid=5047 prio=120 target_cpu=048
stream-5045 [032] d..2. 167.914746: sched_wakeup_new: comm=stream pid=5048 prio=120 target_cpu=000
stream-5045 [032] d..2. 167.914846: sched_wakeup_new: comm=stream pid=5049 prio=120 target_cpu=016
stream-5045 [032] d..2. 167.914891: sched_wakeup_new: comm=stream pid=5050 prio=120 target_cpu=032
stream-5045 [032] d..2. 167.914928: sched_wakeup_new: comm=stream pid=5051 prio=120 target_cpu=032
stream-5045 [032] d..2. 167.914976: sched_wakeup_new: comm=stream pid=5052 prio=120 target_cpu=032
stream-5045 [032] d..2. 167.915011: sched_wakeup_new: comm=stream pid=5053 prio=120 target_cpu=032

Once the first four threads are distributed among the allowed CPUs of
socket one, the rest of the treads start piling on these same CPUs
when clearly there are CPUs on the second socket that can be used.

Following the initial pile up on a small number of CPUs, though the
load-balancer eventually kicks in, it takes a while to get to {4}{4}
and even {4}{4} isn't stable as we observe a bunch of ping ponging
between {4}{4} to {5}{3} and back before a stable state is reached
much later (1 Stream thread per allowed CPU) and no more migration is
required.

We can detect this piling and avoid it by checking if the number of
allowed CPUs in the local group are fewer than the number of tasks
running in the local group and use this information to spread the
5th task out into the next socket (after all, the goal in this
slowpath is to find the idlest group and the idlest CPU during the
initial placement!).

The following sched_wakeup_new tracepoint output shows the initial
placement of tasks after adding this fix on the Zen3 machine:

stream-4733 [032] d..2. 116.017980: sched_wakeup_new: comm=stream pid=4735 prio=120 target_cpu=048
stream-4733 [032] d..2. 116.018032: sched_wakeup_new: comm=stream pid=4736 prio=120 target_cpu=000
stream-4733 [032] d..2. 116.018127: sched_wakeup_new: comm=stream pid=4737 prio=120 target_cpu=064
stream-4733 [032] d..2. 116.018185: sched_wakeup_new: comm=stream pid=4738 prio=120 target_cpu=112
stream-4733 [032] d..2. 116.018235: sched_wakeup_new: comm=stream pid=4739 prio=120 target_cpu=096
stream-4733 [032] d..2. 116.018289: sched_wakeup_new: comm=stream pid=4740 prio=120 target_cpu=016
stream-4733 [032] d..2. 116.018334: sched_wakeup_new: comm=stream pid=4741 prio=120 target_cpu=080

We see that threads are using all of the allowed CPUs and there is
no pileup.

No output is generated for tracepoint sched_migrate_task with this
patch due to a perfect initial placement which removes the need
for balancing later on - both across NUMA boundaries and within
NUMA boundaries for stream.

Following are the results from running 8 Stream threads with and
without pinning on a dual socket Zen3 Machine (2 x 64C/128T):

Pinning is done using: numactl -C 0,16,32,48,64,80,96,112 ./stream8

5.17.0-rc1 5.17.0-rc1 5.17.0-rc1
tip sched/core tip sched/core tip sched/core
(no pinning) +pinning + this-patch
+ pinning

Copy: 97699.28 (0.00 pct) 95933.60 (-1.80 pct) 156578.91 (60.26 pct)
Scale: 107754.15 (0.00 pct) 91869.88 (-14.74 pct) 149783.25 (39.00 pct)
Add: 126383.29 (0.00 pct) 105730.86 (-16.34 pct) 186493.09 (47.56 pct)
Triad: 124896.78 (0.00 pct) 106394.38 (-14.81 pct) 184733.48 (47.90 pct)

Pinning currently hurts the performance compared to unbound case on
tip/sched/core. With the addition of this patch, we are able to
outperform tip/sched/core by a good margin with pinning.

Following are the results from running 16 Stream threads with and
without pinning on a dual socket Skylake Machine (2 x 24C/48T):

Pinning is done using: numactl -C 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 ./stream16

5.17.0-rc1 5.17.0-rc1 5.17.0-rc1
tip sched/core tip sched/core tip sched/core
(no pinning) +pinning + this-patch
+ pinning

Copy: 126620.67 (0.00 pct) 141062.10 (11.40 pct) 147615.44 (16.58 pct)
Scale: 91313.51 (0.00 pct) 112879.61 (23.61 pct) 122591.28 (34.25 pct)
Add: 102035.43 (0.00 pct) 125889.98 (23.37 pct) 138179.01 (35.42 pct)
Triad: 102281.91 (0.00 pct) 123743.48 (20.98 pct) 138940.41 (35.84 pct)

In case of Skylake machine, with single LLC per socket, we see good
improvement brought about by pinning which is further benefited by
this patch.

Signed-off-by: K Prateek Nayak <[email protected]>
Acked-by: Mel Gorman <[email protected]>
---
Changelog v5-->v6:
- Move the cpumask variable declaration to the block it is
used in.
- Collect tags from v5.
Changelog v4-->v5:
- Only perform cpumask operations if nr_cpus_allowed is not
equal to num_online_cpus based on Mel's suggestion.
---
kernel/sched/fair.c | 16 +++++++++++++---
1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 16874e112fe6..6cc90d76250f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9183,6 +9183,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)

case group_has_spare:
if (sd->flags & SD_NUMA) {
+ int imb;
#ifdef CONFIG_NUMA_BALANCING
int idlest_cpu;
/*
@@ -9200,10 +9201,19 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
* Otherwise, keep the task close to the wakeup source
* and improve locality if the number of running tasks
* would remain below threshold where an imbalance is
- * allowed. If there is a real need of migration,
- * periodic load balance will take care of it.
+ * allowed while accounting for the possibility the
+ * task is pinned to a subset of CPUs. If there is a
+ * real need of migration, periodic load balance will
+ * take care of it.
*/
- if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, sd->imb_numa_nr))
+ imb = sd->imb_numa_nr;
+ if (p->nr_cpus_allowed != num_online_cpus()) {
+ struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
+
+ cpumask_and(cpus, sched_group_span(local), p->cpus_ptr);
+ imb = min(cpumask_weight(cpus), sd->imb_numa_nr);
+ }
+ if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, imb))
return NULL;
}

--
2.25.1


2022-03-08 23:33:04

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v6] sched/fair: Consider cpu affinity when allowing NUMA imbalance in find_idlest_group

* K Prateek Nayak <[email protected]> [2022-03-08 12:07:49]:

Hi Prateek,

> In the case of systems containing multiple LLCs per socket, like
> AMD Zen systems, users want to spread bandwidth hungry applications
> across multiple LLCs. Stream is one such representative workload where
> the best performance is obtained by limiting one stream thread per LLC.
> To ensure this, users are known to pin the tasks to a specify a subset
> of the CPUs consisting of one CPU per LLC while running such bandwidth
> hungry tasks.
>
> Suppose we kickstart a multi-threaded task like stream with 8 threads
> using taskset or numactl to run on a subset of CPUs on a 2 socket Zen3
> server where each socket contains 128 CPUs
> (0-63,128-191 in one socket, 64-127,192-255 in another socket)
>
> Eg: numactl -C 0,16,32,48,64,80,96,112 ./stream8
>
> Here each CPU in the list is from a different LLC and 4 of those LLCs
> are on one socket, while the other 4 are on another socket.
>
> Ideally we would prefer that each stream thread runs on a different
> CPU from the allowed list of CPUs. However, the current heuristics in
> find_idlest_group() do not allow this during the initial placement.
>
> Suppose the first socket (0-63,128-191) is our local group from which
> we are kickstarting the stream tasks. The first four stream threads
> will be placed in this socket. When it comes to placing the 5th
> thread, all the allowed CPUs are from the local group (0,16,32,48)
> would have been taken.
>
> However, the current scheduler code simply checks if the number of
> tasks in the local group is fewer than the allowed numa-imbalance
> threshold. This threshold was previously 25% of the NUMA domain span
> (in this case threshold = 32) but after the v6 of Mel's patchset
> "Adjust NUMA imbalance for multiple LLCs", got merged in sched-tip,
> Commit: e496132ebedd ("sched/fair: Adjust the allowed NUMA imbalance
> when SD_NUMA spans multiple LLCs") it is now equal to number of LLCs
> in the NUMA domain, for processors with multiple LLCs.
> (in this case threshold = 8).
>
> For this example, the number of tasks will always be within threshold
> and thus all the 8 stream threads will be woken up on the first socket
> thereby resulting in sub-optimal performance.
>
> The following sched_wakeup_new tracepoint output shows the initial
> placement of tasks in the current tip/sched/core on the Zen3 machine:
>
> stream-5045 [032] d..2. 167.914699: sched_wakeup_new: comm=stream pid=5047 prio=120 target_cpu=048
> stream-5045 [032] d..2. 167.914746: sched_wakeup_new: comm=stream pid=5048 prio=120 target_cpu=000
> stream-5045 [032] d..2. 167.914846: sched_wakeup_new: comm=stream pid=5049 prio=120 target_cpu=016
> stream-5045 [032] d..2. 167.914891: sched_wakeup_new: comm=stream pid=5050 prio=120 target_cpu=032
> stream-5045 [032] d..2. 167.914928: sched_wakeup_new: comm=stream pid=5051 prio=120 target_cpu=032
> stream-5045 [032] d..2. 167.914976: sched_wakeup_new: comm=stream pid=5052 prio=120 target_cpu=032
> stream-5045 [032] d..2. 167.915011: sched_wakeup_new: comm=stream pid=5053 prio=120 target_cpu=032
>
> Once the first four threads are distributed among the allowed CPUs of
> socket one, the rest of the treads start piling on these same CPUs
> when clearly there are CPUs on the second socket that can be used.
>
> Following the initial pile up on a small number of CPUs, though the
> load-balancer eventually kicks in, it takes a while to get to {4}{4}
> and even {4}{4} isn't stable as we observe a bunch of ping ponging
> between {4}{4} to {5}{3} and back before a stable state is reached
> much later (1 Stream thread per allowed CPU) and no more migration is
> required.
>
> We can detect this piling and avoid it by checking if the number of
> allowed CPUs in the local group are fewer than the number of tasks
> running in the local group and use this information to spread the
> 5th task out into the next socket (after all, the goal in this
> slowpath is to find the idlest group and the idlest CPU during the
> initial placement!).
>

If I recollect correctly, each stream thread, has its independent data set.
However if the threads were all to contend for the same resource (memory) or
a waker/wakee relationships, would we not end up spreading the waker/wakee
apart?


> The following sched_wakeup_new tracepoint output shows the initial
> placement of tasks after adding this fix on the Zen3 machine:
>
> stream-4733 [032] d..2. 116.017980: sched_wakeup_new: comm=stream pid=4735 prio=120 target_cpu=048
> stream-4733 [032] d..2. 116.018032: sched_wakeup_new: comm=stream pid=4736 prio=120 target_cpu=000
> stream-4733 [032] d..2. 116.018127: sched_wakeup_new: comm=stream pid=4737 prio=120 target_cpu=064
> stream-4733 [032] d..2. 116.018185: sched_wakeup_new: comm=stream pid=4738 prio=120 target_cpu=112
> stream-4733 [032] d..2. 116.018235: sched_wakeup_new: comm=stream pid=4739 prio=120 target_cpu=096
> stream-4733 [032] d..2. 116.018289: sched_wakeup_new: comm=stream pid=4740 prio=120 target_cpu=016
> stream-4733 [032] d..2. 116.018334: sched_wakeup_new: comm=stream pid=4741 prio=120 target_cpu=080
>
> We see that threads are using all of the allowed CPUs and there is
> no pileup.
>
> No output is generated for tracepoint sched_migrate_task with this
> patch due to a perfect initial placement which removes the need
> for balancing later on - both across NUMA boundaries and within
> NUMA boundaries for stream.
>
> Following are the results from running 8 Stream threads with and
> without pinning on a dual socket Zen3 Machine (2 x 64C/128T):
>
> Pinning is done using: numactl -C 0,16,32,48,64,80,96,112 ./stream8
>
> 5.17.0-rc1 5.17.0-rc1 5.17.0-rc1
> tip sched/core tip sched/core tip sched/core
> (no pinning) +pinning + this-patch
> + pinning
>
> Copy: 97699.28 (0.00 pct) 95933.60 (-1.80 pct) 156578.91 (60.26 pct)
> Scale: 107754.15 (0.00 pct) 91869.88 (-14.74 pct) 149783.25 (39.00 pct)
> Add: 126383.29 (0.00 pct) 105730.86 (-16.34 pct) 186493.09 (47.56 pct)
> Triad: 124896.78 (0.00 pct) 106394.38 (-14.81 pct) 184733.48 (47.90 pct)
>

Do we have numbers for the with-your patch - non-pinned case?
I would assume they would be the same as 1st column since your change
affects only pinned case. But I am wondering if this problem happens in the
unpinned case or not?

Also Stream on powerpc seems to have some variation in results, did we take
a mean of runs, or is it just results of just one run?

> Pinning currently hurts the performance compared to unbound case on
> tip/sched/core. With the addition of this patch, we are able to
> outperform tip/sched/core by a good margin with pinning.
>
> Following are the results from running 16 Stream threads with and
> without pinning on a dual socket Skylake Machine (2 x 24C/48T):
>
> Pinning is done using: numactl -C 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 ./stream16
>
> 5.17.0-rc1 5.17.0-rc1 5.17.0-rc1
> tip sched/core tip sched/core tip sched/core
> (no pinning) +pinning + this-patch
> + pinning
>
> Copy: 126620.67 (0.00 pct) 141062.10 (11.40 pct) 147615.44 (16.58 pct)
> Scale: 91313.51 (0.00 pct) 112879.61 (23.61 pct) 122591.28 (34.25 pct)
> Add: 102035.43 (0.00 pct) 125889.98 (23.37 pct) 138179.01 (35.42 pct)
> Triad: 102281.91 (0.00 pct) 123743.48 (20.98 pct) 138940.41 (35.84 pct)
>
> In case of Skylake machine, with single LLC per socket, we see good
> improvement brought about by pinning which is further benefited by
> this patch.
>
> Signed-off-by: K Prateek Nayak <[email protected]>
> Acked-by: Mel Gorman <[email protected]>
> ---
> Changelog v5-->v6:
> - Move the cpumask variable declaration to the block it is
> used in.
> - Collect tags from v5.
> Changelog v4-->v5:
> - Only perform cpumask operations if nr_cpus_allowed is not
> equal to num_online_cpus based on Mel's suggestion.
> ---
> kernel/sched/fair.c | 16 +++++++++++++---
> 1 file changed, 13 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 16874e112fe6..6cc90d76250f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9183,6 +9183,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>
> case group_has_spare:
> if (sd->flags & SD_NUMA) {
> + int imb;
> #ifdef CONFIG_NUMA_BALANCING
> int idlest_cpu;
> /*
> @@ -9200,10 +9201,19 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
> * Otherwise, keep the task close to the wakeup source
> * and improve locality if the number of running tasks
> * would remain below threshold where an imbalance is
> - * allowed. If there is a real need of migration,
> - * periodic load balance will take care of it.
> + * allowed while accounting for the possibility the
> + * task is pinned to a subset of CPUs. If there is a
> + * real need of migration, periodic load balance will
> + * take care of it.
> */
> - if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, sd->imb_numa_nr))
> + imb = sd->imb_numa_nr;
> + if (p->nr_cpus_allowed != num_online_cpus()) {

Again, repeating, is the problem only happening in the pinned case?

> + struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
> +
> + cpumask_and(cpus, sched_group_span(local), p->cpus_ptr);
> + imb = min(cpumask_weight(cpus), sd->imb_numa_nr);
> + }
> + if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, imb))
> return NULL;
> }
>
> --
> 2.25.1
>

--
Thanks and Regards
Srikar Dronamraju

2022-03-09 02:09:38

by K Prateek Nayak

[permalink] [raw]
Subject: Re: [PATCH v6] sched/fair: Consider cpu affinity when allowing NUMA imbalance in find_idlest_group

Hello Srikar,

On 3/8/2022 2:59 PM, Srikar Dronamraju wrote:
> [..snip..]
> If I recollect correctly, each stream thread, has its independent data set.
> However if the threads were all to contend for the same resource (memory) or
> a waker/wakee relationships, would we not end up spreading the waker/wakee
> apart?

We only spread tasks based on number of allowed cpus in the domain.
For Stream with 8 threads, where it is beneficial for the threads to
spread across LLCs, user might pin the task as follows:

numactl -C 0,16,32,48,64,80,96,112 ./stream8

For communicating processes, binding as shown above would
indeed be bad for performance. Instead, user may prefer a
different subset of cpus to pin the process to such as:

numactl -C 0-7,128-135 ./communicating # Bind to single LLC

This will ensure the communicating threads are on the same LLC.

>> [..snip..]
>>
>> Pinning is done using: numactl -C 0,16,32,48,64,80,96,112 ./stream8
>>
>> 5.17.0-rc1 5.17.0-rc1 5.17.0-rc1
>> tip sched/core tip sched/core tip sched/core
>> (no pinning) +pinning + this-patch
>> + pinning
>>
>> Copy: 97699.28 (0.00 pct) 95933.60 (-1.80 pct) 156578.91 (60.26 pct)
>> Scale: 107754.15 (0.00 pct) 91869.88 (-14.74 pct) 149783.25 (39.00 pct)
>> Add: 126383.29 (0.00 pct) 105730.86 (-16.34 pct) 186493.09 (47.56 pct)
>> Triad: 124896.78 (0.00 pct) 106394.38 (-14.81 pct) 184733.48 (47.90 pct)
>>
> Do we have numbers for the with-your patch - non-pinned case?
Yes, we do.

5.17.0-rc1 5.17.0-rc1
tip sched/core tip sched/core
(no pinning) + this-patch
(no pinning)

 Copy:   97699.28  (0.00 pct)    106114.59  (8.61 pct)
Scale:   107754.15 (0.00 pct)    106814.48 (-0.87 pct)
  Add:   126383.29 (0.00 pct)    125323.78 (-0.83 pct)
Triad:   124896.78 (0.00 pct)    123622.62 (-1.02 pct)

These are the results on the same dual socket 2 x 64C/128T machine

Following are the output of the tracepoint sched_wakeup_new for stream with
8 threads without any pinning:

- 5.17-rc1 tip/sched/core (no pinning)

stream-4570    [025] d..2.   115.786096: sched_wakeup_new: comm=stream pid=4572 prio=120 target_cpu=048  (LLC: 6)
stream-4570    [025] d..2.   115.786160: sched_wakeup_new: comm=stream pid=4573 prio=120 target_cpu=175  (LLC: 5)
stream-4570    [025] d..2.   115.786221: sched_wakeup_new: comm=stream pid=4574 prio=120 target_cpu=000  (LLC: 0)
stream-4570    [025] d..2.   115.786271: sched_wakeup_new: comm=stream pid=4575 prio=120 target_cpu=145  (LLC: 2)
stream-4570    [025] d..2.   115.786329: sched_wakeup_new: comm=stream pid=4576 prio=120 target_cpu=138  (LLC: 1)
stream-4570    [025] d..2.   115.786375: sched_wakeup_new: comm=stream pid=4577 prio=120 target_cpu=059  (LLC: 7)
stream-4570    [025] d..2.   115.786415: sched_wakeup_new: comm=stream pid=4578 prio=120 target_cpu=036  (LLC: 4)

- 5.17-rc1 tip/sched/core + this-patch (no pinning)

stream-4537    [191] d..2.   115.926113: sched_wakeup_new: comm=stream pid=4539 prio=120 target_cpu=162  (LLC: 4)
stream-4537    [191] d..2.   115.926181: sched_wakeup_new: comm=stream pid=4540 prio=120 target_cpu=000  (LLC: 0)
stream-4537    [191] d..2.   115.926235: sched_wakeup_new: comm=stream pid=4541 prio=120 target_cpu=144  (LLC: 2)
stream-4537    [191] d..2.   115.926284: sched_wakeup_new: comm=stream pid=4542 prio=120 target_cpu=025  (LLC: 3)
stream-4537    [191] d..2.   115.926332: sched_wakeup_new: comm=stream pid=4543 prio=120 target_cpu=048  (LLC: 6)
stream-4537    [191] d..2.   115.926376: sched_wakeup_new: comm=stream pid=4544 prio=120 target_cpu=137  (LLC: 1)
stream-4537    [191] d..2.   115.926436: sched_wakeup_new: comm=stream pid=4545 prio=120 target_cpu=041  (LLC: 5)

All the threads remain in the same socket as imb_numa_nr in NPS1 mode
is 8 but each thread gets a separate LLC.
> I would assume they would be the same as 1st column since your change
> affects only pinned case. But I am wondering if this problem happens in the
> unpinned case or not?

We see nearly identical result in unpinned cases - with and
without this patch.

> Also Stream on powerpc seems to have some variation in results, did we take
> a mean of runs, or is it just results of just one run?
The results posted is the mean of 10 runs.
>> Pinning currently hurts the performance compared to unbound case on
>> tip/sched/core. With the addition of this patch, we are able to
>> outperform tip/sched/core by a good margin with pinning.
>>
>> Following are the results from running 16 Stream threads with and
>> without pinning on a dual socket Skylake Machine (2 x 24C/48T):
>>
>> Pinning is done using: numactl -C 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 ./stream16
>>
>> 5.17.0-rc1 5.17.0-rc1 5.17.0-rc1
>> tip sched/core tip sched/core tip sched/core
>> (no pinning) +pinning + this-patch
>> + pinning
>>
>> Copy: 126620.67 (0.00 pct) 141062.10 (11.40 pct) 147615.44 (16.58 pct)
>> Scale: 91313.51 (0.00 pct) 112879.61 (23.61 pct) 122591.28 (34.25 pct)
>> Add: 102035.43 (0.00 pct) 125889.98 (23.37 pct) 138179.01 (35.42 pct)
>> Triad: 102281.91 (0.00 pct) 123743.48 (20.98 pct) 138940.41 (35.84 pct)
>>
>> In case of Skylake machine, with single LLC per socket, we see good
>> improvement brought about by pinning which is further benefited by
>> this patch.
>>
>> Signed-off-by: K Prateek Nayak <[email protected]>
>> Acked-by: Mel Gorman <[email protected]>
>> ---
>> Changelog v5-->v6:
>> - Move the cpumask variable declaration to the block it is
>> used in.
>> - Collect tags from v5.
>> Changelog v4-->v5:
>> - Only perform cpumask operations if nr_cpus_allowed is not
>> equal to num_online_cpus based on Mel's suggestion.
>> ---
>> kernel/sched/fair.c | 16 +++++++++++++---
>> 1 file changed, 13 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 16874e112fe6..6cc90d76250f 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -9183,6 +9183,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>>
>> case group_has_spare:
>> if (sd->flags & SD_NUMA) {
>> + int imb;
>> #ifdef CONFIG_NUMA_BALANCING
>> int idlest_cpu;
>> /*
>> @@ -9200,10 +9201,19 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>> * Otherwise, keep the task close to the wakeup source
>> * and improve locality if the number of running tasks
>> * would remain below threshold where an imbalance is
>> - * allowed. If there is a real need of migration,
>> - * periodic load balance will take care of it.
>> + * allowed while accounting for the possibility the
>> + * task is pinned to a subset of CPUs. If there is a
>> + * real need of migration, periodic load balance will
>> + * take care of it.
>> */
>> - if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, sd->imb_numa_nr))
>> + imb = sd->imb_numa_nr;
>> + if (p->nr_cpus_allowed != num_online_cpus()) {
> Again, repeating, is the problem only happening in the pinned case?
Yes. We've tested stream with 8 and 16 stream threads on a Zen3 system
with 16 LLCs and in both cases, with unbound runs, we've seen each
Stream thread get a separate LLC and we didn't observe any stacking.
--
Thanks and Regards,
Prateek

2022-03-09 05:33:33

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v6] sched/fair: Consider cpu affinity when allowing NUMA imbalance in find_idlest_group

* K Prateek Nayak <[email protected]> [2022-03-08 17:18:16]:

> Hello Srikar,
>
> On 3/8/2022 2:59 PM, Srikar Dronamraju wrote:
> > [..snip..]


> >> @@ -9200,10 +9201,19 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
> >> * Otherwise, keep the task close to the wakeup source
> >> * and improve locality if the number of running tasks
> >> * would remain below threshold where an imbalance is
> >> - * allowed. If there is a real need of migration,
> >> - * periodic load balance will take care of it.
> >> + * allowed while accounting for the possibility the
> >> + * task is pinned to a subset of CPUs. If there is a
> >> + * real need of migration, periodic load balance will
> >> + * take care of it.
> >> */
> >> - if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, sd->imb_numa_nr))
> >> + imb = sd->imb_numa_nr;
> >> + if (p->nr_cpus_allowed != num_online_cpus()) {

> > Again, repeating, is the problem only happening in the pinned case?
> Yes. We've tested stream with 8 and 16 stream threads on a Zen3 system
> with 16 LLCs and in both cases, with unbound runs, we've seen each
> Stream thread get a separate LLC and we didn't observe any stacking.

If the problem is only happening with pinned case, then it means that in the
in unpinned case, the load balancer is able to do the load balancing
correctly and quickly but for some reason may not be able to do the same in
pinned case. Without the patch, even in the unpinned case, the initial CPU
range is more less the same number of LLCs as the pinned. However its able
to spread better.

I believe the problem could be in can_migrate_task() checking for
!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)

i.e dst_cpu is doing a load balance on behalf of the entire LLC, however it
only will pull tasks that can be pulled into it.

> --
> Thanks and Regards,
> Prateek

--
Thanks and Regards
Srikar Dronamraju

2022-03-09 07:55:51

by K Prateek Nayak

[permalink] [raw]
Subject: Re: [PATCH v6] sched/fair: Consider cpu affinity when allowing NUMA imbalance in find_idlest_group

Hello Srikar,

On 3/9/2022 10:55 AM, Srikar Dronamraju wrote:
> * K Prateek Nayak <[email protected]> [2022-03-08 17:18:16]:
> [..snip..]
>> Yes. We've tested stream with 8 and 16 stream threads on a Zen3 system
>> with 16 LLCs and in both cases, with unbound runs, we've seen each
>> Stream thread get a separate LLC and we didn't observe any stacking.
> If the problem is only happening with pinned case, then it means that in the
> in unpinned case, the load balancer is able to do the load balancing
> correctly and quickly but for some reason may not be able to do the same in
> pinned case. Without the patch, even in the unpinned case, the initial CPU
> range is more less the same number of LLCs as the pinned. However its able
> to spread better.
The problem this patch is trying to solve is that of the initial placement
of task when they are pinned to a subset of cpu in a way that number of
allowed cpus in a NUMA domain is less than the imb_numa_nr.

Consider the same example of Stream running 8 threads with pinning as follows
on the dual socket Zen3 system (0-63,128-191 in one socket, 64-127,192-255 in
another socket) with 8 LLCs per socket:

numactl -C 0,16,32,48,64,80,96,112 ./stream8

Number of cpus available in each socket for this task is 4. However, each
socket consists of 8 LLCs. According to current scheduler, all the LLCs are
available for task to use but in reality, it can only use 4. Hence, all
the stream threads are put on same socket and they have to initially share
LLCs and the same set of cpus. This is why we see stacking initially and
this is what we are addressing.

We've observed that load balancer does kick in and is able to spread the tasks
apart but this takes a while and quite a few migrations before a stable state
is reached.

Following is the output of the tracepoint sched_migrate_task on 5.17-rc1
tip sched/core for stream running 8 threads with the same pinning pattern:
(Output has been slightly modified for readability)

167.928338: sched_migrate_task: comm=stream pid=5050 prio=120 orig_cpu=32 dest_cpu=48 START - {8}{0}
... 9 migrations
168.595632: sched_migrate_task: comm=stream pid=5051 prio=120 orig_cpu=16 dest_cpu=64 * {7}{1}
168.595634: sched_migrate_task: comm=stream pid=5048 prio=120 orig_cpu=16 dest_cpu=64 * {6}{2}
... 3 migrations
168.625803: sched_migrate_task: comm=stream pid=5052 prio=120 orig_cpu=0 dest_cpu=64 * {5}{3}
168.626146: sched_migrate_task: comm=stream pid=5050 prio=120 orig_cpu=16 dest_cpu=0
168.650832: sched_migrate_task: comm=stream pid=5050 prio=120 orig_cpu=0 dest_cpu=48
168.651009: sched_migrate_task: comm=stream pid=5048 prio=120 orig_cpu=64 dest_cpu=32 * {6}{2}
168.677314: sched_migrate_task: comm=stream pid=5048 prio=120 orig_cpu=32 dest_cpu=80 * {5}{3}
168.677320: sched_migrate_task: comm=stream pid=5050 prio=120 orig_cpu=48 dest_cpu=64 * {4}{4}
168.735707: sched_migrate_task: comm=stream pid=5050 prio=120 orig_cpu=64 dest_cpu=96
168.775510: sched_migrate_task: comm=stream pid=5051 prio=120 orig_cpu=80 dest_cpu=0 * {5}{3}
... 39 migrations
170.232105: sched_migrate_task: comm=stream pid=5049 prio=120 orig_cpu=0 dest_cpu=64 END {4}{4}

As we can see, 63 migrations are arecorded during the runtime of the
program that can be avoided by the correct initial placement.

As you highlight, there may be areas in the load balancer path that
can be optimized too for such a case.
> I believe the problem could be in can_migrate_task() checking for
> !cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)
>
> i.e dst_cpu is doing a load balance on behalf of the entire LLC, however it
> only will pull tasks that can be pulled into it.
Please correct me if I'm wrong but don't we scan the entire sched group
and check if there is a compatible cpu for pinned tasks in can_migrate_task()
when !cpumask_test_cpu(env->dst_cpu, p->cpus_ptr) ?

If a compatible cpu is found in the group, it is stored in new_dst_cpu
member of lb_env to be used later in the load balancing path if my
understanding is correct.
--
Thanks and Regards,
Prateek

2022-03-09 10:42:20

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v6] sched/fair: Consider cpu affinity when allowing NUMA imbalance in find_idlest_group

* K Prateek Nayak <[email protected]> [2022-03-09 12:42:51]:

Hi Prateek,

> Hello Srikar,
>
> On 3/9/2022 10:55 AM, Srikar Dronamraju wrote:
> > * K Prateek Nayak <[email protected]> [2022-03-08 17:18:16]:
> > [..snip..]
> >> Yes. We've tested stream with 8 and 16 stream threads on a Zen3 system
> >> with 16 LLCs and in both cases, with unbound runs, we've seen each
> >> Stream thread get a separate LLC and we didn't observe any stacking.
> > If the problem is only happening with pinned case, then it means that in the
> > in unpinned case, the load balancer is able to do the load balancing
> > correctly and quickly but for some reason may not be able to do the same in
> > pinned case. Without the patch, even in the unpinned case, the initial CPU
> > range is more less the same number of LLCs as the pinned. However its able
> > to spread better.
> The problem this patch is trying to solve is that of the initial placement
> of task when they are pinned to a subset of cpu in a way that number of
> allowed cpus in a NUMA domain is less than the imb_numa_nr.
>

I completely understand your problem. The only missing piece is why is this
initial placement *not a problem for the unpinned case*. If we are able to
articulate how the current code works well for the unpinned case, I would
be fine.

> Consider the same example of Stream running 8 threads with pinning as follows
> on the dual socket Zen3 system (0-63,128-191 in one socket, 64-127,192-255 in
> another socket) with 8 LLCs per socket:
>
> numactl -C 0,16,32,48,64,80,96,112 ./stream8
>
> Number of cpus available in each socket for this task is 4. However, each
> socket consists of 8 LLCs. According to current scheduler, all the LLCs are
> available for task to use but in reality, it can only use 4. Hence, all
> the stream threads are put on same socket and they have to initially share
> LLCs and the same set of cpus. This is why we see stacking initially and
> this is what we are addressing.
>
> We've observed that load balancer does kick in and is able to spread the tasks
> apart but this takes a while and quite a few migrations before a stable state
> is reached.
>
> Following is the output of the tracepoint sched_migrate_task on 5.17-rc1
> tip sched/core for stream running 8 threads with the same pinning pattern:
> (Output has been slightly modified for readability)
>
> 167.928338: sched_migrate_task: comm=stream pid=5050 prio=120 orig_cpu=32 dest_cpu=48 START - {8}{0}
> ... 9 migrations
> 168.595632: sched_migrate_task: comm=stream pid=5051 prio=120 orig_cpu=16 dest_cpu=64 * {7}{1}
> 168.595634: sched_migrate_task: comm=stream pid=5048 prio=120 orig_cpu=16 dest_cpu=64 * {6}{2}
> ... 3 migrations
> 168.625803: sched_migrate_task: comm=stream pid=5052 prio=120 orig_cpu=0 dest_cpu=64 * {5}{3}
> 168.626146: sched_migrate_task: comm=stream pid=5050 prio=120 orig_cpu=16 dest_cpu=0
> 168.650832: sched_migrate_task: comm=stream pid=5050 prio=120 orig_cpu=0 dest_cpu=48
> 168.651009: sched_migrate_task: comm=stream pid=5048 prio=120 orig_cpu=64 dest_cpu=32 * {6}{2}
> 168.677314: sched_migrate_task: comm=stream pid=5048 prio=120 orig_cpu=32 dest_cpu=80 * {5}{3}
> 168.677320: sched_migrate_task: comm=stream pid=5050 prio=120 orig_cpu=48 dest_cpu=64 * {4}{4}
> 168.735707: sched_migrate_task: comm=stream pid=5050 prio=120 orig_cpu=64 dest_cpu=96
> 168.775510: sched_migrate_task: comm=stream pid=5051 prio=120 orig_cpu=80 dest_cpu=0 * {5}{3}
> ... 39 migrations
> 170.232105: sched_migrate_task: comm=stream pid=5049 prio=120 orig_cpu=0 dest_cpu=64 END {4}{4}
>
> As we can see, 63 migrations are arecorded during the runtime of the
> program that can be avoided by the correct initial placement.
>
> As you highlight, there may be areas in the load balancer path that
> can be optimized too for such a case.
> > I believe the problem could be in can_migrate_task() checking for
> > !cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)
> >
> > i.e dst_cpu is doing a load balance on behalf of the entire LLC, however it
> > only will pull tasks that can be pulled into it.
> Please correct me if I'm wrong but don't we scan the entire sched group
> and check if there is a compatible cpu for pinned tasks in can_migrate_task()
> when !cpumask_test_cpu(env->dst_cpu, p->cpus_ptr) ?
>

> If a compatible cpu is found in the group, it is stored in new_dst_cpu
> member of lb_env to be used later in the load balancing path if my
> understanding is correct.

Yes, this is the LBF_DST_PINNED logic, So I am wondering if that is kicking
in correctly because this is the only difference I see between pinned and
unpinned.

> --
> Thanks and Regards,
> Prateek

--
Thanks and Regards
Srikar Dronamraju

2022-03-09 12:20:49

by K Prateek Nayak

[permalink] [raw]
Subject: Re: [PATCH v6] sched/fair: Consider cpu affinity when allowing NUMA imbalance in find_idlest_group

Hello Srikar,

On 3/9/2022 3:13 PM, Srikar Dronamraju wrote:
> [..snip..]
> I completely understand your problem. The only missing piece is why is this
> initial placement *not a problem for the unpinned case*. If we are able to
> articulate how the current code works well for the unpinned case, I would
> be fine.
From what I understand, the initial task placement happens as follows:

When a new task is created via fork or exec, for the initial wakeup
it takes the slow path in select_task_rq_fair() and goes to
find_idlest_cpu(). find_idlest_cpu() will explore the sched domain
hierarchy in a top-down fashion to find the idlest cpu for task to
run on.

During this, it'll call find_idlest_group() to get the idlest group
within a particular domain to search for the target cpu. In our case,
the local group will have spare capacity to accommodate tasks.
We only do a cpumask_test_cpu(this_cpu, sched_group_span(group))
to check is the task can run on a particular group.

Hence we land on the case "group_has_spare". Here, if the sched
domain we are looking at has SD_NUMA flag set and if
(local_sgs.sum_nr_running + 1) <= sd->imb_numa_nr
we decide explore the loacl group itself in search of the target
cpu to place task on.
If number of tasks running in local group is more than sd->imb_numa_nr
or if the domain is not a NUMA domain, then the group with
most number of idle cpus is chosen.

We traverse the hierarchy down finding the idlest group of each
domain based on number of idle cpus in the group find the target
cpu to place the task on.

This is just a basic picture I've painted with the details necessary
to explain correct placement in the unbound case.

Consider the unbound case:

- With 8 stream threads
The sched domain hierarchy will be traversed as follows:

Socket Domain ---------------> Die Domain ---------------> MC Domain -> SMT Domain -> Idle CPU
[NUMA](2 socket)               (8LLCs)                     (16 CPUs)    (2 CPUs)      (Target)

  ^                             ^
  |                             |

sd->imb_numa_nr = 8            Each stream thread gets
All 8 stream threads           its own LLC as the
are placed on same             choice of group at this
socket as up to 8              level is purely dependent
tasks can be running           on the number of idle
on same CPU before             cpus in the group. Once a
exploring across NUMA.         stream thread is scheduled
                               on a LLC, it contains less
                               number of idle cpus
                               compared to other LLCs.

- With 16 stream threads
The sched domain hierarchy will be traversed as follows:

Socket Domain ---------------> Die Domain ---------------> MC Domain -> SMT Domain -> Idle CPU
[NUMA](2 socket)               (8LLCs)                     (16 CPUs)    (2 CPUs)      (Target)

  ^                             ^
  |                             |

sd->imb_numa_nr = 8            Each stream thread gets
First 8 stream threads         its own LLC as the
are placed on same             choice of group at this
socket as up to 8              level is purely dependent
tasks can be running           on the number of idle
on same CPU before             cpus in the group.
exploring across NUMA.
                              
Starting from the 9th          First 8 thread will go to
thread, the placement          the 8LLCs in the first
will happen based on           socket and next 8 will
the number of idle cpus.       take one LLC each in the
8 CPUs in one socket           second socket.
will be running the
8 stream threads
previously scheduled so
the rest of the threads
will be scheduled on
the second socket.

If we bind 8 threads to 4 cpus on one socket and 4 cpus
on another, there is currently no mechanism that tells the
scheduler that it can explore the other socket after
scheduling the first four threads on one.
> [..snip..]
>> If a compatible cpu is found in the group, it is stored in new_dst_cpu
>> member of lb_env to be used later in the load balancing path if my
>> understanding is correct.
> Yes, this is the LBF_DST_PINNED logic, So I am wondering if that is kicking
> in correctly because this is the only difference I see between pinned and
> unpinned.
Ah! I see. But I do believe this problem of initial
placement lies along the wakeup path.
--
Thanks and Regards,
Prateek

2022-03-14 16:39:49

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v6] sched/fair: Consider cpu affinity when allowing NUMA imbalance in find_idlest_group

* K Prateek Nayak <[email protected]> [2022-03-08 12:07:49]:

> In the case of systems containing multiple LLCs per socket, like
> AMD Zen systems, users want to spread bandwidth hungry applications
> across multiple LLCs. Stream is one such representative workload where
> the best performance is obtained by limiting one stream thread per LLC.
> To ensure this, users are known to pin the tasks to a specify a subset
> of the CPUs consisting of one CPU per LLC while running such bandwidth
> hungry tasks.
>
> Suppose we kickstart a multi-threaded task like stream with 8 threads
> using taskset or numactl to run on a subset of CPUs on a 2 socket Zen3
> server where each socket contains 128 CPUs
> (0-63,128-191 in one socket, 64-127,192-255 in another socket)
>
> Eg: numactl -C 0,16,32,48,64,80,96,112 ./stream8
>
> Here each CPU in the list is from a different LLC and 4 of those LLCs
> are on one socket, while the other 4 are on another socket.
>
> Ideally we would prefer that each stream thread runs on a different
> CPU from the allowed list of CPUs. However, the current heuristics in
> find_idlest_group() do not allow this during the initial placement.
>
> Suppose the first socket (0-63,128-191) is our local group from which
> we are kickstarting the stream tasks. The first four stream threads
> will be placed in this socket. When it comes to placing the 5th
> thread, all the allowed CPUs are from the local group (0,16,32,48)
> would have been taken.
>
> However, the current scheduler code simply checks if the number of
> tasks in the local group is fewer than the allowed numa-imbalance
> threshold. This threshold was previously 25% of the NUMA domain span
> (in this case threshold = 32) but after the v6 of Mel's patchset
> "Adjust NUMA imbalance for multiple LLCs", got merged in sched-tip,
> Commit: e496132ebedd ("sched/fair: Adjust the allowed NUMA imbalance
> when SD_NUMA spans multiple LLCs") it is now equal to number of LLCs
> in the NUMA domain, for processors with multiple LLCs.
> (in this case threshold = 8).
>
> For this example, the number of tasks will always be within threshold
> and thus all the 8 stream threads will be woken up on the first socket
> thereby resulting in sub-optimal performance.
>
> The following sched_wakeup_new tracepoint output shows the initial
> placement of tasks in the current tip/sched/core on the Zen3 machine:
>
> stream-5045 [032] d..2. 167.914699: sched_wakeup_new: comm=stream pid=5047 prio=120 target_cpu=048
> stream-5045 [032] d..2. 167.914746: sched_wakeup_new: comm=stream pid=5048 prio=120 target_cpu=000
> stream-5045 [032] d..2. 167.914846: sched_wakeup_new: comm=stream pid=5049 prio=120 target_cpu=016
> stream-5045 [032] d..2. 167.914891: sched_wakeup_new: comm=stream pid=5050 prio=120 target_cpu=032
> stream-5045 [032] d..2. 167.914928: sched_wakeup_new: comm=stream pid=5051 prio=120 target_cpu=032
> stream-5045 [032] d..2. 167.914976: sched_wakeup_new: comm=stream pid=5052 prio=120 target_cpu=032
> stream-5045 [032] d..2. 167.915011: sched_wakeup_new: comm=stream pid=5053 prio=120 target_cpu=032
>
> Once the first four threads are distributed among the allowed CPUs of
> socket one, the rest of the treads start piling on these same CPUs
> when clearly there are CPUs on the second socket that can be used.
>
> Following the initial pile up on a small number of CPUs, though the
> load-balancer eventually kicks in, it takes a while to get to {4}{4}
> and even {4}{4} isn't stable as we observe a bunch of ping ponging
> between {4}{4} to {5}{3} and back before a stable state is reached
> much later (1 Stream thread per allowed CPU) and no more migration is
> required.
>
> We can detect this piling and avoid it by checking if the number of
> allowed CPUs in the local group are fewer than the number of tasks
> running in the local group and use this information to spread the
> 5th task out into the next socket (after all, the goal in this
> slowpath is to find the idlest group and the idlest CPU during the
> initial placement!).
>
> The following sched_wakeup_new tracepoint output shows the initial
> placement of tasks after adding this fix on the Zen3 machine:
>
> stream-4733 [032] d..2. 116.017980: sched_wakeup_new: comm=stream pid=4735 prio=120 target_cpu=048
> stream-4733 [032] d..2. 116.018032: sched_wakeup_new: comm=stream pid=4736 prio=120 target_cpu=000
> stream-4733 [032] d..2. 116.018127: sched_wakeup_new: comm=stream pid=4737 prio=120 target_cpu=064
> stream-4733 [032] d..2. 116.018185: sched_wakeup_new: comm=stream pid=4738 prio=120 target_cpu=112
> stream-4733 [032] d..2. 116.018235: sched_wakeup_new: comm=stream pid=4739 prio=120 target_cpu=096
> stream-4733 [032] d..2. 116.018289: sched_wakeup_new: comm=stream pid=4740 prio=120 target_cpu=016
> stream-4733 [032] d..2. 116.018334: sched_wakeup_new: comm=stream pid=4741 prio=120 target_cpu=080
>
> We see that threads are using all of the allowed CPUs and there is
> no pileup.
>
> No output is generated for tracepoint sched_migrate_task with this
> patch due to a perfect initial placement which removes the need
> for balancing later on - both across NUMA boundaries and within
> NUMA boundaries for stream.
>
> Following are the results from running 8 Stream threads with and
> without pinning on a dual socket Zen3 Machine (2 x 64C/128T):
>
> Pinning is done using: numactl -C 0,16,32,48,64,80,96,112 ./stream8
>
> 5.17.0-rc1 5.17.0-rc1 5.17.0-rc1
> tip sched/core tip sched/core tip sched/core
> (no pinning) +pinning + this-patch
> + pinning
>
> Copy: 97699.28 (0.00 pct) 95933.60 (-1.80 pct) 156578.91 (60.26 pct)
> Scale: 107754.15 (0.00 pct) 91869.88 (-14.74 pct) 149783.25 (39.00 pct)
> Add: 126383.29 (0.00 pct) 105730.86 (-16.34 pct) 186493.09 (47.56 pct)
> Triad: 124896.78 (0.00 pct) 106394.38 (-14.81 pct) 184733.48 (47.90 pct)
>
> Pinning currently hurts the performance compared to unbound case on
> tip/sched/core. With the addition of this patch, we are able to
> outperform tip/sched/core by a good margin with pinning.
>
> Following are the results from running 16 Stream threads with and
> without pinning on a dual socket Skylake Machine (2 x 24C/48T):
>
> Pinning is done using: numactl -C 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 ./stream16
>
> 5.17.0-rc1 5.17.0-rc1 5.17.0-rc1
> tip sched/core tip sched/core tip sched/core
> (no pinning) +pinning + this-patch
> + pinning
>
> Copy: 126620.67 (0.00 pct) 141062.10 (11.40 pct) 147615.44 (16.58 pct)
> Scale: 91313.51 (0.00 pct) 112879.61 (23.61 pct) 122591.28 (34.25 pct)
> Add: 102035.43 (0.00 pct) 125889.98 (23.37 pct) 138179.01 (35.42 pct)
> Triad: 102281.91 (0.00 pct) 123743.48 (20.98 pct) 138940.41 (35.84 pct)
>
> In case of Skylake machine, with single LLC per socket, we see good
> improvement brought about by pinning which is further benefited by
> this patch.
>
> Signed-off-by: K Prateek Nayak <[email protected]>
> Acked-by: Mel Gorman <[email protected]>


Looks good to me.

Reviewed-by: Srikar Dronamraju <[email protected]>

> ---
> Changelog v5-->v6:
> - Move the cpumask variable declaration to the block it is
> used in.
> - Collect tags from v5.
> Changelog v4-->v5:
> - Only perform cpumask operations if nr_cpus_allowed is not
> equal to num_online_cpus based on Mel's suggestion.
> ---
> kernel/sched/fair.c | 16 +++++++++++++---
> 1 file changed, 13 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 16874e112fe6..6cc90d76250f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9183,6 +9183,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>
> case group_has_spare:
> if (sd->flags & SD_NUMA) {
> + int imb;
> #ifdef CONFIG_NUMA_BALANCING
> int idlest_cpu;
> /*
> @@ -9200,10 +9201,19 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
> * Otherwise, keep the task close to the wakeup source
> * and improve locality if the number of running tasks
> * would remain below threshold where an imbalance is
> - * allowed. If there is a real need of migration,
> - * periodic load balance will take care of it.
> + * allowed while accounting for the possibility the
> + * task is pinned to a subset of CPUs. If there is a
> + * real need of migration, periodic load balance will
> + * take care of it.
> */
> - if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, sd->imb_numa_nr))
> + imb = sd->imb_numa_nr;
> + if (p->nr_cpus_allowed != num_online_cpus()) {
> + struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
> +
> + cpumask_and(cpus, sched_group_span(local), p->cpus_ptr);
> + imb = min(cpumask_weight(cpus), sd->imb_numa_nr);
> + }
> + if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, imb))
> return NULL;
> }
>
> --
> 2.25.1
>

--
Thanks and Regards
Srikar Dronamraju

2022-03-17 06:20:37

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v6] sched/fair: Consider cpu affinity when allowing NUMA imbalance in find_idlest_group

* K Prateek Nayak <[email protected]> [2022-03-09 17:00:33]:

> Hello Srikar,
>
> On 3/9/2022 3:13 PM, Srikar Dronamraju wrote:
> > [..snip..]
> > I completely understand your problem. The only missing piece is why is this
> > initial placement *not a problem for the unpinned case*. If we are able to
> > articulate how the current code works well for the unpinned case, I would
> > be fine.
> From what I understand, the initial task placement happens as follows:
>
> When a new task is created via fork or exec, for the initial wakeup
> it takes the slow path in select_task_rq_fair() and goes to
> find_idlest_cpu(). find_idlest_cpu() will explore the sched domain
> hierarchy in a top-down fashion to find the idlest cpu for task to
> run on.
>
> During this, it'll call find_idlest_group() to get the idlest group
> within a particular domain to search for the target cpu. In our case,
> the local group will have spare capacity to accommodate tasks.
> We only do a cpumask_test_cpu(this_cpu, sched_group_span(group))
> to check is the task can run on a particular group.
>

[snip]

Ok, Prateek, I do understand the intent here.
Thanks for spending the time to explain the same.

> Ah! I see. But I do believe this problem of initial
> placement lies along the wakeup path.
> --
> Thanks and Regards,
> Prateek

--
Thanks and Regards
Srikar Dronamraju