2022-02-17 07:51:46

by K Prateek Nayak

[permalink] [raw]
Subject: [PATCH v4] sched/fair: Consider cpu affinity when allowing NUMA imbalance in find_idlest_group

In AMD Zen like systems which contains multiple LLCs per socket,
users want to spread bandwidth hungry applications across multiple
LLCs. Stream is one such representative workload where the best
performance is obtained by limiting one stream thread per LLC. To
ensure this, users are known to pin the tasks to a specify a subset of
the CPUs consisting of one CPU per LLC while running such bandwidth
hungry tasks.

Suppose we kickstart a multi-threaded task like stream with 8 threads
using taskset or numactl to run on a subset of CPUs on a 2 socket Zen3
server where each socket contains 128 CPUs
(0-63,128-191 in one socket, 64-127,192-255 in another socket)

Eg: numactl -C 0,16,32,48,64,80,96,112 ./stream8

Here each CPU in the list is from a different LLC and 4 of those LLCs
are on one socket, while the other 4 are on another socket.

Ideally we would prefer that each stream thread runs on a different
CPU from the allowed list of CPUs. However, the current heuristics in
find_idlest_group() do not allow this during the initial placement.

Suppose the first socket (0-63,128-191) is our local group from which
we are kickstarting the stream tasks. The first four stream threads
will be placed in this socket. When it comes to placing the 5th
thread, all the allowed CPUs are from the local group (0,16,32,48)
would have been taken.

However, the current scheduler code simply checks if the number of
tasks in the local group is fewer than the allowed numa-imbalance
threshold. This threshold was previously 25% of the NUMA domain span
(in this case threshold = 32) but after the v6 of Mel's patchset
"Adjust NUMA imbalance for multiple LLCs", got merged in sched-tip,
Commit: e496132ebedd (sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs)
it is now equal to number of LLCs in the NUMA domain, for processors
with multiple LLCs (in this case threshold = 8).

For this example, the number of tasks will always be within threshold
and thus all the 8 stream threads will be woken up on the first socket
thereby resulting in sub-optimal performance.

The following sched_wakeup_new tracepoint output shows the initial
placement of tasks in the current tip/sched/core on the Zen3 machine:

stream-5045 [032] d..2. 167.914699: sched_wakeup_new: comm=stream pid=5047 prio=120 target_cpu=048
stream-5045 [032] d..2. 167.914746: sched_wakeup_new: comm=stream pid=5048 prio=120 target_cpu=000
stream-5045 [032] d..2. 167.914846: sched_wakeup_new: comm=stream pid=5049 prio=120 target_cpu=016
stream-5045 [032] d..2. 167.914891: sched_wakeup_new: comm=stream pid=5050 prio=120 target_cpu=032
stream-5045 [032] d..2. 167.914928: sched_wakeup_new: comm=stream pid=5051 prio=120 target_cpu=032
stream-5045 [032] d..2. 167.914976: sched_wakeup_new: comm=stream pid=5052 prio=120 target_cpu=032
stream-5045 [032] d..2. 167.915011: sched_wakeup_new: comm=stream pid=5053 prio=120 target_cpu=032

Once the first four threads are distributed among the allowed CPUs of
socket one, the rest of the treads start piling on these same CPUs
when clearly there are CPUs on the second socket that can be used.

Following the initial pile up on a small number of CPUs, though the
load-balancer eventually kicks in, it takes a while to get to {4}{4}
and even {4}{4} isn't stable as we observe a bunch of ping ponging
between {4}{4} to {5}{3} and back before a stable state is reached
much later (1 Stream thread per allowed CPU) and no more migration is
required.

We can detect this piling and avoid it by checking if the number of
allowed CPUs in the local group are fewer than the number of tasks
running in the local group and use this information to spread the
5th task out into the next socket (after all, the goal in this
slowpath is to find the idlest group and the idlest CPU during the
initial placement!).

The following sched_wakeup_new tracepoint output shows the initial
placement of tasks after adding this fix on the Zen3 machine:

stream-4733 [032] d..2. 116.017980: sched_wakeup_new: comm=stream pid=4735 prio=120 target_cpu=048
stream-4733 [032] d..2. 116.018032: sched_wakeup_new: comm=stream pid=4736 prio=120 target_cpu=000
stream-4733 [032] d..2. 116.018127: sched_wakeup_new: comm=stream pid=4737 prio=120 target_cpu=064
stream-4733 [032] d..2. 116.018185: sched_wakeup_new: comm=stream pid=4738 prio=120 target_cpu=112
stream-4733 [032] d..2. 116.018235: sched_wakeup_new: comm=stream pid=4739 prio=120 target_cpu=096
stream-4733 [032] d..2. 116.018289: sched_wakeup_new: comm=stream pid=4740 prio=120 target_cpu=016
stream-4733 [032] d..2. 116.018334: sched_wakeup_new: comm=stream pid=4741 prio=120 target_cpu=080

We see that threads are using all of the allowed CPUs and there is
no pileup.

No output is generated for tracepoint sched_migrate_task with this
patch due to a perfect initial placement which removes the need
for balancing later on - both across NUMA boundaries and within
NUMA boundaries for stream.

Following are the results from running 8 Stream threads with and
without pinning on a dual socket Zen3 Machine (2 x 64C/128T):

Pinning is done using: numactl -C 0,16,32,48,64,80,96,112 ./stream8

5.17.0-rc1 5.17.0-rc1 5.17.0-rc1
tip sched/core tip sched/core tip sched/core
(no pinning) +pinning + this-patch
+ pinning

Copy: 97699.28 (0.00 pct) 95933.60 (-1.80 pct) 156578.91 (60.26 pct)
Scale: 107754.15 (0.00 pct) 91869.88 (-14.74 pct) 149783.25 (39.00 pct)
Add: 126383.29 (0.00 pct) 105730.86 (-16.34 pct) 186493.09 (47.56 pct)
Triad: 124896.78 (0.00 pct) 106394.38 (-14.81 pct) 184733.48 (47.90 pct)

Pinning currently hurts the performance compared to unbound case on
tip/sched/core. With the addition of this patch, we are able to
outperform tip/sched/core by a good margin with pinning.

Following are the results from running 16 Stream threads with and
without pinning on a dual socket Skylake Machine (2 x 24C/48T):

Pinning is done using: numactl -C 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 ./stream16

5.17.0-rc1 5.17.0-rc1 5.17.0-rc1
tip sched/core tip sched/core tip sched/core
(no pinning) +pinning + this-patch
+ pinning

Copy: 126620.67 (0.00 pct) 141062.10 (11.40 pct) 147615.44 (16.58 pct)
Scale: 91313.51 (0.00 pct) 112879.61 (23.61 pct) 122591.28 (34.25 pct)
Add: 102035.43 (0.00 pct) 125889.98 (23.37 pct) 138179.01 (35.42 pct)
Triad: 102281.91 (0.00 pct) 123743.48 (20.98 pct) 138940.41 (35.84 pct)

In case of Skylake machine, with single LLC per socket, we see good
improvement brought about by pinning which is further benefited by
this patch.

Signed-off-by: K Prateek Nayak <[email protected]>
---
Changelog v3-->v4:
- Update the patch diff to use Mel's suggestion on v1
which was left out in v3.
Changelog v2-->v3:
- More detailed commit log highlighting the problem.
- Include numbers for dual socket Intel Skylake machine.
Changelog v1-->v2:
- Rebase changes on top of v6 of Mel's
"Adjust NUMA imbalance for multiple LLCs" patchset.
- Reuse select_idle_mask ptr to store result of cpumask_and
based on Mel's suggestion.
---
kernel/sched/fair.c | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5c4bfffe8c2c..6e875f1f34e2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9130,6 +9130,8 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)

case group_has_spare:
if (sd->flags & SD_NUMA) {
+ struct cpumask *cpus;
+ int imb;
#ifdef CONFIG_NUMA_BALANCING
int idlest_cpu;
/*
@@ -9147,10 +9149,15 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
* Otherwise, keep the task close to the wakeup source
* and improve locality if the number of running tasks
* would remain below threshold where an imbalance is
- * allowed. If there is a real need of migration,
- * periodic load balance will take care of it.
+ * allowed while accounting for the possibility the
+ * task is pinned to a subset of CPUs. If there is a
+ * real need of migration, periodic load balance will
+ * take care of it.
*/
- if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, sd->imb_numa_nr))
+ cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
+ cpumask_and(cpus, sched_group_span(local), p->cpus_ptr);
+ imb = min(cpumask_weight(cpus), sd->imb_numa_nr);
+ if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, imb))
return NULL;
}

--
2.25.1


2022-02-17 12:30:39

by K Prateek Nayak

[permalink] [raw]
Subject: Re: [PATCH v4] sched/fair: Consider cpu affinity when allowing NUMA imbalance in find_idlest_group

Hello Mel,


Thank you for looking into the patch.

On 2/17/2022 3:35 PM, Mel Gorman wrote:
> Thanks Prateek,
>
> On Thu, Feb 17, 2022 at 11:24:08AM +0530, K Prateek Nayak wrote:
>> [..snip..]
>>
>> Eg: numactl -C 0,16,32,48,64,80,96,112 ./stream8
>>
> In this case the stream threads can use any CPU of the subset, presumably
> this is parallelised with OpenMP without specifying spread or bind
> directives.
Yes it is parallelized using OpenMP without specifying any directive.
> [..snip..]
> One concern I have is that we incur a cpumask setup and cpumask_weight
> cost on every clone whether a restricted CPU mask is used or not. Peter,
> is it acceptable to avoid the cpumask check if there is no restrictions
> on allowed cpus like this?
>
> imb = sd->imb_numa_nr;
> if (p->nr_cpus_allowed != num_online_cpus())
> struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
>
> cpumask_and(cpus, sched_group_span(local), p->cpus_ptr);
> imb = min(cpumask_weight(cpus), imb);
> }
Can we optimize this further as:

imb = sd->imb_numa_nr;
if (unlikely(p->nr_cpus_allowed != num_online_cpus()))
struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);

cpumask_and(cpus, sched_group_span(local), p->cpus_ptr);
imb = min(cpumask_weight(cpus), imb);
}

As for most part, p->nr_cpus_allowed will be equal to num_online_cpus()
unless user has specifically pinned the task.

--
Thanks and Regards,
Prateek

2022-02-17 12:52:23

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH v4] sched/fair: Consider cpu affinity when allowing NUMA imbalance in find_idlest_group

Thanks Prateek,

On Thu, Feb 17, 2022 at 11:24:08AM +0530, K Prateek Nayak wrote:
> In AMD Zen like systems which contains multiple LLCs per socket,
> users want to spread bandwidth hungry applications across multiple
> LLCs. Stream is one such representative workload where the best
> performance is obtained by limiting one stream thread per LLC. To
> ensure this, users are known to pin the tasks to a specify a subset of
> the CPUs consisting of one CPU per LLC while running such bandwidth
> hungry tasks.
>
> Suppose we kickstart a multi-threaded task like stream with 8 threads
> using taskset or numactl to run on a subset of CPUs on a 2 socket Zen3
> server where each socket contains 128 CPUs
> (0-63,128-191 in one socket, 64-127,192-255 in another socket)
>
> Eg: numactl -C 0,16,32,48,64,80,96,112 ./stream8
>

In this case the stream threads can use any CPU of the subset, presumably
this is parallelised with OpenMP without specifying spread or bind
directives.

> stream-5045 [032] d..2. 167.914699: sched_wakeup_new: comm=stream pid=5047 prio=120 target_cpu=048
> stream-5045 [032] d..2. 167.914746: sched_wakeup_new: comm=stream pid=5048 prio=120 target_cpu=000
> stream-5045 [032] d..2. 167.914846: sched_wakeup_new: comm=stream pid=5049 prio=120 target_cpu=016
> stream-5045 [032] d..2. 167.914891: sched_wakeup_new: comm=stream pid=5050 prio=120 target_cpu=032
> stream-5045 [032] d..2. 167.914928: sched_wakeup_new: comm=stream pid=5051 prio=120 target_cpu=032
> stream-5045 [032] d..2. 167.914976: sched_wakeup_new: comm=stream pid=5052 prio=120 target_cpu=032
> stream-5045 [032] d..2. 167.915011: sched_wakeup_new: comm=stream pid=5053 prio=120 target_cpu=032
>

Resulting in some stacking with the baseline

> stream-4733 [032] d..2. 116.017980: sched_wakeup_new: comm=stream pid=4735 prio=120 target_cpu=048
> stream-4733 [032] d..2. 116.018032: sched_wakeup_new: comm=stream pid=4736 prio=120 target_cpu=000
> stream-4733 [032] d..2. 116.018127: sched_wakeup_new: comm=stream pid=4737 prio=120 target_cpu=064
> stream-4733 [032] d..2. 116.018185: sched_wakeup_new: comm=stream pid=4738 prio=120 target_cpu=112
> stream-4733 [032] d..2. 116.018235: sched_wakeup_new: comm=stream pid=4739 prio=120 target_cpu=096
> stream-4733 [032] d..2. 116.018289: sched_wakeup_new: comm=stream pid=4740 prio=120 target_cpu=016
> stream-4733 [032] d..2. 116.018334: sched_wakeup_new: comm=stream pid=4741 prio=120 target_cpu=080
>

And no stacking with your patch. So far so good.

> <SNIP>
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 5c4bfffe8c2c..6e875f1f34e2 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9130,6 +9130,8 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>
> case group_has_spare:
> if (sd->flags & SD_NUMA) {
> + struct cpumask *cpus;
> + int imb;
> #ifdef CONFIG_NUMA_BALANCING
> int idlest_cpu;
> /*
> @@ -9147,10 +9149,15 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
> * Otherwise, keep the task close to the wakeup source
> * and improve locality if the number of running tasks
> * would remain below threshold where an imbalance is
> - * allowed. If there is a real need of migration,
> - * periodic load balance will take care of it.
> + * allowed while accounting for the possibility the
> + * task is pinned to a subset of CPUs. If there is a
> + * real need of migration, periodic load balance will
> + * take care of it.
> */
> - if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, sd->imb_numa_nr))
> + cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
> + cpumask_and(cpus, sched_group_span(local), p->cpus_ptr);
> + imb = min(cpumask_weight(cpus), sd->imb_numa_nr);
> + if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, imb))
> return NULL;

One concern I have is that we incur a cpumask setup and cpumask_weight
cost on every clone whether a restricted CPU mask is used or not. Peter,
is it acceptable to avoid the cpumask check if there is no restrictions
on allowed cpus like this?

imb = sd->imb_numa_nr;
if (p->nr_cpus_allowed != num_online_cpus())
struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);

cpumask_and(cpus, sched_group_span(local), p->cpus_ptr);
imb = min(cpumask_weight(cpus), imb);
}

It's not perfect as a hotplug event could occur but that would be a fairly
harmless race with a limited impact (race with hotplug during clone may
stack for a short interval before LB intervenes).

--
Mel Gorman
SUSE Labs

2022-02-17 16:06:40

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH v4] sched/fair: Consider cpu affinity when allowing NUMA imbalance in find_idlest_group

On Thu, Feb 17, 2022 at 04:53:51PM +0530, K Prateek Nayak wrote:
> Hello Mel,
>
>
> Thank you for looking into the patch.
>
> On 2/17/2022 3:35 PM, Mel Gorman wrote:
> > Thanks Prateek,
> >
> > On Thu, Feb 17, 2022 at 11:24:08AM +0530, K Prateek Nayak wrote:
> >> [..snip..]
> >>
> >> Eg: numactl -C 0,16,32,48,64,80,96,112 ./stream8
> >>
> > In this case the stream threads can use any CPU of the subset, presumably
> > this is parallelised with OpenMP without specifying spread or bind
> > directives.
> Yes it is parallelized using OpenMP without specifying any directive.
> > [..snip..]
> > One concern I have is that we incur a cpumask setup and cpumask_weight
> > cost on every clone whether a restricted CPU mask is used or not. Peter,
> > is it acceptable to avoid the cpumask check if there is no restrictions
> > on allowed cpus like this?
> >
> > imb = sd->imb_numa_nr;
> > if (p->nr_cpus_allowed != num_online_cpus())
> > struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
> >
> > cpumask_and(cpus, sched_group_span(local), p->cpus_ptr);
> > imb = min(cpumask_weight(cpus), imb);
> > }
> Can we optimize this further as:
>
> imb = sd->imb_numa_nr;
> if (unlikely(p->nr_cpus_allowed != num_online_cpus()))
> struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
>
> cpumask_and(cpus, sched_group_span(local), p->cpus_ptr);
> imb = min(cpumask_weight(cpus), imb);
> }
>
> As for most part, p->nr_cpus_allowed will be equal to num_online_cpus()
> unless user has specifically pinned the task.
>

I'm a little wary due to https://lwn.net/Articles/420019/ raising concerns
from people that feel more strongly about likely/unlikely use. Whether that
branch is likely true or not is specific to the deployment. On my desktop
and most tests I run, the branch is very unlikely because most workloads
I run are usually not CPU-constrained and not fork-intensive. Even those
that are CPU contrained are generally not fork intensive. For a setup with
lots of containers, virtual machines, locality-aware applications etc,
the path is potentially very likely and harder to detect in the future.

I don't object to the change but I would wonder if it's measurable for
anything other than a fork-intensive microbenchmark given it's one branch
in a relatively heavy operation.

I think a relatively harmless micro-optimisation would be

- imb = min(cpumask_weight(cpus), imb);
+ imb = cpumask_weight(cpus);

It assumes that the constrained cpus_allowed would have a lower imb
than one calculated based on all cpus allowed which sounds like a safe
assumption other than racing with hot-onlining a bunch of CPUs.

I think both micro-optimisations are negligible in comparison to avoiding
an unecessary cpumask_and cpumask_weight call. FWIW, I looked at my own
use of likely/unlikely recently and it's

c49c2c47dab6b8d45022b3fabf0642a0e62e3109 unlikely that memory hotplug operation is in progress
3b12e7e97938424de2bb1b95ba0bd6a49bad39f9 hotplug active or machine booting
df1acc856923c0a65c28b588585449106c316b71 memory isolated for hotplug or CMA attempt in progress
56f0e661ea8c0178e80048df7166653a51ef2c3d memory isolated for hotplug or CMA attempt in progress
b3b64ebd38225d8032b5db42938d969b602040c2 bulk allocation request with an array that already has pages

Of those, the last one is the most marginal because it really depends
on whether network core or NFS is the heavy user of the interface and
I made a guess that high-speed networks are more common critical paths
than NFS servers.

--
Mel Gorman
SUSE Labs

2022-02-18 04:00:10

by K Prateek Nayak

[permalink] [raw]
Subject: Re: [PATCH v4] sched/fair: Consider cpu affinity when allowing NUMA imbalance in find_idlest_group

Hello Mel,

Thank you for the feedback.

On 2/17/2022 6:45 PM, Mel Gorman wrote:
> On Thu, Feb 17, 2022 at 04:53:51PM +0530, K Prateek Nayak wrote:
>> [..snip..]
>> Can we optimize this further as:
>>
>> imb = sd->imb_numa_nr;
>> if (unlikely(p->nr_cpus_allowed != num_online_cpus()))
>> struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
>>
>> cpumask_and(cpus, sched_group_span(local), p->cpus_ptr);
>> imb = min(cpumask_weight(cpus), imb);
>> }
>>
>> As for most part, p->nr_cpus_allowed will be equal to num_online_cpus()
>> unless user has specifically pinned the task.
>>
> I'm a little wary due to https://lwn.net/Articles/420019/ raising concerns
> from people that feel more strongly about likely/unlikely use.
I wasn't aware of this. Thank you for pointing this out to me.
> Whether that
> branch is likely true or not is specific to the deployment. On my desktop
> and most tests I run, the branch is very unlikely because most workloads
> I run are usually not CPU-constrained and not fork-intensive. Even those
> that are CPU contrained are generally not fork intensive. For a setup with
> lots of containers, virtual machines, locality-aware applications etc,
> the path is potentially very likely and harder to detect in the future.
Yes, you make a good point.
> I don't object to the change but I would wonder if it's measurable for
> anything other than a fork-intensive microbenchmark given it's one branch
> in a relatively heavy operation.
>
> I think a relatively harmless micro-optimisation would be
>
> - imb = min(cpumask_weight(cpus), imb);
> + imb = cpumask_weight(cpus);
>
> It assumes that the constrained cpus_allowed would have a lower imb
> than one calculated based on all cpus allowed which sounds like a safe
> assumption other than racing with hot-onlining a bunch of CPUs.
This is a good micro-optimization as long as the assumption holds
true.
> I think both micro-optimisations are negligible in comparison to avoiding
> an unecessary cpumask_and cpumask_weight call.
I agree. Checking for p->nr_cpus_allowed != num_online_cpus() will
avoid the relatively expensive cpumask operations.
> FWIW, I looked at my own
> use of likely/unlikely recently and it's
>
> c49c2c47dab6b8d45022b3fabf0642a0e62e3109 unlikely that memory hotplug operation is in progress
> 3b12e7e97938424de2bb1b95ba0bd6a49bad39f9 hotplug active or machine booting
> df1acc856923c0a65c28b588585449106c316b71 memory isolated for hotplug or CMA attempt in progress
> 56f0e661ea8c0178e80048df7166653a51ef2c3d memory isolated for hotplug or CMA attempt in progress
> b3b64ebd38225d8032b5db42938d969b602040c2 bulk allocation request with an array that already has pages
>
> Of those, the last one is the most marginal because it really depends
> on whether network core or NFS is the heavy user of the interface and
> I made a guess that high-speed networks are more common critical paths
> than NFS servers.
Thank you for giving these examples. I considered the case of branch
being taken to be very unlikely based on my workloads, but as you
pointed out, there may be other cases where it's outcome might not be
so predictable all the time.

--
Thanks and Regards,
Prateek

2022-02-22 06:15:03

by K Prateek Nayak

[permalink] [raw]
Subject: Re: [PATCH v4] sched/fair: Consider cpu affinity when allowing NUMA imbalance in find_idlest_group

Hello Mel,

On 2/17/2022 6:45 PM, Mel Gorman wrote:
> I don't object to the change but I would wonder if it's measurable for
> anything other than a fork-intensive microbenchmark given it's one branch
> in a relatively heavy operation.

I used stress-ng to see if we get any measurable difference from the
optimizations in a fork intensive scenario. I'm measuring time in ns
between the events sched_process_fork and sched_wakeup_new for the
stress-ng processes.

Following are the results from testing:

- Un-affined runs:
Command: stress-ng -t 30s --exec <Worker>

Kernel versions:
- balance-wake - This patch
- branch - This patch + Mel's suggested branch
- branch-unlikely - This patch + Mel's suggested branch + unlikely

Result format: Amean in ns [Co-eff of Var] (% Improvement)

Workers balance-wake branch branch-unlikely
1 18613.20 [0.01] (0.00 pct) 18348.00 [0.04] (1.42 pct) 18299.20 [0.02] (1.69 pct)
2 18634.40 [0.03] (0.00 pct) 18163.80 [0.04] (2.53 pct) 19037.80 [0.05] (-2.16 pct)
4 20997.40 [0.02] (0.00 pct) 20980.80 [0.02] (0.08 pct) 21527.40 [0.02] (-2.52 pct)
8 20890.20 [0.01] (0.00 pct) 19714.60 [0.07] (5.63 pct) 20021.40 [0.05] (4.16 pct)
16 21200.20 [0.02] (0.00 pct) 20564.40 [0.00] (3.00 pct) 20676.00 [0.01] (2.47 pct)
32 21301.80 [0.02] (0.00 pct) 20767.40 [0.02] (2.51 pct) 20945.00 [0.01] (1.67 pct)
64 22772.40 [0.01] (0.00 pct) 22505.00 [0.01] (1.17 pct) 22629.40 [0.00] (0.63 pct)
128 25843.00 [0.01] (0.00 pct) 25124.80 [0.00] (2.78 pct) 25377.40 [0.00] (1.80 pct)
256 18691.00 [0.02] (0.00 pct) 19086.40 [0.05] (-2.12 pct) 18013.00 [0.04] (3.63 pct)
512 19658.40 [0.03] (0.00 pct) 19568.80 [0.01] (0.46 pct) 18972.00 [0.02] (3.49 pct)
1024 19126.80 [0.04] (0.00 pct) 18762.80 [0.02] (1.90 pct) 18878.20 [0.04] (1.30 pct)

- Affined runs:
Command: taskset -c 0-254 stress-ng -t 30s --exec <Worker>

Kernel versions:
- balance-wake-affine - This patch + affined run
- branch-affine - This patch + Mel's suggested branch + affined run
- branch-unlikely-affine - This patch + Mel's suggested branch + unlikely + affined run

Result format: Amean in ns [Co-eff of Var] (% Improvement)

Workers balance-wake-affine branch-affine branch-unlikely-affine
1 18515.00 [0.01] (0.00 pct) 18538.00 [0.02] (-0.12 pct) 18568.40 [0.01] (-0.29 pct)
2 17882.80 [0.01] (0.00 pct) 19627.80 [0.09] (-9.76 pct) 18790.40 [0.01] (-5.08 pct)
4 21204.20 [0.01] (0.00 pct) 21410.60 [0.04] (-0.97 pct) 21715.20 [0.03] (-2.41 pct)
8 20840.20 [0.01] (0.00 pct) 19684.60 [0.07] (5.55 pct) 21074.20 [0.02] (-1.12 pct)
16 21115.20 [0.02] (0.00 pct) 20823.00 [0.01] (1.38 pct) 20719.80 [0.00] (1.87 pct)
32 21159.00 [0.02] (0.00 pct) 21371.20 [0.01] (-1.00 pct) 21253.20 [0.01] (-0.45 pct)
64 22768.20 [0.01] (0.00 pct) 22816.80 [0.00] (-0.21 pct) 22662.00 [0.00] (0.47 pct)
128 25671.80 [0.00] (0.00 pct) 25528.20 [0.00] (0.56 pct) 25404.00 [0.00] (1.04 pct)
256 27209.00 [0.01] (0.00 pct) 26751.00 [0.01] (1.68 pct) 26733.20 [0.00] (1.75 pct)
512 20241.00 [0.03] (0.00 pct) 19378.60 [0.03] (4.26 pct) 19671.40 [0.00] (2.81 pct)
1024 19380.80 [0.05] (0.00 pct) 18940.40 [0.02] (2.27 pct) 19071.80 [0.00] (1.59 pct)


With or without the unlikely, adding the check before doing the
cpumask operation benefits most cases of un-affined tasks.

--
Thanks and Regards,
Prateek

2022-02-22 09:54:18

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH v4] sched/fair: Consider cpu affinity when allowing NUMA imbalance in find_idlest_group

On Tue, Feb 22, 2022 at 11:30:48AM +0530, K Prateek Nayak wrote:
> Hello Mel,
>
> On 2/17/2022 6:45 PM, Mel Gorman wrote:
> > I don't object to the change but I would wonder if it's measurable for
> > anything other than a fork-intensive microbenchmark given it's one branch
> > in a relatively heavy operation.
>
> I used stress-ng to see if we get any measurable difference from the
> optimizations in a fork intensive scenario. I'm measuring time in ns
> between the events sched_process_fork and sched_wakeup_new for the
> stress-ng processes.
>
> Following are the results from testing:
>
> - Un-affined runs:
> Command: stress-ng -t 30s --exec <Worker>
>
> Kernel versions:
> - balance-wake - This patch
> - branch - This patch + Mel's suggested branch
> - branch-unlikely - This patch + Mel's suggested branch + unlikely
>
> Result format: Amean in ns [Co-eff of Var] (% Improvement)
>
> Workers balance-wake branch branch-unlikely
> 1 18613.20 [0.01] (0.00 pct) 18348.00 [0.04] (1.42 pct) 18299.20 [0.02] (1.69 pct)
> 2 18634.40 [0.03] (0.00 pct) 18163.80 [0.04] (2.53 pct) 19037.80 [0.05] (-2.16 pct)
> 4 20997.40 [0.02] (0.00 pct) 20980.80 [0.02] (0.08 pct) 21527.40 [0.02] (-2.52 pct)
> 8 20890.20 [0.01] (0.00 pct) 19714.60 [0.07] (5.63 pct) 20021.40 [0.05] (4.16 pct)
> 16 21200.20 [0.02] (0.00 pct) 20564.40 [0.00] (3.00 pct) 20676.00 [0.01] (2.47 pct)
> 32 21301.80 [0.02] (0.00 pct) 20767.40 [0.02] (2.51 pct) 20945.00 [0.01] (1.67 pct)
> 64 22772.40 [0.01] (0.00 pct) 22505.00 [0.01] (1.17 pct) 22629.40 [0.00] (0.63 pct)
> 128 25843.00 [0.01] (0.00 pct) 25124.80 [0.00] (2.78 pct) 25377.40 [0.00] (1.80 pct)
> 256 18691.00 [0.02] (0.00 pct) 19086.40 [0.05] (-2.12 pct) 18013.00 [0.04] (3.63 pct)
> 512 19658.40 [0.03] (0.00 pct) 19568.80 [0.01] (0.46 pct) 18972.00 [0.02] (3.49 pct)
> 1024 19126.80 [0.04] (0.00 pct) 18762.80 [0.02] (1.90 pct) 18878.20 [0.04] (1.30 pct)
>

Co-eff of variance looks low but for the lower counts before the machine
is saturated (>=256?) it does not look like it helps and if anything,
it hurts. A branch mispredict profile might reveal more but I doubt
it's worth the effort at this point.

> - Affined runs:
> Command: taskset -c 0-254 stress-ng -t 30s --exec <Worker>
>
> Kernel versions:
> - balance-wake-affine - This patch + affined run
> - branch-affine - This patch + Mel's suggested branch + affined run
> - branch-unlikely-affine - This patch + Mel's suggested branch + unlikely + affined run
>
> Result format: Amean in ns [Co-eff of Var] (% Improvement)
>
> Workers balance-wake-affine branch-affine branch-unlikely-affine
> 1 18515.00 [0.01] (0.00 pct) 18538.00 [0.02] (-0.12 pct) 18568.40 [0.01] (-0.29 pct)
> 2 17882.80 [0.01] (0.00 pct) 19627.80 [0.09] (-9.76 pct) 18790.40 [0.01] (-5.08 pct)
> 4 21204.20 [0.01] (0.00 pct) 21410.60 [0.04] (-0.97 pct) 21715.20 [0.03] (-2.41 pct)
> 8 20840.20 [0.01] (0.00 pct) 19684.60 [0.07] (5.55 pct) 21074.20 [0.02] (-1.12 pct)
> 16 21115.20 [0.02] (0.00 pct) 20823.00 [0.01] (1.38 pct) 20719.80 [0.00] (1.87 pct)
> 32 21159.00 [0.02] (0.00 pct) 21371.20 [0.01] (-1.00 pct) 21253.20 [0.01] (-0.45 pct)
> 64 22768.20 [0.01] (0.00 pct) 22816.80 [0.00] (-0.21 pct) 22662.00 [0.00] (0.47 pct)
> 128 25671.80 [0.00] (0.00 pct) 25528.20 [0.00] (0.56 pct) 25404.00 [0.00] (1.04 pct)
> 256 27209.00 [0.01] (0.00 pct) 26751.00 [0.01] (1.68 pct) 26733.20 [0.00] (1.75 pct)
> 512 20241.00 [0.03] (0.00 pct) 19378.60 [0.03] (4.26 pct) 19671.40 [0.00] (2.81 pct)
> 1024 19380.80 [0.05] (0.00 pct) 18940.40 [0.02] (2.27 pct) 19071.80 [0.00] (1.59 pct)

Same here, the cpumask check obviously hurts but it does not look like
the unlikely helps.

> With or without the unlikely, adding the check before doing the
> cpumask operation benefits most cases of un-affined tasks.
>

I think repost the patch with the num_online_cpus check added in. Yes,
it hurts a bit for the pure fork case when the cpus_ptr is contrained by
a scheduler policy but at least it makes sense.

--
Mel Gorman
SUSE Labs

2022-02-22 10:50:04

by K Prateek Nayak

[permalink] [raw]
Subject: Re: [PATCH v4] sched/fair: Consider cpu affinity when allowing NUMA imbalance in find_idlest_group

Hello Mel,

On 2/22/2022 2:15 PM, Mel Gorman wrote:
> [..snip..]
>> Following are the results from testing:
>>
>> - Un-affined runs:
>> Command: stress-ng -t 30s --exec <Worker>
>>
>> Kernel versions:
>> - balance-wake - This patch
>> - branch - This patch + Mel's suggested branch
>> - branch-unlikely - This patch + Mel's suggested branch + unlikely
>>
>> Result format: Amean in ns [Co-eff of Var] (% Improvement)
>>
>> Workers balance-wake branch branch-unlikely
>> 1 18613.20 [0.01] (0.00 pct) 18348.00 [0.04] (1.42 pct) 18299.20 [0.02] (1.69 pct)
>> 2 18634.40 [0.03] (0.00 pct) 18163.80 [0.04] (2.53 pct) 19037.80 [0.05] (-2.16 pct)
>> 4 20997.40 [0.02] (0.00 pct) 20980.80 [0.02] (0.08 pct) 21527.40 [0.02] (-2.52 pct)
>> 8 20890.20 [0.01] (0.00 pct) 19714.60 [0.07] (5.63 pct) 20021.40 [0.05] (4.16 pct)
>> 16 21200.20 [0.02] (0.00 pct) 20564.40 [0.00] (3.00 pct) 20676.00 [0.01] (2.47 pct)
>> 32 21301.80 [0.02] (0.00 pct) 20767.40 [0.02] (2.51 pct) 20945.00 [0.01] (1.67 pct)
>> 64 22772.40 [0.01] (0.00 pct) 22505.00 [0.01] (1.17 pct) 22629.40 [0.00] (0.63 pct)
>> 128 25843.00 [0.01] (0.00 pct) 25124.80 [0.00] (2.78 pct) 25377.40 [0.00] (1.80 pct)
>> 256 18691.00 [0.02] (0.00 pct) 19086.40 [0.05] (-2.12 pct) 18013.00 [0.04] (3.63 pct)
>> 512 19658.40 [0.03] (0.00 pct) 19568.80 [0.01] (0.46 pct) 18972.00 [0.02] (3.49 pct)
>> 1024 19126.80 [0.04] (0.00 pct) 18762.80 [0.02] (1.90 pct) 18878.20 [0.04] (1.30 pct)
>>
> Co-eff of variance looks low but for the lower counts before the machine
> is saturated (>=256?) it does not look like it helps and if anything,
> it hurts. A branch mispredict profile might reveal more but I doubt
> it's worth the effort at this point.
The positive percentage here represents improvement i.e., the time
between the events sched_process_fork and sched_wakeup_new has come
down in most cases after adding the branch.
Same is applicable for results below.
>> - Affined runs:
>> Command: taskset -c 0-254 stress-ng -t 30s --exec <Worker>
>>
>> Kernel versions:
>> - balance-wake-affine - This patch + affined run
>> - branch-affine - This patch + Mel's suggested branch + affined run
>> - branch-unlikely-affine - This patch + Mel's suggested branch + unlikely + affined run
>>
>> Result format: Amean in ns [Co-eff of Var] (% Improvement)
>>
>> Workers balance-wake-affine branch-affine branch-unlikely-affine
>> 1 18515.00 [0.01] (0.00 pct) 18538.00 [0.02] (-0.12 pct) 18568.40 [0.01] (-0.29 pct)
>> 2 17882.80 [0.01] (0.00 pct) 19627.80 [0.09] (-9.76 pct) 18790.40 [0.01] (-5.08 pct)
>> 4 21204.20 [0.01] (0.00 pct) 21410.60 [0.04] (-0.97 pct) 21715.20 [0.03] (-2.41 pct)
>> 8 20840.20 [0.01] (0.00 pct) 19684.60 [0.07] (5.55 pct) 21074.20 [0.02] (-1.12 pct)
>> 16 21115.20 [0.02] (0.00 pct) 20823.00 [0.01] (1.38 pct) 20719.80 [0.00] (1.87 pct)
>> 32 21159.00 [0.02] (0.00 pct) 21371.20 [0.01] (-1.00 pct) 21253.20 [0.01] (-0.45 pct)
>> 64 22768.20 [0.01] (0.00 pct) 22816.80 [0.00] (-0.21 pct) 22662.00 [0.00] (0.47 pct)
>> 128 25671.80 [0.00] (0.00 pct) 25528.20 [0.00] (0.56 pct) 25404.00 [0.00] (1.04 pct)
>> 256 27209.00 [0.01] (0.00 pct) 26751.00 [0.01] (1.68 pct) 26733.20 [0.00] (1.75 pct)
>> 512 20241.00 [0.03] (0.00 pct) 19378.60 [0.03] (4.26 pct) 19671.40 [0.00] (2.81 pct)
>> 1024 19380.80 [0.05] (0.00 pct) 18940.40 [0.02] (2.27 pct) 19071.80 [0.00] (1.59 pct)
> Same here, the cpumask check obviously hurts but it does not look like
> the unlikely helps.
I agree. unlikely doesn't show consistent results.
>> With or without the unlikely, adding the check before doing the
>> cpumask operation benefits most cases of un-affined tasks.
>>
> I think repost the patch with the num_online_cpus check added in. Yes,
> it hurts a bit for the pure fork case when the cpus_ptr is contrained by
> a scheduler policy but at least it makes sense.
I'll post the V5 soon with the check as you suggested.
--
Thanks and Regards,
Prateek