2022-06-15 14:22:22

by K Prateek Nayak

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API

Hello Yicong,

I replied to the v3 of the series by mistake!
(https://lore.kernel.org/lkml/[email protected]/)
But rest assured all the analysis discussed there was done with
the v4 patch series. I'll add the same observations below so we
can continue discussion on v4.

We are observing some serious regression with tbench with this patch
series applied. The issue doesn't seem to be related to the actual
functionality of the patch but how the patch changes the per CPU
variable layout.

Discussed below are the results from running tbench on a dual
socket Zen3 (2 x 64C/128T) configured in different NPS modes.

NPS Modes are used to logically divide single socket into
multiple NUMA region.
Following is the NUMA configuration for each NPS mode on the system:

NPS1: Each socket is a NUMA node.
Total 2 NUMA nodes in the dual socket machine.

Node 0: 0-63, 128-191
Node 1: 64-127, 192-255

NPS2: Each socket is further logically divided into 2 NUMA regions.
Total 4 NUMA nodes exist over 2 socket.

Node 0: 0-31, 128-159
Node 1: 32-63, 160-191
Node 2: 64-95, 192-223
Node 3: 96-127, 223-255

NPS4: Each socket is logically divided into 4 NUMA regions.
Total 8 NUMA nodes exist over 2 socket.

Node 0: 0-15, 128-143
Node 1: 16-31, 144-159
Node 2: 32-47, 160-175
Node 3: 48-63, 176-191
Node 4: 64-79, 192-207
Node 5: 80-95, 208-223
Node 6: 96-111, 223-231
Node 7: 112-127, 232-255

Benchmark Results:

Kernel versions:
- tip: 5.19.0-rc2 tip sched/core
- cluster: 5.19.0-rc2 tip sched/core + both the patches of the series

When we started testing, the tip was at:
commit: f3dd3f674555 "sched: Remove the limitation of WF_ON_CPU on wakelist if wakee cpu is idle"

* - Data points of concern

~~~~~~
tbench
~~~~~~

NPS1

Clients: tip cluster
1 444.41 (0.00 pct) 439.27 (-1.15 pct)
2 879.23 (0.00 pct) 831.49 (-5.42 pct) *
4 1648.83 (0.00 pct) 1608.07 (-2.47 pct)
8 3263.81 (0.00 pct) 3086.81 (-5.42 pct) *
16 6011.19 (0.00 pct) 5360.28 (-10.82 pct) *
32 12058.31 (0.00 pct) 8769.08 (-27.27 pct) *
64 21258.21 (0.00 pct) 19021.09 (-10.52 pct) *
128 30795.27 (0.00 pct) 30861.34 (0.21 pct)
256 25138.43 (0.00 pct) 24711.90 (-1.69 pct)
512 51287.93 (0.00 pct) 51855.55 (1.10 pct)
1024 53176.97 (0.00 pct) 52554.55 (-1.17 pct)

NPS2

Clients: tip cluster
1 445.45 (0.00 pct) 441.75 (-0.83 pct)
2 869.24 (0.00 pct) 845.61 (-2.71 pct)
4 1644.28 (0.00 pct) 1586.49 (-3.51 pct)
8 3120.83 (0.00 pct) 2967.01 (-4.92 pct) *
16 5972.29 (0.00 pct) 5208.58 (-12.78 pct) *
32 11776.38 (0.00 pct) 10229.53 (-13.13 pct) *
64 20933.15 (0.00 pct) 17033.45 (-18.62 pct) *
128 32195.00 (0.00 pct) 29507.85 (-8.34 pct) *
256 24641.52 (0.00 pct) 27225.00 (10.48 pct)
512 50806.96 (0.00 pct) 51377.50 (1.12 pct)
1024 51993.96 (0.00 pct) 50773.35 (-2.34 pct)

NPS4

Clients: tip cluster
1 442.10 (0.00 pct) 435.06 (-1.59 pct)
2 870.94 (0.00 pct) 858.64 (-1.41 pct)
4 1615.30 (0.00 pct) 1607.27 (-0.49 pct)
8 3195.95 (0.00 pct) 3020.63 (-5.48 pct) *
16 5937.41 (0.00 pct) 5719.87 (-3.66 pct)
32 11800.41 (0.00 pct) 11229.65 (-4.83 pct) *
64 20844.71 (0.00 pct) 20432.79 (-1.97 pct)
128 31003.62 (0.00 pct) 29441.20 (-5.03 pct) *
256 27476.37 (0.00 pct) 25857.30 (-5.89 pct) * [Know to have run to run variance]
512 52276.72 (0.00 pct) 51659.16 (-1.18 pct)
1024 51372.10 (0.00 pct) 51026.87 (-0.67 pct)

Note: tbench results for 256 workers are known to have
run to run variation on the test machine. Any regression
seen for the data point can be safely ignored.

The behavior is consistent for both tip and patched kernel
across multiple runs of tbench.

~~~~~~~~~~~~~~~~~~~~
Analysis done so far
~~~~~~~~~~~~~~~~~~~~

To root cause this issue quicker, we have focused on 8 to 64 clients
data points with the machine running in NPS1 mode.

- Even on disabling HW prefetcher, the behavior remains consistent
signifying HW prefetcher is not the cause of the problem.

- Bisecting:

When we ran the tests with only Patch 1 of the series, the
regression was visible and the numbers were worse.

Clients: tip cluster Patch 1 Only
8 3263.81 (0.00 pct) 3086.81 (-5.42 pct) 3018.63 (-7.51 pct)
16 6011.19 (0.00 pct) 5360.28 (-10.82 pct) 4869.26 (-18.99 pct)
32 12058.31 (0.00 pct) 8769.08 (-27.27 pct) 8159.60 (-32.33 pct)
64 21258.21 (0.00 pct) 19021.09 (-10.52 pct) 13161.92 (-38.08 pct)

We further bisected the hunks to narrow down the cause to the per CPU
variable declarations.


On 6/9/2022 5:36 PM, Yicong Yang wrote:
>
> [..snip..]
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 01259611beb9..b9bcfcf8d14d 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1753,7 +1753,9 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
> DECLARE_PER_CPU(int, sd_llc_size);
> DECLARE_PER_CPU(int, sd_llc_id);
> +DECLARE_PER_CPU(int, sd_share_id);
> DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> +DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster);

The main reason for the regression seems to be the above declarations.
The regression seem to go away if we do one of the following:

- Declare sd_share_id and sd_cluster using DECLARE_PER_CPU_READ_MOSTLY()
instead of DECLARE_PER_CPU() and change the corresponding definition
below to DEFINE_PER_CPU_READ_MOSTLY().

Clients: tip Patch 1 Patch 1 (READ_MOSTLY)
8 3255.69 (0.00 pct) 3018.63 (-7.28 pct) 3237.33 (-0.56 pct)
16 6092.67 (0.00 pct) 4869.26 (-20.08 pct) 5914.53 (-2.92 pct)
32 11156.56 (0.00 pct) 8159.60 (-26.86 pct) 11536.05 (3.40 pct)
64 21019.97 (0.00 pct) 13161.92 (-37.38 pct) 21162.33 (0.67 pct)

- Convert sd_share_id and sd_cluster to static arrays.

Clients: tip Patch 1 Patch 1 (Static Array)
8 3255.69 (0.00 pct) 3018.63 (-7.28 pct) 3203.27 (-1.61 pct)
16 6092.67 (0.00 pct) 4869.26 (-20.08 pct) 6198.35 (1.73 pct)
32 11156.56 (0.00 pct) 8159.60 (-26.86 pct) 11385.76 (2.05 pct)
64 21019.97 (0.00 pct) 13161.92 (-37.38 pct) 21919.80 (4.28 pct)

- Move the declarations of sd_share_id and sd_cluster to the top

Clients: tip Patch 1 Patch 1 (Declarion on Top)
8 3255.69 (0.00 pct) 3018.63 (-7.28 pct) 3072.30 (-5.63 pct)
16 6092.67 (0.00 pct) 4869.26 (-20.08 pct) 5586.59 (-8.30 pct)
32 11156.56 (0.00 pct) 8159.60 (-26.86 pct) 11184.17 (0.24 pct)
64 21019.97 (0.00 pct) 13161.92 (-37.38 pct) 20289.70 (-3.47 pct)

Unfortunately, none of these are complete solutions. For example, using
DECLARE_PER_CPU_READ_MOSTLY() with both patches applied reduces the regression
but doesn't eliminate it entirely:

Clients: tip cluster cluster (READ_MOSTLY)
1 444.41 (0.00 pct) 439.27 (-1.15 pct) 435.95 (-1.90 pct)
2 879.23 (0.00 pct) 831.49 (-5.42 pct) 842.09 (-4.22 pct)
4 1648.83 (0.00 pct) 1608.07 (-2.47 pct) 1598.77 (-3.03 pct)
8 3263.81 (0.00 pct) 3086.81 (-5.42 pct) 3090.86 (-5.29 pct) *
16 6011.19 (0.00 pct) 5360.28 (-10.82 pct) 5360.28 (-10.82 pct) *
32 12058.31 (0.00 pct) 8769.08 (-27.27 pct) 11083.66 (-8.08 pct) *
64 21258.21 (0.00 pct) 19021.09 (-10.52 pct) 20984.30 (-1.28 pct)
128 30795.27 (0.00 pct) 30861.34 (0.21 pct) 30735.20 (-0.19 pct)
256 25138.43 (0.00 pct) 24711.90 (-1.69 pct) 24021.21 (-4.44 pct)
512 51287.93 (0.00 pct) 51855.55 (1.10 pct) 51672.73 (0.75 pct)
1024 53176.97 (0.00 pct) 52554.55 (-1.17 pct) 52620.02 (-1.04 pct)

We are still trying to root cause the underlying issue that
brought about such drastic regression in tbench performance.

> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 05b6c2ad90b9..0595827d481d 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -664,6 +664,8 @@ static void destroy_sched_domains(struct sched_domain *sd)
> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
> DEFINE_PER_CPU(int, sd_llc_size);
> DEFINE_PER_CPU(int, sd_llc_id);
> +DEFINE_PER_CPU(int, sd_share_id);
> +DEFINE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
> DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
>
> [..snip..]
>

We would like some time to investigate this issue and root cause
the reason for this regression.
--
Thanks and Regards,
Prateek


2022-06-15 15:46:59

by Gautham R. Shenoy

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API

On Wed, Jun 15, 2022 at 07:49:22PM +0530, K Prateek Nayak wrote:

[..snip..]

>
> - Bisecting:
>
> When we ran the tests with only Patch 1 of the series, the
> regression was visible and the numbers were worse.
>
> Clients: tip cluster Patch 1 Only
> 8 3263.81 (0.00 pct) 3086.81 (-5.42 pct) 3018.63 (-7.51 pct)
> 16 6011.19 (0.00 pct) 5360.28 (-10.82 pct) 4869.26 (-18.99 pct)
> 32 12058.31 (0.00 pct) 8769.08 (-27.27 pct) 8159.60 (-32.33 pct)
> 64 21258.21 (0.00 pct) 19021.09 (-10.52 pct) 13161.92 (-38.08 pct)
>
> We further bisected the hunks to narrow down the cause to the per CPU
> variable declarations.
>
>
> >
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 01259611beb9..b9bcfcf8d14d 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -1753,7 +1753,9 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
> > DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
> > DECLARE_PER_CPU(int, sd_llc_size);
> > DECLARE_PER_CPU(int, sd_llc_id);
> > +DECLARE_PER_CPU(int, sd_share_id);
> > DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> > +DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
>
> The main reason for the regression seems to be the above declarations.

I think you meant that the regressions are due to the DEFINE_PER_CPU()
instances from the following hunk:

> > @@ -664,6 +664,8 @@ static void destroy_sched_domains(struct sched_domain *sd)
> > DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
> > DEFINE_PER_CPU(int, sd_llc_size);
> > DEFINE_PER_CPU(int, sd_llc_id);
> > +DEFINE_PER_CPU(int, sd_share_id);
> > +DEFINE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
> > DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> > DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
> > DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
> >


The System.map diff for these variables between tip vs tip +
cluster-sched-v4 on your test system looks as follows:

0000000000020520 D sd_asym_packing
0000000000020528 D sd_numa
-0000000000020530 D sd_llc_shared
-0000000000020538 D sd_llc_id
-000000000002053c D sd_llc_size
-0000000000020540 D sd_llc
+0000000000020530 D sd_cluster
+0000000000020538 D sd_llc_shared
+0000000000020540 D sd_share_id
+0000000000020544 D sd_llc_id
+0000000000020548 D sd_llc_size
+0000000000020550 D sd_llc

The allocations are in the reverse-order of the definitions.

That perhaps explains why you no longer see the regression when you
define the sd_share_id and sd_cluster per-cpu definitions at the
beginning as indicated by the following

> - Move the declarations of sd_share_id and sd_cluster to the top
>
> Clients: tip Patch 1 Patch 1 (Declarion on Top)
> 8 3255.69 (0.00 pct) 3018.63 (-7.28 pct) 3072.30 (-5.63 pct)
> 16 6092.67 (0.00 pct) 4869.26 (-20.08 pct) 5586.59 (-8.30 pct)
> 32 11156.56 (0.00 pct) 8159.60 (-26.86 pct) 11184.17 (0.24 pct)
> 64 21019.97 (0.00 pct) 13161.92 (-37.38 pct) 20289.70 (-3.47 pct)


--
Thanks and Regards
gautham.