2022-06-17 12:53:00

by K Prateek Nayak

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API

Hello Yicong,

On 6/16/2022 1:25 PM, Yicong Yang wrote:
> Hi Prateek,
>
> Seems my previous reply is in the wrong format so the server rejected it..just repost..
>
> Thanks a lot for the test!

Thank you for looking into it and suggesting further steps
for debugging the issue.

>> [..snip..]
>>
>> We are still trying to root cause the underlying issue that
>> brought about such drastic regression in tbench performance.
>>
>>> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>>> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
>>> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
>>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>>> index 05b6c2ad90b9..0595827d481d 100644
>>> --- a/kernel/sched/topology.c
>>> +++ b/kernel/sched/topology.c
>>> @@ -664,6 +664,8 @@ static void destroy_sched_domains(struct sched_domain *sd)
>>> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>>> DEFINE_PER_CPU(int, sd_llc_size);
>>> DEFINE_PER_CPU(int, sd_llc_id);
>>> +DEFINE_PER_CPU(int, sd_share_id);
>>> +DEFINE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
>>> DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
>>> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>>> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
>>>
>>> [..snip..]
>>>
>>
>> We would like some time to investigate this issue and root cause
>> the reason for this regression.
>
> One significant difference I can see for now is that Kunpeng 920 is a non-SMT machine while Zen3
> is a SMT machine. So in the select_idle_sibling() path we won't use sd_llc_shared. Since sd_share_id
> and sd_cluster are inserted between sd_llc_id and sd_llc_shared which are both used in the path on
> your test, I guess the change of the layout may affect something like the cache behaviour.
>
> Can you also test with SMT disabled?

Following are results on 2 x 64C Zen3 machine with SMT disabled in BIOS:

Clients: tip tip + Patch 1 only tip + both patches tip + both patches (Move declaration to top of block)
1 456.35 (0.00 pct) 455.26 (-0.23 pct) 449.19 (-1.56 pct) 444.95 (-2.49 pct)
2 887.44 (0.00 pct) 881.23 (-0.69 pct) 854.28 (-3.73 pct) 878.32 (-1.02 pct)
4 1693.37 (0.00 pct) 1616.95 (-4.51 pct) 1635.48 (-3.41 pct) 1631.13 (-3.67 pct)
8 3212.89 (0.00 pct) 3033.79 (-5.57 pct) * 3135.47 (-2.40 pct) 3234.27 (0.66 pct)
16 6234.92 (0.00 pct) 5201.92 (-16.56 pct) * 5530.89 (-11.29 pct) * 5912.84 (-5.16 pct) *
32 12237.45 (0.00 pct) 8156.19 (-33.35 pct) * 10959.70 (-10.44 pct) * 11810.84 (-3.48 pct)
64 24020.17 (0.00 pct) 13164.27 (-45.19 pct) * 14578.70 (-39.30 pct) * 21266.73 (-11.46 pct) *
128 34633.95 (0.00 pct) 32629.42 (-5.78 pct) * 31811.74 (-8.14 pct) * 28707.32 (-17.11 pct) *
256 34310.04 (0.00 pct) 37271.34 (8.63 pct) 34086.25 (-0.65 pct) 29466.42 (-14.11 pct) *
512 35216.68 (0.00 pct) 36763.77 (4.39 pct) 35632.86 (1.18 pct) 31823.27 (-9.63 pct) *

As you can see the regression still exists. even with only first patch
of the series applied on top of the tip. Moving the declarations around
to top helps some cases but we are having troubles at the saturation
point with the move.

Following is the diff for "Move declaration to top of block":

--
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e9f3dc6dcbf4..97a3895416ab 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1750,12 +1750,12 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
return sd;
}

+DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
+DECLARE_PER_CPU(int, sd_share_id);
DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
DECLARE_PER_CPU(int, sd_llc_size);
DECLARE_PER_CPU(int, sd_llc_id);
-DECLARE_PER_CPU(int, sd_share_id);
DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
-DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
--

The System-map of each kernel is as follows:

- On "tip"

0000000000020518 D sd_asym_cpucapacity
0000000000020520 D sd_asym_packing
0000000000020528 D sd_numa
0000000000020530 D sd_llc_shared
0000000000020538 D sd_llc_id
000000000002053c D sd_llc_size
-------------------------------------------- 64B Cacheline Boundary
0000000000020540 D sd_llc

- On "tip + Patch 1 only" and "tip + both patches"

0000000000020518 D sd_asym_cpucapacity
0000000000020520 D sd_asym_packing
0000000000020528 D sd_numa
0000000000020530 D sd_cluster <-----
0000000000020538 D sd_llc_shared
-------------------------------------------- 64B Cacheline Boundary
0000000000020540 D sd_share_id <-----
0000000000020544 D sd_llc_id
0000000000020548 D sd_llc_size
0000000000020550 D sd_llc


- On "tip + both patches (Move declaration to top)"

0000000000020518 D sd_asym_cpucapacity
0000000000020520 D sd_asym_packing
0000000000020528 D sd_numa
0000000000020530 D sd_llc_shared
0000000000020538 D sd_llc_id
000000000002053c D sd_llc_size
-------------------------------------------- 64B Cacheline Boundary
0000000000020540 D sd_llc
0000000000020548 D sd_share_id <-----
0000000000020550 D sd_cluster <-----

> Or change the layout a bit to see if there's any difference,
> like:
>
> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
> DEFINE_PER_CPU(int, sd_llc_size);
> DEFINE_PER_CPU(int, sd_llc_id);
> DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> +DEFINE_PER_CPU(int, sd_share_id);
> +DEFINE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
>
> I need to further look into it and have some tests on a SMT machine. Would you mind to share
> the kernel config as well? I'd like to compare the config as well.

I've attached the kernel config used to build the test kernel
to this mail.

>
> Thanks,
> Yicong

We are trying to debug the issue using perf and find an optimal
arrangement of the per cpu declarations to get the relevant data
used in the wakeup path on the same 64B cache line.

We'll keep you posted of out finding. Let me know if you need
anything else in the meantime.
--
Thanks and Regards,
Prateek


Attachments:
config (157.48 kB)

2022-06-17 18:13:17

by Tim Chen

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API

On Fri, 2022-06-17 at 17:50 +0530, K Prateek Nayak wrote:
>
>
> --
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index e9f3dc6dcbf4..97a3895416ab 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1750,12 +1750,12 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
> return sd;
> }
>
> +DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
> +DECLARE_PER_CPU(int, sd_share_id);
> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
> DECLARE_PER_CPU(int, sd_llc_size);
> DECLARE_PER_CPU(int, sd_llc_id);
> -DECLARE_PER_CPU(int, sd_share_id);
> DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> -DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
> --
>
> The System-map of each kernel is as follows:
>
> - On "tip"
>
> 0000000000020518 D sd_asym_cpucapacity
> 0000000000020520 D sd_asym_packing
> 0000000000020528 D sd_numa
> 0000000000020530 D sd_llc_shared
> 0000000000020538 D sd_llc_id
> 000000000002053c D sd_llc_size
> -------------------------------------------- 64B Cacheline Boundary
> 0000000000020540 D sd_llc
>
> - On "tip + Patch 1 only" and "tip + both patches"
>
> 0000000000020518 D sd_asym_cpucapacity
> 0000000000020520 D sd_asym_packing
> 0000000000020528 D sd_numa
> 0000000000020530 D sd_cluster <-----
> 0000000000020538 D sd_llc_shared
> -------------------------------------------- 64B Cacheline Boundary
> 0000000000020540 D sd_share_id <-----
> 0000000000020544 D sd_llc_id
> 0000000000020548 D sd_llc_size
> 0000000000020550 D sd_llc
>
>
> - On "tip + both patches (Move declaration to top)"
>
> 0000000000020518 D sd_asym_cpucapacity
> 0000000000020520 D sd_asym_packing
> 0000000000020528 D sd_numa
> 0000000000020530 D sd_llc_shared
> 0000000000020538 D sd_llc_id
> 000000000002053c D sd_llc_size
> -------------------------------------------- 64B Cacheline Boundary
> 0000000000020540 D sd_llc

Wonder if it will help to try keep sd_llc and sd_llc_size into the same
cache line. They are both used in the wake up path.


> 0000000000020548 D sd_share_id <-----
> 0000000000020550 D sd_cluster <-----
>
> > Or change the layout a bit to see if there's any difference,
> > like:
> >
> > DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
> > DEFINE_PER_CPU(int, sd_llc_size);
> > DEFINE_PER_CPU(int, sd_llc_id);
> > DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> > +DEFINE_PER_CPU(int, sd_share_id);
> > +DEFINE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
> > DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
> > DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
> >
> > I need to further look into it and have some tests on a SMT machine. Would you mind to share
> > the kernel config as well? I'd like to compare the config as well.
>
> I've attached the kernel config used to build the test kernel
> to this mail.
>
> > Thanks,
> > Yicong
>
> We are trying to debug the issue using perf and find an optimal
> arrangement of the per cpu declarations to get the relevant data
> used in the wakeup path on the same 64B cache line.

A check of perf c2c profile difference between tip and the move new declarations to
the top case could be useful. It may give some additional clues of possibel
false sharing issues.

Tim

>
> We'll keep you posted of out finding. Let me know if you need
> anything else in the meantime.
> --
> Thanks and Regards,
> Prateek

2022-06-20 11:44:25

by K Prateek Nayak

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API

Hello Tim,

Thank you for looking into this.

On 6/17/2022 10:20 PM, Tim Chen wrote:
> On Fri, 2022-06-17 at 17:50 +0530, K Prateek Nayak wrote:
>>
>>
>> --
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index e9f3dc6dcbf4..97a3895416ab 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -1750,12 +1750,12 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
>> return sd;
>> }
>>
>> +DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
>> +DECLARE_PER_CPU(int, sd_share_id);
>> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>> DECLARE_PER_CPU(int, sd_llc_size);
>> DECLARE_PER_CPU(int, sd_llc_id);
>> -DECLARE_PER_CPU(int, sd_share_id);
>> DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
>> -DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
>> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
>> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
>> --
>>
>> The System-map of each kernel is as follows:
>>
>> - On "tip"
>>
>> 0000000000020518 D sd_asym_cpucapacity
>> 0000000000020520 D sd_asym_packing
>> 0000000000020528 D sd_numa
>> 0000000000020530 D sd_llc_shared
>> 0000000000020538 D sd_llc_id
>> 000000000002053c D sd_llc_size
>> -------------------------------------------- 64B Cacheline Boundary
>> 0000000000020540 D sd_llc
>>
>> - On "tip + Patch 1 only" and "tip + both patches"
>>
>> 0000000000020518 D sd_asym_cpucapacity
>> 0000000000020520 D sd_asym_packing
>> 0000000000020528 D sd_numa
>> 0000000000020530 D sd_cluster <-----
>> 0000000000020538 D sd_llc_shared
>> -------------------------------------------- 64B Cacheline Boundary
>> 0000000000020540 D sd_share_id <-----
>> 0000000000020544 D sd_llc_id
>> 0000000000020548 D sd_llc_size
>> 0000000000020550 D sd_llc
>>
>>
>> - On "tip + both patches (Move declaration to top)"
>>
>> 0000000000020518 D sd_asym_cpucapacity
>> 0000000000020520 D sd_asym_packing
>> 0000000000020528 D sd_numa
>> 0000000000020530 D sd_llc_shared
>> 0000000000020538 D sd_llc_id
>> 000000000002053c D sd_llc_size
>> -------------------------------------------- 64B Cacheline Boundary
>> 0000000000020540 D sd_llc
>
> Wonder if it will help to try keep sd_llc and sd_llc_size into the same
> cache line. They are both used in the wake up path.

We are still evaluating keeping which set of variables on the same
cache line will provide the best results.

I would have expected the two kernel variants - "tip" and the
"tip + both patches (Move declaration to top)" - to give similar results
as their System map for all the old variables remain the same and the
addition of "sd_share_id" and "sd_cluster: fit in the gap after "sd_llc".
However, now we see a regression for higher number of client.

This probably hints that access to "sd_cluster" variable in Patch 2 and
bringing in the extra cache line could be responsible for the regression
we see with "tip + both patches (Move declaration to top)"

>
>
>> 0000000000020548 D sd_share_id <-----
>> 0000000000020550 D sd_cluster <-----
>>
>>> Or change the layout a bit to see if there's any difference,
>>> like:
>>>
>>> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>>> DEFINE_PER_CPU(int, sd_llc_size);
>>> DEFINE_PER_CPU(int, sd_llc_id);
>>> DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
>>> +DEFINE_PER_CPU(int, sd_share_id);
>>> +DEFINE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
>>> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>>> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
>>>
>>> I need to further look into it and have some tests on a SMT machine. Would you mind to share
>>> the kernel config as well? I'd like to compare the config as well.
>>
>> I've attached the kernel config used to build the test kernel
>> to this mail.
>>
>>> Thanks,
>>> Yicong
>>
>> We are trying to debug the issue using perf and find an optimal
>> arrangement of the per cpu declarations to get the relevant data
>> used in the wakeup path on the same 64B cache line.
>
> A check of perf c2c profile difference between tip and the move new declarations to
> the top case could be useful. It may give some additional clues of possibel
> false sharing issues.

Thank you for the suggestion. We are currently looking at perf counter
data to see how the cache efficiency has changed between the two kernels.
We suspect that the need for the data in the other cache line too in the
wakeup path is resulting in higher cache misses in the levels closer to
the core.

I don't think it is a false sharing problem as these per CPU data are
set when the system first boots up and will only be change again during
a CPU hotplug operation. However, it might be beneficial to see the c2c
profile if perf counters don't indicate anything out of the ordinary.

>
> Tim
>
>>
>> We'll keep you posted of out finding. Let me know if you need
>> anything else in the meantime.
>> --
>> Thanks and Regards,
>> Prateek
>
--
Thanks and Regards,
Prateek

2022-06-20 14:58:18

by Abel Wu

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API


On 6/20/22 7:20 PM, K Prateek Nayak Wrote:
> Hello Tim,
>
> Thank you for looking into this.
>
> On 6/17/2022 10:20 PM, Tim Chen wrote:
>> On Fri, 2022-06-17 at 17:50 +0530, K Prateek Nayak wrote:
>>>
>>>
>>> --
>>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>>> index e9f3dc6dcbf4..97a3895416ab 100644
>>> --- a/kernel/sched/sched.h
>>> +++ b/kernel/sched/sched.h
>>> @@ -1750,12 +1750,12 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
>>> return sd;
>>> }
>>>
>>> +DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
>>> +DECLARE_PER_CPU(int, sd_share_id);
>>> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>>> DECLARE_PER_CPU(int, sd_llc_size);
>>> DECLARE_PER_CPU(int, sd_llc_id);
>>> -DECLARE_PER_CPU(int, sd_share_id);
>>> DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
>>> -DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
>>> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>>> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
>>> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
>>> --
>>>
>>> The System-map of each kernel is as follows:
>>>
>>> - On "tip"
>>>
>>> 0000000000020518 D sd_asym_cpucapacity
>>> 0000000000020520 D sd_asym_packing
>>> 0000000000020528 D sd_numa
>>> 0000000000020530 D sd_llc_shared
>>> 0000000000020538 D sd_llc_id
>>> 000000000002053c D sd_llc_size
>>> -------------------------------------------- 64B Cacheline Boundary
>>> 0000000000020540 D sd_llc
>>>
>>> - On "tip + Patch 1 only" and "tip + both patches"
>>>
>>> 0000000000020518 D sd_asym_cpucapacity
>>> 0000000000020520 D sd_asym_packing
>>> 0000000000020528 D sd_numa
>>> 0000000000020530 D sd_cluster <-----
>>> 0000000000020538 D sd_llc_shared
>>> -------------------------------------------- 64B Cacheline Boundary
>>> 0000000000020540 D sd_share_id <-----
>>> 0000000000020544 D sd_llc_id
>>> 0000000000020548 D sd_llc_size
>>> 0000000000020550 D sd_llc
>>>
>>>
>>> - On "tip + both patches (Move declaration to top)"
>>>
>>> 0000000000020518 D sd_asym_cpucapacity
>>> 0000000000020520 D sd_asym_packing
>>> 0000000000020528 D sd_numa
>>> 0000000000020530 D sd_llc_shared
>>> 0000000000020538 D sd_llc_id
>>> 000000000002053c D sd_llc_size
>>> -------------------------------------------- 64B Cacheline Boundary
>>> 0000000000020540 D sd_llc
>>
>> Wonder if it will help to try keep sd_llc and sd_llc_size into the same
>> cache line. They are both used in the wake up path.
>
> We are still evaluating keeping which set of variables on the same
> cache line will provide the best results.
>
> I would have expected the two kernel variants - "tip" and the
> "tip + both patches (Move declaration to top)" - to give similar results
> as their System map for all the old variables remain the same and the
> addition of "sd_share_id" and "sd_cluster: fit in the gap after "sd_llc".
> However, now we see a regression for higher number of client.
>
> This probably hints that access to "sd_cluster" variable in Patch 2 and
> bringing in the extra cache line could be responsible for the regression
> we see with "tip + both patches (Move declaration to top)"
>
>>
>>
>>> 0000000000020548 D sd_share_id <-----
>>> 0000000000020550 D sd_cluster <-----
>>>
>>>> Or change the layout a bit to see if there's any difference,
>>>> like:
>>>>
>>>> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>>>> DEFINE_PER_CPU(int, sd_llc_size);
>>>> DEFINE_PER_CPU(int, sd_llc_id);
>>>> DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
>>>> +DEFINE_PER_CPU(int, sd_share_id);
>>>> +DEFINE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
>>>> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>>>> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
>>>>
>>>> I need to further look into it and have some tests on a SMT machine. Would you mind to share
>>>> the kernel config as well? I'd like to compare the config as well.
>>>
>>> I've attached the kernel config used to build the test kernel
>>> to this mail.
>>>
>>>> Thanks,
>>>> Yicong
>>>
>>> We are trying to debug the issue using perf and find an optimal
>>> arrangement of the per cpu declarations to get the relevant data
>>> used in the wakeup path on the same 64B cache line.
>>
>> A check of perf c2c profile difference between tip and the move new declarations to
>> the top case could be useful. It may give some additional clues of possibel
>> false sharing issues.
>
> Thank you for the suggestion. We are currently looking at perf counter
> data to see how the cache efficiency has changed between the two kernels.
> We suspect that the need for the data in the other cache line too in the
> wakeup path is resulting in higher cache misses in the levels closer to
> the core.
>
> I don't think it is a false sharing problem as these per CPU data are
> set when the system first boots up and will only be change again during
> a CPU hotplug operation. However, it might be beneficial to see the c2c
> profile if perf counters don't indicate anything out of the ordinary.
>

Would it be possible if any other frequent-written variables share
same cacheline with these per-cpu data causing false sharing? What
about making all these sd_* data DEFINE_PER_CPU_READ_MOSTLY?

2022-06-28 12:16:01

by K Prateek Nayak

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API

Hello Abel,

Thank you for looking into this issue.

tl;dr

We have found two variables that need to be
co-located on the same cache line to restore tbench
performance.

As this issue is unrelated to the functionality of
the patch, it should not hold this patch series from
landing given the performance improvements seen on
systems with CPU clusters.

The results of our analysis are discussed in
detail below.

On 6/20/2022 7:07 PM, Abel Wu wrote:
>
> On 6/20/22 7:20 PM, K Prateek Nayak Wrote:
>> Hello Tim,
>>
>> Thank you for looking into this.
>>
>> On 6/17/2022 10:20 PM, Tim Chen wrote:
>>> On Fri, 2022-06-17 at 17:50 +0530, K Prateek Nayak wrote:
>>>>
>>>>
>>>> --
>>>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>>>> index e9f3dc6dcbf4..97a3895416ab 100644
>>>> --- a/kernel/sched/sched.h
>>>> +++ b/kernel/sched/sched.h
>>>> @@ -1750,12 +1750,12 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
>>>>       return sd;
>>>>   }
>>>>   +DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
>>>> +DECLARE_PER_CPU(int, sd_share_id);
>>>>   DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>>>>   DECLARE_PER_CPU(int, sd_llc_size);
>>>>   DECLARE_PER_CPU(int, sd_llc_id);
>>>> -DECLARE_PER_CPU(int, sd_share_id);
>>>>   DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
>>>> -DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
>>>>   DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>>>>   DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
>>>>   DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
>>>> --
>>>>
>>>> The System-map of each kernel is as follows:
>>>>
>>>> - On "tip"
>>>>
>>>> 0000000000020518 D sd_asym_cpucapacity
>>>> 0000000000020520 D sd_asym_packing
>>>> 0000000000020528 D sd_numa
>>>> 0000000000020530 D sd_llc_shared
>>>> 0000000000020538 D sd_llc_id
>>>> 000000000002053c D sd_llc_size
>>>> -------------------------------------------- 64B Cacheline Boundary
>>>> 0000000000020540 D sd_llc
>>>>
>>>> - On "tip + Patch 1 only" and "tip + both patches"
>>>>
>>>> 0000000000020518 D sd_asym_cpucapacity
>>>> 0000000000020520 D sd_asym_packing
>>>> 0000000000020528 D sd_numa
>>>> 0000000000020530 D sd_cluster     <-----
>>>> 0000000000020538 D sd_llc_shared
>>>> -------------------------------------------- 64B Cacheline Boundary
>>>> 0000000000020540 D sd_share_id    <-----
>>>> 0000000000020544 D sd_llc_id
>>>> 0000000000020548 D sd_llc_size
>>>> 0000000000020550 D sd_llc
>>>>
>>>>
>>>> - On "tip + both patches (Move declaration to top)"
>>>>
>>>> 0000000000020518 D sd_asym_cpucapacity
>>>> 0000000000020520 D sd_asym_packing
>>>> 0000000000020528 D sd_numa
>>>> 0000000000020530 D sd_llc_shared
>>>> 0000000000020538 D sd_llc_id
>>>> 000000000002053c D sd_llc_size
>>>> -------------------------------------------- 64B Cacheline Boundary
>>>> 0000000000020540 D sd_llc
>>>
>>> Wonder if it will help to try keep sd_llc and sd_llc_size into the same
>>> cache line.  They are both used in the wake up path.
>>
>> We are still evaluating keeping which set of variables on the same
>> cache line will provide the best results.
>>
>> I would have expected the two kernel variants - "tip" and the
>> "tip + both patches (Move declaration to top)" - to give similar results
>> as their System map for all the old variables remain the same and the
>> addition of "sd_share_id" and "sd_cluster: fit in the gap after "sd_llc".
>> However, now we see a regression for higher number of client.
>>
>> This probably hints that access to "sd_cluster" variable in Patch 2 and
>> bringing in the extra cache line could be responsible for the regression
>> we see with "tip + both patches (Move declaration to top)"
>>
>>>
>>>
>>>> 0000000000020548 D sd_share_id    <-----
>>>> 0000000000020550 D sd_cluster     <-----
>>>>
>>>>> Or change the layout a bit to see if there's any difference,
>>>>> like:
>>>>>
>>>>>   DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>>>>>   DEFINE_PER_CPU(int, sd_llc_size);
>>>>>   DEFINE_PER_CPU(int, sd_llc_id);
>>>>>   DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
>>>>> +DEFINE_PER_CPU(int, sd_share_id);
>>>>> +DEFINE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
>>>>>   DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>>>>>   DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
>>>>>
>>>>> I need to further look into it and have some tests on a SMT machine. Would you mind to share
>>>>> the kernel config as well? I'd like to compare the config as well.
>>>>
>>>> I've attached the kernel config used to build the test kernel
>>>> to this mail.
>>>>
>>>>> Thanks,
>>>>> Yicong
>>>>
>>>> We are trying to debug the issue using perf and find an optimal
>>>> arrangement of the per cpu declarations to get the relevant data
>>>> used in the wakeup path on the same 64B cache line.
>>>
>>> A check of perf c2c profile difference between tip and the move new declarations to
>>> the top case could be useful.  It may give some additional clues of possibel
>>> false sharing issues.
>>
>> Thank you for the suggestion. We are currently looking at perf counter
>> data to see how the cache efficiency has changed between the two kernels.
>> We suspect that the need for the data in the other cache line too in the
>> wakeup path is resulting in higher cache misses in the levels closer to
>> the core.
>>
>> I don't think it is a false sharing problem as these per CPU data are
>> set when the system first boots up and will only be change again during
>> a CPU hotplug operation. However, it might be beneficial to see the c2c
>> profile if perf counters don't indicate anything out of the ordinary.
>>
>
> Would it be possible if any other frequent-written variables share
> same cacheline with these per-cpu data causing false sharing?

This indeed seems to be the case. I'll leave the
detailed analysis below.

> What
> about making all these sd_* data DEFINE_PER_CPU_READ_MOSTLY?
>

Making all the sd_* variables DEFINE_PER_CPU_READ_MOSTLY or
placing all the sd_* variables on the same cache line doesn't
help the regression. In fact, it makes it worse.

Following are the results on different test kernels:
tip - 5.19.0-rc2 tip
(commit: f3dd3f674555 "sched: Remove the limitation of WF_ON_CPU on wakelist if wakee cpu is idle")
cluster - tip + both the patches of the series
Patch1 - tip + only Patch 1
align_first (Patch 1) - tip + only Patch 1 + all sd_* variables in same cache line
per_cpu_aigned_struct (Patch 1) - tip + only Patch 1 + all sd_* variables part of a per_cpu struct which is cacheline aligned

Clients: tip cluster Patch1 align_first (Patch 1) per_cpu_aigned_struct (Patch 1)
8 3263.81 (0.00 pct) 3086.81 (-5.42 pct) 3018.63 (-7.51 pct) 2993.65 (-8.27 pct) 1728.89 (-47.02 pct)
16 6011.19 (0.00 pct) 5360.28 (-10.82 pct) 4869.26 (-18.99 pct) 4820.18 (-19.81 pct) 3840.64 (-36.10 pct)
32 12058.31 (0.00 pct) 8769.08 (-27.27 pct) 8159.60 (-32.33 pct) 7868.82 (-34.74 pct) 7130.19 (-40.86 pct)
64 21258.21 (0.00 pct) 19021.09 (-10.52 pct) 13161.92 (-38.08 pct) 12327.86 (-42.00 pct) 12572.70 (-40.85 pct)

Following is the system map for the kernel variant "align_first (Patch 1)":

--
00000000000204c0 d sugov_cpu
------------------------------------------------ 20500 (Cache Line Start)
0000000000020508 d root_cpuacct_cpuusage
0000000000020510 D cpufreq_update_util_data
------------------------------------------------ 20540 (Cache Line Start)
0000000000020540 D sd_asym_cpucapacity
0000000000020548 D sd_asym_packing
0000000000020550 D sd_numa
0000000000020558 D sd_cluster
0000000000020560 D sd_llc_shared
0000000000020568 D sd_share_id
000000000002056c D sd_llc_id
0000000000020570 D sd_llc_size
0000000000020578 D sd_llc
------------------------------------------------ 20580 (Cache Line Start)
0000000000020580 d wake_up_klogd_work
00000000000205a0 d printk_pending
00000000000205a4 d printk_count_nmi
00000000000205a5 d printk_count
00000000000205a8 d printk_context
------------------------------------------------ 205c0 (Cache Line Start)
00000000000205c0 d rcu_tasks_trace__percpu
--

At this point it was clear that one or more sd_* variable needs
to be co-located with the per CPU variables in cache line starting
at 20540. We began moving variable out of the cache line one by one
to see which variable makes the difference as we found out that as
long as root_cpuacct_cpuusage and sd_llc_id are on the same cache
line, the results were equivalent of what we saw on the tip. As both
the variables seem to be accesses very frequently, access to one will
prime the cache line containing the other variable as well leading to
better cache efficiency.

Placing root_cpuacct_cpuusage, sd_llc_id, sd_share_id, sd_llc_shared
and sd_cluster on the same cache line, the results are as follows:

Kernel versions:
tip - 5.19.0-rc2 tip
cluster - tip + both the patches of the series
cluster (Custom Layout) - tip + both the patches of the series + reworked system map
cluster (Custom Layout) + SIS_UTIL - cluster (Custom Layout) + v4 of SIS_UTIL patchset by Chenyu
(https://lore.kernel.org/lkml/[email protected]/)

Clients: tip cluster cluster (Custom Layout) cluster (Custom Layout)
+ SIS Util
1 444.41 (0.00 pct) 439.27 (-1.15 pct) 438.06 (-1.42 pct) 447.75 (0.75 pct)
2 879.23 (0.00 pct) 831.49 (-5.42 pct) 846.98 (-3.66 pct) 871.64 (-0.86 pct)
4 1648.83 (0.00 pct) 1608.07 (-2.47 pct) 1621.38 (-1.66 pct) 1656.34 (0.45 pct)
8 3263.81 (0.00 pct) 3086.81 (-5.42 pct) 3103.40 (-4.91 pct) * 3227.88 (-1.10 pct)
16 6011.19 (0.00 pct) 5360.28 (-10.82 pct) 5838.04 (-2.88 pct) 6232.92 (3.68 pct)
32 12058.31 (0.00 pct) 8769.08 (-27.27 pct) 11577.73 (-3.98 pct) 11774.10 (-2.35 pct)
64 21258.21 (0.00 pct) 19021.09 (-10.52 pct) 19563.57 (-7.97 pct) * 22044.93 (3.70 pct)
128 30795.27 (0.00 pct) 30861.34 (0.21 pct) 31705.47 (2.95 pct) 28986.14 (-5.87 pct) *
256 25138.43 (0.00 pct) 24711.90 (-1.69 pct) 23929.42 (-4.80 pct) * 43984.52 (74.96 pct) [Known to be unstable without SIS_UTIL]
512 51287.93 (0.00 pct) 51855.55 (1.10 pct) 52278.33 (1.93 pct) 51511.51 (0.43 pct)
1024 53176.97 (0.00 pct) 52554.55 (-1.17 pct) 52995.27 (-0.34 pct) 52807.04 (-0.69 pct)

Chenyu's SIS_UTIL patch was merged today in the tip
(commit: 70fb5ccf2ebb "sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg")
and seems to bring back performance to the same level
as seen on the tip used during our testing.

The system map of the tested configuration labelled "Custom Layout"
is as follows:

--
00000000000204c0 d sugov_cpu
------------------------------------------------ 20500 (Cache Line Start)
0000000000020508 d root_cpuacct_cpuusage
0000000000020510 D cpufreq_update_util_data
0000000000020518 D sd_llc_id
000000000002051c D sd_share_id
0000000000020520 D sd_llc_shared
0000000000020528 D sd_cluster
0000000000020530 D sd_asym_cpucapacity
0000000000020538 D sd_asym_packing
------------------------------------------------ 20540 (Cache Line Start)
0000000000020540 D sd_numa
0000000000020548 D sd_llc_size
0000000000020550 D sd_llc
0000000000020560 d wake_up_klogd_work
------------------------------------------------ 20580 (Cache Line Start)
0000000000020580 d printk_pending
0000000000020584 d printk_count_nmi
0000000000020585 d printk_count
0000000000020588 d printk_context
------------------------------------------------ 205c0 (Cache Line Start)
00000000000205c0 d rcu_tasks_trace__percpu
--

The System-map is however dependent on multiple factors
such as config options enabled, etc. which can change from
build to build.
Finding a permanent solution to the issue might take
some more time.

Meanwhile, as this is an issue unrelated to the
functionality of this patch, it should not block
the landing of this patch series.
Thank you everyone for your pointers and patience
during this debug.

Tested-by: K Prateek Nayak <[email protected]>
--
Thanks and Regards,
Prateek