On 13/05/2022 13:03, Sudeep Holla wrote:
> On Fri, May 13, 2022 at 12:42:00PM +0200, Dietmar Eggemann wrote:
>> On 13/05/2022 11:03, Sudeep Holla wrote:
>>> On Fri, May 13, 2022 at 10:34:00AM +0200, Dietmar Eggemann wrote:
>>
>> [...]
>>
>>>> @@ -527,7 +528,8 @@ static int __init parse_core(struct device_node *core, int package_id,
>>>> return -EINVAL;
>>>> }
>>>>
>>>> - cpu_topology[cpu].package_id = package_id;
>>>> + cpu_topology[cpu].package_id = 0;
>>>
>>> While the above looks good and matches with what I am attempting to do
>>> as well ...
>>>
>>>> + cpu_topology[cpu].llc_id = llc_id;
>>>
>>> This looks wrong for simple reason that this is derived incorrectly from
>>> the cpu-map while there is no guarantee that it matches the last level
>>> cache ID on the system as we didn't parse the cache topology for this.
>>> So I disagree with this change as it might conflict with the actual and
>>> correct llc_id.
>>
>> It might not match the LLC, that's true. Something we have already today
>> in Android for DynamIQ clusters with big/Little. People using 1. level
>> clusters to group CPUs according to uArch.
>
> Not sure if that is the correct representation of h/w cluster on those
> platforms, but if they want to misguide OS with the f/w(DT in this case)
> well that's their choice.
>
> The main point is we need to get the exact h/w topology information and
> then we can decide how to present the below masks as required by the
> scheduler for its sched domains.
>
>> My point is we manage to get:
>>
>> SMT - cpu_smt_mask()
>> CLS - cpu_clustergroup_mask()
>> MC - cpu_coregroup_mask()
>> DIE - cpu_cpu_mask()
>>
>> covered in ACPI with the cpu_topology[] structure and if we want CLS on
>> DT we have to save cluster_id for the 2. level (DT) cluster.
>>
>
> I am not sure on the above point. Even with ACPI PPTT we are just setting
> cluster_id based on first or leaf level of the clusters ignoring the
Not sure about this. cluster_id was introduced last year into ACPI PPTT
commit c5e22feffdd7 ("topology: Represent clusters of CPUs within a
die") to cover L3-tag (4 CPUs) within L3 (24 CPUs) on Kunpeng920 for
instance.
cat /sys/kernel/debug/sched/domains/cpu0/domain*/name
CLS
MC
... I skip the NUMA levels
# cat /proc/schedstat | awk '{print $1 " " $2 }' | grep ^[cd] | head -5
cpu0 0
domain0 00000000,00000000,0000000f <-- 4 CPUs <-- cluster_id
domain1 00000000,00000000,00ffffff <-- 24 CPUs
If you use cluster_id for 1. level cluster then you cover MC's 24 CPUs
> nesting ATM. And that's exactly what I am trying to get with this series[1]
>
>
>> And that's why I proposed to (ab)use llc_id to form the MC mask.
>>
>
> Sure, it is already supported IIUC by cpu_coregroup_mask in arch_topology.c
> We just need to make sure llc_id is set correctly in case of DT. Now if
> you are saying llc_sibling is not what you need but something else, then
> we may need to add that new mask and update cpu_coregroup_mask to choose
> that based on certain condition which I believe is bit complicated.
>
>> I'm not currently aware of another solution to get CLS somehow elegantly
>> into a DT system.
>
> Will grouping of CPUs into cluster they belong not good enough for CLS ?
No, IMHO then you'll cover MC and it's cpu_coregroup_mask() and you get
MC. ^^^^
> I thought that should suffice based on what we have in cpu_clustergroup_mask
> (i.e. cluster sibling mask)
For one level (MC) yes, but not for 2 (MC and CLS). And cluster_id was
introduces for the 2. level.
cpu_clustergroup_mask() is 0f, cpu_coregroup_mask() is 00ffffff.
On Fri, May 13, 2022 at 02:04:29PM +0200, Dietmar Eggemann wrote:
> On 13/05/2022 13:03, Sudeep Holla wrote:
> > On Fri, May 13, 2022 at 12:42:00PM +0200, Dietmar Eggemann wrote:
> >> On 13/05/2022 11:03, Sudeep Holla wrote:
> >>> On Fri, May 13, 2022 at 10:34:00AM +0200, Dietmar Eggemann wrote:
> >>
> >> [...]
> >>
> >>>> @@ -527,7 +528,8 @@ static int __init parse_core(struct device_node *core, int package_id,
> >>>> return -EINVAL;
> >>>> }
> >>>>
> >>>> - cpu_topology[cpu].package_id = package_id;
> >>>> + cpu_topology[cpu].package_id = 0;
> >>>
> >>> While the above looks good and matches with what I am attempting to do
> >>> as well ...
> >>>
> >>>> + cpu_topology[cpu].llc_id = llc_id;
> >>>
> >>> This looks wrong for simple reason that this is derived incorrectly from
> >>> the cpu-map while there is no guarantee that it matches the last level
> >>> cache ID on the system as we didn't parse the cache topology for this.
> >>> So I disagree with this change as it might conflict with the actual and
> >>> correct llc_id.
> >>
> >> It might not match the LLC, that's true. Something we have already today
> >> in Android for DynamIQ clusters with big/Little. People using 1. level
> >> clusters to group CPUs according to uArch.
> >
> > Not sure if that is the correct representation of h/w cluster on those
> > platforms, but if they want to misguide OS with the f/w(DT in this case)
> > well that's their choice.
> >
> > The main point is we need to get the exact h/w topology information and
> > then we can decide how to present the below masks as required by the
> > scheduler for its sched domains.
> >
> >> My point is we manage to get:
> >>
> >> SMT - cpu_smt_mask()
> >> CLS - cpu_clustergroup_mask()
> >> MC - cpu_coregroup_mask()
> >> DIE - cpu_cpu_mask()
> >>
> >> covered in ACPI with the cpu_topology[] structure and if we want CLS on
> >> DT we have to save cluster_id for the 2. level (DT) cluster.
> >>
> >
> > I am not sure on the above point. Even with ACPI PPTT we are just setting
> > cluster_id based on first or leaf level of the clusters ignoring the
>
> Not sure about this. cluster_id was introduced last year into ACPI PPTT
> commit c5e22feffdd7 ("topology: Represent clusters of CPUs within a
> die") to cover L3-tag (4 CPUs) within L3 (24 CPUs) on Kunpeng920 for
> instance.
>
> cat /sys/kernel/debug/sched/domains/cpu0/domain*/name
> CLS
> MC
> ... I skip the NUMA levels
>
> # cat /proc/schedstat | awk '{print $1 " " $2 }' | grep ^[cd] | head -5
> cpu0 0
> domain0 00000000,00000000,0000000f <-- 4 CPUs <-- cluster_id
> domain1 00000000,00000000,00ffffff <-- 24 CPUs
>
> If you use cluster_id for 1. level cluster then you cover MC's 24 CPUs
OK, the way I am looking at this from CPU topology perspective seem to be
different from what you are expecting here w.r.t sched_domains.
IMO, these cpumasks on its own must represent real CPU topology and how it
is used via cpu_*_mask by the scheduler domains is different.
> > nesting ATM. And that's exactly what I am trying to get with this series[1]
> >
> >
> >> And that's why I proposed to (ab)use llc_id to form the MC mask.
> >>
> >
> > Sure, it is already supported IIUC by cpu_coregroup_mask in arch_topology.c
> > We just need to make sure llc_id is set correctly in case of DT. Now if
> > you are saying llc_sibling is not what you need but something else, then
> > we may need to add that new mask and update cpu_coregroup_mask to choose
> > that based on certain condition which I believe is bit complicated.
> >
> >> I'm not currently aware of another solution to get CLS somehow elegantly
> >> into a DT system.
> >
> > Will grouping of CPUs into cluster they belong not good enough for CLS ?
>
> No, IMHO then you'll cover MC and it's cpu_coregroup_mask() and you get
> MC. ^^^^
>
I see on Juno with SCHED_CLUSTER and cluster masks set, I see CLS and MC
domains.
> > I thought that should suffice based on what we have in cpu_clustergroup_mask
> > (i.e. cluster sibling mask)
>
> For one level (MC) yes, but not for 2 (MC and CLS). And cluster_id was
> introduces for the 2. level.
>
That sounds wrong and not what ACPI PPTT code says. My series just aligns
with what is done with ACPI PPTT IIUC. I need to check that again if my
understand differs from what is being done. But the example of Kunpeng920
aligns with my understanding.
--
Regards,
Sudeep
On 16/05/2022 12:35, Sudeep Holla wrote:
> On Fri, May 13, 2022 at 02:04:29PM +0200, Dietmar Eggemann wrote:
>> On 13/05/2022 13:03, Sudeep Holla wrote:
>>> On Fri, May 13, 2022 at 12:42:00PM +0200, Dietmar Eggemann wrote:
>>>> On 13/05/2022 11:03, Sudeep Holla wrote:
>>>>> On Fri, May 13, 2022 at 10:34:00AM +0200, Dietmar Eggemann wrote:
[...]
>> cat /sys/kernel/debug/sched/domains/cpu0/domain*/name
>> CLS
>> MC
>> ... I skip the NUMA levels
>>
>> # cat /proc/schedstat | awk '{print $1 " " $2 }' | grep ^[cd] | head -5
>> cpu0 0
>> domain0 00000000,00000000,0000000f <-- 4 CPUs <-- cluster_id
>> domain1 00000000,00000000,00ffffff <-- 24 CPUs
>>
>> If you use cluster_id for 1. level cluster then you cover MC's 24 CPUs
>
> OK, the way I am looking at this from CPU topology perspective seem to be
> different from what you are expecting here w.r.t sched_domains.
>
> IMO, these cpumasks on its own must represent real CPU topology and how it
> is used via cpu_*_mask by the scheduler domains is different.
I'm afraid that in case we want to change the mapping of scheduler
(topology) flags (like SD_SHARE_PKG_RESOURCES) via `*_flags()` of
default_topology[] we would have to consider all ACPI corner cases (see
cpu_coregroup_mask()) as well.
See (1) further down.
[...]
> I see on Juno with SCHED_CLUSTER and cluster masks set, I see CLS and MC
> domains.
Right but that's not really correct from the scheduler's POV.
Juno has LLC on cpumasks [0,3-5] and [1-2], not on [0-5]. So the most
important information is the highest Sched Domain with the
SD_SHARE_PKG_RESOURCES flag. This is the MC layer (cpu_core_flags() in
default_topology[]). So the scheduler would think that [0-5] are sharing
LLC.
You have LLC at:
cat /sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list
^^^^^^
0,3-5
but the scheduler sees the highest SD_SHARE_PKG_RESOURCES on MC:
cat /sys/kernel/debug/sched/domains/cpu0/domain1/flags
^^^^^^^
... SD_SHARE_PKG_RESOURCES ...
[...]
>> For one level (MC) yes, but not for 2 (MC and CLS). And cluster_id was
>> introduces for the 2. level.
>>
>
> That sounds wrong and not what ACPI PPTT code says. My series just aligns
> with what is done with ACPI PPTT IIUC. I need to check that again if my
> understand differs from what is being done. But the example of Kunpeng920
> aligns with my understanding.
(1) IMHO, as long as we don't consider cache (llc) information in DT we
can't have the same coverage as ACPI.
Let's take an ACPI !CONFIG_NUMA Kunpeng920 as an example.
# cat /sys/kernel/debug/sched/domains/cpu0/domain*/name
CLS
MC
DIE
# cat /sys/kernel/debug/sched/domains/cpu0/domain*/flags
SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE
SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING
^^^^^^^^^^^^^^^^^^^^^^
SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE
SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING
^^^^^^^^^^^^^^^^^^^^^^
SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE
SD_PREFER_SIBLING
cat /proc/schedstat | awk '{print $1 " " $2 }' | grep ^[cd] | head -4
cpu0 0
domain0 00000000,00000000,0000000f
domain1 00000000,00000000,00ffffff <-- (2)
domain2 ffffffff,ffffffff,ffffffff
cat /sys/devices/system/cpu/cpu0/topology/thread_siblings_list
0
cat /sys/devices/system/cpu/cpu0/topology/core_cpus_list
0
cat /sys/devices/system/cpu/cpu0/topology/cluster_cpus_list
0-3
cat /sys/devices/system/cpu/cpu0/topology/core_siblings_list
0-47
cat /sys/devices/system/cpu/cpu0/topology/package_cpus_list
0-47
The MC mask 00ffffff is not derived from any topology mask but from the
llc (index3) mask:
cat /sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_list
^^^^^^
0-23 <-- (2)
Coming back to the original request (the support of Armv9 L2 complexes
in Android) from Qing on systems like QC SM8450:
.---------------.
CPU |0 1 2 3 4 5 6 7|
+---------------+
uarch |l l l l m m m b| (so called tri-gear: little, medium, big)
+---------------+
L2 | | | | | | |
+---------------+
L3 |<-- -->|
+---------------+
|<-- cluster -->|
+---------------+
|<-- DSU -->|
'---------------'
This still wouldn't be possible. We know that Phantom Domain (grouping
after uarch) is not supported in mainline but heavily used in Android
(legacy deps).
If we say we only support L2 sharing (asymmetric here since only CPU0-3
have it !!!) and we don't support Phantom then your approach will work
for such systems.
Hi Dietmar,
Thanks for the detailed response.
On Tue, May 17, 2022 at 11:14:44AM +0200, Dietmar Eggemann wrote:
> On 16/05/2022 12:35, Sudeep Holla wrote:
> > On Fri, May 13, 2022 at 02:04:29PM +0200, Dietmar Eggemann wrote:
> >> On 13/05/2022 13:03, Sudeep Holla wrote:
> >>> On Fri, May 13, 2022 at 12:42:00PM +0200, Dietmar Eggemann wrote:
> >>>> On 13/05/2022 11:03, Sudeep Holla wrote:
> >>>>> On Fri, May 13, 2022 at 10:34:00AM +0200, Dietmar Eggemann wrote:
>
> [...]
>
> >> cat /sys/kernel/debug/sched/domains/cpu0/domain*/name
> >> CLS
> >> MC
> >> ... I skip the NUMA levels
> >>
> >> # cat /proc/schedstat | awk '{print $1 " " $2 }' | grep ^[cd] | head -5
> >> cpu0 0
> >> domain0 00000000,00000000,0000000f <-- 4 CPUs <-- cluster_id
> >> domain1 00000000,00000000,00ffffff <-- 24 CPUs
> >>
> >> If you use cluster_id for 1. level cluster then you cover MC's 24 CPUs
> >
> > OK, the way I am looking at this from CPU topology perspective seem to be
> > different from what you are expecting here w.r.t sched_domains.
> >
> > IMO, these cpumasks on its own must represent real CPU topology and how it
> > is used via cpu_*_mask by the scheduler domains is different.
>
> I'm afraid that in case we want to change the mapping of scheduler
> (topology) flags (like SD_SHARE_PKG_RESOURCES) via `*_flags()` of
> default_topology[] we would have to consider all ACPI corner cases (see
> cpu_coregroup_mask()) as well.
>
> See (1) further down.
>
Understood.
> [...]
>
> > I see on Juno with SCHED_CLUSTER and cluster masks set, I see CLS and MC
> > domains.
>
> Right but that's not really correct from the scheduler's POV.
>
OK
> Juno has LLC on cpumasks [0,3-5] and [1-2], not on [0-5]. So the most
> important information is the highest Sched Domain with the
> SD_SHARE_PKG_RESOURCES flag. This is the MC layer (cpu_core_flags() in
> default_topology[]). So the scheduler would think that [0-5] are sharing
> LLC.
>
Ah OK, but if LLC sibling masks are updated, cpu_coregroup_mask() takes
care of it IIUC, right ?
> You have LLC at:
>
> cat /sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list
> ^^^^^^
> 0,3-5
>
> but the scheduler sees the highest SD_SHARE_PKG_RESOURCES on MC:
>
> cat /sys/kernel/debug/sched/domains/cpu0/domain1/flags
> ^^^^^^^
> ... SD_SHARE_PKG_RESOURCES ...
>
> [...]
>
> >> For one level (MC) yes, but not for 2 (MC and CLS). And cluster_id was
> >> introduces for the 2. level.
> >>
> >
> > That sounds wrong and not what ACPI PPTT code says. My series just aligns
> > with what is done with ACPI PPTT IIUC. I need to check that again if my
> > understand differs from what is being done. But the example of Kunpeng920
> > aligns with my understanding.
>
> (1) IMHO, as long as we don't consider cache (llc) information in DT we
> can't have the same coverage as ACPI.
>
Agreed. But we are not changing any topology masks as per sched domain
requirements as they get exposed to the userspace as is.
> Let's take an ACPI !CONFIG_NUMA Kunpeng920 as an example.
>
> # cat /sys/kernel/debug/sched/domains/cpu0/domain*/name
> CLS
> MC
> DIE
>
> # cat /sys/kernel/debug/sched/domains/cpu0/domain*/flags
> SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE
> SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING
> ^^^^^^^^^^^^^^^^^^^^^^
> SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE
> SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING
> ^^^^^^^^^^^^^^^^^^^^^^
> SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE
> SD_PREFER_SIBLING
>
> cat /proc/schedstat | awk '{print $1 " " $2 }' | grep ^[cd] | head -4
> cpu0 0
> domain0 00000000,00000000,0000000f
> domain1 00000000,00000000,00ffffff <-- (2)
> domain2 ffffffff,ffffffff,ffffffff
>
> cat /sys/devices/system/cpu/cpu0/topology/thread_siblings_list
> 0
> cat /sys/devices/system/cpu/cpu0/topology/core_cpus_list
> 0
> cat /sys/devices/system/cpu/cpu0/topology/cluster_cpus_list
> 0-3
> cat /sys/devices/system/cpu/cpu0/topology/core_siblings_list
> 0-47
> cat /sys/devices/system/cpu/cpu0/topology/package_cpus_list
> 0-47
>
> The MC mask 00ffffff is not derived from any topology mask but from the
> llc (index3) mask:
>
> cat /sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_list
> ^^^^^^
> 0-23 <-- (2)
>
Understood and on Juno if we get llc_siblings right, the sched domains
must be sorted correctly ?
>
> Coming back to the original request (the support of Armv9 L2 complexes
> in Android) from Qing on systems like QC SM8450:
>
> .---------------.
> CPU |0 1 2 3 4 5 6 7|
> +---------------+
> uarch |l l l l m m m b| (so called tri-gear: little, medium, big)
> +---------------+
> L2 | | | | | | |
> +---------------+
> L3 |<-- -->|
> +---------------+
> |<-- cluster -->|
> +---------------+
> |<-- DSU -->|
> '---------------'
>
> This still wouldn't be possible. We know that Phantom Domain (grouping
> after uarch) is not supported in mainline but heavily used in Android
> (legacy deps).
>
Correct, unless you convince to get a suitable notion of *whatever*
phantom domains represent into the DT bindings, they don't exist.
If someone wants to support this config, they must first represent that
in the firmware so that OS can infer information from the same.
> If we say we only support L2 sharing (asymmetric here since only CPU0-3
> have it !!!) and we don't support Phantom then your approach will work
> for such systems.
Thanks, as I said what is *Phantom* domain ;) ? Can you point me to the
DT or ACPI binding for the same ? Just kidding, I know they don't exist.
Anyways, I understand your concern that llc_sibling must go with my set
of topology changes which I agree. Is that the only concern ?
--
Regards,
Sudeep
On 17/05/2022 12:57, Sudeep Holla wrote:
> Hi Dietmar,
>
> Thanks for the detailed response.
>
> On Tue, May 17, 2022 at 11:14:44AM +0200, Dietmar Eggemann wrote:
>> On 16/05/2022 12:35, Sudeep Holla wrote:
>>> On Fri, May 13, 2022 at 02:04:29PM +0200, Dietmar Eggemann wrote:
>>>> On 13/05/2022 13:03, Sudeep Holla wrote:
>>>>> On Fri, May 13, 2022 at 12:42:00PM +0200, Dietmar Eggemann wrote:
>>>>>> On 13/05/2022 11:03, Sudeep Holla wrote:
>>>>>>> On Fri, May 13, 2022 at 10:34:00AM +0200, Dietmar Eggemann wrote:
[...]
>>> I see on Juno with SCHED_CLUSTER and cluster masks set, I see CLS and MC
>>> domains.
>>
>> Right but that's not really correct from the scheduler's POV.
>>
>
> OK
>
>> Juno has LLC on cpumasks [0,3-5] and [1-2], not on [0-5]. So the most
>> important information is the highest Sched Domain with the
>> SD_SHARE_PKG_RESOURCES flag. This is the MC layer (cpu_core_flags() in
>> default_topology[]). So the scheduler would think that [0-5] are sharing
>> LLC.
>>
>
> Ah OK, but if LLC sibling masks are updated, cpu_coregroup_mask() takes
> care of it IIUC, right ?
Yes. That's the:
691 if (cpu_topology[cpu].llc_id != -1) {
692 if (cpumask_subset(&cpu_topology[cpu].llc_sibling, core_mask))
693 core_mask = &cpu_topology[cpu].llc_sibling;
694 }
condition in cpu_coregroup_mask().
>
>> You have LLC at:
>>
>> cat /sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list
>> ^^^^^^
>> 0,3-5
>>
>> but the scheduler sees the highest SD_SHARE_PKG_RESOURCES on MC:
>>
>> cat /sys/kernel/debug/sched/domains/cpu0/domain1/flags
>> ^^^^^^^
>> ... SD_SHARE_PKG_RESOURCES ...
>>
>> [...]
>>
>>>> For one level (MC) yes, but not for 2 (MC and CLS). And cluster_id was
>>>> introduces for the 2. level.
>>>>
>>>
>>> That sounds wrong and not what ACPI PPTT code says. My series just aligns
>>> with what is done with ACPI PPTT IIUC. I need to check that again if my
>>> understand differs from what is being done. But the example of Kunpeng920
>>> aligns with my understanding.
>>
>> (1) IMHO, as long as we don't consider cache (llc) information in DT we
>> can't have the same coverage as ACPI.
>>
>
> Agreed. But we are not changing any topology masks as per sched domain
> requirements as they get exposed to the userspace as is.
I see. Your requirement is valid information under
/sys/devices/system/cpu/cpuX/{topology, cache} for userspace.
I'm not sure that we can get core_siblings_list and package_cpus_list
correctly setup.
With your patch we have now:
root@juno:/sys/devices/system/cpu/cpu0/topology# cat core_siblings_list
0-5
root@juno:/sys/devices/system/cpu/cpu0/topology# cat package_cpus_list
0-5
Before we had [0,3-5] for both.
I'm afraid we can have 2 different mask here because of:
72 define_siblings_read_func(core_siblings, core_cpumask);
^^^^^^^^^^^^
88 define_siblings_read_func(package_cpus, core_cpumask);
^^^^^^^^^^^^
[...]
> Understood and on Juno if we get llc_siblings right, the sched domains
> must be sorted correctly ?
Yes, then it should do exactly what ACPI is leading us to on this !NUMA
Kunpeng920 example.
>> Coming back to the original request (the support of Armv9 L2 complexes
>> in Android) from Qing on systems like QC SM8450:
>>
>> .---------------.
>> CPU |0 1 2 3 4 5 6 7|
>> +---------------+
>> uarch |l l l l m m m b| (so called tri-gear: little, medium, big)
>> +---------------+
>> L2 | | | | | | |
>> +---------------+
>> L3 |<-- -->|
>> +---------------+
>> |<-- cluster -->|
>> +---------------+
>> |<-- DSU -->|
>> '---------------'
>>
>> This still wouldn't be possible. We know that Phantom Domain (grouping
>> after uarch) is not supported in mainline but heavily used in Android
>> (legacy deps).
>>
>
> Correct, unless you convince to get a suitable notion of *whatever*
> phantom domains represent into the DT bindings, they don't exist.
> If someone wants to support this config, they must first represent that
> in the firmware so that OS can infer information from the same.
OK, we don't support Phantom domains via 1. level Cluster in cpu-map
anymore. We already explicitly informed the Android community. But I'm
sure people will only discover this if something breaks on their
platforms and they are able to detect this.
>> If we say we only support L2 sharing (asymmetric here since only CPU0-3
>> have it !!!) and we don't support Phantom then your approach will work
>> for such systems.
>
> Thanks, as I said what is *Phantom* domain ;) ? Can you point me to the
> DT or ACPI binding for the same ? Just kidding, I know they don't exist.
They do ;-) 1. level Clusters ... but they are used for uArch
boundaries, not for LLC boundaries. That's the existing issue in DT,
topology information has 2 sources: (1) cpu-map and (2) cache info.
> Anyways, I understand your concern that llc_sibling must go with my set
> of topology changes which I agree. Is that the only concern ?
Cool. Let me review your v2 first ;-)
On Wed, May 18, 2022 at 12:23:35PM +0200, Dietmar Eggemann wrote:
> On 17/05/2022 12:57, Sudeep Holla wrote:
[...]
> > Ah OK, but if LLC sibling masks are updated, cpu_coregroup_mask() takes
> > care of it IIUC, right ?
>
> Yes. That's the:
>
> 691 if (cpu_topology[cpu].llc_id != -1) {
> 692 if (cpumask_subset(&cpu_topology[cpu].llc_sibling, core_mask))
> 693 core_mask = &cpu_topology[cpu].llc_sibling;
> 694 }
>
> condition in cpu_coregroup_mask().
>
Correct, I am referring that.
> >
> > Agreed. But we are not changing any topology masks as per sched domain
> > requirements as they get exposed to the userspace as is.
>
> I see. Your requirement is valid information under
> /sys/devices/system/cpu/cpuX/{topology, cache} for userspace.
>
> I'm not sure that we can get core_siblings_list and package_cpus_list
> correctly setup.
>
> With your patch we have now:
>
> root@juno:/sys/devices/system/cpu/cpu0/topology# cat core_siblings_list
> 0-5
> root@juno:/sys/devices/system/cpu/cpu0/topology# cat package_cpus_list
> 0-5
>
> Before we had [0,3-5] for both.
>
Correct and that was pure wrong for a single socket system. It must
cover all the CPUs. Tools like lscpu reports them as 2 socket system
which is wrong. There were reports for that but no one really chased it
up to get it fixed. So assumed it is not so important but still it is
wrong. ACPI platforms cared and wanted it fixed with ACPI PPTT since
the PPTT support. So check the difference without my patches on Juno
in DT and ACPI boot. We should get same output for both.
>
> I'm afraid we can have 2 different mask here because of:
>
> 72 define_siblings_read_func(core_siblings, core_cpumask);
> ^^^^^^^^^^^^
> 88 define_siblings_read_func(package_cpus, core_cpumask);
> ^^^^^^^^^^^^
>
Yes even I got confused and the Documentation revealed one is deprecated
or obsolete(Documentation/ABI/stable/sysfs-devices-system-cpu)
74 What: /sys/devices/system/cpu/cpuX/topology/package_cpus
75 Description: internal kernel map of the CPUs sharing the same physical_package_id.
76 (deprecated name: "core_siblings").
77 Values: hexadecimal bitmask.
78
79 What: /sys/devices/system/cpu/cpuX/topology/package_cpus_list
80 Description: human-readable list of CPUs sharing the same physical_package_id.
81 The format is like 0-3, 8-11, 14,17.
82 (deprecated name: "core_siblings_list")
83 Values: decimal list.
> [...]
>
> > Understood and on Juno if we get llc_siblings right, the sched domains
> > must be sorted correctly ?
>
> Yes, then it should do exactly what ACPI is leading us to on this !NUMA
> Kunpeng920 example.
>
Cool.
> >> Coming back to the original request (the support of Armv9 L2 complexes
> >> in Android) from Qing on systems like QC SM8450:
> >>
> >> .---------------.
> >> CPU |0 1 2 3 4 5 6 7|
> >> +---------------+
> >> uarch |l l l l m m m b| (so called tri-gear: little, medium, big)
> >> +---------------+
> >> L2 | | | | | | |
> >> +---------------+
> >> L3 |<-- -->|
> >> +---------------+
> >> |<-- cluster -->|
> >> +---------------+
> >> |<-- DSU -->|
> >> '---------------'
> >>
> >> This still wouldn't be possible. We know that Phantom Domain (grouping
> >> after uarch) is not supported in mainline but heavily used in Android
> >> (legacy deps).
> >>
> >
> > Correct, unless you convince to get a suitable notion of *whatever*
> > phantom domains represent into the DT bindings, they don't exist.
> > If someone wants to support this config, they must first represent that
> > in the firmware so that OS can infer information from the same.
>
> OK, we don't support Phantom domains via 1. level Cluster in cpu-map
> anymore. We already explicitly informed the Android community. But I'm
> sure people will only discover this if something breaks on their
> platforms and they are able to detect this.
>
Again if it was working just by cache with their own phantom domains that
are not upstream, then nothing is broken. At-least I see that we have now
fixed and aligned DT with ACPI. While I understand your concern, I see
this as main issue and not sched domains which may or may not be using
topology info incorrectly.
> >> If we say we only support L2 sharing (asymmetric here since only CPU0-3
> >> have it !!!) and we don't support Phantom then your approach will work
> >> for such systems.
> >
> > Thanks, as I said what is *Phantom* domain ;) ? Can you point me to the
> > DT or ACPI binding for the same ? Just kidding, I know they don't exist.
>
> They do ;-) 1. level Clusters ... but they are used for uArch
> boundaries, not for LLC boundaries. That's the existing issue in DT,
> topology information has 2 sources: (1) cpu-map and (2) cache info.
>
Sure, they can fix or add more optimisation on top of what I have sent[1]
> > Anyways, I understand your concern that llc_sibling must go with my set
> > of topology changes which I agree. Is that the only concern ?
>
> Cool. Let me review your v2 first ;-)
Thanks.
--
Regards,
Sudeep
[1] https://lore.kernel.org/lkml/[email protected]
On 18/05/2022 12:43, Sudeep Holla wrote:
> On Wed, May 18, 2022 at 12:23:35PM +0200, Dietmar Eggemann wrote:
>> On 17/05/2022 12:57, Sudeep Holla wrote:
> [...]
[...]
>> I see. Your requirement is valid information under
>> /sys/devices/system/cpu/cpuX/{topology, cache} for userspace.
>>
>> I'm not sure that we can get core_siblings_list and package_cpus_list
>> correctly setup.
>>
>> With your patch we have now:
>>
>> root@juno:/sys/devices/system/cpu/cpu0/topology# cat core_siblings_list
>> 0-5
>> root@juno:/sys/devices/system/cpu/cpu0/topology# cat package_cpus_list
>> 0-5
>>
>> Before we had [0,3-5] for both.
>>
>
> Correct and that was pure wrong for a single socket system. It must
> cover all the CPUs. Tools like lscpu reports them as 2 socket system
> which is wrong. There were reports for that but no one really chased it
> up to get it fixed. So assumed it is not so important but still it is
> wrong. ACPI platforms cared and wanted it fixed with ACPI PPTT since
> the PPTT support. So check the difference without my patches on Juno
> in DT and ACPI boot. We should get same output for both.
>
>>
>> I'm afraid we can have 2 different mask here because of:
Sorry, s/can/can't
>> 72 define_siblings_read_func(core_siblings, core_cpumask);
>> ^^^^^^^^^^^^
>> 88 define_siblings_read_func(package_cpus, core_cpumask);
>> ^^^^^^^^^^^^
>>
>
> Yes even I got confused and the Documentation revealed one is deprecated
> or obsolete(Documentation/ABI/stable/sysfs-devices-system-cpu)
>
> 74 What: /sys/devices/system/cpu/cpuX/topology/package_cpus
> 75 Description: internal kernel map of the CPUs sharing the same physical_package_id.
> 76 (deprecated name: "core_siblings").
> 77 Values: hexadecimal bitmask.
> 78
> 79 What: /sys/devices/system/cpu/cpuX/topology/package_cpus_list
> 80 Description: human-readable list of CPUs sharing the same physical_package_id.
> 81 The format is like 0-3, 8-11, 14,17.
> 82 (deprecated name: "core_siblings_list")
> 83 Values: decimal list.
Ah, that makes it more clear to me! Thanks. So DIE boundary, i.e.
cpu_cpu_mask() {return cpumask_of_node(cpu_to_node(cpu))} is not related
to package cpumask at all. This is why we have the first if condition in
cpu_coregroup_mask():
658 const cpumask_t *core_mask = cpumask_of_node(cpu_to_node(cpu));
...
661 if (cpumask_subset(&cpu_topology[cpu].core_sibling, core_mask)) {
662 /* not numa in package, lets use the package siblings */
663 core_mask = &cpu_topology[cpu].core_sibling;
[...]
>> OK, we don't support Phantom domains via 1. level Cluster in cpu-map
>> anymore. We already explicitly informed the Android community. But I'm
>> sure people will only discover this if something breaks on their
>> platforms and they are able to detect this.
>>
>
> Again if it was working just by cache with their own phantom domains that
> are not upstream, then nothing is broken. At-least I see that we have now
> fixed and aligned DT with ACPI. While I understand your concern, I see
> this as main issue and not sched domains which may or may not be using
> topology info incorrectly.
OK, lets see how you incorporated llc into the topology code in your v2
first.
>>>> If we say we only support L2 sharing (asymmetric here since only CPU0-3
>>>> have it !!!) and we don't support Phantom then your approach will work
>>>> for such systems.
>>>
>>> Thanks, as I said what is *Phantom* domain ;) ? Can you point me to the
>>> DT or ACPI binding for the same ? Just kidding, I know they don't exist.
>>
>> They do ;-) 1. level Clusters ... but they are used for uArch
>> boundaries, not for LLC boundaries. That's the existing issue in DT,
>> topology information has 2 sources: (1) cpu-map and (2) cache info.
>>
>
> Sure, they can fix or add more optimisation on top of what I have sent[1]
If you add llc parsing in your v2, they should have everything they need
for Armv9 with uArch boundaries and L2 complexes. What I'm interested in
is seeing how we solve the 2 sources (cache and cpu-map) issue here.
Example:
cpu-map:
socket
cluster <-- (1)
core
thread
cache:
llc
How do people know that (1) is L2 and not LLC?
But let us switch the discussion to you v2 on this one. I have to see
your implementation first.
> [1] https://lore.kernel.org/lkml/[email protected]
[...]
On Wed, May 18, 2022 at 05:54:23PM +0200, Dietmar Eggemann wrote:
> On 18/05/2022 12:43, Sudeep Holla wrote:
> > On Wed, May 18, 2022 at 12:23:35PM +0200, Dietmar Eggemann wrote:
> >> On 17/05/2022 12:57, Sudeep Holla wrote:
> > [...]
>
> [...]
>
> >> I see. Your requirement is valid information under
> >> /sys/devices/system/cpu/cpuX/{topology, cache} for userspace.
> >>
> >> I'm not sure that we can get core_siblings_list and package_cpus_list
> >> correctly setup.
> >>
> >> With your patch we have now:
> >>
> >> root@juno:/sys/devices/system/cpu/cpu0/topology# cat core_siblings_list
> >> 0-5
> >> root@juno:/sys/devices/system/cpu/cpu0/topology# cat package_cpus_list
> >> 0-5
> >>
> >> Before we had [0,3-5] for both.
> >>
> >
> > Correct and that was pure wrong for a single socket system. It must
> > cover all the CPUs. Tools like lscpu reports them as 2 socket system
> > which is wrong. There were reports for that but no one really chased it
> > up to get it fixed. So assumed it is not so important but still it is
> > wrong. ACPI platforms cared and wanted it fixed with ACPI PPTT since
> > the PPTT support. So check the difference without my patches on Juno
> > in DT and ACPI boot. We should get same output for both.
> >
> >>
> >> I'm afraid we can have 2 different mask here because of:
>
> Sorry, s/can/can't
>
No worries I assumed and read it so.
> >> 72 define_siblings_read_func(core_siblings, core_cpumask);
> >> ^^^^^^^^^^^^
> >> 88 define_siblings_read_func(package_cpus, core_cpumask);
> >> ^^^^^^^^^^^^
> >>
> >
> > Yes even I got confused and the Documentation revealed one is deprecated
> > or obsolete(Documentation/ABI/stable/sysfs-devices-system-cpu)
> >
> > 74 What: /sys/devices/system/cpu/cpuX/topology/package_cpus
> > 75 Description: internal kernel map of the CPUs sharing the same physical_package_id.
> > 76 (deprecated name: "core_siblings").
> > 77 Values: hexadecimal bitmask.
> > 78
> > 79 What: /sys/devices/system/cpu/cpuX/topology/package_cpus_list
> > 80 Description: human-readable list of CPUs sharing the same physical_package_id.
> > 81 The format is like 0-3, 8-11, 14,17.
> > 82 (deprecated name: "core_siblings_list")
> > 83 Values: decimal list.
>
> Ah, that makes it more clear to me! Thanks. So DIE boundary, i.e.
> cpu_cpu_mask() {return cpumask_of_node(cpu_to_node(cpu))} is not related
> to package cpumask at all. This is why we have the first if condition in
> cpu_coregroup_mask():
>
> 658 const cpumask_t *core_mask = cpumask_of_node(cpu_to_node(cpu));
> ...
> 661 if (cpumask_subset(&cpu_topology[cpu].core_sibling, core_mask)) {
> 662 /* not numa in package, lets use the package siblings */
> 663 core_mask = &cpu_topology[cpu].core_sibling;
>
Correct, either I got confused with the names it was different from what
I had seen last time I looked at these more than couple of years ago.
> [...]
>
> >> OK, we don't support Phantom domains via 1. level Cluster in cpu-map
> >> anymore. We already explicitly informed the Android community. But I'm
> >> sure people will only discover this if something breaks on their
> >> platforms and they are able to detect this.
> >>
> >
> > Again if it was working just by cache with their own phantom domains that
> > are not upstream, then nothing is broken. At-least I see that we have now
> > fixed and aligned DT with ACPI. While I understand your concern, I see
> > this as main issue and not sched domains which may or may not be using
> > topology info incorrectly.
>
> OK, lets see how you incorporated llc into the topology code in your v2
> first.
>
Sure, thanks for agreeing to review those implicitly ????.
> >>>> If we say we only support L2 sharing (asymmetric here since only CPU0-3
> >>>> have it !!!) and we don't support Phantom then your approach will work
> >>>> for such systems.
> >>>
> >>> Thanks, as I said what is *Phantom* domain ;) ? Can you point me to the
> >>> DT or ACPI binding for the same ? Just kidding, I know they don't exist.
> >>
> >> They do ;-) 1. level Clusters ... but they are used for uArch
> >> boundaries, not for LLC boundaries. That's the existing issue in DT,
> >> topology information has 2 sources: (1) cpu-map and (2) cache info.
> >>
> >
> > Sure, they can fix or add more optimisation on top of what I have sent[1]
>
> If you add llc parsing in your v2, they should have everything they need
> for Armv9 with uArch boundaries and L2 complexes. What I'm interested in
> is seeing how we solve the 2 sources (cache and cpu-map) issue here.
Not sure if just llc with resolve what we need on those Armv9 systems unless
L2 = LLC. If L2 != LLC, then my patches won't help that much.
> Example:
>
> cpu-map:
>
> socket
> cluster <-- (1)
> core
> thread
>
> cache:
>
> llc
>
> How do people know that (1) is L2 and not LLC?
>
We can get the cpumask of cpus sharing each unique cache in the system.
That is not a problem, we have cacheinfo driver for that. But that is
initialised quite late for a reason and if we need that info much early
then that needs some reworking.
My patches is addressing only LLC and hence much simpler.
> But let us switch the discussion to you v2 on this one. I have to see
> your implementation first.
>
Sure, I just replied here just to address if there were any open questions.
Let us discuss any issues in the below mail thread.
> > [1] https://lore.kernel.org/lkml/[email protected]
>
> [...]
--
Regards,
Sudeep