2022-04-27 09:29:58

by 王擎

[permalink] [raw]
Subject: [PATCH V2] arm64: add SCHED_CLUSTER's dependency on ACPI


>> On Tue, Apr 26, 2022 at 06:52:34AM +0000, ???? wrote:
>>>
>>>>>>>>> From: Wang Qing <[email protected]>
>>>>>>>>>
>>>>>>>>> cluster sched_domain configured by cpu_topology[].cluster_sibling,
>>>>>>>>> which is set by cluster_id, cluster_id can only get from ACPI.
>>>>>>>>>
>>>>>>>>> If the system does not enable ACPI, cluster_id is always -1, even enable
>>>>>>>>> SCHED_CLUSTER is invalid, this is misleading.
>>>>>>>>>
>>>>>>>>> So we add SCHED_CLUSTER's dependency on ACPI here.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Any reason why this can't be extended to support DT based systems via
>>>>>>>> cpu-map in the device tree. IMO we almost have everything w.r.t topology
>>>>>>>> in DT and no reason to deviate this feature between ACPI and DT.
>>>>>>>>
>>>>>>> That's the problem, we parse out "cluster" info according to the
>>>>>>> description in cpu-map, but do assign it to package_id, which used to
>>>>>>> configure the MC sched domain, not cluster sched domain.
>>>>>>>
>>>>>>
>>>>>> Right, we haven't updated the code after updating the bindings to match
>>>>>> ACPI sockets which are the physical package boundaries. Clusters are not
>>>>>> the physical boundaries and the current topology code is not 100% aligned
>>>>>> with the bindings after Commit 849b384f92bc ("Documentation: DT: arm: add
>>>>>> support for sockets defining package boundaries")
>>>>>
>>>>> I see, but this commit is a long time ago, why hasn't it been used widely.
>>>>> Maybe I can help about it if you need.
>>>>>
>>>>
>>>> I assume no one cared or had a requirement for the same.
>>>
>>> It took me a while to find the root cause why enabling SCHED_CLUSTER
>>> didn't work.
>>>
>>> We should add SCHED_CLUSTER's dependency before implementation.
>>> Otherwise, everyone who doesn't have ACPI but use SCHED_CLUSTER
>>> will have this problem.
>>>
>>
>> I am fine with that or mark it broken for DT, but ideally I wouldn't
>> want to create unnecessary dependency on ACPI or DT when both supports
>> the feature.
>
>IMHO trying to introduce SCHED_COMPLEX for DT next to the linux-wide
>available SCHED_CLUSTER (used only for ACPI right now) is the wrong way.
>
>_If_ asymmetric sub-clustering of CPUs underneath LLC (L3) makes any
>sense on ARMv9 single DSU systems like:
>
> .---------------.
>CPU |0 1 2 3 4 5 6 7|
> +---------------+
>uarch |l l l l m m m b| (so called tri-gear: little, medium, big)
> +---------------+
> L2 | | | | | | |
> +---------------+
> L3 |<-- -->|
> +---------------+
> |<-- cluster -->|
> +---------------+
> |<-- DSU -->|
> '---------------'
>
>(and I'm not saying here it does!) then the existing SCHED_CLUSTER
>should be used in DT as well. Since it provides exactly the same
>functionality for the task scheduler no matter whether it's setup from
>ACPI or DT.
>
>parse_cluster() -> parse_core() should be changed to be able to decode
>both id's (package_id and cluster_id) in this case.

Totally agree, but not implemented yet. Because now cluster_id is used
to describe the package/socket, the modification will involve all DTS.

>
>DT's cpu_map would have to be changed to code 2 level setups.
>
>For a system like the one above it should look like:
>
> cpu-map {
> cluster0 {
> foo0 {
> core0 {
> cpu = <&cpu0>;
> };
> core1 {
> cpu = <&cpu1>;
> };
> };
> foo2 {
> core2 {
> cpu = <&cpu2>;
> };
> core3 {
> cpu = <&cpu3>;
> };
> };
> };
> cluster1 {
> core0 {
> cpu = <&cpu4>;
> };
> core1 {
> cpu = <&cpu5>;
> };
> core2 {
> cpu = <&cpu6>;
> };
> };
> cluster2 {
> core0 {
> cpu = <&cpu7>;
> };
> };
> };
>
>I pimped my Hikey 960 to look like one of those Armv9 4-3-1 systems with
>L2-complexes on the LITTLES and I get:

This system is exactly what I mentioned, but I have a question,
How did you parse out the cluster_id based on foo0/foo2?
Because if ACPI is not used, cluster_id is always -1.

What I want to do is to change the foo0/foo2 to complex0/complex2 here,
then parse it like parse_cluster() -> parse_complex() -> parse_core().

>
># cat /sys/kernel/debug/sched/domains/cpu*/domain*/name
>CLS
>MC
>DIE
>
>CLS
>MC
>DIE
>
>CLS
>MC
>DIE
>
>CLS
>MC
>DIE
>
>MC
>DIE
>
>MC
>DIE
>
>MC
>DIE
>
>DIE
>
>cat /proc/schedstat | awk '{print $1 " " $2 }' | grep ^[cd]
>
>cpu0 0
>domain0 03
>domain1 0f
>domain2 ff
>cpu1 0
>domain0 03
>domain1 0f
>domain2 ff
>cpu2 0
>domain0 0c
>domain1 0f
>domain2 ff
>cpu3 0
>domain0 0c
>domain1 0f
>domain2 ff
>cpu4 0
>domain0 70
>domain1 ff
>cpu5 0
>domain0 70
>domain1 ff
>cpu6 0
>domain0 70
>domain1 ff
>cpu7 0
>domain0 ff
>
>Like I mentioned earlier, I'm not sure if this additional complexity
>makes sense on mobile systems running EAS (since only CFS load-balancing
>on little CPUs would be affected).
>
>But my hunch is that this setup is what you want for your system. If we
>could agree on this one, that would already be some progress to see the
>entire story here.

Yes, that's what I want, but still a little confused, why we use MC to
describe "cluster" and use CLS describe "complex", can you show some details?

Thanks,
Qing


2022-04-27 16:17:24

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH V2] arm64: add SCHED_CLUSTER's dependency on ACPI

On 27/04/2022 04:18, 王擎 wrote:
>
>>> On Tue, Apr 26, 2022 at 06:52:34AM +0000, 王擎 wrote:
>>>>
>>>>>>>>>> From: Wang Qing <[email protected]>

[...]

>> .---------------.
>> CPU |0 1 2 3 4 5 6 7|
>> +---------------+
>> uarch |l l l l m m m b| (so called tri-gear: little, medium, big)
>> +---------------+
>> L2 | | | | | | |
>> +---------------+
>> L3 |<-- -->|
>> +---------------+
>> |<-- cluster -->|
>> +---------------+
>> |<-- DSU -->|
>> '---------------'
>>
>> (and I'm not saying here it does!) then the existing SCHED_CLUSTER
>> should be used in DT as well. Since it provides exactly the same
>> functionality for the task scheduler no matter whether it's setup from
>> ACPI or DT.
>>
>> parse_cluster() -> parse_core() should be changed to be able to decode
>> both id's (package_id and cluster_id) in this case.
>
> Totally agree, but not implemented yet. Because now cluster_id is used
> to describe the package/socket, the modification will involve all DTS.

You're talking about the fact that 1. level clusterX (1) in cpu_map is
used to set `cpu_topology[cpu].package_id`, right? That's the
information for the DIE layer.
The first level cluster[0,1,2] spawn all 8 CPUs and form 3 groups of
CPUS (0-3, 4-6, 7), the later sched groups of the DIE sched domain.
We don't have any socketN since it is a single socket system. Think
about a system with 2 DSUs where you would have socket[0,1].

Sub-chapter 4 in `Documentation/devicetree/bindings/cpu/cpu-topology.txt`:

cpu-map {
socket0 {
cluster0 { <--- (1)

Sub-chapter 3 says:

- cluster node

... A system can contain several layers of clustering within a
single physical socket and cluster nodes can be contained in parent
cluster nodes.

A cluster node's child nodes must be:
one or more cluster nodes

This multi-level cluster thing hasn't been coded yet.

parse_cluster()

...
/*
* First check for child clusters; we currently ignore any
* information about the nesting of clusters and present the
* scheduler with a flat list of them.
*/
...

[...]

>> I pimped my Hikey 960 to look like one of those Armv9 4-3-1 systems with
>> L2-complexes on the LITTLES and I get:
>
> This system is exactly what I mentioned, but I have a question,
> How did you parse out the cluster_id based on foo0/foo2?
> Because if ACPI is not used, cluster_id is always -1.

I haven't put in the extension to decode a 2-level clusterX cpu_map in
parse_cluster() -> parse_core(). I just did a mock-up for illustration
purpose inside parse_core() for my H960 with a 4-3-1 cpu-map:

cpu-map
cluster0
core0
core1
core2
core3
cluster1
core0
core1
core2
cluster2
core0

diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index 1d6636ebaac5..8af40f13fdb5 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -529,6 +529,11 @@ static int __init parse_core(struct device_node *core, int package_id,

cpu_topology[cpu].package_id = package_id;
cpu_topology[cpu].core_id = core_id;
+
+ /* mock-up CLS SD on 4-3-1 Armv9 DSU cluster w/ L2-complexes */
+ if (cpu <= 3)
+ cpu_topology[cpu].cluster_id = cpu / 2;
+
} else if (leaf && cpu != -ENODEV) {
pr_err("%pOF: Can't get CPU for leaf core\n", core);
return -EINVAL;

And on the scheduler-side I only had to enable CONFIG_SCHED_CLUSTER and
everything worked just fine, no need for any arm64-specific topo table
and alike.

IMHO, this is what you have to do. Make a 2 level cluster cpumap:

cpu-map
cluster0
cluster0
core0
core1
cluster1
core2
core3
cluster1
core0
core1
core2
cluster2
core0

parse-able and set `cpu_topology[cpu].cluster_id` in parse_core().

> What I want to do is to change the foo0/foo2 to complex0/complex2 here,
> then parse it like parse_cluster() -> parse_complex() -> parse_core().

You should read `Documentation/devicetree/bindings/cpu/cpu-topology.txt`
and implement the multi-level cluster approach instead. Big advantage
would be that there won't be any DT related changes/extensions needed.

[...]

> Yes, that's what I want, but still a little confused, why we use MC to
> describe "cluster" and use CLS describe "complex", can you show some details?

The DT entity `cluster` has nothing to do with the task scheduler domain
name `SCHED_CLUSTER`. The name is actually meaningless and just there for
debug purposes.

2022-04-28 15:52:07

by 王擎

[permalink] [raw]
Subject: [PATCH V2] arm64: add SCHED_CLUSTER's dependency on ACPI


>>
>>>> On Tue, Apr 26, 2022 at 06:52:34AM +0000, ???? wrote:
>>>>>
>>>>>>>>>>> From: Wang Qing <[email protected]>
>
>[...]
>
>>> .---------------.
>>> CPU |0 1 2 3 4 5 6 7|
>>> +---------------+
>>> uarch |l l l l m m m b| (so called tri-gear: little, medium, big)
>>> +---------------+
>>> L2 | | | | | | |
>>> +---------------+
>>> L3 |<-- -->|
>>> +---------------+
>>> |<-- cluster -->|
>>> +---------------+
>>> |<-- DSU -->|
>>> '---------------'
>>>
>>> (and I'm not saying here it does!) then the existing SCHED_CLUSTER
>>> should be used in DT as well. Since it provides exactly the same
>>> functionality for the task scheduler no matter whether it's setup from
>>> ACPI or DT.
>>>
>>> parse_cluster() -> parse_core() should be changed to be able to decode
>>> both id's (package_id and cluster_id) in this case.
>>
>> Totally agree, but not implemented yet. Because now cluster_id is used
>> to describe the package/socket, the modification will involve all DTS.
>
>You're talking about the fact that 1. level clusterX (1) in cpu_map is
>used to set `cpu_topology[cpu].package_id`, right? That's the
>information for the DIE layer.
>The first level cluster[0,1,2] spawn all 8 CPUs and form 3 groups of
>CPUS (0-3, 4-6, 7), the later sched groups of the DIE sched domain.
>We don't have any socketN since it is a single socket system. Think
>about a system with 2 DSUs where you would have socket[0,1].
>
>Sub-chapter 4 in `Documentation/devicetree/bindings/cpu/cpu-topology.txt`:
>
>cpu-map {
> socket0 {
> cluster0 { <--- (1)
>
>Sub-chapter 3 says:
>
> - cluster node
>
> ... A system can contain several layers of clustering within a
> single physical socket and cluster nodes can be contained in parent
> cluster nodes.
>
> A cluster node's child nodes must be:
> one or more cluster nodes
>
>This multi-level cluster thing hasn't been coded yet.
>
>parse_cluster()
>
> ...
> /*
> * First check for child clusters; we currently ignore any
> * information about the nesting of clusters and present the
> * scheduler with a flat list of them.
> */
> ...
>
>[...]
>
>>> I pimped my Hikey 960 to look like one of those Armv9 4-3-1 systems with
>>> L2-complexes on the LITTLES and I get:
>>
>> This system is exactly what I mentioned, but I have a question,
>> How did you parse out the cluster_id based on foo0/foo2?
>> Because if ACPI is not used, cluster_id is always -1.
>
>I haven't put in the extension to decode a 2-level clusterX cpu_map in
>parse_cluster() -> parse_core(). I just did a mock-up for illustration
>purpose inside parse_core() for my H960 with a 4-3-1 cpu-map:
>
>cpu-map
> cluster0
> core0
> core1
> core2
> core3
> cluster1
> core0
> core1
> core2
> cluster2
> core0
>
>diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
>index 1d6636ebaac5..8af40f13fdb5 100644
>--- a/drivers/base/arch_topology.c
>+++ b/drivers/base/arch_topology.c
>@@ -529,6 +529,11 @@ static int __init parse_core(struct device_node *core, int package_id,
>
> cpu_topology[cpu].package_id = package_id;
> cpu_topology[cpu].core_id = core_id;
>+
>+ /* mock-up CLS SD on 4-3-1 Armv9 DSU cluster w/ L2-complexes */
>+ if (cpu <= 3)
>+ cpu_topology[cpu].cluster_id = cpu / 2;
>+
> } else if (leaf && cpu != -ENODEV) {
> pr_err("%pOF: Can't get CPU for leaf core\n", core);
> return -EINVAL;
>
>And on the scheduler-side I only had to enable CONFIG_SCHED_CLUSTER and
>everything worked just fine, no need for any arm64-specific topo table
>and alike.
>
>IMHO, this is what you have to do. Make a 2 level cluster cpumap:
>
>cpu-map
> cluster0
> cluster0
> core0
> core1
> cluster1
> core2
> core3
> cluster1
> core0
> core1
> core2
> cluster2
> core0
>
>parse-able and set `cpu_topology[cpu].cluster_id` in parse_core().

Oh, there are many ways I can implement it in my own system, but my
purpose here is to modify the mainline code so that all Android
developers can use it.

>
>> What I want to do is to change the foo0/foo2 to complex0/complex2 here,
>> then parse it like parse_cluster() -> parse_complex() -> parse_core().
>
>You should read `Documentation/devicetree/bindings/cpu/cpu-topology.txt`
>and implement the multi-level cluster approach instead. Big advantage
>would be that there won't be any DT related changes/extensions needed.
>

Thanks for the advice, I will consider about it.
Qing

>[...]
>
>> Yes, that's what I want, but still a little confused, why we use MC to
>> describe "cluster" and use CLS describe "complex", can you show some details?
>
>The DT entity `cluster` has nothing to do with the task scheduler domain
>name `SCHED_CLUSTER`. The name is actually meaningless and just there for
>debug purposes.