2022-06-01 19:17:50

by Krzysztof Kozlowski

[permalink] [raw]
Subject: [PATCH v4 4/4] arm64: dts: qcom: sdm845: Add CPU BWMON

Add device node for CPU-memory BWMON device (bandwidth monitoring) on
SDM845 measuring bandwidth between CPU (gladiator_noc) and Last Level
Cache (memnoc). Usage of this BWMON allows to remove fixed bandwidth
votes from cpufreq (CPU nodes) thus achieve high memory throughput even
with lower CPU frequencies.

Performance impact (SDM845-MTP RB3 board, linux next-20220422):
1. No noticeable impact when running with schedutil or performance
governors.

2. When comparing to customized kernel with synced interconnects and
without bandwidth votes from CPU freq, the sysbench memory tests
show significant improvement with bwmon for blocksizes past the L3
cache. The results for such superficial comparison:

sysbench memory test, results in MB/s (higher is better)
bs kB | type | V | V+no bw votes | bwmon | benefit %
1 | W/seq | 14795 | 4816 | 4985 | 3.5%
64 | W/seq | 41987 | 10334 | 10433 | 1.0%
4096 | W/seq | 29768 | 8728 | 32007 | 266.7%
65536 | W/seq | 17711 | 4846 | 18399 | 279.6%
262144 | W/seq | 16112 | 4538 | 17429 | 284.1%
64 | R/seq | 61202 | 67092 | 66804 | -0.4%
4096 | R/seq | 23871 | 5458 | 24307 | 345.4%
65536 | R/seq | 18554 | 4240 | 18685 | 340.7%
262144 | R/seq | 17524 | 4207 | 17774 | 322.4%
64 | W/rnd | 2663 | 1098 | 1119 | 1.9%
65536 | W/rnd | 600 | 316 | 610 | 92.7%
64 | R/rnd | 4915 | 4784 | 4594 | -4.0%
65536 | R/rnd | 664 | 281 | 678 | 140.7%

Legend:
bs kB: block size in KB (small block size means only L1-3 caches are
used
type: R - read, W - write, seq - sequential, rnd - random
V: vanilla (next-20220422)
V + no bw votes: vanilla without bandwidth votes from CPU freq
bwmon: bwmon without bandwidth votes from CPU freq
benefit %: difference between vanilla without bandwidth votes and bwmon
(higher is better)

Co-developed-by: Thara Gopinath <[email protected]>
Signed-off-by: Thara Gopinath <[email protected]>
Signed-off-by: Krzysztof Kozlowski <[email protected]>
---
arch/arm64/boot/dts/qcom/sdm845.dtsi | 54 ++++++++++++++++++++++++++++
1 file changed, 54 insertions(+)

diff --git a/arch/arm64/boot/dts/qcom/sdm845.dtsi b/arch/arm64/boot/dts/qcom/sdm845.dtsi
index 83e8b63f0910..adffb9c70566 100644
--- a/arch/arm64/boot/dts/qcom/sdm845.dtsi
+++ b/arch/arm64/boot/dts/qcom/sdm845.dtsi
@@ -2026,6 +2026,60 @@ llcc: system-cache-controller@1100000 {
interrupts = <GIC_SPI 582 IRQ_TYPE_LEVEL_HIGH>;
};

+ pmu@1436400 {
+ compatible = "qcom,sdm845-cpu-bwmon";
+ reg = <0 0x01436400 0 0x600>;
+
+ interrupts = <GIC_SPI 581 IRQ_TYPE_LEVEL_HIGH>;
+
+ interconnects = <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>,
+ <&osm_l3 MASTER_OSM_L3_APPS &osm_l3 SLAVE_OSM_L3>;
+ interconnect-names = "ddr", "l3c";
+
+ operating-points-v2 = <&cpu_bwmon_opp_table>;
+
+ cpu_bwmon_opp_table: opp-table {
+ compatible = "operating-points-v2";
+
+ /*
+ * The interconnect paths bandwidths taken from
+ * cpu4_opp_table bandwidth.
+ * They also match different tables from
+ * msm-4.9 downstream kernel:
+ * - the gladiator_noc-mem_noc from bandwidth
+ * table of qcom,llccbw (property qcom,bw-tbl);
+ * bus width: 4 bytes;
+ * - the OSM L3 from bandwidth table of
+ * qcom,cpu4-l3lat-mon (qcom,core-dev-table);
+ * bus width: 16 bytes;
+ */
+ opp-0 {
+ opp-peak-kBps = <800000 4800000>;
+ };
+ opp-1 {
+ opp-peak-kBps = <1804000 9216000>;
+ };
+ opp-2 {
+ opp-peak-kBps = <2188000 11980800>;
+ };
+ opp-3 {
+ opp-peak-kBps = <3072000 15052800>;
+ };
+ opp-4 {
+ opp-peak-kBps = <4068000 19353600>;
+ };
+ opp-5 {
+ opp-peak-kBps = <5412000 20889600>;
+ };
+ opp-6 {
+ opp-peak-kBps = <6220000 22425600>;
+ };
+ opp-7 {
+ opp-peak-kBps = <7216000 25497600>;
+ };
+ };
+ };
+
pcie0: pci@1c00000 {
compatible = "qcom,pcie-sdm845";
reg = <0 0x01c00000 0 0x2000>,
--
2.34.1



2022-06-06 21:45:26

by Georgi Djakov

[permalink] [raw]
Subject: Re: [PATCH v4 4/4] arm64: dts: qcom: sdm845: Add CPU BWMON

On 1.06.22 13:11, Krzysztof Kozlowski wrote:
> Add device node for CPU-memory BWMON device (bandwidth monitoring) on
> SDM845 measuring bandwidth between CPU (gladiator_noc) and Last Level
> Cache (memnoc). Usage of this BWMON allows to remove fixed bandwidth
> votes from cpufreq (CPU nodes) thus achieve high memory throughput even
> with lower CPU frequencies.
>
> Performance impact (SDM845-MTP RB3 board, linux next-20220422):
> 1. No noticeable impact when running with schedutil or performance
> governors.
>
> 2. When comparing to customized kernel with synced interconnects and
> without bandwidth votes from CPU freq, the sysbench memory tests
> show significant improvement with bwmon for blocksizes past the L3
> cache. The results for such superficial comparison:
>
> sysbench memory test, results in MB/s (higher is better)
> bs kB | type | V | V+no bw votes | bwmon | benefit %
> 1 | W/seq | 14795 | 4816 | 4985 | 3.5%
> 64 | W/seq | 41987 | 10334 | 10433 | 1.0%
> 4096 | W/seq | 29768 | 8728 | 32007 | 266.7%
> 65536 | W/seq | 17711 | 4846 | 18399 | 279.6%
> 262144 | W/seq | 16112 | 4538 | 17429 | 284.1%
> 64 | R/seq | 61202 | 67092 | 66804 | -0.4%
> 4096 | R/seq | 23871 | 5458 | 24307 | 345.4%
> 65536 | R/seq | 18554 | 4240 | 18685 | 340.7%
> 262144 | R/seq | 17524 | 4207 | 17774 | 322.4%
> 64 | W/rnd | 2663 | 1098 | 1119 | 1.9%
> 65536 | W/rnd | 600 | 316 | 610 | 92.7%
> 64 | R/rnd | 4915 | 4784 | 4594 | -4.0%
> 65536 | R/rnd | 664 | 281 | 678 | 140.7%
>
> Legend:
> bs kB: block size in KB (small block size means only L1-3 caches are
> used
> type: R - read, W - write, seq - sequential, rnd - random
> V: vanilla (next-20220422)
> V + no bw votes: vanilla without bandwidth votes from CPU freq
> bwmon: bwmon without bandwidth votes from CPU freq
> benefit %: difference between vanilla without bandwidth votes and bwmon
> (higher is better)
>

Ok, now i see! So bwmon shows similar performance compared with the current
cpufreq-based bandwidth scaling. And if you add bwmon on top of vanilla, are
the results close/same? Is the plan to remove the cpufreq based bandwidth
scaling and switch to bwmon? It might improve the power consumption in some
scenarios.

Thanks,
Georgi

> Co-developed-by: Thara Gopinath <[email protected]>
> Signed-off-by: Thara Gopinath <[email protected]>
> Signed-off-by: Krzysztof Kozlowski <[email protected]>
> ---
> arch/arm64/boot/dts/qcom/sdm845.dtsi | 54 ++++++++++++++++++++++++++++
> 1 file changed, 54 insertions(+)
>
> diff --git a/arch/arm64/boot/dts/qcom/sdm845.dtsi b/arch/arm64/boot/dts/qcom/sdm845.dtsi
> index 83e8b63f0910..adffb9c70566 100644
> --- a/arch/arm64/boot/dts/qcom/sdm845.dtsi
> +++ b/arch/arm64/boot/dts/qcom/sdm845.dtsi
> @@ -2026,6 +2026,60 @@ llcc: system-cache-controller@1100000 {
> interrupts = <GIC_SPI 582 IRQ_TYPE_LEVEL_HIGH>;
> };
>
> + pmu@1436400 {
> + compatible = "qcom,sdm845-cpu-bwmon";
> + reg = <0 0x01436400 0 0x600>;
> +
> + interrupts = <GIC_SPI 581 IRQ_TYPE_LEVEL_HIGH>;
> +
> + interconnects = <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>,
> + <&osm_l3 MASTER_OSM_L3_APPS &osm_l3 SLAVE_OSM_L3>;
> + interconnect-names = "ddr", "l3c";
> +
> + operating-points-v2 = <&cpu_bwmon_opp_table>;
> +
> + cpu_bwmon_opp_table: opp-table {
> + compatible = "operating-points-v2";
> +
> + /*
> + * The interconnect paths bandwidths taken from
> + * cpu4_opp_table bandwidth.
> + * They also match different tables from
> + * msm-4.9 downstream kernel:
> + * - the gladiator_noc-mem_noc from bandwidth
> + * table of qcom,llccbw (property qcom,bw-tbl);
> + * bus width: 4 bytes;
> + * - the OSM L3 from bandwidth table of
> + * qcom,cpu4-l3lat-mon (qcom,core-dev-table);
> + * bus width: 16 bytes;
> + */
> + opp-0 {
> + opp-peak-kBps = <800000 4800000>;
> + };
> + opp-1 {
> + opp-peak-kBps = <1804000 9216000>;
> + };
> + opp-2 {
> + opp-peak-kBps = <2188000 11980800>;
> + };
> + opp-3 {
> + opp-peak-kBps = <3072000 15052800>;
> + };
> + opp-4 {
> + opp-peak-kBps = <4068000 19353600>;
> + };
> + opp-5 {
> + opp-peak-kBps = <5412000 20889600>;
> + };
> + opp-6 {
> + opp-peak-kBps = <6220000 22425600>;
> + };
> + opp-7 {
> + opp-peak-kBps = <7216000 25497600>;
> + };
> + };
> + };
> +
> pcie0: pci@1c00000 {
> compatible = "qcom,pcie-sdm845";
> reg = <0 0x01c00000 0 0x2000>,

2022-06-08 04:34:03

by Krzysztof Kozlowski

[permalink] [raw]
Subject: Re: [PATCH v4 4/4] arm64: dts: qcom: sdm845: Add CPU BWMON

On 06/06/2022 22:39, Georgi Djakov wrote:
> On 1.06.22 13:11, Krzysztof Kozlowski wrote:
>> Add device node for CPU-memory BWMON device (bandwidth monitoring) on
>> SDM845 measuring bandwidth between CPU (gladiator_noc) and Last Level
>> Cache (memnoc). Usage of this BWMON allows to remove fixed bandwidth
>> votes from cpufreq (CPU nodes) thus achieve high memory throughput even
>> with lower CPU frequencies.
>>
>> Performance impact (SDM845-MTP RB3 board, linux next-20220422):
>> 1. No noticeable impact when running with schedutil or performance
>> governors.
>>
>> 2. When comparing to customized kernel with synced interconnects and
>> without bandwidth votes from CPU freq, the sysbench memory tests
>> show significant improvement with bwmon for blocksizes past the L3
>> cache. The results for such superficial comparison:
>>
>> sysbench memory test, results in MB/s (higher is better)
>> bs kB | type | V | V+no bw votes | bwmon | benefit %
>> 1 | W/seq | 14795 | 4816 | 4985 | 3.5%
>> 64 | W/seq | 41987 | 10334 | 10433 | 1.0%
>> 4096 | W/seq | 29768 | 8728 | 32007 | 266.7%
>> 65536 | W/seq | 17711 | 4846 | 18399 | 279.6%
>> 262144 | W/seq | 16112 | 4538 | 17429 | 284.1%
>> 64 | R/seq | 61202 | 67092 | 66804 | -0.4%
>> 4096 | R/seq | 23871 | 5458 | 24307 | 345.4%
>> 65536 | R/seq | 18554 | 4240 | 18685 | 340.7%
>> 262144 | R/seq | 17524 | 4207 | 17774 | 322.4%
>> 64 | W/rnd | 2663 | 1098 | 1119 | 1.9%
>> 65536 | W/rnd | 600 | 316 | 610 | 92.7%
>> 64 | R/rnd | 4915 | 4784 | 4594 | -4.0%
>> 65536 | R/rnd | 664 | 281 | 678 | 140.7%
>>
>> Legend:
>> bs kB: block size in KB (small block size means only L1-3 caches are
>> used
>> type: R - read, W - write, seq - sequential, rnd - random
>> V: vanilla (next-20220422)
>> V + no bw votes: vanilla without bandwidth votes from CPU freq
>> bwmon: bwmon without bandwidth votes from CPU freq
>> benefit %: difference between vanilla without bandwidth votes and bwmon
>> (higher is better)
>>
>
> Ok, now i see! So bwmon shows similar performance compared with the current
> cpufreq-based bandwidth scaling. And if you add bwmon on top of vanilla, are
> the results close/same?

Vanilla + bwmon results in almost no difference.

> Is the plan to remove the cpufreq based bandwidth
> scaling and switch to bwmon? It might improve the power consumption in some
> scenarios.

The next plan would be to implement the second bwmon, one between CPU
and caches. With both of them, the cpufreq bandwidth votes can be
removed (I think Android might be interested in this).


Best regards,
Krzysztof

2022-06-22 12:16:50

by Rajendra Nayak

[permalink] [raw]
Subject: Re: [PATCH v4 4/4] arm64: dts: qcom: sdm845: Add CPU BWMON


On 6/1/2022 3:41 PM, Krzysztof Kozlowski wrote:
> Add device node for CPU-memory BWMON device (bandwidth monitoring) on
> SDM845 measuring bandwidth between CPU (gladiator_noc) and Last Level
> Cache (memnoc). Usage of this BWMON allows to remove fixed bandwidth
> votes from cpufreq (CPU nodes) thus achieve high memory throughput even
> with lower CPU frequencies.
>
> Performance impact (SDM845-MTP RB3 board, linux next-20220422):
> 1. No noticeable impact when running with schedutil or performance
> governors.
>
> 2. When comparing to customized kernel with synced interconnects and
> without bandwidth votes from CPU freq, the sysbench memory tests
> show significant improvement with bwmon for blocksizes past the L3
> cache. The results for such superficial comparison:
>
> sysbench memory test, results in MB/s (higher is better)
> bs kB | type | V | V+no bw votes | bwmon | benefit %
> 1 | W/seq | 14795 | 4816 | 4985 | 3.5%
> 64 | W/seq | 41987 | 10334 | 10433 | 1.0%
> 4096 | W/seq | 29768 | 8728 | 32007 | 266.7%
> 65536 | W/seq | 17711 | 4846 | 18399 | 279.6%
> 262144 | W/seq | 16112 | 4538 | 17429 | 284.1%
> 64 | R/seq | 61202 | 67092 | 66804 | -0.4%
> 4096 | R/seq | 23871 | 5458 | 24307 | 345.4%
> 65536 | R/seq | 18554 | 4240 | 18685 | 340.7%
> 262144 | R/seq | 17524 | 4207 | 17774 | 322.4%
> 64 | W/rnd | 2663 | 1098 | 1119 | 1.9%
> 65536 | W/rnd | 600 | 316 | 610 | 92.7%
> 64 | R/rnd | 4915 | 4784 | 4594 | -4.0%
> 65536 | R/rnd | 664 | 281 | 678 | 140.7%
>
> Legend:
> bs kB: block size in KB (small block size means only L1-3 caches are
> used
> type: R - read, W - write, seq - sequential, rnd - random
> V: vanilla (next-20220422)
> V + no bw votes: vanilla without bandwidth votes from CPU freq
> bwmon: bwmon without bandwidth votes from CPU freq
> benefit %: difference between vanilla without bandwidth votes and bwmon
> (higher is better)
>
> Co-developed-by: Thara Gopinath <[email protected]>
> Signed-off-by: Thara Gopinath <[email protected]>
> Signed-off-by: Krzysztof Kozlowski <[email protected]>
> ---
> arch/arm64/boot/dts/qcom/sdm845.dtsi | 54 ++++++++++++++++++++++++++++
> 1 file changed, 54 insertions(+)
>
> diff --git a/arch/arm64/boot/dts/qcom/sdm845.dtsi b/arch/arm64/boot/dts/qcom/sdm845.dtsi
> index 83e8b63f0910..adffb9c70566 100644
> --- a/arch/arm64/boot/dts/qcom/sdm845.dtsi
> +++ b/arch/arm64/boot/dts/qcom/sdm845.dtsi
> @@ -2026,6 +2026,60 @@ llcc: system-cache-controller@1100000 {
> interrupts = <GIC_SPI 582 IRQ_TYPE_LEVEL_HIGH>;
> };
>
> + pmu@1436400 {
> + compatible = "qcom,sdm845-cpu-bwmon";
> + reg = <0 0x01436400 0 0x600>;
> +
> + interrupts = <GIC_SPI 581 IRQ_TYPE_LEVEL_HIGH>;
> +
> + interconnects = <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>,
> + <&osm_l3 MASTER_OSM_L3_APPS &osm_l3 SLAVE_OSM_L3>;
> + interconnect-names = "ddr", "l3c";

Is this the pmu/bwmon instance between the cpu and caches or the one between the caches and DDR?
Depending on which one it is, shouldn;t we just be scaling either one and not both the interconnect paths?

> +
> + operating-points-v2 = <&cpu_bwmon_opp_table>;
> +
> + cpu_bwmon_opp_table: opp-table {
> + compatible = "operating-points-v2";
> +
> + /*
> + * The interconnect paths bandwidths taken from
> + * cpu4_opp_table bandwidth.
> + * They also match different tables from
> + * msm-4.9 downstream kernel:
> + * - the gladiator_noc-mem_noc from bandwidth
> + * table of qcom,llccbw (property qcom,bw-tbl);
> + * bus width: 4 bytes;
> + * - the OSM L3 from bandwidth table of
> + * qcom,cpu4-l3lat-mon (qcom,core-dev-table);
> + * bus width: 16 bytes;
> + */
> + opp-0 {
> + opp-peak-kBps = <800000 4800000>;
> + };
> + opp-1 {
> + opp-peak-kBps = <1804000 9216000>;
> + };
> + opp-2 {
> + opp-peak-kBps = <2188000 11980800>;
> + };
> + opp-3 {
> + opp-peak-kBps = <3072000 15052800>;
> + };
> + opp-4 {
> + opp-peak-kBps = <4068000 19353600>;
> + };
> + opp-5 {
> + opp-peak-kBps = <5412000 20889600>;
> + };
> + opp-6 {
> + opp-peak-kBps = <6220000 22425600>;
> + };
> + opp-7 {
> + opp-peak-kBps = <7216000 25497600>;
> + };
> + };
> + };
> +
> pcie0: pci@1c00000 {
> compatible = "qcom,pcie-sdm845";
> reg = <0 0x01c00000 0 0x2000>,

2022-06-22 14:02:46

by Krzysztof Kozlowski

[permalink] [raw]
Subject: Re: [PATCH v4 4/4] arm64: dts: qcom: sdm845: Add CPU BWMON

On 22/06/2022 13:46, Rajendra Nayak wrote:
>
> On 6/1/2022 3:41 PM, Krzysztof Kozlowski wrote:
>> Add device node for CPU-memory BWMON device (bandwidth monitoring) on
>> SDM845 measuring bandwidth between CPU (gladiator_noc) and Last Level
>> Cache (memnoc). Usage of this BWMON allows to remove fixed bandwidth
>> votes from cpufreq (CPU nodes) thus achieve high memory throughput even
>> with lower CPU frequencies.
>>
>> Performance impact (SDM845-MTP RB3 board, linux next-20220422):
>> 1. No noticeable impact when running with schedutil or performance
>> governors.
>>
>> 2. When comparing to customized kernel with synced interconnects and
>> without bandwidth votes from CPU freq, the sysbench memory tests
>> show significant improvement with bwmon for blocksizes past the L3
>> cache. The results for such superficial comparison:
>>
>> sysbench memory test, results in MB/s (higher is better)
>> bs kB | type | V | V+no bw votes | bwmon | benefit %
>> 1 | W/seq | 14795 | 4816 | 4985 | 3.5%
>> 64 | W/seq | 41987 | 10334 | 10433 | 1.0%
>> 4096 | W/seq | 29768 | 8728 | 32007 | 266.7%
>> 65536 | W/seq | 17711 | 4846 | 18399 | 279.6%
>> 262144 | W/seq | 16112 | 4538 | 17429 | 284.1%
>> 64 | R/seq | 61202 | 67092 | 66804 | -0.4%
>> 4096 | R/seq | 23871 | 5458 | 24307 | 345.4%
>> 65536 | R/seq | 18554 | 4240 | 18685 | 340.7%
>> 262144 | R/seq | 17524 | 4207 | 17774 | 322.4%
>> 64 | W/rnd | 2663 | 1098 | 1119 | 1.9%
>> 65536 | W/rnd | 600 | 316 | 610 | 92.7%
>> 64 | R/rnd | 4915 | 4784 | 4594 | -4.0%
>> 65536 | R/rnd | 664 | 281 | 678 | 140.7%
>>
>> Legend:
>> bs kB: block size in KB (small block size means only L1-3 caches are
>> used
>> type: R - read, W - write, seq - sequential, rnd - random
>> V: vanilla (next-20220422)
>> V + no bw votes: vanilla without bandwidth votes from CPU freq
>> bwmon: bwmon without bandwidth votes from CPU freq
>> benefit %: difference between vanilla without bandwidth votes and bwmon
>> (higher is better)
>>
>> Co-developed-by: Thara Gopinath <[email protected]>
>> Signed-off-by: Thara Gopinath <[email protected]>
>> Signed-off-by: Krzysztof Kozlowski <[email protected]>
>> ---
>> arch/arm64/boot/dts/qcom/sdm845.dtsi | 54 ++++++++++++++++++++++++++++
>> 1 file changed, 54 insertions(+)
>>
>> diff --git a/arch/arm64/boot/dts/qcom/sdm845.dtsi b/arch/arm64/boot/dts/qcom/sdm845.dtsi
>> index 83e8b63f0910..adffb9c70566 100644
>> --- a/arch/arm64/boot/dts/qcom/sdm845.dtsi
>> +++ b/arch/arm64/boot/dts/qcom/sdm845.dtsi
>> @@ -2026,6 +2026,60 @@ llcc: system-cache-controller@1100000 {
>> interrupts = <GIC_SPI 582 IRQ_TYPE_LEVEL_HIGH>;
>> };
>>
>> + pmu@1436400 {
>> + compatible = "qcom,sdm845-cpu-bwmon";
>> + reg = <0 0x01436400 0 0x600>;
>> +
>> + interrupts = <GIC_SPI 581 IRQ_TYPE_LEVEL_HIGH>;
>> +
>> + interconnects = <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>,
>> + <&osm_l3 MASTER_OSM_L3_APPS &osm_l3 SLAVE_OSM_L3>;
>> + interconnect-names = "ddr", "l3c";
>
> Is this the pmu/bwmon instance between the cpu and caches or the one between the caches and DDR?

To my understanding this is the one between CPU and caches.

> Depending on which one it is, shouldn;t we just be scaling either one and not both the interconnect paths?

The interconnects are the same as ones used for CPU nodes, therefore if
we want to scale both when scaling CPU, then we also want to scale both
when seeing traffic between CPU and cache.

Maybe the assumption here is not correct, so basically the two
interconnects in CPU nodes are also not proper?


Best regards,
Krzysztof

2022-06-23 07:05:11

by Rajendra Nayak

[permalink] [raw]
Subject: Re: [PATCH v4 4/4] arm64: dts: qcom: sdm845: Add CPU BWMON


On 6/22/2022 7:22 PM, Krzysztof Kozlowski wrote:
> On 22/06/2022 13:46, Rajendra Nayak wrote:
>>
>> On 6/1/2022 3:41 PM, Krzysztof Kozlowski wrote:
>>> Add device node for CPU-memory BWMON device (bandwidth monitoring) on
>>> SDM845 measuring bandwidth between CPU (gladiator_noc) and Last Level
>>> Cache (memnoc). Usage of this BWMON allows to remove fixed bandwidth
>>> votes from cpufreq (CPU nodes) thus achieve high memory throughput even
>>> with lower CPU frequencies.
>>>
>>> Performance impact (SDM845-MTP RB3 board, linux next-20220422):
>>> 1. No noticeable impact when running with schedutil or performance
>>> governors.
>>>
>>> 2. When comparing to customized kernel with synced interconnects and
>>> without bandwidth votes from CPU freq, the sysbench memory tests
>>> show significant improvement with bwmon for blocksizes past the L3
>>> cache. The results for such superficial comparison:
>>>
>>> sysbench memory test, results in MB/s (higher is better)
>>> bs kB | type | V | V+no bw votes | bwmon | benefit %
>>> 1 | W/seq | 14795 | 4816 | 4985 | 3.5%
>>> 64 | W/seq | 41987 | 10334 | 10433 | 1.0%
>>> 4096 | W/seq | 29768 | 8728 | 32007 | 266.7%
>>> 65536 | W/seq | 17711 | 4846 | 18399 | 279.6%
>>> 262144 | W/seq | 16112 | 4538 | 17429 | 284.1%
>>> 64 | R/seq | 61202 | 67092 | 66804 | -0.4%
>>> 4096 | R/seq | 23871 | 5458 | 24307 | 345.4%
>>> 65536 | R/seq | 18554 | 4240 | 18685 | 340.7%
>>> 262144 | R/seq | 17524 | 4207 | 17774 | 322.4%
>>> 64 | W/rnd | 2663 | 1098 | 1119 | 1.9%
>>> 65536 | W/rnd | 600 | 316 | 610 | 92.7%
>>> 64 | R/rnd | 4915 | 4784 | 4594 | -4.0%
>>> 65536 | R/rnd | 664 | 281 | 678 | 140.7%
>>>
>>> Legend:
>>> bs kB: block size in KB (small block size means only L1-3 caches are
>>> used
>>> type: R - read, W - write, seq - sequential, rnd - random
>>> V: vanilla (next-20220422)
>>> V + no bw votes: vanilla without bandwidth votes from CPU freq
>>> bwmon: bwmon without bandwidth votes from CPU freq
>>> benefit %: difference between vanilla without bandwidth votes and bwmon
>>> (higher is better)
>>>
>>> Co-developed-by: Thara Gopinath <[email protected]>
>>> Signed-off-by: Thara Gopinath <[email protected]>
>>> Signed-off-by: Krzysztof Kozlowski <[email protected]>
>>> ---
>>> arch/arm64/boot/dts/qcom/sdm845.dtsi | 54 ++++++++++++++++++++++++++++
>>> 1 file changed, 54 insertions(+)
>>>
>>> diff --git a/arch/arm64/boot/dts/qcom/sdm845.dtsi b/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>> index 83e8b63f0910..adffb9c70566 100644
>>> --- a/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>> +++ b/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>> @@ -2026,6 +2026,60 @@ llcc: system-cache-controller@1100000 {
>>> interrupts = <GIC_SPI 582 IRQ_TYPE_LEVEL_HIGH>;
>>> };
>>>
>>> + pmu@1436400 {
>>> + compatible = "qcom,sdm845-cpu-bwmon";
>>> + reg = <0 0x01436400 0 0x600>;
>>> +
>>> + interrupts = <GIC_SPI 581 IRQ_TYPE_LEVEL_HIGH>;
>>> +
>>> + interconnects = <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>,
>>> + <&osm_l3 MASTER_OSM_L3_APPS &osm_l3 SLAVE_OSM_L3>;
>>> + interconnect-names = "ddr", "l3c";
>>
>> Is this the pmu/bwmon instance between the cpu and caches or the one between the caches and DDR?
>
> To my understanding this is the one between CPU and caches.

Ok, but then because the OPP table lists the DDR bw first and Cache bw second, isn't the driver
ending up comparing the bw values thrown by the pmu against the DDR bw instead of the Cache BW?
Atleast with my testing on sc7280 I found this to mess things up and I always was ending up at
higher OPPs even while the system was completely idle. Comparing the values against the Cache bw
fixed it.(sc7280 also has a bwmon4 instance between the cpu and caches and a bwmon5 between the cache
and DDR)

>
>> Depending on which one it is, shouldn;t we just be scaling either one and not both the interconnect paths?
>
> The interconnects are the same as ones used for CPU nodes, therefore if
> we want to scale both when scaling CPU, then we also want to scale both
> when seeing traffic between CPU and cache.

Well, they were both associated with the CPU node because with no other input to decide on _when_
to scale the caches and DDR, we just put a mapping table which simply mapped a CPU freq to a L3 _and_
DDR freq. So with just one input (CPU freq) we decided on what should be both the L3 freq and DDR freq.

Now with 2 pmu's, we have 2 inputs, so we can individually scale the L3 based on the cache PMU
counters and DDR based on the DDR PMU counters, no?

Since you said you have plans to add the other pmu support as well (bwmon5 between the cache and DDR)
how else would you have the OPP table associated with that pmu instance? Would you again have both the
L3 and DDR scale based on the inputs from that bwmon too?

>
> Maybe the assumption here is not correct, so basically the two
> interconnects in CPU nodes are also not proper?
>
>
> Best regards,
> Krzysztof

2022-06-23 13:08:56

by Krzysztof Kozlowski

[permalink] [raw]
Subject: Re: [PATCH v4 4/4] arm64: dts: qcom: sdm845: Add CPU BWMON

On 23/06/2022 08:48, Rajendra Nayak wrote:
>>>> diff --git a/arch/arm64/boot/dts/qcom/sdm845.dtsi b/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>> index 83e8b63f0910..adffb9c70566 100644
>>>> --- a/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>> +++ b/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>> @@ -2026,6 +2026,60 @@ llcc: system-cache-controller@1100000 {
>>>> interrupts = <GIC_SPI 582 IRQ_TYPE_LEVEL_HIGH>;
>>>> };
>>>>
>>>> + pmu@1436400 {
>>>> + compatible = "qcom,sdm845-cpu-bwmon";
>>>> + reg = <0 0x01436400 0 0x600>;
>>>> +
>>>> + interrupts = <GIC_SPI 581 IRQ_TYPE_LEVEL_HIGH>;
>>>> +
>>>> + interconnects = <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>,
>>>> + <&osm_l3 MASTER_OSM_L3_APPS &osm_l3 SLAVE_OSM_L3>;
>>>> + interconnect-names = "ddr", "l3c";
>>>
>>> Is this the pmu/bwmon instance between the cpu and caches or the one between the caches and DDR?
>>
>> To my understanding this is the one between CPU and caches.
>
> Ok, but then because the OPP table lists the DDR bw first and Cache bw second, isn't the driver
> ending up comparing the bw values thrown by the pmu against the DDR bw instead of the Cache BW?

I double checked now and you're right.

> Atleast with my testing on sc7280 I found this to mess things up and I always was ending up at
> higher OPPs even while the system was completely idle. Comparing the values against the Cache bw
> fixed it.(sc7280 also has a bwmon4 instance between the cpu and caches and a bwmon5 between the cache
> and DDR)

In my case it exposes different issue - under performance. Somehow the
bwmon does not report bandwidth high enough to vote for high bandwidth.

After removing the DDR interconnect and bandwidth OPP values I have for:
sysbench --threads=8 --time=60 --memory-total-size=20T --test=memory
--memory-block-size=4M run

1. Vanilla: 29768 MB/s
2. Vanilla without CPU votes: 8728 MB/s
3. Previous bwmon (voting too high): 32007 MB/s
4. Fixed bwmon 24911 MB/s
Bwmon does not vote for maximum L3 speed:
bwmon report 9408 MB/s (thresholds set: <9216000 15052801>
)
osm l3 aggregate 14355 MBps -> 897 MHz, level 7, bw 14355 MBps

Maybe that's just problem with missing governor which would vote for
bandwidth rounding up or anticipating higher needs.

>>> Depending on which one it is, shouldn;t we just be scaling either one and not both the interconnect paths?
>>
>> The interconnects are the same as ones used for CPU nodes, therefore if
>> we want to scale both when scaling CPU, then we also want to scale both
>> when seeing traffic between CPU and cache.
>
> Well, they were both associated with the CPU node because with no other input to decide on _when_
> to scale the caches and DDR, we just put a mapping table which simply mapped a CPU freq to a L3 _and_
> DDR freq. So with just one input (CPU freq) we decided on what should be both the L3 freq and DDR freq.
>
> Now with 2 pmu's, we have 2 inputs, so we can individually scale the L3 based on the cache PMU
> counters and DDR based on the DDR PMU counters, no?
>
> Since you said you have plans to add the other pmu support as well (bwmon5 between the cache and DDR)
> how else would you have the OPP table associated with that pmu instance? Would you again have both the
> L3 and DDR scale based on the inputs from that bwmon too?

Good point, thanks for sharing. I think you're right. I'll keep only the
l3c interconnect path.


Best regards,
Krzysztof

2022-06-26 03:43:15

by Bjorn Andersson

[permalink] [raw]
Subject: Re: [PATCH v4 4/4] arm64: dts: qcom: sdm845: Add CPU BWMON

On Thu 23 Jun 07:58 CDT 2022, Krzysztof Kozlowski wrote:

> On 23/06/2022 08:48, Rajendra Nayak wrote:
> >>>> diff --git a/arch/arm64/boot/dts/qcom/sdm845.dtsi b/arch/arm64/boot/dts/qcom/sdm845.dtsi
> >>>> index 83e8b63f0910..adffb9c70566 100644
> >>>> --- a/arch/arm64/boot/dts/qcom/sdm845.dtsi
> >>>> +++ b/arch/arm64/boot/dts/qcom/sdm845.dtsi
> >>>> @@ -2026,6 +2026,60 @@ llcc: system-cache-controller@1100000 {
> >>>> interrupts = <GIC_SPI 582 IRQ_TYPE_LEVEL_HIGH>;
> >>>> };
> >>>>
> >>>> + pmu@1436400 {
> >>>> + compatible = "qcom,sdm845-cpu-bwmon";
> >>>> + reg = <0 0x01436400 0 0x600>;
> >>>> +
> >>>> + interrupts = <GIC_SPI 581 IRQ_TYPE_LEVEL_HIGH>;
> >>>> +
> >>>> + interconnects = <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>,
> >>>> + <&osm_l3 MASTER_OSM_L3_APPS &osm_l3 SLAVE_OSM_L3>;
> >>>> + interconnect-names = "ddr", "l3c";
> >>>
> >>> Is this the pmu/bwmon instance between the cpu and caches or the one between the caches and DDR?
> >>
> >> To my understanding this is the one between CPU and caches.
> >
> > Ok, but then because the OPP table lists the DDR bw first and Cache bw second, isn't the driver
> > ending up comparing the bw values thrown by the pmu against the DDR bw instead of the Cache BW?
>
> I double checked now and you're right.
>
> > Atleast with my testing on sc7280 I found this to mess things up and I always was ending up at
> > higher OPPs even while the system was completely idle. Comparing the values against the Cache bw
> > fixed it.(sc7280 also has a bwmon4 instance between the cpu and caches and a bwmon5 between the cache
> > and DDR)
>
> In my case it exposes different issue - under performance. Somehow the
> bwmon does not report bandwidth high enough to vote for high bandwidth.
>
> After removing the DDR interconnect and bandwidth OPP values I have for:
> sysbench --threads=8 --time=60 --memory-total-size=20T --test=memory
> --memory-block-size=4M run
>
> 1. Vanilla: 29768 MB/s
> 2. Vanilla without CPU votes: 8728 MB/s
> 3. Previous bwmon (voting too high): 32007 MB/s
> 4. Fixed bwmon 24911 MB/s
> Bwmon does not vote for maximum L3 speed:
> bwmon report 9408 MB/s (thresholds set: <9216000 15052801>
> )
> osm l3 aggregate 14355 MBps -> 897 MHz, level 7, bw 14355 MBps
>
> Maybe that's just problem with missing governor which would vote for
> bandwidth rounding up or anticipating higher needs.
>
> >>> Depending on which one it is, shouldn;t we just be scaling either one and not both the interconnect paths?
> >>
> >> The interconnects are the same as ones used for CPU nodes, therefore if
> >> we want to scale both when scaling CPU, then we also want to scale both
> >> when seeing traffic between CPU and cache.
> >
> > Well, they were both associated with the CPU node because with no other input to decide on _when_
> > to scale the caches and DDR, we just put a mapping table which simply mapped a CPU freq to a L3 _and_
> > DDR freq. So with just one input (CPU freq) we decided on what should be both the L3 freq and DDR freq.
> >
> > Now with 2 pmu's, we have 2 inputs, so we can individually scale the L3 based on the cache PMU
> > counters and DDR based on the DDR PMU counters, no?
> >
> > Since you said you have plans to add the other pmu support as well (bwmon5 between the cache and DDR)
> > how else would you have the OPP table associated with that pmu instance? Would you again have both the
> > L3 and DDR scale based on the inputs from that bwmon too?
>
> Good point, thanks for sharing. I think you're right. I'll keep only the
> l3c interconnect path.
>

If I understand correctly, <&osm_l3 MASTER_OSM_L3_APPS &osm_l3
SLAVE_OSM_L3> relates to the L3 cache speed, which sits inside the CPU
subsystem. As such traffic hitting this cache will not show up in either
bwmon instance.

The path <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>
affects the DDR frequency. So the traffic measured by the cpu-bwmon
would be the CPU subsystems traffic that missed the L1/L2/L3 caches and
hits the memory bus towards DDR.


If this is the case it seems to make sense to keep the L3 scaling in the
opp-tables for the CPU and make bwmon only scale the DDR path. What do
you think?

Regards,
Bjorn

2022-06-27 12:59:40

by Krzysztof Kozlowski

[permalink] [raw]
Subject: Re: [PATCH v4 4/4] arm64: dts: qcom: sdm845: Add CPU BWMON

On 26/06/2022 05:28, Bjorn Andersson wrote:
> On Thu 23 Jun 07:58 CDT 2022, Krzysztof Kozlowski wrote:
>
>> On 23/06/2022 08:48, Rajendra Nayak wrote:
>>>>>> diff --git a/arch/arm64/boot/dts/qcom/sdm845.dtsi b/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>>>> index 83e8b63f0910..adffb9c70566 100644
>>>>>> --- a/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>>>> +++ b/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>>>> @@ -2026,6 +2026,60 @@ llcc: system-cache-controller@1100000 {
>>>>>> interrupts = <GIC_SPI 582 IRQ_TYPE_LEVEL_HIGH>;
>>>>>> };
>>>>>>
>>>>>> + pmu@1436400 {
>>>>>> + compatible = "qcom,sdm845-cpu-bwmon";
>>>>>> + reg = <0 0x01436400 0 0x600>;
>>>>>> +
>>>>>> + interrupts = <GIC_SPI 581 IRQ_TYPE_LEVEL_HIGH>;
>>>>>> +
>>>>>> + interconnects = <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>,
>>>>>> + <&osm_l3 MASTER_OSM_L3_APPS &osm_l3 SLAVE_OSM_L3>;
>>>>>> + interconnect-names = "ddr", "l3c";
>>>>>
>>>>> Is this the pmu/bwmon instance between the cpu and caches or the one between the caches and DDR?
>>>>
>>>> To my understanding this is the one between CPU and caches.
>>>
>>> Ok, but then because the OPP table lists the DDR bw first and Cache bw second, isn't the driver
>>> ending up comparing the bw values thrown by the pmu against the DDR bw instead of the Cache BW?
>>
>> I double checked now and you're right.
>>
>>> Atleast with my testing on sc7280 I found this to mess things up and I always was ending up at
>>> higher OPPs even while the system was completely idle. Comparing the values against the Cache bw
>>> fixed it.(sc7280 also has a bwmon4 instance between the cpu and caches and a bwmon5 between the cache
>>> and DDR)
>>
>> In my case it exposes different issue - under performance. Somehow the
>> bwmon does not report bandwidth high enough to vote for high bandwidth.
>>
>> After removing the DDR interconnect and bandwidth OPP values I have for:
>> sysbench --threads=8 --time=60 --memory-total-size=20T --test=memory
>> --memory-block-size=4M run
>>
>> 1. Vanilla: 29768 MB/s
>> 2. Vanilla without CPU votes: 8728 MB/s
>> 3. Previous bwmon (voting too high): 32007 MB/s
>> 4. Fixed bwmon 24911 MB/s
>> Bwmon does not vote for maximum L3 speed:
>> bwmon report 9408 MB/s (thresholds set: <9216000 15052801>
>> )
>> osm l3 aggregate 14355 MBps -> 897 MHz, level 7, bw 14355 MBps
>>
>> Maybe that's just problem with missing governor which would vote for
>> bandwidth rounding up or anticipating higher needs.
>>
>>>>> Depending on which one it is, shouldn;t we just be scaling either one and not both the interconnect paths?
>>>>
>>>> The interconnects are the same as ones used for CPU nodes, therefore if
>>>> we want to scale both when scaling CPU, then we also want to scale both
>>>> when seeing traffic between CPU and cache.
>>>
>>> Well, they were both associated with the CPU node because with no other input to decide on _when_
>>> to scale the caches and DDR, we just put a mapping table which simply mapped a CPU freq to a L3 _and_
>>> DDR freq. So with just one input (CPU freq) we decided on what should be both the L3 freq and DDR freq.
>>>
>>> Now with 2 pmu's, we have 2 inputs, so we can individually scale the L3 based on the cache PMU
>>> counters and DDR based on the DDR PMU counters, no?
>>>
>>> Since you said you have plans to add the other pmu support as well (bwmon5 between the cache and DDR)
>>> how else would you have the OPP table associated with that pmu instance? Would you again have both the
>>> L3 and DDR scale based on the inputs from that bwmon too?
>>
>> Good point, thanks for sharing. I think you're right. I'll keep only the
>> l3c interconnect path.
>>
>
> If I understand correctly, <&osm_l3 MASTER_OSM_L3_APPS &osm_l3
> SLAVE_OSM_L3> relates to the L3 cache speed, which sits inside the CPU
> subsystem. As such traffic hitting this cache will not show up in either
> bwmon instance.
>
> The path <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>
> affects the DDR frequency. So the traffic measured by the cpu-bwmon
> would be the CPU subsystems traffic that missed the L1/L2/L3 caches and
> hits the memory bus towards DDR.
>
>
> If this is the case it seems to make sense to keep the L3 scaling in the
> opp-tables for the CPU and make bwmon only scale the DDR path. What do
> you think?

The reported data throughput by this bwmon instance is beyond the DDR
OPP table bandwidth, e.g.: 16-22 GB/s, so it seems it measures still
within cache controller, not the memory bus.

Best regards,
Krzysztof

2022-06-28 10:44:38

by Rajendra Nayak

[permalink] [raw]
Subject: Re: [PATCH v4 4/4] arm64: dts: qcom: sdm845: Add CPU BWMON


On 6/27/2022 6:09 PM, Krzysztof Kozlowski wrote:
> On 26/06/2022 05:28, Bjorn Andersson wrote:
>> On Thu 23 Jun 07:58 CDT 2022, Krzysztof Kozlowski wrote:
>>
>>> On 23/06/2022 08:48, Rajendra Nayak wrote:
>>>>>>> diff --git a/arch/arm64/boot/dts/qcom/sdm845.dtsi b/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>>>>> index 83e8b63f0910..adffb9c70566 100644
>>>>>>> --- a/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>>>>> +++ b/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>>>>> @@ -2026,6 +2026,60 @@ llcc: system-cache-controller@1100000 {
>>>>>>> interrupts = <GIC_SPI 582 IRQ_TYPE_LEVEL_HIGH>;
>>>>>>> };
>>>>>>>
>>>>>>> + pmu@1436400 {
>>>>>>> + compatible = "qcom,sdm845-cpu-bwmon";
>>>>>>> + reg = <0 0x01436400 0 0x600>;
>>>>>>> +
>>>>>>> + interrupts = <GIC_SPI 581 IRQ_TYPE_LEVEL_HIGH>;
>>>>>>> +
>>>>>>> + interconnects = <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>,
>>>>>>> + <&osm_l3 MASTER_OSM_L3_APPS &osm_l3 SLAVE_OSM_L3>;
>>>>>>> + interconnect-names = "ddr", "l3c";
>>>>>>
>>>>>> Is this the pmu/bwmon instance between the cpu and caches or the one between the caches and DDR?
>>>>>
>>>>> To my understanding this is the one between CPU and caches.
>>>>
>>>> Ok, but then because the OPP table lists the DDR bw first and Cache bw second, isn't the driver
>>>> ending up comparing the bw values thrown by the pmu against the DDR bw instead of the Cache BW?
>>>
>>> I double checked now and you're right.
>>>
>>>> Atleast with my testing on sc7280 I found this to mess things up and I always was ending up at
>>>> higher OPPs even while the system was completely idle. Comparing the values against the Cache bw
>>>> fixed it.(sc7280 also has a bwmon4 instance between the cpu and caches and a bwmon5 between the cache
>>>> and DDR)
>>>
>>> In my case it exposes different issue - under performance. Somehow the
>>> bwmon does not report bandwidth high enough to vote for high bandwidth.
>>>
>>> After removing the DDR interconnect and bandwidth OPP values I have for:
>>> sysbench --threads=8 --time=60 --memory-total-size=20T --test=memory
>>> --memory-block-size=4M run
>>>
>>> 1. Vanilla: 29768 MB/s
>>> 2. Vanilla without CPU votes: 8728 MB/s
>>> 3. Previous bwmon (voting too high): 32007 MB/s
>>> 4. Fixed bwmon 24911 MB/s
>>> Bwmon does not vote for maximum L3 speed:
>>> bwmon report 9408 MB/s (thresholds set: <9216000 15052801>
>>> )
>>> osm l3 aggregate 14355 MBps -> 897 MHz, level 7, bw 14355 MBps
>>>
>>> Maybe that's just problem with missing governor which would vote for
>>> bandwidth rounding up or anticipating higher needs.
>>>
>>>>>> Depending on which one it is, shouldn;t we just be scaling either one and not both the interconnect paths?
>>>>>
>>>>> The interconnects are the same as ones used for CPU nodes, therefore if
>>>>> we want to scale both when scaling CPU, then we also want to scale both
>>>>> when seeing traffic between CPU and cache.
>>>>
>>>> Well, they were both associated with the CPU node because with no other input to decide on _when_
>>>> to scale the caches and DDR, we just put a mapping table which simply mapped a CPU freq to a L3 _and_
>>>> DDR freq. So with just one input (CPU freq) we decided on what should be both the L3 freq and DDR freq.
>>>>
>>>> Now with 2 pmu's, we have 2 inputs, so we can individually scale the L3 based on the cache PMU
>>>> counters and DDR based on the DDR PMU counters, no?
>>>>
>>>> Since you said you have plans to add the other pmu support as well (bwmon5 between the cache and DDR)
>>>> how else would you have the OPP table associated with that pmu instance? Would you again have both the
>>>> L3 and DDR scale based on the inputs from that bwmon too?
>>>
>>> Good point, thanks for sharing. I think you're right. I'll keep only the
>>> l3c interconnect path.
>>>
>>
>> If I understand correctly, <&osm_l3 MASTER_OSM_L3_APPS &osm_l3
>> SLAVE_OSM_L3> relates to the L3 cache speed, which sits inside the CPU
>> subsystem. As such traffic hitting this cache will not show up in either
>> bwmon instance.
>>
>> The path <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>
>> affects the DDR frequency. So the traffic measured by the cpu-bwmon
>> would be the CPU subsystems traffic that missed the L1/L2/L3 caches and
>> hits the memory bus towards DDR.

That seems right, looking some more into the downstream code and register definitions
I see the 2 bwmon instances actually lie on the path outside CPU SS towards DDR,
first one (bwmon4) is between the CPUSS and LLCC (system cache) and the second one
(bwmon5) between LLCC and DDR. So we should use the counters from bwmon4 to
scale the CPU-LLCC path (and not L3), on sc7280 that would mean splitting the
<&gem_noc MASTER_APPSS_PROC 3 &mc_virt SLAVE_EBI1 3> into
<&gem_noc MASTER_APPSS_PROC 3 &gem_noc SLAVE_LLCC 3> (voting based on the bwmon4 inputs)
and <&mc_virt MASTER_LLCC 3 &mc_virt SLAVE_EBI1 3> (voting based on the bwmon5 inputs)
and similar for sdm845 too.

L3 should perhaps still be voted based on the cpu freq as done today.

>> If this is the case it seems to make sense to keep the L3 scaling in the
>> opp-tables for the CPU and make bwmon only scale the DDR path. What do
>> you think?
>
> The reported data throughput by this bwmon instance is beyond the DDR
> OPP table bandwidth, e.g.: 16-22 GB/s, so it seems it measures still
> within cache controller, not the memory bus.
>
> Best regards,
> Krzysztof

2022-06-28 11:04:13

by Krzysztof Kozlowski

[permalink] [raw]
Subject: Re: [PATCH v4 4/4] arm64: dts: qcom: sdm845: Add CPU BWMON

On 28/06/2022 12:36, Rajendra Nayak wrote:
>
> On 6/27/2022 6:09 PM, Krzysztof Kozlowski wrote:
>> On 26/06/2022 05:28, Bjorn Andersson wrote:
>>> On Thu 23 Jun 07:58 CDT 2022, Krzysztof Kozlowski wrote:
>>>
>>>> On 23/06/2022 08:48, Rajendra Nayak wrote:
>>>>>>>> diff --git a/arch/arm64/boot/dts/qcom/sdm845.dtsi b/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>>>>>> index 83e8b63f0910..adffb9c70566 100644
>>>>>>>> --- a/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>>>>>> +++ b/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>>>>>> @@ -2026,6 +2026,60 @@ llcc: system-cache-controller@1100000 {
>>>>>>>> interrupts = <GIC_SPI 582 IRQ_TYPE_LEVEL_HIGH>;
>>>>>>>> };
>>>>>>>>
>>>>>>>> + pmu@1436400 {
>>>>>>>> + compatible = "qcom,sdm845-cpu-bwmon";
>>>>>>>> + reg = <0 0x01436400 0 0x600>;
>>>>>>>> +
>>>>>>>> + interrupts = <GIC_SPI 581 IRQ_TYPE_LEVEL_HIGH>;
>>>>>>>> +
>>>>>>>> + interconnects = <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>,
>>>>>>>> + <&osm_l3 MASTER_OSM_L3_APPS &osm_l3 SLAVE_OSM_L3>;
>>>>>>>> + interconnect-names = "ddr", "l3c";
>>>>>>>
>>>>>>> Is this the pmu/bwmon instance between the cpu and caches or the one between the caches and DDR?
>>>>>>
>>>>>> To my understanding this is the one between CPU and caches.
>>>>>
>>>>> Ok, but then because the OPP table lists the DDR bw first and Cache bw second, isn't the driver
>>>>> ending up comparing the bw values thrown by the pmu against the DDR bw instead of the Cache BW?
>>>>
>>>> I double checked now and you're right.
>>>>
>>>>> Atleast with my testing on sc7280 I found this to mess things up and I always was ending up at
>>>>> higher OPPs even while the system was completely idle. Comparing the values against the Cache bw
>>>>> fixed it.(sc7280 also has a bwmon4 instance between the cpu and caches and a bwmon5 between the cache
>>>>> and DDR)
>>>>
>>>> In my case it exposes different issue - under performance. Somehow the
>>>> bwmon does not report bandwidth high enough to vote for high bandwidth.
>>>>
>>>> After removing the DDR interconnect and bandwidth OPP values I have for:
>>>> sysbench --threads=8 --time=60 --memory-total-size=20T --test=memory
>>>> --memory-block-size=4M run
>>>>
>>>> 1. Vanilla: 29768 MB/s
>>>> 2. Vanilla without CPU votes: 8728 MB/s
>>>> 3. Previous bwmon (voting too high): 32007 MB/s
>>>> 4. Fixed bwmon 24911 MB/s
>>>> Bwmon does not vote for maximum L3 speed:
>>>> bwmon report 9408 MB/s (thresholds set: <9216000 15052801>
>>>> )
>>>> osm l3 aggregate 14355 MBps -> 897 MHz, level 7, bw 14355 MBps
>>>>
>>>> Maybe that's just problem with missing governor which would vote for
>>>> bandwidth rounding up or anticipating higher needs.
>>>>
>>>>>>> Depending on which one it is, shouldn;t we just be scaling either one and not both the interconnect paths?
>>>>>>
>>>>>> The interconnects are the same as ones used for CPU nodes, therefore if
>>>>>> we want to scale both when scaling CPU, then we also want to scale both
>>>>>> when seeing traffic between CPU and cache.
>>>>>
>>>>> Well, they were both associated with the CPU node because with no other input to decide on _when_
>>>>> to scale the caches and DDR, we just put a mapping table which simply mapped a CPU freq to a L3 _and_
>>>>> DDR freq. So with just one input (CPU freq) we decided on what should be both the L3 freq and DDR freq.
>>>>>
>>>>> Now with 2 pmu's, we have 2 inputs, so we can individually scale the L3 based on the cache PMU
>>>>> counters and DDR based on the DDR PMU counters, no?
>>>>>
>>>>> Since you said you have plans to add the other pmu support as well (bwmon5 between the cache and DDR)
>>>>> how else would you have the OPP table associated with that pmu instance? Would you again have both the
>>>>> L3 and DDR scale based on the inputs from that bwmon too?
>>>>
>>>> Good point, thanks for sharing. I think you're right. I'll keep only the
>>>> l3c interconnect path.
>>>>
>>>
>>> If I understand correctly, <&osm_l3 MASTER_OSM_L3_APPS &osm_l3
>>> SLAVE_OSM_L3> relates to the L3 cache speed, which sits inside the CPU
>>> subsystem. As such traffic hitting this cache will not show up in either
>>> bwmon instance.
>>>
>>> The path <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>
>>> affects the DDR frequency. So the traffic measured by the cpu-bwmon
>>> would be the CPU subsystems traffic that missed the L1/L2/L3 caches and
>>> hits the memory bus towards DDR.
>
> That seems right, looking some more into the downstream code and register definitions
> I see the 2 bwmon instances actually lie on the path outside CPU SS towards DDR,
> first one (bwmon4) is between the CPUSS and LLCC (system cache) and the second one
> (bwmon5) between LLCC and DDR. So we should use the counters from bwmon4 to
> scale the CPU-LLCC path (and not L3), on sc7280 that would mean splitting the
> <&gem_noc MASTER_APPSS_PROC 3 &mc_virt SLAVE_EBI1 3> into
> <&gem_noc MASTER_APPSS_PROC 3 &gem_noc SLAVE_LLCC 3> (voting based on the bwmon4 inputs)
> and <&mc_virt MASTER_LLCC 3 &mc_virt SLAVE_EBI1 3> (voting based on the bwmon5 inputs)
> and similar for sdm845 too.
>
> L3 should perhaps still be voted based on the cpu freq as done today.

This would mean that original bandwidth values (800 - 7216 MB/s) were
correct. However we have still your observation that bwmon kicks in very
fast and my measurements that sampled bwmon data shows bandwidth ~20000
MB/s.


Best regards,
Krzysztof

2022-06-28 13:18:18

by Rajendra Nayak

[permalink] [raw]
Subject: Re: [PATCH v4 4/4] arm64: dts: qcom: sdm845: Add CPU BWMON



On 6/28/2022 4:20 PM, Krzysztof Kozlowski wrote:
> On 28/06/2022 12:36, Rajendra Nayak wrote:
>>
>> On 6/27/2022 6:09 PM, Krzysztof Kozlowski wrote:
>>> On 26/06/2022 05:28, Bjorn Andersson wrote:
>>>> On Thu 23 Jun 07:58 CDT 2022, Krzysztof Kozlowski wrote:
>>>>
>>>>> On 23/06/2022 08:48, Rajendra Nayak wrote:
>>>>>>>>> diff --git a/arch/arm64/boot/dts/qcom/sdm845.dtsi b/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>>>>>>> index 83e8b63f0910..adffb9c70566 100644
>>>>>>>>> --- a/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>>>>>>> +++ b/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>>>>>>> @@ -2026,6 +2026,60 @@ llcc: system-cache-controller@1100000 {
>>>>>>>>> interrupts = <GIC_SPI 582 IRQ_TYPE_LEVEL_HIGH>;
>>>>>>>>> };
>>>>>>>>>
>>>>>>>>> + pmu@1436400 {
>>>>>>>>> + compatible = "qcom,sdm845-cpu-bwmon";
>>>>>>>>> + reg = <0 0x01436400 0 0x600>;
>>>>>>>>> +
>>>>>>>>> + interrupts = <GIC_SPI 581 IRQ_TYPE_LEVEL_HIGH>;
>>>>>>>>> +
>>>>>>>>> + interconnects = <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>,
>>>>>>>>> + <&osm_l3 MASTER_OSM_L3_APPS &osm_l3 SLAVE_OSM_L3>;
>>>>>>>>> + interconnect-names = "ddr", "l3c";
>>>>>>>>
>>>>>>>> Is this the pmu/bwmon instance between the cpu and caches or the one between the caches and DDR?
>>>>>>>
>>>>>>> To my understanding this is the one between CPU and caches.
>>>>>>
>>>>>> Ok, but then because the OPP table lists the DDR bw first and Cache bw second, isn't the driver
>>>>>> ending up comparing the bw values thrown by the pmu against the DDR bw instead of the Cache BW?
>>>>>
>>>>> I double checked now and you're right.
>>>>>
>>>>>> Atleast with my testing on sc7280 I found this to mess things up and I always was ending up at
>>>>>> higher OPPs even while the system was completely idle. Comparing the values against the Cache bw
>>>>>> fixed it.(sc7280 also has a bwmon4 instance between the cpu and caches and a bwmon5 between the cache
>>>>>> and DDR)
>>>>>
>>>>> In my case it exposes different issue - under performance. Somehow the
>>>>> bwmon does not report bandwidth high enough to vote for high bandwidth.
>>>>>
>>>>> After removing the DDR interconnect and bandwidth OPP values I have for:
>>>>> sysbench --threads=8 --time=60 --memory-total-size=20T --test=memory
>>>>> --memory-block-size=4M run
>>>>>
>>>>> 1. Vanilla: 29768 MB/s
>>>>> 2. Vanilla without CPU votes: 8728 MB/s
>>>>> 3. Previous bwmon (voting too high): 32007 MB/s
>>>>> 4. Fixed bwmon 24911 MB/s
>>>>> Bwmon does not vote for maximum L3 speed:
>>>>> bwmon report 9408 MB/s (thresholds set: <9216000 15052801>
>>>>> )
>>>>> osm l3 aggregate 14355 MBps -> 897 MHz, level 7, bw 14355 MBps
>>>>>
>>>>> Maybe that's just problem with missing governor which would vote for
>>>>> bandwidth rounding up or anticipating higher needs.
>>>>>
>>>>>>>> Depending on which one it is, shouldn;t we just be scaling either one and not both the interconnect paths?
>>>>>>>
>>>>>>> The interconnects are the same as ones used for CPU nodes, therefore if
>>>>>>> we want to scale both when scaling CPU, then we also want to scale both
>>>>>>> when seeing traffic between CPU and cache.
>>>>>>
>>>>>> Well, they were both associated with the CPU node because with no other input to decide on _when_
>>>>>> to scale the caches and DDR, we just put a mapping table which simply mapped a CPU freq to a L3 _and_
>>>>>> DDR freq. So with just one input (CPU freq) we decided on what should be both the L3 freq and DDR freq.
>>>>>>
>>>>>> Now with 2 pmu's, we have 2 inputs, so we can individually scale the L3 based on the cache PMU
>>>>>> counters and DDR based on the DDR PMU counters, no?
>>>>>>
>>>>>> Since you said you have plans to add the other pmu support as well (bwmon5 between the cache and DDR)
>>>>>> how else would you have the OPP table associated with that pmu instance? Would you again have both the
>>>>>> L3 and DDR scale based on the inputs from that bwmon too?
>>>>>
>>>>> Good point, thanks for sharing. I think you're right. I'll keep only the
>>>>> l3c interconnect path.
>>>>>
>>>>
>>>> If I understand correctly, <&osm_l3 MASTER_OSM_L3_APPS &osm_l3
>>>> SLAVE_OSM_L3> relates to the L3 cache speed, which sits inside the CPU
>>>> subsystem. As such traffic hitting this cache will not show up in either
>>>> bwmon instance.
>>>>
>>>> The path <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>
>>>> affects the DDR frequency. So the traffic measured by the cpu-bwmon
>>>> would be the CPU subsystems traffic that missed the L1/L2/L3 caches and
>>>> hits the memory bus towards DDR.
>>
>> That seems right, looking some more into the downstream code and register definitions
>> I see the 2 bwmon instances actually lie on the path outside CPU SS towards DDR,
>> first one (bwmon4) is between the CPUSS and LLCC (system cache) and the second one
>> (bwmon5) between LLCC and DDR. So we should use the counters from bwmon4 to
>> scale the CPU-LLCC path (and not L3), on sc7280 that would mean splitting the
>> <&gem_noc MASTER_APPSS_PROC 3 &mc_virt SLAVE_EBI1 3> into
>> <&gem_noc MASTER_APPSS_PROC 3 &gem_noc SLAVE_LLCC 3> (voting based on the bwmon4 inputs)
>> and <&mc_virt MASTER_LLCC 3 &mc_virt SLAVE_EBI1 3> (voting based on the bwmon5 inputs)
>> and similar for sdm845 too.
>>
>> L3 should perhaps still be voted based on the cpu freq as done today.
>
> This would mean that original bandwidth values (800 - 7216 MB/s) were
> correct. However we have still your observation that bwmon kicks in very
> fast and my measurements that sampled bwmon data shows bandwidth ~20000
> MB/s.

Right, thats because the bandwidth supported between the cpu<->llcc path is much higher
than the DDR frequencies. For instance on sc7280, I see (2288 - 15258 MB/s) for LLCC while
the DDR max is 8532 MB/s.

>
>
> Best regards,
> Krzysztof

2022-06-28 14:32:24

by Krzysztof Kozlowski

[permalink] [raw]
Subject: Re: [PATCH v4 4/4] arm64: dts: qcom: sdm845: Add CPU BWMON

On 28/06/2022 15:15, Rajendra Nayak wrote:
>
>
> On 6/28/2022 4:20 PM, Krzysztof Kozlowski wrote:
>> On 28/06/2022 12:36, Rajendra Nayak wrote:
>>>
>>> On 6/27/2022 6:09 PM, Krzysztof Kozlowski wrote:
>>>> On 26/06/2022 05:28, Bjorn Andersson wrote:
>>>>> On Thu 23 Jun 07:58 CDT 2022, Krzysztof Kozlowski wrote:
>>>>>
>>>>>> On 23/06/2022 08:48, Rajendra Nayak wrote:
>>>>>>>>>> diff --git a/arch/arm64/boot/dts/qcom/sdm845.dtsi b/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>>>>>>>> index 83e8b63f0910..adffb9c70566 100644
>>>>>>>>>> --- a/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>>>>>>>> +++ b/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>>>>>>>> @@ -2026,6 +2026,60 @@ llcc: system-cache-controller@1100000 {
>>>>>>>>>> interrupts = <GIC_SPI 582 IRQ_TYPE_LEVEL_HIGH>;
>>>>>>>>>> };
>>>>>>>>>>
>>>>>>>>>> + pmu@1436400 {
>>>>>>>>>> + compatible = "qcom,sdm845-cpu-bwmon";
>>>>>>>>>> + reg = <0 0x01436400 0 0x600>;
>>>>>>>>>> +
>>>>>>>>>> + interrupts = <GIC_SPI 581 IRQ_TYPE_LEVEL_HIGH>;
>>>>>>>>>> +
>>>>>>>>>> + interconnects = <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>,
>>>>>>>>>> + <&osm_l3 MASTER_OSM_L3_APPS &osm_l3 SLAVE_OSM_L3>;
>>>>>>>>>> + interconnect-names = "ddr", "l3c";
>>>>>>>>>
>>>>>>>>> Is this the pmu/bwmon instance between the cpu and caches or the one between the caches and DDR?
>>>>>>>>
>>>>>>>> To my understanding this is the one between CPU and caches.
>>>>>>>
>>>>>>> Ok, but then because the OPP table lists the DDR bw first and Cache bw second, isn't the driver
>>>>>>> ending up comparing the bw values thrown by the pmu against the DDR bw instead of the Cache BW?
>>>>>>
>>>>>> I double checked now and you're right.
>>>>>>
>>>>>>> Atleast with my testing on sc7280 I found this to mess things up and I always was ending up at
>>>>>>> higher OPPs even while the system was completely idle. Comparing the values against the Cache bw
>>>>>>> fixed it.(sc7280 also has a bwmon4 instance between the cpu and caches and a bwmon5 between the cache
>>>>>>> and DDR)
>>>>>>
>>>>>> In my case it exposes different issue - under performance. Somehow the
>>>>>> bwmon does not report bandwidth high enough to vote for high bandwidth.
>>>>>>
>>>>>> After removing the DDR interconnect and bandwidth OPP values I have for:
>>>>>> sysbench --threads=8 --time=60 --memory-total-size=20T --test=memory
>>>>>> --memory-block-size=4M run
>>>>>>
>>>>>> 1. Vanilla: 29768 MB/s
>>>>>> 2. Vanilla without CPU votes: 8728 MB/s
>>>>>> 3. Previous bwmon (voting too high): 32007 MB/s
>>>>>> 4. Fixed bwmon 24911 MB/s
>>>>>> Bwmon does not vote for maximum L3 speed:
>>>>>> bwmon report 9408 MB/s (thresholds set: <9216000 15052801>
>>>>>> )
>>>>>> osm l3 aggregate 14355 MBps -> 897 MHz, level 7, bw 14355 MBps
>>>>>>
>>>>>> Maybe that's just problem with missing governor which would vote for
>>>>>> bandwidth rounding up or anticipating higher needs.
>>>>>>
>>>>>>>>> Depending on which one it is, shouldn;t we just be scaling either one and not both the interconnect paths?
>>>>>>>>
>>>>>>>> The interconnects are the same as ones used for CPU nodes, therefore if
>>>>>>>> we want to scale both when scaling CPU, then we also want to scale both
>>>>>>>> when seeing traffic between CPU and cache.
>>>>>>>
>>>>>>> Well, they were both associated with the CPU node because with no other input to decide on _when_
>>>>>>> to scale the caches and DDR, we just put a mapping table which simply mapped a CPU freq to a L3 _and_
>>>>>>> DDR freq. So with just one input (CPU freq) we decided on what should be both the L3 freq and DDR freq.
>>>>>>>
>>>>>>> Now with 2 pmu's, we have 2 inputs, so we can individually scale the L3 based on the cache PMU
>>>>>>> counters and DDR based on the DDR PMU counters, no?
>>>>>>>
>>>>>>> Since you said you have plans to add the other pmu support as well (bwmon5 between the cache and DDR)
>>>>>>> how else would you have the OPP table associated with that pmu instance? Would you again have both the
>>>>>>> L3 and DDR scale based on the inputs from that bwmon too?
>>>>>>
>>>>>> Good point, thanks for sharing. I think you're right. I'll keep only the
>>>>>> l3c interconnect path.
>>>>>>
>>>>>
>>>>> If I understand correctly, <&osm_l3 MASTER_OSM_L3_APPS &osm_l3
>>>>> SLAVE_OSM_L3> relates to the L3 cache speed, which sits inside the CPU
>>>>> subsystem. As such traffic hitting this cache will not show up in either
>>>>> bwmon instance.
>>>>>
>>>>> The path <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>
>>>>> affects the DDR frequency. So the traffic measured by the cpu-bwmon
>>>>> would be the CPU subsystems traffic that missed the L1/L2/L3 caches and
>>>>> hits the memory bus towards DDR.
>>>
>>> That seems right, looking some more into the downstream code and register definitions
>>> I see the 2 bwmon instances actually lie on the path outside CPU SS towards DDR,
>>> first one (bwmon4) is between the CPUSS and LLCC (system cache) and the second one
>>> (bwmon5) between LLCC and DDR. So we should use the counters from bwmon4 to
>>> scale the CPU-LLCC path (and not L3), on sc7280 that would mean splitting the
>>> <&gem_noc MASTER_APPSS_PROC 3 &mc_virt SLAVE_EBI1 3> into
>>> <&gem_noc MASTER_APPSS_PROC 3 &gem_noc SLAVE_LLCC 3> (voting based on the bwmon4 inputs)

For sdm845 SLAVE_LLCC is in mem_noc, so I guess mc_virt on sc7280?

>>> and <&mc_virt MASTER_LLCC 3 &mc_virt SLAVE_EBI1 3> (voting based on the bwmon5 inputs)
>>> and similar for sdm845 too.
>>>
>>> L3 should perhaps still be voted based on the cpu freq as done today.
>>
>> This would mean that original bandwidth values (800 - 7216 MB/s) were
>> correct. However we have still your observation that bwmon kicks in very
>> fast and my measurements that sampled bwmon data shows bandwidth ~20000
>> MB/s.
>
> Right, thats because the bandwidth supported between the cpu<->llcc path is much higher
> than the DDR frequencies. For instance on sc7280, I see (2288 - 15258 MB/s) for LLCC while
> the DDR max is 8532 MB/s.

OK, that sounds right.

Another point is that I did not find actual scaling of throughput via
that interconnect path:
<&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_LLCC 3>

so I cannot test impact of bwmon that way.

Best regards,
Krzysztof

2022-06-28 15:25:51

by Rajendra Nayak

[permalink] [raw]
Subject: Re: [PATCH v4 4/4] arm64: dts: qcom: sdm845: Add CPU BWMON



On 6/28/2022 7:32 PM, Krzysztof Kozlowski wrote:
> On 28/06/2022 15:15, Rajendra Nayak wrote:
>>
>>
>> On 6/28/2022 4:20 PM, Krzysztof Kozlowski wrote:
>>> On 28/06/2022 12:36, Rajendra Nayak wrote:
>>>>
>>>> On 6/27/2022 6:09 PM, Krzysztof Kozlowski wrote:
>>>>> On 26/06/2022 05:28, Bjorn Andersson wrote:
>>>>>> On Thu 23 Jun 07:58 CDT 2022, Krzysztof Kozlowski wrote:
>>>>>>
>>>>>>> On 23/06/2022 08:48, Rajendra Nayak wrote:
>>>>>>>>>>> diff --git a/arch/arm64/boot/dts/qcom/sdm845.dtsi b/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>>>>>>>>> index 83e8b63f0910..adffb9c70566 100644
>>>>>>>>>>> --- a/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>>>>>>>>> +++ b/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>>>>>>>>> @@ -2026,6 +2026,60 @@ llcc: system-cache-controller@1100000 {
>>>>>>>>>>> interrupts = <GIC_SPI 582 IRQ_TYPE_LEVEL_HIGH>;
>>>>>>>>>>> };
>>>>>>>>>>>
>>>>>>>>>>> + pmu@1436400 {
>>>>>>>>>>> + compatible = "qcom,sdm845-cpu-bwmon";
>>>>>>>>>>> + reg = <0 0x01436400 0 0x600>;
>>>>>>>>>>> +
>>>>>>>>>>> + interrupts = <GIC_SPI 581 IRQ_TYPE_LEVEL_HIGH>;
>>>>>>>>>>> +
>>>>>>>>>>> + interconnects = <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>,
>>>>>>>>>>> + <&osm_l3 MASTER_OSM_L3_APPS &osm_l3 SLAVE_OSM_L3>;
>>>>>>>>>>> + interconnect-names = "ddr", "l3c";
>>>>>>>>>>
>>>>>>>>>> Is this the pmu/bwmon instance between the cpu and caches or the one between the caches and DDR?
>>>>>>>>>
>>>>>>>>> To my understanding this is the one between CPU and caches.
>>>>>>>>
>>>>>>>> Ok, but then because the OPP table lists the DDR bw first and Cache bw second, isn't the driver
>>>>>>>> ending up comparing the bw values thrown by the pmu against the DDR bw instead of the Cache BW?
>>>>>>>
>>>>>>> I double checked now and you're right.
>>>>>>>
>>>>>>>> Atleast with my testing on sc7280 I found this to mess things up and I always was ending up at
>>>>>>>> higher OPPs even while the system was completely idle. Comparing the values against the Cache bw
>>>>>>>> fixed it.(sc7280 also has a bwmon4 instance between the cpu and caches and a bwmon5 between the cache
>>>>>>>> and DDR)
>>>>>>>
>>>>>>> In my case it exposes different issue - under performance. Somehow the
>>>>>>> bwmon does not report bandwidth high enough to vote for high bandwidth.
>>>>>>>
>>>>>>> After removing the DDR interconnect and bandwidth OPP values I have for:
>>>>>>> sysbench --threads=8 --time=60 --memory-total-size=20T --test=memory
>>>>>>> --memory-block-size=4M run
>>>>>>>
>>>>>>> 1. Vanilla: 29768 MB/s
>>>>>>> 2. Vanilla without CPU votes: 8728 MB/s
>>>>>>> 3. Previous bwmon (voting too high): 32007 MB/s
>>>>>>> 4. Fixed bwmon 24911 MB/s
>>>>>>> Bwmon does not vote for maximum L3 speed:
>>>>>>> bwmon report 9408 MB/s (thresholds set: <9216000 15052801>
>>>>>>> )
>>>>>>> osm l3 aggregate 14355 MBps -> 897 MHz, level 7, bw 14355 MBps
>>>>>>>
>>>>>>> Maybe that's just problem with missing governor which would vote for
>>>>>>> bandwidth rounding up or anticipating higher needs.
>>>>>>>
>>>>>>>>>> Depending on which one it is, shouldn;t we just be scaling either one and not both the interconnect paths?
>>>>>>>>>
>>>>>>>>> The interconnects are the same as ones used for CPU nodes, therefore if
>>>>>>>>> we want to scale both when scaling CPU, then we also want to scale both
>>>>>>>>> when seeing traffic between CPU and cache.
>>>>>>>>
>>>>>>>> Well, they were both associated with the CPU node because with no other input to decide on _when_
>>>>>>>> to scale the caches and DDR, we just put a mapping table which simply mapped a CPU freq to a L3 _and_
>>>>>>>> DDR freq. So with just one input (CPU freq) we decided on what should be both the L3 freq and DDR freq.
>>>>>>>>
>>>>>>>> Now with 2 pmu's, we have 2 inputs, so we can individually scale the L3 based on the cache PMU
>>>>>>>> counters and DDR based on the DDR PMU counters, no?
>>>>>>>>
>>>>>>>> Since you said you have plans to add the other pmu support as well (bwmon5 between the cache and DDR)
>>>>>>>> how else would you have the OPP table associated with that pmu instance? Would you again have both the
>>>>>>>> L3 and DDR scale based on the inputs from that bwmon too?
>>>>>>>
>>>>>>> Good point, thanks for sharing. I think you're right. I'll keep only the
>>>>>>> l3c interconnect path.
>>>>>>>
>>>>>>
>>>>>> If I understand correctly, <&osm_l3 MASTER_OSM_L3_APPS &osm_l3
>>>>>> SLAVE_OSM_L3> relates to the L3 cache speed, which sits inside the CPU
>>>>>> subsystem. As such traffic hitting this cache will not show up in either
>>>>>> bwmon instance.
>>>>>>
>>>>>> The path <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>
>>>>>> affects the DDR frequency. So the traffic measured by the cpu-bwmon
>>>>>> would be the CPU subsystems traffic that missed the L1/L2/L3 caches and
>>>>>> hits the memory bus towards DDR.
>>>>
>>>> That seems right, looking some more into the downstream code and register definitions
>>>> I see the 2 bwmon instances actually lie on the path outside CPU SS towards DDR,
>>>> first one (bwmon4) is between the CPUSS and LLCC (system cache) and the second one
>>>> (bwmon5) between LLCC and DDR. So we should use the counters from bwmon4 to
>>>> scale the CPU-LLCC path (and not L3), on sc7280 that would mean splitting the
>>>> <&gem_noc MASTER_APPSS_PROC 3 &mc_virt SLAVE_EBI1 3> into
>>>> <&gem_noc MASTER_APPSS_PROC 3 &gem_noc SLAVE_LLCC 3> (voting based on the bwmon4 inputs)
>
> For sdm845 SLAVE_LLCC is in mem_noc, so I guess mc_virt on sc7280?

thats correct,

>
>>>> and <&mc_virt MASTER_LLCC 3 &mc_virt SLAVE_EBI1 3> (voting based on the bwmon5 inputs)
>>>> and similar for sdm845 too.
>>>>
>>>> L3 should perhaps still be voted based on the cpu freq as done today.
>>>
>>> This would mean that original bandwidth values (800 - 7216 MB/s) were
>>> correct. However we have still your observation that bwmon kicks in very
>>> fast and my measurements that sampled bwmon data shows bandwidth ~20000
>>> MB/s.
>>
>> Right, thats because the bandwidth supported between the cpu<->llcc path is much higher
>> than the DDR frequencies. For instance on sc7280, I see (2288 - 15258 MB/s) for LLCC while
>> the DDR max is 8532 MB/s.
>
> OK, that sounds right.
>
> Another point is that I did not find actual scaling of throughput via
> that interconnect path:
> <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_LLCC 3>

Shouldn't this be <&gladiator_noc MASTER_APPSS_PROC 3 &gladiator_noc SLAVE_LLCC 3> on sdm845?

>
> so I cannot test impact of bwmon that way.
>
> Best regards,
> Krzysztof

2022-06-28 15:45:54

by Krzysztof Kozlowski

[permalink] [raw]
Subject: Re: [PATCH v4 4/4] arm64: dts: qcom: sdm845: Add CPU BWMON

On 28/06/2022 17:20, Rajendra Nayak wrote:
>
>
> On 6/28/2022 7:32 PM, Krzysztof Kozlowski wrote:
>> On 28/06/2022 15:15, Rajendra Nayak wrote:
>>>
>>>
>>> On 6/28/2022 4:20 PM, Krzysztof Kozlowski wrote:
>>>> On 28/06/2022 12:36, Rajendra Nayak wrote:
>>>>>
>>>>> On 6/27/2022 6:09 PM, Krzysztof Kozlowski wrote:
>>>>>> On 26/06/2022 05:28, Bjorn Andersson wrote:
>>>>>>> On Thu 23 Jun 07:58 CDT 2022, Krzysztof Kozlowski wrote:
>>>>>>>
>>>>>>>> On 23/06/2022 08:48, Rajendra Nayak wrote:
>>>>>>>>>>>> diff --git a/arch/arm64/boot/dts/qcom/sdm845.dtsi b/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>>>>>>>>>> index 83e8b63f0910..adffb9c70566 100644
>>>>>>>>>>>> --- a/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>>>>>>>>>> +++ b/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>>>>>>>>>> @@ -2026,6 +2026,60 @@ llcc: system-cache-controller@1100000 {
>>>>>>>>>>>> interrupts = <GIC_SPI 582 IRQ_TYPE_LEVEL_HIGH>;
>>>>>>>>>>>> };
>>>>>>>>>>>>
>>>>>>>>>>>> + pmu@1436400 {
>>>>>>>>>>>> + compatible = "qcom,sdm845-cpu-bwmon";
>>>>>>>>>>>> + reg = <0 0x01436400 0 0x600>;
>>>>>>>>>>>> +
>>>>>>>>>>>> + interrupts = <GIC_SPI 581 IRQ_TYPE_LEVEL_HIGH>;
>>>>>>>>>>>> +
>>>>>>>>>>>> + interconnects = <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>,
>>>>>>>>>>>> + <&osm_l3 MASTER_OSM_L3_APPS &osm_l3 SLAVE_OSM_L3>;
>>>>>>>>>>>> + interconnect-names = "ddr", "l3c";
>>>>>>>>>>>
>>>>>>>>>>> Is this the pmu/bwmon instance between the cpu and caches or the one between the caches and DDR?
>>>>>>>>>>
>>>>>>>>>> To my understanding this is the one between CPU and caches.
>>>>>>>>>
>>>>>>>>> Ok, but then because the OPP table lists the DDR bw first and Cache bw second, isn't the driver
>>>>>>>>> ending up comparing the bw values thrown by the pmu against the DDR bw instead of the Cache BW?
>>>>>>>>
>>>>>>>> I double checked now and you're right.
>>>>>>>>
>>>>>>>>> Atleast with my testing on sc7280 I found this to mess things up and I always was ending up at
>>>>>>>>> higher OPPs even while the system was completely idle. Comparing the values against the Cache bw
>>>>>>>>> fixed it.(sc7280 also has a bwmon4 instance between the cpu and caches and a bwmon5 between the cache
>>>>>>>>> and DDR)
>>>>>>>>
>>>>>>>> In my case it exposes different issue - under performance. Somehow the
>>>>>>>> bwmon does not report bandwidth high enough to vote for high bandwidth.
>>>>>>>>
>>>>>>>> After removing the DDR interconnect and bandwidth OPP values I have for:
>>>>>>>> sysbench --threads=8 --time=60 --memory-total-size=20T --test=memory
>>>>>>>> --memory-block-size=4M run
>>>>>>>>
>>>>>>>> 1. Vanilla: 29768 MB/s
>>>>>>>> 2. Vanilla without CPU votes: 8728 MB/s
>>>>>>>> 3. Previous bwmon (voting too high): 32007 MB/s
>>>>>>>> 4. Fixed bwmon 24911 MB/s
>>>>>>>> Bwmon does not vote for maximum L3 speed:
>>>>>>>> bwmon report 9408 MB/s (thresholds set: <9216000 15052801>
>>>>>>>> )
>>>>>>>> osm l3 aggregate 14355 MBps -> 897 MHz, level 7, bw 14355 MBps
>>>>>>>>
>>>>>>>> Maybe that's just problem with missing governor which would vote for
>>>>>>>> bandwidth rounding up or anticipating higher needs.
>>>>>>>>
>>>>>>>>>>> Depending on which one it is, shouldn;t we just be scaling either one and not both the interconnect paths?
>>>>>>>>>>
>>>>>>>>>> The interconnects are the same as ones used for CPU nodes, therefore if
>>>>>>>>>> we want to scale both when scaling CPU, then we also want to scale both
>>>>>>>>>> when seeing traffic between CPU and cache.
>>>>>>>>>
>>>>>>>>> Well, they were both associated with the CPU node because with no other input to decide on _when_
>>>>>>>>> to scale the caches and DDR, we just put a mapping table which simply mapped a CPU freq to a L3 _and_
>>>>>>>>> DDR freq. So with just one input (CPU freq) we decided on what should be both the L3 freq and DDR freq.
>>>>>>>>>
>>>>>>>>> Now with 2 pmu's, we have 2 inputs, so we can individually scale the L3 based on the cache PMU
>>>>>>>>> counters and DDR based on the DDR PMU counters, no?
>>>>>>>>>
>>>>>>>>> Since you said you have plans to add the other pmu support as well (bwmon5 between the cache and DDR)
>>>>>>>>> how else would you have the OPP table associated with that pmu instance? Would you again have both the
>>>>>>>>> L3 and DDR scale based on the inputs from that bwmon too?
>>>>>>>>
>>>>>>>> Good point, thanks for sharing. I think you're right. I'll keep only the
>>>>>>>> l3c interconnect path.
>>>>>>>>
>>>>>>>
>>>>>>> If I understand correctly, <&osm_l3 MASTER_OSM_L3_APPS &osm_l3
>>>>>>> SLAVE_OSM_L3> relates to the L3 cache speed, which sits inside the CPU
>>>>>>> subsystem. As such traffic hitting this cache will not show up in either
>>>>>>> bwmon instance.
>>>>>>>
>>>>>>> The path <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>
>>>>>>> affects the DDR frequency. So the traffic measured by the cpu-bwmon
>>>>>>> would be the CPU subsystems traffic that missed the L1/L2/L3 caches and
>>>>>>> hits the memory bus towards DDR.
>>>>>
>>>>> That seems right, looking some more into the downstream code and register definitions
>>>>> I see the 2 bwmon instances actually lie on the path outside CPU SS towards DDR,
>>>>> first one (bwmon4) is between the CPUSS and LLCC (system cache) and the second one
>>>>> (bwmon5) between LLCC and DDR. So we should use the counters from bwmon4 to
>>>>> scale the CPU-LLCC path (and not L3), on sc7280 that would mean splitting the
>>>>> <&gem_noc MASTER_APPSS_PROC 3 &mc_virt SLAVE_EBI1 3> into
>>>>> <&gem_noc MASTER_APPSS_PROC 3 &gem_noc SLAVE_LLCC 3> (voting based on the bwmon4 inputs)
>>
>> For sdm845 SLAVE_LLCC is in mem_noc, so I guess mc_virt on sc7280?
>
> thats correct,
>
>>
>>>>> and <&mc_virt MASTER_LLCC 3 &mc_virt SLAVE_EBI1 3> (voting based on the bwmon5 inputs)
>>>>> and similar for sdm845 too.
>>>>>
>>>>> L3 should perhaps still be voted based on the cpu freq as done today.
>>>>
>>>> This would mean that original bandwidth values (800 - 7216 MB/s) were
>>>> correct. However we have still your observation that bwmon kicks in very
>>>> fast and my measurements that sampled bwmon data shows bandwidth ~20000
>>>> MB/s.
>>>
>>> Right, thats because the bandwidth supported between the cpu<->llcc path is much higher
>>> than the DDR frequencies. For instance on sc7280, I see (2288 - 15258 MB/s) for LLCC while
>>> the DDR max is 8532 MB/s.
>>
>> OK, that sounds right.
>>
>> Another point is that I did not find actual scaling of throughput via
>> that interconnect path:
>> <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_LLCC 3>
>
> Shouldn't this be <&gladiator_noc MASTER_APPSS_PROC 3 &gladiator_noc SLAVE_LLCC 3> on sdm845?

When I tried this, I got icc xlate error. If I read the code correctly,
it's in mem_noc:
https://elixir.bootlin.com/linux/v5.19-rc4/source/drivers/interconnect/qcom/sdm845.c#L349

Best regards,
Krzysztof