2021-12-13 22:52:19

by Julia Lawall

[permalink] [raw]
Subject: cpufreq: intel_pstate: map utilization into the pstate range

With HWP, intel_cpufreq_adjust_perf takes the utilization, scales it
between 0 and the capacity, and then maps everything below min_pstate to
the lowest frequency. On my Intel Xeon Gold 6130 and Intel Xeon Gold
5218, this means that more than the bottom quarter of utilizations are all
mapped to the lowest frequency. Running slowly doesn't necessarily save
energy, because it takes more time. This patch scales the utilization
(target_perf) between the min pstate and the cap pstate instead.

On the DaCapo (Java) benchmarks and on a few exmples of kernel compilation
(based on make defconfig), on two-socket machines with the above CPUs, the
performance is always the same or better as Linux v5.15, and the CPU and
RAM energy consumption is likewise always the same or better (one
exception: zxing-eval on the 5128 uses a little more energy).

6130:

Performance (sec):
v5.15 with this patch (improvement)
avrora 77.5773 56.4090 (1.38)
batik-eval 113.1173 112.4135 (1.01)
biojava-eval 196.6533 196.7943 (1.00)
cassandra-eval 62.6638 59.2800 (1.06)
eclipse-eval 218.5988 210.0139 (1.04)
fop 3.5537 3.4281 (1.04)
graphchi-evalN 13.8668 10.3411 (1.34)
h2 75.5018 62.2993 (1.21)
jme-eval 94.9531 89.5722 (1.06)
jython 23.5789 23.0603 (1.02)
kafka-eval 60.2784 59.2057 (1.02)
luindex 5.3537 5.1190 (1.05)
lusearch-fix 3.5956 3.3628 (1.07)
lusearch 3.5396 3.5204 (1.01)
pmd 13.3505 10.8795 (1.23)
sunflow 7.5932 7.4899 (1.01)
tomcat-eval 39.6568 31.4844 (1.26)
tradebeans 118.9918 99.3932 (1.20)
tradesoap-eval 56.9113 54.7567 (1.04)
tradesoap 50.7779 44.5169 (1.14)
xalan 5.0711 4.8879 (1.04)
zxing-eval 10.5532 10.2435 (1.03)

make 45.5977 45.3454 (1.01)
make sched 3.4318 3.3450 (1.03)
make fair.o 2.9611 2.8464 (1.04)

CPU energy consumption (J):

avrora 4740.4813 3585.5843 (1.32)
batik-eval 13361.34 13278.74 (1.01)
biojava-eval 21608.70 21652.94 (1.00)
cassandra-eval 3037.6907 2891.8117 (1.05)
eclipse-eval 23528.15 23198.36 (1.01)
fop 455.7363 441.6443 (1.03)
graphchi-eval 999.9220 971.5633 (1.03)
h2 5451.3093 4929.8383 (1.11)
jme-eval 5343.7790 5143.8463 (1.04)
jython 2685.3790 2623.1950 (1.02)
kafka-eval 2715.6047 2548.7220 (1.07)
luindex 597.7587 571.0387 (1.05)
lusearch-fix 714.0340 692.4727 (1.03)
lusearch 718.4863 704.3650 (1.02)
pmd 1627.6377 1497.5437 (1.09)
sunflow 1563.5173 1514.6013 (1.03)
tomcat-eval 4740.1603 4539.1503 (1.04)
tradebeans 8331.2260 7482.3737 (1.11)
tradesoap-eval 6610.1040 6426.7077 (1.03)
tradesoap 5641.9300 5544.3517 (1.02)
xalan 1072.0363 1065.7957 (1.01)
zxing-eval 2200.1883 2174.1137 (1.01)

make 9788.9290 9777.5823 (1.00)
make sched 501.0770 495.0600 (1.01)
make fair.o 363.4570 352.8670 (1.03)

RAM energy consumption (J):

avrora 2508.5553 1844.5977 (1.36)
batik-eval 5627.3327 5603.1820 (1.00)
biojava-eval 9371.1417 9351.1543 (1.00)
cassandra-eval 1398.0567 1289.8317 (1.08)
eclipse-eval 10193.28 9952.3543 (1.02)
fop 189.1927 184.0620 (1.03)
graphchi-eval 539.3947 447.4557 (1.21)
h2 2771.0573 2432.2587 (1.14)
jme-eval 2702.4030 2504.0783 (1.08)
jython 1135.7317 1114.5190 (1.02)
kafka-eval 1320.6840 1220.6867 (1.08)
luindex 246.6597 237.1593 (1.04)
lusearch-fix 294.4317 282.2193 (1.04)
lusearch 295.5400 284.3890 (1.04)
pmd 721.7020 643.1280 (1.12)
sunflow 568.6710 549.3780 (1.04)
tomcat-eval 2305.8857 1995.8843 (1.16)
tradebeans 4323.5243 3749.7033 (1.15)
tradesoap-eval 2862.8047 2783.5733 (1.03)
tradesoap 2717.3900 2519.9567 (1.08)
xalan 430.6100 418.5797 (1.03)
zxing-eval 732.2507 710.9423 (1.03)

make 3362.8837 3356.2587 (1.00)
make sched 191.7917 188.8863 (1.02)
make fair.o 149.6850 145.8273 (1.03)

5128:

Performance (sec):

avrora 62.0511 43.9240 (1.41)
batik-eval 111.6393 110.1999 (1.01)
biojava-eval 241.4400 238.7388 (1.01)
cassandra-eval 62.0185 58.9052 (1.05)
eclipse-eval 240.9488 232.8944 (1.03)
fop 3.8318 3.6408 (1.05)
graphchi-eval 13.3911 10.4670 (1.28)
h2 75.3658 62.8218 (1.20)
jme-eval 95.0131 89.5635 (1.06)
jython 28.1397 27.6802 (1.02)
kafka-eval 60.4817 59.4780 (1.02)
luindex 5.1994 4.9587 (1.05)
lusearch-fix 3.8448 3.6519 (1.05)
lusearch 3.8928 3.7068 (1.05)
pmd 13.0990 10.8008 (1.21)
sunflow 7.7983 7.8569 (0.99)
tomcat-eval 39.2064 31.7629 (1.23)
tradebeans 120.8676 100.9113 (1.20)
tradesoap-eval 65.5552 63.3493 (1.03)
xalan 5.4463 5.3576 (1.02)
zxing-eval 9.8611 9.9692 (0.99)

make 43.1852 43.1285 (1.00)
make sched 3.2181 3.1706 (1.01)
make fair.o 2.7584 2.6615 (1.04)

CPU energy consumption (J):

avrora 3979.5297 3049.3347 (1.31)
batik-eval 12339.59 12413.41 (0.99)
biojava-eval 23935.18 23931.61 (1.00)
cassandra-eval 3552.2753 3380.4860 (1.05)
eclipse-eval 24186.38 24076.57 (1.00)
fop 441.0607 442.9647 (1.00)
graphchi-eval 1021.1323 964.4800 (1.06)
h2 5484.9667 4901.9067 (1.12)
jme-eval 6167.5287 5909.5767 (1.04)
jython 2956.7150 2986.3680 (0.99)
kafka-eval 3229.9333 3197.7743 (1.01)
luindex 537.0007 533.9980 (1.01)
lusearch-fix 720.1830 699.2343 (1.03)
lusearch 708.8190 700.7023 (1.01)
pmd 1539.7463 1398.1850 (1.10)
sunflow 1533.3367 1497.2863 (1.02)
tomcat-eval 4551.9333 4289.2553 (1.06)
tradebeans 8527.2623 7570.2933 (1.13)
tradesoap-eval 6849.3213 6750.9687 (1.01)
xalan 1013.2747 1019.1217 (0.99)
zxing-eval 1852.9077 1943.1753 (0.95)

make 9257.5547 9262.5993 (1.00)
make sched 438.7123 435.9133 (1.01)
make fair.o 315.6550 312.2280 (1.01)

RAM energy consumption (J):

avrora 16309.86 11458.08 (1.42)
batik-eval 30107.11 29891.58 (1.01)
biojava-eval 64290.01 63941.71 (1.01)
cassandra-eval 13240.04 12403.19 (1.07)
eclipse-eval 64188.41 62008.35 (1.04)
fop 1052.2457 996.0907 (1.06)
graphchi-eval 3622.5130 2856.1983 (1.27)
h2 19965.58 16624.08 (1.20)
jme-eval 21777.02 20211.06 (1.08)
jython 7515.3843 7396.6437 (1.02)
kafka-eval 12868.39 12577.32 (1.02)
luindex 1387.7263 1328.8073 (1.04)
lusearch-fix 1313.1220 1238.8813 (1.06)
lusearch 1303.5597 1245.4130 (1.05)
pmd 3650.6697 3049.8567 (1.20)
sunflow 2460.8907 2380.3773 (1.03)
tomcat-eval 11199.61 9232.8367 (1.21)
tradebeans 32385.99 26901.40 (1.20)
tradesoap-eval 17691.01 17006.95 (1.04)
xalan 1783.7290 1735.1937 (1.03)
zxing-eval 2812.9710 2952.2933 (0.95)

make 13247.47 13258.64 (1.00)
make sched 885.7790 877.1667 (1.01)
make fair.o 741.2473 723.6313 (1.02)


Signed-off-by: Julia Lawall <[email protected]>

---

min_pstate is defined in terms of cpu->pstate.min_pstate and
cpu->min_perf_ratio. Maybe one of these values should be used instead.
Likewise, perhaps cap_pstate should be max_pstate?

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index 8c176b7dae41..ba6a48959754 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -2789,10 +2789,6 @@ static void intel_cpufreq_adjust_perf(unsigned int cpunum,

/* Optimization: Avoid unnecessary divisions. */

- target_pstate = cap_pstate;
- if (target_perf < capacity)
- target_pstate = DIV_ROUND_UP(cap_pstate * target_perf, capacity);
-
min_pstate = cap_pstate;
if (min_perf < capacity)
min_pstate = DIV_ROUND_UP(cap_pstate * min_perf, capacity);
@@ -2807,6 +2803,10 @@ static void intel_cpufreq_adjust_perf(unsigned int cpunum,
if (max_pstate < min_pstate)
max_pstate = min_pstate;

+ target_pstate = cap_pstate;
+ if (target_perf < capacity)
+ target_pstate = DIV_ROUND_UP((cap_pstate - min_pstate) * target_perf, capacity) + min_pstate;
+
target_pstate = clamp_t(int, target_pstate, min_pstate, max_pstate);

intel_cpufreq_hwp_update(cpu, min_pstate, max_pstate, target_pstate, true);


2021-12-17 18:36:43

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: cpufreq: intel_pstate: map utilization into the pstate range

On Mon, Dec 13, 2021 at 11:52 PM Julia Lawall <[email protected]> wrote:
>
> With HWP, intel_cpufreq_adjust_perf takes the utilization, scales it
> between 0 and the capacity, and then maps everything below min_pstate to
> the lowest frequency.

Well, it is not just intel_pstate with HWP. This is how schedutil
works in general; see get_next_freq() in there.

> On my Intel Xeon Gold 6130 and Intel Xeon Gold
> 5218, this means that more than the bottom quarter of utilizations are all
> mapped to the lowest frequency. Running slowly doesn't necessarily save
> energy, because it takes more time.

This is true, but the layout of the available range of performance
values is a property of the processor, not a driver issue.

Moreover, the role of the driver is not to decide how to respond to
the given utilization value, that is the role of the governor. The
driver is expected to do what it is asked for by the governor.

> This patch scales the utilization
> (target_perf) between the min pstate and the cap pstate instead.
>
> On the DaCapo (Java) benchmarks and on a few exmples of kernel compilation
> (based on make defconfig), on two-socket machines with the above CPUs, the
> performance is always the same or better as Linux v5.15, and the CPU and
> RAM energy consumption is likewise always the same or better (one
> exception: zxing-eval on the 5128 uses a little more energy).
>
> 6130:
>
> Performance (sec):
> v5.15 with this patch (improvement)
> avrora 77.5773 56.4090 (1.38)
> batik-eval 113.1173 112.4135 (1.01)
> biojava-eval 196.6533 196.7943 (1.00)
> cassandra-eval 62.6638 59.2800 (1.06)
> eclipse-eval 218.5988 210.0139 (1.04)
> fop 3.5537 3.4281 (1.04)
> graphchi-evalN 13.8668 10.3411 (1.34)
> h2 75.5018 62.2993 (1.21)
> jme-eval 94.9531 89.5722 (1.06)
> jython 23.5789 23.0603 (1.02)
> kafka-eval 60.2784 59.2057 (1.02)
> luindex 5.3537 5.1190 (1.05)
> lusearch-fix 3.5956 3.3628 (1.07)
> lusearch 3.5396 3.5204 (1.01)
> pmd 13.3505 10.8795 (1.23)
> sunflow 7.5932 7.4899 (1.01)
> tomcat-eval 39.6568 31.4844 (1.26)
> tradebeans 118.9918 99.3932 (1.20)
> tradesoap-eval 56.9113 54.7567 (1.04)
> tradesoap 50.7779 44.5169 (1.14)
> xalan 5.0711 4.8879 (1.04)
> zxing-eval 10.5532 10.2435 (1.03)
>
> make 45.5977 45.3454 (1.01)
> make sched 3.4318 3.3450 (1.03)
> make fair.o 2.9611 2.8464 (1.04)
>
> CPU energy consumption (J):
>
> avrora 4740.4813 3585.5843 (1.32)
> batik-eval 13361.34 13278.74 (1.01)
> biojava-eval 21608.70 21652.94 (1.00)
> cassandra-eval 3037.6907 2891.8117 (1.05)
> eclipse-eval 23528.15 23198.36 (1.01)
> fop 455.7363 441.6443 (1.03)
> graphchi-eval 999.9220 971.5633 (1.03)
> h2 5451.3093 4929.8383 (1.11)
> jme-eval 5343.7790 5143.8463 (1.04)
> jython 2685.3790 2623.1950 (1.02)
> kafka-eval 2715.6047 2548.7220 (1.07)
> luindex 597.7587 571.0387 (1.05)
> lusearch-fix 714.0340 692.4727 (1.03)
> lusearch 718.4863 704.3650 (1.02)
> pmd 1627.6377 1497.5437 (1.09)
> sunflow 1563.5173 1514.6013 (1.03)
> tomcat-eval 4740.1603 4539.1503 (1.04)
> tradebeans 8331.2260 7482.3737 (1.11)
> tradesoap-eval 6610.1040 6426.7077 (1.03)
> tradesoap 5641.9300 5544.3517 (1.02)
> xalan 1072.0363 1065.7957 (1.01)
> zxing-eval 2200.1883 2174.1137 (1.01)
>
> make 9788.9290 9777.5823 (1.00)
> make sched 501.0770 495.0600 (1.01)
> make fair.o 363.4570 352.8670 (1.03)
>
> RAM energy consumption (J):
>
> avrora 2508.5553 1844.5977 (1.36)
> batik-eval 5627.3327 5603.1820 (1.00)
> biojava-eval 9371.1417 9351.1543 (1.00)
> cassandra-eval 1398.0567 1289.8317 (1.08)
> eclipse-eval 10193.28 9952.3543 (1.02)
> fop 189.1927 184.0620 (1.03)
> graphchi-eval 539.3947 447.4557 (1.21)
> h2 2771.0573 2432.2587 (1.14)
> jme-eval 2702.4030 2504.0783 (1.08)
> jython 1135.7317 1114.5190 (1.02)
> kafka-eval 1320.6840 1220.6867 (1.08)
> luindex 246.6597 237.1593 (1.04)
> lusearch-fix 294.4317 282.2193 (1.04)
> lusearch 295.5400 284.3890 (1.04)
> pmd 721.7020 643.1280 (1.12)
> sunflow 568.6710 549.3780 (1.04)
> tomcat-eval 2305.8857 1995.8843 (1.16)
> tradebeans 4323.5243 3749.7033 (1.15)
> tradesoap-eval 2862.8047 2783.5733 (1.03)
> tradesoap 2717.3900 2519.9567 (1.08)
> xalan 430.6100 418.5797 (1.03)
> zxing-eval 732.2507 710.9423 (1.03)
>
> make 3362.8837 3356.2587 (1.00)
> make sched 191.7917 188.8863 (1.02)
> make fair.o 149.6850 145.8273 (1.03)
>
> 5128:
>
> Performance (sec):
>
> avrora 62.0511 43.9240 (1.41)
> batik-eval 111.6393 110.1999 (1.01)
> biojava-eval 241.4400 238.7388 (1.01)
> cassandra-eval 62.0185 58.9052 (1.05)
> eclipse-eval 240.9488 232.8944 (1.03)
> fop 3.8318 3.6408 (1.05)
> graphchi-eval 13.3911 10.4670 (1.28)
> h2 75.3658 62.8218 (1.20)
> jme-eval 95.0131 89.5635 (1.06)
> jython 28.1397 27.6802 (1.02)
> kafka-eval 60.4817 59.4780 (1.02)
> luindex 5.1994 4.9587 (1.05)
> lusearch-fix 3.8448 3.6519 (1.05)
> lusearch 3.8928 3.7068 (1.05)
> pmd 13.0990 10.8008 (1.21)
> sunflow 7.7983 7.8569 (0.99)
> tomcat-eval 39.2064 31.7629 (1.23)
> tradebeans 120.8676 100.9113 (1.20)
> tradesoap-eval 65.5552 63.3493 (1.03)
> xalan 5.4463 5.3576 (1.02)
> zxing-eval 9.8611 9.9692 (0.99)
>
> make 43.1852 43.1285 (1.00)
> make sched 3.2181 3.1706 (1.01)
> make fair.o 2.7584 2.6615 (1.04)
>
> CPU energy consumption (J):
>
> avrora 3979.5297 3049.3347 (1.31)
> batik-eval 12339.59 12413.41 (0.99)
> biojava-eval 23935.18 23931.61 (1.00)
> cassandra-eval 3552.2753 3380.4860 (1.05)
> eclipse-eval 24186.38 24076.57 (1.00)
> fop 441.0607 442.9647 (1.00)
> graphchi-eval 1021.1323 964.4800 (1.06)
> h2 5484.9667 4901.9067 (1.12)
> jme-eval 6167.5287 5909.5767 (1.04)
> jython 2956.7150 2986.3680 (0.99)
> kafka-eval 3229.9333 3197.7743 (1.01)
> luindex 537.0007 533.9980 (1.01)
> lusearch-fix 720.1830 699.2343 (1.03)
> lusearch 708.8190 700.7023 (1.01)
> pmd 1539.7463 1398.1850 (1.10)
> sunflow 1533.3367 1497.2863 (1.02)
> tomcat-eval 4551.9333 4289.2553 (1.06)
> tradebeans 8527.2623 7570.2933 (1.13)
> tradesoap-eval 6849.3213 6750.9687 (1.01)
> xalan 1013.2747 1019.1217 (0.99)
> zxing-eval 1852.9077 1943.1753 (0.95)
>
> make 9257.5547 9262.5993 (1.00)
> make sched 438.7123 435.9133 (1.01)
> make fair.o 315.6550 312.2280 (1.01)
>
> RAM energy consumption (J):
>
> avrora 16309.86 11458.08 (1.42)
> batik-eval 30107.11 29891.58 (1.01)
> biojava-eval 64290.01 63941.71 (1.01)
> cassandra-eval 13240.04 12403.19 (1.07)
> eclipse-eval 64188.41 62008.35 (1.04)
> fop 1052.2457 996.0907 (1.06)
> graphchi-eval 3622.5130 2856.1983 (1.27)
> h2 19965.58 16624.08 (1.20)
> jme-eval 21777.02 20211.06 (1.08)
> jython 7515.3843 7396.6437 (1.02)
> kafka-eval 12868.39 12577.32 (1.02)
> luindex 1387.7263 1328.8073 (1.04)
> lusearch-fix 1313.1220 1238.8813 (1.06)
> lusearch 1303.5597 1245.4130 (1.05)
> pmd 3650.6697 3049.8567 (1.20)
> sunflow 2460.8907 2380.3773 (1.03)
> tomcat-eval 11199.61 9232.8367 (1.21)
> tradebeans 32385.99 26901.40 (1.20)
> tradesoap-eval 17691.01 17006.95 (1.04)
> xalan 1783.7290 1735.1937 (1.03)
> zxing-eval 2812.9710 2952.2933 (0.95)
>
> make 13247.47 13258.64 (1.00)
> make sched 885.7790 877.1667 (1.01)
> make fair.o 741.2473 723.6313 (1.02)

So the number look better after the change, because it makes the
driver ask the hardware for slightly more performance than it is asked
for by the governor.

>
> Signed-off-by: Julia Lawall <[email protected]>
>
> ---
>
> min_pstate is defined in terms of cpu->pstate.min_pstate and
> cpu->min_perf_ratio. Maybe one of these values should be used instead.
> Likewise, perhaps cap_pstate should be max_pstate?

I'm not sure if I understand this remark. cap_pstate is the max
performance level of the CPU and max_pstate is the current limit
imposed by the framework. They are different things.

>
> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> index 8c176b7dae41..ba6a48959754 100644
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -2789,10 +2789,6 @@ static void intel_cpufreq_adjust_perf(unsigned int cpunum,
>
> /* Optimization: Avoid unnecessary divisions. */
>
> - target_pstate = cap_pstate;
> - if (target_perf < capacity)
> - target_pstate = DIV_ROUND_UP(cap_pstate * target_perf, capacity);
> -
> min_pstate = cap_pstate;
> if (min_perf < capacity)
> min_pstate = DIV_ROUND_UP(cap_pstate * min_perf, capacity);
> @@ -2807,6 +2803,10 @@ static void intel_cpufreq_adjust_perf(unsigned int cpunum,
> if (max_pstate < min_pstate)
> max_pstate = min_pstate;
>
> + target_pstate = cap_pstate;
> + if (target_perf < capacity)
> + target_pstate = DIV_ROUND_UP((cap_pstate - min_pstate) * target_perf, capacity) + min_pstate;

So the driver is asked by the governor to deliver the fraction of the
max performance (cap_pstate) given by the target_perf / capacity ratio
with the floor given by min_perf / capacity. It cannot turn around
and do something else, because it thinks it knows better.

> +
> target_pstate = clamp_t(int, target_pstate, min_pstate, max_pstate);
>
> intel_cpufreq_hwp_update(cpu, min_pstate, max_pstate, target_pstate, true);

2021-12-17 19:32:31

by Julia Lawall

[permalink] [raw]
Subject: Re: cpufreq: intel_pstate: map utilization into the pstate range



On Fri, 17 Dec 2021, Rafael J. Wysocki wrote:

> On Mon, Dec 13, 2021 at 11:52 PM Julia Lawall <[email protected]> wrote:
> >
> > With HWP, intel_cpufreq_adjust_perf takes the utilization, scales it
> > between 0 and the capacity, and then maps everything below min_pstate to
> > the lowest frequency.
>
> Well, it is not just intel_pstate with HWP. This is how schedutil
> works in general; see get_next_freq() in there.
>
> > On my Intel Xeon Gold 6130 and Intel Xeon Gold
> > 5218, this means that more than the bottom quarter of utilizations are all
> > mapped to the lowest frequency. Running slowly doesn't necessarily save
> > energy, because it takes more time.
>
> This is true, but the layout of the available range of performance
> values is a property of the processor, not a driver issue.
>
> Moreover, the role of the driver is not to decide how to respond to
> the given utilization value, that is the role of the governor. The
> driver is expected to do what it is asked for by the governor.

OK, but what exactly is the goal of schedutil?

I would have expected that it was to give good performance while saving
energy, but it's not doing either in many of these cases.

Is it the intent of schedutil that the bottom quarter of utilizations
should be mapped to the lowest frequency?

julia


>
> > This patch scales the utilization
> > (target_perf) between the min pstate and the cap pstate instead.
> >
> > On the DaCapo (Java) benchmarks and on a few exmples of kernel compilation
> > (based on make defconfig), on two-socket machines with the above CPUs, the
> > performance is always the same or better as Linux v5.15, and the CPU and
> > RAM energy consumption is likewise always the same or better (one
> > exception: zxing-eval on the 5128 uses a little more energy).
> >
> > 6130:
> >
> > Performance (sec):
> > v5.15 with this patch (improvement)
> > avrora 77.5773 56.4090 (1.38)
> > batik-eval 113.1173 112.4135 (1.01)
> > biojava-eval 196.6533 196.7943 (1.00)
> > cassandra-eval 62.6638 59.2800 (1.06)
> > eclipse-eval 218.5988 210.0139 (1.04)
> > fop 3.5537 3.4281 (1.04)
> > graphchi-evalN 13.8668 10.3411 (1.34)
> > h2 75.5018 62.2993 (1.21)
> > jme-eval 94.9531 89.5722 (1.06)
> > jython 23.5789 23.0603 (1.02)
> > kafka-eval 60.2784 59.2057 (1.02)
> > luindex 5.3537 5.1190 (1.05)
> > lusearch-fix 3.5956 3.3628 (1.07)
> > lusearch 3.5396 3.5204 (1.01)
> > pmd 13.3505 10.8795 (1.23)
> > sunflow 7.5932 7.4899 (1.01)
> > tomcat-eval 39.6568 31.4844 (1.26)
> > tradebeans 118.9918 99.3932 (1.20)
> > tradesoap-eval 56.9113 54.7567 (1.04)
> > tradesoap 50.7779 44.5169 (1.14)
> > xalan 5.0711 4.8879 (1.04)
> > zxing-eval 10.5532 10.2435 (1.03)
> >
> > make 45.5977 45.3454 (1.01)
> > make sched 3.4318 3.3450 (1.03)
> > make fair.o 2.9611 2.8464 (1.04)
> >
> > CPU energy consumption (J):
> >
> > avrora 4740.4813 3585.5843 (1.32)
> > batik-eval 13361.34 13278.74 (1.01)
> > biojava-eval 21608.70 21652.94 (1.00)
> > cassandra-eval 3037.6907 2891.8117 (1.05)
> > eclipse-eval 23528.15 23198.36 (1.01)
> > fop 455.7363 441.6443 (1.03)
> > graphchi-eval 999.9220 971.5633 (1.03)
> > h2 5451.3093 4929.8383 (1.11)
> > jme-eval 5343.7790 5143.8463 (1.04)
> > jython 2685.3790 2623.1950 (1.02)
> > kafka-eval 2715.6047 2548.7220 (1.07)
> > luindex 597.7587 571.0387 (1.05)
> > lusearch-fix 714.0340 692.4727 (1.03)
> > lusearch 718.4863 704.3650 (1.02)
> > pmd 1627.6377 1497.5437 (1.09)
> > sunflow 1563.5173 1514.6013 (1.03)
> > tomcat-eval 4740.1603 4539.1503 (1.04)
> > tradebeans 8331.2260 7482.3737 (1.11)
> > tradesoap-eval 6610.1040 6426.7077 (1.03)
> > tradesoap 5641.9300 5544.3517 (1.02)
> > xalan 1072.0363 1065.7957 (1.01)
> > zxing-eval 2200.1883 2174.1137 (1.01)
> >
> > make 9788.9290 9777.5823 (1.00)
> > make sched 501.0770 495.0600 (1.01)
> > make fair.o 363.4570 352.8670 (1.03)
> >
> > RAM energy consumption (J):
> >
> > avrora 2508.5553 1844.5977 (1.36)
> > batik-eval 5627.3327 5603.1820 (1.00)
> > biojava-eval 9371.1417 9351.1543 (1.00)
> > cassandra-eval 1398.0567 1289.8317 (1.08)
> > eclipse-eval 10193.28 9952.3543 (1.02)
> > fop 189.1927 184.0620 (1.03)
> > graphchi-eval 539.3947 447.4557 (1.21)
> > h2 2771.0573 2432.2587 (1.14)
> > jme-eval 2702.4030 2504.0783 (1.08)
> > jython 1135.7317 1114.5190 (1.02)
> > kafka-eval 1320.6840 1220.6867 (1.08)
> > luindex 246.6597 237.1593 (1.04)
> > lusearch-fix 294.4317 282.2193 (1.04)
> > lusearch 295.5400 284.3890 (1.04)
> > pmd 721.7020 643.1280 (1.12)
> > sunflow 568.6710 549.3780 (1.04)
> > tomcat-eval 2305.8857 1995.8843 (1.16)
> > tradebeans 4323.5243 3749.7033 (1.15)
> > tradesoap-eval 2862.8047 2783.5733 (1.03)
> > tradesoap 2717.3900 2519.9567 (1.08)
> > xalan 430.6100 418.5797 (1.03)
> > zxing-eval 732.2507 710.9423 (1.03)
> >
> > make 3362.8837 3356.2587 (1.00)
> > make sched 191.7917 188.8863 (1.02)
> > make fair.o 149.6850 145.8273 (1.03)
> >
> > 5128:
> >
> > Performance (sec):
> >
> > avrora 62.0511 43.9240 (1.41)
> > batik-eval 111.6393 110.1999 (1.01)
> > biojava-eval 241.4400 238.7388 (1.01)
> > cassandra-eval 62.0185 58.9052 (1.05)
> > eclipse-eval 240.9488 232.8944 (1.03)
> > fop 3.8318 3.6408 (1.05)
> > graphchi-eval 13.3911 10.4670 (1.28)
> > h2 75.3658 62.8218 (1.20)
> > jme-eval 95.0131 89.5635 (1.06)
> > jython 28.1397 27.6802 (1.02)
> > kafka-eval 60.4817 59.4780 (1.02)
> > luindex 5.1994 4.9587 (1.05)
> > lusearch-fix 3.8448 3.6519 (1.05)
> > lusearch 3.8928 3.7068 (1.05)
> > pmd 13.0990 10.8008 (1.21)
> > sunflow 7.7983 7.8569 (0.99)
> > tomcat-eval 39.2064 31.7629 (1.23)
> > tradebeans 120.8676 100.9113 (1.20)
> > tradesoap-eval 65.5552 63.3493 (1.03)
> > xalan 5.4463 5.3576 (1.02)
> > zxing-eval 9.8611 9.9692 (0.99)
> >
> > make 43.1852 43.1285 (1.00)
> > make sched 3.2181 3.1706 (1.01)
> > make fair.o 2.7584 2.6615 (1.04)
> >
> > CPU energy consumption (J):
> >
> > avrora 3979.5297 3049.3347 (1.31)
> > batik-eval 12339.59 12413.41 (0.99)
> > biojava-eval 23935.18 23931.61 (1.00)
> > cassandra-eval 3552.2753 3380.4860 (1.05)
> > eclipse-eval 24186.38 24076.57 (1.00)
> > fop 441.0607 442.9647 (1.00)
> > graphchi-eval 1021.1323 964.4800 (1.06)
> > h2 5484.9667 4901.9067 (1.12)
> > jme-eval 6167.5287 5909.5767 (1.04)
> > jython 2956.7150 2986.3680 (0.99)
> > kafka-eval 3229.9333 3197.7743 (1.01)
> > luindex 537.0007 533.9980 (1.01)
> > lusearch-fix 720.1830 699.2343 (1.03)
> > lusearch 708.8190 700.7023 (1.01)
> > pmd 1539.7463 1398.1850 (1.10)
> > sunflow 1533.3367 1497.2863 (1.02)
> > tomcat-eval 4551.9333 4289.2553 (1.06)
> > tradebeans 8527.2623 7570.2933 (1.13)
> > tradesoap-eval 6849.3213 6750.9687 (1.01)
> > xalan 1013.2747 1019.1217 (0.99)
> > zxing-eval 1852.9077 1943.1753 (0.95)
> >
> > make 9257.5547 9262.5993 (1.00)
> > make sched 438.7123 435.9133 (1.01)
> > make fair.o 315.6550 312.2280 (1.01)
> >
> > RAM energy consumption (J):
> >
> > avrora 16309.86 11458.08 (1.42)
> > batik-eval 30107.11 29891.58 (1.01)
> > biojava-eval 64290.01 63941.71 (1.01)
> > cassandra-eval 13240.04 12403.19 (1.07)
> > eclipse-eval 64188.41 62008.35 (1.04)
> > fop 1052.2457 996.0907 (1.06)
> > graphchi-eval 3622.5130 2856.1983 (1.27)
> > h2 19965.58 16624.08 (1.20)
> > jme-eval 21777.02 20211.06 (1.08)
> > jython 7515.3843 7396.6437 (1.02)
> > kafka-eval 12868.39 12577.32 (1.02)
> > luindex 1387.7263 1328.8073 (1.04)
> > lusearch-fix 1313.1220 1238.8813 (1.06)
> > lusearch 1303.5597 1245.4130 (1.05)
> > pmd 3650.6697 3049.8567 (1.20)
> > sunflow 2460.8907 2380.3773 (1.03)
> > tomcat-eval 11199.61 9232.8367 (1.21)
> > tradebeans 32385.99 26901.40 (1.20)
> > tradesoap-eval 17691.01 17006.95 (1.04)
> > xalan 1783.7290 1735.1937 (1.03)
> > zxing-eval 2812.9710 2952.2933 (0.95)
> >
> > make 13247.47 13258.64 (1.00)
> > make sched 885.7790 877.1667 (1.01)
> > make fair.o 741.2473 723.6313 (1.02)
>
> So the number look better after the change, because it makes the
> driver ask the hardware for slightly more performance than it is asked
> for by the governor.
>
> >
> > Signed-off-by: Julia Lawall <[email protected]>
> >
> > ---
> >
> > min_pstate is defined in terms of cpu->pstate.min_pstate and
> > cpu->min_perf_ratio. Maybe one of these values should be used instead.
> > Likewise, perhaps cap_pstate should be max_pstate?
>
> I'm not sure if I understand this remark. cap_pstate is the max
> performance level of the CPU and max_pstate is the current limit
> imposed by the framework. They are different things.
>
> >
> > diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> > index 8c176b7dae41..ba6a48959754 100644
> > --- a/drivers/cpufreq/intel_pstate.c
> > +++ b/drivers/cpufreq/intel_pstate.c
> > @@ -2789,10 +2789,6 @@ static void intel_cpufreq_adjust_perf(unsigned int cpunum,
> >
> > /* Optimization: Avoid unnecessary divisions. */
> >
> > - target_pstate = cap_pstate;
> > - if (target_perf < capacity)
> > - target_pstate = DIV_ROUND_UP(cap_pstate * target_perf, capacity);
> > -
> > min_pstate = cap_pstate;
> > if (min_perf < capacity)
> > min_pstate = DIV_ROUND_UP(cap_pstate * min_perf, capacity);
> > @@ -2807,6 +2803,10 @@ static void intel_cpufreq_adjust_perf(unsigned int cpunum,
> > if (max_pstate < min_pstate)
> > max_pstate = min_pstate;
> >
> > + target_pstate = cap_pstate;
> > + if (target_perf < capacity)
> > + target_pstate = DIV_ROUND_UP((cap_pstate - min_pstate) * target_perf, capacity) + min_pstate;
>
> So the driver is asked by the governor to deliver the fraction of the
> max performance (cap_pstate) given by the target_perf / capacity ratio
> with the floor given by min_perf / capacity. It cannot turn around
> and do something else, because it thinks it knows better.
>
> > +
> > target_pstate = clamp_t(int, target_pstate, min_pstate, max_pstate);
> >
> > intel_cpufreq_hwp_update(cpu, min_pstate, max_pstate, target_pstate, true);
>

2021-12-17 20:36:53

by Francisco Jerez

[permalink] [raw]
Subject: Re: cpufreq: intel_pstate: map utilization into the pstate range

Julia Lawall <[email protected]> writes:

> On Fri, 17 Dec 2021, Rafael J. Wysocki wrote:
>
>> On Mon, Dec 13, 2021 at 11:52 PM Julia Lawall <[email protected]> wrote:
>> >
>> > With HWP, intel_cpufreq_adjust_perf takes the utilization, scales it
>> > between 0 and the capacity, and then maps everything below min_pstate to
>> > the lowest frequency.
>>
>> Well, it is not just intel_pstate with HWP. This is how schedutil
>> works in general; see get_next_freq() in there.
>>
>> > On my Intel Xeon Gold 6130 and Intel Xeon Gold
>> > 5218, this means that more than the bottom quarter of utilizations are all
>> > mapped to the lowest frequency. Running slowly doesn't necessarily save
>> > energy, because it takes more time.
>>
>> This is true, but the layout of the available range of performance
>> values is a property of the processor, not a driver issue.
>>
>> Moreover, the role of the driver is not to decide how to respond to
>> the given utilization value, that is the role of the governor. The
>> driver is expected to do what it is asked for by the governor.
>
> OK, but what exactly is the goal of schedutil?
>
> I would have expected that it was to give good performance while saving
> energy, but it's not doing either in many of these cases.
>
> Is it the intent of schedutil that the bottom quarter of utilizations
> should be mapped to the lowest frequency?
>

If the lowest frequency provides more performance than needed to handle
the CPU utilization observed by schedutil, why would it want any other
frequency than the (theoretically most efficient) minimum P-state?

Remember that whether running more slowly saves energy or not depends
among other things on whether your system is running beyond the
inflection point of its power curve (AKA frequency of maximum
efficiency). Within the region of concavity below this most efficient
frequency, yes, running more slowly will waste energy, however, the
optimal behavior within that region is to fix your clock to the most
efficient frequency and then power-gate the CPU once it's run out of
work to do -- Which is precisely what the current code can be expected
to achieve by clamping its response to min_pstate, which is meant to
approximate the most efficient P-state of the CPU -- Though looking at
your results makes me think that that's not happening for you, possibly
because intel_pstate's notion of the most efficient frequency may be
fairly inaccurate in this case.

Your energy usage results below seem to provide some evidence that we're
botching min_pstate in your system: Your energy figures scale pretty
much linearly with the runtime of each testcase, which suggests that
your energy usage is mostly dominated by leakage current, as would be
the case for workloads running far below the most efficient frequency of
the CPU.

Attempting to correct that by introducing an additive bias term into the
P-state calculation as done in this patch will inevitably pessimize
energy usage in the (also fairly common) scenario that the CPU
utilization is high enough to push the CPU frequency into the convex
region of the power curve, and doesn't really fix the underlying problem
that our knowledge about the most efficient P-state may have a
substantial error in your system.

Judging from the performance improvement you're observing with this, I'd
bet that most of the test cases below are fairly latency-bound: They
seem like the kind of workloads where a thread may block on something
for a significant fraction of the time and then run a burst of CPU work
that's not designed to run in parallel with the tasks the same thread
will subsequently block on. That would explain the fact that you're
getting low enough utilization values that your change affects the
P-state calculation significantly. As you've probably realized
yourself, in such a scenario the optimality assumptions of the current
schedutil heuristic break down, however it doesn't seem like
intel_pstate has enough information to make up for that problem, if that
requires introducing another heuristic which will itself cause us to
further deviate from optimality in a different set of scenarios.

> julia
>

Regards,
Francisco

>
>>
>> > This patch scales the utilization
>> > (target_perf) between the min pstate and the cap pstate instead.
>> >
>> > On the DaCapo (Java) benchmarks and on a few exmples of kernel compilation
>> > (based on make defconfig), on two-socket machines with the above CPUs, the
>> > performance is always the same or better as Linux v5.15, and the CPU and
>> > RAM energy consumption is likewise always the same or better (one
>> > exception: zxing-eval on the 5128 uses a little more energy).
>> >
>> > 6130:
>> >
>> > Performance (sec):
>> > v5.15 with this patch (improvement)
>> > avrora 77.5773 56.4090 (1.38)
>> > batik-eval 113.1173 112.4135 (1.01)
>> > biojava-eval 196.6533 196.7943 (1.00)
>> > cassandra-eval 62.6638 59.2800 (1.06)
>> > eclipse-eval 218.5988 210.0139 (1.04)
>> > fop 3.5537 3.4281 (1.04)
>> > graphchi-evalN 13.8668 10.3411 (1.34)
>> > h2 75.5018 62.2993 (1.21)
>> > jme-eval 94.9531 89.5722 (1.06)
>> > jython 23.5789 23.0603 (1.02)
>> > kafka-eval 60.2784 59.2057 (1.02)
>> > luindex 5.3537 5.1190 (1.05)
>> > lusearch-fix 3.5956 3.3628 (1.07)
>> > lusearch 3.5396 3.5204 (1.01)
>> > pmd 13.3505 10.8795 (1.23)
>> > sunflow 7.5932 7.4899 (1.01)
>> > tomcat-eval 39.6568 31.4844 (1.26)
>> > tradebeans 118.9918 99.3932 (1.20)
>> > tradesoap-eval 56.9113 54.7567 (1.04)
>> > tradesoap 50.7779 44.5169 (1.14)
>> > xalan 5.0711 4.8879 (1.04)
>> > zxing-eval 10.5532 10.2435 (1.03)
>> >
>> > make 45.5977 45.3454 (1.01)
>> > make sched 3.4318 3.3450 (1.03)
>> > make fair.o 2.9611 2.8464 (1.04)
>> >
>> > CPU energy consumption (J):
>> >
>> > avrora 4740.4813 3585.5843 (1.32)
>> > batik-eval 13361.34 13278.74 (1.01)
>> > biojava-eval 21608.70 21652.94 (1.00)
>> > cassandra-eval 3037.6907 2891.8117 (1.05)
>> > eclipse-eval 23528.15 23198.36 (1.01)
>> > fop 455.7363 441.6443 (1.03)
>> > graphchi-eval 999.9220 971.5633 (1.03)
>> > h2 5451.3093 4929.8383 (1.11)
>> > jme-eval 5343.7790 5143.8463 (1.04)
>> > jython 2685.3790 2623.1950 (1.02)
>> > kafka-eval 2715.6047 2548.7220 (1.07)
>> > luindex 597.7587 571.0387 (1.05)
>> > lusearch-fix 714.0340 692.4727 (1.03)
>> > lusearch 718.4863 704.3650 (1.02)
>> > pmd 1627.6377 1497.5437 (1.09)
>> > sunflow 1563.5173 1514.6013 (1.03)
>> > tomcat-eval 4740.1603 4539.1503 (1.04)
>> > tradebeans 8331.2260 7482.3737 (1.11)
>> > tradesoap-eval 6610.1040 6426.7077 (1.03)
>> > tradesoap 5641.9300 5544.3517 (1.02)
>> > xalan 1072.0363 1065.7957 (1.01)
>> > zxing-eval 2200.1883 2174.1137 (1.01)
>> >
>> > make 9788.9290 9777.5823 (1.00)
>> > make sched 501.0770 495.0600 (1.01)
>> > make fair.o 363.4570 352.8670 (1.03)
>> >
>> > RAM energy consumption (J):
>> >
>> > avrora 2508.5553 1844.5977 (1.36)
>> > batik-eval 5627.3327 5603.1820 (1.00)
>> > biojava-eval 9371.1417 9351.1543 (1.00)
>> > cassandra-eval 1398.0567 1289.8317 (1.08)
>> > eclipse-eval 10193.28 9952.3543 (1.02)
>> > fop 189.1927 184.0620 (1.03)
>> > graphchi-eval 539.3947 447.4557 (1.21)
>> > h2 2771.0573 2432.2587 (1.14)
>> > jme-eval 2702.4030 2504.0783 (1.08)
>> > jython 1135.7317 1114.5190 (1.02)
>> > kafka-eval 1320.6840 1220.6867 (1.08)
>> > luindex 246.6597 237.1593 (1.04)
>> > lusearch-fix 294.4317 282.2193 (1.04)
>> > lusearch 295.5400 284.3890 (1.04)
>> > pmd 721.7020 643.1280 (1.12)
>> > sunflow 568.6710 549.3780 (1.04)
>> > tomcat-eval 2305.8857 1995.8843 (1.16)
>> > tradebeans 4323.5243 3749.7033 (1.15)
>> > tradesoap-eval 2862.8047 2783.5733 (1.03)
>> > tradesoap 2717.3900 2519.9567 (1.08)
>> > xalan 430.6100 418.5797 (1.03)
>> > zxing-eval 732.2507 710.9423 (1.03)
>> >
>> > make 3362.8837 3356.2587 (1.00)
>> > make sched 191.7917 188.8863 (1.02)
>> > make fair.o 149.6850 145.8273 (1.03)
>> >
>> > 5128:
>> >
>> > Performance (sec):
>> >
>> > avrora 62.0511 43.9240 (1.41)
>> > batik-eval 111.6393 110.1999 (1.01)
>> > biojava-eval 241.4400 238.7388 (1.01)
>> > cassandra-eval 62.0185 58.9052 (1.05)
>> > eclipse-eval 240.9488 232.8944 (1.03)
>> > fop 3.8318 3.6408 (1.05)
>> > graphchi-eval 13.3911 10.4670 (1.28)
>> > h2 75.3658 62.8218 (1.20)
>> > jme-eval 95.0131 89.5635 (1.06)
>> > jython 28.1397 27.6802 (1.02)
>> > kafka-eval 60.4817 59.4780 (1.02)
>> > luindex 5.1994 4.9587 (1.05)
>> > lusearch-fix 3.8448 3.6519 (1.05)
>> > lusearch 3.8928 3.7068 (1.05)
>> > pmd 13.0990 10.8008 (1.21)
>> > sunflow 7.7983 7.8569 (0.99)
>> > tomcat-eval 39.2064 31.7629 (1.23)
>> > tradebeans 120.8676 100.9113 (1.20)
>> > tradesoap-eval 65.5552 63.3493 (1.03)
>> > xalan 5.4463 5.3576 (1.02)
>> > zxing-eval 9.8611 9.9692 (0.99)
>> >
>> > make 43.1852 43.1285 (1.00)
>> > make sched 3.2181 3.1706 (1.01)
>> > make fair.o 2.7584 2.6615 (1.04)
>> >
>> > CPU energy consumption (J):
>> >
>> > avrora 3979.5297 3049.3347 (1.31)
>> > batik-eval 12339.59 12413.41 (0.99)
>> > biojava-eval 23935.18 23931.61 (1.00)
>> > cassandra-eval 3552.2753 3380.4860 (1.05)
>> > eclipse-eval 24186.38 24076.57 (1.00)
>> > fop 441.0607 442.9647 (1.00)
>> > graphchi-eval 1021.1323 964.4800 (1.06)
>> > h2 5484.9667 4901.9067 (1.12)
>> > jme-eval 6167.5287 5909.5767 (1.04)
>> > jython 2956.7150 2986.3680 (0.99)
>> > kafka-eval 3229.9333 3197.7743 (1.01)
>> > luindex 537.0007 533.9980 (1.01)
>> > lusearch-fix 720.1830 699.2343 (1.03)
>> > lusearch 708.8190 700.7023 (1.01)
>> > pmd 1539.7463 1398.1850 (1.10)
>> > sunflow 1533.3367 1497.2863 (1.02)
>> > tomcat-eval 4551.9333 4289.2553 (1.06)
>> > tradebeans 8527.2623 7570.2933 (1.13)
>> > tradesoap-eval 6849.3213 6750.9687 (1.01)
>> > xalan 1013.2747 1019.1217 (0.99)
>> > zxing-eval 1852.9077 1943.1753 (0.95)
>> >
>> > make 9257.5547 9262.5993 (1.00)
>> > make sched 438.7123 435.9133 (1.01)
>> > make fair.o 315.6550 312.2280 (1.01)
>> >
>> > RAM energy consumption (J):
>> >
>> > avrora 16309.86 11458.08 (1.42)
>> > batik-eval 30107.11 29891.58 (1.01)
>> > biojava-eval 64290.01 63941.71 (1.01)
>> > cassandra-eval 13240.04 12403.19 (1.07)
>> > eclipse-eval 64188.41 62008.35 (1.04)
>> > fop 1052.2457 996.0907 (1.06)
>> > graphchi-eval 3622.5130 2856.1983 (1.27)
>> > h2 19965.58 16624.08 (1.20)
>> > jme-eval 21777.02 20211.06 (1.08)
>> > jython 7515.3843 7396.6437 (1.02)
>> > kafka-eval 12868.39 12577.32 (1.02)
>> > luindex 1387.7263 1328.8073 (1.04)
>> > lusearch-fix 1313.1220 1238.8813 (1.06)
>> > lusearch 1303.5597 1245.4130 (1.05)
>> > pmd 3650.6697 3049.8567 (1.20)
>> > sunflow 2460.8907 2380.3773 (1.03)
>> > tomcat-eval 11199.61 9232.8367 (1.21)
>> > tradebeans 32385.99 26901.40 (1.20)
>> > tradesoap-eval 17691.01 17006.95 (1.04)
>> > xalan 1783.7290 1735.1937 (1.03)
>> > zxing-eval 2812.9710 2952.2933 (0.95)
>> >
>> > make 13247.47 13258.64 (1.00)
>> > make sched 885.7790 877.1667 (1.01)
>> > make fair.o 741.2473 723.6313 (1.02)
>>
>> So the number look better after the change, because it makes the
>> driver ask the hardware for slightly more performance than it is asked
>> for by the governor.
>>
>> >
>> > Signed-off-by: Julia Lawall <[email protected]>
>> >
>> > ---
>> >
>> > min_pstate is defined in terms of cpu->pstate.min_pstate and
>> > cpu->min_perf_ratio. Maybe one of these values should be used instead.
>> > Likewise, perhaps cap_pstate should be max_pstate?
>>
>> I'm not sure if I understand this remark. cap_pstate is the max
>> performance level of the CPU and max_pstate is the current limit
>> imposed by the framework. They are different things.
>>
>> >
>> > diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
>> > index 8c176b7dae41..ba6a48959754 100644
>> > --- a/drivers/cpufreq/intel_pstate.c
>> > +++ b/drivers/cpufreq/intel_pstate.c
>> > @@ -2789,10 +2789,6 @@ static void intel_cpufreq_adjust_perf(unsigned int cpunum,
>> >
>> > /* Optimization: Avoid unnecessary divisions. */
>> >
>> > - target_pstate = cap_pstate;
>> > - if (target_perf < capacity)
>> > - target_pstate = DIV_ROUND_UP(cap_pstate * target_perf, capacity);
>> > -
>> > min_pstate = cap_pstate;
>> > if (min_perf < capacity)
>> > min_pstate = DIV_ROUND_UP(cap_pstate * min_perf, capacity);
>> > @@ -2807,6 +2803,10 @@ static void intel_cpufreq_adjust_perf(unsigned int cpunum,
>> > if (max_pstate < min_pstate)
>> > max_pstate = min_pstate;
>> >
>> > + target_pstate = cap_pstate;
>> > + if (target_perf < capacity)
>> > + target_pstate = DIV_ROUND_UP((cap_pstate - min_pstate) * target_perf, capacity) + min_pstate;
>>
>> So the driver is asked by the governor to deliver the fraction of the
>> max performance (cap_pstate) given by the target_perf / capacity ratio
>> with the floor given by min_perf / capacity. It cannot turn around
>> and do something else, because it thinks it knows better.
>>
>> > +
>> > target_pstate = clamp_t(int, target_pstate, min_pstate, max_pstate);
>> >
>> > intel_cpufreq_hwp_update(cpu, min_pstate, max_pstate, target_pstate, true);
>>

2021-12-17 22:51:55

by Julia Lawall

[permalink] [raw]
Subject: Re: cpufreq: intel_pstate: map utilization into the pstate range



On Fri, 17 Dec 2021, Francisco Jerez wrote:

> Julia Lawall <[email protected]> writes:
>
> > On Fri, 17 Dec 2021, Rafael J. Wysocki wrote:
> >
> >> On Mon, Dec 13, 2021 at 11:52 PM Julia Lawall <[email protected]> wrote:
> >> >
> >> > With HWP, intel_cpufreq_adjust_perf takes the utilization, scales it
> >> > between 0 and the capacity, and then maps everything below min_pstate to
> >> > the lowest frequency.
> >>
> >> Well, it is not just intel_pstate with HWP. This is how schedutil
> >> works in general; see get_next_freq() in there.
> >>
> >> > On my Intel Xeon Gold 6130 and Intel Xeon Gold
> >> > 5218, this means that more than the bottom quarter of utilizations are all
> >> > mapped to the lowest frequency. Running slowly doesn't necessarily save
> >> > energy, because it takes more time.
> >>
> >> This is true, but the layout of the available range of performance
> >> values is a property of the processor, not a driver issue.
> >>
> >> Moreover, the role of the driver is not to decide how to respond to
> >> the given utilization value, that is the role of the governor. The
> >> driver is expected to do what it is asked for by the governor.
> >
> > OK, but what exactly is the goal of schedutil?
> >
> > I would have expected that it was to give good performance while saving
> > energy, but it's not doing either in many of these cases.
> >
> > Is it the intent of schedutil that the bottom quarter of utilizations
> > should be mapped to the lowest frequency?
> >
>
> If the lowest frequency provides more performance than needed to handle
> the CPU utilization observed by schedutil, why would it want any other
> frequency than the (theoretically most efficient) minimum P-state?
>
> Remember that whether running more slowly saves energy or not depends
> among other things on whether your system is running beyond the
> inflection point of its power curve (AKA frequency of maximum
> efficiency). Within the region of concavity below this most efficient
> frequency, yes, running more slowly will waste energy, however, the
> optimal behavior within that region is to fix your clock to the most
> efficient frequency and then power-gate the CPU once it's run out of
> work to do -- Which is precisely what the current code can be expected
> to achieve by clamping its response to min_pstate, which is meant to
> approximate the most efficient P-state of the CPU -- Though looking at
> your results makes me think that that's not happening for you, possibly
> because intel_pstate's notion of the most efficient frequency may be
> fairly inaccurate in this case.

I'm not sure to understand the concept of the min_pstate being the most
efficient one. The min_pstate appears to be just the minimum frequency
advertised for the machine. Is that somehow intended to be the most
efficient one?

On the other hand, I noticed that by putting lower numbers than the min
one, one seems to obtain lower frequencies than what is advertised for the
machine.

> Your energy usage results below seem to provide some evidence that we're
> botching min_pstate in your system: Your energy figures scale pretty
> much linearly with the runtime of each testcase, which suggests that
> your energy usage is mostly dominated by leakage current, as would be
> the case for workloads running far below the most efficient frequency of
> the CPU.

I also tried just always forcing various pstates for a few applications:

avrora pstate10 4804.4830
avrora pstate15 3520.0250
avrora pstate20 2975.5300
avrora pstate25 3605.5110
avrora pstate30 3265.1520
avrora pstate35 3142.0730
avrora pstate37 3149.4060

h2 pstate10 6100.5350
h2 pstate15 4440.2950
h2 pstate20 3731.1560
h2 pstate25 4924.2250
h2 pstate30 4375.3220
h2 pstate35 4227.6440
h2 pstate37 4181.9290

xalan pstate10 1153.3680
xalan pstate15 1027.7840
xalan pstate20 998.0690
xalan pstate25 1094.4020
xalan pstate30 1098.2600
xalan pstate35 1092.1510
xalan pstate37 1098.5350

For these three cases, the best pstate in terms of CPU energy consumption
is always 20. For RAM, faster is always better:

avrora pstate10 2372.9950
avrora pstate15 1706.6990
avrora pstate20 1383.3360
avrora pstate25 1406.3790
avrora pstate30 1235.5450
avrora pstate35 1139.7800
avrora pstate37 1142.9890

h2 pstate10 3239.6100
h2 pstate15 2321.2250
h2 pstate20 1886.2960
h2 pstate25 2030.6580
h2 pstate30 1731.8120
h2 pstate35 1635.3940
h2 pstate37 1607.1940

xalan pstate10 662.1400
xalan pstate15 556.7600
xalan pstate20 479.3040
xalan pstate25 429.1490
xalan pstate30 407.0890
xalan pstate35 405.5320
xalan pstate37 406.9260


>
> Attempting to correct that by introducing an additive bias term into the
> P-state calculation as done in this patch will inevitably pessimize
> energy usage in the (also fairly common) scenario that the CPU
> utilization is high enough to push the CPU frequency into the convex
> region of the power curve, and doesn't really fix the underlying problem
> that our knowledge about the most efficient P-state may have a
> substantial error in your system.
>
> Judging from the performance improvement you're observing with this, I'd
> bet that most of the test cases below are fairly latency-bound: They
> seem like the kind of workloads where a thread may block on something
> for a significant fraction of the time and then run a burst of CPU work
> that's not designed to run in parallel with the tasks the same thread
> will subsequently block on. That would explain the fact that you're
> getting low enough utilization values that your change affects the
> P-state calculation significantly.

The three applications all alternate running and blocking at various fast
rates. Small portions of the traces of each one are attached.

thanks,
julia

> As you've probably realized
> yourself, in such a scenario the optimality assumptions of the current
> schedutil heuristic break down, however it doesn't seem like
> intel_pstate has enough information to make up for that problem, if that
> requires introducing another heuristic which will itself cause us to
> further deviate from optimality in a different set of scenarios.
>
> > julia
> >
>
> Regards,
> Francisco
>
> >
> >>
> >> > This patch scales the utilization
> >> > (target_perf) between the min pstate and the cap pstate instead.
> >> >
> >> > On the DaCapo (Java) benchmarks and on a few exmples of kernel compilation
> >> > (based on make defconfig), on two-socket machines with the above CPUs, the
> >> > performance is always the same or better as Linux v5.15, and the CPU and
> >> > RAM energy consumption is likewise always the same or better (one
> >> > exception: zxing-eval on the 5128 uses a little more energy).
> >> >
> >> > 6130:
> >> >
> >> > Performance (sec):
> >> > v5.15 with this patch (improvement)
> >> > avrora 77.5773 56.4090 (1.38)
> >> > batik-eval 113.1173 112.4135 (1.01)
> >> > biojava-eval 196.6533 196.7943 (1.00)
> >> > cassandra-eval 62.6638 59.2800 (1.06)
> >> > eclipse-eval 218.5988 210.0139 (1.04)
> >> > fop 3.5537 3.4281 (1.04)
> >> > graphchi-evalN 13.8668 10.3411 (1.34)
> >> > h2 75.5018 62.2993 (1.21)
> >> > jme-eval 94.9531 89.5722 (1.06)
> >> > jython 23.5789 23.0603 (1.02)
> >> > kafka-eval 60.2784 59.2057 (1.02)
> >> > luindex 5.3537 5.1190 (1.05)
> >> > lusearch-fix 3.5956 3.3628 (1.07)
> >> > lusearch 3.5396 3.5204 (1.01)
> >> > pmd 13.3505 10.8795 (1.23)
> >> > sunflow 7.5932 7.4899 (1.01)
> >> > tomcat-eval 39.6568 31.4844 (1.26)
> >> > tradebeans 118.9918 99.3932 (1.20)
> >> > tradesoap-eval 56.9113 54.7567 (1.04)
> >> > tradesoap 50.7779 44.5169 (1.14)
> >> > xalan 5.0711 4.8879 (1.04)
> >> > zxing-eval 10.5532 10.2435 (1.03)
> >> >
> >> > make 45.5977 45.3454 (1.01)
> >> > make sched 3.4318 3.3450 (1.03)
> >> > make fair.o 2.9611 2.8464 (1.04)
> >> >
> >> > CPU energy consumption (J):
> >> >
> >> > avrora 4740.4813 3585.5843 (1.32)
> >> > batik-eval 13361.34 13278.74 (1.01)
> >> > biojava-eval 21608.70 21652.94 (1.00)
> >> > cassandra-eval 3037.6907 2891.8117 (1.05)
> >> > eclipse-eval 23528.15 23198.36 (1.01)
> >> > fop 455.7363 441.6443 (1.03)
> >> > graphchi-eval 999.9220 971.5633 (1.03)
> >> > h2 5451.3093 4929.8383 (1.11)
> >> > jme-eval 5343.7790 5143.8463 (1.04)
> >> > jython 2685.3790 2623.1950 (1.02)
> >> > kafka-eval 2715.6047 2548.7220 (1.07)
> >> > luindex 597.7587 571.0387 (1.05)
> >> > lusearch-fix 714.0340 692.4727 (1.03)
> >> > lusearch 718.4863 704.3650 (1.02)
> >> > pmd 1627.6377 1497.5437 (1.09)
> >> > sunflow 1563.5173 1514.6013 (1.03)
> >> > tomcat-eval 4740.1603 4539.1503 (1.04)
> >> > tradebeans 8331.2260 7482.3737 (1.11)
> >> > tradesoap-eval 6610.1040 6426.7077 (1.03)
> >> > tradesoap 5641.9300 5544.3517 (1.02)
> >> > xalan 1072.0363 1065.7957 (1.01)
> >> > zxing-eval 2200.1883 2174.1137 (1.01)
> >> >
> >> > make 9788.9290 9777.5823 (1.00)
> >> > make sched 501.0770 495.0600 (1.01)
> >> > make fair.o 363.4570 352.8670 (1.03)
> >> >
> >> > RAM energy consumption (J):
> >> >
> >> > avrora 2508.5553 1844.5977 (1.36)
> >> > batik-eval 5627.3327 5603.1820 (1.00)
> >> > biojava-eval 9371.1417 9351.1543 (1.00)
> >> > cassandra-eval 1398.0567 1289.8317 (1.08)
> >> > eclipse-eval 10193.28 9952.3543 (1.02)
> >> > fop 189.1927 184.0620 (1.03)
> >> > graphchi-eval 539.3947 447.4557 (1.21)
> >> > h2 2771.0573 2432.2587 (1.14)
> >> > jme-eval 2702.4030 2504.0783 (1.08)
> >> > jython 1135.7317 1114.5190 (1.02)
> >> > kafka-eval 1320.6840 1220.6867 (1.08)
> >> > luindex 246.6597 237.1593 (1.04)
> >> > lusearch-fix 294.4317 282.2193 (1.04)
> >> > lusearch 295.5400 284.3890 (1.04)
> >> > pmd 721.7020 643.1280 (1.12)
> >> > sunflow 568.6710 549.3780 (1.04)
> >> > tomcat-eval 2305.8857 1995.8843 (1.16)
> >> > tradebeans 4323.5243 3749.7033 (1.15)
> >> > tradesoap-eval 2862.8047 2783.5733 (1.03)
> >> > tradesoap 2717.3900 2519.9567 (1.08)
> >> > xalan 430.6100 418.5797 (1.03)
> >> > zxing-eval 732.2507 710.9423 (1.03)
> >> >
> >> > make 3362.8837 3356.2587 (1.00)
> >> > make sched 191.7917 188.8863 (1.02)
> >> > make fair.o 149.6850 145.8273 (1.03)
> >> >
> >> > 5128:
> >> >
> >> > Performance (sec):
> >> >
> >> > avrora 62.0511 43.9240 (1.41)
> >> > batik-eval 111.6393 110.1999 (1.01)
> >> > biojava-eval 241.4400 238.7388 (1.01)
> >> > cassandra-eval 62.0185 58.9052 (1.05)
> >> > eclipse-eval 240.9488 232.8944 (1.03)
> >> > fop 3.8318 3.6408 (1.05)
> >> > graphchi-eval 13.3911 10.4670 (1.28)
> >> > h2 75.3658 62.8218 (1.20)
> >> > jme-eval 95.0131 89.5635 (1.06)
> >> > jython 28.1397 27.6802 (1.02)
> >> > kafka-eval 60.4817 59.4780 (1.02)
> >> > luindex 5.1994 4.9587 (1.05)
> >> > lusearch-fix 3.8448 3.6519 (1.05)
> >> > lusearch 3.8928 3.7068 (1.05)
> >> > pmd 13.0990 10.8008 (1.21)
> >> > sunflow 7.7983 7.8569 (0.99)
> >> > tomcat-eval 39.2064 31.7629 (1.23)
> >> > tradebeans 120.8676 100.9113 (1.20)
> >> > tradesoap-eval 65.5552 63.3493 (1.03)
> >> > xalan 5.4463 5.3576 (1.02)
> >> > zxing-eval 9.8611 9.9692 (0.99)
> >> >
> >> > make 43.1852 43.1285 (1.00)
> >> > make sched 3.2181 3.1706 (1.01)
> >> > make fair.o 2.7584 2.6615 (1.04)
> >> >
> >> > CPU energy consumption (J):
> >> >
> >> > avrora 3979.5297 3049.3347 (1.31)
> >> > batik-eval 12339.59 12413.41 (0.99)
> >> > biojava-eval 23935.18 23931.61 (1.00)
> >> > cassandra-eval 3552.2753 3380.4860 (1.05)
> >> > eclipse-eval 24186.38 24076.57 (1.00)
> >> > fop 441.0607 442.9647 (1.00)
> >> > graphchi-eval 1021.1323 964.4800 (1.06)
> >> > h2 5484.9667 4901.9067 (1.12)
> >> > jme-eval 6167.5287 5909.5767 (1.04)
> >> > jython 2956.7150 2986.3680 (0.99)
> >> > kafka-eval 3229.9333 3197.7743 (1.01)
> >> > luindex 537.0007 533.9980 (1.01)
> >> > lusearch-fix 720.1830 699.2343 (1.03)
> >> > lusearch 708.8190 700.7023 (1.01)
> >> > pmd 1539.7463 1398.1850 (1.10)
> >> > sunflow 1533.3367 1497.2863 (1.02)
> >> > tomcat-eval 4551.9333 4289.2553 (1.06)
> >> > tradebeans 8527.2623 7570.2933 (1.13)
> >> > tradesoap-eval 6849.3213 6750.9687 (1.01)
> >> > xalan 1013.2747 1019.1217 (0.99)
> >> > zxing-eval 1852.9077 1943.1753 (0.95)
> >> >
> >> > make 9257.5547 9262.5993 (1.00)
> >> > make sched 438.7123 435.9133 (1.01)
> >> > make fair.o 315.6550 312.2280 (1.01)
> >> >
> >> > RAM energy consumption (J):
> >> >
> >> > avrora 16309.86 11458.08 (1.42)
> >> > batik-eval 30107.11 29891.58 (1.01)
> >> > biojava-eval 64290.01 63941.71 (1.01)
> >> > cassandra-eval 13240.04 12403.19 (1.07)
> >> > eclipse-eval 64188.41 62008.35 (1.04)
> >> > fop 1052.2457 996.0907 (1.06)
> >> > graphchi-eval 3622.5130 2856.1983 (1.27)
> >> > h2 19965.58 16624.08 (1.20)
> >> > jme-eval 21777.02 20211.06 (1.08)
> >> > jython 7515.3843 7396.6437 (1.02)
> >> > kafka-eval 12868.39 12577.32 (1.02)
> >> > luindex 1387.7263 1328.8073 (1.04)
> >> > lusearch-fix 1313.1220 1238.8813 (1.06)
> >> > lusearch 1303.5597 1245.4130 (1.05)
> >> > pmd 3650.6697 3049.8567 (1.20)
> >> > sunflow 2460.8907 2380.3773 (1.03)
> >> > tomcat-eval 11199.61 9232.8367 (1.21)
> >> > tradebeans 32385.99 26901.40 (1.20)
> >> > tradesoap-eval 17691.01 17006.95 (1.04)
> >> > xalan 1783.7290 1735.1937 (1.03)
> >> > zxing-eval 2812.9710 2952.2933 (0.95)
> >> >
> >> > make 13247.47 13258.64 (1.00)
> >> > make sched 885.7790 877.1667 (1.01)
> >> > make fair.o 741.2473 723.6313 (1.02)
> >>
> >> So the number look better after the change, because it makes the
> >> driver ask the hardware for slightly more performance than it is asked
> >> for by the governor.
> >>
> >> >
> >> > Signed-off-by: Julia Lawall <[email protected]>
> >> >
> >> > ---
> >> >
> >> > min_pstate is defined in terms of cpu->pstate.min_pstate and
> >> > cpu->min_perf_ratio. Maybe one of these values should be used instead.
> >> > Likewise, perhaps cap_pstate should be max_pstate?
> >>
> >> I'm not sure if I understand this remark. cap_pstate is the max
> >> performance level of the CPU and max_pstate is the current limit
> >> imposed by the framework. They are different things.
> >>
> >> >
> >> > diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> >> > index 8c176b7dae41..ba6a48959754 100644
> >> > --- a/drivers/cpufreq/intel_pstate.c
> >> > +++ b/drivers/cpufreq/intel_pstate.c
> >> > @@ -2789,10 +2789,6 @@ static void intel_cpufreq_adjust_perf(unsigned int cpunum,
> >> >
> >> > /* Optimization: Avoid unnecessary divisions. */
> >> >
> >> > - target_pstate = cap_pstate;
> >> > - if (target_perf < capacity)
> >> > - target_pstate = DIV_ROUND_UP(cap_pstate * target_perf, capacity);
> >> > -
> >> > min_pstate = cap_pstate;
> >> > if (min_perf < capacity)
> >> > min_pstate = DIV_ROUND_UP(cap_pstate * min_perf, capacity);
> >> > @@ -2807,6 +2803,10 @@ static void intel_cpufreq_adjust_perf(unsigned int cpunum,
> >> > if (max_pstate < min_pstate)
> >> > max_pstate = min_pstate;
> >> >
> >> > + target_pstate = cap_pstate;
> >> > + if (target_perf < capacity)
> >> > + target_pstate = DIV_ROUND_UP((cap_pstate - min_pstate) * target_perf, capacity) + min_pstate;
> >>
> >> So the driver is asked by the governor to deliver the fraction of the
> >> max performance (cap_pstate) given by the target_perf / capacity ratio
> >> with the floor given by min_perf / capacity. It cannot turn around
> >> and do something else, because it thinks it knows better.
> >>
> >> > +
> >> > target_pstate = clamp_t(int, target_pstate, min_pstate, max_pstate);
> >> >
> >> > intel_cpufreq_hwp_update(cpu, min_pstate, max_pstate, target_pstate, true);
> >>
>


Attachments:
avrora_R10_C1_dahu-2_5.15.0freq_schedutil_1_from_30_upto_30.01.pdf (15.49 kB)
h2_R10_CN_dahu-2_5.15.0freq_schedutil_1_from_20_upto_20.1.pdf (13.35 kB)
xalan_R10_C100_dahu-2_5.15.0freq_schedutil_1_from_2.9_upto_3.pdf (97.50 kB)
Download all attachments

2021-12-18 00:04:15

by Francisco Jerez

[permalink] [raw]
Subject: Re: cpufreq: intel_pstate: map utilization into the pstate range

Julia Lawall <[email protected]> writes:

> On Fri, 17 Dec 2021, Francisco Jerez wrote:
>
>> Julia Lawall <[email protected]> writes:
>>
>> > On Fri, 17 Dec 2021, Rafael J. Wysocki wrote:
>> >
>> >> On Mon, Dec 13, 2021 at 11:52 PM Julia Lawall <[email protected]> wrote:
>> >> >
>> >> > With HWP, intel_cpufreq_adjust_perf takes the utilization, scales it
>> >> > between 0 and the capacity, and then maps everything below min_pstate to
>> >> > the lowest frequency.
>> >>
>> >> Well, it is not just intel_pstate with HWP. This is how schedutil
>> >> works in general; see get_next_freq() in there.
>> >>
>> >> > On my Intel Xeon Gold 6130 and Intel Xeon Gold
>> >> > 5218, this means that more than the bottom quarter of utilizations are all
>> >> > mapped to the lowest frequency. Running slowly doesn't necessarily save
>> >> > energy, because it takes more time.
>> >>
>> >> This is true, but the layout of the available range of performance
>> >> values is a property of the processor, not a driver issue.
>> >>
>> >> Moreover, the role of the driver is not to decide how to respond to
>> >> the given utilization value, that is the role of the governor. The
>> >> driver is expected to do what it is asked for by the governor.
>> >
>> > OK, but what exactly is the goal of schedutil?
>> >
>> > I would have expected that it was to give good performance while saving
>> > energy, but it's not doing either in many of these cases.
>> >
>> > Is it the intent of schedutil that the bottom quarter of utilizations
>> > should be mapped to the lowest frequency?
>> >
>>
>> If the lowest frequency provides more performance than needed to handle
>> the CPU utilization observed by schedutil, why would it want any other
>> frequency than the (theoretically most efficient) minimum P-state?
>>
>> Remember that whether running more slowly saves energy or not depends
>> among other things on whether your system is running beyond the
>> inflection point of its power curve (AKA frequency of maximum
>> efficiency). Within the region of concavity below this most efficient
>> frequency, yes, running more slowly will waste energy, however, the
>> optimal behavior within that region is to fix your clock to the most
>> efficient frequency and then power-gate the CPU once it's run out of
>> work to do -- Which is precisely what the current code can be expected
>> to achieve by clamping its response to min_pstate, which is meant to
>> approximate the most efficient P-state of the CPU -- Though looking at
>> your results makes me think that that's not happening for you, possibly
>> because intel_pstate's notion of the most efficient frequency may be
>> fairly inaccurate in this case.
>
> I'm not sure to understand the concept of the min_pstate being the most
> efficient one. The min_pstate appears to be just the minimum frequency
> advertised for the machine. Is that somehow intended to be the most
> efficient one?
>

Yeah, that's what it should be ideally, since there is hardly any reason
to ever program the CPU clock to run below this most efficient
frequency, since the concavity region of the CPU power curve is
inherently inefficient and delivers lower performance than the most
efficient frequency.

As you can see in intel_pstate.c, min_pstate is initialized on core
platforms from MSR_PLATFORM_INFO[47:40], which is "Maximum Efficiency
Ratio (R/O)". However that seems to deviate massively from the most
efficient ratio on your system, which may indicate a firmware bug, some
sort of clock gating problem, or an issue with the way that
intel_pstate.c processes this information.

> On the other hand, I noticed that by putting lower numbers than the min
> one, one seems to obtain lower frequencies than what is advertised for the
> machine.
>
>> Your energy usage results below seem to provide some evidence that we're
>> botching min_pstate in your system: Your energy figures scale pretty
>> much linearly with the runtime of each testcase, which suggests that
>> your energy usage is mostly dominated by leakage current, as would be
>> the case for workloads running far below the most efficient frequency of
>> the CPU.
>
> I also tried just always forcing various pstates for a few applications:
>
> avrora pstate10 4804.4830
> avrora pstate15 3520.0250
> avrora pstate20 2975.5300
> avrora pstate25 3605.5110
> avrora pstate30 3265.1520
> avrora pstate35 3142.0730
> avrora pstate37 3149.4060
>
> h2 pstate10 6100.5350
> h2 pstate15 4440.2950
> h2 pstate20 3731.1560
> h2 pstate25 4924.2250
> h2 pstate30 4375.3220
> h2 pstate35 4227.6440
> h2 pstate37 4181.9290
>
> xalan pstate10 1153.3680
> xalan pstate15 1027.7840
> xalan pstate20 998.0690
> xalan pstate25 1094.4020
> xalan pstate30 1098.2600
> xalan pstate35 1092.1510
> xalan pstate37 1098.5350
>

Nice, so this confirms that the most efficient CPU frequency is roughly
2x the one currently assumed by intel_pstate on your system. It would
be trivial to work around this locally on your system by forcing
min_pstate to be ~20 via sysfs. Though of course it would be better to
find the root cause of this deviation.

> For these three cases, the best pstate in terms of CPU energy consumption
> is always 20. For RAM, faster is always better:
>
> avrora pstate10 2372.9950
> avrora pstate15 1706.6990
> avrora pstate20 1383.3360
> avrora pstate25 1406.3790
> avrora pstate30 1235.5450
> avrora pstate35 1139.7800
> avrora pstate37 1142.9890
>
> h2 pstate10 3239.6100
> h2 pstate15 2321.2250
> h2 pstate20 1886.2960
> h2 pstate25 2030.6580
> h2 pstate30 1731.8120
> h2 pstate35 1635.3940
> h2 pstate37 1607.1940
>
> xalan pstate10 662.1400
> xalan pstate15 556.7600
> xalan pstate20 479.3040
> xalan pstate25 429.1490
> xalan pstate30 407.0890
> xalan pstate35 405.5320
> xalan pstate37 406.9260
>

Yeah, the picture becomes more complicated as one tries to take into
account the energy consumption of the various peripherals your CPU is
talking to, which will typically give you a combined power curve with a
different maximum efficiency point. Predicting that doesn't seem
possible without additional information not available to intel_pstate
currently, including the set of devices the application is interacting
with, and their respective power curves.

>
>>
>> Attempting to correct that by introducing an additive bias term into the
>> P-state calculation as done in this patch will inevitably pessimize
>> energy usage in the (also fairly common) scenario that the CPU
>> utilization is high enough to push the CPU frequency into the convex
>> region of the power curve, and doesn't really fix the underlying problem
>> that our knowledge about the most efficient P-state may have a
>> substantial error in your system.
>>
>> Judging from the performance improvement you're observing with this, I'd
>> bet that most of the test cases below are fairly latency-bound: They
>> seem like the kind of workloads where a thread may block on something
>> for a significant fraction of the time and then run a burst of CPU work
>> that's not designed to run in parallel with the tasks the same thread
>> will subsequently block on. That would explain the fact that you're
>> getting low enough utilization values that your change affects the
>> P-state calculation significantly.
>
> The three applications all alternate running and blocking at various fast
> rates. Small portions of the traces of each one are attached.

Yup, thanks for the traces, seems like the kind of workloads that
greatly underutilize the CPU resources. It's not surprising to see
schedutil give a suboptimal response in these cases, since the limiting
factor for such latency-bound workloads that spend most of their time
waiting is how quickly the CPU can react to some event and complete a
short non-parallelizable computation, rather than the total amount of
computational resources available to it.

Do you get any better results while using HWP as actual governor
(i.e. when intel_pstate is in active mode) instead of relying on
schedutil? With schedutil you may be able to get better results in
combination with the deadline scheduler, though that would also need
userspace collaboration.

>
> thanks,
> julia
>
>> As you've probably realized
>> yourself, in such a scenario the optimality assumptions of the current
>> schedutil heuristic break down, however it doesn't seem like
>> intel_pstate has enough information to make up for that problem, if that
>> requires introducing another heuristic which will itself cause us to
>> further deviate from optimality in a different set of scenarios.
>>
>> > julia
>> >
>>
>> Regards,
>> Francisco
>>
>> >
>> >>
>> >> > This patch scales the utilization
>> >> > (target_perf) between the min pstate and the cap pstate instead.
>> >> >
>> >> > On the DaCapo (Java) benchmarks and on a few exmples of kernel compilation
>> >> > (based on make defconfig), on two-socket machines with the above CPUs, the
>> >> > performance is always the same or better as Linux v5.15, and the CPU and
>> >> > RAM energy consumption is likewise always the same or better (one
>> >> > exception: zxing-eval on the 5128 uses a little more energy).
>> >> >
>> >> > 6130:
>> >> >
>> >> > Performance (sec):
>> >> > v5.15 with this patch (improvement)
>> >> > avrora 77.5773 56.4090 (1.38)
>> >> > batik-eval 113.1173 112.4135 (1.01)
>> >> > biojava-eval 196.6533 196.7943 (1.00)
>> >> > cassandra-eval 62.6638 59.2800 (1.06)
>> >> > eclipse-eval 218.5988 210.0139 (1.04)
>> >> > fop 3.5537 3.4281 (1.04)
>> >> > graphchi-evalN 13.8668 10.3411 (1.34)
>> >> > h2 75.5018 62.2993 (1.21)
>> >> > jme-eval 94.9531 89.5722 (1.06)
>> >> > jython 23.5789 23.0603 (1.02)
>> >> > kafka-eval 60.2784 59.2057 (1.02)
>> >> > luindex 5.3537 5.1190 (1.05)
>> >> > lusearch-fix 3.5956 3.3628 (1.07)
>> >> > lusearch 3.5396 3.5204 (1.01)
>> >> > pmd 13.3505 10.8795 (1.23)
>> >> > sunflow 7.5932 7.4899 (1.01)
>> >> > tomcat-eval 39.6568 31.4844 (1.26)
>> >> > tradebeans 118.9918 99.3932 (1.20)
>> >> > tradesoap-eval 56.9113 54.7567 (1.04)
>> >> > tradesoap 50.7779 44.5169 (1.14)
>> >> > xalan 5.0711 4.8879 (1.04)
>> >> > zxing-eval 10.5532 10.2435 (1.03)
>> >> >
>> >> > make 45.5977 45.3454 (1.01)
>> >> > make sched 3.4318 3.3450 (1.03)
>> >> > make fair.o 2.9611 2.8464 (1.04)
>> >> >
>> >> > CPU energy consumption (J):
>> >> >
>> >> > avrora 4740.4813 3585.5843 (1.32)
>> >> > batik-eval 13361.34 13278.74 (1.01)
>> >> > biojava-eval 21608.70 21652.94 (1.00)
>> >> > cassandra-eval 3037.6907 2891.8117 (1.05)
>> >> > eclipse-eval 23528.15 23198.36 (1.01)
>> >> > fop 455.7363 441.6443 (1.03)
>> >> > graphchi-eval 999.9220 971.5633 (1.03)
>> >> > h2 5451.3093 4929.8383 (1.11)
>> >> > jme-eval 5343.7790 5143.8463 (1.04)
>> >> > jython 2685.3790 2623.1950 (1.02)
>> >> > kafka-eval 2715.6047 2548.7220 (1.07)
>> >> > luindex 597.7587 571.0387 (1.05)
>> >> > lusearch-fix 714.0340 692.4727 (1.03)
>> >> > lusearch 718.4863 704.3650 (1.02)
>> >> > pmd 1627.6377 1497.5437 (1.09)
>> >> > sunflow 1563.5173 1514.6013 (1.03)
>> >> > tomcat-eval 4740.1603 4539.1503 (1.04)
>> >> > tradebeans 8331.2260 7482.3737 (1.11)
>> >> > tradesoap-eval 6610.1040 6426.7077 (1.03)
>> >> > tradesoap 5641.9300 5544.3517 (1.02)
>> >> > xalan 1072.0363 1065.7957 (1.01)
>> >> > zxing-eval 2200.1883 2174.1137 (1.01)
>> >> >
>> >> > make 9788.9290 9777.5823 (1.00)
>> >> > make sched 501.0770 495.0600 (1.01)
>> >> > make fair.o 363.4570 352.8670 (1.03)
>> >> >
>> >> > RAM energy consumption (J):
>> >> >
>> >> > avrora 2508.5553 1844.5977 (1.36)
>> >> > batik-eval 5627.3327 5603.1820 (1.00)
>> >> > biojava-eval 9371.1417 9351.1543 (1.00)
>> >> > cassandra-eval 1398.0567 1289.8317 (1.08)
>> >> > eclipse-eval 10193.28 9952.3543 (1.02)
>> >> > fop 189.1927 184.0620 (1.03)
>> >> > graphchi-eval 539.3947 447.4557 (1.21)
>> >> > h2 2771.0573 2432.2587 (1.14)
>> >> > jme-eval 2702.4030 2504.0783 (1.08)
>> >> > jython 1135.7317 1114.5190 (1.02)
>> >> > kafka-eval 1320.6840 1220.6867 (1.08)
>> >> > luindex 246.6597 237.1593 (1.04)
>> >> > lusearch-fix 294.4317 282.2193 (1.04)
>> >> > lusearch 295.5400 284.3890 (1.04)
>> >> > pmd 721.7020 643.1280 (1.12)
>> >> > sunflow 568.6710 549.3780 (1.04)
>> >> > tomcat-eval 2305.8857 1995.8843 (1.16)
>> >> > tradebeans 4323.5243 3749.7033 (1.15)
>> >> > tradesoap-eval 2862.8047 2783.5733 (1.03)
>> >> > tradesoap 2717.3900 2519.9567 (1.08)
>> >> > xalan 430.6100 418.5797 (1.03)
>> >> > zxing-eval 732.2507 710.9423 (1.03)
>> >> >
>> >> > make 3362.8837 3356.2587 (1.00)
>> >> > make sched 191.7917 188.8863 (1.02)
>> >> > make fair.o 149.6850 145.8273 (1.03)
>> >> >
>> >> > 5128:
>> >> >
>> >> > Performance (sec):
>> >> >
>> >> > avrora 62.0511 43.9240 (1.41)
>> >> > batik-eval 111.6393 110.1999 (1.01)
>> >> > biojava-eval 241.4400 238.7388 (1.01)
>> >> > cassandra-eval 62.0185 58.9052 (1.05)
>> >> > eclipse-eval 240.9488 232.8944 (1.03)
>> >> > fop 3.8318 3.6408 (1.05)
>> >> > graphchi-eval 13.3911 10.4670 (1.28)
>> >> > h2 75.3658 62.8218 (1.20)
>> >> > jme-eval 95.0131 89.5635 (1.06)
>> >> > jython 28.1397 27.6802 (1.02)
>> >> > kafka-eval 60.4817 59.4780 (1.02)
>> >> > luindex 5.1994 4.9587 (1.05)
>> >> > lusearch-fix 3.8448 3.6519 (1.05)
>> >> > lusearch 3.8928 3.7068 (1.05)
>> >> > pmd 13.0990 10.8008 (1.21)
>> >> > sunflow 7.7983 7.8569 (0.99)
>> >> > tomcat-eval 39.2064 31.7629 (1.23)
>> >> > tradebeans 120.8676 100.9113 (1.20)
>> >> > tradesoap-eval 65.5552 63.3493 (1.03)
>> >> > xalan 5.4463 5.3576 (1.02)
>> >> > zxing-eval 9.8611 9.9692 (0.99)
>> >> >
>> >> > make 43.1852 43.1285 (1.00)
>> >> > make sched 3.2181 3.1706 (1.01)
>> >> > make fair.o 2.7584 2.6615 (1.04)
>> >> >
>> >> > CPU energy consumption (J):
>> >> >
>> >> > avrora 3979.5297 3049.3347 (1.31)
>> >> > batik-eval 12339.59 12413.41 (0.99)
>> >> > biojava-eval 23935.18 23931.61 (1.00)
>> >> > cassandra-eval 3552.2753 3380.4860 (1.05)
>> >> > eclipse-eval 24186.38 24076.57 (1.00)
>> >> > fop 441.0607 442.9647 (1.00)
>> >> > graphchi-eval 1021.1323 964.4800 (1.06)
>> >> > h2 5484.9667 4901.9067 (1.12)
>> >> > jme-eval 6167.5287 5909.5767 (1.04)
>> >> > jython 2956.7150 2986.3680 (0.99)
>> >> > kafka-eval 3229.9333 3197.7743 (1.01)
>> >> > luindex 537.0007 533.9980 (1.01)
>> >> > lusearch-fix 720.1830 699.2343 (1.03)
>> >> > lusearch 708.8190 700.7023 (1.01)
>> >> > pmd 1539.7463 1398.1850 (1.10)
>> >> > sunflow 1533.3367 1497.2863 (1.02)
>> >> > tomcat-eval 4551.9333 4289.2553 (1.06)
>> >> > tradebeans 8527.2623 7570.2933 (1.13)
>> >> > tradesoap-eval 6849.3213 6750.9687 (1.01)
>> >> > xalan 1013.2747 1019.1217 (0.99)
>> >> > zxing-eval 1852.9077 1943.1753 (0.95)
>> >> >
>> >> > make 9257.5547 9262.5993 (1.00)
>> >> > make sched 438.7123 435.9133 (1.01)
>> >> > make fair.o 315.6550 312.2280 (1.01)
>> >> >
>> >> > RAM energy consumption (J):
>> >> >
>> >> > avrora 16309.86 11458.08 (1.42)
>> >> > batik-eval 30107.11 29891.58 (1.01)
>> >> > biojava-eval 64290.01 63941.71 (1.01)
>> >> > cassandra-eval 13240.04 12403.19 (1.07)
>> >> > eclipse-eval 64188.41 62008.35 (1.04)
>> >> > fop 1052.2457 996.0907 (1.06)
>> >> > graphchi-eval 3622.5130 2856.1983 (1.27)
>> >> > h2 19965.58 16624.08 (1.20)
>> >> > jme-eval 21777.02 20211.06 (1.08)
>> >> > jython 7515.3843 7396.6437 (1.02)
>> >> > kafka-eval 12868.39 12577.32 (1.02)
>> >> > luindex 1387.7263 1328.8073 (1.04)
>> >> > lusearch-fix 1313.1220 1238.8813 (1.06)
>> >> > lusearch 1303.5597 1245.4130 (1.05)
>> >> > pmd 3650.6697 3049.8567 (1.20)
>> >> > sunflow 2460.8907 2380.3773 (1.03)
>> >> > tomcat-eval 11199.61 9232.8367 (1.21)
>> >> > tradebeans 32385.99 26901.40 (1.20)
>> >> > tradesoap-eval 17691.01 17006.95 (1.04)
>> >> > xalan 1783.7290 1735.1937 (1.03)
>> >> > zxing-eval 2812.9710 2952.2933 (0.95)
>> >> >
>> >> > make 13247.47 13258.64 (1.00)
>> >> > make sched 885.7790 877.1667 (1.01)
>> >> > make fair.o 741.2473 723.6313 (1.02)
>> >>
>> >> So the number look better after the change, because it makes the
>> >> driver ask the hardware for slightly more performance than it is asked
>> >> for by the governor.
>> >>
>> >> >
>> >> > Signed-off-by: Julia Lawall <[email protected]>
>> >> >
>> >> > ---
>> >> >
>> >> > min_pstate is defined in terms of cpu->pstate.min_pstate and
>> >> > cpu->min_perf_ratio. Maybe one of these values should be used instead.
>> >> > Likewise, perhaps cap_pstate should be max_pstate?
>> >>
>> >> I'm not sure if I understand this remark. cap_pstate is the max
>> >> performance level of the CPU and max_pstate is the current limit
>> >> imposed by the framework. They are different things.
>> >>
>> >> >
>> >> > diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
>> >> > index 8c176b7dae41..ba6a48959754 100644
>> >> > --- a/drivers/cpufreq/intel_pstate.c
>> >> > +++ b/drivers/cpufreq/intel_pstate.c
>> >> > @@ -2789,10 +2789,6 @@ static void intel_cpufreq_adjust_perf(unsigned int cpunum,
>> >> >
>> >> > /* Optimization: Avoid unnecessary divisions. */
>> >> >
>> >> > - target_pstate = cap_pstate;
>> >> > - if (target_perf < capacity)
>> >> > - target_pstate = DIV_ROUND_UP(cap_pstate * target_perf, capacity);
>> >> > -
>> >> > min_pstate = cap_pstate;
>> >> > if (min_perf < capacity)
>> >> > min_pstate = DIV_ROUND_UP(cap_pstate * min_perf, capacity);
>> >> > @@ -2807,6 +2803,10 @@ static void intel_cpufreq_adjust_perf(unsigned int cpunum,
>> >> > if (max_pstate < min_pstate)
>> >> > max_pstate = min_pstate;
>> >> >
>> >> > + target_pstate = cap_pstate;
>> >> > + if (target_perf < capacity)
>> >> > + target_pstate = DIV_ROUND_UP((cap_pstate - min_pstate) * target_perf, capacity) + min_pstate;
>> >>
>> >> So the driver is asked by the governor to deliver the fraction of the
>> >> max performance (cap_pstate) given by the target_perf / capacity ratio
>> >> with the floor given by min_perf / capacity. It cannot turn around
>> >> and do something else, because it thinks it knows better.
>> >>
>> >> > +
>> >> > target_pstate = clamp_t(int, target_pstate, min_pstate, max_pstate);
>> >> >
>> >> > intel_cpufreq_hwp_update(cpu, min_pstate, max_pstate, target_pstate, true);
>> >>
>>

2021-12-18 06:12:50

by Julia Lawall

[permalink] [raw]
Subject: Re: cpufreq: intel_pstate: map utilization into the pstate range

> As you can see in intel_pstate.c, min_pstate is initialized on core
> platforms from MSR_PLATFORM_INFO[47:40], which is "Maximum Efficiency
> Ratio (R/O)". However that seems to deviate massively from the most
> efficient ratio on your system, which may indicate a firmware bug, some
> sort of clock gating problem, or an issue with the way that
> intel_pstate.c processes this information.

I'm not sure to understand the bug part. min_pstate gives the frequency
that I find as the minimum frequency when I look for the specifications of
the CPU. Should one expect that it should be something different?

> Yup, thanks for the traces, seems like the kind of workloads that
> greatly underutilize the CPU resources. It's not surprising to see
> schedutil give a suboptimal response in these cases, since the limiting
> factor for such latency-bound workloads that spend most of their time
> waiting is how quickly the CPU can react to some event and complete a
> short non-parallelizable computation, rather than the total amount of
> computational resources available to it.
>
> Do you get any better results while using HWP as actual governor
> (i.e. when intel_pstate is in active mode) instead of relying on
> schedutil? With schedutil you may be able to get better results in
> combination with the deadline scheduler, though that would also need
> userspace collaboration.

I have results for Linux 5.9. At that time, schedutil made suggestions
and the hardware made the decisions, mostly ignoring the suggestions from
schedutil. With avrora (mostly 6 concurrent threads, tiny gaps), only 7%
of the execution time is below the turbo frequencies. With h2 (more
threads, larger gaps), 15% of the time is below turbo frequencies. With
xalan (larger number of threads, middle sized gaps), only 0.2% of the time
is below the turbo frequencies.

julia


>
> >
> > thanks,
> > julia
> >
> >> As you've probably realized
> >> yourself, in such a scenario the optimality assumptions of the current
> >> schedutil heuristic break down, however it doesn't seem like
> >> intel_pstate has enough information to make up for that problem, if that
> >> requires introducing another heuristic which will itself cause us to
> >> further deviate from optimality in a different set of scenarios.
> >>
> >> > julia
> >> >
> >>
> >> Regards,
> >> Francisco
> >>
> >> >
> >> >>
> >> >> > This patch scales the utilization
> >> >> > (target_perf) between the min pstate and the cap pstate instead.
> >> >> >
> >> >> > On the DaCapo (Java) benchmarks and on a few exmples of kernel compilation
> >> >> > (based on make defconfig), on two-socket machines with the above CPUs, the
> >> >> > performance is always the same or better as Linux v5.15, and the CPU and
> >> >> > RAM energy consumption is likewise always the same or better (one
> >> >> > exception: zxing-eval on the 5128 uses a little more energy).
> >> >> >
> >> >> > 6130:
> >> >> >
> >> >> > Performance (sec):
> >> >> > v5.15 with this patch (improvement)
> >> >> > avrora 77.5773 56.4090 (1.38)
> >> >> > batik-eval 113.1173 112.4135 (1.01)
> >> >> > biojava-eval 196.6533 196.7943 (1.00)
> >> >> > cassandra-eval 62.6638 59.2800 (1.06)
> >> >> > eclipse-eval 218.5988 210.0139 (1.04)
> >> >> > fop 3.5537 3.4281 (1.04)
> >> >> > graphchi-evalN 13.8668 10.3411 (1.34)
> >> >> > h2 75.5018 62.2993 (1.21)
> >> >> > jme-eval 94.9531 89.5722 (1.06)
> >> >> > jython 23.5789 23.0603 (1.02)
> >> >> > kafka-eval 60.2784 59.2057 (1.02)
> >> >> > luindex 5.3537 5.1190 (1.05)
> >> >> > lusearch-fix 3.5956 3.3628 (1.07)
> >> >> > lusearch 3.5396 3.5204 (1.01)
> >> >> > pmd 13.3505 10.8795 (1.23)
> >> >> > sunflow 7.5932 7.4899 (1.01)
> >> >> > tomcat-eval 39.6568 31.4844 (1.26)
> >> >> > tradebeans 118.9918 99.3932 (1.20)
> >> >> > tradesoap-eval 56.9113 54.7567 (1.04)
> >> >> > tradesoap 50.7779 44.5169 (1.14)
> >> >> > xalan 5.0711 4.8879 (1.04)
> >> >> > zxing-eval 10.5532 10.2435 (1.03)
> >> >> >
> >> >> > make 45.5977 45.3454 (1.01)
> >> >> > make sched 3.4318 3.3450 (1.03)
> >> >> > make fair.o 2.9611 2.8464 (1.04)
> >> >> >
> >> >> > CPU energy consumption (J):
> >> >> >
> >> >> > avrora 4740.4813 3585.5843 (1.32)
> >> >> > batik-eval 13361.34 13278.74 (1.01)
> >> >> > biojava-eval 21608.70 21652.94 (1.00)
> >> >> > cassandra-eval 3037.6907 2891.8117 (1.05)
> >> >> > eclipse-eval 23528.15 23198.36 (1.01)
> >> >> > fop 455.7363 441.6443 (1.03)
> >> >> > graphchi-eval 999.9220 971.5633 (1.03)
> >> >> > h2 5451.3093 4929.8383 (1.11)
> >> >> > jme-eval 5343.7790 5143.8463 (1.04)
> >> >> > jython 2685.3790 2623.1950 (1.02)
> >> >> > kafka-eval 2715.6047 2548.7220 (1.07)
> >> >> > luindex 597.7587 571.0387 (1.05)
> >> >> > lusearch-fix 714.0340 692.4727 (1.03)
> >> >> > lusearch 718.4863 704.3650 (1.02)
> >> >> > pmd 1627.6377 1497.5437 (1.09)
> >> >> > sunflow 1563.5173 1514.6013 (1.03)
> >> >> > tomcat-eval 4740.1603 4539.1503 (1.04)
> >> >> > tradebeans 8331.2260 7482.3737 (1.11)
> >> >> > tradesoap-eval 6610.1040 6426.7077 (1.03)
> >> >> > tradesoap 5641.9300 5544.3517 (1.02)
> >> >> > xalan 1072.0363 1065.7957 (1.01)
> >> >> > zxing-eval 2200.1883 2174.1137 (1.01)
> >> >> >
> >> >> > make 9788.9290 9777.5823 (1.00)
> >> >> > make sched 501.0770 495.0600 (1.01)
> >> >> > make fair.o 363.4570 352.8670 (1.03)
> >> >> >
> >> >> > RAM energy consumption (J):
> >> >> >
> >> >> > avrora 2508.5553 1844.5977 (1.36)
> >> >> > batik-eval 5627.3327 5603.1820 (1.00)
> >> >> > biojava-eval 9371.1417 9351.1543 (1.00)
> >> >> > cassandra-eval 1398.0567 1289.8317 (1.08)
> >> >> > eclipse-eval 10193.28 9952.3543 (1.02)
> >> >> > fop 189.1927 184.0620 (1.03)
> >> >> > graphchi-eval 539.3947 447.4557 (1.21)
> >> >> > h2 2771.0573 2432.2587 (1.14)
> >> >> > jme-eval 2702.4030 2504.0783 (1.08)
> >> >> > jython 1135.7317 1114.5190 (1.02)
> >> >> > kafka-eval 1320.6840 1220.6867 (1.08)
> >> >> > luindex 246.6597 237.1593 (1.04)
> >> >> > lusearch-fix 294.4317 282.2193 (1.04)
> >> >> > lusearch 295.5400 284.3890 (1.04)
> >> >> > pmd 721.7020 643.1280 (1.12)
> >> >> > sunflow 568.6710 549.3780 (1.04)
> >> >> > tomcat-eval 2305.8857 1995.8843 (1.16)
> >> >> > tradebeans 4323.5243 3749.7033 (1.15)
> >> >> > tradesoap-eval 2862.8047 2783.5733 (1.03)
> >> >> > tradesoap 2717.3900 2519.9567 (1.08)
> >> >> > xalan 430.6100 418.5797 (1.03)
> >> >> > zxing-eval 732.2507 710.9423 (1.03)
> >> >> >
> >> >> > make 3362.8837 3356.2587 (1.00)
> >> >> > make sched 191.7917 188.8863 (1.02)
> >> >> > make fair.o 149.6850 145.8273 (1.03)
> >> >> >
> >> >> > 5128:
> >> >> >
> >> >> > Performance (sec):
> >> >> >
> >> >> > avrora 62.0511 43.9240 (1.41)
> >> >> > batik-eval 111.6393 110.1999 (1.01)
> >> >> > biojava-eval 241.4400 238.7388 (1.01)
> >> >> > cassandra-eval 62.0185 58.9052 (1.05)
> >> >> > eclipse-eval 240.9488 232.8944 (1.03)
> >> >> > fop 3.8318 3.6408 (1.05)
> >> >> > graphchi-eval 13.3911 10.4670 (1.28)
> >> >> > h2 75.3658 62.8218 (1.20)
> >> >> > jme-eval 95.0131 89.5635 (1.06)
> >> >> > jython 28.1397 27.6802 (1.02)
> >> >> > kafka-eval 60.4817 59.4780 (1.02)
> >> >> > luindex 5.1994 4.9587 (1.05)
> >> >> > lusearch-fix 3.8448 3.6519 (1.05)
> >> >> > lusearch 3.8928 3.7068 (1.05)
> >> >> > pmd 13.0990 10.8008 (1.21)
> >> >> > sunflow 7.7983 7.8569 (0.99)
> >> >> > tomcat-eval 39.2064 31.7629 (1.23)
> >> >> > tradebeans 120.8676 100.9113 (1.20)
> >> >> > tradesoap-eval 65.5552 63.3493 (1.03)
> >> >> > xalan 5.4463 5.3576 (1.02)
> >> >> > zxing-eval 9.8611 9.9692 (0.99)
> >> >> >
> >> >> > make 43.1852 43.1285 (1.00)
> >> >> > make sched 3.2181 3.1706 (1.01)
> >> >> > make fair.o 2.7584 2.6615 (1.04)
> >> >> >
> >> >> > CPU energy consumption (J):
> >> >> >
> >> >> > avrora 3979.5297 3049.3347 (1.31)
> >> >> > batik-eval 12339.59 12413.41 (0.99)
> >> >> > biojava-eval 23935.18 23931.61 (1.00)
> >> >> > cassandra-eval 3552.2753 3380.4860 (1.05)
> >> >> > eclipse-eval 24186.38 24076.57 (1.00)
> >> >> > fop 441.0607 442.9647 (1.00)
> >> >> > graphchi-eval 1021.1323 964.4800 (1.06)
> >> >> > h2 5484.9667 4901.9067 (1.12)
> >> >> > jme-eval 6167.5287 5909.5767 (1.04)
> >> >> > jython 2956.7150 2986.3680 (0.99)
> >> >> > kafka-eval 3229.9333 3197.7743 (1.01)
> >> >> > luindex 537.0007 533.9980 (1.01)
> >> >> > lusearch-fix 720.1830 699.2343 (1.03)
> >> >> > lusearch 708.8190 700.7023 (1.01)
> >> >> > pmd 1539.7463 1398.1850 (1.10)
> >> >> > sunflow 1533.3367 1497.2863 (1.02)
> >> >> > tomcat-eval 4551.9333 4289.2553 (1.06)
> >> >> > tradebeans 8527.2623 7570.2933 (1.13)
> >> >> > tradesoap-eval 6849.3213 6750.9687 (1.01)
> >> >> > xalan 1013.2747 1019.1217 (0.99)
> >> >> > zxing-eval 1852.9077 1943.1753 (0.95)
> >> >> >
> >> >> > make 9257.5547 9262.5993 (1.00)
> >> >> > make sched 438.7123 435.9133 (1.01)
> >> >> > make fair.o 315.6550 312.2280 (1.01)
> >> >> >
> >> >> > RAM energy consumption (J):
> >> >> >
> >> >> > avrora 16309.86 11458.08 (1.42)
> >> >> > batik-eval 30107.11 29891.58 (1.01)
> >> >> > biojava-eval 64290.01 63941.71 (1.01)
> >> >> > cassandra-eval 13240.04 12403.19 (1.07)
> >> >> > eclipse-eval 64188.41 62008.35 (1.04)
> >> >> > fop 1052.2457 996.0907 (1.06)
> >> >> > graphchi-eval 3622.5130 2856.1983 (1.27)
> >> >> > h2 19965.58 16624.08 (1.20)
> >> >> > jme-eval 21777.02 20211.06 (1.08)
> >> >> > jython 7515.3843 7396.6437 (1.02)
> >> >> > kafka-eval 12868.39 12577.32 (1.02)
> >> >> > luindex 1387.7263 1328.8073 (1.04)
> >> >> > lusearch-fix 1313.1220 1238.8813 (1.06)
> >> >> > lusearch 1303.5597 1245.4130 (1.05)
> >> >> > pmd 3650.6697 3049.8567 (1.20)
> >> >> > sunflow 2460.8907 2380.3773 (1.03)
> >> >> > tomcat-eval 11199.61 9232.8367 (1.21)
> >> >> > tradebeans 32385.99 26901.40 (1.20)
> >> >> > tradesoap-eval 17691.01 17006.95 (1.04)
> >> >> > xalan 1783.7290 1735.1937 (1.03)
> >> >> > zxing-eval 2812.9710 2952.2933 (0.95)
> >> >> >
> >> >> > make 13247.47 13258.64 (1.00)
> >> >> > make sched 885.7790 877.1667 (1.01)
> >> >> > make fair.o 741.2473 723.6313 (1.02)
> >> >>
> >> >> So the number look better after the change, because it makes the
> >> >> driver ask the hardware for slightly more performance than it is asked
> >> >> for by the governor.
> >> >>
> >> >> >
> >> >> > Signed-off-by: Julia Lawall <[email protected]>
> >> >> >
> >> >> > ---
> >> >> >
> >> >> > min_pstate is defined in terms of cpu->pstate.min_pstate and
> >> >> > cpu->min_perf_ratio. Maybe one of these values should be used instead.
> >> >> > Likewise, perhaps cap_pstate should be max_pstate?
> >> >>
> >> >> I'm not sure if I understand this remark. cap_pstate is the max
> >> >> performance level of the CPU and max_pstate is the current limit
> >> >> imposed by the framework. They are different things.
> >> >>
> >> >> >
> >> >> > diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> >> >> > index 8c176b7dae41..ba6a48959754 100644
> >> >> > --- a/drivers/cpufreq/intel_pstate.c
> >> >> > +++ b/drivers/cpufreq/intel_pstate.c
> >> >> > @@ -2789,10 +2789,6 @@ static void intel_cpufreq_adjust_perf(unsigned int cpunum,
> >> >> >
> >> >> > /* Optimization: Avoid unnecessary divisions. */
> >> >> >
> >> >> > - target_pstate = cap_pstate;
> >> >> > - if (target_perf < capacity)
> >> >> > - target_pstate = DIV_ROUND_UP(cap_pstate * target_perf, capacity);
> >> >> > -
> >> >> > min_pstate = cap_pstate;
> >> >> > if (min_perf < capacity)
> >> >> > min_pstate = DIV_ROUND_UP(cap_pstate * min_perf, capacity);
> >> >> > @@ -2807,6 +2803,10 @@ static void intel_cpufreq_adjust_perf(unsigned int cpunum,
> >> >> > if (max_pstate < min_pstate)
> >> >> > max_pstate = min_pstate;
> >> >> >
> >> >> > + target_pstate = cap_pstate;
> >> >> > + if (target_perf < capacity)
> >> >> > + target_pstate = DIV_ROUND_UP((cap_pstate - min_pstate) * target_perf, capacity) + min_pstate;
> >> >>
> >> >> So the driver is asked by the governor to deliver the fraction of the
> >> >> max performance (cap_pstate) given by the target_perf / capacity ratio
> >> >> with the floor given by min_perf / capacity. It cannot turn around
> >> >> and do something else, because it thinks it knows better.
> >> >>
> >> >> > +
> >> >> > target_pstate = clamp_t(int, target_pstate, min_pstate, max_pstate);
> >> >> >
> >> >> > intel_cpufreq_hwp_update(cpu, min_pstate, max_pstate, target_pstate, true);
> >> >>
> >>
>

2021-12-18 10:19:45

by Francisco Jerez

[permalink] [raw]
Subject: Re: cpufreq: intel_pstate: map utilization into the pstate range

Julia Lawall <[email protected]> writes:

>> As you can see in intel_pstate.c, min_pstate is initialized on core
>> platforms from MSR_PLATFORM_INFO[47:40], which is "Maximum Efficiency
>> Ratio (R/O)". However that seems to deviate massively from the most
>> efficient ratio on your system, which may indicate a firmware bug, some
>> sort of clock gating problem, or an issue with the way that
>> intel_pstate.c processes this information.
>
> I'm not sure to understand the bug part. min_pstate gives the frequency
> that I find as the minimum frequency when I look for the specifications of
> the CPU. Should one expect that it should be something different?
>

I'd expect the minimum frequency on your processor specification to
roughly match the "Maximum Efficiency Ratio (R/O)" value from that MSR,
since there's little reason to claim your processor can be clocked down
to a frequency which is inherently inefficient /and/ slower than the
maximum efficiency ratio -- In fact they both seem to match in your
system, they're just nowhere close to the frequency which is actually
most efficient, which smells like a bug, like your processor
misreporting what the most efficient frequency is, or it deviating from
the expected one due to your CPU static power consumption being greater
than it would be expected to be under ideal conditions -- E.g. due to
some sort of clock gating issue, possibly due to a software bug, or due
to our scheduling of such workloads with a large amount of lightly
loaded threads being unnecessarily inefficient which could also be
preventing most of your CPU cores from ever being clock-gated even
though your processor may be sitting idle for a large fraction of their
runtime.

>> Yup, thanks for the traces, seems like the kind of workloads that
>> greatly underutilize the CPU resources. It's not surprising to see
>> schedutil give a suboptimal response in these cases, since the limiting
>> factor for such latency-bound workloads that spend most of their time
>> waiting is how quickly the CPU can react to some event and complete a
>> short non-parallelizable computation, rather than the total amount of
>> computational resources available to it.
>>
>> Do you get any better results while using HWP as actual governor
>> (i.e. when intel_pstate is in active mode) instead of relying on
>> schedutil? With schedutil you may be able to get better results in
>> combination with the deadline scheduler, though that would also need
>> userspace collaboration.
>
> I have results for Linux 5.9. At that time, schedutil made suggestions
> and the hardware made the decisions, mostly ignoring the suggestions from
> schedutil. With avrora (mostly 6 concurrent threads, tiny gaps), only 7%
> of the execution time is below the turbo frequencies. With h2 (more
> threads, larger gaps), 15% of the time is below turbo frequencies. With
> xalan (larger number of threads, middle sized gaps), only 0.2% of the time
> is below the turbo frequencies.
>
> julia
>
>
>>
>> >
>> > thanks,
>> > julia
>> >
>> >> As you've probably realized
>> >> yourself, in such a scenario the optimality assumptions of the current
>> >> schedutil heuristic break down, however it doesn't seem like
>> >> intel_pstate has enough information to make up for that problem, if that
>> >> requires introducing another heuristic which will itself cause us to
>> >> further deviate from optimality in a different set of scenarios.
>> >>
>> >> > julia
>> >> >
>> >>
>> >> Regards,
>> >> Francisco
>> >>
>> >> >
>> >> >>
>> >> >> > This patch scales the utilization
>> >> >> > (target_perf) between the min pstate and the cap pstate instead.
>> >> >> >
>> >> >> > On the DaCapo (Java) benchmarks and on a few exmples of kernel compilation
>> >> >> > (based on make defconfig), on two-socket machines with the above CPUs, the
>> >> >> > performance is always the same or better as Linux v5.15, and the CPU and
>> >> >> > RAM energy consumption is likewise always the same or better (one
>> >> >> > exception: zxing-eval on the 5128 uses a little more energy).
>> >> >> >
>> >> >> > 6130:
>> >> >> >
>> >> >> > Performance (sec):
>> >> >> > v5.15 with this patch (improvement)
>> >> >> > avrora 77.5773 56.4090 (1.38)
>> >> >> > batik-eval 113.1173 112.4135 (1.01)
>> >> >> > biojava-eval 196.6533 196.7943 (1.00)
>> >> >> > cassandra-eval 62.6638 59.2800 (1.06)
>> >> >> > eclipse-eval 218.5988 210.0139 (1.04)
>> >> >> > fop 3.5537 3.4281 (1.04)
>> >> >> > graphchi-evalN 13.8668 10.3411 (1.34)
>> >> >> > h2 75.5018 62.2993 (1.21)
>> >> >> > jme-eval 94.9531 89.5722 (1.06)
>> >> >> > jython 23.5789 23.0603 (1.02)
>> >> >> > kafka-eval 60.2784 59.2057 (1.02)
>> >> >> > luindex 5.3537 5.1190 (1.05)
>> >> >> > lusearch-fix 3.5956 3.3628 (1.07)
>> >> >> > lusearch 3.5396 3.5204 (1.01)
>> >> >> > pmd 13.3505 10.8795 (1.23)
>> >> >> > sunflow 7.5932 7.4899 (1.01)
>> >> >> > tomcat-eval 39.6568 31.4844 (1.26)
>> >> >> > tradebeans 118.9918 99.3932 (1.20)
>> >> >> > tradesoap-eval 56.9113 54.7567 (1.04)
>> >> >> > tradesoap 50.7779 44.5169 (1.14)
>> >> >> > xalan 5.0711 4.8879 (1.04)
>> >> >> > zxing-eval 10.5532 10.2435 (1.03)
>> >> >> >
>> >> >> > make 45.5977 45.3454 (1.01)
>> >> >> > make sched 3.4318 3.3450 (1.03)
>> >> >> > make fair.o 2.9611 2.8464 (1.04)
>> >> >> >
>> >> >> > CPU energy consumption (J):
>> >> >> >
>> >> >> > avrora 4740.4813 3585.5843 (1.32)
>> >> >> > batik-eval 13361.34 13278.74 (1.01)
>> >> >> > biojava-eval 21608.70 21652.94 (1.00)
>> >> >> > cassandra-eval 3037.6907 2891.8117 (1.05)
>> >> >> > eclipse-eval 23528.15 23198.36 (1.01)
>> >> >> > fop 455.7363 441.6443 (1.03)
>> >> >> > graphchi-eval 999.9220 971.5633 (1.03)
>> >> >> > h2 5451.3093 4929.8383 (1.11)
>> >> >> > jme-eval 5343.7790 5143.8463 (1.04)
>> >> >> > jython 2685.3790 2623.1950 (1.02)
>> >> >> > kafka-eval 2715.6047 2548.7220 (1.07)
>> >> >> > luindex 597.7587 571.0387 (1.05)
>> >> >> > lusearch-fix 714.0340 692.4727 (1.03)
>> >> >> > lusearch 718.4863 704.3650 (1.02)
>> >> >> > pmd 1627.6377 1497.5437 (1.09)
>> >> >> > sunflow 1563.5173 1514.6013 (1.03)
>> >> >> > tomcat-eval 4740.1603 4539.1503 (1.04)
>> >> >> > tradebeans 8331.2260 7482.3737 (1.11)
>> >> >> > tradesoap-eval 6610.1040 6426.7077 (1.03)
>> >> >> > tradesoap 5641.9300 5544.3517 (1.02)
>> >> >> > xalan 1072.0363 1065.7957 (1.01)
>> >> >> > zxing-eval 2200.1883 2174.1137 (1.01)
>> >> >> >
>> >> >> > make 9788.9290 9777.5823 (1.00)
>> >> >> > make sched 501.0770 495.0600 (1.01)
>> >> >> > make fair.o 363.4570 352.8670 (1.03)
>> >> >> >
>> >> >> > RAM energy consumption (J):
>> >> >> >
>> >> >> > avrora 2508.5553 1844.5977 (1.36)
>> >> >> > batik-eval 5627.3327 5603.1820 (1.00)
>> >> >> > biojava-eval 9371.1417 9351.1543 (1.00)
>> >> >> > cassandra-eval 1398.0567 1289.8317 (1.08)
>> >> >> > eclipse-eval 10193.28 9952.3543 (1.02)
>> >> >> > fop 189.1927 184.0620 (1.03)
>> >> >> > graphchi-eval 539.3947 447.4557 (1.21)
>> >> >> > h2 2771.0573 2432.2587 (1.14)
>> >> >> > jme-eval 2702.4030 2504.0783 (1.08)
>> >> >> > jython 1135.7317 1114.5190 (1.02)
>> >> >> > kafka-eval 1320.6840 1220.6867 (1.08)
>> >> >> > luindex 246.6597 237.1593 (1.04)
>> >> >> > lusearch-fix 294.4317 282.2193 (1.04)
>> >> >> > lusearch 295.5400 284.3890 (1.04)
>> >> >> > pmd 721.7020 643.1280 (1.12)
>> >> >> > sunflow 568.6710 549.3780 (1.04)
>> >> >> > tomcat-eval 2305.8857 1995.8843 (1.16)
>> >> >> > tradebeans 4323.5243 3749.7033 (1.15)
>> >> >> > tradesoap-eval 2862.8047 2783.5733 (1.03)
>> >> >> > tradesoap 2717.3900 2519.9567 (1.08)
>> >> >> > xalan 430.6100 418.5797 (1.03)
>> >> >> > zxing-eval 732.2507 710.9423 (1.03)
>> >> >> >
>> >> >> > make 3362.8837 3356.2587 (1.00)
>> >> >> > make sched 191.7917 188.8863 (1.02)
>> >> >> > make fair.o 149.6850 145.8273 (1.03)
>> >> >> >
>> >> >> > 5128:
>> >> >> >
>> >> >> > Performance (sec):
>> >> >> >
>> >> >> > avrora 62.0511 43.9240 (1.41)
>> >> >> > batik-eval 111.6393 110.1999 (1.01)
>> >> >> > biojava-eval 241.4400 238.7388 (1.01)
>> >> >> > cassandra-eval 62.0185 58.9052 (1.05)
>> >> >> > eclipse-eval 240.9488 232.8944 (1.03)
>> >> >> > fop 3.8318 3.6408 (1.05)
>> >> >> > graphchi-eval 13.3911 10.4670 (1.28)
>> >> >> > h2 75.3658 62.8218 (1.20)
>> >> >> > jme-eval 95.0131 89.5635 (1.06)
>> >> >> > jython 28.1397 27.6802 (1.02)
>> >> >> > kafka-eval 60.4817 59.4780 (1.02)
>> >> >> > luindex 5.1994 4.9587 (1.05)
>> >> >> > lusearch-fix 3.8448 3.6519 (1.05)
>> >> >> > lusearch 3.8928 3.7068 (1.05)
>> >> >> > pmd 13.0990 10.8008 (1.21)
>> >> >> > sunflow 7.7983 7.8569 (0.99)
>> >> >> > tomcat-eval 39.2064 31.7629 (1.23)
>> >> >> > tradebeans 120.8676 100.9113 (1.20)
>> >> >> > tradesoap-eval 65.5552 63.3493 (1.03)
>> >> >> > xalan 5.4463 5.3576 (1.02)
>> >> >> > zxing-eval 9.8611 9.9692 (0.99)
>> >> >> >
>> >> >> > make 43.1852 43.1285 (1.00)
>> >> >> > make sched 3.2181 3.1706 (1.01)
>> >> >> > make fair.o 2.7584 2.6615 (1.04)
>> >> >> >
>> >> >> > CPU energy consumption (J):
>> >> >> >
>> >> >> > avrora 3979.5297 3049.3347 (1.31)
>> >> >> > batik-eval 12339.59 12413.41 (0.99)
>> >> >> > biojava-eval 23935.18 23931.61 (1.00)
>> >> >> > cassandra-eval 3552.2753 3380.4860 (1.05)
>> >> >> > eclipse-eval 24186.38 24076.57 (1.00)
>> >> >> > fop 441.0607 442.9647 (1.00)
>> >> >> > graphchi-eval 1021.1323 964.4800 (1.06)
>> >> >> > h2 5484.9667 4901.9067 (1.12)
>> >> >> > jme-eval 6167.5287 5909.5767 (1.04)
>> >> >> > jython 2956.7150 2986.3680 (0.99)
>> >> >> > kafka-eval 3229.9333 3197.7743 (1.01)
>> >> >> > luindex 537.0007 533.9980 (1.01)
>> >> >> > lusearch-fix 720.1830 699.2343 (1.03)
>> >> >> > lusearch 708.8190 700.7023 (1.01)
>> >> >> > pmd 1539.7463 1398.1850 (1.10)
>> >> >> > sunflow 1533.3367 1497.2863 (1.02)
>> >> >> > tomcat-eval 4551.9333 4289.2553 (1.06)
>> >> >> > tradebeans 8527.2623 7570.2933 (1.13)
>> >> >> > tradesoap-eval 6849.3213 6750.9687 (1.01)
>> >> >> > xalan 1013.2747 1019.1217 (0.99)
>> >> >> > zxing-eval 1852.9077 1943.1753 (0.95)
>> >> >> >
>> >> >> > make 9257.5547 9262.5993 (1.00)
>> >> >> > make sched 438.7123 435.9133 (1.01)
>> >> >> > make fair.o 315.6550 312.2280 (1.01)
>> >> >> >
>> >> >> > RAM energy consumption (J):
>> >> >> >
>> >> >> > avrora 16309.86 11458.08 (1.42)
>> >> >> > batik-eval 30107.11 29891.58 (1.01)
>> >> >> > biojava-eval 64290.01 63941.71 (1.01)
>> >> >> > cassandra-eval 13240.04 12403.19 (1.07)
>> >> >> > eclipse-eval 64188.41 62008.35 (1.04)
>> >> >> > fop 1052.2457 996.0907 (1.06)
>> >> >> > graphchi-eval 3622.5130 2856.1983 (1.27)
>> >> >> > h2 19965.58 16624.08 (1.20)
>> >> >> > jme-eval 21777.02 20211.06 (1.08)
>> >> >> > jython 7515.3843 7396.6437 (1.02)
>> >> >> > kafka-eval 12868.39 12577.32 (1.02)
>> >> >> > luindex 1387.7263 1328.8073 (1.04)
>> >> >> > lusearch-fix 1313.1220 1238.8813 (1.06)
>> >> >> > lusearch 1303.5597 1245.4130 (1.05)
>> >> >> > pmd 3650.6697 3049.8567 (1.20)
>> >> >> > sunflow 2460.8907 2380.3773 (1.03)
>> >> >> > tomcat-eval 11199.61 9232.8367 (1.21)
>> >> >> > tradebeans 32385.99 26901.40 (1.20)
>> >> >> > tradesoap-eval 17691.01 17006.95 (1.04)
>> >> >> > xalan 1783.7290 1735.1937 (1.03)
>> >> >> > zxing-eval 2812.9710 2952.2933 (0.95)
>> >> >> >
>> >> >> > make 13247.47 13258.64 (1.00)
>> >> >> > make sched 885.7790 877.1667 (1.01)
>> >> >> > make fair.o 741.2473 723.6313 (1.02)
>> >> >>
>> >> >> So the number look better after the change, because it makes the
>> >> >> driver ask the hardware for slightly more performance than it is asked
>> >> >> for by the governor.
>> >> >>
>> >> >> >
>> >> >> > Signed-off-by: Julia Lawall <[email protected]>
>> >> >> >
>> >> >> > ---
>> >> >> >
>> >> >> > min_pstate is defined in terms of cpu->pstate.min_pstate and
>> >> >> > cpu->min_perf_ratio. Maybe one of these values should be used instead.
>> >> >> > Likewise, perhaps cap_pstate should be max_pstate?
>> >> >>
>> >> >> I'm not sure if I understand this remark. cap_pstate is the max
>> >> >> performance level of the CPU and max_pstate is the current limit
>> >> >> imposed by the framework. They are different things.
>> >> >>
>> >> >> >
>> >> >> > diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
>> >> >> > index 8c176b7dae41..ba6a48959754 100644
>> >> >> > --- a/drivers/cpufreq/intel_pstate.c
>> >> >> > +++ b/drivers/cpufreq/intel_pstate.c
>> >> >> > @@ -2789,10 +2789,6 @@ static void intel_cpufreq_adjust_perf(unsigned int cpunum,
>> >> >> >
>> >> >> > /* Optimization: Avoid unnecessary divisions. */
>> >> >> >
>> >> >> > - target_pstate = cap_pstate;
>> >> >> > - if (target_perf < capacity)
>> >> >> > - target_pstate = DIV_ROUND_UP(cap_pstate * target_perf, capacity);
>> >> >> > -
>> >> >> > min_pstate = cap_pstate;
>> >> >> > if (min_perf < capacity)
>> >> >> > min_pstate = DIV_ROUND_UP(cap_pstate * min_perf, capacity);
>> >> >> > @@ -2807,6 +2803,10 @@ static void intel_cpufreq_adjust_perf(unsigned int cpunum,
>> >> >> > if (max_pstate < min_pstate)
>> >> >> > max_pstate = min_pstate;
>> >> >> >
>> >> >> > + target_pstate = cap_pstate;
>> >> >> > + if (target_perf < capacity)
>> >> >> > + target_pstate = DIV_ROUND_UP((cap_pstate - min_pstate) * target_perf, capacity) + min_pstate;
>> >> >>
>> >> >> So the driver is asked by the governor to deliver the fraction of the
>> >> >> max performance (cap_pstate) given by the target_perf / capacity ratio
>> >> >> with the floor given by min_perf / capacity. It cannot turn around
>> >> >> and do something else, because it thinks it knows better.
>> >> >>
>> >> >> > +
>> >> >> > target_pstate = clamp_t(int, target_pstate, min_pstate, max_pstate);
>> >> >> >
>> >> >> > intel_cpufreq_hwp_update(cpu, min_pstate, max_pstate, target_pstate, true);
>> >> >>
>> >>
>>

2021-12-18 11:08:03

by Julia Lawall

[permalink] [raw]
Subject: Re: cpufreq: intel_pstate: map utilization into the pstate range



On Sat, 18 Dec 2021, Francisco Jerez wrote:

> Julia Lawall <[email protected]> writes:
>
> >> As you can see in intel_pstate.c, min_pstate is initialized on core
> >> platforms from MSR_PLATFORM_INFO[47:40], which is "Maximum Efficiency
> >> Ratio (R/O)". However that seems to deviate massively from the most
> >> efficient ratio on your system, which may indicate a firmware bug, some
> >> sort of clock gating problem, or an issue with the way that
> >> intel_pstate.c processes this information.
> >
> > I'm not sure to understand the bug part. min_pstate gives the frequency
> > that I find as the minimum frequency when I look for the specifications of
> > the CPU. Should one expect that it should be something different?
> >
>
> I'd expect the minimum frequency on your processor specification to
> roughly match the "Maximum Efficiency Ratio (R/O)" value from that MSR,
> since there's little reason to claim your processor can be clocked down
> to a frequency which is inherently inefficient /and/ slower than the
> maximum efficiency ratio -- In fact they both seem to match in your
> system, they're just nowhere close to the frequency which is actually
> most efficient, which smells like a bug, like your processor
> misreporting what the most efficient frequency is, or it deviating from
> the expected one due to your CPU static power consumption being greater
> than it would be expected to be under ideal conditions -- E.g. due to
> some sort of clock gating issue, possibly due to a software bug, or due
> to our scheduling of such workloads with a large amount of lightly
> loaded threads being unnecessarily inefficient which could also be
> preventing most of your CPU cores from ever being clock-gated even
> though your processor may be sitting idle for a large fraction of their
> runtime.

The original mail has results from two different machines: Intel 6130
(skylake) and Intel 5218 (cascade lake). I have access to another cluster
of 6130s and 5218s. I can try them.

I tried 5.9 in which I just commented out the schedutil code to make
frequency requests. I only tested avrora (tiny pauses) and h2 (longer
pauses) and in both case the execution is almost entirely in the turbo
frequencies.

I'm not sure to understand the term "clock-gated". What C state does that
correspond to? The turbostat output for one run of avrora is below.

julia

78.062895 sec
Package Core CPU Avg_MHz Busy% Bzy_MHz TSC_MHz IRQ SMI POLL C1 C1E C6 POLL% C1% C1E% C6% CPU%c1 CPU%c6 CoreTmp PkgTmp Pkg%pc2 Pkg%pc6 Pkg_J RAM_J PKG_% RAM_%
- - - 31 2.95 1065 2096 156134 0 1971 155458 2956270 657130 0.00 0.20 4.78 92.26 14.75 82.31 40 41 45.14 0.04 4747.52 2509.05 0.00 0.00
0 0 0 13 1.15 1132 2095 11360 0 0 2 39 19209 0.00 0.00 0.01 99.01 8.02 90.83 39 41 90.24 0.04 2266.04 1346.09 0.00 0.00
0 0 32 1 0.09 1001 2095 37 0 0 0 0 42 0.00 0.00 0.00 100.00 9.08
0 1 4 0 0.04 1000 2095 57 0 0 0 1 133 0.00 0.00 0.00 99.96 0.08 99.88 38
0 1 36 0 0.00 1000 2095 35 0 0 0 0 40 0.00 0.00 0.00 100.00 0.12
0 2 8 0 0.03 1000 2095 64 0 0 0 1 124 0.00 0.00 0.00 99.97 0.08 99.89 38
0 2 40 0 0.00 1000 2095 36 0 0 0 0 40 0.00 0.00 0.00 100.00 0.10
0 3 12 0 0.00 1000 2095 42 0 0 0 0 71 0.00 0.00 0.00 100.00 0.14 99.86 38
0 3 44 1 0.09 1000 2095 63 0 0 0 0 65 0.00 0.00 0.00 99.91 0.05
0 4 14 0 0.00 1010 2095 38 0 0 0 1 41 0.00 0.00 0.00 100.00 0.04 99.96 39
0 4 46 0 0.00 1011 2095 36 0 0 0 1 41 0.00 0.00 0.00 100.00 0.04
0 5 10 0 0.01 1084 2095 39 0 0 0 0 58 0.00 0.00 0.00 99.99 0.04 99.95 38
0 5 42 0 0.00 1114 2095 35 0 0 0 0 39 0.00 0.00 0.00 100.00 0.05
0 6 6 0 0.03 1005 2095 89 0 0 0 1 116 0.00 0.00 0.00 99.97 0.07 99.90 39
0 6 38 0 0.00 1000 2095 38 0 0 0 0 41 0.00 0.00 0.00 100.00 0.10
0 7 2 0 0.05 1001 2095 59 0 0 0 1 133 0.00 0.00 0.00 99.95 0.09 99.86 40
0 7 34 0 0.00 1000 2095 39 0 0 0 0 65 0.00 0.00 0.00 100.00 0.13
0 8 16 0 0.00 1000 2095 43 0 0 0 0 47 0.00 0.00 0.00 100.00 0.04 99.96 38
0 8 48 0 0.00 1000 2095 37 0 0 0 0 41 0.00 0.00 0.00 100.00 0.04
0 9 20 0 0.00 1000 2095 33 0 0 0 0 37 0.00 0.00 0.00 100.00 0.03 99.97 38
0 9 52 0 0.00 1000 2095 33 0 0 0 0 36 0.00 0.00 0.00 100.00 0.03
0 10 24 0 0.00 1000 2095 36 0 0 0 1 40 0.00 0.00 0.00 100.00 0.03 99.96 39
0 10 56 0 0.00 1000 2095 37 0 0 0 1 38 0.00 0.00 0.00 100.00 0.03
0 11 28 0 0.00 1002 2095 35 0 0 0 1 37 0.00 0.00 0.00 100.00 0.03 99.97 38
0 11 60 0 0.00 1004 2095 34 0 0 0 0 36 0.00 0.00 0.00 100.00 0.03
0 12 30 0 0.00 1001 2095 35 0 0 0 0 40 0.00 0.00 0.00 100.00 0.11 99.88 38
0 12 62 0 0.01 1000 2095 197 0 0 0 0 197 0.00 0.00 0.00 99.99 0.10
0 13 26 0 0.00 1000 2095 37 0 0 0 0 41 0.00 0.00 0.00 100.00 0.03 99.97 39
0 13 58 0 0.00 1000 2095 38 0 0 0 0 40 0.00 0.00 0.00 100.00 0.03
0 14 22 0 0.01 1000 2095 149 0 1 2 0 142 0.00 0.01 0.00 99.99 0.07 99.92 39
0 14 54 0 0.00 1000 2095 35 0 0 0 0 38 0.00 0.00 0.00 100.00 0.07
0 15 18 0 0.00 1000 2095 33 0 0 0 0 36 0.00 0.00 0.00 100.00 0.03 99.97 39
0 15 50 0 0.00 1000 2095 34 0 0 0 0 38 0.00 0.00 0.00 100.00 0.03
1 0 1 32 3.23 1008 2095 2385 0 31 3190 45025 10144 0.00 0.28 4.68 91.99 11.21 85.56 32 35 0.04 0.04 2481.49 1162.96 0.00 0.00
1 0 33 9 0.63 1404 2095 12206 0 5 162 2480 10283 0.00 0.04 0.75 98.64 13.81
1 1 5 1 0.07 1384 2095 236 0 0 38 24 314 0.00 0.09 0.06 99.77 4.66 95.27 33
1 1 37 81 3.93 2060 2095 1254 0 5 40 59 683 0.00 0.01 0.02 96.05 0.80
1 2 9 37 3.46 1067 2095 2396 0 29 2256 55406 11731 0.00 0.17 6.02 90.54 54.10 42.44 31
1 2 41 151 14.51 1042 2095 10447 0 135 10494 248077 42327 0.01 0.87 26.57 58.84 43.05
1 3 13 110 10.47 1053 2095 7120 0 120 9218 168938 33884 0.01 0.77 16.63 72.68 42.58 46.95 32
1 3 45 69 6.76 1021 2095 4730 0 66 5598 115410 23447 0.00 0.44 12.06 81.12 46.29
1 4 15 112 10.64 1056 2095 7204 0 116 8831 171423 37754 0.01 0.70 17.56 71.67 28.01 61.35 33
1 4 47 18 1.80 1006 2095 1771 0 13 915 29315 6564 0.00 0.07 3.20 95.03 36.85
1 5 11 63 5.96 1065 2095 4090 0 58 6449 99015 18955 0.00 0.45 10.27 83.64 31.24 62.80 31
1 5 43 72 7.11 1016 2095 4794 0 73 6203 115361 26494 0.00 0.48 11.79 81.02 30.09
1 6 7 35 3.39 1022 2095 2328 0 45 3377 52721 13759 0.00 0.27 5.10 91.43 25.84 70.77 32
1 6 39 67 6.09 1096 2095 4483 0 52 3696 94964 19366 0.00 0.30 10.32 83.61 23.14
1 7 3 1 0.06 1395 2095 91 0 0 0 1 167 0.00 0.00 0.00 99.95 25.36 74.58 35
1 7 35 83 8.16 1024 2095 5785 0 100 7398 134640 27428 0.00 0.56 13.39 78.34 17.26
1 8 17 46 4.49 1016 2095 3229 0 52 3048 74914 16010 0.00 0.27 8.29 87.19 29.71 65.80 33
1 8 49 64 6.12 1052 2095 4210 0 89 5782 100570 21463 0.00 0.42 10.63 83.17 28.08
1 9 21 73 7.02 1036 2095 4917 0 64 5786 109887 21939 0.00 0.55 11.61 81.18 22.10 70.88 33
1 9 53 64 6.33 1012 2095 4074 0 69 5957 97596 20580 0.00 0.51 9.78 83.74 22.79
1 10 25 26 2.58 1013 2095 1825 0 22 2124 42630 8627 0.00 0.17 4.17 93.24 53.91 43.52 33
1 10 57 159 15.59 1022 2095 10951 0 175 14237 256828 56810 0.01 1.10 26.00 58.16 40.89
1 11 29 112 10.54 1065 2095 7462 0 126 9548 179206 39821 0.01 0.85 18.49 70.71 29.46 60.00 31
1 11 61 29 2.89 1011 2095 2002 0 24 2468 45558 10288 0.00 0.20 4.71 92.36 37.11
1 12 31 37 3.66 1011 2095 2596 0 79 3161 61027 13292 0.00 0.24 6.48 89.79 23.75 72.59 32
1 12 63 56 5.08 1107 2095 3789 0 62 4777 79133 17089 0.00 0.41 7.91 86.86 22.31
1 13 27 12 1.14 1045 2095 1477 0 16 888 18744 3250 0.00 0.06 2.18 96.70 21.23 77.64 32
1 13 59 60 5.81 1038 2095 5230 0 60 4936 87225 21402 0.00 0.41 8.95 85.14 16.55
1 14 23 28 2.75 1024 2095 2008 0 20 1839 47417 9177 0.00 0.13 5.08 92.21 34.18 63.07 32
1 14 55 106 9.58 1105 2095 6292 0 89 7182 141379 31354 0.00 0.63 14.45 75.81 27.36
1 15 19 118 11.65 1012 2095 7872 0 121 10014 193186 40448 0.01 0.80 19.53 68.68 37.53 50.82 32
1 15 51 59 5.58 1059 2095 3967 0 54 5842 88063 21138 0.00 0.39 9.12 85.23 43.60

2021-12-18 22:13:08

by Francisco Jerez

[permalink] [raw]
Subject: Re: cpufreq: intel_pstate: map utilization into the pstate range

Julia Lawall <[email protected]> writes:

> On Sat, 18 Dec 2021, Francisco Jerez wrote:
>
>> Julia Lawall <[email protected]> writes:
>>
>> >> As you can see in intel_pstate.c, min_pstate is initialized on core
>> >> platforms from MSR_PLATFORM_INFO[47:40], which is "Maximum Efficiency
>> >> Ratio (R/O)". However that seems to deviate massively from the most
>> >> efficient ratio on your system, which may indicate a firmware bug, some
>> >> sort of clock gating problem, or an issue with the way that
>> >> intel_pstate.c processes this information.
>> >
>> > I'm not sure to understand the bug part. min_pstate gives the frequency
>> > that I find as the minimum frequency when I look for the specifications of
>> > the CPU. Should one expect that it should be something different?
>> >
>>
>> I'd expect the minimum frequency on your processor specification to
>> roughly match the "Maximum Efficiency Ratio (R/O)" value from that MSR,
>> since there's little reason to claim your processor can be clocked down
>> to a frequency which is inherently inefficient /and/ slower than the
>> maximum efficiency ratio -- In fact they both seem to match in your
>> system, they're just nowhere close to the frequency which is actually
>> most efficient, which smells like a bug, like your processor
>> misreporting what the most efficient frequency is, or it deviating from
>> the expected one due to your CPU static power consumption being greater
>> than it would be expected to be under ideal conditions -- E.g. due to
>> some sort of clock gating issue, possibly due to a software bug, or due
>> to our scheduling of such workloads with a large amount of lightly
>> loaded threads being unnecessarily inefficient which could also be
>> preventing most of your CPU cores from ever being clock-gated even
>> though your processor may be sitting idle for a large fraction of their
>> runtime.
>
> The original mail has results from two different machines: Intel 6130
> (skylake) and Intel 5218 (cascade lake). I have access to another cluster
> of 6130s and 5218s. I can try them.
>
> I tried 5.9 in which I just commented out the schedutil code to make
> frequency requests. I only tested avrora (tiny pauses) and h2 (longer
> pauses) and in both case the execution is almost entirely in the turbo
> frequencies.
>
> I'm not sure to understand the term "clock-gated". What C state does that
> correspond to? The turbostat output for one run of avrora is below.
>

I didn't have any specific C1+ state in mind, most of the deeper ones
implement some sort of clock gating among other optimizations, I was
just wondering whether some sort of software bug and/or the highly
intermittent CPU utilization pattern of these workloads are preventing
most of your CPU cores from entering deep sleep states. See below.

> julia
>
> 78.062895 sec
> Package Core CPU Avg_MHz Busy% Bzy_MHz TSC_MHz IRQ SMI POLL C1 C1E C6 POLL% C1% C1E% C6% CPU%c1 CPU%c6 CoreTmp PkgTmp Pkg%pc2 Pkg%pc6 Pkg_J RAM_J PKG_% RAM_%
> - - - 31 2.95 1065 2096 156134 0 1971 155458 2956270 657130 0.00 0.20 4.78 92.26 14.75 82.31 40 41 45.14 0.04 4747.52 2509.05 0.00 0.00
> 0 0 0 13 1.15 1132 2095 11360 0 0 2 39 19209 0.00 0.00 0.01 99.01 8.02 90.83 39 41 90.24 0.04 2266.04 1346.09 0.00 0.00

This seems suspicious: ^^^^ ^^^^^^^

I hadn't understood that you're running this on a dual-socket system
until I looked at these results. It seems like package #0 is doing
pretty much nothing according to the stats below, but it's still
consuming nearly half of your energy, apparently because the idle
package #0 isn't entering deep sleep states (Pkg%pc6 above is close to
0%). That could explain your unexpectedly high static power consumption
and the deviation of the real maximum efficiency frequency from the one
reported by your processor, since the reported maximum efficiency ratio
cannot possibly take into account the existence of a second CPU package
with dysfunctional idle management.

I'm guessing that if you fully disable one of your CPU packages and
repeat the previous experiment forcing various P-states between 10 and
37 you should get a maximum efficiency ratio closer to the theoretical
one for this CPU?

> 0 0 32 1 0.09 1001 2095 37 0 0 0 0 42 0.00 0.00 0.00 100.00 9.08
> 0 1 4 0 0.04 1000 2095 57 0 0 0 1 133 0.00 0.00 0.00 99.96 0.08 99.88 38
> 0 1 36 0 0.00 1000 2095 35 0 0 0 0 40 0.00 0.00 0.00 100.00 0.12
> 0 2 8 0 0.03 1000 2095 64 0 0 0 1 124 0.00 0.00 0.00 99.97 0.08 99.89 38
> 0 2 40 0 0.00 1000 2095 36 0 0 0 0 40 0.00 0.00 0.00 100.00 0.10
> 0 3 12 0 0.00 1000 2095 42 0 0 0 0 71 0.00 0.00 0.00 100.00 0.14 99.86 38
> 0 3 44 1 0.09 1000 2095 63 0 0 0 0 65 0.00 0.00 0.00 99.91 0.05
> 0 4 14 0 0.00 1010 2095 38 0 0 0 1 41 0.00 0.00 0.00 100.00 0.04 99.96 39
> 0 4 46 0 0.00 1011 2095 36 0 0 0 1 41 0.00 0.00 0.00 100.00 0.04
> 0 5 10 0 0.01 1084 2095 39 0 0 0 0 58 0.00 0.00 0.00 99.99 0.04 99.95 38
> 0 5 42 0 0.00 1114 2095 35 0 0 0 0 39 0.00 0.00 0.00 100.00 0.05
> 0 6 6 0 0.03 1005 2095 89 0 0 0 1 116 0.00 0.00 0.00 99.97 0.07 99.90 39
> 0 6 38 0 0.00 1000 2095 38 0 0 0 0 41 0.00 0.00 0.00 100.00 0.10
> 0 7 2 0 0.05 1001 2095 59 0 0 0 1 133 0.00 0.00 0.00 99.95 0.09 99.86 40
> 0 7 34 0 0.00 1000 2095 39 0 0 0 0 65 0.00 0.00 0.00 100.00 0.13
> 0 8 16 0 0.00 1000 2095 43 0 0 0 0 47 0.00 0.00 0.00 100.00 0.04 99.96 38
> 0 8 48 0 0.00 1000 2095 37 0 0 0 0 41 0.00 0.00 0.00 100.00 0.04
> 0 9 20 0 0.00 1000 2095 33 0 0 0 0 37 0.00 0.00 0.00 100.00 0.03 99.97 38
> 0 9 52 0 0.00 1000 2095 33 0 0 0 0 36 0.00 0.00 0.00 100.00 0.03
> 0 10 24 0 0.00 1000 2095 36 0 0 0 1 40 0.00 0.00 0.00 100.00 0.03 99.96 39
> 0 10 56 0 0.00 1000 2095 37 0 0 0 1 38 0.00 0.00 0.00 100.00 0.03
> 0 11 28 0 0.00 1002 2095 35 0 0 0 1 37 0.00 0.00 0.00 100.00 0.03 99.97 38
> 0 11 60 0 0.00 1004 2095 34 0 0 0 0 36 0.00 0.00 0.00 100.00 0.03
> 0 12 30 0 0.00 1001 2095 35 0 0 0 0 40 0.00 0.00 0.00 100.00 0.11 99.88 38
> 0 12 62 0 0.01 1000 2095 197 0 0 0 0 197 0.00 0.00 0.00 99.99 0.10
> 0 13 26 0 0.00 1000 2095 37 0 0 0 0 41 0.00 0.00 0.00 100.00 0.03 99.97 39
> 0 13 58 0 0.00 1000 2095 38 0 0 0 0 40 0.00 0.00 0.00 100.00 0.03
> 0 14 22 0 0.01 1000 2095 149 0 1 2 0 142 0.00 0.01 0.00 99.99 0.07 99.92 39
> 0 14 54 0 0.00 1000 2095 35 0 0 0 0 38 0.00 0.00 0.00 100.00 0.07
> 0 15 18 0 0.00 1000 2095 33 0 0 0 0 36 0.00 0.00 0.00 100.00 0.03 99.97 39
> 0 15 50 0 0.00 1000 2095 34 0 0 0 0 38 0.00 0.00 0.00 100.00 0.03
> 1 0 1 32 3.23 1008 2095 2385 0 31 3190 45025 10144 0.00 0.28 4.68 91.99 11.21 85.56 32 35 0.04 0.04 2481.49 1162.96 0.00 0.00
> 1 0 33 9 0.63 1404 2095 12206 0 5 162 2480 10283 0.00 0.04 0.75 98.64 13.81
> 1 1 5 1 0.07 1384 2095 236 0 0 38 24 314 0.00 0.09 0.06 99.77 4.66 95.27 33
> 1 1 37 81 3.93 2060 2095 1254 0 5 40 59 683 0.00 0.01 0.02 96.05 0.80
> 1 2 9 37 3.46 1067 2095 2396 0 29 2256 55406 11731 0.00 0.17 6.02 90.54 54.10 42.44 31
> 1 2 41 151 14.51 1042 2095 10447 0 135 10494 248077 42327 0.01 0.87 26.57 58.84 43.05
> 1 3 13 110 10.47 1053 2095 7120 0 120 9218 168938 33884 0.01 0.77 16.63 72.68 42.58 46.95 32
> 1 3 45 69 6.76 1021 2095 4730 0 66 5598 115410 23447 0.00 0.44 12.06 81.12 46.29
> 1 4 15 112 10.64 1056 2095 7204 0 116 8831 171423 37754 0.01 0.70 17.56 71.67 28.01 61.35 33
> 1 4 47 18 1.80 1006 2095 1771 0 13 915 29315 6564 0.00 0.07 3.20 95.03 36.85
> 1 5 11 63 5.96 1065 2095 4090 0 58 6449 99015 18955 0.00 0.45 10.27 83.64 31.24 62.80 31
> 1 5 43 72 7.11 1016 2095 4794 0 73 6203 115361 26494 0.00 0.48 11.79 81.02 30.09
> 1 6 7 35 3.39 1022 2095 2328 0 45 3377 52721 13759 0.00 0.27 5.10 91.43 25.84 70.77 32
> 1 6 39 67 6.09 1096 2095 4483 0 52 3696 94964 19366 0.00 0.30 10.32 83.61 23.14
> 1 7 3 1 0.06 1395 2095 91 0 0 0 1 167 0.00 0.00 0.00 99.95 25.36 74.58 35
> 1 7 35 83 8.16 1024 2095 5785 0 100 7398 134640 27428 0.00 0.56 13.39 78.34 17.26
> 1 8 17 46 4.49 1016 2095 3229 0 52 3048 74914 16010 0.00 0.27 8.29 87.19 29.71 65.80 33
> 1 8 49 64 6.12 1052 2095 4210 0 89 5782 100570 21463 0.00 0.42 10.63 83.17 28.08
> 1 9 21 73 7.02 1036 2095 4917 0 64 5786 109887 21939 0.00 0.55 11.61 81.18 22.10 70.88 33
> 1 9 53 64 6.33 1012 2095 4074 0 69 5957 97596 20580 0.00 0.51 9.78 83.74 22.79
> 1 10 25 26 2.58 1013 2095 1825 0 22 2124 42630 8627 0.00 0.17 4.17 93.24 53.91 43.52 33
> 1 10 57 159 15.59 1022 2095 10951 0 175 14237 256828 56810 0.01 1.10 26.00 58.16 40.89
> 1 11 29 112 10.54 1065 2095 7462 0 126 9548 179206 39821 0.01 0.85 18.49 70.71 29.46 60.00 31
> 1 11 61 29 2.89 1011 2095 2002 0 24 2468 45558 10288 0.00 0.20 4.71 92.36 37.11
> 1 12 31 37 3.66 1011 2095 2596 0 79 3161 61027 13292 0.00 0.24 6.48 89.79 23.75 72.59 32
> 1 12 63 56 5.08 1107 2095 3789 0 62 4777 79133 17089 0.00 0.41 7.91 86.86 22.31
> 1 13 27 12 1.14 1045 2095 1477 0 16 888 18744 3250 0.00 0.06 2.18 96.70 21.23 77.64 32
> 1 13 59 60 5.81 1038 2095 5230 0 60 4936 87225 21402 0.00 0.41 8.95 85.14 16.55
> 1 14 23 28 2.75 1024 2095 2008 0 20 1839 47417 9177 0.00 0.13 5.08 92.21 34.18 63.07 32
> 1 14 55 106 9.58 1105 2095 6292 0 89 7182 141379 31354 0.00 0.63 14.45 75.81 27.36
> 1 15 19 118 11.65 1012 2095 7872 0 121 10014 193186 40448 0.01 0.80 19.53 68.68 37.53 50.82 32
> 1 15 51 59 5.58 1059 2095 3967 0 54 5842 88063 21138 0.00 0.39 9.12 85.23 43.60

2021-12-19 06:42:13

by Julia Lawall

[permalink] [raw]
Subject: Re: cpufreq: intel_pstate: map utilization into the pstate range



On Sat, 18 Dec 2021, Francisco Jerez wrote:

> Julia Lawall <[email protected]> writes:
>
> > On Sat, 18 Dec 2021, Francisco Jerez wrote:
> >
> >> Julia Lawall <[email protected]> writes:
> >>
> >> >> As you can see in intel_pstate.c, min_pstate is initialized on core
> >> >> platforms from MSR_PLATFORM_INFO[47:40], which is "Maximum Efficiency
> >> >> Ratio (R/O)". However that seems to deviate massively from the most
> >> >> efficient ratio on your system, which may indicate a firmware bug, some
> >> >> sort of clock gating problem, or an issue with the way that
> >> >> intel_pstate.c processes this information.
> >> >
> >> > I'm not sure to understand the bug part. min_pstate gives the frequency
> >> > that I find as the minimum frequency when I look for the specifications of
> >> > the CPU. Should one expect that it should be something different?
> >> >
> >>
> >> I'd expect the minimum frequency on your processor specification to
> >> roughly match the "Maximum Efficiency Ratio (R/O)" value from that MSR,
> >> since there's little reason to claim your processor can be clocked down
> >> to a frequency which is inherently inefficient /and/ slower than the
> >> maximum efficiency ratio -- In fact they both seem to match in your
> >> system, they're just nowhere close to the frequency which is actually
> >> most efficient, which smells like a bug, like your processor
> >> misreporting what the most efficient frequency is, or it deviating from
> >> the expected one due to your CPU static power consumption being greater
> >> than it would be expected to be under ideal conditions -- E.g. due to
> >> some sort of clock gating issue, possibly due to a software bug, or due
> >> to our scheduling of such workloads with a large amount of lightly
> >> loaded threads being unnecessarily inefficient which could also be
> >> preventing most of your CPU cores from ever being clock-gated even
> >> though your processor may be sitting idle for a large fraction of their
> >> runtime.
> >
> > The original mail has results from two different machines: Intel 6130
> > (skylake) and Intel 5218 (cascade lake). I have access to another cluster
> > of 6130s and 5218s. I can try them.
> >
> > I tried 5.9 in which I just commented out the schedutil code to make
> > frequency requests. I only tested avrora (tiny pauses) and h2 (longer
> > pauses) and in both case the execution is almost entirely in the turbo
> > frequencies.
> >
> > I'm not sure to understand the term "clock-gated". What C state does that
> > correspond to? The turbostat output for one run of avrora is below.
> >
>
> I didn't have any specific C1+ state in mind, most of the deeper ones
> implement some sort of clock gating among other optimizations, I was
> just wondering whether some sort of software bug and/or the highly
> intermittent CPU utilization pattern of these workloads are preventing
> most of your CPU cores from entering deep sleep states. See below.
>
> > julia
> >
> > 78.062895 sec
> > Package Core CPU Avg_MHz Busy% Bzy_MHz TSC_MHz IRQ SMI POLL C1 C1E C6 POLL% C1% C1E% C6% CPU%c1 CPU%c6 CoreTmp PkgTmp Pkg%pc2 Pkg%pc6 Pkg_J RAM_J PKG_% RAM_%
> > - - - 31 2.95 1065 2096 156134 0 1971 155458 2956270 657130 0.00 0.20 4.78 92.26 14.75 82.31 40 41 45.14 0.04 4747.52 2509.05 0.00 0.00
> > 0 0 0 13 1.15 1132 2095 11360 0 0 2 39 19209 0.00 0.00 0.01 99.01 8.02 90.83 39 41 90.24 0.04 2266.04 1346.09 0.00 0.00
>
> This seems suspicious: ^^^^ ^^^^^^^
>
> I hadn't understood that you're running this on a dual-socket system
> until I looked at these results.

Sorry not to have mentioned that.

> It seems like package #0 is doing
> pretty much nothing according to the stats below, but it's still
> consuming nearly half of your energy, apparently because the idle
> package #0 isn't entering deep sleep states (Pkg%pc6 above is close to
> 0%). That could explain your unexpectedly high static power consumption
> and the deviation of the real maximum efficiency frequency from the one
> reported by your processor, since the reported maximum efficiency ratio
> cannot possibly take into account the existence of a second CPU package
> with dysfunctional idle management.

Our assumption was that if anything happens on any core, all of the
packages remain in a state that allows them to react in a reasonable
amount of time ot any memory request.

> I'm guessing that if you fully disable one of your CPU packages and
> repeat the previous experiment forcing various P-states between 10 and
> 37 you should get a maximum efficiency ratio closer to the theoretical
> one for this CPU?

OK, but that's not really a natural usage context... I do have a
one-socket Intel 5220. I'll see what happens there.

I did some experiements with forcing different frequencies. I haven't
finished processing the results, but I notice that as the frequency goes
up, the utilization (specifically the value of
map_util_perf(sg_cpu->util) at the point of the call to
cpufreq_driver_adjust_perf in sugov_update_single_perf) goes up as well.
Is this expected?

thanks,
julia

> > 0 0 32 1 0.09 1001 2095 37 0 0 0 0 42 0.00 0.00 0.00 100.00 9.08
> > 0 1 4 0 0.04 1000 2095 57 0 0 0 1 133 0.00 0.00 0.00 99.96 0.08 99.88 38
> > 0 1 36 0 0.00 1000 2095 35 0 0 0 0 40 0.00 0.00 0.00 100.00 0.12
> > 0 2 8 0 0.03 1000 2095 64 0 0 0 1 124 0.00 0.00 0.00 99.97 0.08 99.89 38
> > 0 2 40 0 0.00 1000 2095 36 0 0 0 0 40 0.00 0.00 0.00 100.00 0.10
> > 0 3 12 0 0.00 1000 2095 42 0 0 0 0 71 0.00 0.00 0.00 100.00 0.14 99.86 38
> > 0 3 44 1 0.09 1000 2095 63 0 0 0 0 65 0.00 0.00 0.00 99.91 0.05
> > 0 4 14 0 0.00 1010 2095 38 0 0 0 1 41 0.00 0.00 0.00 100.00 0.04 99.96 39
> > 0 4 46 0 0.00 1011 2095 36 0 0 0 1 41 0.00 0.00 0.00 100.00 0.04
> > 0 5 10 0 0.01 1084 2095 39 0 0 0 0 58 0.00 0.00 0.00 99.99 0.04 99.95 38
> > 0 5 42 0 0.00 1114 2095 35 0 0 0 0 39 0.00 0.00 0.00 100.00 0.05
> > 0 6 6 0 0.03 1005 2095 89 0 0 0 1 116 0.00 0.00 0.00 99.97 0.07 99.90 39
> > 0 6 38 0 0.00 1000 2095 38 0 0 0 0 41 0.00 0.00 0.00 100.00 0.10
> > 0 7 2 0 0.05 1001 2095 59 0 0 0 1 133 0.00 0.00 0.00 99.95 0.09 99.86 40
> > 0 7 34 0 0.00 1000 2095 39 0 0 0 0 65 0.00 0.00 0.00 100.00 0.13
> > 0 8 16 0 0.00 1000 2095 43 0 0 0 0 47 0.00 0.00 0.00 100.00 0.04 99.96 38
> > 0 8 48 0 0.00 1000 2095 37 0 0 0 0 41 0.00 0.00 0.00 100.00 0.04
> > 0 9 20 0 0.00 1000 2095 33 0 0 0 0 37 0.00 0.00 0.00 100.00 0.03 99.97 38
> > 0 9 52 0 0.00 1000 2095 33 0 0 0 0 36 0.00 0.00 0.00 100.00 0.03
> > 0 10 24 0 0.00 1000 2095 36 0 0 0 1 40 0.00 0.00 0.00 100.00 0.03 99.96 39
> > 0 10 56 0 0.00 1000 2095 37 0 0 0 1 38 0.00 0.00 0.00 100.00 0.03
> > 0 11 28 0 0.00 1002 2095 35 0 0 0 1 37 0.00 0.00 0.00 100.00 0.03 99.97 38
> > 0 11 60 0 0.00 1004 2095 34 0 0 0 0 36 0.00 0.00 0.00 100.00 0.03
> > 0 12 30 0 0.00 1001 2095 35 0 0 0 0 40 0.00 0.00 0.00 100.00 0.11 99.88 38
> > 0 12 62 0 0.01 1000 2095 197 0 0 0 0 197 0.00 0.00 0.00 99.99 0.10
> > 0 13 26 0 0.00 1000 2095 37 0 0 0 0 41 0.00 0.00 0.00 100.00 0.03 99.97 39
> > 0 13 58 0 0.00 1000 2095 38 0 0 0 0 40 0.00 0.00 0.00 100.00 0.03
> > 0 14 22 0 0.01 1000 2095 149 0 1 2 0 142 0.00 0.01 0.00 99.99 0.07 99.92 39
> > 0 14 54 0 0.00 1000 2095 35 0 0 0 0 38 0.00 0.00 0.00 100.00 0.07
> > 0 15 18 0 0.00 1000 2095 33 0 0 0 0 36 0.00 0.00 0.00 100.00 0.03 99.97 39
> > 0 15 50 0 0.00 1000 2095 34 0 0 0 0 38 0.00 0.00 0.00 100.00 0.03
> > 1 0 1 32 3.23 1008 2095 2385 0 31 3190 45025 10144 0.00 0.28 4.68 91.99 11.21 85.56 32 35 0.04 0.04 2481.49 1162.96 0.00 0.00
> > 1 0 33 9 0.63 1404 2095 12206 0 5 162 2480 10283 0.00 0.04 0.75 98.64 13.81
> > 1 1 5 1 0.07 1384 2095 236 0 0 38 24 314 0.00 0.09 0.06 99.77 4.66 95.27 33
> > 1 1 37 81 3.93 2060 2095 1254 0 5 40 59 683 0.00 0.01 0.02 96.05 0.80
> > 1 2 9 37 3.46 1067 2095 2396 0 29 2256 55406 11731 0.00 0.17 6.02 90.54 54.10 42.44 31
> > 1 2 41 151 14.51 1042 2095 10447 0 135 10494 248077 42327 0.01 0.87 26.57 58.84 43.05
> > 1 3 13 110 10.47 1053 2095 7120 0 120 9218 168938 33884 0.01 0.77 16.63 72.68 42.58 46.95 32
> > 1 3 45 69 6.76 1021 2095 4730 0 66 5598 115410 23447 0.00 0.44 12.06 81.12 46.29
> > 1 4 15 112 10.64 1056 2095 7204 0 116 8831 171423 37754 0.01 0.70 17.56 71.67 28.01 61.35 33
> > 1 4 47 18 1.80 1006 2095 1771 0 13 915 29315 6564 0.00 0.07 3.20 95.03 36.85
> > 1 5 11 63 5.96 1065 2095 4090 0 58 6449 99015 18955 0.00 0.45 10.27 83.64 31.24 62.80 31
> > 1 5 43 72 7.11 1016 2095 4794 0 73 6203 115361 26494 0.00 0.48 11.79 81.02 30.09
> > 1 6 7 35 3.39 1022 2095 2328 0 45 3377 52721 13759 0.00 0.27 5.10 91.43 25.84 70.77 32
> > 1 6 39 67 6.09 1096 2095 4483 0 52 3696 94964 19366 0.00 0.30 10.32 83.61 23.14
> > 1 7 3 1 0.06 1395 2095 91 0 0 0 1 167 0.00 0.00 0.00 99.95 25.36 74.58 35
> > 1 7 35 83 8.16 1024 2095 5785 0 100 7398 134640 27428 0.00 0.56 13.39 78.34 17.26
> > 1 8 17 46 4.49 1016 2095 3229 0 52 3048 74914 16010 0.00 0.27 8.29 87.19 29.71 65.80 33
> > 1 8 49 64 6.12 1052 2095 4210 0 89 5782 100570 21463 0.00 0.42 10.63 83.17 28.08
> > 1 9 21 73 7.02 1036 2095 4917 0 64 5786 109887 21939 0.00 0.55 11.61 81.18 22.10 70.88 33
> > 1 9 53 64 6.33 1012 2095 4074 0 69 5957 97596 20580 0.00 0.51 9.78 83.74 22.79
> > 1 10 25 26 2.58 1013 2095 1825 0 22 2124 42630 8627 0.00 0.17 4.17 93.24 53.91 43.52 33
> > 1 10 57 159 15.59 1022 2095 10951 0 175 14237 256828 56810 0.01 1.10 26.00 58.16 40.89
> > 1 11 29 112 10.54 1065 2095 7462 0 126 9548 179206 39821 0.01 0.85 18.49 70.71 29.46 60.00 31
> > 1 11 61 29 2.89 1011 2095 2002 0 24 2468 45558 10288 0.00 0.20 4.71 92.36 37.11
> > 1 12 31 37 3.66 1011 2095 2596 0 79 3161 61027 13292 0.00 0.24 6.48 89.79 23.75 72.59 32
> > 1 12 63 56 5.08 1107 2095 3789 0 62 4777 79133 17089 0.00 0.41 7.91 86.86 22.31
> > 1 13 27 12 1.14 1045 2095 1477 0 16 888 18744 3250 0.00 0.06 2.18 96.70 21.23 77.64 32
> > 1 13 59 60 5.81 1038 2095 5230 0 60 4936 87225 21402 0.00 0.41 8.95 85.14 16.55
> > 1 14 23 28 2.75 1024 2095 2008 0 20 1839 47417 9177 0.00 0.13 5.08 92.21 34.18 63.07 32
> > 1 14 55 106 9.58 1105 2095 6292 0 89 7182 141379 31354 0.00 0.63 14.45 75.81 27.36
> > 1 15 19 118 11.65 1012 2095 7872 0 121 10014 193186 40448 0.01 0.80 19.53 68.68 37.53 50.82 32
> > 1 15 51 59 5.58 1059 2095 3967 0 54 5842 88063 21138 0.00 0.39 9.12 85.23 43.60
>

2021-12-19 14:14:16

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: cpufreq: intel_pstate: map utilization into the pstate range

On Fri, Dec 17, 2021 at 8:32 PM Julia Lawall <[email protected]> wrote:
>
>
>
> On Fri, 17 Dec 2021, Rafael J. Wysocki wrote:
>
> > On Mon, Dec 13, 2021 at 11:52 PM Julia Lawall <[email protected]> wrote:
> > >
> > > With HWP, intel_cpufreq_adjust_perf takes the utilization, scales it
> > > between 0 and the capacity, and then maps everything below min_pstate to
> > > the lowest frequency.
> >
> > Well, it is not just intel_pstate with HWP. This is how schedutil
> > works in general; see get_next_freq() in there.
> >
> > > On my Intel Xeon Gold 6130 and Intel Xeon Gold
> > > 5218, this means that more than the bottom quarter of utilizations are all
> > > mapped to the lowest frequency. Running slowly doesn't necessarily save
> > > energy, because it takes more time.
> >
> > This is true, but the layout of the available range of performance
> > values is a property of the processor, not a driver issue.
> >
> > Moreover, the role of the driver is not to decide how to respond to
> > the given utilization value, that is the role of the governor. The
> > driver is expected to do what it is asked for by the governor.
>
> OK, but what exactly is the goal of schedutil?

The short answer is: minimizing the cost (in terms of energy) of
allocating an adequate amount of CPU time for a given workload.

Of course, this requires a bit of explanation, so bear with me.

It starts with a question:

Given a steady workload (ie. a workload that uses approximately the
same amount of CPU time to run in every sampling interval), what is
the most efficient frequency (or generally, performance level measured
in some abstract units) to run it at and still ensure that it will get
as much CPU time as it needs (or wants)?

To answer this question, let's first assume that

(1) Performance is a monotonically increasing (ideally, linear)
function of frequency.
(2) CPU idle states have not enough impact on the energy usage for
them to matter.

Both of these assumptions may not be realistic, but that's how it goes.

Now, consider the "raw" frequency-dependent utilization

util(f) = util_max * (t_{total} - t_{idle}(f)) / t_{total}

where

t_{total} is the total CPU time available in the given time frame.
t_{idle}(f) is the idle CPU time appearing in the workload when run at
frequency f in that time frame.
util_max is a convenience constant allowing an integer data type to be
used for representing util(f) with sufficient approximation.

Notice that by assumption (1), util(f) is a monotonically decreasing
function, so if util(f_{max}) = util_max (where f_{max} is the maximum
frequency available from the hardware), which means that there is no
idle CPU time in the workload when run at the max available frequency,
there will be no idle CPU time in it when run at any frequency below
f_{max}. Hence, in that case the workload needs to be run at f_{max}.

If util(f_{max}) < util_max, there is some idle CPU time in the
workload at f_{max} and it may be run at a lower frequency without
sacrificing performance. Moreover, the cost should be minimum when
running the workload at the maximum frequency f_e for which
t_{idle}(f_e) = 0. IOW, that is the point at which the workload still
gets as much CPU time as needed, but the cost of running it is
maximally reduced.

In practice, it is better to look for a frequency slightly greater
than f_e to allow some performance margin to be there in case the
workload fluctuates or similar, so we get

C * util(f) / util_max = 1

where the constant C is slightly greater than 1.

This equation cannot be solved directly, because the util(f) graph is
not known, but util(f) can be computed (at least approximately) for a
given f and the solution can be approximated by computing a series of
frequencies f_n given by

f_{n+1} = C * f_n * util(f_n) / util_max

under certain additional assumptions regarding the convergence etc.

This is almost what schedutil does, but it also uses the observation
that if the frequency-invariant utilization util_inv is known, then
approximately

util(f) = util_inv * f_{max} / f

so finally

f = C * f_{max} * util_inv / util_max

and util_inv is provided by PELT.

This has a few interesting properties that are vitally important:

(a) The current frequency need not be known in order to compute the
next one (and it is hard to determine in general).
(b) The response is predictable by the CPU scheduler upfront, so it
can make decisions based on it in advance.
(c) If util_inv is properly scaled to reflect differences between
different types of CPUs in a hybrid system, the same formula can be
used for each of them regardless of where the workload was running
previously.

and they need to be maintained.

> I would have expected that it was to give good performance while saving
> energy, but it's not doing either in many of these cases.

The performance improvement after making the change in question means
that something is missing. The assumptions mentioned above (and there
are quite a few of them) may not hold or the hardware may not behave
exactly as anticipated.

Generally, there are three directions worth investigating IMV:

1. The scale-invariance mechanism may cause util_inv to be
underestimated. It may be worth trying to use the max non-turbo
performance instead of the 4-core-turbo performance level in it; see
intel_set_max_freq_ratio() in smpboot.c.

2.The hardware's response to the "desired" HWP value may depend on
some additional factors (eg. the EPP value) that may need to be
adjusted.

3. The workloads are not actually steady and running them at higher
frequencies causes the sections that really need more CPU time to
complete faster.

At the same time, CPU idle states may actually have measurable impact
on energy usage which is why you may not see much difference in that
respect.

> Is it the intent of schedutil that the bottom quarter of utilizations
> should be mapped to the lowest frequency?

It is not the intent, but a consequence of the scaling algorithm used
by schedutil.

As you can see from the derivation of that algorithm outlined above,
if the utilization is mapped to a performance level below min_perf,
running the workload at min_perf or above it is not expected (under
all of the the assumptions made) to improve performance, so mapping
all of the "low" utilization values to min_perf should not hurt
performance, as the CPU time required by the workload will still be
provided (and with a surplus for that matter).

The reason why the hardware refuses to run below a certain minimum
performance level is because it knows that running below that level
doesn't really improve energy usage (or at least the improvement
whatever it may be is not worth the effort). The CPU would run
slower, but it would still use (almost) as much energy as it uses at
the "hardware minimum" level, so it may as well run at the min level.

2021-12-19 14:19:22

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: cpufreq: intel_pstate: map utilization into the pstate range

On Sun, Dec 19, 2021 at 7:42 AM Julia Lawall <[email protected]> wrote:
>
>
>
> On Sat, 18 Dec 2021, Francisco Jerez wrote:
>
> > Julia Lawall <[email protected]> writes:
> >
> > > On Sat, 18 Dec 2021, Francisco Jerez wrote:
> > >
> > >> Julia Lawall <[email protected]> writes:
> > >>
> > >> >> As you can see in intel_pstate.c, min_pstate is initialized on core
> > >> >> platforms from MSR_PLATFORM_INFO[47:40], which is "Maximum Efficiency
> > >> >> Ratio (R/O)". However that seems to deviate massively from the most
> > >> >> efficient ratio on your system, which may indicate a firmware bug, some
> > >> >> sort of clock gating problem, or an issue with the way that
> > >> >> intel_pstate.c processes this information.
> > >> >
> > >> > I'm not sure to understand the bug part. min_pstate gives the frequency
> > >> > that I find as the minimum frequency when I look for the specifications of
> > >> > the CPU. Should one expect that it should be something different?
> > >> >
> > >>
> > >> I'd expect the minimum frequency on your processor specification to
> > >> roughly match the "Maximum Efficiency Ratio (R/O)" value from that MSR,
> > >> since there's little reason to claim your processor can be clocked down
> > >> to a frequency which is inherently inefficient /and/ slower than the
> > >> maximum efficiency ratio -- In fact they both seem to match in your
> > >> system, they're just nowhere close to the frequency which is actually
> > >> most efficient, which smells like a bug, like your processor
> > >> misreporting what the most efficient frequency is, or it deviating from
> > >> the expected one due to your CPU static power consumption being greater
> > >> than it would be expected to be under ideal conditions -- E.g. due to
> > >> some sort of clock gating issue, possibly due to a software bug, or due
> > >> to our scheduling of such workloads with a large amount of lightly
> > >> loaded threads being unnecessarily inefficient which could also be
> > >> preventing most of your CPU cores from ever being clock-gated even
> > >> though your processor may be sitting idle for a large fraction of their
> > >> runtime.
> > >
> > > The original mail has results from two different machines: Intel 6130
> > > (skylake) and Intel 5218 (cascade lake). I have access to another cluster
> > > of 6130s and 5218s. I can try them.
> > >
> > > I tried 5.9 in which I just commented out the schedutil code to make
> > > frequency requests. I only tested avrora (tiny pauses) and h2 (longer
> > > pauses) and in both case the execution is almost entirely in the turbo
> > > frequencies.
> > >
> > > I'm not sure to understand the term "clock-gated". What C state does that
> > > correspond to? The turbostat output for one run of avrora is below.
> > >
> >
> > I didn't have any specific C1+ state in mind, most of the deeper ones
> > implement some sort of clock gating among other optimizations, I was
> > just wondering whether some sort of software bug and/or the highly
> > intermittent CPU utilization pattern of these workloads are preventing
> > most of your CPU cores from entering deep sleep states. See below.
> >
> > > julia
> > >
> > > 78.062895 sec
> > > Package Core CPU Avg_MHz Busy% Bzy_MHz TSC_MHz IRQ SMI POLL C1 C1E C6 POLL% C1% C1E% C6% CPU%c1 CPU%c6 CoreTmp PkgTmp Pkg%pc2 Pkg%pc6 Pkg_J RAM_J PKG_% RAM_%
> > > - - - 31 2.95 1065 2096 156134 0 1971 155458 2956270 657130 0.00 0.20 4.78 92.26 14.75 82.31 40 41 45.14 0.04 4747.52 2509.05 0.00 0.00
> > > 0 0 0 13 1.15 1132 2095 11360 0 0 2 39 19209 0.00 0.00 0.01 99.01 8.02 90.83 39 41 90.24 0.04 2266.04 1346.09 0.00 0.00
> >
> > This seems suspicious: ^^^^ ^^^^^^^
> >
> > I hadn't understood that you're running this on a dual-socket system
> > until I looked at these results.
>
> Sorry not to have mentioned that.
>
> > It seems like package #0 is doing
> > pretty much nothing according to the stats below, but it's still
> > consuming nearly half of your energy, apparently because the idle
> > package #0 isn't entering deep sleep states (Pkg%pc6 above is close to
> > 0%). That could explain your unexpectedly high static power consumption
> > and the deviation of the real maximum efficiency frequency from the one
> > reported by your processor, since the reported maximum efficiency ratio
> > cannot possibly take into account the existence of a second CPU package
> > with dysfunctional idle management.
>
> Our assumption was that if anything happens on any core, all of the
> packages remain in a state that allows them to react in a reasonable
> amount of time ot any memory request.
>
> > I'm guessing that if you fully disable one of your CPU packages and
> > repeat the previous experiment forcing various P-states between 10 and
> > 37 you should get a maximum efficiency ratio closer to the theoretical
> > one for this CPU?
>
> OK, but that's not really a natural usage context... I do have a
> one-socket Intel 5220. I'll see what happens there.
>
> I did some experiements with forcing different frequencies. I haven't
> finished processing the results, but I notice that as the frequency goes
> up, the utilization (specifically the value of
> map_util_perf(sg_cpu->util) at the point of the call to
> cpufreq_driver_adjust_perf in sugov_update_single_perf) goes up as well.
> Is this expected?

It isn't, as long as the scale-invariance mechanism mentioned in my
previous message works properly.

2021-12-19 14:30:48

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: cpufreq: intel_pstate: map utilization into the pstate range

On Sun, Dec 19, 2021 at 3:19 PM Rafael J. Wysocki <[email protected]> wrote:
>
> On Sun, Dec 19, 2021 at 7:42 AM Julia Lawall <[email protected]> wrote:
> >
> >
> >
> > On Sat, 18 Dec 2021, Francisco Jerez wrote:
> >
> > > Julia Lawall <[email protected]> writes:
> > >
> > > > On Sat, 18 Dec 2021, Francisco Jerez wrote:
> > > >
> > > >> Julia Lawall <[email protected]> writes:
> > > >>
> > > >> >> As you can see in intel_pstate.c, min_pstate is initialized on core
> > > >> >> platforms from MSR_PLATFORM_INFO[47:40], which is "Maximum Efficiency
> > > >> >> Ratio (R/O)". However that seems to deviate massively from the most
> > > >> >> efficient ratio on your system, which may indicate a firmware bug, some
> > > >> >> sort of clock gating problem, or an issue with the way that
> > > >> >> intel_pstate.c processes this information.
> > > >> >
> > > >> > I'm not sure to understand the bug part. min_pstate gives the frequency
> > > >> > that I find as the minimum frequency when I look for the specifications of
> > > >> > the CPU. Should one expect that it should be something different?
> > > >> >
> > > >>
> > > >> I'd expect the minimum frequency on your processor specification to
> > > >> roughly match the "Maximum Efficiency Ratio (R/O)" value from that MSR,
> > > >> since there's little reason to claim your processor can be clocked down
> > > >> to a frequency which is inherently inefficient /and/ slower than the
> > > >> maximum efficiency ratio -- In fact they both seem to match in your
> > > >> system, they're just nowhere close to the frequency which is actually
> > > >> most efficient, which smells like a bug, like your processor
> > > >> misreporting what the most efficient frequency is, or it deviating from
> > > >> the expected one due to your CPU static power consumption being greater
> > > >> than it would be expected to be under ideal conditions -- E.g. due to
> > > >> some sort of clock gating issue, possibly due to a software bug, or due
> > > >> to our scheduling of such workloads with a large amount of lightly
> > > >> loaded threads being unnecessarily inefficient which could also be
> > > >> preventing most of your CPU cores from ever being clock-gated even
> > > >> though your processor may be sitting idle for a large fraction of their
> > > >> runtime.
> > > >
> > > > The original mail has results from two different machines: Intel 6130
> > > > (skylake) and Intel 5218 (cascade lake). I have access to another cluster
> > > > of 6130s and 5218s. I can try them.
> > > >
> > > > I tried 5.9 in which I just commented out the schedutil code to make
> > > > frequency requests. I only tested avrora (tiny pauses) and h2 (longer
> > > > pauses) and in both case the execution is almost entirely in the turbo
> > > > frequencies.
> > > >
> > > > I'm not sure to understand the term "clock-gated". What C state does that
> > > > correspond to? The turbostat output for one run of avrora is below.
> > > >
> > >
> > > I didn't have any specific C1+ state in mind, most of the deeper ones
> > > implement some sort of clock gating among other optimizations, I was
> > > just wondering whether some sort of software bug and/or the highly
> > > intermittent CPU utilization pattern of these workloads are preventing
> > > most of your CPU cores from entering deep sleep states. See below.
> > >
> > > > julia
> > > >
> > > > 78.062895 sec
> > > > Package Core CPU Avg_MHz Busy% Bzy_MHz TSC_MHz IRQ SMI POLL C1 C1E C6 POLL% C1% C1E% C6% CPU%c1 CPU%c6 CoreTmp PkgTmp Pkg%pc2 Pkg%pc6 Pkg_J RAM_J PKG_% RAM_%
> > > > - - - 31 2.95 1065 2096 156134 0 1971 155458 2956270 657130 0.00 0.20 4.78 92.26 14.75 82.31 40 41 45.14 0.04 4747.52 2509.05 0.00 0.00
> > > > 0 0 0 13 1.15 1132 2095 11360 0 0 2 39 19209 0.00 0.00 0.01 99.01 8.02 90.83 39 41 90.24 0.04 2266.04 1346.09 0.00 0.00
> > >
> > > This seems suspicious: ^^^^ ^^^^^^^
> > >
> > > I hadn't understood that you're running this on a dual-socket system
> > > until I looked at these results.
> >
> > Sorry not to have mentioned that.
> >
> > > It seems like package #0 is doing
> > > pretty much nothing according to the stats below, but it's still
> > > consuming nearly half of your energy, apparently because the idle
> > > package #0 isn't entering deep sleep states (Pkg%pc6 above is close to
> > > 0%). That could explain your unexpectedly high static power consumption
> > > and the deviation of the real maximum efficiency frequency from the one
> > > reported by your processor, since the reported maximum efficiency ratio
> > > cannot possibly take into account the existence of a second CPU package
> > > with dysfunctional idle management.
> >
> > Our assumption was that if anything happens on any core, all of the
> > packages remain in a state that allows them to react in a reasonable
> > amount of time ot any memory request.
> >
> > > I'm guessing that if you fully disable one of your CPU packages and
> > > repeat the previous experiment forcing various P-states between 10 and
> > > 37 you should get a maximum efficiency ratio closer to the theoretical
> > > one for this CPU?
> >
> > OK, but that's not really a natural usage context... I do have a
> > one-socket Intel 5220. I'll see what happens there.
> >
> > I did some experiements with forcing different frequencies. I haven't
> > finished processing the results, but I notice that as the frequency goes
> > up, the utilization (specifically the value of
> > map_util_perf(sg_cpu->util) at the point of the call to
> > cpufreq_driver_adjust_perf in sugov_update_single_perf) goes up as well.
> > Is this expected?
>
> It isn't, as long as the scale-invariance mechanism mentioned in my
> previous message works properly.

But even if it doesn't, the utilization should decrease when the
frequency increases.

Increasing frequency should cause more instructions to be retired per
unit of time and so there should be more idle time in the workload.

2021-12-19 17:03:29

by Julia Lawall

[permalink] [raw]
Subject: Re: cpufreq: intel_pstate: map utilization into the pstate range



On Sun, 19 Dec 2021, Rafael J. Wysocki wrote:

> On Fri, Dec 17, 2021 at 8:32 PM Julia Lawall <[email protected]> wrote:
> >
> >
> >
> > On Fri, 17 Dec 2021, Rafael J. Wysocki wrote:
> >
> > > On Mon, Dec 13, 2021 at 11:52 PM Julia Lawall <[email protected]> wrote:
> > > >
> > > > With HWP, intel_cpufreq_adjust_perf takes the utilization, scales it
> > > > between 0 and the capacity, and then maps everything below min_pstate to
> > > > the lowest frequency.
> > >
> > > Well, it is not just intel_pstate with HWP. This is how schedutil
> > > works in general; see get_next_freq() in there.
> > >
> > > > On my Intel Xeon Gold 6130 and Intel Xeon Gold
> > > > 5218, this means that more than the bottom quarter of utilizations are all
> > > > mapped to the lowest frequency. Running slowly doesn't necessarily save
> > > > energy, because it takes more time.
> > >
> > > This is true, but the layout of the available range of performance
> > > values is a property of the processor, not a driver issue.
> > >
> > > Moreover, the role of the driver is not to decide how to respond to
> > > the given utilization value, that is the role of the governor. The
> > > driver is expected to do what it is asked for by the governor.
> >
> > OK, but what exactly is the goal of schedutil?
>
> The short answer is: minimizing the cost (in terms of energy) of
> allocating an adequate amount of CPU time for a given workload.
>
> Of course, this requires a bit of explanation, so bear with me.
>
> It starts with a question:
>
> Given a steady workload (ie. a workload that uses approximately the
> same amount of CPU time to run in every sampling interval), what is
> the most efficient frequency (or generally, performance level measured
> in some abstract units) to run it at and still ensure that it will get
> as much CPU time as it needs (or wants)?
>
> To answer this question, let's first assume that
>
> (1) Performance is a monotonically increasing (ideally, linear)
> function of frequency.
> (2) CPU idle states have not enough impact on the energy usage for
> them to matter.
>
> Both of these assumptions may not be realistic, but that's how it goes.
>
> Now, consider the "raw" frequency-dependent utilization
>
> util(f) = util_max * (t_{total} - t_{idle}(f)) / t_{total}
>
> where
>
> t_{total} is the total CPU time available in the given time frame.
> t_{idle}(f) is the idle CPU time appearing in the workload when run at
> frequency f in that time frame.
> util_max is a convenience constant allowing an integer data type to be
> used for representing util(f) with sufficient approximation.
>
> Notice that by assumption (1), util(f) is a monotonically decreasing
> function, so if util(f_{max}) = util_max (where f_{max} is the maximum
> frequency available from the hardware), which means that there is no
> idle CPU time in the workload when run at the max available frequency,
> there will be no idle CPU time in it when run at any frequency below
> f_{max}. Hence, in that case the workload needs to be run at f_{max}.
>
> If util(f_{max}) < util_max, there is some idle CPU time in the
> workload at f_{max} and it may be run at a lower frequency without
> sacrificing performance. Moreover, the cost should be minimum when
> running the workload at the maximum frequency f_e for which
> t_{idle}(f_e) = 0. IOW, that is the point at which the workload still
> gets as much CPU time as needed, but the cost of running it is
> maximally reduced.

Thanks for the detailed explanation. I got lost at this point, though.

Idle time can be either due to I/O, or due to waiting for synchronization
from some other thread perhaps on another core. How can either of these
disappear? In the I/O case, no matter what the frequency, the idle time
will be the same (in a simplified world). In the case of waiting for
another thread on another core, assuming that all the cores are running at
the same frequency, lowering the frequency will cause the both running
time and the idle time to increase, and it will cause util(f) to stay the
same. Both cases seem to send the application directly to the lowest
frequency. I guess that's fine if we also assume that the lowest
frequency is also the most efficient one.

julia

>
> In practice, it is better to look for a frequency slightly greater
> than f_e to allow some performance margin to be there in case the
> workload fluctuates or similar, so we get
>
> C * util(f) / util_max = 1
>
> where the constant C is slightly greater than 1.
>
> This equation cannot be solved directly, because the util(f) graph is
> not known, but util(f) can be computed (at least approximately) for a
> given f and the solution can be approximated by computing a series of
> frequencies f_n given by
>
> f_{n+1} = C * f_n * util(f_n) / util_max
>
> under certain additional assumptions regarding the convergence etc.
>
> This is almost what schedutil does, but it also uses the observation
> that if the frequency-invariant utilization util_inv is known, then
> approximately
>
> util(f) = util_inv * f_{max} / f
>
> so finally
>
> f = C * f_{max} * util_inv / util_max
>
> and util_inv is provided by PELT.
>
> This has a few interesting properties that are vitally important:
>
> (a) The current frequency need not be known in order to compute the
> next one (and it is hard to determine in general).
> (b) The response is predictable by the CPU scheduler upfront, so it
> can make decisions based on it in advance.
> (c) If util_inv is properly scaled to reflect differences between
> different types of CPUs in a hybrid system, the same formula can be
> used for each of them regardless of where the workload was running
> previously.
>
> and they need to be maintained.
>
> > I would have expected that it was to give good performance while saving
> > energy, but it's not doing either in many of these cases.
>
> The performance improvement after making the change in question means
> that something is missing. The assumptions mentioned above (and there
> are quite a few of them) may not hold or the hardware may not behave
> exactly as anticipated.
>
> Generally, there are three directions worth investigating IMV:
>
> 1. The scale-invariance mechanism may cause util_inv to be
> underestimated. It may be worth trying to use the max non-turbo
> performance instead of the 4-core-turbo performance level in it; see
> intel_set_max_freq_ratio() in smpboot.c.
>
> 2.The hardware's response to the "desired" HWP value may depend on
> some additional factors (eg. the EPP value) that may need to be
> adjusted.
>
> 3. The workloads are not actually steady and running them at higher
> frequencies causes the sections that really need more CPU time to
> complete faster.
>
> At the same time, CPU idle states may actually have measurable impact
> on energy usage which is why you may not see much difference in that
> respect.
>
> > Is it the intent of schedutil that the bottom quarter of utilizations
> > should be mapped to the lowest frequency?
>
> It is not the intent, but a consequence of the scaling algorithm used
> by schedutil.
>
> As you can see from the derivation of that algorithm outlined above,
> if the utilization is mapped to a performance level below min_perf,
> running the workload at min_perf or above it is not expected (under
> all of the the assumptions made) to improve performance, so mapping
> all of the "low" utilization values to min_perf should not hurt
> performance, as the CPU time required by the workload will still be
> provided (and with a surplus for that matter).
>
> The reason why the hardware refuses to run below a certain minimum
> performance level is because it knows that running below that level
> doesn't really improve energy usage (or at least the improvement
> whatever it may be is not worth the effort). The CPU would run
> slower, but it would still use (almost) as much energy as it uses at
> the "hardware minimum" level, so it may as well run at the min level.
>

2021-12-19 21:47:49

by Julia Lawall

[permalink] [raw]
Subject: Re: cpufreq: intel_pstate: map utilization into the pstate range

> > I did some experiements with forcing different frequencies. I haven't
> > finished processing the results, but I notice that as the frequency goes
> > up, the utilization (specifically the value of
> > map_util_perf(sg_cpu->util) at the point of the call to
> > cpufreq_driver_adjust_perf in sugov_update_single_perf) goes up as well.
> > Is this expected?
>
> It isn't, as long as the scale-invariance mechanism mentioned in my
> previous message works properly.

The results are attached. I have now tried four machines:

2 x Intel Xeon Gold 6130, 16 cores/CPU
2 x Intel Xeon Gold 5218R, 20 cores/CPU
Intel Xeon Gold 5220, 18 cores/CPU
2 x Intel Xeon Gold 6130, 16 cores/CPU

In the graphs, for a particular execution, I take all of the reported
utilizations at the call to cpufreq_driver_adjust_perf associated with a
given application, and then take the average. Only the five applications
with the most data points are included. Typically, most of the execution
time is associated with the first one (blue). For each application, the
legend contains the least to the greatest utilization when fixing the
pstates as indicated in the x axis.

julia


Attachments:
aa.pdf (174.04 kB)

2021-12-19 22:10:52

by Francisco Jerez

[permalink] [raw]
Subject: Re: cpufreq: intel_pstate: map utilization into the pstate range

Julia Lawall <[email protected]> writes:

> On Sat, 18 Dec 2021, Francisco Jerez wrote:
>
>> Julia Lawall <[email protected]> writes:
>>
>> > On Sat, 18 Dec 2021, Francisco Jerez wrote:
>> >
>> >> Julia Lawall <[email protected]> writes:
>> >>
>> >> >> As you can see in intel_pstate.c, min_pstate is initialized on core
>> >> >> platforms from MSR_PLATFORM_INFO[47:40], which is "Maximum Efficiency
>> >> >> Ratio (R/O)". However that seems to deviate massively from the most
>> >> >> efficient ratio on your system, which may indicate a firmware bug, some
>> >> >> sort of clock gating problem, or an issue with the way that
>> >> >> intel_pstate.c processes this information.
>> >> >
>> >> > I'm not sure to understand the bug part. min_pstate gives the frequency
>> >> > that I find as the minimum frequency when I look for the specifications of
>> >> > the CPU. Should one expect that it should be something different?
>> >> >
>> >>
>> >> I'd expect the minimum frequency on your processor specification to
>> >> roughly match the "Maximum Efficiency Ratio (R/O)" value from that MSR,
>> >> since there's little reason to claim your processor can be clocked down
>> >> to a frequency which is inherently inefficient /and/ slower than the
>> >> maximum efficiency ratio -- In fact they both seem to match in your
>> >> system, they're just nowhere close to the frequency which is actually
>> >> most efficient, which smells like a bug, like your processor
>> >> misreporting what the most efficient frequency is, or it deviating from
>> >> the expected one due to your CPU static power consumption being greater
>> >> than it would be expected to be under ideal conditions -- E.g. due to
>> >> some sort of clock gating issue, possibly due to a software bug, or due
>> >> to our scheduling of such workloads with a large amount of lightly
>> >> loaded threads being unnecessarily inefficient which could also be
>> >> preventing most of your CPU cores from ever being clock-gated even
>> >> though your processor may be sitting idle for a large fraction of their
>> >> runtime.
>> >
>> > The original mail has results from two different machines: Intel 6130
>> > (skylake) and Intel 5218 (cascade lake). I have access to another cluster
>> > of 6130s and 5218s. I can try them.
>> >
>> > I tried 5.9 in which I just commented out the schedutil code to make
>> > frequency requests. I only tested avrora (tiny pauses) and h2 (longer
>> > pauses) and in both case the execution is almost entirely in the turbo
>> > frequencies.
>> >
>> > I'm not sure to understand the term "clock-gated". What C state does that
>> > correspond to? The turbostat output for one run of avrora is below.
>> >
>>
>> I didn't have any specific C1+ state in mind, most of the deeper ones
>> implement some sort of clock gating among other optimizations, I was
>> just wondering whether some sort of software bug and/or the highly
>> intermittent CPU utilization pattern of these workloads are preventing
>> most of your CPU cores from entering deep sleep states. See below.
>>
>> > julia
>> >
>> > 78.062895 sec
>> > Package Core CPU Avg_MHz Busy% Bzy_MHz TSC_MHz IRQ SMI POLL C1 C1E C6 POLL% C1% C1E% C6% CPU%c1 CPU%c6 CoreTmp PkgTmp Pkg%pc2 Pkg%pc6 Pkg_J RAM_J PKG_% RAM_%
>> > - - - 31 2.95 1065 2096 156134 0 1971 155458 2956270 657130 0.00 0.20 4.78 92.26 14.75 82.31 40 41 45.14 0.04 4747.52 2509.05 0.00 0.00
>> > 0 0 0 13 1.15 1132 2095 11360 0 0 2 39 19209 0.00 0.00 0.01 99.01 8.02 90.83 39 41 90.24 0.04 2266.04 1346.09 0.00 0.00
>>
>> This seems suspicious: ^^^^ ^^^^^^^
>>
>> I hadn't understood that you're running this on a dual-socket system
>> until I looked at these results.
>
> Sorry not to have mentioned that.
>
>> It seems like package #0 is doing
>> pretty much nothing according to the stats below, but it's still
>> consuming nearly half of your energy, apparently because the idle
>> package #0 isn't entering deep sleep states (Pkg%pc6 above is close to
>> 0%). That could explain your unexpectedly high static power consumption
>> and the deviation of the real maximum efficiency frequency from the one
>> reported by your processor, since the reported maximum efficiency ratio
>> cannot possibly take into account the existence of a second CPU package
>> with dysfunctional idle management.
>
> Our assumption was that if anything happens on any core, all of the
> packages remain in a state that allows them to react in a reasonable
> amount of time ot any memory request.

I can see how that might be helpful for workloads that need to be able
to unleash the whole processing power of your multi-socket system with
minimal latency, but the majority of multi-socket systems out there with
completely idle CPU packages are unlikely to notice any performance
difference as long as their idle CPU packages are idle, so the
environmentalist in me tells me that this is a bad idea. ;)

>
>> I'm guessing that if you fully disable one of your CPU packages and
>> repeat the previous experiment forcing various P-states between 10 and
>> 37 you should get a maximum efficiency ratio closer to the theoretical
>> one for this CPU?
>
> OK, but that's not really a natural usage context... I do have a
> one-socket Intel 5220. I'll see what happens there.
>

Fair, I didn't intend to suggest you take it offline manually every time
you don't plan to use it, my suggestion was just intended as an
experiment to help us confirm or disprove the theory that the reason for
the deviation from reality of your reported maximum efficiency ratio is
the presence of that second CPU package with broken idle management. If
that's the case the P-state vs. energy usage plot should show a minimum
closer to the ideal maximum efficiency ratio after disabling the second
CPU package.

> I did some experiements with forcing different frequencies. I haven't
> finished processing the results, but I notice that as the frequency goes
> up, the utilization (specifically the value of
> map_util_perf(sg_cpu->util) at the point of the call to
> cpufreq_driver_adjust_perf in sugov_update_single_perf) goes up as well.
> Is this expected?
>

Actually, it *is* expected based on our previous hypothesis that these
workloads are largely latency-bound: In cases where a given burst of CPU
work is not parallelizable with any other tasks the thread needs to
complete subsequently, its overall runtime will decrease monotonically
with increasing frequency, therefore the number of instructions executed
per unit of time will increase monotonically with increasing frequency,
and with it its frequency-invariant utilization.

> thanks,
> julia
>
>> > 0 0 32 1 0.09 1001 2095 37 0 0 0 0 42 0.00 0.00 0.00 100.00 9.08
>> > 0 1 4 0 0.04 1000 2095 57 0 0 0 1 133 0.00 0.00 0.00 99.96 0.08 99.88 38
>> > 0 1 36 0 0.00 1000 2095 35 0 0 0 0 40 0.00 0.00 0.00 100.00 0.12
>> > 0 2 8 0 0.03 1000 2095 64 0 0 0 1 124 0.00 0.00 0.00 99.97 0.08 99.89 38
>> > 0 2 40 0 0.00 1000 2095 36 0 0 0 0 40 0.00 0.00 0.00 100.00 0.10
>> > 0 3 12 0 0.00 1000 2095 42 0 0 0 0 71 0.00 0.00 0.00 100.00 0.14 99.86 38
>> > 0 3 44 1 0.09 1000 2095 63 0 0 0 0 65 0.00 0.00 0.00 99.91 0.05
>> > 0 4 14 0 0.00 1010 2095 38 0 0 0 1 41 0.00 0.00 0.00 100.00 0.04 99.96 39
>> > 0 4 46 0 0.00 1011 2095 36 0 0 0 1 41 0.00 0.00 0.00 100.00 0.04
>> > 0 5 10 0 0.01 1084 2095 39 0 0 0 0 58 0.00 0.00 0.00 99.99 0.04 99.95 38
>> > 0 5 42 0 0.00 1114 2095 35 0 0 0 0 39 0.00 0.00 0.00 100.00 0.05
>> > 0 6 6 0 0.03 1005 2095 89 0 0 0 1 116 0.00 0.00 0.00 99.97 0.07 99.90 39
>> > 0 6 38 0 0.00 1000 2095 38 0 0 0 0 41 0.00 0.00 0.00 100.00 0.10
>> > 0 7 2 0 0.05 1001 2095 59 0 0 0 1 133 0.00 0.00 0.00 99.95 0.09 99.86 40
>> > 0 7 34 0 0.00 1000 2095 39 0 0 0 0 65 0.00 0.00 0.00 100.00 0.13
>> > 0 8 16 0 0.00 1000 2095 43 0 0 0 0 47 0.00 0.00 0.00 100.00 0.04 99.96 38
>> > 0 8 48 0 0.00 1000 2095 37 0 0 0 0 41 0.00 0.00 0.00 100.00 0.04
>> > 0 9 20 0 0.00 1000 2095 33 0 0 0 0 37 0.00 0.00 0.00 100.00 0.03 99.97 38
>> > 0 9 52 0 0.00 1000 2095 33 0 0 0 0 36 0.00 0.00 0.00 100.00 0.03
>> > 0 10 24 0 0.00 1000 2095 36 0 0 0 1 40 0.00 0.00 0.00 100.00 0.03 99.96 39
>> > 0 10 56 0 0.00 1000 2095 37 0 0 0 1 38 0.00 0.00 0.00 100.00 0.03
>> > 0 11 28 0 0.00 1002 2095 35 0 0 0 1 37 0.00 0.00 0.00 100.00 0.03 99.97 38
>> > 0 11 60 0 0.00 1004 2095 34 0 0 0 0 36 0.00 0.00 0.00 100.00 0.03
>> > 0 12 30 0 0.00 1001 2095 35 0 0 0 0 40 0.00 0.00 0.00 100.00 0.11 99.88 38
>> > 0 12 62 0 0.01 1000 2095 197 0 0 0 0 197 0.00 0.00 0.00 99.99 0.10
>> > 0 13 26 0 0.00 1000 2095 37 0 0 0 0 41 0.00 0.00 0.00 100.00 0.03 99.97 39
>> > 0 13 58 0 0.00 1000 2095 38 0 0 0 0 40 0.00 0.00 0.00 100.00 0.03
>> > 0 14 22 0 0.01 1000 2095 149 0 1 2 0 142 0.00 0.01 0.00 99.99 0.07 99.92 39
>> > 0 14 54 0 0.00 1000 2095 35 0 0 0 0 38 0.00 0.00 0.00 100.00 0.07
>> > 0 15 18 0 0.00 1000 2095 33 0 0 0 0 36 0.00 0.00 0.00 100.00 0.03 99.97 39
>> > 0 15 50 0 0.00 1000 2095 34 0 0 0 0 38 0.00 0.00 0.00 100.00 0.03
>> > 1 0 1 32 3.23 1008 2095 2385 0 31 3190 45025 10144 0.00 0.28 4.68 91.99 11.21 85.56 32 35 0.04 0.04 2481.49 1162.96 0.00 0.00
>> > 1 0 33 9 0.63 1404 2095 12206 0 5 162 2480 10283 0.00 0.04 0.75 98.64 13.81
>> > 1 1 5 1 0.07 1384 2095 236 0 0 38 24 314 0.00 0.09 0.06 99.77 4.66 95.27 33
>> > 1 1 37 81 3.93 2060 2095 1254 0 5 40 59 683 0.00 0.01 0.02 96.05 0.80
>> > 1 2 9 37 3.46 1067 2095 2396 0 29 2256 55406 11731 0.00 0.17 6.02 90.54 54.10 42.44 31
>> > 1 2 41 151 14.51 1042 2095 10447 0 135 10494 248077 42327 0.01 0.87 26.57 58.84 43.05
>> > 1 3 13 110 10.47 1053 2095 7120 0 120 9218 168938 33884 0.01 0.77 16.63 72.68 42.58 46.95 32
>> > 1 3 45 69 6.76 1021 2095 4730 0 66 5598 115410 23447 0.00 0.44 12.06 81.12 46.29
>> > 1 4 15 112 10.64 1056 2095 7204 0 116 8831 171423 37754 0.01 0.70 17.56 71.67 28.01 61.35 33
>> > 1 4 47 18 1.80 1006 2095 1771 0 13 915 29315 6564 0.00 0.07 3.20 95.03 36.85
>> > 1 5 11 63 5.96 1065 2095 4090 0 58 6449 99015 18955 0.00 0.45 10.27 83.64 31.24 62.80 31
>> > 1 5 43 72 7.11 1016 2095 4794 0 73 6203 115361 26494 0.00 0.48 11.79 81.02 30.09
>> > 1 6 7 35 3.39 1022 2095 2328 0 45 3377 52721 13759 0.00 0.27 5.10 91.43 25.84 70.77 32
>> > 1 6 39 67 6.09 1096 2095 4483 0 52 3696 94964 19366 0.00 0.30 10.32 83.61 23.14
>> > 1 7 3 1 0.06 1395 2095 91 0 0 0 1 167 0.00 0.00 0.00 99.95 25.36 74.58 35
>> > 1 7 35 83 8.16 1024 2095 5785 0 100 7398 134640 27428 0.00 0.56 13.39 78.34 17.26
>> > 1 8 17 46 4.49 1016 2095 3229 0 52 3048 74914 16010 0.00 0.27 8.29 87.19 29.71 65.80 33
>> > 1 8 49 64 6.12 1052 2095 4210 0 89 5782 100570 21463 0.00 0.42 10.63 83.17 28.08
>> > 1 9 21 73 7.02 1036 2095 4917 0 64 5786 109887 21939 0.00 0.55 11.61 81.18 22.10 70.88 33
>> > 1 9 53 64 6.33 1012 2095 4074 0 69 5957 97596 20580 0.00 0.51 9.78 83.74 22.79
>> > 1 10 25 26 2.58 1013 2095 1825 0 22 2124 42630 8627 0.00 0.17 4.17 93.24 53.91 43.52 33
>> > 1 10 57 159 15.59 1022 2095 10951 0 175 14237 256828 56810 0.01 1.10 26.00 58.16 40.89
>> > 1 11 29 112 10.54 1065 2095 7462 0 126 9548 179206 39821 0.01 0.85 18.49 70.71 29.46 60.00 31
>> > 1 11 61 29 2.89 1011 2095 2002 0 24 2468 45558 10288 0.00 0.20 4.71 92.36 37.11
>> > 1 12 31 37 3.66 1011 2095 2596 0 79 3161 61027 13292 0.00 0.24 6.48 89.79 23.75 72.59 32
>> > 1 12 63 56 5.08 1107 2095 3789 0 62 4777 79133 17089 0.00 0.41 7.91 86.86 22.31
>> > 1 13 27 12 1.14 1045 2095 1477 0 16 888 18744 3250 0.00 0.06 2.18 96.70 21.23 77.64 32
>> > 1 13 59 60 5.81 1038 2095 5230 0 60 4936 87225 21402 0.00 0.41 8.95 85.14 16.55
>> > 1 14 23 28 2.75 1024 2095 2008 0 20 1839 47417 9177 0.00 0.13 5.08 92.21 34.18 63.07 32
>> > 1 14 55 106 9.58 1105 2095 6292 0 89 7182 141379 31354 0.00 0.63 14.45 75.81 27.36
>> > 1 15 19 118 11.65 1012 2095 7872 0 121 10014 193186 40448 0.01 0.80 19.53 68.68 37.53 50.82 32
>> > 1 15 51 59 5.58 1059 2095 3967 0 54 5842 88063 21138 0.00 0.39 9.12 85.23 43.60
>>

2021-12-19 22:30:54

by Francisco Jerez

[permalink] [raw]
Subject: Re: cpufreq: intel_pstate: map utilization into the pstate range

Julia Lawall <[email protected]> writes:

> On Sun, 19 Dec 2021, Rafael J. Wysocki wrote:
>
>> On Fri, Dec 17, 2021 at 8:32 PM Julia Lawall <[email protected]> wrote:
>> >
>> >
>> >
>> > On Fri, 17 Dec 2021, Rafael J. Wysocki wrote:
>> >
>> > > On Mon, Dec 13, 2021 at 11:52 PM Julia Lawall <[email protected]> wrote:
>> > > >
>> > > > With HWP, intel_cpufreq_adjust_perf takes the utilization, scales it
>> > > > between 0 and the capacity, and then maps everything below min_pstate to
>> > > > the lowest frequency.
>> > >
>> > > Well, it is not just intel_pstate with HWP. This is how schedutil
>> > > works in general; see get_next_freq() in there.
>> > >
>> > > > On my Intel Xeon Gold 6130 and Intel Xeon Gold
>> > > > 5218, this means that more than the bottom quarter of utilizations are all
>> > > > mapped to the lowest frequency. Running slowly doesn't necessarily save
>> > > > energy, because it takes more time.
>> > >
>> > > This is true, but the layout of the available range of performance
>> > > values is a property of the processor, not a driver issue.
>> > >
>> > > Moreover, the role of the driver is not to decide how to respond to
>> > > the given utilization value, that is the role of the governor. The
>> > > driver is expected to do what it is asked for by the governor.
>> >
>> > OK, but what exactly is the goal of schedutil?
>>
>> The short answer is: minimizing the cost (in terms of energy) of
>> allocating an adequate amount of CPU time for a given workload.
>>
>> Of course, this requires a bit of explanation, so bear with me.
>>
>> It starts with a question:
>>
>> Given a steady workload (ie. a workload that uses approximately the
>> same amount of CPU time to run in every sampling interval), what is
>> the most efficient frequency (or generally, performance level measured
>> in some abstract units) to run it at and still ensure that it will get
>> as much CPU time as it needs (or wants)?
>>
>> To answer this question, let's first assume that
>>
>> (1) Performance is a monotonically increasing (ideally, linear)
>> function of frequency.
>> (2) CPU idle states have not enough impact on the energy usage for
>> them to matter.
>>
>> Both of these assumptions may not be realistic, but that's how it goes.
>>
>> Now, consider the "raw" frequency-dependent utilization
>>
>> util(f) = util_max * (t_{total} - t_{idle}(f)) / t_{total}
>>
>> where
>>
>> t_{total} is the total CPU time available in the given time frame.
>> t_{idle}(f) is the idle CPU time appearing in the workload when run at
>> frequency f in that time frame.
>> util_max is a convenience constant allowing an integer data type to be
>> used for representing util(f) with sufficient approximation.
>>
>> Notice that by assumption (1), util(f) is a monotonically decreasing
>> function, so if util(f_{max}) = util_max (where f_{max} is the maximum
>> frequency available from the hardware), which means that there is no
>> idle CPU time in the workload when run at the max available frequency,
>> there will be no idle CPU time in it when run at any frequency below
>> f_{max}. Hence, in that case the workload needs to be run at f_{max}.
>>
>> If util(f_{max}) < util_max, there is some idle CPU time in the
>> workload at f_{max} and it may be run at a lower frequency without
>> sacrificing performance. Moreover, the cost should be minimum when
>> running the workload at the maximum frequency f_e for which
>> t_{idle}(f_e) = 0. IOW, that is the point at which the workload still
>> gets as much CPU time as needed, but the cost of running it is
>> maximally reduced.
>
> Thanks for the detailed explanation. I got lost at this point, though.
>
> Idle time can be either due to I/O, or due to waiting for synchronization
> from some other thread perhaps on another core. How can either of these
> disappear? In the I/O case, no matter what the frequency, the idle time
> will be the same (in a simplified world). In the case of waiting for
> another thread on another core, assuming that all the cores are running at
> the same frequency, lowering the frequency will cause the both running
> time and the idle time to increase, and it will cause util(f) to stay the
> same. Both cases seem to send the application directly to the lowest
> frequency. I guess that's fine if we also assume that the lowest
> frequency is also the most efficient one.
>

Yeah, that's exactly what I was referring to with the optimality
assumptions of the schedutil heuristic breaking down in latency-bound
cases: They rely on the assumption that whatever task is keeping the CPU
thread idle for a fraction of the time can be parallelized with the CPU
work, so reducing its frequency will cause the execution time of that
CPU work to approach the runtime of those other parallel tasks without
affecting the overall runtime of the workload, which is obviously not
the case whenever a burst of CPU work is not parallelizable with
anything else the thread will subsequently block on, so the factor
limiting your performance is the latency of that burst of work rather
than the total computational capacity of the CPU (which is largely
underutilized in your case).

> julia
>
>>
>> In practice, it is better to look for a frequency slightly greater
>> than f_e to allow some performance margin to be there in case the
>> workload fluctuates or similar, so we get
>>
>> C * util(f) / util_max = 1
>>
>> where the constant C is slightly greater than 1.
>>
>> This equation cannot be solved directly, because the util(f) graph is
>> not known, but util(f) can be computed (at least approximately) for a
>> given f and the solution can be approximated by computing a series of
>> frequencies f_n given by
>>
>> f_{n+1} = C * f_n * util(f_n) / util_max
>>
>> under certain additional assumptions regarding the convergence etc.
>>
>> This is almost what schedutil does, but it also uses the observation
>> that if the frequency-invariant utilization util_inv is known, then
>> approximately
>>
>> util(f) = util_inv * f_{max} / f
>>
>> so finally
>>
>> f = C * f_{max} * util_inv / util_max
>>
>> and util_inv is provided by PELT.
>>
>> This has a few interesting properties that are vitally important:
>>
>> (a) The current frequency need not be known in order to compute the
>> next one (and it is hard to determine in general).
>> (b) The response is predictable by the CPU scheduler upfront, so it
>> can make decisions based on it in advance.
>> (c) If util_inv is properly scaled to reflect differences between
>> different types of CPUs in a hybrid system, the same formula can be
>> used for each of them regardless of where the workload was running
>> previously.
>>
>> and they need to be maintained.
>>
>> > I would have expected that it was to give good performance while saving
>> > energy, but it's not doing either in many of these cases.
>>
>> The performance improvement after making the change in question means
>> that something is missing. The assumptions mentioned above (and there
>> are quite a few of them) may not hold or the hardware may not behave
>> exactly as anticipated.
>>
>> Generally, there are three directions worth investigating IMV:
>>
>> 1. The scale-invariance mechanism may cause util_inv to be
>> underestimated. It may be worth trying to use the max non-turbo
>> performance instead of the 4-core-turbo performance level in it; see
>> intel_set_max_freq_ratio() in smpboot.c.
>>
>> 2.The hardware's response to the "desired" HWP value may depend on
>> some additional factors (eg. the EPP value) that may need to be
>> adjusted.
>>
>> 3. The workloads are not actually steady and running them at higher
>> frequencies causes the sections that really need more CPU time to
>> complete faster.
>>
>> At the same time, CPU idle states may actually have measurable impact
>> on energy usage which is why you may not see much difference in that
>> respect.
>>
>> > Is it the intent of schedutil that the bottom quarter of utilizations
>> > should be mapped to the lowest frequency?
>>
>> It is not the intent, but a consequence of the scaling algorithm used
>> by schedutil.
>>
>> As you can see from the derivation of that algorithm outlined above,
>> if the utilization is mapped to a performance level below min_perf,
>> running the workload at min_perf or above it is not expected (under
>> all of the the assumptions made) to improve performance, so mapping
>> all of the "low" utilization values to min_perf should not hurt
>> performance, as the CPU time required by the workload will still be
>> provided (and with a surplus for that matter).
>>
>> The reason why the hardware refuses to run below a certain minimum
>> performance level is because it knows that running below that level
>> doesn't really improve energy usage (or at least the improvement
>> whatever it may be is not worth the effort). The CPU would run
>> slower, but it would still use (almost) as much energy as it uses at
>> the "hardware minimum" level, so it may as well run at the min level.
>>

2021-12-19 22:41:13

by Julia Lawall

[permalink] [raw]
Subject: Re: cpufreq: intel_pstate: map utilization into the pstate range



On Sun, 19 Dec 2021, Francisco Jerez wrote:

> Julia Lawall <[email protected]> writes:
>
> > On Sat, 18 Dec 2021, Francisco Jerez wrote:
> >
> >> Julia Lawall <[email protected]> writes:
> >>
> >> > On Sat, 18 Dec 2021, Francisco Jerez wrote:
> >> >
> >> >> Julia Lawall <[email protected]> writes:
> >> >>
> >> >> >> As you can see in intel_pstate.c, min_pstate is initialized on core
> >> >> >> platforms from MSR_PLATFORM_INFO[47:40], which is "Maximum Efficiency
> >> >> >> Ratio (R/O)". However that seems to deviate massively from the most
> >> >> >> efficient ratio on your system, which may indicate a firmware bug, some
> >> >> >> sort of clock gating problem, or an issue with the way that
> >> >> >> intel_pstate.c processes this information.
> >> >> >
> >> >> > I'm not sure to understand the bug part. min_pstate gives the frequency
> >> >> > that I find as the minimum frequency when I look for the specifications of
> >> >> > the CPU. Should one expect that it should be something different?
> >> >> >
> >> >>
> >> >> I'd expect the minimum frequency on your processor specification to
> >> >> roughly match the "Maximum Efficiency Ratio (R/O)" value from that MSR,
> >> >> since there's little reason to claim your processor can be clocked down
> >> >> to a frequency which is inherently inefficient /and/ slower than the
> >> >> maximum efficiency ratio -- In fact they both seem to match in your
> >> >> system, they're just nowhere close to the frequency which is actually
> >> >> most efficient, which smells like a bug, like your processor
> >> >> misreporting what the most efficient frequency is, or it deviating from
> >> >> the expected one due to your CPU static power consumption being greater
> >> >> than it would be expected to be under ideal conditions -- E.g. due to
> >> >> some sort of clock gating issue, possibly due to a software bug, or due
> >> >> to our scheduling of such workloads with a large amount of lightly
> >> >> loaded threads being unnecessarily inefficient which could also be
> >> >> preventing most of your CPU cores from ever being clock-gated even
> >> >> though your processor may be sitting idle for a large fraction of their
> >> >> runtime.
> >> >
> >> > The original mail has results from two different machines: Intel 6130
> >> > (skylake) and Intel 5218 (cascade lake). I have access to another cluster
> >> > of 6130s and 5218s. I can try them.
> >> >
> >> > I tried 5.9 in which I just commented out the schedutil code to make
> >> > frequency requests. I only tested avrora (tiny pauses) and h2 (longer
> >> > pauses) and in both case the execution is almost entirely in the turbo
> >> > frequencies.
> >> >
> >> > I'm not sure to understand the term "clock-gated". What C state does that
> >> > correspond to? The turbostat output for one run of avrora is below.
> >> >
> >>
> >> I didn't have any specific C1+ state in mind, most of the deeper ones
> >> implement some sort of clock gating among other optimizations, I was
> >> just wondering whether some sort of software bug and/or the highly
> >> intermittent CPU utilization pattern of these workloads are preventing
> >> most of your CPU cores from entering deep sleep states. See below.
> >>
> >> > julia
> >> >
> >> > 78.062895 sec
> >> > Package Core CPU Avg_MHz Busy% Bzy_MHz TSC_MHz IRQ SMI POLL C1 C1E C6 POLL% C1% C1E% C6% CPU%c1 CPU%c6 CoreTmp PkgTmp Pkg%pc2 Pkg%pc6 Pkg_J RAM_J PKG_% RAM_%
> >> > - - - 31 2.95 1065 2096 156134 0 1971 155458 2956270 657130 0.00 0.20 4.78 92.26 14.75 82.31 40 41 45.14 0.04 4747.52 2509.05 0.00 0.00
> >> > 0 0 0 13 1.15 1132 2095 11360 0 0 2 39 19209 0.00 0.00 0.01 99.01 8.02 90.83 39 41 90.24 0.04 2266.04 1346.09 0.00 0.00
> >>
> >> This seems suspicious: ^^^^ ^^^^^^^
> >>
> >> I hadn't understood that you're running this on a dual-socket system
> >> until I looked at these results.
> >
> > Sorry not to have mentioned that.
> >
> >> It seems like package #0 is doing
> >> pretty much nothing according to the stats below, but it's still
> >> consuming nearly half of your energy, apparently because the idle
> >> package #0 isn't entering deep sleep states (Pkg%pc6 above is close to
> >> 0%). That could explain your unexpectedly high static power consumption
> >> and the deviation of the real maximum efficiency frequency from the one
> >> reported by your processor, since the reported maximum efficiency ratio
> >> cannot possibly take into account the existence of a second CPU package
> >> with dysfunctional idle management.
> >
> > Our assumption was that if anything happens on any core, all of the
> > packages remain in a state that allows them to react in a reasonable
> > amount of time ot any memory request.
>
> I can see how that might be helpful for workloads that need to be able
> to unleash the whole processing power of your multi-socket system with
> minimal latency, but the majority of multi-socket systems out there with
> completely idle CPU packages are unlikely to notice any performance
> difference as long as their idle CPU packages are idle, so the
> environmentalist in me tells me that this is a bad idea. ;)

Certainly it sounds like a bad idea from the point of view of anyone who
wants to save energy, but it's how the machine seems to work (at least in
its current configuration, which is not entirely under my control).

Note also that of the benchmarks, only avrora has the property of often
using only one of the sockets. The others let their threads drift around
more.

>
> >
> >> I'm guessing that if you fully disable one of your CPU packages and
> >> repeat the previous experiment forcing various P-states between 10 and
> >> 37 you should get a maximum efficiency ratio closer to the theoretical
> >> one for this CPU?
> >
> > OK, but that's not really a natural usage context... I do have a
> > one-socket Intel 5220. I'll see what happens there.
> >
>
> Fair, I didn't intend to suggest you take it offline manually every time
> you don't plan to use it, my suggestion was just intended as an
> experiment to help us confirm or disprove the theory that the reason for
> the deviation from reality of your reported maximum efficiency ratio is
> the presence of that second CPU package with broken idle management. If
> that's the case the P-state vs. energy usage plot should show a minimum
> closer to the ideal maximum efficiency ratio after disabling the second
> CPU package.

More numbers are attached. Pages 1-3 have two socket machines. Page 4
has a one socket machine. The values for p state 20 are highlighted.
For avrora (the one-socket application) on page 2, 20 is not the pstate
with the lowest CPU energy consumption. 35 and 37 do better. Also for
xalan on page 4 (one-socket machine) 15 does slightly better than 20.
Otherwise, 20 always seems to be the best.

> > I did some experiements with forcing different frequencies. I haven't
> > finished processing the results, but I notice that as the frequency goes
> > up, the utilization (specifically the value of
> > map_util_perf(sg_cpu->util) at the point of the call to
> > cpufreq_driver_adjust_perf in sugov_update_single_perf) goes up as well.
> > Is this expected?
> >
>
> Actually, it *is* expected based on our previous hypothesis that these
> workloads are largely latency-bound: In cases where a given burst of CPU
> work is not parallelizable with any other tasks the thread needs to
> complete subsequently, its overall runtime will decrease monotonically
> with increasing frequency, therefore the number of instructions executed
> per unit of time will increase monotonically with increasing frequency,
> and with it its frequency-invariant utilization.

I'm not sure. If you have two tasks, each one alternately waiting for the
other, if the frequency doubles, they will each run faster and wait less,
but as long as one is computing the utilization in a small interval, ie
before the application ends, the utilization will always be 50%. The
applications, however, are probably not as simple as this.

julia

> > thanks,
> > julia
> >
> >> > 0 0 32 1 0.09 1001 2095 37 0 0 0 0 42 0.00 0.00 0.00 100.00 9.08
> >> > 0 1 4 0 0.04 1000 2095 57 0 0 0 1 133 0.00 0.00 0.00 99.96 0.08 99.88 38
> >> > 0 1 36 0 0.00 1000 2095 35 0 0 0 0 40 0.00 0.00 0.00 100.00 0.12
> >> > 0 2 8 0 0.03 1000 2095 64 0 0 0 1 124 0.00 0.00 0.00 99.97 0.08 99.89 38
> >> > 0 2 40 0 0.00 1000 2095 36 0 0 0 0 40 0.00 0.00 0.00 100.00 0.10
> >> > 0 3 12 0 0.00 1000 2095 42 0 0 0 0 71 0.00 0.00 0.00 100.00 0.14 99.86 38
> >> > 0 3 44 1 0.09 1000 2095 63 0 0 0 0 65 0.00 0.00 0.00 99.91 0.05
> >> > 0 4 14 0 0.00 1010 2095 38 0 0 0 1 41 0.00 0.00 0.00 100.00 0.04 99.96 39
> >> > 0 4 46 0 0.00 1011 2095 36 0 0 0 1 41 0.00 0.00 0.00 100.00 0.04
> >> > 0 5 10 0 0.01 1084 2095 39 0 0 0 0 58 0.00 0.00 0.00 99.99 0.04 99.95 38
> >> > 0 5 42 0 0.00 1114 2095 35 0 0 0 0 39 0.00 0.00 0.00 100.00 0.05
> >> > 0 6 6 0 0.03 1005 2095 89 0 0 0 1 116 0.00 0.00 0.00 99.97 0.07 99.90 39
> >> > 0 6 38 0 0.00 1000 2095 38 0 0 0 0 41 0.00 0.00 0.00 100.00 0.10
> >> > 0 7 2 0 0.05 1001 2095 59 0 0 0 1 133 0.00 0.00 0.00 99.95 0.09 99.86 40
> >> > 0 7 34 0 0.00 1000 2095 39 0 0 0 0 65 0.00 0.00 0.00 100.00 0.13
> >> > 0 8 16 0 0.00 1000 2095 43 0 0 0 0 47 0.00 0.00 0.00 100.00 0.04 99.96 38
> >> > 0 8 48 0 0.00 1000 2095 37 0 0 0 0 41 0.00 0.00 0.00 100.00 0.04
> >> > 0 9 20 0 0.00 1000 2095 33 0 0 0 0 37 0.00 0.00 0.00 100.00 0.03 99.97 38
> >> > 0 9 52 0 0.00 1000 2095 33 0 0 0 0 36 0.00 0.00 0.00 100.00 0.03
> >> > 0 10 24 0 0.00 1000 2095 36 0 0 0 1 40 0.00 0.00 0.00 100.00 0.03 99.96 39
> >> > 0 10 56 0 0.00 1000 2095 37 0 0 0 1 38 0.00 0.00 0.00 100.00 0.03
> >> > 0 11 28 0 0.00 1002 2095 35 0 0 0 1 37 0.00 0.00 0.00 100.00 0.03 99.97 38
> >> > 0 11 60 0 0.00 1004 2095 34 0 0 0 0 36 0.00 0.00 0.00 100.00 0.03
> >> > 0 12 30 0 0.00 1001 2095 35 0 0 0 0 40 0.00 0.00 0.00 100.00 0.11 99.88 38
> >> > 0 12 62 0 0.01 1000 2095 197 0 0 0 0 197 0.00 0.00 0.00 99.99 0.10
> >> > 0 13 26 0 0.00 1000 2095 37 0 0 0 0 41 0.00 0.00 0.00 100.00 0.03 99.97 39
> >> > 0 13 58 0 0.00 1000 2095 38 0 0 0 0 40 0.00 0.00 0.00 100.00 0.03
> >> > 0 14 22 0 0.01 1000 2095 149 0 1 2 0 142 0.00 0.01 0.00 99.99 0.07 99.92 39
> >> > 0 14 54 0 0.00 1000 2095 35 0 0 0 0 38 0.00 0.00 0.00 100.00 0.07
> >> > 0 15 18 0 0.00 1000 2095 33 0 0 0 0 36 0.00 0.00 0.00 100.00 0.03 99.97 39
> >> > 0 15 50 0 0.00 1000 2095 34 0 0 0 0 38 0.00 0.00 0.00 100.00 0.03
> >> > 1 0 1 32 3.23 1008 2095 2385 0 31 3190 45025 10144 0.00 0.28 4.68 91.99 11.21 85.56 32 35 0.04 0.04 2481.49 1162.96 0.00 0.00
> >> > 1 0 33 9 0.63 1404 2095 12206 0 5 162 2480 10283 0.00 0.04 0.75 98.64 13.81
> >> > 1 1 5 1 0.07 1384 2095 236 0 0 38 24 314 0.00 0.09 0.06 99.77 4.66 95.27 33
> >> > 1 1 37 81 3.93 2060 2095 1254 0 5 40 59 683 0.00 0.01 0.02 96.05 0.80
> >> > 1 2 9 37 3.46 1067 2095 2396 0 29 2256 55406 11731 0.00 0.17 6.02 90.54 54.10 42.44 31
> >> > 1 2 41 151 14.51 1042 2095 10447 0 135 10494 248077 42327 0.01 0.87 26.57 58.84 43.05
> >> > 1 3 13 110 10.47 1053 2095 7120 0 120 9218 168938 33884 0.01 0.77 16.63 72.68 42.58 46.95 32
> >> > 1 3 45 69 6.76 1021 2095 4730 0 66 5598 115410 23447 0.00 0.44 12.06 81.12 46.29
> >> > 1 4 15 112 10.64 1056 2095 7204 0 116 8831 171423 37754 0.01 0.70 17.56 71.67 28.01 61.35 33
> >> > 1 4 47 18 1.80 1006 2095 1771 0 13 915 29315 6564 0.00 0.07 3.20 95.03 36.85
> >> > 1 5 11 63 5.96 1065 2095 4090 0 58 6449 99015 18955 0.00 0.45 10.27 83.64 31.24 62.80 31
> >> > 1 5 43 72 7.11 1016 2095 4794 0 73 6203 115361 26494 0.00 0.48 11.79 81.02 30.09
> >> > 1 6 7 35 3.39 1022 2095 2328 0 45 3377 52721 13759 0.00 0.27 5.10 91.43 25.84 70.77 32
> >> > 1 6 39 67 6.09 1096 2095 4483 0 52 3696 94964 19366 0.00 0.30 10.32 83.61 23.14
> >> > 1 7 3 1 0.06 1395 2095 91 0 0 0 1 167 0.00 0.00 0.00 99.95 25.36 74.58 35
> >> > 1 7 35 83 8.16 1024 2095 5785 0 100 7398 134640 27428 0.00 0.56 13.39 78.34 17.26
> >> > 1 8 17 46 4.49 1016 2095 3229 0 52 3048 74914 16010 0.00 0.27 8.29 87.19 29.71 65.80 33
> >> > 1 8 49 64 6.12 1052 2095 4210 0 89 5782 100570 21463 0.00 0.42 10.63 83.17 28.08
> >> > 1 9 21 73 7.02 1036 2095 4917 0 64 5786 109887 21939 0.00 0.55 11.61 81.18 22.10 70.88 33
> >> > 1 9 53 64 6.33 1012 2095 4074 0 69 5957 97596 20580 0.00 0.51 9.78 83.74 22.79
> >> > 1 10 25 26 2.58 1013 2095 1825 0 22 2124 42630 8627 0.00 0.17 4.17 93.24 53.91 43.52 33
> >> > 1 10 57 159 15.59 1022 2095 10951 0 175 14237 256828 56810 0.01 1.10 26.00 58.16 40.89
> >> > 1 11 29 112 10.54 1065 2095 7462 0 126 9548 179206 39821 0.01 0.85 18.49 70.71 29.46 60.00 31
> >> > 1 11 61 29 2.89 1011 2095 2002 0 24 2468 45558 10288 0.00 0.20 4.71 92.36 37.11
> >> > 1 12 31 37 3.66 1011 2095 2596 0 79 3161 61027 13292 0.00 0.24 6.48 89.79 23.75 72.59 32
> >> > 1 12 63 56 5.08 1107 2095 3789 0 62 4777 79133 17089 0.00 0.41 7.91 86.86 22.31
> >> > 1 13 27 12 1.14 1045 2095 1477 0 16 888 18744 3250 0.00 0.06 2.18 96.70 21.23 77.64 32
> >> > 1 13 59 60 5.81 1038 2095 5230 0 60 4936 87225 21402 0.00 0.41 8.95 85.14 16.55
> >> > 1 14 23 28 2.75 1024 2095 2008 0 20 1839 47417 9177 0.00 0.13 5.08 92.21 34.18 63.07 32
> >> > 1 14 55 106 9.58 1105 2095 6292 0 89 7182 141379 31354 0.00 0.63 14.45 75.81 27.36
> >> > 1 15 19 118 11.65 1012 2095 7872 0 121 10014 193186 40448 0.01 0.80 19.53 68.68 37.53 50.82 32
> >> > 1 15 51 59 5.58 1059 2095 3967 0 54 5842 88063 21138 0.00 0.39 9.12 85.23 43.60
> >>
>


Attachments:
h2.pdf (69.27 kB)

2021-12-19 23:31:17

by Francisco Jerez

[permalink] [raw]
Subject: Re: cpufreq: intel_pstate: map utilization into the pstate range

Julia Lawall <[email protected]> writes:

> On Sun, 19 Dec 2021, Francisco Jerez wrote:
>
>> Julia Lawall <[email protected]> writes:
>>
>> > On Sat, 18 Dec 2021, Francisco Jerez wrote:
>> >
>> >> Julia Lawall <[email protected]> writes:
>> >>
>> >> > On Sat, 18 Dec 2021, Francisco Jerez wrote:
>> >> >
>> >> >> Julia Lawall <[email protected]> writes:
>> >> >>
>> >> >> >> As you can see in intel_pstate.c, min_pstate is initialized on core
>> >> >> >> platforms from MSR_PLATFORM_INFO[47:40], which is "Maximum Efficiency
>> >> >> >> Ratio (R/O)". However that seems to deviate massively from the most
>> >> >> >> efficient ratio on your system, which may indicate a firmware bug, some
>> >> >> >> sort of clock gating problem, or an issue with the way that
>> >> >> >> intel_pstate.c processes this information.
>> >> >> >
>> >> >> > I'm not sure to understand the bug part. min_pstate gives the frequency
>> >> >> > that I find as the minimum frequency when I look for the specifications of
>> >> >> > the CPU. Should one expect that it should be something different?
>> >> >> >
>> >> >>
>> >> >> I'd expect the minimum frequency on your processor specification to
>> >> >> roughly match the "Maximum Efficiency Ratio (R/O)" value from that MSR,
>> >> >> since there's little reason to claim your processor can be clocked down
>> >> >> to a frequency which is inherently inefficient /and/ slower than the
>> >> >> maximum efficiency ratio -- In fact they both seem to match in your
>> >> >> system, they're just nowhere close to the frequency which is actually
>> >> >> most efficient, which smells like a bug, like your processor
>> >> >> misreporting what the most efficient frequency is, or it deviating from
>> >> >> the expected one due to your CPU static power consumption being greater
>> >> >> than it would be expected to be under ideal conditions -- E.g. due to
>> >> >> some sort of clock gating issue, possibly due to a software bug, or due
>> >> >> to our scheduling of such workloads with a large amount of lightly
>> >> >> loaded threads being unnecessarily inefficient which could also be
>> >> >> preventing most of your CPU cores from ever being clock-gated even
>> >> >> though your processor may be sitting idle for a large fraction of their
>> >> >> runtime.
>> >> >
>> >> > The original mail has results from two different machines: Intel 6130
>> >> > (skylake) and Intel 5218 (cascade lake). I have access to another cluster
>> >> > of 6130s and 5218s. I can try them.
>> >> >
>> >> > I tried 5.9 in which I just commented out the schedutil code to make
>> >> > frequency requests. I only tested avrora (tiny pauses) and h2 (longer
>> >> > pauses) and in both case the execution is almost entirely in the turbo
>> >> > frequencies.
>> >> >
>> >> > I'm not sure to understand the term "clock-gated". What C state does that
>> >> > correspond to? The turbostat output for one run of avrora is below.
>> >> >
>> >>
>> >> I didn't have any specific C1+ state in mind, most of the deeper ones
>> >> implement some sort of clock gating among other optimizations, I was
>> >> just wondering whether some sort of software bug and/or the highly
>> >> intermittent CPU utilization pattern of these workloads are preventing
>> >> most of your CPU cores from entering deep sleep states. See below.
>> >>
>> >> > julia
>> >> >
>> >> > 78.062895 sec
>> >> > Package Core CPU Avg_MHz Busy% Bzy_MHz TSC_MHz IRQ SMI POLL C1 C1E C6 POLL% C1% C1E% C6% CPU%c1 CPU%c6 CoreTmp PkgTmp Pkg%pc2 Pkg%pc6 Pkg_J RAM_J PKG_% RAM_%
>> >> > - - - 31 2.95 1065 2096 156134 0 1971 155458 2956270 657130 0.00 0.20 4.78 92.26 14.75 82.31 40 41 45.14 0.04 4747.52 2509.05 0.00 0.00
>> >> > 0 0 0 13 1.15 1132 2095 11360 0 0 2 39 19209 0.00 0.00 0.01 99.01 8.02 90.83 39 41 90.24 0.04 2266.04 1346.09 0.00 0.00
>> >>
>> >> This seems suspicious: ^^^^ ^^^^^^^
>> >>
>> >> I hadn't understood that you're running this on a dual-socket system
>> >> until I looked at these results.
>> >
>> > Sorry not to have mentioned that.
>> >
>> >> It seems like package #0 is doing
>> >> pretty much nothing according to the stats below, but it's still
>> >> consuming nearly half of your energy, apparently because the idle
>> >> package #0 isn't entering deep sleep states (Pkg%pc6 above is close to
>> >> 0%). That could explain your unexpectedly high static power consumption
>> >> and the deviation of the real maximum efficiency frequency from the one
>> >> reported by your processor, since the reported maximum efficiency ratio
>> >> cannot possibly take into account the existence of a second CPU package
>> >> with dysfunctional idle management.
>> >
>> > Our assumption was that if anything happens on any core, all of the
>> > packages remain in a state that allows them to react in a reasonable
>> > amount of time ot any memory request.
>>
>> I can see how that might be helpful for workloads that need to be able
>> to unleash the whole processing power of your multi-socket system with
>> minimal latency, but the majority of multi-socket systems out there with
>> completely idle CPU packages are unlikely to notice any performance
>> difference as long as their idle CPU packages are idle, so the
>> environmentalist in me tells me that this is a bad idea. ;)
>
> Certainly it sounds like a bad idea from the point of view of anyone who
> wants to save energy, but it's how the machine seems to work (at least in
> its current configuration, which is not entirely under my control).
>

Yes that seems to be how it works right now, but honestly it seems like
an idle management bug to me.

> Note also that of the benchmarks, only avrora has the property of often
> using only one of the sockets. The others let their threads drift around
> more.
>
>>
>> >
>> >> I'm guessing that if you fully disable one of your CPU packages and
>> >> repeat the previous experiment forcing various P-states between 10 and
>> >> 37 you should get a maximum efficiency ratio closer to the theoretical
>> >> one for this CPU?
>> >
>> > OK, but that's not really a natural usage context... I do have a
>> > one-socket Intel 5220. I'll see what happens there.
>> >
>>
>> Fair, I didn't intend to suggest you take it offline manually every time
>> you don't plan to use it, my suggestion was just intended as an
>> experiment to help us confirm or disprove the theory that the reason for
>> the deviation from reality of your reported maximum efficiency ratio is
>> the presence of that second CPU package with broken idle management. If
>> that's the case the P-state vs. energy usage plot should show a minimum
>> closer to the ideal maximum efficiency ratio after disabling the second
>> CPU package.
>
> More numbers are attached. Pages 1-3 have two socket machines. Page 4
> has a one socket machine. The values for p state 20 are highlighted.
> For avrora (the one-socket application) on page 2, 20 is not the pstate
> with the lowest CPU energy consumption. 35 and 37 do better. Also for
> xalan on page 4 (one-socket machine) 15 does slightly better than 20.
> Otherwise, 20 always seems to be the best.
>

It seems like your results suggest that the presence of a second CPU
package cannot be the only factor leading to this deviation, however
it's hard to tell how much of an influence it's having on that
deviation, since your single- and dual-socket samples are taken from
machines with different CPUs so it's unclear whether moving to a single
CPU has led to a shift of the maximum efficiency frequency, and if it
has it may have had a smaller impact than the ~5 P-state granularity of
your samples.

Either way it seems like we're greatly underestimating the maximum
efficiency frequency even on your single-socket system. The reason may
still be suboptimal idle management -- I hope it is, since the
alternative that your processor is lying about its maximum efficiency
ratio seems far more difficult to deal with as some generic software
change...

>> > I did some experiements with forcing different frequencies. I haven't
>> > finished processing the results, but I notice that as the frequency goes
>> > up, the utilization (specifically the value of
>> > map_util_perf(sg_cpu->util) at the point of the call to
>> > cpufreq_driver_adjust_perf in sugov_update_single_perf) goes up as well.
>> > Is this expected?
>> >
>>
>> Actually, it *is* expected based on our previous hypothesis that these
>> workloads are largely latency-bound: In cases where a given burst of CPU
>> work is not parallelizable with any other tasks the thread needs to
>> complete subsequently, its overall runtime will decrease monotonically
>> with increasing frequency, therefore the number of instructions executed
>> per unit of time will increase monotonically with increasing frequency,
>> and with it its frequency-invariant utilization.
>
> I'm not sure. If you have two tasks, each one alternately waiting for the
> other, if the frequency doubles, they will each run faster and wait less,
> but as long as one is computing the utilization in a small interval, ie
> before the application ends, the utilization will always be 50%.

Not really, because we're talking about frequency-invariant utilization
rather than just the CPU's duty cycle (which may indeed remain at 50%
regardless). If the frequency doubles and the thread is still active
50% of the time its frequency-invariant utilization will also double,
since the thread would be utilizing twice as many computational
resources per unit of time as before. As you can see in the definition
in [1], the frequency-invariant utilization is scaled by the running
frequency of the thread.

[1] Documentation/scheduler/sched-capacity.rst

> The applications, however, are probably not as simple as this.
>
> julia
>
>> > thanks,
>> > julia
>> >
>> >> > 0 0 32 1 0.09 1001 2095 37 0 0 0 0 42 0.00 0.00 0.00 100.00 9.08
>> >> > 0 1 4 0 0.04 1000 2095 57 0 0 0 1 133 0.00 0.00 0.00 99.96 0.08 99.88 38
>> >> > 0 1 36 0 0.00 1000 2095 35 0 0 0 0 40 0.00 0.00 0.00 100.00 0.12
>> >> > 0 2 8 0 0.03 1000 2095 64 0 0 0 1 124 0.00 0.00 0.00 99.97 0.08 99.89 38
>> >> > 0 2 40 0 0.00 1000 2095 36 0 0 0 0 40 0.00 0.00 0.00 100.00 0.10
>> >> > 0 3 12 0 0.00 1000 2095 42 0 0 0 0 71 0.00 0.00 0.00 100.00 0.14 99.86 38
>> >> > 0 3 44 1 0.09 1000 2095 63 0 0 0 0 65 0.00 0.00 0.00 99.91 0.05
>> >> > 0 4 14 0 0.00 1010 2095 38 0 0 0 1 41 0.00 0.00 0.00 100.00 0.04 99.96 39
>> >> > 0 4 46 0 0.00 1011 2095 36 0 0 0 1 41 0.00 0.00 0.00 100.00 0.04
>> >> > 0 5 10 0 0.01 1084 2095 39 0 0 0 0 58 0.00 0.00 0.00 99.99 0.04 99.95 38
>> >> > 0 5 42 0 0.00 1114 2095 35 0 0 0 0 39 0.00 0.00 0.00 100.00 0.05
>> >> > 0 6 6 0 0.03 1005 2095 89 0 0 0 1 116 0.00 0.00 0.00 99.97 0.07 99.90 39
>> >> > 0 6 38 0 0.00 1000 2095 38 0 0 0 0 41 0.00 0.00 0.00 100.00 0.10
>> >> > 0 7 2 0 0.05 1001 2095 59 0 0 0 1 133 0.00 0.00 0.00 99.95 0.09 99.86 40
>> >> > 0 7 34 0 0.00 1000 2095 39 0 0 0 0 65 0.00 0.00 0.00 100.00 0.13
>> >> > 0 8 16 0 0.00 1000 2095 43 0 0 0 0 47 0.00 0.00 0.00 100.00 0.04 99.96 38
>> >> > 0 8 48 0 0.00 1000 2095 37 0 0 0 0 41 0.00 0.00 0.00 100.00 0.04
>> >> > 0 9 20 0 0.00 1000 2095 33 0 0 0 0 37 0.00 0.00 0.00 100.00 0.03 99.97 38
>> >> > 0 9 52 0 0.00 1000 2095 33 0 0 0 0 36 0.00 0.00 0.00 100.00 0.03
>> >> > 0 10 24 0 0.00 1000 2095 36 0 0 0 1 40 0.00 0.00 0.00 100.00 0.03 99.96 39
>> >> > 0 10 56 0 0.00 1000 2095 37 0 0 0 1 38 0.00 0.00 0.00 100.00 0.03
>> >> > 0 11 28 0 0.00 1002 2095 35 0 0 0 1 37 0.00 0.00 0.00 100.00 0.03 99.97 38
>> >> > 0 11 60 0 0.00 1004 2095 34 0 0 0 0 36 0.00 0.00 0.00 100.00 0.03
>> >> > 0 12 30 0 0.00 1001 2095 35 0 0 0 0 40 0.00 0.00 0.00 100.00 0.11 99.88 38
>> >> > 0 12 62 0 0.01 1000 2095 197 0 0 0 0 197 0.00 0.00 0.00 99.99 0.10
>> >> > 0 13 26 0 0.00 1000 2095 37 0 0 0 0 41 0.00 0.00 0.00 100.00 0.03 99.97 39
>> >> > 0 13 58 0 0.00 1000 2095 38 0 0 0 0 40 0.00 0.00 0.00 100.00 0.03
>> >> > 0 14 22 0 0.01 1000 2095 149 0 1 2 0 142 0.00 0.01 0.00 99.99 0.07 99.92 39
>> >> > 0 14 54 0 0.00 1000 2095 35 0 0 0 0 38 0.00 0.00 0.00 100.00 0.07
>> >> > 0 15 18 0 0.00 1000 2095 33 0 0 0 0 36 0.00 0.00 0.00 100.00 0.03 99.97 39
>> >> > 0 15 50 0 0.00 1000 2095 34 0 0 0 0 38 0.00 0.00 0.00 100.00 0.03
>> >> > 1 0 1 32 3.23 1008 2095 2385 0 31 3190 45025 10144 0.00 0.28 4.68 91.99 11.21 85.56 32 35 0.04 0.04 2481.49 1162.96 0.00 0.00
>> >> > 1 0 33 9 0.63 1404 2095 12206 0 5 162 2480 10283 0.00 0.04 0.75 98.64 13.81
>> >> > 1 1 5 1 0.07 1384 2095 236 0 0 38 24 314 0.00 0.09 0.06 99.77 4.66 95.27 33
>> >> > 1 1 37 81 3.93 2060 2095 1254 0 5 40 59 683 0.00 0.01 0.02 96.05 0.80
>> >> > 1 2 9 37 3.46 1067 2095 2396 0 29 2256 55406 11731 0.00 0.17 6.02 90.54 54.10 42.44 31
>> >> > 1 2 41 151 14.51 1042 2095 10447 0 135 10494 248077 42327 0.01 0.87 26.57 58.84 43.05
>> >> > 1 3 13 110 10.47 1053 2095 7120 0 120 9218 168938 33884 0.01 0.77 16.63 72.68 42.58 46.95 32
>> >> > 1 3 45 69 6.76 1021 2095 4730 0 66 5598 115410 23447 0.00 0.44 12.06 81.12 46.29
>> >> > 1 4 15 112 10.64 1056 2095 7204 0 116 8831 171423 37754 0.01 0.70 17.56 71.67 28.01 61.35 33
>> >> > 1 4 47 18 1.80 1006 2095 1771 0 13 915 29315 6564 0.00 0.07 3.20 95.03 36.85
>> >> > 1 5 11 63 5.96 1065 2095 4090 0 58 6449 99015 18955 0.00 0.45 10.27 83.64 31.24 62.80 31
>> >> > 1 5 43 72 7.11 1016 2095 4794 0 73 6203 115361 26494 0.00 0.48 11.79 81.02 30.09
>> >> > 1 6 7 35 3.39 1022 2095 2328 0 45 3377 52721 13759 0.00 0.27 5.10 91.43 25.84 70.77 32
>> >> > 1 6 39 67 6.09 1096 2095 4483 0 52 3696 94964 19366 0.00 0.30 10.32 83.61 23.14
>> >> > 1 7 3 1 0.06 1395 2095 91 0 0 0 1 167 0.00 0.00 0.00 99.95 25.36 74.58 35
>> >> > 1 7 35 83 8.16 1024 2095 5785 0 100 7398 134640 27428 0.00 0.56 13.39 78.34 17.26
>> >> > 1 8 17 46 4.49 1016 2095 3229 0 52 3048 74914 16010 0.00 0.27 8.29 87.19 29.71 65.80 33
>> >> > 1 8 49 64 6.12 1052 2095 4210 0 89 5782 100570 21463 0.00 0.42 10.63 83.17 28.08
>> >> > 1 9 21 73 7.02 1036 2095 4917 0 64 5786 109887 21939 0.00 0.55 11.61 81.18 22.10 70.88 33
>> >> > 1 9 53 64 6.33 1012 2095 4074 0 69 5957 97596 20580 0.00 0.51 9.78 83.74 22.79
>> >> > 1 10 25 26 2.58 1013 2095 1825 0 22 2124 42630 8627 0.00 0.17 4.17 93.24 53.91 43.52 33
>> >> > 1 10 57 159 15.59 1022 2095 10951 0 175 14237 256828 56810 0.01 1.10 26.00 58.16 40.89
>> >> > 1 11 29 112 10.54 1065 2095 7462 0 126 9548 179206 39821 0.01 0.85 18.49 70.71 29.46 60.00 31
>> >> > 1 11 61 29 2.89 1011 2095 2002 0 24 2468 45558 10288 0.00 0.20 4.71 92.36 37.11
>> >> > 1 12 31 37 3.66 1011 2095 2596 0 79 3161 61027 13292 0.00 0.24 6.48 89.79 23.75 72.59 32
>> >> > 1 12 63 56 5.08 1107 2095 3789 0 62 4777 79133 17089 0.00 0.41 7.91 86.86 22.31
>> >> > 1 13 27 12 1.14 1045 2095 1477 0 16 888 18744 3250 0.00 0.06 2.18 96.70 21.23 77.64 32
>> >> > 1 13 59 60 5.81 1038 2095 5230 0 60 4936 87225 21402 0.00 0.41 8.95 85.14 16.55
>> >> > 1 14 23 28 2.75 1024 2095 2008 0 20 1839 47417 9177 0.00 0.13 5.08 92.21 34.18 63.07 32
>> >> > 1 14 55 106 9.58 1105 2095 6292 0 89 7182 141379 31354 0.00 0.63 14.45 75.81 27.36
>> >> > 1 15 19 118 11.65 1012 2095 7872 0 121 10014 193186 40448 0.01 0.80 19.53 68.68 37.53 50.82 32
>> >> > 1 15 51 59 5.58 1059 2095 3967 0 54 5842 88063 21138 0.00 0.39 9.12 85.23 43.60
>> >>
>>

2021-12-21 17:04:49

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: cpufreq: intel_pstate: map utilization into the pstate range

On Sun, Dec 19, 2021 at 11:10 PM Francisco Jerez <[email protected]> wrote:
>
> Julia Lawall <[email protected]> writes:
>
> > On Sat, 18 Dec 2021, Francisco Jerez wrote:

[cut]

> > I did some experiements with forcing different frequencies. I haven't
> > finished processing the results, but I notice that as the frequency goes
> > up, the utilization (specifically the value of
> > map_util_perf(sg_cpu->util) at the point of the call to
> > cpufreq_driver_adjust_perf in sugov_update_single_perf) goes up as well.
> > Is this expected?
> >
>
> Actually, it *is* expected based on our previous hypothesis that these
> workloads are largely latency-bound: In cases where a given burst of CPU
> work is not parallelizable with any other tasks the thread needs to
> complete subsequently, its overall runtime will decrease monotonically
> with increasing frequency, therefore the number of instructions executed
> per unit of time will increase monotonically with increasing frequency,
> and with it its frequency-invariant utilization.

But shouldn't these two effects cancel each other if the
frequency-invariance mechanism works well?

2021-12-21 18:10:18

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: cpufreq: intel_pstate: map utilization into the pstate range

On Sun, Dec 19, 2021 at 6:03 PM Julia Lawall <[email protected]> wrote:
>
>
>
> On Sun, 19 Dec 2021, Rafael J. Wysocki wrote:
>
> > On Fri, Dec 17, 2021 at 8:32 PM Julia Lawall <[email protected]> wrote:
> > >
> > >
> > >
> > > On Fri, 17 Dec 2021, Rafael J. Wysocki wrote:
> > >
> > > > On Mon, Dec 13, 2021 at 11:52 PM Julia Lawall <[email protected]> wrote:
> > > > >
> > > > > With HWP, intel_cpufreq_adjust_perf takes the utilization, scales it
> > > > > between 0 and the capacity, and then maps everything below min_pstate to
> > > > > the lowest frequency.
> > > >
> > > > Well, it is not just intel_pstate with HWP. This is how schedutil
> > > > works in general; see get_next_freq() in there.
> > > >
> > > > > On my Intel Xeon Gold 6130 and Intel Xeon Gold
> > > > > 5218, this means that more than the bottom quarter of utilizations are all
> > > > > mapped to the lowest frequency. Running slowly doesn't necessarily save
> > > > > energy, because it takes more time.
> > > >
> > > > This is true, but the layout of the available range of performance
> > > > values is a property of the processor, not a driver issue.
> > > >
> > > > Moreover, the role of the driver is not to decide how to respond to
> > > > the given utilization value, that is the role of the governor. The
> > > > driver is expected to do what it is asked for by the governor.
> > >
> > > OK, but what exactly is the goal of schedutil?
> >
> > The short answer is: minimizing the cost (in terms of energy) of
> > allocating an adequate amount of CPU time for a given workload.
> >
> > Of course, this requires a bit of explanation, so bear with me.
> >
> > It starts with a question:
> >
> > Given a steady workload (ie. a workload that uses approximately the
> > same amount of CPU time to run in every sampling interval), what is
> > the most efficient frequency (or generally, performance level measured
> > in some abstract units) to run it at and still ensure that it will get
> > as much CPU time as it needs (or wants)?
> >
> > To answer this question, let's first assume that
> >
> > (1) Performance is a monotonically increasing (ideally, linear)
> > function of frequency.
> > (2) CPU idle states have not enough impact on the energy usage for
> > them to matter.
> >
> > Both of these assumptions may not be realistic, but that's how it goes.
> >
> > Now, consider the "raw" frequency-dependent utilization
> >
> > util(f) = util_max * (t_{total} - t_{idle}(f)) / t_{total}
> >
> > where
> >
> > t_{total} is the total CPU time available in the given time frame.
> > t_{idle}(f) is the idle CPU time appearing in the workload when run at
> > frequency f in that time frame.
> > util_max is a convenience constant allowing an integer data type to be
> > used for representing util(f) with sufficient approximation.
> >
> > Notice that by assumption (1), util(f) is a monotonically decreasing
> > function, so if util(f_{max}) = util_max (where f_{max} is the maximum
> > frequency available from the hardware), which means that there is no
> > idle CPU time in the workload when run at the max available frequency,
> > there will be no idle CPU time in it when run at any frequency below
> > f_{max}. Hence, in that case the workload needs to be run at f_{max}.
> >
> > If util(f_{max}) < util_max, there is some idle CPU time in the
> > workload at f_{max} and it may be run at a lower frequency without
> > sacrificing performance. Moreover, the cost should be minimum when
> > running the workload at the maximum frequency f_e for which
> > t_{idle}(f_e) = 0. IOW, that is the point at which the workload still
> > gets as much CPU time as needed, but the cost of running it is
> > maximally reduced.
>
> Thanks for the detailed explanation. I got lost at this point, though.
>
> Idle time can be either due to I/O, or due to waiting for synchronization
> from some other thread perhaps on another core. How can either of these
> disappear?

I guess the "due to I/O" case needs to be expanded a bit.

If a task waits for new data to appear (which in the steady case is
assumed to happen on a regular basis) and then wakes up, processes
them and goes back to sleep, then the data processing speed can be
adjusted to the rate at which new data appear so as to reduce the
sleep (or idle) time almost down to 0, at least theoretically.

Conversely, if a task submits data for I/O and waits for the I/O to
complete, it may as well start to prepare a new buffer for the next
I/O cycle as soon as the previous one has been submitted. In that
case, the speed at which the new buffer is prepared can be adjusted to
the time it takes to complete the I/O so as to reduce the task's sleep
(or idle) time, at least in theory.

But, yes, there are cases in which that cannot be done (see below).

> In the I/O case, no matter what the frequency, the idle time
> will be the same (in a simplified world).

Well, not necessarily.

> In the case of waiting for
> another thread on another core, assuming that all the cores are running at
> the same frequency, lowering the frequency will cause the both running
> time and the idle time to increase, and it will cause util(f) to stay the
> same.

The running and idle time can't both increase simultaneously, because
the sum of them is the total CPU time (in each sampling interval)
which can't expand.

Anyway, this case can only be addressed precisely if the exact type of
dependency between the tasks in question is taken into account (eg.
producer-consumer etc).

> Both cases seem to send the application directly to the lowest
> frequency. I guess that's fine if we also assume that the lowest
> frequency is also the most efficient one.

They may, but this is not what schedutil does. It simply tries to run
every task at a frequency proportional to its scale-invariant
utilization and the reason why it does that is because in the "steady"
cases where computations can be carried out in parallel with the I/O
doing that corresponds to looking for the maximum speed at which
there's no idle time in the workload.

In the cases in which reducing the speed of computations doesn't cause
the amount of idle time to decrease, the "right" speed to run the
workload depends on its performance requirements (because in practice
running computations at lower speeds is more enerrgy-efficient most of
the time in such cases) and they can be sort of expressed by
increasing scaling_min_freq or through utilization clamping.

2021-12-21 23:57:06

by Francisco Jerez

[permalink] [raw]
Subject: Re: cpufreq: intel_pstate: map utilization into the pstate range

"Rafael J. Wysocki" <[email protected]> writes:

> On Sun, Dec 19, 2021 at 11:10 PM Francisco Jerez <[email protected]> wrote:
>>
>> Julia Lawall <[email protected]> writes:
>>
>> > On Sat, 18 Dec 2021, Francisco Jerez wrote:
>
> [cut]
>
>> > I did some experiements with forcing different frequencies. I haven't
>> > finished processing the results, but I notice that as the frequency goes
>> > up, the utilization (specifically the value of
>> > map_util_perf(sg_cpu->util) at the point of the call to
>> > cpufreq_driver_adjust_perf in sugov_update_single_perf) goes up as well.
>> > Is this expected?
>> >
>>
>> Actually, it *is* expected based on our previous hypothesis that these
>> workloads are largely latency-bound: In cases where a given burst of CPU
>> work is not parallelizable with any other tasks the thread needs to
>> complete subsequently, its overall runtime will decrease monotonically
>> with increasing frequency, therefore the number of instructions executed
>> per unit of time will increase monotonically with increasing frequency,
>> and with it its frequency-invariant utilization.
>
> But shouldn't these two effects cancel each other if the
> frequency-invariance mechanism works well?

No, they won't cancel each other out under our hypothesis that these
workloads are largely latency-bound, since the performance of the
application will increase steadily with increasing frequency, and with
it the amount of computational resources it utilizes per unit of time on
the average, and therefore its frequency-invariant utilization as well.

If you're not convinced by my argument, consider a simple latency-bound
application that repeatedly blocks for t0 on some external agent and
then requires the execution of n1 CPU clocks which cannot be
parallelized with any of the operations occurring during that t0 idle
time. The runtime of a single cycle of that application will be,
assuming that the CPU frequency is f:

T = t0 + n1/f

Its frequency-invariant utilization will approach on the average:

u = (T-t0) / T * f / f1 = n1/f / (t0 + n1/f) * f / f1 = (n1 / f1) / (t0 + n1/f)

with f1 a constant with units of frequency. As you can see the
denominator of the last expression above decreases with frequency,
therefore the frequency-invariant utilization increases, as expected for
an application whose performance is increasing.

2021-12-22 14:55:03

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: cpufreq: intel_pstate: map utilization into the pstate range

On Wed, Dec 22, 2021 at 12:57 AM Francisco Jerez <[email protected]> wrote:
>
> "Rafael J. Wysocki" <[email protected]> writes:
>
> > On Sun, Dec 19, 2021 at 11:10 PM Francisco Jerez <[email protected]> wrote:
> >>
> >> Julia Lawall <[email protected]> writes:
> >>
> >> > On Sat, 18 Dec 2021, Francisco Jerez wrote:
> >
> > [cut]
> >
> >> > I did some experiements with forcing different frequencies. I haven't
> >> > finished processing the results, but I notice that as the frequency goes
> >> > up, the utilization (specifically the value of
> >> > map_util_perf(sg_cpu->util) at the point of the call to
> >> > cpufreq_driver_adjust_perf in sugov_update_single_perf) goes up as well.
> >> > Is this expected?
> >> >
> >>
> >> Actually, it *is* expected based on our previous hypothesis that these
> >> workloads are largely latency-bound: In cases where a given burst of CPU
> >> work is not parallelizable with any other tasks the thread needs to
> >> complete subsequently, its overall runtime will decrease monotonically
> >> with increasing frequency, therefore the number of instructions executed
> >> per unit of time will increase monotonically with increasing frequency,
> >> and with it its frequency-invariant utilization.
> >
> > But shouldn't these two effects cancel each other if the
> > frequency-invariance mechanism works well?
>
> No, they won't cancel each other out under our hypothesis that these
> workloads are largely latency-bound, since the performance of the
> application will increase steadily with increasing frequency, and with
> it the amount of computational resources it utilizes per unit of time on
> the average, and therefore its frequency-invariant utilization as well.

OK, so this is a workload in which the maximum performance is only
achieved at the maximum available frequency. IOW, there's no
performance saturation point and increasing the frequency (if
possible) will always cause more work to be done per unit of time.

For this type of workloads, requirements regarding performance (for
example, upper bound on the expected time of computations) need to be
known in order to determine the "most suitable" frequency to run them
and I agree that schedutil doesn't help much in that respect.

It is probably better to run them with intel_pstate in the active mode
(ie. "pure HWP") or decrease EPP via sysfs to allow HWP to ramp up
turbo more aggressively.

2021-12-24 11:08:58

by Julia Lawall

[permalink] [raw]
Subject: Re: cpufreq: intel_pstate: map utilization into the pstate range



On Wed, 22 Dec 2021, Rafael J. Wysocki wrote:

> On Wed, Dec 22, 2021 at 12:57 AM Francisco Jerez <[email protected]> wrote:
> >
> > "Rafael J. Wysocki" <[email protected]> writes:
> >
> > > On Sun, Dec 19, 2021 at 11:10 PM Francisco Jerez <[email protected]> wrote:
> > >>
> > >> Julia Lawall <[email protected]> writes:
> > >>
> > >> > On Sat, 18 Dec 2021, Francisco Jerez wrote:
> > >
> > > [cut]
> > >
> > >> > I did some experiements with forcing different frequencies. I haven't
> > >> > finished processing the results, but I notice that as the frequency goes
> > >> > up, the utilization (specifically the value of
> > >> > map_util_perf(sg_cpu->util) at the point of the call to
> > >> > cpufreq_driver_adjust_perf in sugov_update_single_perf) goes up as well.
> > >> > Is this expected?
> > >> >
> > >>
> > >> Actually, it *is* expected based on our previous hypothesis that these
> > >> workloads are largely latency-bound: In cases where a given burst of CPU
> > >> work is not parallelizable with any other tasks the thread needs to
> > >> complete subsequently, its overall runtime will decrease monotonically
> > >> with increasing frequency, therefore the number of instructions executed
> > >> per unit of time will increase monotonically with increasing frequency,
> > >> and with it its frequency-invariant utilization.
> > >
> > > But shouldn't these two effects cancel each other if the
> > > frequency-invariance mechanism works well?
> >
> > No, they won't cancel each other out under our hypothesis that these
> > workloads are largely latency-bound, since the performance of the
> > application will increase steadily with increasing frequency, and with
> > it the amount of computational resources it utilizes per unit of time on
> > the average, and therefore its frequency-invariant utilization as well.
>
> OK, so this is a workload in which the maximum performance is only
> achieved at the maximum available frequency. IOW, there's no
> performance saturation point and increasing the frequency (if
> possible) will always cause more work to be done per unit of time.
>
> For this type of workloads, requirements regarding performance (for
> example, upper bound on the expected time of computations) need to be
> known in order to determine the "most suitable" frequency to run them
> and I agree that schedutil doesn't help much in that respect.
>
> It is probably better to run them with intel_pstate in the active mode
> (ie. "pure HWP") or decrease EPP via sysfs to allow HWP to ramp up
> turbo more aggressively.

active mode + powersave indeed both gives faster runtimes and less energy
consumption for these examples.

thanks,
julia

2021-12-28 16:58:29

by Julia Lawall

[permalink] [raw]
Subject: Re: cpufreq: intel_pstate: map utilization into the pstate range

I looked a bit more into why pstate 20 is always using the least energy. I
have just one thread spinning for 10 seconds, I use a fixed value for the
pstate, and I measure the energy usage with turbostat. I tried this on a
2-socket Intel 6130 and a 4-socket Intel 6130. The experiment runs 40
times.

There seem to be only two levels of CPU energy usage. On the 2-socket
machine the energy usage is around 600J up to pstate 20 and around 1000J
after that. On the 4-socket machine it is twice that.

The change in RAM energy usage is similar, eg around 320J for the 2-socket
machine up to pstate 20, and around 460J for higher pstates.

On the 6130, pstate 21 is 2.1GHz, which is the nominal frequency of the
machine. So it seems that the most efficient thing is to be just below
that. The reduced execution time with pstate 20 as compared to pstate 10
greatly outweighs any small increase in the energy usage due to changing
the frequency.

Perhaps there is something abnormal in how the machines are configured?

julia

2021-12-28 17:40:31

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: cpufreq: intel_pstate: map utilization into the pstate range

On Tue, Dec 28, 2021 at 5:58 PM Julia Lawall <[email protected]> wrote:
>
> I looked a bit more into why pstate 20 is always using the least energy. I
> have just one thread spinning for 10 seconds, I use a fixed value for the
> pstate, and I measure the energy usage with turbostat.

How exactly do you fix the pstate?

> I tried this on a
> 2-socket Intel 6130 and a 4-socket Intel 6130. The experiment runs 40
> times.
>
> There seem to be only two levels of CPU energy usage. On the 2-socket
> machine the energy usage is around 600J up to pstate 20 and around 1000J
> after that. On the 4-socket machine it is twice that.

These are the package power numbers from turbostat, aren't they?

> The change in RAM energy usage is similar, eg around 320J for the 2-socket
> machine up to pstate 20, and around 460J for higher pstates.
>
> On the 6130, pstate 21 is 2.1GHz, which is the nominal frequency of the
> machine. So it seems that the most efficient thing is to be just below
> that. The reduced execution time with pstate 20 as compared to pstate 10
> greatly outweighs any small increase in the energy usage due to changing
> the frequency.
>
> Perhaps there is something abnormal in how the machines are configured?

2021-12-28 17:46:28

by Julia Lawall

[permalink] [raw]
Subject: Re: cpufreq: intel_pstate: map utilization into the pstate range



On Tue, 28 Dec 2021, Rafael J. Wysocki wrote:

> On Tue, Dec 28, 2021 at 5:58 PM Julia Lawall <[email protected]> wrote:
> >
> > I looked a bit more into why pstate 20 is always using the least energy. I
> > have just one thread spinning for 10 seconds, I use a fixed value for the
> > pstate, and I measure the energy usage with turbostat.
>
> How exactly do you fix the pstate?

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index e7af18857371..19440b15454c 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -400,7 +402,7 @@ static void sugov_update_single_perf(struct update_util_data *hook, u64 time,
sg_cpu->util = prev_util;

cpufreq_driver_adjust_perf(sg_cpu->cpu, map_util_perf(sg_cpu->bw_dl),
- map_util_perf(sg_cpu->util), sg_cpu->max);
+ sysctl_sched_fixedfreq, sg_cpu->max);

sg_cpu->sg_policy->last_freq_update_time = time;
}

------------------------------

sysctl_sched_fixedfreq is a variable that I added to sysfs.


>
> > I tried this on a
> > 2-socket Intel 6130 and a 4-socket Intel 6130. The experiment runs 40
> > times.
> >
> > There seem to be only two levels of CPU energy usage. On the 2-socket
> > machine the energy usage is around 600J up to pstate 20 and around 1000J
> > after that. On the 4-socket machine it is twice that.
>
> These are the package power numbers from turbostat, aren't they?

Yes.

julia

2021-12-28 18:06:58

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: cpufreq: intel_pstate: map utilization into the pstate range

On Tue, Dec 28, 2021 at 6:46 PM Julia Lawall <[email protected]> wrote:
>
>
>
> On Tue, 28 Dec 2021, Rafael J. Wysocki wrote:
>
> > On Tue, Dec 28, 2021 at 5:58 PM Julia Lawall <[email protected]> wrote:
> > >
> > > I looked a bit more into why pstate 20 is always using the least energy. I
> > > have just one thread spinning for 10 seconds, I use a fixed value for the
> > > pstate, and I measure the energy usage with turbostat.
> >
> > How exactly do you fix the pstate?
>
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index e7af18857371..19440b15454c 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -400,7 +402,7 @@ static void sugov_update_single_perf(struct update_util_data *hook, u64 time,
> sg_cpu->util = prev_util;
>
> cpufreq_driver_adjust_perf(sg_cpu->cpu, map_util_perf(sg_cpu->bw_dl),
> - map_util_perf(sg_cpu->util), sg_cpu->max);
> + sysctl_sched_fixedfreq, sg_cpu->max);

This is just changing the "target" hint given to the processor which
may very well ignore it, though.

>
> sg_cpu->sg_policy->last_freq_update_time = time;
> }
>
> ------------------------------
>
> sysctl_sched_fixedfreq is a variable that I added to sysfs.

If I were trying to fix a pstate, I would set scaling_max_freq and
scaling_min_freq in sysfs for all CPUs to the same value.

That would cause intel_pstate to set HWP min and max to the same value
which should really cause the pstate to be fixed, at least outside the
turbo range of pstates.

> >
> > > I tried this on a
> > > 2-socket Intel 6130 and a 4-socket Intel 6130. The experiment runs 40
> > > times.
> > >
> > > There seem to be only two levels of CPU energy usage. On the 2-socket
> > > machine the energy usage is around 600J up to pstate 20 and around 1000J
> > > after that. On the 4-socket machine it is twice that.
> >
> > These are the package power numbers from turbostat, aren't they?
>
> Yes.

OK

2021-12-28 18:16:49

by Julia Lawall

[permalink] [raw]
Subject: Re: cpufreq: intel_pstate: map utilization into the pstate range



On Tue, 28 Dec 2021, Rafael J. Wysocki wrote:

> On Tue, Dec 28, 2021 at 6:46 PM Julia Lawall <[email protected]> wrote:
> >
> >
> >
> > On Tue, 28 Dec 2021, Rafael J. Wysocki wrote:
> >
> > > On Tue, Dec 28, 2021 at 5:58 PM Julia Lawall <[email protected]> wrote:
> > > >
> > > > I looked a bit more into why pstate 20 is always using the least energy. I
> > > > have just one thread spinning for 10 seconds, I use a fixed value for the
> > > > pstate, and I measure the energy usage with turbostat.
> > >
> > > How exactly do you fix the pstate?
> >
> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > index e7af18857371..19440b15454c 100644
> > --- a/kernel/sched/cpufreq_schedutil.c
> > +++ b/kernel/sched/cpufreq_schedutil.c
> > @@ -400,7 +402,7 @@ static void sugov_update_single_perf(struct update_util_data *hook, u64 time,
> > sg_cpu->util = prev_util;
> >
> > cpufreq_driver_adjust_perf(sg_cpu->cpu, map_util_perf(sg_cpu->bw_dl),
> > - map_util_perf(sg_cpu->util), sg_cpu->max);
> > + sysctl_sched_fixedfreq, sg_cpu->max);
>
> This is just changing the "target" hint given to the processor which
> may very well ignore it, though.

It doesn't seem to ignore it. I also print the current frequency on every
clock tick, and it is as it should be. This is done in the function
arch_scale_freq_tick in arch/x86/kernel/smpboot.c, where I added:

trace_printk("freq %lld\n", div64_u64((cpu_khz * acnt), mcnt));


>
> >
> > sg_cpu->sg_policy->last_freq_update_time = time;
> > }
> >
> > ------------------------------
> >
> > sysctl_sched_fixedfreq is a variable that I added to sysfs.
>
> If I were trying to fix a pstate, I would set scaling_max_freq and
> scaling_min_freq in sysfs for all CPUs to the same value.
>
> That would cause intel_pstate to set HWP min and max to the same value
> which should really cause the pstate to be fixed, at least outside the
> turbo range of pstates.

OK, I can try that, thanks.

julia

2021-12-29 09:13:08

by Julia Lawall

[permalink] [raw]
Subject: Re: cpufreq: intel_pstate: map utilization into the pstate range



On Tue, 28 Dec 2021, Rafael J. Wysocki wrote:

> On Tue, Dec 28, 2021 at 6:46 PM Julia Lawall <[email protected]> wrote:
> >
> >
> >
> > On Tue, 28 Dec 2021, Rafael J. Wysocki wrote:
> >
> > > On Tue, Dec 28, 2021 at 5:58 PM Julia Lawall <[email protected]> wrote:
> > > >
> > > > I looked a bit more into why pstate 20 is always using the least energy. I
> > > > have just one thread spinning for 10 seconds, I use a fixed value for the
> > > > pstate, and I measure the energy usage with turbostat.
> > >
> > > How exactly do you fix the pstate?
> >
> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > index e7af18857371..19440b15454c 100644
> > --- a/kernel/sched/cpufreq_schedutil.c
> > +++ b/kernel/sched/cpufreq_schedutil.c
> > @@ -400,7 +402,7 @@ static void sugov_update_single_perf(struct update_util_data *hook, u64 time,
> > sg_cpu->util = prev_util;
> >
> > cpufreq_driver_adjust_perf(sg_cpu->cpu, map_util_perf(sg_cpu->bw_dl),
> > - map_util_perf(sg_cpu->util), sg_cpu->max);
> > + sysctl_sched_fixedfreq, sg_cpu->max);
>
> This is just changing the "target" hint given to the processor which
> may very well ignore it, though.
>
> >
> > sg_cpu->sg_policy->last_freq_update_time = time;
> > }
> >
> > ------------------------------
> >
> > sysctl_sched_fixedfreq is a variable that I added to sysfs.
>
> If I were trying to fix a pstate, I would set scaling_max_freq and
> scaling_min_freq in sysfs for all CPUs to the same value.
>
> That would cause intel_pstate to set HWP min and max to the same value
> which should really cause the pstate to be fixed, at least outside the
> turbo range of pstates.

The effect is the same. But that approach is indeed simpler than patching
the kernel.

julia

>
> > >
> > > > I tried this on a
> > > > 2-socket Intel 6130 and a 4-socket Intel 6130. The experiment runs 40
> > > > times.
> > > >
> > > > There seem to be only two levels of CPU energy usage. On the 2-socket
> > > > machine the energy usage is around 600J up to pstate 20 and around 1000J
> > > > after that. On the 4-socket machine it is twice that.
> > >
> > > These are the package power numbers from turbostat, aren't they?
> >
> > Yes.
>
> OK
>

2021-12-30 17:03:48

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: cpufreq: intel_pstate: map utilization into the pstate range

On Wed, Dec 29, 2021 at 10:13 AM Julia Lawall <[email protected]> wrote:
>
>
>
> On Tue, 28 Dec 2021, Rafael J. Wysocki wrote:
>
> > On Tue, Dec 28, 2021 at 6:46 PM Julia Lawall <[email protected]> wrote:
> > >
> > >
> > >
> > > On Tue, 28 Dec 2021, Rafael J. Wysocki wrote:
> > >
> > > > On Tue, Dec 28, 2021 at 5:58 PM Julia Lawall <[email protected]> wrote:
> > > > >
> > > > > I looked a bit more into why pstate 20 is always using the least energy. I
> > > > > have just one thread spinning for 10 seconds, I use a fixed value for the
> > > > > pstate, and I measure the energy usage with turbostat.
> > > >
> > > > How exactly do you fix the pstate?
> > >
> > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > > index e7af18857371..19440b15454c 100644
> > > --- a/kernel/sched/cpufreq_schedutil.c
> > > +++ b/kernel/sched/cpufreq_schedutil.c
> > > @@ -400,7 +402,7 @@ static void sugov_update_single_perf(struct update_util_data *hook, u64 time,
> > > sg_cpu->util = prev_util;
> > >
> > > cpufreq_driver_adjust_perf(sg_cpu->cpu, map_util_perf(sg_cpu->bw_dl),
> > > - map_util_perf(sg_cpu->util), sg_cpu->max);
> > > + sysctl_sched_fixedfreq, sg_cpu->max);
> >
> > This is just changing the "target" hint given to the processor which
> > may very well ignore it, though.
> >
> > >
> > > sg_cpu->sg_policy->last_freq_update_time = time;
> > > }
> > >
> > > ------------------------------
> > >
> > > sysctl_sched_fixedfreq is a variable that I added to sysfs.
> >
> > If I were trying to fix a pstate, I would set scaling_max_freq and
> > scaling_min_freq in sysfs for all CPUs to the same value.
> >
> > That would cause intel_pstate to set HWP min and max to the same value
> > which should really cause the pstate to be fixed, at least outside the
> > turbo range of pstates.
>
> The effect is the same. But that approach is indeed simpler than patching
> the kernel.

It is also applicable when intel_pstate runs in the active mode.

As for the results that you have reported, it looks like the package
power on these systems is dominated by package voltage and going from
P-state 20 to P-state 21 causes that voltage to increase significantly
(the observed RAM energy usage pattern is consistent with that). This
means that running at P-states above 20 is only really justified if
there is a strict performance requirement that can't be met otherwise.

Can you please check what value is there in the base_frequency sysfs
attribute under cpuX/cpufreq/?

I'm guessing that the package voltage level for P-states 10 and 20 is
the same, so the power difference between them is not significant
relative to the difference between P-state 20 and 21 and if increasing
the P-state causes some extra idle time to appear in the workload
(even though there is not enough of it to prevent to overall
utilization from increasing), then the overall power draw when running
at P-state 10 may be greater that for P-state 20.

You can check if there is any C-state residency difference between
these two cases by running the workload under turbostat in each of
them.

Anyway, this is a configuration in which the HWP scaling algorithm
used when intel_pstate runs in the active mode is likely to work
better, because it should take the processor design into account.
That's why it is the default configuration of intel_pstate on systems
with HWP. There are cases in which schedutil helps, but that's mostly
when HWP without it tends to run the workload too fast, because it
lacks the utilization history provided by PELT.

2021-12-30 17:54:05

by Julia Lawall

[permalink] [raw]
Subject: Re: cpufreq: intel_pstate: map utilization into the pstate range

> > The effect is the same. But that approach is indeed simpler than patching
> > the kernel.
>
> It is also applicable when intel_pstate runs in the active mode.
>
> As for the results that you have reported, it looks like the package
> power on these systems is dominated by package voltage and going from
> P-state 20 to P-state 21 causes that voltage to increase significantly
> (the observed RAM energy usage pattern is consistent with that). This
> means that running at P-states above 20 is only really justified if
> there is a strict performance requirement that can't be met otherwise.
>
> Can you please check what value is there in the base_frequency sysfs
> attribute under cpuX/cpufreq/?

2100000, which should be pstate 21

>
> I'm guessing that the package voltage level for P-states 10 and 20 is
> the same, so the power difference between them is not significant
> relative to the difference between P-state 20 and 21 and if increasing
> the P-state causes some extra idle time to appear in the workload
> (even though there is not enough of it to prevent to overall
> utilization from increasing), then the overall power draw when running
> at P-state 10 may be greater that for P-state 20.

My impression is that the package voltage level for P-states 10 to 20 is
high enough that increasing the frequency has little impact. But the code
runs twice as fast, which reduces the execution time a lot, saving energy.

My first experiment had only one running thread. I also tried running 32
spinning threads for 10 seconds, ie using up one package and leaving the
other idle. In this case, instead of staying around 600J for pstates
10-20, the pstate rises from 743 to 946. But there is still a gap between
20 and 21, with 21 being 1392J.

> You can check if there is any C-state residency difference between
> these two cases by running the workload under turbostat in each of
> them.

The C1 and C6 cases (CPU%c1 and CPU%c6) are about the same between 20 and
21, whether with 1 thread or with 32 thread.

> Anyway, this is a configuration in which the HWP scaling algorithm
> used when intel_pstate runs in the active mode is likely to work
> better, because it should take the processor design into account.
> That's why it is the default configuration of intel_pstate on systems
> with HWP. There are cases in which schedutil helps, but that's mostly
> when HWP without it tends to run the workload too fast, because it
> lacks the utilization history provided by PELT.

OK, I'll look into that case a bit more.

thanks,
julia

2021-12-30 17:58:56

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: cpufreq: intel_pstate: map utilization into the pstate range

On Thu, Dec 30, 2021 at 6:54 PM Julia Lawall <[email protected]> wrote:
>
> > > The effect is the same. But that approach is indeed simpler than patching
> > > the kernel.
> >
> > It is also applicable when intel_pstate runs in the active mode.
> >
> > As for the results that you have reported, it looks like the package
> > power on these systems is dominated by package voltage and going from
> > P-state 20 to P-state 21 causes that voltage to increase significantly
> > (the observed RAM energy usage pattern is consistent with that). This
> > means that running at P-states above 20 is only really justified if
> > there is a strict performance requirement that can't be met otherwise.
> >
> > Can you please check what value is there in the base_frequency sysfs
> > attribute under cpuX/cpufreq/?
>
> 2100000, which should be pstate 21
>
> >
> > I'm guessing that the package voltage level for P-states 10 and 20 is
> > the same, so the power difference between them is not significant
> > relative to the difference between P-state 20 and 21 and if increasing
> > the P-state causes some extra idle time to appear in the workload
> > (even though there is not enough of it to prevent to overall
> > utilization from increasing), then the overall power draw when running
> > at P-state 10 may be greater that for P-state 20.
>
> My impression is that the package voltage level for P-states 10 to 20 is
> high enough that increasing the frequency has little impact. But the code
> runs twice as fast, which reduces the execution time a lot, saving energy.
>
> My first experiment had only one running thread. I also tried running 32
> spinning threads for 10 seconds, ie using up one package and leaving the
> other idle. In this case, instead of staying around 600J for pstates
> 10-20, the pstate rises from 743 to 946. But there is still a gap between
> 20 and 21, with 21 being 1392J.
>
> > You can check if there is any C-state residency difference between
> > these two cases by running the workload under turbostat in each of
> > them.
>
> The C1 and C6 cases (CPU%c1 and CPU%c6) are about the same between 20 and
> 21, whether with 1 thread or with 32 thread.

I meant to compare P-state 10 and P-state 20.

20 and 21 are really close as far as the performance is concerned, so
I wouldn't expect to see any significant C-state residency difference
between them.

2021-12-30 18:21:00

by Julia Lawall

[permalink] [raw]
Subject: Re: cpufreq: intel_pstate: map utilization into the pstate range



On Thu, 30 Dec 2021, Rafael J. Wysocki wrote:

> On Thu, Dec 30, 2021 at 6:54 PM Julia Lawall <[email protected]> wrote:
> >
> > > > The effect is the same. But that approach is indeed simpler than patching
> > > > the kernel.
> > >
> > > It is also applicable when intel_pstate runs in the active mode.
> > >
> > > As for the results that you have reported, it looks like the package
> > > power on these systems is dominated by package voltage and going from
> > > P-state 20 to P-state 21 causes that voltage to increase significantly
> > > (the observed RAM energy usage pattern is consistent with that). This
> > > means that running at P-states above 20 is only really justified if
> > > there is a strict performance requirement that can't be met otherwise.
> > >
> > > Can you please check what value is there in the base_frequency sysfs
> > > attribute under cpuX/cpufreq/?
> >
> > 2100000, which should be pstate 21
> >
> > >
> > > I'm guessing that the package voltage level for P-states 10 and 20 is
> > > the same, so the power difference between them is not significant
> > > relative to the difference between P-state 20 and 21 and if increasing
> > > the P-state causes some extra idle time to appear in the workload
> > > (even though there is not enough of it to prevent to overall
> > > utilization from increasing), then the overall power draw when running
> > > at P-state 10 may be greater that for P-state 20.
> >
> > My impression is that the package voltage level for P-states 10 to 20 is
> > high enough that increasing the frequency has little impact. But the code
> > runs twice as fast, which reduces the execution time a lot, saving energy.
> >
> > My first experiment had only one running thread. I also tried running 32
> > spinning threads for 10 seconds, ie using up one package and leaving the
> > other idle. In this case, instead of staying around 600J for pstates
> > 10-20, the pstate rises from 743 to 946. But there is still a gap between
> > 20 and 21, with 21 being 1392J.
> >
> > > You can check if there is any C-state residency difference between
> > > these two cases by running the workload under turbostat in each of
> > > them.
> >
> > The C1 and C6 cases (CPU%c1 and CPU%c6) are about the same between 20 and
> > 21, whether with 1 thread or with 32 thread.
>
> I meant to compare P-state 10 and P-state 20.
>
> 20 and 21 are really close as far as the performance is concerned, so
> I wouldn't expect to see any significant C-state residency difference
> between them.

There's also no difference between 10 and 20. This seems normal, because
the same cores are either fully used or fully idle in both cases. The
idle ones are almost always in C6.

julia

2021-12-30 18:37:16

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: cpufreq: intel_pstate: map utilization into the pstate range

On Thu, Dec 30, 2021 at 7:21 PM Julia Lawall <[email protected]> wrote:
>
>
>
> On Thu, 30 Dec 2021, Rafael J. Wysocki wrote:
>
> > On Thu, Dec 30, 2021 at 6:54 PM Julia Lawall <[email protected]> wrote:
> > >
> > > > > The effect is the same. But that approach is indeed simpler than patching
> > > > > the kernel.
> > > >
> > > > It is also applicable when intel_pstate runs in the active mode.
> > > >
> > > > As for the results that you have reported, it looks like the package
> > > > power on these systems is dominated by package voltage and going from
> > > > P-state 20 to P-state 21 causes that voltage to increase significantly
> > > > (the observed RAM energy usage pattern is consistent with that). This
> > > > means that running at P-states above 20 is only really justified if
> > > > there is a strict performance requirement that can't be met otherwise.
> > > >
> > > > Can you please check what value is there in the base_frequency sysfs
> > > > attribute under cpuX/cpufreq/?
> > >
> > > 2100000, which should be pstate 21
> > >
> > > >
> > > > I'm guessing that the package voltage level for P-states 10 and 20 is
> > > > the same, so the power difference between them is not significant
> > > > relative to the difference between P-state 20 and 21 and if increasing
> > > > the P-state causes some extra idle time to appear in the workload
> > > > (even though there is not enough of it to prevent to overall
> > > > utilization from increasing), then the overall power draw when running
> > > > at P-state 10 may be greater that for P-state 20.
> > >
> > > My impression is that the package voltage level for P-states 10 to 20 is
> > > high enough that increasing the frequency has little impact. But the code
> > > runs twice as fast, which reduces the execution time a lot, saving energy.
> > >
> > > My first experiment had only one running thread. I also tried running 32
> > > spinning threads for 10 seconds, ie using up one package and leaving the
> > > other idle. In this case, instead of staying around 600J for pstates
> > > 10-20, the pstate rises from 743 to 946. But there is still a gap between
> > > 20 and 21, with 21 being 1392J.
> > >
> > > > You can check if there is any C-state residency difference between
> > > > these two cases by running the workload under turbostat in each of
> > > > them.
> > >
> > > The C1 and C6 cases (CPU%c1 and CPU%c6) are about the same between 20 and
> > > 21, whether with 1 thread or with 32 thread.
> >
> > I meant to compare P-state 10 and P-state 20.
> >
> > 20 and 21 are really close as far as the performance is concerned, so
> > I wouldn't expect to see any significant C-state residency difference
> > between them.
>
> There's also no difference between 10 and 20. This seems normal, because
> the same cores are either fully used or fully idle in both cases. The
> idle ones are almost always in C6.

The turbostat output sent by you previously shows that the CPUs doing
the work are only about 15-or-less percent busy, though, and you get
quite a bit of C-state residency on them. I'm assuming that this is
for 1 running thread.

Can you please run the 32 spinning threads workload (ie. on one
package) and with P-state locked to 10 and then to 20 under turbostat
and send me the turbostat output for both runs?

2021-12-30 18:44:12

by Julia Lawall

[permalink] [raw]
Subject: Re: cpufreq: intel_pstate: map utilization into the pstate range



On Thu, 30 Dec 2021, Rafael J. Wysocki wrote:

> On Thu, Dec 30, 2021 at 7:21 PM Julia Lawall <[email protected]> wrote:
> >
> >
> >
> > On Thu, 30 Dec 2021, Rafael J. Wysocki wrote:
> >
> > > On Thu, Dec 30, 2021 at 6:54 PM Julia Lawall <[email protected]> wrote:
> > > >
> > > > > > The effect is the same. But that approach is indeed simpler than patching
> > > > > > the kernel.
> > > > >
> > > > > It is also applicable when intel_pstate runs in the active mode.
> > > > >
> > > > > As for the results that you have reported, it looks like the package
> > > > > power on these systems is dominated by package voltage and going from
> > > > > P-state 20 to P-state 21 causes that voltage to increase significantly
> > > > > (the observed RAM energy usage pattern is consistent with that). This
> > > > > means that running at P-states above 20 is only really justified if
> > > > > there is a strict performance requirement that can't be met otherwise.
> > > > >
> > > > > Can you please check what value is there in the base_frequency sysfs
> > > > > attribute under cpuX/cpufreq/?
> > > >
> > > > 2100000, which should be pstate 21
> > > >
> > > > >
> > > > > I'm guessing that the package voltage level for P-states 10 and 20 is
> > > > > the same, so the power difference between them is not significant
> > > > > relative to the difference between P-state 20 and 21 and if increasing
> > > > > the P-state causes some extra idle time to appear in the workload
> > > > > (even though there is not enough of it to prevent to overall
> > > > > utilization from increasing), then the overall power draw when running
> > > > > at P-state 10 may be greater that for P-state 20.
> > > >
> > > > My impression is that the package voltage level for P-states 10 to 20 is
> > > > high enough that increasing the frequency has little impact. But the code
> > > > runs twice as fast, which reduces the execution time a lot, saving energy.
> > > >
> > > > My first experiment had only one running thread. I also tried running 32
> > > > spinning threads for 10 seconds, ie using up one package and leaving the
> > > > other idle. In this case, instead of staying around 600J for pstates
> > > > 10-20, the pstate rises from 743 to 946. But there is still a gap between
> > > > 20 and 21, with 21 being 1392J.
> > > >
> > > > > You can check if there is any C-state residency difference between
> > > > > these two cases by running the workload under turbostat in each of
> > > > > them.
> > > >
> > > > The C1 and C6 cases (CPU%c1 and CPU%c6) are about the same between 20 and
> > > > 21, whether with 1 thread or with 32 thread.
> > >
> > > I meant to compare P-state 10 and P-state 20.
> > >
> > > 20 and 21 are really close as far as the performance is concerned, so
> > > I wouldn't expect to see any significant C-state residency difference
> > > between them.
> >
> > There's also no difference between 10 and 20. This seems normal, because
> > the same cores are either fully used or fully idle in both cases. The
> > idle ones are almost always in C6.
>
> The turbostat output sent by you previously shows that the CPUs doing
> the work are only about 15-or-less percent busy, though, and you get
> quite a bit of C-state residency on them. I'm assuming that this is
> for 1 running thread.
>
> Can you please run the 32 spinning threads workload (ie. on one
> package) and with P-state locked to 10 and then to 20 under turbostat
> and send me the turbostat output for both runs?

Attached.

Pstate 10: spin_minmax_10_dahu-9_5.15.0freq_schedutil_11.turbo
Pstate 20: spin_minmax_20_dahu-9_5.15.0freq_schedutil_11.turbo

julia


Attachments:
spin_minmax_10_dahu-9_5.15.0freq_schedutil_11.turbo (4.73 kB)
spin_minmax_20_dahu-9_5.15.0freq_schedutil_11.turbo (4.72 kB)
Download all attachments

2022-01-03 15:50:58

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: cpufreq: intel_pstate: map utilization into the pstate range

On Thu, Dec 30, 2021 at 7:44 PM Julia Lawall <[email protected]> wrote:
>
>
>
> On Thu, 30 Dec 2021, Rafael J. Wysocki wrote:
>
> > On Thu, Dec 30, 2021 at 7:21 PM Julia Lawall <[email protected]> wrote:
> > >
> > >
> > >
> > > On Thu, 30 Dec 2021, Rafael J. Wysocki wrote:
> > >
> > > > On Thu, Dec 30, 2021 at 6:54 PM Julia Lawall <[email protected]> wrote:
> > > > >
> > > > > > > The effect is the same. But that approach is indeed simpler than patching
> > > > > > > the kernel.
> > > > > >
> > > > > > It is also applicable when intel_pstate runs in the active mode.
> > > > > >
> > > > > > As for the results that you have reported, it looks like the package
> > > > > > power on these systems is dominated by package voltage and going from
> > > > > > P-state 20 to P-state 21 causes that voltage to increase significantly
> > > > > > (the observed RAM energy usage pattern is consistent with that). This
> > > > > > means that running at P-states above 20 is only really justified if
> > > > > > there is a strict performance requirement that can't be met otherwise.
> > > > > >
> > > > > > Can you please check what value is there in the base_frequency sysfs
> > > > > > attribute under cpuX/cpufreq/?
> > > > >
> > > > > 2100000, which should be pstate 21
> > > > >
> > > > > >
> > > > > > I'm guessing that the package voltage level for P-states 10 and 20 is
> > > > > > the same, so the power difference between them is not significant
> > > > > > relative to the difference between P-state 20 and 21 and if increasing
> > > > > > the P-state causes some extra idle time to appear in the workload
> > > > > > (even though there is not enough of it to prevent to overall
> > > > > > utilization from increasing), then the overall power draw when running
> > > > > > at P-state 10 may be greater that for P-state 20.
> > > > >
> > > > > My impression is that the package voltage level for P-states 10 to 20 is
> > > > > high enough that increasing the frequency has little impact. But the code
> > > > > runs twice as fast, which reduces the execution time a lot, saving energy.
> > > > >
> > > > > My first experiment had only one running thread. I also tried running 32
> > > > > spinning threads for 10 seconds, ie using up one package and leaving the
> > > > > other idle. In this case, instead of staying around 600J for pstates
> > > > > 10-20, the pstate rises from 743 to 946. But there is still a gap between
> > > > > 20 and 21, with 21 being 1392J.
> > > > >
> > > > > > You can check if there is any C-state residency difference between
> > > > > > these two cases by running the workload under turbostat in each of
> > > > > > them.
> > > > >
> > > > > The C1 and C6 cases (CPU%c1 and CPU%c6) are about the same between 20 and
> > > > > 21, whether with 1 thread or with 32 thread.
> > > >
> > > > I meant to compare P-state 10 and P-state 20.
> > > >
> > > > 20 and 21 are really close as far as the performance is concerned, so
> > > > I wouldn't expect to see any significant C-state residency difference
> > > > between them.
> > >
> > > There's also no difference between 10 and 20. This seems normal, because
> > > the same cores are either fully used or fully idle in both cases. The
> > > idle ones are almost always in C6.
> >
> > The turbostat output sent by you previously shows that the CPUs doing
> > the work are only about 15-or-less percent busy, though, and you get
> > quite a bit of C-state residency on them. I'm assuming that this is
> > for 1 running thread.
> >
> > Can you please run the 32 spinning threads workload (ie. on one
> > package) and with P-state locked to 10 and then to 20 under turbostat
> > and send me the turbostat output for both runs?
>
> Attached.
>
> Pstate 10: spin_minmax_10_dahu-9_5.15.0freq_schedutil_11.turbo
> Pstate 20: spin_minmax_20_dahu-9_5.15.0freq_schedutil_11.turbo

Well, in both cases there is only 1 CPU running and it is running at
1 GHz (ie. P-state 10) all the time as far as I can say.

2022-01-03 16:42:13

by Julia Lawall

[permalink] [raw]
Subject: Re: cpufreq: intel_pstate: map utilization into the pstate range

> > > Can you please run the 32 spinning threads workload (ie. on one
> > > package) and with P-state locked to 10 and then to 20 under turbostat
> > > and send me the turbostat output for both runs?
> >
> > Attached.
> >
> > Pstate 10: spin_minmax_10_dahu-9_5.15.0freq_schedutil_11.turbo
> > Pstate 20: spin_minmax_20_dahu-9_5.15.0freq_schedutil_11.turbo
>
> Well, in both cases there is only 1 CPU running and it is running at
> 1 GHz (ie. P-state 10) all the time as far as I can say.

Indeed. I'll check on that.

julia

2022-01-03 18:23:42

by Julia Lawall

[permalink] [raw]
Subject: Re: cpufreq: intel_pstate: map utilization into the pstate range

> > > Can you please run the 32 spinning threads workload (ie. on one
> > > package) and with P-state locked to 10 and then to 20 under turbostat
> > > and send me the turbostat output for both runs?
> >
> > Attached.
> >
> > Pstate 10: spin_minmax_10_dahu-9_5.15.0freq_schedutil_11.turbo
> > Pstate 20: spin_minmax_20_dahu-9_5.15.0freq_schedutil_11.turbo
>
> Well, in both cases there is only 1 CPU running and it is running at
> 1 GHz (ie. P-state 10) all the time as far as I can say.

It looks better now. I included 1 core (core 0) for pstates 10, 20, and
21, and 32 cores (socket 0) for the same pstates.

julia


Attachments:
132.tar.gz (5.74 kB)

2022-01-03 19:59:02

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: cpufreq: intel_pstate: map utilization into the pstate range

On Mon, Jan 3, 2022 at 7:23 PM Julia Lawall <[email protected]> wrote:
>
> > > > Can you please run the 32 spinning threads workload (ie. on one
> > > > package) and with P-state locked to 10 and then to 20 under turbostat
> > > > and send me the turbostat output for both runs?
> > >
> > > Attached.
> > >
> > > Pstate 10: spin_minmax_10_dahu-9_5.15.0freq_schedutil_11.turbo
> > > Pstate 20: spin_minmax_20_dahu-9_5.15.0freq_schedutil_11.turbo
> >
> > Well, in both cases there is only 1 CPU running and it is running at
> > 1 GHz (ie. P-state 10) all the time as far as I can say.
>
> It looks better now. I included 1 core (core 0) for pstates 10, 20, and
> 21, and 32 cores (socket 0) for the same pstates.

OK, so let's first consider the runs where 32 cores (entire socket 0)
are doing the work.

This set of data clearly shows that running the busy cores at 1 GHz
takes less energy than running them at 2 GHz (the ratio of these
numbers is roughly 2/3 if I got that right). This means that P-state
10 is more energy efficient than P-state 20, as expected.

However, the cost of running at 2.1 GHz is much greater than the cost
of running at 2 GHz and I'm still thinking that this is attributable
to some kind of voltage increase between P-state 20 and P-state 21
(which, interestingly enough, affects the second "idle" socket too).

In the other set of data, where only 1 CPU is doing the work, P-state
10 is still more energy-efficient than P-state 20, but it takes more
time to do the work at 1 GHz, so the energy lost due to leakage
increases too and it is "leaked" by all of the CPUs in the package
(including the idle ones in core C-states), so overall this loss
offsets the gain from using a more energy-efficient P-state. At the
same time, socket 1 can spend more time in PC2 when the busy CPU is
running at 2 GHz (which means less leakage in that socket), so with 1
CPU doing the work the total cost of running at 2 GHz is slightly
smaller than the total cost of running at 1 GHz. [Note how important
it is to take the other CPUs in the system into account in this case,
because there are simply enough of them to affect one-CPU measurements
in a significant way.]

Still, when going from 2 GHz to 2.1 GHz, the voltage jump causes the
energy to increase significantly again.

2022-01-03 20:51:38

by Julia Lawall

[permalink] [raw]
Subject: Re: cpufreq: intel_pstate: map utilization into the pstate range



On Mon, 3 Jan 2022, Rafael J. Wysocki wrote:

> On Mon, Jan 3, 2022 at 7:23 PM Julia Lawall <[email protected]> wrote:
> >
> > > > > Can you please run the 32 spinning threads workload (ie. on one
> > > > > package) and with P-state locked to 10 and then to 20 under turbostat
> > > > > and send me the turbostat output for both runs?
> > > >
> > > > Attached.
> > > >
> > > > Pstate 10: spin_minmax_10_dahu-9_5.15.0freq_schedutil_11.turbo
> > > > Pstate 20: spin_minmax_20_dahu-9_5.15.0freq_schedutil_11.turbo
> > >
> > > Well, in both cases there is only 1 CPU running and it is running at
> > > 1 GHz (ie. P-state 10) all the time as far as I can say.
> >
> > It looks better now. I included 1 core (core 0) for pstates 10, 20, and
> > 21, and 32 cores (socket 0) for the same pstates.
>
> OK, so let's first consider the runs where 32 cores (entire socket 0)
> are doing the work.
>
> This set of data clearly shows that running the busy cores at 1 GHz
> takes less energy than running them at 2 GHz (the ratio of these
> numbers is roughly 2/3 if I got that right). This means that P-state
> 10 is more energy efficient than P-state 20, as expected.

Here all the threads always spin for 10 seconds. But if they had a fixed
amount of work to do, they should finish twice as fast at pstate 20.
Currently, we have 708J at pstate 10 and 905J at pstate 20, but if we can
divide the time at pstate 20 by 2, we should be around 450J, which is much
less than 708J.

turbostat -J sleep 5 shows 105J, so we're still ahead.

I haven't yet tried the actual experiment of spinning for 5 seconds and
then sleeping for 5 seconds, though.

>
> However, the cost of running at 2.1 GHz is much greater than the cost
> of running at 2 GHz and I'm still thinking that this is attributable
> to some kind of voltage increase between P-state 20 and P-state 21
> (which, interestingly enough, affects the second "idle" socket too).
>
> In the other set of data, where only 1 CPU is doing the work, P-state
> 10 is still more energy-efficient than P-state 20,

Actually, this doesn't seem to be the case. It's surely due to the
approximation of the result, but the consumption is slightly lower for
pstate 20. With more runs it probably averages out to around the same.

julia

> but it takes more
> time to do the work at 1 GHz, so the energy lost due to leakage
> increases too and it is "leaked" by all of the CPUs in the package
> (including the idle ones in core C-states), so overall this loss
> offsets the gain from using a more energy-efficient P-state. At the
> same time, socket 1 can spend more time in PC2 when the busy CPU is
> running at 2 GHz (which means less leakage in that socket), so with 1
> CPU doing the work the total cost of running at 2 GHz is slightly
> smaller than the total cost of running at 1 GHz. [Note how important
> it is to take the other CPUs in the system into account in this case,
> because there are simply enough of them to affect one-CPU measurements
> in a significant way.]
>
> Still, when going from 2 GHz to 2.1 GHz, the voltage jump causes the
> energy to increase significantly again.
>

2022-01-04 14:10:13

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: cpufreq: intel_pstate: map utilization into the pstate range

On Mon, Jan 3, 2022 at 9:51 PM Julia Lawall <[email protected]> wrote:
>
>
>
> On Mon, 3 Jan 2022, Rafael J. Wysocki wrote:
>
> > On Mon, Jan 3, 2022 at 7:23 PM Julia Lawall <[email protected]> wrote:
> > >
> > > > > > Can you please run the 32 spinning threads workload (ie. on one
> > > > > > package) and with P-state locked to 10 and then to 20 under turbostat
> > > > > > and send me the turbostat output for both runs?
> > > > >
> > > > > Attached.
> > > > >
> > > > > Pstate 10: spin_minmax_10_dahu-9_5.15.0freq_schedutil_11.turbo
> > > > > Pstate 20: spin_minmax_20_dahu-9_5.15.0freq_schedutil_11.turbo
> > > >
> > > > Well, in both cases there is only 1 CPU running and it is running at
> > > > 1 GHz (ie. P-state 10) all the time as far as I can say.
> > >
> > > It looks better now. I included 1 core (core 0) for pstates 10, 20, and
> > > 21, and 32 cores (socket 0) for the same pstates.
> >
> > OK, so let's first consider the runs where 32 cores (entire socket 0)
> > are doing the work.
> >
> > This set of data clearly shows that running the busy cores at 1 GHz
> > takes less energy than running them at 2 GHz (the ratio of these
> > numbers is roughly 2/3 if I got that right). This means that P-state
> > 10 is more energy efficient than P-state 20, as expected.
>
> Here all the threads always spin for 10 seconds.

That escaped me, sorry.

> But if they had a fixed
> amount of work to do, they should finish twice as fast at pstate 20.
> Currently, we have 708J at pstate 10 and 905J at pstate 20, but if we can
> divide the time at pstate 20 by 2, we should be around 450J, which is much
> less than 708J.

But socket 1 is idle and only slightly affected by P-state changes in
the range below P-state 21, so the difference that matters here is
between socket 0 running at 1 GHz and that socket running at 2 GHz,
which is 420 J vs 620 J (rounded to the closest multiple of 10 J).

> turbostat -J sleep 5 shows 105J, so we're still ahead.
>
> I haven't yet tried the actual experiment of spinning for 5 seconds and
> then sleeping for 5 seconds, though.
>
> >
> > However, the cost of running at 2.1 GHz is much greater than the cost
> > of running at 2 GHz and I'm still thinking that this is attributable
> > to some kind of voltage increase between P-state 20 and P-state 21
> > (which, interestingly enough, affects the second "idle" socket too).
> >
> > In the other set of data, where only 1 CPU is doing the work, P-state
> > 10 is still more energy-efficient than P-state 20,
>
> Actually, this doesn't seem to be the case. It's surely due to the
> approximation of the result, but the consumption is slightly lower for
> pstate 20. With more runs it probably averages out to around the same.

First of all, the cost of keeping a socket in the state in which CPUs
can execute code (referred to as PS0 sometimes) is relatively large on
that system.

Because socket 1 spending the vast majority of time in PC2 (in which
instructions cannot be executed by the CPUs in it) consistently draws
around 29 W when CPUs in socket 0 run at 1-2 GHz, the power needed to
keep socket 0 in PC0 must be larger than this and it looks like it is
around 30 W for the given range of P-states (because it cannot exceed
the total power needed to run 1 CPU at 1 GHz). Running 1 CPU 100% busy
on top of that makes around 1% of a difference which is likely below
the accuracy of the power meter (ie. in the noise).

In the case when all of the 16 cores (32 CPUs) in socket 0 are running
we have the 29 J drawn by socket 1 (idle), around 30 W drawn by the
memory (on both sockets), 30 W drawn by socket 0 just because it is in
PC0 all the time and the power drawn because the cores are actually
running. That last part is around 12 W when they are running at 1 GHz
or around 32 W when they are running at 2 GHz, so if the running cores
alone are taken into consideration, the latter is still more expensive
after all even if work is done twice as fast then.

However, in practice everything counts, not just the running cores
alone, so what is more efficient really depends on the use case.

For example, if it is known that at least 1 CPU will be 100% busy all
the time, the power drawn by socket 1 (mostly in PC2), by the memory
and in order to hold socket 0 in PC0 will need to be drawn anyway and
in that case 1 GHz is more efficient.

If the system as a whole can be completely idle at least from time to
time (in which state it will draw much less power as a whole), though,
it is likely more efficient to run the CPUs at 2 GHz.

2022-01-04 15:49:12

by Julia Lawall

[permalink] [raw]
Subject: Re: cpufreq: intel_pstate: map utilization into the pstate range

I tried the whole experiment again on an Intel w2155 (one socket, 10
physical cores, pstates 12, 33, and 45).

For the CPU there is a small jump a between 32 and 33 - less than for the
6130.

For the RAM, there is a big jump between 21 and 22.

Combining them leaves a big jump between 21 and 22.

It seems that the definition of efficient is that there is no more cost
for the computation than the cost of simply having the machine doing any
computation at all. It doesn't take into account the time and energy
required to do some actual amount of work.

julia

2022-01-04 19:22:45

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: cpufreq: intel_pstate: map utilization into the pstate range

On Tue, Jan 4, 2022 at 4:49 PM Julia Lawall <[email protected]> wrote:
>
> I tried the whole experiment again on an Intel w2155 (one socket, 10
> physical cores, pstates 12, 33, and 45).
>
> For the CPU there is a small jump a between 32 and 33 - less than for the
> 6130.
>
> For the RAM, there is a big jump between 21 and 22.
>
> Combining them leaves a big jump between 21 and 22.

These jumps are most likely related to voltage increases.

> It seems that the definition of efficient is that there is no more cost
> for the computation than the cost of simply having the machine doing any
> computation at all. It doesn't take into account the time and energy
> required to do some actual amount of work.

Well, that's not what I wanted to say.

Of course, the configuration that requires less energy to be spent to
do a given amount of work is more energy-efficient. To measure this,
the system needs to be given exactly the same amount of work for each
run and the energy spent by it during each run needs to be compared.

However, I think that you are interested in answering a different
question: Given a specific amount of time (say T) to run the workload,
what frequency to run the CPUs doing the work at in order to get the
maximum amount of work done per unit of energy spent by the system (as
a whole)? Or, given 2 different frequency levels, which of them to
run the CPUs at to get more work done per energy unit?

The work / energy ratio can be estimated as

W / E = C * f / P(f)

where C is a constant and P(f) is the power drawn by the whole system
while the CPUs doing the work are running at frequency f, and of
course for the system discussed previously it is greater in the 2 GHz
case.

However P(f) can be divided into two parts, P_1(f) that really depends
on the frequency and P_0 that does not depend on it. If P_0 is large
enough to dominate P(f), which is the case in the 10-20 range of
P-states on the system in question, it is better to run the CPUs doing
the work faster (as long as there is always enough work to do for
them; see below). This doesn't mean that P(f) is not a convex
function of f, though.

Moreover, this assumes that there will always be enough work for the
system to do when running the busy CPUs at 2 GHz, or that it can go
completely idle when it doesn't do any work, but let's see what
happens if the amount of work to do is W_1 = C * 1 GHz * T and the
system cannot go completely idle when the work is done.

Then, nothing changes for the busy CPUs running at 1 GHz, but in the 2
GHz case we get W = W_1 and E = P(2 GHz) * T/2 + P_0 * T/2, because
the busy CPUs are only busy 1/2 of the time, but power P_0 is drawn by
the system regardless. Hence, in the 2 GHz case (assuming P(2 GHz) =
120 W and P_0 = 90 W), we get

W / E = 2 * C * 1 GHz / (P(2 GHz) + P_0) = 0.0095 * C * 1 GHz

which is slightly less than the W / E ratio at 1 GHz approximately
equal to 0.01 * C * 1 GHz (assuming P(1 GHz) = 100 W), so in these
conditions it would be better to run the busy CPUs at 1 GHz.

2022-01-05 00:38:45

by Francisco Jerez

[permalink] [raw]
Subject: Re: cpufreq: intel_pstate: map utilization into the pstate range

Julia Lawall <[email protected]> writes:

> On Mon, 3 Jan 2022, Rafael J. Wysocki wrote:
>
>> On Mon, Jan 3, 2022 at 7:23 PM Julia Lawall <[email protected]> wrote:
>> >
>> > > > > Can you please run the 32 spinning threads workload (ie. on one
>> > > > > package) and with P-state locked to 10 and then to 20 under turbostat
>> > > > > and send me the turbostat output for both runs?
>> > > >
>> > > > Attached.
>> > > >
>> > > > Pstate 10: spin_minmax_10_dahu-9_5.15.0freq_schedutil_11.turbo
>> > > > Pstate 20: spin_minmax_20_dahu-9_5.15.0freq_schedutil_11.turbo
>> > >
>> > > Well, in both cases there is only 1 CPU running and it is running at
>> > > 1 GHz (ie. P-state 10) all the time as far as I can say.
>> >
>> > It looks better now. I included 1 core (core 0) for pstates 10, 20, and
>> > 21, and 32 cores (socket 0) for the same pstates.
>>
>> OK, so let's first consider the runs where 32 cores (entire socket 0)
>> are doing the work.
>>
>> This set of data clearly shows that running the busy cores at 1 GHz
>> takes less energy than running them at 2 GHz (the ratio of these
>> numbers is roughly 2/3 if I got that right). This means that P-state
>> 10 is more energy efficient than P-state 20, as expected.

Uhm, that's not what I'm seeing Rafael.

>
> Here all the threads always spin for 10 seconds. But if they had a fixed
> amount of work to do, they should finish twice as fast at pstate 20.
> Currently, we have 708J at pstate 10 and 905J at pstate 20, but if we can
> divide the time at pstate 20 by 2, we should be around 450J, which is much
> less than 708J.
>

I agree with Julia on this: According to the last turbostat logs
attached to this thread, CPU package #0 consumes 618 J for 32 threads
spinning at 2GHz for 10s, and 421 J for the same number of threads
spinning at 1GHz for roughly the same time, therefore at P-state 10 we
observe an energy efficiency (based on Rafael's own definition of energy
efficiency elsewhere in this thread) of 1GHz*10s/421J = 24 Mclocks/J,
while at P-state 20 we observe an energy efficiency of 2GHz*10s/618J =
32 Mclocks/J, so P-state 20 is clearly the most energy-efficient in
Julia's setup, even if we only consider one of the CPU packages in her
system (considering the effect of the second CPU package would further
bias the result in favor of P-state 20).

Since her latest experiment is utilizing all 16 cores of the package
close to 100% of the time, I think this rules out our earlier theory of
this being the result of broken idle management, two alternative
explanations I can think of:

- Voltage scaling isn't functioning as expected, your CPU's reported
maximum efficiency ratio may be calculated based on the assumption
that your CPU would be running at a lower voltage around P-state 10,
which for some reason isn't the case in your system.

- MSR_PLATFORM_INFO is misreporting the maximum efficiency ratio as
suggested earlier.

> turbostat -J sleep 5 shows 105J, so we're still ahead.
>
> I haven't yet tried the actual experiment of spinning for 5 seconds and
> then sleeping for 5 seconds, though.
>
>>
>> However, the cost of running at 2.1 GHz is much greater than the cost
>> of running at 2 GHz and I'm still thinking that this is attributable
>> to some kind of voltage increase between P-state 20 and P-state 21
>> (which, interestingly enough, affects the second "idle" socket too).
>>
>> In the other set of data, where only 1 CPU is doing the work, P-state
>> 10 is still more energy-efficient than P-state 20,
>
> Actually, this doesn't seem to be the case. It's surely due to the
> approximation of the result, but the consumption is slightly lower for
> pstate 20. With more runs it probably averages out to around the same.
>

Yeah, I agree that the data seems to confirm P-state 20 being truly more
efficient than P-state 10, whether 1 or 16 cores are in use.

> julia
>
>> but it takes more
>> time to do the work at 1 GHz, so the energy lost due to leakage
>> increases too and it is "leaked" by all of the CPUs in the package
>> (including the idle ones in core C-states), so overall this loss
>> offsets the gain from using a more energy-efficient P-state. At the
>> same time, socket 1 can spend more time in PC2 when the busy CPU is
>> running at 2 GHz (which means less leakage in that socket), so with 1
>> CPU doing the work the total cost of running at 2 GHz is slightly
>> smaller than the total cost of running at 1 GHz. [Note how important
>> it is to take the other CPUs in the system into account in this case,
>> because there are simply enough of them to affect one-CPU measurements
>> in a significant way.]
>>
>> Still, when going from 2 GHz to 2.1 GHz, the voltage jump causes the
>> energy to increase significantly again.
>>

2022-01-05 20:19:47

by Julia Lawall

[permalink] [raw]
Subject: Re: cpufreq: intel_pstate: map utilization into the pstate range



On Tue, 4 Jan 2022, Rafael J. Wysocki wrote:

> On Tue, Jan 4, 2022 at 4:49 PM Julia Lawall <[email protected]> wrote:
> >
> > I tried the whole experiment again on an Intel w2155 (one socket, 10
> > physical cores, pstates 12, 33, and 45).
> >
> > For the CPU there is a small jump a between 32 and 33 - less than for the
> > 6130.
> >
> > For the RAM, there is a big jump between 21 and 22.
> >
> > Combining them leaves a big jump between 21 and 22.
>
> These jumps are most likely related to voltage increases.
>
> > It seems that the definition of efficient is that there is no more cost
> > for the computation than the cost of simply having the machine doing any
> > computation at all. It doesn't take into account the time and energy
> > required to do some actual amount of work.
>
> Well, that's not what I wanted to say.

I was referring to Francisco's comment that the lowest indicated frequency
should be the most efficient one. Turbostat also reports the lowest
frequency as the most efficient one. In my graph, there are the pstates 7
and 10, which give exactly the same energy consumption as 12. 7 and 10
are certainly less efficient, because the energy consumption is the same,
but the execution speed is lower.

> Of course, the configuration that requires less energy to be spent to
> do a given amount of work is more energy-efficient. To measure this,
> the system needs to be given exactly the same amount of work for each
> run and the energy spent by it during each run needs to be compared.

This is bascially my point of view, but there is a question about it. If
over 10 seconds you consume 10J and by running twice as fast you would
consume only 6J, then how do you account for the nest 5 seconds? If the
machine is then idle for the next 5 seconds, maybe you would end up
consuming 8J in total over the 10 seconds. But if you take advantage of
the free 5 seconds to pack in another job, then you end up consuming 12J.

> However, I think that you are interested in answering a different
> question: Given a specific amount of time (say T) to run the workload,
> what frequency to run the CPUs doing the work at in order to get the
> maximum amount of work done per unit of energy spent by the system (as
> a whole)? Or, given 2 different frequency levels, which of them to
> run the CPUs at to get more work done per energy unit?

This is the approach where you assume that the machine will be idle in any
leftover time. And it accounts for the energy consumed in that idle time.

> The work / energy ratio can be estimated as
>
> W / E = C * f / P(f)
>
> where C is a constant and P(f) is the power drawn by the whole system
> while the CPUs doing the work are running at frequency f, and of
> course for the system discussed previously it is greater in the 2 GHz
> case.
>
> However P(f) can be divided into two parts, P_1(f) that really depends
> on the frequency and P_0 that does not depend on it. If P_0 is large
> enough to dominate P(f), which is the case in the 10-20 range of
> P-states on the system in question, it is better to run the CPUs doing
> the work faster (as long as there is always enough work to do for
> them; see below). This doesn't mean that P(f) is not a convex
> function of f, though.
>
> Moreover, this assumes that there will always be enough work for the
> system to do when running the busy CPUs at 2 GHz, or that it can go
> completely idle when it doesn't do any work, but let's see what
> happens if the amount of work to do is W_1 = C * 1 GHz * T and the
> system cannot go completely idle when the work is done.
>
> Then, nothing changes for the busy CPUs running at 1 GHz, but in the 2
> GHz case we get W = W_1 and E = P(2 GHz) * T/2 + P_0 * T/2, because
> the busy CPUs are only busy 1/2 of the time, but power P_0 is drawn by
> the system regardless. Hence, in the 2 GHz case (assuming P(2 GHz) =
> 120 W and P_0 = 90 W), we get
>
> W / E = 2 * C * 1 GHz / (P(2 GHz) + P_0) = 0.0095 * C * 1 GHz
>
> which is slightly less than the W / E ratio at 1 GHz approximately
> equal to 0.01 * C * 1 GHz (assuming P(1 GHz) = 100 W), so in these
> conditions it would be better to run the busy CPUs at 1 GHz.

OK, I'll try to measure this.

thanks,
julia

2022-01-05 23:47:01

by Francisco Jerez

[permalink] [raw]
Subject: Re: cpufreq: intel_pstate: map utilization into the pstate range

Julia Lawall <[email protected]> writes:

> On Tue, 4 Jan 2022, Rafael J. Wysocki wrote:
>
>> On Tue, Jan 4, 2022 at 4:49 PM Julia Lawall <[email protected]> wrote:
>> >
>> > I tried the whole experiment again on an Intel w2155 (one socket, 10
>> > physical cores, pstates 12, 33, and 45).
>> >
>> > For the CPU there is a small jump a between 32 and 33 - less than for the
>> > 6130.
>> >
>> > For the RAM, there is a big jump between 21 and 22.
>> >
>> > Combining them leaves a big jump between 21 and 22.
>>
>> These jumps are most likely related to voltage increases.
>>
>> > It seems that the definition of efficient is that there is no more cost
>> > for the computation than the cost of simply having the machine doing any
>> > computation at all. It doesn't take into account the time and energy
>> > required to do some actual amount of work.
>>
>> Well, that's not what I wanted to say.
>
> I was referring to Francisco's comment that the lowest indicated frequency
> should be the most efficient one. Turbostat also reports the lowest
> frequency as the most efficient one. In my graph, there are the pstates 7
> and 10, which give exactly the same energy consumption as 12. 7 and 10
> are certainly less efficient, because the energy consumption is the same,
> but the execution speed is lower.
>
>> Of course, the configuration that requires less energy to be spent to
>> do a given amount of work is more energy-efficient. To measure this,
>> the system needs to be given exactly the same amount of work for each
>> run and the energy spent by it during each run needs to be compared.

I disagree that the system needs to be given the exact same amount of
work in order to measure differences in energy efficiency. The average
energy efficiency of Julia's 10s workloads can be calculated easily in
both cases (e.g. as the W/E ratio below, W will just be a different
value for each run), and the result will likely approximate the
instantaneous energy efficiency of the fixed P-states we're comparing,
since her workload seems to be fairly close to a steady state.

>
> This is bascially my point of view, but there is a question about it. If
> over 10 seconds you consume 10J and by running twice as fast you would
> consume only 6J, then how do you account for the nest 5 seconds? If the
> machine is then idle for the next 5 seconds, maybe you would end up
> consuming 8J in total over the 10 seconds. But if you take advantage of
> the free 5 seconds to pack in another job, then you end up consuming 12J.
>

Geometrically, such an oscillatory workload with periods of idling and
periods of activity would give an average power consumption along the
line that passes through the points corresponding to both states on the
CPU's power curve -- IOW your average power consumption will just be the
weighted average of the power consumption of each state (with the duty
cycle t_i/t_total of each state being its weight):

P_avg = t_0/t_total * P_0 + t_1/t_total * P_1

Your energy usage would just be 10s times that P_avg, since you're
assuming that the total runtime of the workload is fixed at 10s
independent of how long the CPU actually takes to complete the
computation. In cases where the P-state during the period of activity
t_1 is equal or lower to the maximum efficiency P-state, that line
segment is guaranteed to lie below the power curve, indicating that such
oscillation is more efficient than running the workload fixed to its
average P-state.

That said, this scenario doesn't really seem very relevant to your case,
since the last workload you've provided turbostat traces for seems to
show almost no oscillation. If there was such an oscillation, your
total energy usage would still be greater for oscillations between idle
and some P-state different from the most efficient one. Such an
oscillation doesn't explain the anomaly we're seeing on your traces,
which show more energy-efficient instantaneous behavior for a P-state 2x
the one reported by your processor as the most energy-efficient.

>> However, I think that you are interested in answering a different
>> question: Given a specific amount of time (say T) to run the workload,
>> what frequency to run the CPUs doing the work at in order to get the
>> maximum amount of work done per unit of energy spent by the system (as
>> a whole)? Or, given 2 different frequency levels, which of them to
>> run the CPUs at to get more work done per energy unit?
>
> This is the approach where you assume that the machine will be idle in any
> leftover time. And it accounts for the energy consumed in that idle time.
>
>> The work / energy ratio can be estimated as
>>
>> W / E = C * f / P(f)
>>
>> where C is a constant and P(f) is the power drawn by the whole system
>> while the CPUs doing the work are running at frequency f, and of
>> course for the system discussed previously it is greater in the 2 GHz
>> case.
>>
>> However P(f) can be divided into two parts, P_1(f) that really depends
>> on the frequency and P_0 that does not depend on it. If P_0 is large
>> enough to dominate P(f), which is the case in the 10-20 range of
>> P-states on the system in question, it is better to run the CPUs doing
>> the work faster (as long as there is always enough work to do for
>> them; see below). This doesn't mean that P(f) is not a convex
>> function of f, though.
>>
>> Moreover, this assumes that there will always be enough work for the
>> system to do when running the busy CPUs at 2 GHz, or that it can go
>> completely idle when it doesn't do any work, but let's see what
>> happens if the amount of work to do is W_1 = C * 1 GHz * T and the
>> system cannot go completely idle when the work is done.
>>
>> Then, nothing changes for the busy CPUs running at 1 GHz, but in the 2
>> GHz case we get W = W_1 and E = P(2 GHz) * T/2 + P_0 * T/2, because
>> the busy CPUs are only busy 1/2 of the time, but power P_0 is drawn by
>> the system regardless. Hence, in the 2 GHz case (assuming P(2 GHz) =
>> 120 W and P_0 = 90 W), we get
>>
>> W / E = 2 * C * 1 GHz / (P(2 GHz) + P_0) = 0.0095 * C * 1 GHz
>>
>> which is slightly less than the W / E ratio at 1 GHz approximately
>> equal to 0.01 * C * 1 GHz (assuming P(1 GHz) = 100 W), so in these
>> conditions it would be better to run the busy CPUs at 1 GHz.
>
> OK, I'll try to measure this.
>
> thanks,
> julia

2022-01-06 19:49:19

by Julia Lawall

[permalink] [raw]
Subject: Re: cpufreq: intel_pstate: map utilization into the pstate range



On Wed, 5 Jan 2022, Francisco Jerez wrote:

> Julia Lawall <[email protected]> writes:
>
> > On Tue, 4 Jan 2022, Rafael J. Wysocki wrote:
> >
> >> On Tue, Jan 4, 2022 at 4:49 PM Julia Lawall <[email protected]> wrote:
> >> >
> >> > I tried the whole experiment again on an Intel w2155 (one socket, 10
> >> > physical cores, pstates 12, 33, and 45).
> >> >
> >> > For the CPU there is a small jump a between 32 and 33 - less than for the
> >> > 6130.
> >> >
> >> > For the RAM, there is a big jump between 21 and 22.
> >> >
> >> > Combining them leaves a big jump between 21 and 22.
> >>
> >> These jumps are most likely related to voltage increases.
> >>
> >> > It seems that the definition of efficient is that there is no more cost
> >> > for the computation than the cost of simply having the machine doing any
> >> > computation at all. It doesn't take into account the time and energy
> >> > required to do some actual amount of work.
> >>
> >> Well, that's not what I wanted to say.
> >
> > I was referring to Francisco's comment that the lowest indicated frequency
> > should be the most efficient one. Turbostat also reports the lowest
> > frequency as the most efficient one. In my graph, there are the pstates 7
> > and 10, which give exactly the same energy consumption as 12. 7 and 10
> > are certainly less efficient, because the energy consumption is the same,
> > but the execution speed is lower.
> >
> >> Of course, the configuration that requires less energy to be spent to
> >> do a given amount of work is more energy-efficient. To measure this,
> >> the system needs to be given exactly the same amount of work for each
> >> run and the energy spent by it during each run needs to be compared.
>
> I disagree that the system needs to be given the exact same amount of
> work in order to measure differences in energy efficiency. The average
> energy efficiency of Julia's 10s workloads can be calculated easily in
> both cases (e.g. as the W/E ratio below, W will just be a different
> value for each run), and the result will likely approximate the
> instantaneous energy efficiency of the fixed P-states we're comparing,
> since her workload seems to be fairly close to a steady state.
>
> >
> > This is bascially my point of view, but there is a question about it. If
> > over 10 seconds you consume 10J and by running twice as fast you would
> > consume only 6J, then how do you account for the nest 5 seconds? If the
> > machine is then idle for the next 5 seconds, maybe you would end up
> > consuming 8J in total over the 10 seconds. But if you take advantage of
> > the free 5 seconds to pack in another job, then you end up consuming 12J.
> >
>
> Geometrically, such an oscillatory workload with periods of idling and
> periods of activity would give an average power consumption along the
> line that passes through the points corresponding to both states on the
> CPU's power curve -- IOW your average power consumption will just be the
> weighted average of the power consumption of each state (with the duty
> cycle t_i/t_total of each state being its weight):
>
> P_avg = t_0/t_total * P_0 + t_1/t_total * P_1
>
> Your energy usage would just be 10s times that P_avg, since you're
> assuming that the total runtime of the workload is fixed at 10s
> independent of how long the CPU actually takes to complete the
> computation. In cases where the P-state during the period of activity
> t_1 is equal or lower to the maximum efficiency P-state, that line
> segment is guaranteed to lie below the power curve, indicating that such
> oscillation is more efficient than running the workload fixed to its
> average P-state.
>
> That said, this scenario doesn't really seem very relevant to your case,
> since the last workload you've provided turbostat traces for seems to
> show almost no oscillation. If there was such an oscillation, your
> total energy usage would still be greater for oscillations between idle
> and some P-state different from the most efficient one. Such an
> oscillation doesn't explain the anomaly we're seeing on your traces,
> which show more energy-efficient instantaneous behavior for a P-state 2x
> the one reported by your processor as the most energy-efficient.

All the turbostat output and graphs I have sent recently were just for
continuous spinning:

for(;;);

Now I am trying running for the percentage of the time corresponding to
10 / P for pstate P (ie 0.5 of the time for pstate 20), and then sleeping,
to see whether one can just add the sleeping power consumption of the
machine to compute the efficiency as Rafael suggested.

julia

>
> >> However, I think that you are interested in answering a different
> >> question: Given a specific amount of time (say T) to run the workload,
> >> what frequency to run the CPUs doing the work at in order to get the
> >> maximum amount of work done per unit of energy spent by the system (as
> >> a whole)? Or, given 2 different frequency levels, which of them to
> >> run the CPUs at to get more work done per energy unit?
> >
> > This is the approach where you assume that the machine will be idle in any
> > leftover time. And it accounts for the energy consumed in that idle time.
> >
> >> The work / energy ratio can be estimated as
> >>
> >> W / E = C * f / P(f)
> >>
> >> where C is a constant and P(f) is the power drawn by the whole system
> >> while the CPUs doing the work are running at frequency f, and of
> >> course for the system discussed previously it is greater in the 2 GHz
> >> case.
> >>
> >> However P(f) can be divided into two parts, P_1(f) that really depends
> >> on the frequency and P_0 that does not depend on it. If P_0 is large
> >> enough to dominate P(f), which is the case in the 10-20 range of
> >> P-states on the system in question, it is better to run the CPUs doing
> >> the work faster (as long as there is always enough work to do for
> >> them; see below). This doesn't mean that P(f) is not a convex
> >> function of f, though.
> >>
> >> Moreover, this assumes that there will always be enough work for the
> >> system to do when running the busy CPUs at 2 GHz, or that it can go
> >> completely idle when it doesn't do any work, but let's see what
> >> happens if the amount of work to do is W_1 = C * 1 GHz * T and the
> >> system cannot go completely idle when the work is done.
> >>
> >> Then, nothing changes for the busy CPUs running at 1 GHz, but in the 2
> >> GHz case we get W = W_1 and E = P(2 GHz) * T/2 + P_0 * T/2, because
> >> the busy CPUs are only busy 1/2 of the time, but power P_0 is drawn by
> >> the system regardless. Hence, in the 2 GHz case (assuming P(2 GHz) =
> >> 120 W and P_0 = 90 W), we get
> >>
> >> W / E = 2 * C * 1 GHz / (P(2 GHz) + P_0) = 0.0095 * C * 1 GHz
> >>
> >> which is slightly less than the W / E ratio at 1 GHz approximately
> >> equal to 0.01 * C * 1 GHz (assuming P(1 GHz) = 100 W), so in these
> >> conditions it would be better to run the busy CPUs at 1 GHz.
> >
> > OK, I'll try to measure this.
> >
> > thanks,
> > julia
>

2022-01-06 20:28:49

by Srinivas Pandruvada

[permalink] [raw]
Subject: Re: cpufreq: intel_pstate: map utilization into the pstate range

On Thu, 2022-01-06 at 20:49 +0100, Julia Lawall wrote:
>
> On Wed, 5 Jan 2022, Francisco Jerez wrote:
>
> > Julia Lawall <[email protected]> writes:
> >
> > > On Tue, 4 Jan 2022, Rafael J. Wysocki wrote:
> > >
> > > > On Tue, Jan 4, 2022 at 4:49 PM Julia Lawall <
> > > > [email protected]> wrote:
> > > > > I tried the whole experiment again on an Intel w2155 (one
> > > > > socket, 10
> > > > > physical cores, pstates 12, 33, and 45).
> > > > >
> > > > > For the CPU there is a small jump a between 32 and 33 - less
> > > > > than for the
> > > > > 6130.
> > > > >
> > > > > For the RAM, there is a big jump between 21 and 22.
> > > > >
> > > > > Combining them leaves a big jump between 21 and 22.
> > > >
> > > > These jumps are most likely related to voltage increases.
> > > >
> > > > > It seems that the definition of efficient is that there is no
> > > > > more cost
> > > > > for the computation than the cost of simply having the
> > > > > machine doing any
> > > > > computation at all. It doesn't take into account the time
> > > > > and energy
> > > > > required to do some actual amount of work.
> > > >
> > > > Well, that's not what I wanted to say.
> > >
> > > I was referring to Francisco's comment that the lowest indicated
> > > frequency
> > > should be the most efficient one. Turbostat also reports the
> > > lowest
> > > frequency as the most efficient one. In my graph, there are the
> > > pstates 7
> > > and 10, which give exactly the same energy consumption as 12. 7
> > > and 10
> > > are certainly less efficient, because the energy consumption is
> > > the same,
> > > but the execution speed is lower.
> > >
> > > > Of course, the configuration that requires less energy to be
> > > > spent to
> > > > do a given amount of work is more energy-efficient. To measure
> > > > this,
> > > > the system needs to be given exactly the same amount of work
> > > > for each
> > > > run and the energy spent by it during each run needs to be
> > > > compared.
> >
> > I disagree that the system needs to be given the exact same amount
> > of
> > work in order to measure differences in energy efficiency. The
> > average
> > energy efficiency of Julia's 10s workloads can be calculated easily
> > in
> > both cases (e.g. as the W/E ratio below, W will just be a different
> > value for each run), and the result will likely approximate the
> > instantaneous energy efficiency of the fixed P-states we're
> > comparing,
> > since her workload seems to be fairly close to a steady state.
> >
> > > This is bascially my point of view, but there is a question about
> > > it. If
> > > over 10 seconds you consume 10J and by running twice as fast you
> > > would
> > > consume only 6J, then how do you account for the nest 5
> > > seconds? If the
> > > machine is then idle for the next 5 seconds, maybe you would end
> > > up
> > > consuming 8J in total over the 10 seconds. But if you take
> > > advantage of
> > > the free 5 seconds to pack in another job, then you end up
> > > consuming 12J.
> > >
> >
> > Geometrically, such an oscillatory workload with periods of idling
> > and
> > periods of activity would give an average power consumption along
> > the
> > line that passes through the points corresponding to both states on
> > the
> > CPU's power curve -- IOW your average power consumption will just
> > be the
> > weighted average of the power consumption of each state (with the
> > duty
> > cycle t_i/t_total of each state being its weight):
> >
> > P_avg = t_0/t_total * P_0 + t_1/t_total * P_1
> >
> > Your energy usage would just be 10s times that P_avg, since you're
> > assuming that the total runtime of the workload is fixed at 10s
> > independent of how long the CPU actually takes to complete the
> > computation. In cases where the P-state during the period of
> > activity
> > t_1 is equal or lower to the maximum efficiency P-state, that line
> > segment is guaranteed to lie below the power curve, indicating that
> > such
> > oscillation is more efficient than running the workload fixed to
> > its
> > average P-state.
> >
> > That said, this scenario doesn't really seem very relevant to your
> > case,
> > since the last workload you've provided turbostat traces for seems
> > to
> > show almost no oscillation. If there was such an oscillation, your
> > total energy usage would still be greater for oscillations between
> > idle
> > and some P-state different from the most efficient one. Such an
> > oscillation doesn't explain the anomaly we're seeing on your
> > traces,
> > which show more energy-efficient instantaneous behavior for a P-
> > state 2x
> > the one reported by your processor as the most energy-efficient.
>
> All the turbostat output and graphs I have sent recently were just
> for
> continuous spinning:
>
> for(;;);
>
> Now I am trying running for the percentage of the time corresponding
> to
> 10 / P for pstate P (ie 0.5 of the time for pstate 20), and then
> sleeping,
> to see whether one can just add the sleeping power consumption of the
> machine to compute the efficiency as Rafael suggested.
>
Before doing comparison try freezing uncore.

wrmsr -a 0x620 0x0808

to Freeze uncore at 800MHz. Any other value is fine.

Thanks,
Srinivas

> julia
>
> > > > However, I think that you are interested in answering a
> > > > different
> > > > question: Given a specific amount of time (say T) to run the
> > > > workload,
> > > > what frequency to run the CPUs doing the work at in order to
> > > > get the
> > > > maximum amount of work done per unit of energy spent by the
> > > > system (as
> > > > a whole)? Or, given 2 different frequency levels, which of
> > > > them to
> > > > run the CPUs at to get more work done per energy unit?
> > >
> > > This is the approach where you assume that the machine will be
> > > idle in any
> > > leftover time. And it accounts for the energy consumed in that
> > > idle time.
> > >
> > > > The work / energy ratio can be estimated as
> > > >
> > > > W / E = C * f / P(f)
> > > >
> > > > where C is a constant and P(f) is the power drawn by the whole
> > > > system
> > > > while the CPUs doing the work are running at frequency f, and
> > > > of
> > > > course for the system discussed previously it is greater in the
> > > > 2 GHz
> > > > case.
> > > >
> > > > However P(f) can be divided into two parts, P_1(f) that really
> > > > depends
> > > > on the frequency and P_0 that does not depend on it. If P_0 is
> > > > large
> > > > enough to dominate P(f), which is the case in the 10-20 range
> > > > of
> > > > P-states on the system in question, it is better to run the
> > > > CPUs doing
> > > > the work faster (as long as there is always enough work to do
> > > > for
> > > > them; see below). This doesn't mean that P(f) is not a convex
> > > > function of f, though.
> > > >
> > > > Moreover, this assumes that there will always be enough work
> > > > for the
> > > > system to do when running the busy CPUs at 2 GHz, or that it
> > > > can go
> > > > completely idle when it doesn't do any work, but let's see what
> > > > happens if the amount of work to do is W_1 = C * 1 GHz * T and
> > > > the
> > > > system cannot go completely idle when the work is done.
> > > >
> > > > Then, nothing changes for the busy CPUs running at 1 GHz, but
> > > > in the 2
> > > > GHz case we get W = W_1 and E = P(2 GHz) * T/2 + P_0 * T/2,
> > > > because
> > > > the busy CPUs are only busy 1/2 of the time, but power P_0 is
> > > > drawn by
> > > > the system regardless. Hence, in the 2 GHz case (assuming P(2
> > > > GHz) =
> > > > 120 W and P_0 = 90 W), we get
> > > >
> > > > W / E = 2 * C * 1 GHz / (P(2 GHz) + P_0) = 0.0095 * C * 1 GHz
> > > >
> > > > which is slightly less than the W / E ratio at 1 GHz
> > > > approximately
> > > > equal to 0.01 * C * 1 GHz (assuming P(1 GHz) = 100 W), so in
> > > > these
> > > > conditions it would be better to run the busy CPUs at 1 GHz.
> > >
> > > OK, I'll try to measure this.
> > >
> > > thanks,
> > > julia


2022-01-06 20:43:16

by Julia Lawall

[permalink] [raw]
Subject: Re: cpufreq: intel_pstate: map utilization into the pstate range

> > All the turbostat output and graphs I have sent recently were just
> > for
> > continuous spinning:
> >
> > for(;;);
> >
> > Now I am trying running for the percentage of the time corresponding
> > to
> > 10 / P for pstate P (ie 0.5 of the time for pstate 20), and then
> > sleeping,
> > to see whether one can just add the sleeping power consumption of the
> > machine to compute the efficiency as Rafael suggested.
> >
> Before doing comparison try freezing uncore.
>
> wrmsr -a 0x620 0x0808
>
> to Freeze uncore at 800MHz. Any other value is fine.

Thanks for the suggestion. What is the impact of this?

julia

2022-01-06 21:56:15

by Srinivas Pandruvada

[permalink] [raw]
Subject: Re: cpufreq: intel_pstate: map utilization into the pstate range

On Thu, 2022-01-06 at 21:43 +0100, Julia Lawall wrote:
> > > All the turbostat output and graphs I have sent recently were
> > > just
> > > for
> > > continuous spinning:
> > >
> > > for(;;);
> > >
> > > Now I am trying running for the percentage of the time
> > > corresponding
> > > to
> > > 10 / P for pstate P (ie 0.5 of the time for pstate 20), and then
> > > sleeping,
> > > to see whether one can just add the sleeping power consumption of
> > > the
> > > machine to compute the efficiency as Rafael suggested.
> > >
> > Before doing comparison try freezing uncore.
> >
> > wrmsr -a 0x620 0x0808
> >
> > to Freeze uncore at 800MHz. Any other value is fine.
>
> Thanks for the suggestion.  What is the impact of this?
Uncore scales based on its own heuristics based in P-state change and
works in package scope. So to actually see the effect of P-state change
on energy you can remove variability of uncore power.

Thanks,
Srinivas

>
> julia



2022-01-06 21:58:35

by Julia Lawall

[permalink] [raw]
Subject: Re: cpufreq: intel_pstate: map utilization into the pstate range



On Thu, 6 Jan 2022, srinivas pandruvada wrote:

> On Thu, 2022-01-06 at 21:43 +0100, Julia Lawall wrote:
> > > > All the turbostat output and graphs I have sent recently were
> > > > just
> > > > for
> > > > continuous spinning:
> > > >
> > > > for(;;);
> > > >
> > > > Now I am trying running for the percentage of the time
> > > > corresponding
> > > > to
> > > > 10 / P for pstate P (ie 0.5 of the time for pstate 20), and then
> > > > sleeping,
> > > > to see whether one can just add the sleeping power consumption of
> > > > the
> > > > machine to compute the efficiency as Rafael suggested.
> > > >
> > > Before doing comparison try freezing uncore.
> > >
> > > wrmsr -a 0x620 0x0808
> > >
> > > to Freeze uncore at 800MHz. Any other value is fine.
> >
> > Thanks for the suggestion.  What is the impact of this?
> Uncore scales based on its own heuristics based in P-state change and
> works in package scope. So to actually see the effect of P-state change
> on energy you can remove variability of uncore power.

OK, thanks. I will try both options.

julia