LinuxLists.cc - [Patch v4 0/6] Introduce Thermal Pressure

2019-10-23 01:24:43

Subject: [Patch v4 0/6] Introduce Thermal Pressure

Thermal governors can respond to an overheat event of a cpu by
capping the cpu's maximum possible frequency. This in turn
means that the maximum available compute capacity of the
cpu is restricted. But today in the kernel, task scheduler is
not notified of capping of maximum frequency of a cpu.
In other words, scheduler is unware of maximum capacity
restrictions placed on a cpu due to thermal activity.
This patch series attempts to address this issue.
The benefits identified are better task placement among available
cpus in event of overheating which in turn leads to better
performance numbers.

The reduction in the maximum possible capacity of a cpu due to a
thermal event can be considered as thermal pressure. Instantaneous
thermal pressure is hard to record and can sometime be erroneous
as there can be mismatch between the actual capping of capacity
and scheduler recording it. Thus solution is to have a weighted
average per cpu value for thermal pressure over time.
The weight reflects the amount of time the cpu has spent at a
capped maximum frequency. Since thermal pressure is recorded as
an average, it must be decayed periodically. Exisiting algorithm
in the kernel scheduler pelt framework is re-used to calculate
the weighted average. This patch series also defines a sysctl
inerface to allow for a configurable decay period.

Regarding testing, basic build, boot and sanity testing have been
performed on db845c platform with debian file system.
Further, dhrystone and hackbench tests have been
run with the thermal pressure algorithm. During testing, due to
constraints of step wise governor in dealing with big little systems,
trip point 0 temperature was made assymetric between cpus in little
cluster and big cluster; the idea being that
big core will heat up and cpu cooling device will throttle the
frequency of the big cores faster, there by limiting the maximum available
capacity and the scheduler will spread out tasks to little cores as well.

Test Results

Hackbench: 1 group , 30000 loops, 10 runs
Result SD
(Secs) (% of mean)
No Thermal Pressure 14.03 2.69%
Thermal Pressure PELT Algo. Decay : 32 ms 13.29 0.56%
Thermal Pressure PELT Algo. Decay : 64 ms 12.57 1.56%
Thermal Pressure PELT Algo. Decay : 128 ms 12.71 1.04%
Thermal Pressure PELT Algo. Decay : 256 ms 12.29 1.42%
Thermal Pressure PELT Algo. Decay : 512 ms 12.42 1.15%

Dhrystone Run Time : 20 threads, 3000 MLOOPS
Result SD
(Secs) (% of mean)
No Thermal Pressure 9.452 4.49%
Thermal Pressure PELT Algo. Decay : 32 ms 8.793 5.30%
Thermal Pressure PELT Algo. Decay : 64 ms 8.981 5.29%
Thermal Pressure PELT Algo. Decay : 128 ms 8.647 6.62%
Thermal Pressure PELT Algo. Decay : 256 ms 8.774 6.45%
Thermal Pressure PELT Algo. Decay : 512 ms 8.603 5.41%

A Brief History

The first version of this patch-series was posted with resuing
PELT algorithm to decay thermal pressure signal. The discussions
that followed were around whether intanteneous thermal pressure
solution is better and whether a stand-alone algortihm to accumulate
and decay thermal pressure is more appropriate than re-using the
PELT framework.
Tests on Hikey960 showed the stand-alone algorithm performing slightly
better than resuing PELT algorithm and V2 was posted with the stand
alone algorithm. Test results were shared as part of this series.
Discussions were around re-using PELT algorithm and running
further tests with more granular decay period.

For some time after this development was impeded due to hardware
unavailability, some other unforseen and possibly unfortunate events.
For this version, h/w was switched from hikey960 to db845c.
Also Instantaneous thermal pressure was never tested as part of this
cycle as it is clear that weighted average is a better implementation.
The non-PELT algorithm never gave any conclusive results to prove that it
is better than reusing PELT algorithm, in this round of testing.
Also reusing PELT algorithm means thermal pressure tracks the
other utilization signals in the scheduler.

v3->v4:
- "Patch 3/7:sched: Initialize per cpu thermal pressure structure"
is dropped as it is no longer needed following changes in other
other patches.
- rest of the change log mentioned in specific patches.

Thara Gopinath (6):
sched/pelt.c: Add support to track thermal pressure
sched: Add infrastructure to store and update instantaneous thermal
pressure
sched/fair: Enable CFS periodic tick to update thermal pressure
sched/fair: update cpu_capcity to reflect thermal pressure
thermal/cpu-cooling: Update thermal pressure in case of a maximum
frequency capping
sched: thermal: Enable tuning of decay period

Documentation/admin-guide/kernel-parameters.txt | 5 ++
drivers/thermal/cpu_cooling.c | 31 ++++++++++-
include/linux/sched.h | 8 +++
kernel/sched/Makefile | 2 +-
kernel/sched/fair.c | 6 +++
kernel/sched/pelt.c | 13 +++++
kernel/sched/pelt.h | 7 +++
kernel/sched/sched.h | 1 +
kernel/sched/thermal.c | 68 +++++++++++++++++++++++++
kernel/sched/thermal.h | 13 +++++
10 files changed, 151 insertions(+), 3 deletions(-)
create mode 100644 kernel/sched/thermal.c
create mode 100644 kernel/sched/thermal.h

--
2.1.4

2019-10-23 01:25:07

by Thara Gopinath

[permalink] [raw]

Subject: [Patch v4 4/6] sched/fair: update cpu_capcity to reflect thermal pressure

cpu_capacity relflects the maximum available capacity of a cpu. Thermal
pressure on a cpu means this maximum available capacity is reduced. This
patch reduces the average thermal pressure for a cpu from its maximum
available capacity so that cpu_capacity reflects the actual
available capacity.

Signed-off-by: Thara Gopinath <[email protected]>
---
kernel/sched/fair.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4f9c2cb..be3e802 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7727,6 +7727,7 @@ static unsigned long scale_rt_capacity(struct sched_domain *sd, int cpu)

used = READ_ONCE(rq->avg_rt.util_avg);
used += READ_ONCE(rq->avg_dl.util_avg);
+ used += READ_ONCE(rq->avg_thermal.load_avg);

if (unlikely(used >= max))
return 1;
--
2.1.4

2019-10-23 17:04:03

by Qais Yousef

[permalink] [raw]

Subject: Re: [Patch v4 4/6] sched/fair: update cpu_capcity to reflect thermal pressure

On 10/22/19 16:34, Thara Gopinath wrote:
> cpu_capacity relflects the maximum available capacity of a cpu. Thermal
> pressure on a cpu means this maximum available capacity is reduced. This
> patch reduces the average thermal pressure for a cpu from its maximum
> available capacity so that cpu_capacity reflects the actual
> available capacity.
>
> Signed-off-by: Thara Gopinath <[email protected]>
> ---
> kernel/sched/fair.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4f9c2cb..be3e802 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7727,6 +7727,7 @@ static unsigned long scale_rt_capacity(struct sched_domain *sd, int cpu)
>
> used = READ_ONCE(rq->avg_rt.util_avg);
> used += READ_ONCE(rq->avg_dl.util_avg);
> + used += READ_ONCE(rq->avg_thermal.load_avg);

Maybe a naive question - but can we add util_avg with load_avg without
a conversion? I thought the 2 signals have different properties.

Cheers

--
Qais Yousef

>
> if (unlikely(used >= max))
> return 1;
> --
> 2.1.4
>

2019-10-28 21:12:03

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [Patch v4 4/6] sched/fair: update cpu_capcity to reflect thermal pressure

On Wed, Oct 23, 2019 at 01:28:40PM +0100, Qais Yousef wrote:
> On 10/22/19 16:34, Thara Gopinath wrote:
> > cpu_capacity relflects the maximum available capacity of a cpu. Thermal
> > pressure on a cpu means this maximum available capacity is reduced. This
> > patch reduces the average thermal pressure for a cpu from its maximum
> > available capacity so that cpu_capacity reflects the actual
> > available capacity.
> >
> > Signed-off-by: Thara Gopinath <[email protected]>
> > ---
> > kernel/sched/fair.c | 1 +
> > 1 file changed, 1 insertion(+)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 4f9c2cb..be3e802 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7727,6 +7727,7 @@ static unsigned long scale_rt_capacity(struct sched_domain *sd, int cpu)
> >
> > used = READ_ONCE(rq->avg_rt.util_avg);
> > used += READ_ONCE(rq->avg_dl.util_avg);
> > + used += READ_ONCE(rq->avg_thermal.load_avg);
>
> Maybe a naive question - but can we add util_avg with load_avg without
> a conversion? I thought the 2 signals have different properties.

Changelog of patch #1 explains, it's in that dense blob of text.

But yes, you're quite right that that wants a comment here.

2019-10-29 15:35:53

by Daniel Lezcano

[permalink] [raw]

Subject: Re: [Patch v4 0/6] Introduce Thermal Pressure

Hi Thara,

On 22/10/2019 22:34, Thara Gopinath wrote:
> Thermal governors can respond to an overheat event of a cpu by
> capping the cpu's maximum possible frequency. This in turn
> means that the maximum available compute capacity of the
> cpu is restricted. But today in the kernel, task scheduler is
> not notified of capping of maximum frequency of a cpu.
> In other words, scheduler is unware of maximum capacity
> restrictions placed on a cpu due to thermal activity.
> This patch series attempts to address this issue.
> The benefits identified are better task placement among available
> cpus in event of overheating which in turn leads to better
> performance numbers.
>
> The reduction in the maximum possible capacity of a cpu due to a
> thermal event can be considered as thermal pressure. Instantaneous
> thermal pressure is hard to record and can sometime be erroneous
> as there can be mismatch between the actual capping of capacity
> and scheduler recording it. Thus solution is to have a weighted
> average per cpu value for thermal pressure over time.
> The weight reflects the amount of time the cpu has spent at a
> capped maximum frequency. Since thermal pressure is recorded as
> an average, it must be decayed periodically. Exisiting algorithm
> in the kernel scheduler pelt framework is re-used to calculate
> the weighted average. This patch series also defines a sysctl
> inerface to allow for a configurable decay period.
>
> Regarding testing, basic build, boot and sanity testing have been
> performed on db845c platform with debian file system.
> Further, dhrystone and hackbench tests have been
> run with the thermal pressure algorithm. During testing, due to
> constraints of step wise governor in dealing with big little systems,
> trip point 0 temperature was made assymetric between cpus in little
> cluster and big cluster; the idea being that
> big core will heat up and cpu cooling device will throttle the
> frequency of the big cores faster, there by limiting the maximum available
> capacity and the scheduler will spread out tasks to little cores as well.
>
> Test Results
>
> Hackbench: 1 group , 30000 loops, 10 runs
> Result SD
> (Secs) (% of mean)
> No Thermal Pressure 14.03 2.69%
> Thermal Pressure PELT Algo. Decay : 32 ms 13.29 0.56%
> Thermal Pressure PELT Algo. Decay : 64 ms 12.57 1.56%
> Thermal Pressure PELT Algo. Decay : 128 ms 12.71 1.04%
> Thermal Pressure PELT Algo. Decay : 256 ms 12.29 1.42%
> Thermal Pressure PELT Algo. Decay : 512 ms 12.42 1.15%
>
> Dhrystone Run Time : 20 threads, 3000 MLOOPS
> Result SD
> (Secs) (% of mean)
> No Thermal Pressure 9.452 4.49%
> Thermal Pressure PELT Algo. Decay : 32 ms 8.793 5.30%
> Thermal Pressure PELT Algo. Decay : 64 ms 8.981 5.29%
> Thermal Pressure PELT Algo. Decay : 128 ms 8.647 6.62%
> Thermal Pressure PELT Algo. Decay : 256 ms 8.774 6.45%
> Thermal Pressure PELT Algo. Decay : 512 ms 8.603 5.41%

I took the opportunity to try glmark2 on the db845c platform with the
default decay and got the following glmark2 scores:

Without thermal pressure:

# NumSamples = 9; Min = 790.00; Max = 805.00
# Mean = 794.888889; Variance = 19.209877; SD = 4.382907; Median 794.000000
# each ∎ represents a count of 1
790.0000 - 791.5000 [ 2]: ∎∎
791.5000 - 793.0000 [ 2]: ∎∎
793.0000 - 794.5000 [ 2]: ∎∎
794.5000 - 796.0000 [ 1]: ∎
796.0000 - 797.5000 [ 0]:
797.5000 - 799.0000 [ 1]: ∎
799.0000 - 800.5000 [ 0]:
800.5000 - 802.0000 [ 0]:
802.0000 - 803.5000 [ 0]:
803.5000 - 805.0000 [ 1]: ∎

With thermal pressure:

# NumSamples = 9; Min = 933.00; Max = 960.00
# Mean = 940.777778; Variance = 64.172840; SD = 8.010795; Median 937.000000
# each ∎ represents a count of 1
933.0000 - 935.7000 [ 3]: ∎∎∎
935.7000 - 938.4000 [ 2]: ∎∎
938.4000 - 941.1000 [ 2]: ∎∎
941.1000 - 943.8000 [ 0]:
943.8000 - 946.5000 [ 0]:
946.5000 - 949.2000 [ 1]: ∎
949.2000 - 951.9000 [ 0]:
951.9000 - 954.6000 [ 0]:
954.6000 - 957.3000 [ 0]:
957.3000 - 960.0000 [ 1]: ∎

--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

2019-10-31 09:45:20

by Ionela Voinescu

[permalink] [raw]

Subject: Re: [Patch v4 0/6] Introduce Thermal Pressure

Hi Thara,

On Tuesday 22 Oct 2019 at 16:34:19 (-0400), Thara Gopinath wrote:
> Thermal governors can respond to an overheat event of a cpu by
> capping the cpu's maximum possible frequency. This in turn
> means that the maximum available compute capacity of the
> cpu is restricted. But today in the kernel, task scheduler is
> not notified of capping of maximum frequency of a cpu.
> In other words, scheduler is unware of maximum capacity

Nit: s/unware/unaware

> restrictions placed on a cpu due to thermal activity.
> This patch series attempts to address this issue.
> The benefits identified are better task placement among available
> cpus in event of overheating which in turn leads to better
> performance numbers.
>
> The reduction in the maximum possible capacity of a cpu due to a
> thermal event can be considered as thermal pressure. Instantaneous
> thermal pressure is hard to record and can sometime be erroneous
> as there can be mismatch between the actual capping of capacity
> and scheduler recording it. Thus solution is to have a weighted
> average per cpu value for thermal pressure over time.
> The weight reflects the amount of time the cpu has spent at a
> capped maximum frequency. Since thermal pressure is recorded as
> an average, it must be decayed periodically. Exisiting algorithm
> in the kernel scheduler pelt framework is re-used to calculate
> the weighted average. This patch series also defines a sysctl
> inerface to allow for a configurable decay period.
>
> Regarding testing, basic build, boot and sanity testing have been
> performed on db845c platform with debian file system.
> Further, dhrystone and hackbench tests have been
> run with the thermal pressure algorithm. During testing, due to
> constraints of step wise governor in dealing with big little systems,
> trip point 0 temperature was made assymetric between cpus in little
> cluster and big cluster; the idea being that
> big core will heat up and cpu cooling device will throttle the
> frequency of the big cores faster, there by limiting the maximum available
> capacity and the scheduler will spread out tasks to little cores as well.
>

Can you please share the changes you've made to sdm845.dtsi and a kernel
base on top of which to apply your patches? I would like to reproduce
your results and run more tests and it would be good if our setups were
as close as possible.

> Test Results
>
> Hackbench: 1 group , 30000 loops, 10 runs
> Result SD
> (Secs) (% of mean)
> No Thermal Pressure 14.03 2.69%
> Thermal Pressure PELT Algo. Decay : 32 ms 13.29 0.56%
> Thermal Pressure PELT Algo. Decay : 64 ms 12.57 1.56%
> Thermal Pressure PELT Algo. Decay : 128 ms 12.71 1.04%
> Thermal Pressure PELT Algo. Decay : 256 ms 12.29 1.42%
> Thermal Pressure PELT Algo. Decay : 512 ms 12.42 1.15%
>
> Dhrystone Run Time : 20 threads, 3000 MLOOPS
> Result SD
> (Secs) (% of mean)
> No Thermal Pressure 9.452 4.49%
> Thermal Pressure PELT Algo. Decay : 32 ms 8.793 5.30%
> Thermal Pressure PELT Algo. Decay : 64 ms 8.981 5.29%
> Thermal Pressure PELT Algo. Decay : 128 ms 8.647 6.62%
> Thermal Pressure PELT Algo. Decay : 256 ms 8.774 6.45%
> Thermal Pressure PELT Algo. Decay : 512 ms 8.603 5.41%
>

Do you happen to know by how much the CPUs were capped during these
experiments?

Thanks,
Ionela.

> A Brief History
>
> The first version of this patch-series was posted with resuing
> PELT algorithm to decay thermal pressure signal. The discussions
> that followed were around whether intanteneous thermal pressure
> solution is better and whether a stand-alone algortihm to accumulate
> and decay thermal pressure is more appropriate than re-using the
> PELT framework.
> Tests on Hikey960 showed the stand-alone algorithm performing slightly
> better than resuing PELT algorithm and V2 was posted with the stand
> alone algorithm. Test results were shared as part of this series.
> Discussions were around re-using PELT algorithm and running
> further tests with more granular decay period.
>
> For some time after this development was impeded due to hardware
> unavailability, some other unforseen and possibly unfortunate events.
> For this version, h/w was switched from hikey960 to db845c.
> Also Instantaneous thermal pressure was never tested as part of this
> cycle as it is clear that weighted average is a better implementation.
> The non-PELT algorithm never gave any conclusive results to prove that it
> is better than reusing PELT algorithm, in this round of testing.
> Also reusing PELT algorithm means thermal pressure tracks the
> other utilization signals in the scheduler.
>
> v3->v4:
> - "Patch 3/7:sched: Initialize per cpu thermal pressure structure"
> is dropped as it is no longer needed following changes in other
> other patches.
> - rest of the change log mentioned in specific patches.
>
> Thara Gopinath (6):
> sched/pelt.c: Add support to track thermal pressure
> sched: Add infrastructure to store and update instantaneous thermal
> pressure
> sched/fair: Enable CFS periodic tick to update thermal pressure
> sched/fair: update cpu_capcity to reflect thermal pressure
> thermal/cpu-cooling: Update thermal pressure in case of a maximum
> frequency capping
> sched: thermal: Enable tuning of decay period
>
> Documentation/admin-guide/kernel-parameters.txt | 5 ++
> drivers/thermal/cpu_cooling.c | 31 ++++++++++-
> include/linux/sched.h | 8 +++
> kernel/sched/Makefile | 2 +-
> kernel/sched/fair.c | 6 +++
> kernel/sched/pelt.c | 13 +++++
> kernel/sched/pelt.h | 7 +++
> kernel/sched/sched.h | 1 +
> kernel/sched/thermal.c | 68 +++++++++++++++++++++++++
> kernel/sched/thermal.h | 13 +++++
> 10 files changed, 151 insertions(+), 3 deletions(-)
> create mode 100644 kernel/sched/thermal.c
> create mode 100644 kernel/sched/thermal.h
>
> --
> 2.1.4
>

2019-10-31 10:09:30

by Ionela Voinescu

[permalink] [raw]

Subject: Re: [Patch v4 0/6] Introduce Thermal Pressure

Hi Daniel,

On Tuesday 29 Oct 2019 at 16:34:11 (+0100), Daniel Lezcano wrote:
> Hi Thara,
>
> On 22/10/2019 22:34, Thara Gopinath wrote:
> > Thermal governors can respond to an overheat event of a cpu by
> > capping the cpu's maximum possible frequency. This in turn
> > means that the maximum available compute capacity of the
> > cpu is restricted. But today in the kernel, task scheduler is
> > not notified of capping of maximum frequency of a cpu.
> > In other words, scheduler is unware of maximum capacity
> > restrictions placed on a cpu due to thermal activity.
> > This patch series attempts to address this issue.
> > The benefits identified are better task placement among available
> > cpus in event of overheating which in turn leads to better
> > performance numbers.
> >
> > The reduction in the maximum possible capacity of a cpu due to a
> > thermal event can be considered as thermal pressure. Instantaneous
> > thermal pressure is hard to record and can sometime be erroneous
> > as there can be mismatch between the actual capping of capacity
> > and scheduler recording it. Thus solution is to have a weighted
> > average per cpu value for thermal pressure over time.
> > The weight reflects the amount of time the cpu has spent at a
> > capped maximum frequency. Since thermal pressure is recorded as
> > an average, it must be decayed periodically. Exisiting algorithm
> > in the kernel scheduler pelt framework is re-used to calculate
> > the weighted average. This patch series also defines a sysctl
> > inerface to allow for a configurable decay period.
> >
> > Regarding testing, basic build, boot and sanity testing have been
> > performed on db845c platform with debian file system.
> > Further, dhrystone and hackbench tests have been
> > run with the thermal pressure algorithm. During testing, due to
> > constraints of step wise governor in dealing with big little systems,
> > trip point 0 temperature was made assymetric between cpus in little
> > cluster and big cluster; the idea being that
> > big core will heat up and cpu cooling device will throttle the
> > frequency of the big cores faster, there by limiting the maximum available
> > capacity and the scheduler will spread out tasks to little cores as well.
> >
> > Test Results
> >
> > Hackbench: 1 group , 30000 loops, 10 runs
> > Result SD
> > (Secs) (% of mean)
> > No Thermal Pressure 14.03 2.69%
> > Thermal Pressure PELT Algo. Decay : 32 ms 13.29 0.56%
> > Thermal Pressure PELT Algo. Decay : 64 ms 12.57 1.56%
> > Thermal Pressure PELT Algo. Decay : 128 ms 12.71 1.04%
> > Thermal Pressure PELT Algo. Decay : 256 ms 12.29 1.42%
> > Thermal Pressure PELT Algo. Decay : 512 ms 12.42 1.15%
> >
> > Dhrystone Run Time : 20 threads, 3000 MLOOPS
> > Result SD
> > (Secs) (% of mean)
> > No Thermal Pressure 9.452 4.49%
> > Thermal Pressure PELT Algo. Decay : 32 ms 8.793 5.30%
> > Thermal Pressure PELT Algo. Decay : 64 ms 8.981 5.29%
> > Thermal Pressure PELT Algo. Decay : 128 ms 8.647 6.62%
> > Thermal Pressure PELT Algo. Decay : 256 ms 8.774 6.45%
> > Thermal Pressure PELT Algo. Decay : 512 ms 8.603 5.41%
>
> I took the opportunity to try glmark2 on the db845c platform with the
> default decay and got the following glmark2 scores:
>
> Without thermal pressure:
>
> # NumSamples = 9; Min = 790.00; Max = 805.00
> # Mean = 794.888889; Variance = 19.209877; SD = 4.382907; Median 794.000000
> # each ∎ represents a count of 1
> 790.0000 - 791.5000 [ 2]: ∎∎
> 791.5000 - 793.0000 [ 2]: ∎∎
> 793.0000 - 794.5000 [ 2]: ∎∎
> 794.5000 - 796.0000 [ 1]: ∎
> 796.0000 - 797.5000 [ 0]:
> 797.5000 - 799.0000 [ 1]: ∎
> 799.0000 - 800.5000 [ 0]:
> 800.5000 - 802.0000 [ 0]:
> 802.0000 - 803.5000 [ 0]:
> 803.5000 - 805.0000 [ 1]: ∎
>
>
> With thermal pressure:
>
> # NumSamples = 9; Min = 933.00; Max = 960.00
> # Mean = 940.777778; Variance = 64.172840; SD = 8.010795; Median 937.000000
> # each ∎ represents a count of 1
> 933.0000 - 935.7000 [ 3]: ∎∎∎
> 935.7000 - 938.4000 [ 2]: ∎∎
> 938.4000 - 941.1000 [ 2]: ∎∎
> 941.1000 - 943.8000 [ 0]:
> 943.8000 - 946.5000 [ 0]:
> 946.5000 - 949.2000 [ 1]: ∎
> 949.2000 - 951.9000 [ 0]:
> 951.9000 - 954.6000 [ 0]:
> 954.6000 - 957.3000 [ 0]:
> 957.3000 - 960.0000 [ 1]: ∎
>

Interesting! If I'm interpreting these correctly there seems to be
significant improvement when applying thermal pressure.

I'm not familiar with glmark2, can you tell me more about the process
and the work that the benchmark does? I assume this is a GPU benchmark,
but not knowing more about it I fail to see the correlation between
applying thermal pressure to CPU capacities and the improvement of GPU
performance.

Do you happen to know more about the behaviour that resulted in these
benchmark scores?

Thanks,
Ionela.

>
>
> --
> <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
>
> Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
> <http://twitter.com/#!/linaroorg> Twitter |
> <http://www.linaro.org/linaro-blog/> Blog
>

2019-10-31 10:54:49

by Qais Yousef

[permalink] [raw]

Subject: Re: [Patch v4 4/6] sched/fair: update cpu_capcity to reflect thermal pressure

On 10/28/19 16:30, Peter Zijlstra wrote:
> On Wed, Oct 23, 2019 at 01:28:40PM +0100, Qais Yousef wrote:
> > On 10/22/19 16:34, Thara Gopinath wrote:
> > > cpu_capacity relflects the maximum available capacity of a cpu. Thermal
> > > pressure on a cpu means this maximum available capacity is reduced. This
> > > patch reduces the average thermal pressure for a cpu from its maximum
> > > available capacity so that cpu_capacity reflects the actual
> > > available capacity.
> > >
> > > Signed-off-by: Thara Gopinath <[email protected]>
> > > ---
> > > kernel/sched/fair.c | 1 +
> > > 1 file changed, 1 insertion(+)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 4f9c2cb..be3e802 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -7727,6 +7727,7 @@ static unsigned long scale_rt_capacity(struct sched_domain *sd, int cpu)
> > >
> > > used = READ_ONCE(rq->avg_rt.util_avg);
> > > used += READ_ONCE(rq->avg_dl.util_avg);
> > > + used += READ_ONCE(rq->avg_thermal.load_avg);
> >
> > Maybe a naive question - but can we add util_avg with load_avg without
> > a conversion? I thought the 2 signals have different properties.
>
> Changelog of patch #1 explains, it's in that dense blob of text.
>
> But yes, you're quite right that that wants a comment here.

Thanks for the pointer! A comment would be nice indeed.

To make sure I got this correctly - it's because avg_thermal.load_avg
represents delta_capacity which is already a 'converted' form of load. So this
makes avg_thermal.load_avg a util_avg really. Correct?

If I managed to get it right somehow. It'd be nice if we can do inverse
conversion on delta_capacity so that avg_thermal.{load_avg, util_avg} meaning
is consistent across the board. But I don't feel strongly about it if this gets
documented properly.

Thanks

--
Qais Yousef

2019-10-31 11:57:30

by Daniel Lezcano

[permalink] [raw]

Subject: Re: [Patch v4 0/6] Introduce Thermal Pressure

Hi Ionela,

On 31/10/2019 11:07, Ionela Voinescu wrote:
> Hi Daniel,
>
> On Tuesday 29 Oct 2019 at 16:34:11 (+0100), Daniel Lezcano wrote:
>> Hi Thara,
>>
>> On 22/10/2019 22:34, Thara Gopinath wrote:
>>> Thermal governors can respond to an overheat event of a cpu by
>>> capping the cpu's maximum possible frequency. This in turn
>>> means that the maximum available compute capacity of the
>>> cpu is restricted. But today in the kernel, task scheduler is
>>> not notified of capping of maximum frequency of a cpu.
>>> In other words, scheduler is unware of maximum capacity
>>> restrictions placed on a cpu due to thermal activity.
>>> This patch series attempts to address this issue.
>>> The benefits identified are better task placement among available
>>> cpus in event of overheating which in turn leads to better
>>> performance numbers.
>>>
>>> The reduction in the maximum possible capacity of a cpu due to a
>>> thermal event can be considered as thermal pressure. Instantaneous
>>> thermal pressure is hard to record and can sometime be erroneous
>>> as there can be mismatch between the actual capping of capacity
>>> and scheduler recording it. Thus solution is to have a weighted
>>> average per cpu value for thermal pressure over time.
>>> The weight reflects the amount of time the cpu has spent at a
>>> capped maximum frequency. Since thermal pressure is recorded as
>>> an average, it must be decayed periodically. Exisiting algorithm
>>> in the kernel scheduler pelt framework is re-used to calculate
>>> the weighted average. This patch series also defines a sysctl
>>> inerface to allow for a configurable decay period.
>>>
>>> Regarding testing, basic build, boot and sanity testing have been
>>> performed on db845c platform with debian file system.
>>> Further, dhrystone and hackbench tests have been
>>> run with the thermal pressure algorithm. During testing, due to
>>> constraints of step wise governor in dealing with big little systems,
>>> trip point 0 temperature was made assymetric between cpus in little
>>> cluster and big cluster; the idea being that
>>> big core will heat up and cpu cooling device will throttle the
>>> frequency of the big cores faster, there by limiting the maximum available
>>> capacity and the scheduler will spread out tasks to little cores as well.
>>>
>>> Test Results
>>>
>>> Hackbench: 1 group , 30000 loops, 10 runs
>>> Result SD
>>> (Secs) (% of mean)
>>> No Thermal Pressure 14.03 2.69%
>>> Thermal Pressure PELT Algo. Decay : 32 ms 13.29 0.56%
>>> Thermal Pressure PELT Algo. Decay : 64 ms 12.57 1.56%
>>> Thermal Pressure PELT Algo. Decay : 128 ms 12.71 1.04%
>>> Thermal Pressure PELT Algo. Decay : 256 ms 12.29 1.42%
>>> Thermal Pressure PELT Algo. Decay : 512 ms 12.42 1.15%
>>>
>>> Dhrystone Run Time : 20 threads, 3000 MLOOPS
>>> Result SD
>>> (Secs) (% of mean)
>>> No Thermal Pressure 9.452 4.49%
>>> Thermal Pressure PELT Algo. Decay : 32 ms 8.793 5.30%
>>> Thermal Pressure PELT Algo. Decay : 64 ms 8.981 5.29%
>>> Thermal Pressure PELT Algo. Decay : 128 ms 8.647 6.62%
>>> Thermal Pressure PELT Algo. Decay : 256 ms 8.774 6.45%
>>> Thermal Pressure PELT Algo. Decay : 512 ms 8.603 5.41%
>>
>> I took the opportunity to try glmark2 on the db845c platform with the
>> default decay and got the following glmark2 scores:
>>
>> Without thermal pressure:
>>
>> # NumSamples = 9; Min = 790.00; Max = 805.00
>> # Mean = 794.888889; Variance = 19.209877; SD = 4.382907; Median 794.000000
>> # each ∎ represents a count of 1
>> 790.0000 - 791.5000 [ 2]: ∎∎
>> 791.5000 - 793.0000 [ 2]: ∎∎
>> 793.0000 - 794.5000 [ 2]: ∎∎
>> 794.5000 - 796.0000 [ 1]: ∎
>> 796.0000 - 797.5000 [ 0]:
>> 797.5000 - 799.0000 [ 1]: ∎
>> 799.0000 - 800.5000 [ 0]:
>> 800.5000 - 802.0000 [ 0]:
>> 802.0000 - 803.5000 [ 0]:
>> 803.5000 - 805.0000 [ 1]: ∎
>>
>>
>> With thermal pressure:
>>
>> # NumSamples = 9; Min = 933.00; Max = 960.00
>> # Mean = 940.777778; Variance = 64.172840; SD = 8.010795; Median 937.000000
>> # each ∎ represents a count of 1
>> 933.0000 - 935.7000 [ 3]: ∎∎∎
>> 935.7000 - 938.4000 [ 2]: ∎∎
>> 938.4000 - 941.1000 [ 2]: ∎∎
>> 941.1000 - 943.8000 [ 0]:
>> 943.8000 - 946.5000 [ 0]:
>> 946.5000 - 949.2000 [ 1]: ∎
>> 949.2000 - 951.9000 [ 0]:
>> 951.9000 - 954.6000 [ 0]:
>> 954.6000 - 957.3000 [ 0]:
>> 957.3000 - 960.0000 [ 1]: ∎
>>
>
> Interesting! If I'm interpreting these correctly there seems to be
> significant improvement when applying thermal pressure.
>
> I'm not familiar with glmark2, can you tell me more about the process
> and the work that the benchmark does?

glmark2 is a 3D benchmark. I ran it without parameters, so all tests are
run. At the end, it gives a score which are the values given above.

> I assume this is a GPU benchmark,
> but not knowing more about it I fail to see the correlation between
> applying thermal pressure to CPU capacities and the improvement of GPU
> performance.
> Do you happen to know more about the behaviour that resulted in these
> benchmark scores?

My hypothesis is glmark2 makes the GPU to contribute a lot to the
heating effect, thus increasing the temperature to the CPU close to it.

--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

2019-10-31 12:59:54

by Ionela Voinescu

[permalink] [raw]

Subject: Re: [Patch v4 0/6] Introduce Thermal Pressure

On Thursday 31 Oct 2019 at 12:54:03 (+0100), Daniel Lezcano wrote:
> Hi Ionela,
>
> On 31/10/2019 11:07, Ionela Voinescu wrote:
> > Hi Daniel,
> >
> > On Tuesday 29 Oct 2019 at 16:34:11 (+0100), Daniel Lezcano wrote:
> >> Hi Thara,
> >>
> >> On 22/10/2019 22:34, Thara Gopinath wrote:
> >>> Thermal governors can respond to an overheat event of a cpu by
> >>> capping the cpu's maximum possible frequency. This in turn
> >>> means that the maximum available compute capacity of the
> >>> cpu is restricted. But today in the kernel, task scheduler is
> >>> not notified of capping of maximum frequency of a cpu.
> >>> In other words, scheduler is unware of maximum capacity
> >>> restrictions placed on a cpu due to thermal activity.
> >>> This patch series attempts to address this issue.
> >>> The benefits identified are better task placement among available
> >>> cpus in event of overheating which in turn leads to better
> >>> performance numbers.
> >>>
> >>> The reduction in the maximum possible capacity of a cpu due to a
> >>> thermal event can be considered as thermal pressure. Instantaneous
> >>> thermal pressure is hard to record and can sometime be erroneous
> >>> as there can be mismatch between the actual capping of capacity
> >>> and scheduler recording it. Thus solution is to have a weighted
> >>> average per cpu value for thermal pressure over time.
> >>> The weight reflects the amount of time the cpu has spent at a
> >>> capped maximum frequency. Since thermal pressure is recorded as
> >>> an average, it must be decayed periodically. Exisiting algorithm
> >>> in the kernel scheduler pelt framework is re-used to calculate
> >>> the weighted average. This patch series also defines a sysctl
> >>> inerface to allow for a configurable decay period.
> >>>
> >>> Regarding testing, basic build, boot and sanity testing have been
> >>> performed on db845c platform with debian file system.
> >>> Further, dhrystone and hackbench tests have been
> >>> run with the thermal pressure algorithm. During testing, due to
> >>> constraints of step wise governor in dealing with big little systems,
> >>> trip point 0 temperature was made assymetric between cpus in little
> >>> cluster and big cluster; the idea being that
> >>> big core will heat up and cpu cooling device will throttle the
> >>> frequency of the big cores faster, there by limiting the maximum available
> >>> capacity and the scheduler will spread out tasks to little cores as well.
> >>>
> >>> Test Results
> >>>
> >>> Hackbench: 1 group , 30000 loops, 10 runs
> >>> Result SD
> >>> (Secs) (% of mean)
> >>> No Thermal Pressure 14.03 2.69%
> >>> Thermal Pressure PELT Algo. Decay : 32 ms 13.29 0.56%
> >>> Thermal Pressure PELT Algo. Decay : 64 ms 12.57 1.56%
> >>> Thermal Pressure PELT Algo. Decay : 128 ms 12.71 1.04%
> >>> Thermal Pressure PELT Algo. Decay : 256 ms 12.29 1.42%
> >>> Thermal Pressure PELT Algo. Decay : 512 ms 12.42 1.15%
> >>>
> >>> Dhrystone Run Time : 20 threads, 3000 MLOOPS
> >>> Result SD
> >>> (Secs) (% of mean)
> >>> No Thermal Pressure 9.452 4.49%
> >>> Thermal Pressure PELT Algo. Decay : 32 ms 8.793 5.30%
> >>> Thermal Pressure PELT Algo. Decay : 64 ms 8.981 5.29%
> >>> Thermal Pressure PELT Algo. Decay : 128 ms 8.647 6.62%
> >>> Thermal Pressure PELT Algo. Decay : 256 ms 8.774 6.45%
> >>> Thermal Pressure PELT Algo. Decay : 512 ms 8.603 5.41%
> >>
> >> I took the opportunity to try glmark2 on the db845c platform with the
> >> default decay and got the following glmark2 scores:
> >>
> >> Without thermal pressure:
> >>
> >> # NumSamples = 9; Min = 790.00; Max = 805.00
> >> # Mean = 794.888889; Variance = 19.209877; SD = 4.382907; Median 794.000000
> >> # each ∎ represents a count of 1
> >> 790.0000 - 791.5000 [ 2]: ∎∎
> >> 791.5000 - 793.0000 [ 2]: ∎∎
> >> 793.0000 - 794.5000 [ 2]: ∎∎
> >> 794.5000 - 796.0000 [ 1]: ∎
> >> 796.0000 - 797.5000 [ 0]:
> >> 797.5000 - 799.0000 [ 1]: ∎
> >> 799.0000 - 800.5000 [ 0]:
> >> 800.5000 - 802.0000 [ 0]:
> >> 802.0000 - 803.5000 [ 0]:
> >> 803.5000 - 805.0000 [ 1]: ∎
> >>
> >>
> >> With thermal pressure:
> >>
> >> # NumSamples = 9; Min = 933.00; Max = 960.00
> >> # Mean = 940.777778; Variance = 64.172840; SD = 8.010795; Median 937.000000
> >> # each ∎ represents a count of 1
> >> 933.0000 - 935.7000 [ 3]: ∎∎∎
> >> 935.7000 - 938.4000 [ 2]: ∎∎
> >> 938.4000 - 941.1000 [ 2]: ∎∎
> >> 941.1000 - 943.8000 [ 0]:
> >> 943.8000 - 946.5000 [ 0]:
> >> 946.5000 - 949.2000 [ 1]: ∎
> >> 949.2000 - 951.9000 [ 0]:
> >> 951.9000 - 954.6000 [ 0]:
> >> 954.6000 - 957.3000 [ 0]:
> >> 957.3000 - 960.0000 [ 1]: ∎
> >>
> >
> > Interesting! If I'm interpreting these correctly there seems to be
> > significant improvement when applying thermal pressure.
> >
> > I'm not familiar with glmark2, can you tell me more about the process
> > and the work that the benchmark does?
>
> glmark2 is a 3D benchmark. I ran it without parameters, so all tests are
> run. At the end, it gives a score which are the values given above.
>
> > I assume this is a GPU benchmark,
> > but not knowing more about it I fail to see the correlation between
> > applying thermal pressure to CPU capacities and the improvement of GPU
> > performance.
> > Do you happen to know more about the behaviour that resulted in these
> > benchmark scores?
>
> My hypothesis is glmark2 makes the GPU to contribute a lot to the
> heating effect, thus increasing the temperature to the CPU close to it.
>

Hhmm.. yes, I am assuming that there is some thermal mitigation (CPU
frequency capping) done as a result of the heat inflicted by the work
on the GPU, but these patches do not result in better thermal
management as for the GPU to perform better. They only inform the
scheduler in regards to reduced capacity of CPUs so it can decide to
better use the compute capacity that it has available.

There could be a second hand effect of the more efficient use of the
CPUs which would release thermal headroom for the GPU to use, but I
would not expect the differences to be as high as in the results above.

Another possibility is that work on the CPUs impacts the scores more
than I would expect for such a benchmark but again I would not
expect the work on the CPUs to be significant as to result in such
differences in the scores.

If you have the chance to look more into exactly what is the behaviour,
with and without thermal pressure - cooling states, average frequency,
use of CPUs, use of GPU, etc, it would be very valuable.

Thank you,
Ionela.

>
>
>
> --
> <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
>
> Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
> <http://twitter.com/#!/linaroorg> Twitter |
> <http://www.linaro.org/linaro-blog/> Blog
>

2019-10-31 15:40:13

by Dietmar Eggemann

[permalink] [raw]

Subject: Re: [Patch v4 4/6] sched/fair: update cpu_capcity to reflect thermal pressure

On 31.10.19 11:53, Qais Yousef wrote:
> On 10/28/19 16:30, Peter Zijlstra wrote:
>> On Wed, Oct 23, 2019 at 01:28:40PM +0100, Qais Yousef wrote:
>>> On 10/22/19 16:34, Thara Gopinath wrote:
>>>> cpu_capacity relflects the maximum available capacity of a cpu. Thermal
>>>> pressure on a cpu means this maximum available capacity is reduced. This
>>>> patch reduces the average thermal pressure for a cpu from its maximum
>>>> available capacity so that cpu_capacity reflects the actual
>>>> available capacity.
>>>>
>>>> Signed-off-by: Thara Gopinath <[email protected]>
>>>> ---
>>>> kernel/sched/fair.c | 1 +
>>>> 1 file changed, 1 insertion(+)
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index 4f9c2cb..be3e802 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -7727,6 +7727,7 @@ static unsigned long scale_rt_capacity(struct sched_domain *sd, int cpu)
>>>>
>>>> used = READ_ONCE(rq->avg_rt.util_avg);
>>>> used += READ_ONCE(rq->avg_dl.util_avg);
>>>> + used += READ_ONCE(rq->avg_thermal.load_avg);
>>>
>>> Maybe a naive question - but can we add util_avg with load_avg without
>>> a conversion? I thought the 2 signals have different properties.
>>
>> Changelog of patch #1 explains, it's in that dense blob of text.
>>
>> But yes, you're quite right that that wants a comment here.
>
> Thanks for the pointer! A comment would be nice indeed.
>
> To make sure I got this correctly - it's because avg_thermal.load_avg
> represents delta_capacity which is already a 'converted' form of load. So this
> makes avg_thermal.load_avg a util_avg really. Correct?
>
> If I managed to get it right somehow. It'd be nice if we can do inverse
> conversion on delta_capacity so that avg_thermal.{load_avg, util_avg} meaning
> is consistent across the board. But I don't feel strongly about it if this gets
> documented properly.

So why can't we use rq->avg_thermal.util_avg here? Since capacity is
closer to util than to load?

Is it because you want to use the influence of ___update_load_sum(...,
unsigned long load eq. per-cpu delta_capacity in your signal?

Why not call it this way then?

diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index 38210691c615..d3035457483f 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -357,9 +357,9 @@ int update_thermal_load_avg(u64 now, struct rq *rq,
u64 capacity)
{
if (___update_load_sum(now, &rq->avg_thermal,
capacity,
- capacity,
- capacity)) {
- ___update_load_avg(&rq->avg_thermal, 1, 1);
+ 0,
+ 0)) {
+ ___update_load_avg(&rq->avg_thermal, 1, 0);
return 1;
}

2019-10-31 15:50:27

by Vincent Guittot

[permalink] [raw]

Subject: Re: [Patch v4 4/6] sched/fair: update cpu_capcity to reflect thermal pressure

On Thu, 31 Oct 2019 at 16:38, Dietmar Eggemann <[email protected]> wrote:
>
> On 31.10.19 11:53, Qais Yousef wrote:
> > On 10/28/19 16:30, Peter Zijlstra wrote:
> >> On Wed, Oct 23, 2019 at 01:28:40PM +0100, Qais Yousef wrote:
> >>> On 10/22/19 16:34, Thara Gopinath wrote:
> >>>> cpu_capacity relflects the maximum available capacity of a cpu. Thermal
> >>>> pressure on a cpu means this maximum available capacity is reduced. This
> >>>> patch reduces the average thermal pressure for a cpu from its maximum
> >>>> available capacity so that cpu_capacity reflects the actual
> >>>> available capacity.
> >>>>
> >>>> Signed-off-by: Thara Gopinath <[email protected]>
> >>>> ---
> >>>> kernel/sched/fair.c | 1 +
> >>>> 1 file changed, 1 insertion(+)
> >>>>
> >>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>>> index 4f9c2cb..be3e802 100644
> >>>> --- a/kernel/sched/fair.c
> >>>> +++ b/kernel/sched/fair.c
> >>>> @@ -7727,6 +7727,7 @@ static unsigned long scale_rt_capacity(struct sched_domain *sd, int cpu)
> >>>>
> >>>> used = READ_ONCE(rq->avg_rt.util_avg);
> >>>> used += READ_ONCE(rq->avg_dl.util_avg);
> >>>> + used += READ_ONCE(rq->avg_thermal.load_avg);
> >>>
> >>> Maybe a naive question - but can we add util_avg with load_avg without
> >>> a conversion? I thought the 2 signals have different properties.
> >>
> >> Changelog of patch #1 explains, it's in that dense blob of text.
> >>
> >> But yes, you're quite right that that wants a comment here.
> >
> > Thanks for the pointer! A comment would be nice indeed.
> >
> > To make sure I got this correctly - it's because avg_thermal.load_avg
> > represents delta_capacity which is already a 'converted' form of load. So this
> > makes avg_thermal.load_avg a util_avg really. Correct?
> >
> > If I managed to get it right somehow. It'd be nice if we can do inverse
> > conversion on delta_capacity so that avg_thermal.{load_avg, util_avg} meaning
> > is consistent across the board. But I don't feel strongly about it if this gets
> > documented properly.
>
> So why can't we use rq->avg_thermal.util_avg here? Since capacity is
> closer to util than to load?
>
> Is it because you want to use the influence of ___update_load_sum(...,
> unsigned long load eq. per-cpu delta_capacity in your signal?
>
> Why not call it this way then?

util_avg tracks a binary state with 2 fixed weights: running(1024) vs
not running (0)
In the case of thermal pressure, we want to track how much pressure is
put on the CPU: capping to half the max frequency is not the same as
capping only 10%
load_avg is not boolean but you set the weight you want to apply and
this weight reflects the amount of pressure.

>
> diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
> index 38210691c615..d3035457483f 100644
> --- a/kernel/sched/pelt.c
> +++ b/kernel/sched/pelt.c
> @@ -357,9 +357,9 @@ int update_thermal_load_avg(u64 now, struct rq *rq,
> u64 capacity)
> {
> if (___update_load_sum(now, &rq->avg_thermal,
> capacity,
> - capacity,
> - capacity)) {
> - ___update_load_avg(&rq->avg_thermal, 1, 1);
> + 0,
> + 0)) {
> + ___update_load_avg(&rq->avg_thermal, 1, 0);
> return 1;
> }

2019-10-31 16:20:07

by Dietmar Eggemann

[permalink] [raw]

Subject: Re: [Patch v4 4/6] sched/fair: update cpu_capcity to reflect thermal pressure

On 31.10.19 16:48, Vincent Guittot wrote:
> On Thu, 31 Oct 2019 at 16:38, Dietmar Eggemann <[email protected]> wrote:
>>
>> On 31.10.19 11:53, Qais Yousef wrote:
>>> On 10/28/19 16:30, Peter Zijlstra wrote:
>>>> On Wed, Oct 23, 2019 at 01:28:40PM +0100, Qais Yousef wrote:
>>>>> On 10/22/19 16:34, Thara Gopinath wrote:

[...]

>>> To make sure I got this correctly - it's because avg_thermal.load_avg
>>> represents delta_capacity which is already a 'converted' form of load. So this
>>> makes avg_thermal.load_avg a util_avg really. Correct?
>>>
>>> If I managed to get it right somehow. It'd be nice if we can do inverse
>>> conversion on delta_capacity so that avg_thermal.{load_avg, util_avg} meaning
>>> is consistent across the board. But I don't feel strongly about it if this gets
>>> documented properly.
>>
>> So why can't we use rq->avg_thermal.util_avg here? Since capacity is
>> closer to util than to load?
>>
>> Is it because you want to use the influence of ___update_load_sum(...,
>> unsigned long load eq. per-cpu delta_capacity in your signal?
>>
>> Why not call it this way then?
>
> util_avg tracks a binary state with 2 fixed weights: running(1024) vs
> not running (0)
> In the case of thermal pressure, we want to track how much pressure is
> put on the CPU: capping to half the max frequency is not the same as
> capping only 10%
> load_avg is not boolean but you set the weight you want to apply and
> this weight reflects the amount of pressure.

I see. This is what I meant by 'load (weight) eq. per-cpu delta_capacity
(pressure)'.

>> diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
>> index 38210691c615..d3035457483f 100644
>> --- a/kernel/sched/pelt.c
>> +++ b/kernel/sched/pelt.c
>> @@ -357,9 +357,9 @@ int update_thermal_load_avg(u64 now, struct rq *rq,
>> u64 capacity)
>> {
>> if (___update_load_sum(now, &rq->avg_thermal,
>> capacity,
>> - capacity,
>> - capacity)) {
>> - ___update_load_avg(&rq->avg_thermal, 1, 1);
>> + 0,
>> + 0)) {
>> + ___update_load_avg(&rq->avg_thermal, 1, 0);
>> return 1;
>> }

So we could call it this way since we don't care about runnable_load or
util?

2019-10-31 16:36:05

by Vincent Guittot

[permalink] [raw]

Subject: Re: [Patch v4 4/6] sched/fair: update cpu_capcity to reflect thermal pressure

On Thu, 31 Oct 2019 at 17:17, Dietmar Eggemann <[email protected]> wrote:
>
> On 31.10.19 16:48, Vincent Guittot wrote:
> > On Thu, 31 Oct 2019 at 16:38, Dietmar Eggemann <[email protected]> wrote:
> >>
> >> On 31.10.19 11:53, Qais Yousef wrote:
> >>> On 10/28/19 16:30, Peter Zijlstra wrote:
> >>>> On Wed, Oct 23, 2019 at 01:28:40PM +0100, Qais Yousef wrote:
> >>>>> On 10/22/19 16:34, Thara Gopinath wrote:
>
> [...]
>
> >>> To make sure I got this correctly - it's because avg_thermal.load_avg
> >>> represents delta_capacity which is already a 'converted' form of load. So this
> >>> makes avg_thermal.load_avg a util_avg really. Correct?
> >>>
> >>> If I managed to get it right somehow. It'd be nice if we can do inverse
> >>> conversion on delta_capacity so that avg_thermal.{load_avg, util_avg} meaning
> >>> is consistent across the board. But I don't feel strongly about it if this gets
> >>> documented properly.
> >>
> >> So why can't we use rq->avg_thermal.util_avg here? Since capacity is
> >> closer to util than to load?
> >>
> >> Is it because you want to use the influence of ___update_load_sum(...,
> >> unsigned long load eq. per-cpu delta_capacity in your signal?
> >>
> >> Why not call it this way then?
> >
> > util_avg tracks a binary state with 2 fixed weights: running(1024) vs
> > not running (0)
> > In the case of thermal pressure, we want to track how much pressure is
> > put on the CPU: capping to half the max frequency is not the same as
> > capping only 10%
> > load_avg is not boolean but you set the weight you want to apply and
> > this weight reflects the amount of pressure.
>
> I see. This is what I meant by 'load (weight) eq. per-cpu delta_capacity
> (pressure)'.
>
>
> >> diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
> >> index 38210691c615..d3035457483f 100644
> >> --- a/kernel/sched/pelt.c
> >> +++ b/kernel/sched/pelt.c
> >> @@ -357,9 +357,9 @@ int update_thermal_load_avg(u64 now, struct rq *rq,
> >> u64 capacity)
> >> {
> >> if (___update_load_sum(now, &rq->avg_thermal,
> >> capacity,
> >> - capacity,
> >> - capacity)) {
> >> - ___update_load_avg(&rq->avg_thermal, 1, 1);
> >> + 0,
> >> + 0)) {
> >> + ___update_load_avg(&rq->avg_thermal, 1, 0);
> >> return 1;
> >> }
>
> So we could call it this way since we don't care about runnable_load or
> util?

one way or the other is quite similar but the current solution is
aligned with other irq, rt, dl signals which duplicates the same state
in each fields

2019-10-31 16:44:33

by Thara Gopinath

[permalink] [raw]

Subject: Re: [Patch v4 0/6] Introduce Thermal Pressure

On 10/31/2019 05:44 AM, Ionela Voinescu wrote:
> Hi Thara,
>
> On Tuesday 22 Oct 2019 at 16:34:19 (-0400), Thara Gopinath wrote:
>> Thermal governors can respond to an overheat event of a cpu by
>> capping the cpu's maximum possible frequency. This in turn
>> means that the maximum available compute capacity of the
>> cpu is restricted. But today in the kernel, task scheduler is
>> not notified of capping of maximum frequency of a cpu.
>> In other words, scheduler is unware of maximum capacity
>
> Nit: s/unware/unaware
>
>> restrictions placed on a cpu due to thermal activity.
>> This patch series attempts to address this issue.
>> The benefits identified are better task placement among available
>> cpus in event of overheating which in turn leads to better
>> performance numbers.
>>
>> The reduction in the maximum possible capacity of a cpu due to a
>> thermal event can be considered as thermal pressure. Instantaneous
>> thermal pressure is hard to record and can sometime be erroneous
>> as there can be mismatch between the actual capping of capacity
>> and scheduler recording it. Thus solution is to have a weighted
>> average per cpu value for thermal pressure over time.
>> The weight reflects the amount of time the cpu has spent at a
>> capped maximum frequency. Since thermal pressure is recorded as
>> an average, it must be decayed periodically. Exisiting algorithm
>> in the kernel scheduler pelt framework is re-used to calculate
>> the weighted average. This patch series also defines a sysctl
>> inerface to allow for a configurable decay period.
>>
>> Regarding testing, basic build, boot and sanity testing have been
>> performed on db845c platform with debian file system.
>> Further, dhrystone and hackbench tests have been
>> run with the thermal pressure algorithm. During testing, due to
>> constraints of step wise governor in dealing with big little systems,
>> trip point 0 temperature was made assymetric between cpus in little
>> cluster and big cluster; the idea being that
>> big core will heat up and cpu cooling device will throttle the
>> frequency of the big cores faster, there by limiting the maximum available
>> capacity and the scheduler will spread out tasks to little cores as well.
>>
>
> Can you please share the changes you've made to sdm845.dtsi and a kernel
> base on top of which to apply your patches? I would like to reproduce
> your results and run more tests and it would be good if our setups were
> as close as possible.
Hi Ionela
Thank you for the review.
So I tested this on 5.4-rc1 kernel. The dtsi changes is to reduce the
thermal trip points for the big CPUs to 60000 or 70000 from the default
90000. I did this for 2 reasons
1. I could never get the db845 to heat up sufficiently for my test cases
with the default trip.
2. I was using the default step-wise governor for thermal. I did not
want little and big to start throttling by the same % because then the
task placement ratio will remain the same between little and big cores.

>
>> Test Results
>>
>> Hackbench: 1 group , 30000 loops, 10 runs
>> Result SD
>> (Secs) (% of mean)
>> No Thermal Pressure 14.03 2.69%
>> Thermal Pressure PELT Algo. Decay : 32 ms 13.29 0.56%
>> Thermal Pressure PELT Algo. Decay : 64 ms 12.57 1.56%
>> Thermal Pressure PELT Algo. Decay : 128 ms 12.71 1.04%
>> Thermal Pressure PELT Algo. Decay : 256 ms 12.29 1.42%
>> Thermal Pressure PELT Algo. Decay : 512 ms 12.42 1.15%
>>
>> Dhrystone Run Time : 20 threads, 3000 MLOOPS
>> Result SD
>> (Secs) (% of mean)
>> No Thermal Pressure 9.452 4.49%
>> Thermal Pressure PELT Algo. Decay : 32 ms 8.793 5.30%
>> Thermal Pressure PELT Algo. Decay : 64 ms 8.981 5.29%
>> Thermal Pressure PELT Algo. Decay : 128 ms 8.647 6.62%
>> Thermal Pressure PELT Algo. Decay : 256 ms 8.774 6.45%
>> Thermal Pressure PELT Algo. Decay : 512 ms 8.603 5.41%
>>
>
> Do you happen to know by how much the CPUs were capped during these
> experiments?

I don't have any captured results here. I know that big cores were
capped and at times there was capacity inversion.

Also I will fix the nit comments above.

>
> Thanks,
> Ionela.
>

--
Warm Regards
Thara

2019-10-31 16:46:28

by Dietmar Eggemann

[permalink] [raw]

Subject: Re: [Patch v4 4/6] sched/fair: update cpu_capcity to reflect thermal pressure

On 31.10.19 17:31, Vincent Guittot wrote:
> On Thu, 31 Oct 2019 at 17:17, Dietmar Eggemann <[email protected]> wrote:
>>
>> On 31.10.19 16:48, Vincent Guittot wrote:
>>> On Thu, 31 Oct 2019 at 16:38, Dietmar Eggemann <[email protected]> wrote:
>>>>
>>>> On 31.10.19 11:53, Qais Yousef wrote:
>>>>> On 10/28/19 16:30, Peter Zijlstra wrote:
>>>>>> On Wed, Oct 23, 2019 at 01:28:40PM +0100, Qais Yousef wrote:
>>>>>>> On 10/22/19 16:34, Thara Gopinath wrote:

[...]

>>>> diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
>>>> index 38210691c615..d3035457483f 100644
>>>> --- a/kernel/sched/pelt.c
>>>> +++ b/kernel/sched/pelt.c
>>>> @@ -357,9 +357,9 @@ int update_thermal_load_avg(u64 now, struct rq *rq,
>>>> u64 capacity)
>>>> {
>>>> if (___update_load_sum(now, &rq->avg_thermal,
>>>> capacity,
>>>> - capacity,
>>>> - capacity)) {
>>>> - ___update_load_avg(&rq->avg_thermal, 1, 1);
>>>> + 0,
>>>> + 0)) {
>>>> + ___update_load_avg(&rq->avg_thermal, 1, 0);
>>>> return 1;
>>>> }
>>
>> So we could call it this way since we don't care about runnable_load or
>> util?
>
> one way or the other is quite similar but the current solution is
> aligned with other irq, rt, dl signals which duplicates the same state
> in each fields

I see. But there is a subtle difference. For irq, rt, dl, we have to
also set load (even we only use util) because of:

___update_load_sum() {

...
if (!load)
runnable = running = 0;
...
}

which is there for se's only.

I like self-explanatory code but I agree in this case it's not obvious.

2019-10-31 16:53:09

by Thara Gopinath

[permalink] [raw]

Subject: Re: [Patch v4 0/6] Introduce Thermal Pressure

On 10/31/2019 12:41 PM, Thara Gopinath wrote:
> On 10/31/2019 05:44 AM, Ionela Voinescu wrote:
>> Hi Thara,
>>
>> On Tuesday 22 Oct 2019 at 16:34:19 (-0400), Thara Gopinath wrote:
>>> Thermal governors can respond to an overheat event of a cpu by
>>> capping the cpu's maximum possible frequency. This in turn
>>> means that the maximum available compute capacity of the
>>> cpu is restricted. But today in the kernel, task scheduler is
>>> not notified of capping of maximum frequency of a cpu.
>>> In other words, scheduler is unware of maximum capacity
>>
>> Nit: s/unware/unaware
>>
>>> restrictions placed on a cpu due to thermal activity.
>>> This patch series attempts to address this issue.
>>> The benefits identified are better task placement among available
>>> cpus in event of overheating which in turn leads to better
>>> performance numbers.
>>>
>>> The reduction in the maximum possible capacity of a cpu due to a
>>> thermal event can be considered as thermal pressure. Instantaneous
>>> thermal pressure is hard to record and can sometime be erroneous
>>> as there can be mismatch between the actual capping of capacity
>>> and scheduler recording it. Thus solution is to have a weighted
>>> average per cpu value for thermal pressure over time.
>>> The weight reflects the amount of time the cpu has spent at a
>>> capped maximum frequency. Since thermal pressure is recorded as
>>> an average, it must be decayed periodically. Exisiting algorithm
>>> in the kernel scheduler pelt framework is re-used to calculate
>>> the weighted average. This patch series also defines a sysctl
>>> inerface to allow for a configurable decay period.
>>>
>>> Regarding testing, basic build, boot and sanity testing have been
>>> performed on db845c platform with debian file system.
>>> Further, dhrystone and hackbench tests have been
>>> run with the thermal pressure algorithm. During testing, due to
>>> constraints of step wise governor in dealing with big little systems,
>>> trip point 0 temperature was made assymetric between cpus in little
>>> cluster and big cluster; the idea being that
>>> big core will heat up and cpu cooling device will throttle the
>>> frequency of the big cores faster, there by limiting the maximum available
>>> capacity and the scheduler will spread out tasks to little cores as well.
>>>
>>
>> Can you please share the changes you've made to sdm845.dtsi and a kernel
>> base on top of which to apply your patches? I would like to reproduce
>> your results and run more tests and it would be good if our setups were
>> as close as possible.
> Hi Ionela
> Thank you for the review.
> So I tested this on 5.4-rc1 kernel. The dtsi changes is to reduce the
> thermal trip points for the big CPUs to 60000 or 70000 from the default
> 90000. I did this for 2 reasons
> 1. I could never get the db845 to heat up sufficiently for my test cases
> with the default trip.
> 2. I was using the default step-wise governor for thermal. I did not
> want little and big to start throttling by the same % because then the
> task placement ratio will remain the same between little and big cores.
>

So I am not sure though if this is the set up under which Daniel ran
glbench . I will let him comment on it.

>
>
>

--
Warm Regards
Thara

2019-10-31 17:00:47

by Qais Yousef

[permalink] [raw]

Subject: Re: [Patch v4 4/6] sched/fair: update cpu_capcity to reflect thermal pressure

On 10/31/19 12:03, Thara Gopinath wrote:
> On 10/31/2019 06:53 AM, Qais Yousef wrote:
> > On 10/28/19 16:30, Peter Zijlstra wrote:
> >> On Wed, Oct 23, 2019 at 01:28:40PM +0100, Qais Yousef wrote:
> >>> On 10/22/19 16:34, Thara Gopinath wrote:
> >>>> cpu_capacity relflects the maximum available capacity of a cpu. Thermal
> >>>> pressure on a cpu means this maximum available capacity is reduced. This
> >>>> patch reduces the average thermal pressure for a cpu from its maximum
> >>>> available capacity so that cpu_capacity reflects the actual
> >>>> available capacity.
> >>>>
> >>>> Signed-off-by: Thara Gopinath <[email protected]>
> >>>> ---
> >>>> kernel/sched/fair.c | 1 +
> >>>> 1 file changed, 1 insertion(+)
> >>>>
> >>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>>> index 4f9c2cb..be3e802 100644
> >>>> --- a/kernel/sched/fair.c
> >>>> +++ b/kernel/sched/fair.c
> >>>> @@ -7727,6 +7727,7 @@ static unsigned long scale_rt_capacity(struct sched_domain *sd, int cpu)
> >>>>
> >>>> used = READ_ONCE(rq->avg_rt.util_avg);
> >>>> used += READ_ONCE(rq->avg_dl.util_avg);
> >>>> + used += READ_ONCE(rq->avg_thermal.load_avg);
> >>>
> >>> Maybe a naive question - but can we add util_avg with load_avg without
> >>> a conversion? I thought the 2 signals have different properties.
> >>
> >> Changelog of patch #1 explains, it's in that dense blob of text.
> >>
> >> But yes, you're quite right that that wants a comment here.
> >
> > Thanks for the pointer! A comment would be nice indeed.
> >
> > To make sure I got this correctly - it's because avg_thermal.load_avg
> > represents delta_capacity which is already a 'converted' form of load. So this
> > makes avg_thermal.load_avg a util_avg really. Correct?
> Hello Quais,
>
> Sorry for not replying to your earlier email. Thanks for the review.
> So if you look at the code, util_sum in calculated as a binary signal
> converted into capacity. Check out the the below snippet from accumulate_sum
>
> if (load)
> sa->load_sum += load * contrib;
> if (runnable)
> sa->runnable_load_sum += runnable * contrib;
> if (running)
> sa->util_sum += contrib << SCHED_CAPACITY_SHIFT;
>
> So the actual delta for the thermal pressure will never be considered
> if util_avg is used.
>
> I will update this patch with relevant comment.

Okay thanks for the explanation.

--
Qais Yousef

2019-10-31 17:50:37

by Daniel Lezcano

[permalink] [raw]

Subject: Re: [Patch v4 0/6] Introduce Thermal Pressure

On 31/10/2019 13:57, Ionela Voinescu wrote:
> On Thursday 31 Oct 2019 at 12:54:03 (+0100), Daniel Lezcano wrote:
>> Hi Ionela,
>>
>> On 31/10/2019 11:07, Ionela Voinescu wrote:
>>> Hi Daniel,
>>>
>>> On Tuesday 29 Oct 2019 at 16:34:11 (+0100), Daniel Lezcano wrote:
>>>> Hi Thara,
>>>>
>>>> On 22/10/2019 22:34, Thara Gopinath wrote:
>>>>> Thermal governors can respond to an overheat event of a cpu by
>>>>> capping the cpu's maximum possible frequency. This in turn
>>>>> means that the maximum available compute capacity of the
>>>>> cpu is restricted. But today in the kernel, task scheduler is
>>>>> not notified of capping of maximum frequency of a cpu.
>>>>> In other words, scheduler is unware of maximum capacity
>>>>> restrictions placed on a cpu due to thermal activity.
>>>>> This patch series attempts to address this issue.
>>>>> The benefits identified are better task placement among available
>>>>> cpus in event of overheating which in turn leads to better
>>>>> performance numbers.
>>>>>
>>>>> The reduction in the maximum possible capacity of a cpu due to a
>>>>> thermal event can be considered as thermal pressure. Instantaneous
>>>>> thermal pressure is hard to record and can sometime be erroneous
>>>>> as there can be mismatch between the actual capping of capacity
>>>>> and scheduler recording it. Thus solution is to have a weighted
>>>>> average per cpu value for thermal pressure over time.
>>>>> The weight reflects the amount of time the cpu has spent at a
>>>>> capped maximum frequency. Since thermal pressure is recorded as
>>>>> an average, it must be decayed periodically. Exisiting algorithm
>>>>> in the kernel scheduler pelt framework is re-used to calculate
>>>>> the weighted average. This patch series also defines a sysctl
>>>>> inerface to allow for a configurable decay period.
>>>>>
>>>>> Regarding testing, basic build, boot and sanity testing have been
>>>>> performed on db845c platform with debian file system.
>>>>> Further, dhrystone and hackbench tests have been
>>>>> run with the thermal pressure algorithm. During testing, due to
>>>>> constraints of step wise governor in dealing with big little systems,
>>>>> trip point 0 temperature was made assymetric between cpus in little
>>>>> cluster and big cluster; the idea being that
>>>>> big core will heat up and cpu cooling device will throttle the
>>>>> frequency of the big cores faster, there by limiting the maximum available
>>>>> capacity and the scheduler will spread out tasks to little cores as well.
>>>>>
>>>>> Test Results
>>>>>
>>>>> Hackbench: 1 group , 30000 loops, 10 runs
>>>>> Result SD
>>>>> (Secs) (% of mean)
>>>>> No Thermal Pressure 14.03 2.69%
>>>>> Thermal Pressure PELT Algo. Decay : 32 ms 13.29 0.56%
>>>>> Thermal Pressure PELT Algo. Decay : 64 ms 12.57 1.56%
>>>>> Thermal Pressure PELT Algo. Decay : 128 ms 12.71 1.04%
>>>>> Thermal Pressure PELT Algo. Decay : 256 ms 12.29 1.42%
>>>>> Thermal Pressure PELT Algo. Decay : 512 ms 12.42 1.15%
>>>>>
>>>>> Dhrystone Run Time : 20 threads, 3000 MLOOPS
>>>>> Result SD
>>>>> (Secs) (% of mean)
>>>>> No Thermal Pressure 9.452 4.49%
>>>>> Thermal Pressure PELT Algo. Decay : 32 ms 8.793 5.30%
>>>>> Thermal Pressure PELT Algo. Decay : 64 ms 8.981 5.29%
>>>>> Thermal Pressure PELT Algo. Decay : 128 ms 8.647 6.62%
>>>>> Thermal Pressure PELT Algo. Decay : 256 ms 8.774 6.45%
>>>>> Thermal Pressure PELT Algo. Decay : 512 ms 8.603 5.41%
>>>>
>>>> I took the opportunity to try glmark2 on the db845c platform with the
>>>> default decay and got the following glmark2 scores:
>>>>
>>>> Without thermal pressure:
>>>>
>>>> # NumSamples = 9; Min = 790.00; Max = 805.00
>>>> # Mean = 794.888889; Variance = 19.209877; SD = 4.382907; Median 794.000000
>>>> # each ∎ represents a count of 1
>>>> 790.0000 - 791.5000 [ 2]: ∎∎
>>>> 791.5000 - 793.0000 [ 2]: ∎∎
>>>> 793.0000 - 794.5000 [ 2]: ∎∎
>>>> 794.5000 - 796.0000 [ 1]: ∎
>>>> 796.0000 - 797.5000 [ 0]:
>>>> 797.5000 - 799.0000 [ 1]: ∎
>>>> 799.0000 - 800.5000 [ 0]:
>>>> 800.5000 - 802.0000 [ 0]:
>>>> 802.0000 - 803.5000 [ 0]:
>>>> 803.5000 - 805.0000 [ 1]: ∎
>>>>
>>>>
>>>> With thermal pressure:
>>>>
>>>> # NumSamples = 9; Min = 933.00; Max = 960.00
>>>> # Mean = 940.777778; Variance = 64.172840; SD = 8.010795; Median 937.000000
>>>> # each ∎ represents a count of 1
>>>> 933.0000 - 935.7000 [ 3]: ∎∎∎
>>>> 935.7000 - 938.4000 [ 2]: ∎∎
>>>> 938.4000 - 941.1000 [ 2]: ∎∎
>>>> 941.1000 - 943.8000 [ 0]:
>>>> 943.8000 - 946.5000 [ 0]:
>>>> 946.5000 - 949.2000 [ 1]: ∎
>>>> 949.2000 - 951.9000 [ 0]:
>>>> 951.9000 - 954.6000 [ 0]:
>>>> 954.6000 - 957.3000 [ 0]:
>>>> 957.3000 - 960.0000 [ 1]: ∎
>>>>
>>>
>>> Interesting! If I'm interpreting these correctly there seems to be
>>> significant improvement when applying thermal pressure.
>>>
>>> I'm not familiar with glmark2, can you tell me more about the process
>>> and the work that the benchmark does?
>>
>> glmark2 is a 3D benchmark. I ran it without parameters, so all tests are
>> run. At the end, it gives a score which are the values given above.
>>
>>> I assume this is a GPU benchmark,
>>> but not knowing more about it I fail to see the correlation between
>>> applying thermal pressure to CPU capacities and the improvement of GPU
>>> performance.
>>> Do you happen to know more about the behaviour that resulted in these
>>> benchmark scores?
>>
>> My hypothesis is glmark2 makes the GPU to contribute a lot to the
>> heating effect, thus increasing the temperature to the CPU close to it.
>>
>
> Hhmm.. yes, I am assuming that there is some thermal mitigation (CPU
> frequency capping) done as a result of the heat inflicted by the work
> on the GPU, but these patches do not result in better thermal
> management as for the GPU to perform better. They only inform the
> scheduler in regards to reduced capacity of CPUs so it can decide to
> better use the compute capacity that it has available.
>
> There could be a second hand effect of the more efficient use of the
> CPUs which would release thermal headroom for the GPU to use, but I
> would not expect the differences to be as high as in the results above.

Indeed, you may be right.

> Another possibility is that work on the CPUs impacts the scores more
> than I would expect for such a benchmark but again I would not
> expect the work on the CPUs to be significant as to result in such
> differences in the scores.
>
> If you have the chance to look more into exactly what is the behaviour,
> with and without thermal pressure - cooling states, average frequency,
> use of CPUs, use of GPU, etc, it would be very valuable.

Not sure I have enough bandwidth to do all. I'll double check if there
is a difference when testing both versions.

--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

2019-10-31 21:10:32

by Thara Gopinath

[permalink] [raw]

Subject: Re: [Patch v4 4/6] sched/fair: update cpu_capcity to reflect thermal pressure

On 10/31/2019 06:53 AM, Qais Yousef wrote:
> On 10/28/19 16:30, Peter Zijlstra wrote:
>> On Wed, Oct 23, 2019 at 01:28:40PM +0100, Qais Yousef wrote:
>>> On 10/22/19 16:34, Thara Gopinath wrote:
>>>> cpu_capacity relflects the maximum available capacity of a cpu. Thermal
>>>> pressure on a cpu means this maximum available capacity is reduced. This
>>>> patch reduces the average thermal pressure for a cpu from its maximum
>>>> available capacity so that cpu_capacity reflects the actual
>>>> available capacity.
>>>>
>>>> Signed-off-by: Thara Gopinath <[email protected]>
>>>> ---
>>>> kernel/sched/fair.c | 1 +
>>>> 1 file changed, 1 insertion(+)
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index 4f9c2cb..be3e802 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -7727,6 +7727,7 @@ static unsigned long scale_rt_capacity(struct sched_domain *sd, int cpu)
>>>>
>>>> used = READ_ONCE(rq->avg_rt.util_avg);
>>>> used += READ_ONCE(rq->avg_dl.util_avg);
>>>> + used += READ_ONCE(rq->avg_thermal.load_avg);
>>>
>>> Maybe a naive question - but can we add util_avg with load_avg without
>>> a conversion? I thought the 2 signals have different properties.
>>
>> Changelog of patch #1 explains, it's in that dense blob of text.
>>
>> But yes, you're quite right that that wants a comment here.
>
> Thanks for the pointer! A comment would be nice indeed.
>
> To make sure I got this correctly - it's because avg_thermal.load_avg
> represents delta_capacity which is already a 'converted' form of load. So this
> makes avg_thermal.load_avg a util_avg really. Correct?
Hello Quais,

Sorry for not replying to your earlier email. Thanks for the review.
So if you look at the code, util_sum in calculated as a binary signal
converted into capacity. Check out the the below snippet from accumulate_sum

if (load)
sa->load_sum += load * contrib;
if (runnable)
sa->runnable_load_sum += runnable * contrib;
if (running)
sa->util_sum += contrib << SCHED_CAPACITY_SHIFT;

So the actual delta for the thermal pressure will never be considered
if util_avg is used.

I will update this patch with relevant comment.

--
Warm Regards
Thara

2019-11-05 21:07:41

by Ionela Voinescu

[permalink] [raw]

Subject: Re: [Patch v4 0/6] Introduce Thermal Pressure

Hi Thara,

On Thursday 31 Oct 2019 at 12:41:20 (-0400), Thara Gopinath wrote:
[...]
> >> Regarding testing, basic build, boot and sanity testing have been
> >> performed on db845c platform with debian file system.
> >> Further, dhrystone and hackbench tests have been
> >> run with the thermal pressure algorithm. During testing, due to
> >> constraints of step wise governor in dealing with big little systems,
> >> trip point 0 temperature was made assymetric between cpus in little
> >> cluster and big cluster; the idea being that
> >> big core will heat up and cpu cooling device will throttle the
> >> frequency of the big cores faster, there by limiting the maximum available
> >> capacity and the scheduler will spread out tasks to little cores as well.
> >>
> >
> > Can you please share the changes you've made to sdm845.dtsi and a kernel
> > base on top of which to apply your patches? I would like to reproduce
> > your results and run more tests and it would be good if our setups were
> > as close as possible.
> Hi Ionela
> Thank you for the review.
> So I tested this on 5.4-rc1 kernel. The dtsi changes is to reduce the
> thermal trip points for the big CPUs to 60000 or 70000 from the default
> 90000. I did this for 2 reasons
> 1. I could never get the db845 to heat up sufficiently for my test cases
> with the default trip.
> 2. I was using the default step-wise governor for thermal. I did not
> want little and big to start throttling by the same % because then the
> task placement ratio will remain the same between little and big cores.
>
>

Some early testing on this showed that when setting the trip point to
60000 for the big CPUs and the big cluster, and running hackbench (1
group, 30000 loops) the cooling state of the big cluster results in
always being set to the maximum (the lowest OPP), which results in
capacity inversion (almost) continuously.

For 70000 the average cooling state of the bigs is around 20 so it
will leave a few more OPPs available on the bigs more of the time,
but probably the capacity of bigs is mostly lower than the capacity
of little CPUs, during this test as well.

I think that explains the difference in results that you obtained
below. This is good as it shows that thermal pressure is useful but
it shouldn't show much difference between the different decay
periods, as can also be observed in your results below.

This being said, I did not obtained such significant results on my
side by I'll try again with the kernel you've pointed me to offline.

Thanks,
Ionela.

> >
> >> Test Results
> >>
> >> Hackbench: 1 group , 30000 loops, 10 runs
> >> Result SD
> >> (Secs) (% of mean)
> >> No Thermal Pressure 14.03 2.69%
> >> Thermal Pressure PELT Algo. Decay : 32 ms 13.29 0.56%
> >> Thermal Pressure PELT Algo. Decay : 64 ms 12.57 1.56%
> >> Thermal Pressure PELT Algo. Decay : 128 ms 12.71 1.04%
> >> Thermal Pressure PELT Algo. Decay : 256 ms 12.29 1.42%
> >> Thermal Pressure PELT Algo. Decay : 512 ms 12.42 1.15%
> >>
> >> Dhrystone Run Time : 20 threads, 3000 MLOOPS
> >> Result SD
> >> (Secs) (% of mean)
> >> No Thermal Pressure 9.452 4.49%
> >> Thermal Pressure PELT Algo. Decay : 32 ms 8.793 5.30%
> >> Thermal Pressure PELT Algo. Decay : 64 ms 8.981 5.29%
> >> Thermal Pressure PELT Algo. Decay : 128 ms 8.647 6.62%
> >> Thermal Pressure PELT Algo. Decay : 256 ms 8.774 6.45%
> >> Thermal Pressure PELT Algo. Decay : 512 ms 8.603 5.41%
> >>
> >
> > Do you happen to know by how much the CPUs were capped during these
> > experiments?
>
> I don't have any captured results here. I know that big cores were
> capped and at times there was capacity inversion.
>
> Also I will fix the nit comments above.
>
> >
> > Thanks,
> > Ionela.
> >
>
>
>
> --
> Warm Regards
> Thara