Thermal governors can respond to an overheat event of a cpu by
capping the cpu's maximum possible frequency. This in turn
means that the maximum available compute capacity of the
cpu is restricted. But today in the kernel, task scheduler is
not notified of capping of maximum frequency of a cpu.
In other words, scheduler is unware of maximum capacity
restrictions placed on a cpu due to thermal activity.
This patch series attempts to address this issue.
The benefits identified are better task placement among available
cpus in event of overheating which in turn leads to better
performance numbers.
The reduction in the maximum possible capacity of a cpu due to a
thermal event can be considered as thermal pressure. Instantaneous
thermal pressure is hard to record and can sometime be erroneous
as there can be mismatch between the actual capping of capacity
and scheduler recording it. Thus solution is to have a weighted
average per cpu value for thermal pressure over time.
The weight reflects the amount of time the cpu has spent at a
capped maximum frequency. Since thermal pressure is recorded as
an average, it must be decayed periodically. To this extent, this
patch series defines a configurable decay period.
Regarding testing, basic build, boot and sanity testing have been
performed on hikey960 mainline kernel with debian file system.
Further, aobench (An occlusion renderer for benchmarking realworld
floating point performance), dhrystone and hackbench test have been
run with the thermal pressure algorithm. During testing, due to
constraints of step wise governor in dealing with big little systems,
cpu cooling was disabled on little core, the idea being that
big core will heat up and cpu cooling device will throttle the
frequency of the big cores there by limiting the maximum available
capacity and the scheduler will spread out tasks to little cores as well.
Finally, this patch series has been boot tested on db410C running v5.1-rc4
kernel.
During the course of development various methods of capturing
and reflecting thermal pressure were implemented.
The first method to be evaluated was to convert the
capped max frequency into capacity and have the scheduler use the
instantaneous value when updating cpu_capacity.
This method is referenced as "Instantaneous Thermal Pressure" in the
test results below.
The next two methods employs different methods of averaging the
thermal pressure before applying it when updating cpu_capacity.
The first of these methods re-used the PELT algorithm already present
in the kernel that does the averaging of rt and dl load and utilization.
This method is referenced as "Thermal Pressure Averaging using PELT fmwk"
in the test results below.
The final method employs an averaging algorithm that collects and
decays thermal pressure based on the decay period. In this method,
the decay period is configurable. This method is referenced as
"Thermal Pressure Averaging non-PELT Algo. Decay : XXX ms" in the
test results below.
The test results below shows 3-5% improvement in performance when
using the third solution compared to the default system today where
scheduler is unware of cpu capacity limitations due to thermal events.
Hackbench: (1 group , 30000 loops, 10 runs)
Result Standard Deviation
(Time Secs) (% of mean)
No Thermal Pressure 10.21 7.99%
Instantaneous thermal pressure 10.16 5.36%
Thermal Pressure Averaging
using PELT fmwk 9.88 3.94%
Thermal Pressure Averaging
non-PELT Algo. Decay : 500 ms 9.94 4.59%
Thermal Pressure Averaging
non-PELT Algo. Decay : 250 ms 7.52 5.42%
Thermal Pressure Averaging
non-PELT Algo. Decay : 125 ms 9.87 3.94%
Aobench: Size 2048 * 2048
Result Standard Deviation
(Time Secs) (% of mean)
No Thermal Pressure 141.58 15.85%
Instantaneous thermal pressure 141.63 15.03%
Thermal Pressure Averaging
using PELT fmwk 134.48 13.16%
Thermal Pressure Averaging
non-PELT Algo. Decay : 500 ms 133.62 13.00%
Thermal Pressure Averaging
non-PELT Algo. Decay : 250 ms 137.22 15.30%
Thermal Pressure Averaging
non-PELT Algo. Decay : 125 ms 137.55 13.26%
Dhrystone was run 10 times with each run spawning 20 threads of
500 MLOOPS.The idea here is to measure the Total dhrystone run
time and not look at individual processor performance.
Dhrystone Run Time
Result Standard Deviation
(Time Secs) (% of mean)
No Thermal Pressure 1.14 10.04%
Instantaneous thermal pressure 1.15 9%
Thermal Pressure Averaging
using PELT fmwk 1.19 11.60%
Thermal Pressure Averaging
non-PELT Algo. Decay : 500 ms 1.09 7.51%
Thermal Pressure Averaging
non-PELT Algo. Decay : 250 ms 1.012 7.02%
Thermal Pressure Averaging
non-PELT Algo. Decay : 125 ms 1.12 9.02%
V1->V2: Removed using Pelt framework for thermal pressure accumulation
and averaging. Instead implemented a weighted average algorithm.
Thara Gopinath (3):
Calculate Thermal Pressure
sched/fair: update cpu_capcity to reflect thermal pressure
thermal/cpu-cooling: Update thermal pressure in case of a maximum
frequency capping
drivers/thermal/cpu_cooling.c | 4 +
include/linux/sched/thermal.h | 11 +++
kernel/sched/Makefile | 2 +-
kernel/sched/fair.c | 4 +
kernel/sched/thermal.c | 220 ++++++++++++++++++++++++++++++++++++++++++
5 files changed, 240 insertions(+), 1 deletion(-)
create mode 100644 include/linux/sched/thermal.h
create mode 100644 kernel/sched/thermal.c
--
2.1.4
cpu_capacity relflects the maximum available capacity of a cpu. Thermal
pressure on a cpu means this maximum available capacity is reduced. This
patch reduces the average thermal pressure for a cpu from its maximum
available capacity so that cpu_capacity reflects the actual
available capacity.
Signed-off-by: Thara Gopinath <[email protected]>
---
kernel/sched/fair.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8213ff6..c5454d3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -23,6 +23,7 @@
#include "sched.h"
#include <trace/events/sched.h>
+#include <linux/sched/thermal.h>
/*
* Targeted preemption latency for CPU-bound tasks:
@@ -7967,6 +7968,9 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
if (!capacity)
capacity = 1;
+ capacity *= sched_get_thermal_pressure(cpu);
+ capacity >>= SCHED_CAPACITY_SHIFT;
+
cpu_rq(cpu)->cpu_capacity = capacity;
sdg->sgc->capacity = capacity;
sdg->sgc->min_capacity = capacity;
--
2.1.4
Enable cpufreq cooling device to update the thermal pressure in
event of a capped maximum frequency or removal of capped maximum
frequency.
Signed-off-by: Thara Gopinath <[email protected]>
---
drivers/thermal/cpu_cooling.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
index 6fff161..d5cc3c3 100644
--- a/drivers/thermal/cpu_cooling.c
+++ b/drivers/thermal/cpu_cooling.c
@@ -31,6 +31,7 @@
#include <linux/slab.h>
#include <linux/cpu.h>
#include <linux/cpu_cooling.h>
+#include <linux/sched/thermal.h>
#include <trace/events/thermal.h>
@@ -177,6 +178,9 @@ static int cpufreq_thermal_notifier(struct notifier_block *nb,
if (policy->max > clipped_freq)
cpufreq_verify_within_limits(policy, 0, clipped_freq);
+
+ sched_update_thermal_pressure(policy->cpus,
+ policy->max, policy->cpuinfo.max_freq);
break;
}
mutex_unlock(&cooling_list_lock);
--
2.1.4
Add thermal.c and thermal.h files that provides interface
APIs to initialize, update/average, track, accumulate and decay
thermal pressure per cpu basis. A per cpu structure thermal_pressure is
introduced to keep track of instantaneous per cpu thermal pressure.
Per cpu timers are scheduled to accumulate and decay thermal pressure
periodically. Two interfaces are introduced: sched_update_thermal_pressure
to be called from any entity that caps the maximum frequency of a cpu
and sched_get_thermal_pressure to be called by scheduler to get the
thermal pressure of the cpu.
Signed-off-by: Thara Gopinath <[email protected]>
---
include/linux/sched/thermal.h | 11 +++
kernel/sched/Makefile | 2 +-
kernel/sched/thermal.c | 220 ++++++++++++++++++++++++++++++++++++++++++
3 files changed, 232 insertions(+), 1 deletion(-)
create mode 100644 include/linux/sched/thermal.h
create mode 100644 kernel/sched/thermal.c
diff --git a/include/linux/sched/thermal.h b/include/linux/sched/thermal.h
new file mode 100644
index 0000000..cda158e
--- /dev/null
+++ b/include/linux/sched/thermal.h
@@ -0,0 +1,11 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_SCHED_THERMAL_H
+#define _LINUX_SCHED_THERMAL_H
+
+void sched_update_thermal_pressure(struct cpumask *cpus,
+ unsigned long cap_max_freq,
+ unsigned long max_freq);
+
+unsigned long sched_get_thermal_pressure(int cpu);
+
+#endif /* _LINUX_SCHED_THERMAL_H */
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 21fb5a5..4d3b820 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -20,7 +20,7 @@ obj-y += core.o loadavg.o clock.o cputime.o
obj-y += idle.o fair.o rt.o deadline.o
obj-y += wait.o wait_bit.o swait.o completion.o
-obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o topology.o stop_task.o pelt.o
+obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o topology.o stop_task.o pelt.o thermal.o
obj-$(CONFIG_SCHED_AUTOGROUP) += autogroup.o
obj-$(CONFIG_SCHEDSTATS) += stats.o
obj-$(CONFIG_SCHED_DEBUG) += debug.o
diff --git a/kernel/sched/thermal.c b/kernel/sched/thermal.c
new file mode 100644
index 0000000..1acee52
--- /dev/null
+++ b/kernel/sched/thermal.c
@@ -0,0 +1,220 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Sceduler Thermal Interactions
+ *
+ * Copyright (C) 2018 Linaro, Inc., Thara Gopinath <[email protected]>
+ */
+
+#include <linux/mutex.h>
+#include <linux/sched.h>
+#include <linux/timer.h>
+#include "sched.h"
+
+/* Per cpu structure to keep track of Thermal Pressure */
+struct thermal_pressure {
+ unsigned long scale; /* scale reflecting average cpu max capacity*/
+ unsigned long acc_scale; /* Accumulated scale for this time window */
+ unsigned long old_scale; /* Scale value for the previous window */
+ unsigned long raw_scale; /* Raw max capacity */
+ unsigned long age_stamp; /* Last time old_scale was updated */
+ unsigned long last_update; /* Last time acc_scale was updated */
+ spinlock_t lock; /* Lock for protecting from simultaneous access*/
+ /* Timer for periodic update of thermal pressure */
+ struct timer_list timer;
+ int cpu;
+};
+
+DEFINE_PER_CPU(struct thermal_pressure *, thermal_pressure_cpu);
+
+#define THERMAL_PRESSURE_DECAY_PERIOD (NSEC_PER_SEC / 2)
+
+static unsigned long calculate_simple(struct thermal_pressure *cpu_thermal,
+ s64 delta, s64 period)
+{
+ unsigned long scale;
+ s64 decay_period = THERMAL_PRESSURE_DECAY_PERIOD;
+
+ cpu_thermal->acc_scale += delta * cpu_thermal->raw_scale;
+ scale = cpu_thermal->old_scale * decay_period;
+ scale += cpu_thermal->acc_scale;
+ scale /= (decay_period + period);
+ cpu_thermal->last_update += delta;
+
+ return scale;
+}
+
+/*
+ * Calculate thermal pressure.
+ * At the crux this is an averaging algorithm. Intially a tunable
+ * decay period(D) is defined. Thermal pressure at the end of a decay
+ * period D is the average of thermal pressure of period D-1 and D.
+ *
+ * Time D-2 D-1 D
+ * ----------------------------------------------------------
+ * Raw Thermal r1 r2 r3
+ * Pressure
+ *
+ * Average Thermal r1 (r1+r2)/2 ((r1+r2)/2 + r3)/2
+ * Pressure.
+ */
+static void calculate_thermal_pressure(struct thermal_pressure *cpu_thermal)
+{
+ unsigned long scale;
+ s64 now, delta, decay_period, period;
+ int cpu;
+
+ if (!cpu_thermal)
+ return;
+
+ cpu = cpu_thermal->cpu;
+ now = sched_clock_cpu(cpu);
+ period = now - cpu_thermal->age_stamp;
+ decay_period = THERMAL_PRESSURE_DECAY_PERIOD;
+
+ if (period <= 0)
+ return;
+
+ /*
+ * If period is less than decay_period,
+ * just accumulate thermal pressure
+ */
+ if (period < decay_period) {
+ delta = now - cpu_thermal->last_update;
+ scale = calculate_simple(cpu_thermal, delta, period);
+ } else {
+ /* delta here is the remaining time in the last time window */
+ delta = decay_period -
+ (cpu_thermal->last_update - cpu_thermal->age_stamp);
+ scale = calculate_simple(cpu_thermal, delta, decay_period);
+ cpu_thermal->acc_scale = 0;
+ cpu_thermal->age_stamp += decay_period;
+ /* Decay thermal pressure for every decay period remaining */
+ while ((sched_clock_cpu(cpu) - cpu_thermal->age_stamp)
+ > decay_period) {
+ scale += cpu_thermal->raw_scale;
+ scale /= 2;
+ cpu_thermal->age_stamp += decay_period;
+ cpu_thermal->last_update += decay_period;
+ }
+ cpu_thermal->old_scale = scale;
+ delta = sched_clock_cpu(cpu) - cpu_thermal->age_stamp;
+ if (delta > 0)
+ scale = calculate_simple(cpu_thermal, delta, delta);
+ }
+ cpu_thermal->scale = scale;
+}
+
+static void thermal_pressure_update(struct thermal_pressure *cpu_thermal,
+ unsigned long cap_max_freq,
+ unsigned long max_freq, bool change_scale)
+{
+ unsigned long flags = 0;
+
+ calculate_thermal_pressure(cpu_thermal);
+ if (change_scale)
+ cpu_thermal->raw_scale =
+ (cap_max_freq << SCHED_CAPACITY_SHIFT) / max_freq;
+
+ mod_timer(&cpu_thermal->timer, jiffies +
+ usecs_to_jiffies(TICK_USEC));
+
+ spin_unlock_irqrestore(&cpu_thermal->lock, flags);
+}
+
+/**
+ * Function for the tick update of the thermal pressure.
+ * The thermal pressure update is aborted if already an update is
+ * happening.
+ */
+static void thermal_pressure_timeout(struct timer_list *timer)
+{
+ struct thermal_pressure *cpu_thermal = from_timer(cpu_thermal, timer,
+ timer);
+ unsigned long flags = 0;
+
+ if (!cpu_thermal)
+ return;
+
+ if (!spin_trylock_irqsave(&cpu_thermal->lock, flags))
+ return;
+
+ thermal_pressure_update(cpu_thermal, 0, 0, 0);
+}
+
+/**
+ * Function to update thermal pressure from cooling device
+ * or any framework responsible for capping cpu maximum
+ * capacity.
+ */
+void sched_update_thermal_pressure(struct cpumask *cpus,
+ unsigned long cap_max_freq,
+ unsigned long max_freq)
+{
+ int cpu;
+ unsigned long flags = 0;
+ struct thermal_pressure *cpu_thermal;
+
+ for_each_cpu(cpu, cpus) {
+ cpu_thermal = per_cpu(thermal_pressure_cpu, cpu);
+ if (!cpu_thermal)
+ return;
+ spin_lock_irqsave(&cpu_thermal->lock, flags);
+ thermal_pressure_update(cpu_thermal, cap_max_freq, max_freq, 1);
+ }
+}
+
+/**
+ * Function to be called from scheduler to get thermal pressure
+ * of a cpu
+ */
+unsigned long sched_get_thermal_pressure(int cpu)
+{
+ struct thermal_pressure *cpu_thermal = per_cpu(thermal_pressure_cpu,
+ cpu);
+
+ if (!cpu_thermal)
+ return SCHED_CAPACITY_SCALE;
+ else
+ return cpu_thermal->scale;
+}
+
+static void __init init_thermal_pressure(void)
+{
+ struct thermal_pressure *cpu_thermal;
+ unsigned long scale;
+ int cpu;
+
+ pr_debug("Init thermal pressure\n");
+ for_each_possible_cpu(cpu) {
+ cpu_thermal = per_cpu(thermal_pressure_cpu, cpu);
+ if (cpu_thermal)
+ continue;
+
+ cpu_thermal = kzalloc(sizeof(*cpu_thermal), GFP_KERNEL);
+ if (!cpu_thermal)
+ continue;
+ scale = SCHED_CAPACITY_SCALE;
+ cpu_thermal->scale = scale;
+ cpu_thermal->old_scale = scale;
+ cpu_thermal->raw_scale = scale;
+ cpu_thermal->age_stamp = sched_clock_cpu(cpu);
+ cpu_thermal->last_update = sched_clock_cpu(cpu);
+ cpu_thermal->cpu = cpu;
+ spin_lock_init(&cpu_thermal->lock);
+ timer_setup(&cpu_thermal->timer, thermal_pressure_timeout,
+ TIMER_DEFERRABLE);
+ per_cpu(thermal_pressure_cpu, cpu) = cpu_thermal;
+ pr_debug("cpu %d thermal scale = %ld\n", cpu, cpu_thermal->scale);
+ }
+
+ for_each_possible_cpu(cpu) {
+ cpu_thermal = per_cpu(thermal_pressure_cpu, cpu);
+ if (!cpu_thermal)
+ continue;
+ cpu_thermal->timer.expires = jiffies +
+ usecs_to_jiffies(TICK_USEC);
+ add_timer(&cpu_thermal->timer);
+ }
+}
+
+late_initcall(init_thermal_pressure);
--
2.1.4
* Thara Gopinath <[email protected]> wrote:
> The test results below shows 3-5% improvement in performance when
> using the third solution compared to the default system today where
> scheduler is unware of cpu capacity limitations due to thermal events.
The numbers look very promising!
I've rearranged the results to make the performance properties of the
various approaches and parameters easier to see:
(seconds, lower is better)
Hackbench Aobench Dhrystone
========= ======= =========
Vanilla kernel (No Thermal Pressure) 10.21 141.58 1.14
Instantaneous thermal pressure 10.16 141.63 1.15
Thermal Pressure Averaging:
- PELT fmwk 9.88 134.48 1.19
- non-PELT Algo. Decay : 500 ms 9.94 133.62 1.09
- non-PELT Algo. Decay : 250 ms 7.52 137.22 1.012
- non-PELT Algo. Decay : 125 ms 9.87 137.55 1.12
Firstly, a couple of questions about the numbers:
1)
Is the 1.012 result for "non-PELT 250 msecs Dhrystone" really 1.012?
You reported it as:
non-PELT Algo. Decay : 250 ms 1.012 7.02%
But the formatting is significant 3 digits versus only two for all
the other results.
2)
You reported the hackbench numbers with "10 runs" - did the other
benchmarks use 10 runs as well? Maybe you used fewer runs for the
longest benchmark, Aobench?
Secondly, it appears the non-PELT decaying average is the best approach,
but the results are a bit coarse around the ~250 msecs peak. Maybe it
would be good to measure it in 50 msecs steps between 50 msecs and 1000
msecs - but only if it can be scripted sanely:
A possible approach would be to add a debug sysctl for the tuning period,
and script all these benchmark runs and the printing of the results. You
could add another (debug) sysctl to turn the 'instant' logic on, and to
restore vanilla kernel behavior as well - this makes it all much easier
to script and measure with a single kernel image, without having to
reboot the kernel. The sysctl overhead will not be measurable for
workloads like this.
Then you can use "perf stat --null --table" to measure runtime and stddev
easily and with a single tool, for example:
dagon:~> perf stat --null --sync --repeat 10 --table ./hackbench 20 >benchmark.out
Performance counter stats for './hackbench 20' (10 runs):
# Table of individual measurements:
0.15246 (-0.03960) ######
0.20832 (+0.01627) ##
0.17895 (-0.01310) ##
0.19791 (+0.00585) #
0.19209 (+0.00004) #
0.19406 (+0.00201) #
0.22484 (+0.03278) ###
0.18695 (-0.00511) #
0.19032 (-0.00174) #
0.19464 (+0.00259) #
# Final result:
0.19205 +- 0.00592 seconds time elapsed ( +- 3.08% )
Note how all the individual measurements can be captured this way,
without seeing the benchmark output itself. So difference benchmarks can
be measured this way, assuming they don't have too long setup time.
Thanks,
Ingo
* Ingo Molnar <[email protected]> wrote:
> * Thara Gopinath <[email protected]> wrote:
>
> > The test results below shows 3-5% improvement in performance when
> > using the third solution compared to the default system today where
> > scheduler is unware of cpu capacity limitations due to thermal events.
>
> The numbers look very promising!
>
> I've rearranged the results to make the performance properties of the
> various approaches and parameters easier to see:
>
> (seconds, lower is better)
>
> Hackbench Aobench Dhrystone
> ========= ======= =========
> Vanilla kernel (No Thermal Pressure) 10.21 141.58 1.14
> Instantaneous thermal pressure 10.16 141.63 1.15
> Thermal Pressure Averaging:
> - PELT fmwk 9.88 134.48 1.19
> - non-PELT Algo. Decay : 500 ms 9.94 133.62 1.09
> - non-PELT Algo. Decay : 250 ms 7.52 137.22 1.012
> - non-PELT Algo. Decay : 125 ms 9.87 137.55 1.12
So what I forgot to say is that IMO your results show robust improvements
over the vanilla kernel of around 5%, with a relatively straightforward
thermal pressure metric. So I suspect we could do something like this, if
there was a bit more measurements done to get the best decay period
established - the 125-250-500 msecs results seem a bit coarse and not
entirely unambiguous.
In terms of stddev: the perf stat --pre hook could be used to add a dummy
benchmark run, to heat up the test system, to get more reliable, less
noisy numbers?
BTW., that big improvement in hackbench results to 7.52 at 250 msecs, is
that real, or a fluke perhaps?
Thanks,
Ingo
On 04/17/2019 01:36 AM, Ingo Molnar wrote:
>
> * Thara Gopinath <[email protected]> wrote:
>
>> The test results below shows 3-5% improvement in performance when
>> using the third solution compared to the default system today where
>> scheduler is unware of cpu capacity limitations due to thermal events.
>
> The numbers look very promising!
Hello Ingo,
Thank you for the review.
>
> I've rearranged the results to make the performance properties of the
> various approaches and parameters easier to see:
>
> (seconds, lower is better)
>
> Hackbench Aobench Dhrystone
> ========= ======= =========
> Vanilla kernel (No Thermal Pressure) 10.21 141.58 1.14
> Instantaneous thermal pressure 10.16 141.63 1.15
> Thermal Pressure Averaging:
> - PELT fmwk 9.88 134.48 1.19
> - non-PELT Algo. Decay : 500 ms 9.94 133.62 1.09
> - non-PELT Algo. Decay : 250 ms 7.52 137.22 1.012
> - non-PELT Algo. Decay : 125 ms 9.87 137.55 1.12
>
>
> Firstly, a couple of questions about the numbers:
>
> 1)
>
> Is the 1.012 result for "non-PELT 250 msecs Dhrystone" really 1.012?
> You reported it as:
>
> non-PELT Algo. Decay : 250 ms 1.012 7.02%
It is indeed 1.012. So, I ran the "non-PELT Algo 250 ms" benchmarks
multiple time because of the anomalies noticed. 1.012 is a formatting
error on my part when I copy pasted the results into a google sheet I am
maintaining to capture the test results. Sorry about the confusion.
>
> But the formatting is significant 3 digits versus only two for all
> the other results.
>
> 2)
>
> You reported the hackbench numbers with "10 runs" - did the other
> benchmarks use 10 runs as well? Maybe you used fewer runs for the
> longest benchmark, Aobench?
Hackbench and dhrystone are 10 runs each. Aobench is part of phoronix
test suit and the test suite runs it six times and gives the per run
results, mean and stddev. On my part, I ran aobench just once per
configuration.
>
> Secondly, it appears the non-PELT decaying average is the best approach,
> but the results are a bit coarse around the ~250 msecs peak. Maybe it
> would be good to measure it in 50 msecs steps between 50 msecs and 1000
> msecs - but only if it can be scripted sanely:
non-PELT looks better overall because the test results are quite
comparable (if not better) between the two solutions and it takes care
of concerns people raised when I posted V1 using PELT-fmwk algo
regarding reuse of utilization signal to track thermal pressure.
Regarding the decay period, I agree that more testing can be done. I
like your suggestions below and I am going to try implementing them
sometime next week. Once I have some solid results, I will send them out.
My concern regarding getting hung up too much on decay period is that I
think it could vary from SoC to SoC depending on the type and number of
cores and thermal characteristics. So I was thinking eventually the
decay period should be configurable via a config option or by any other
means. Testing on different systems will definitely help and maybe I am
wrong and there is no much variation between systems.
Regards
Thara
>
> A possible approach would be to add a debug sysctl for the tuning period,
> and script all these benchmark runs and the printing of the results. You
> could add another (debug) sysctl to turn the 'instant' logic on, and to
> restore vanilla kernel behavior as well - this makes it all much easier
> to script and measure with a single kernel image, without having to
> reboot the kernel. The sysctl overhead will not be measurable for
> workloads like this.
>
> Then you can use "perf stat --null --table" to measure runtime and stddev
> easily and with a single tool, for example:
>
> dagon:~> perf stat --null --sync --repeat 10 --table ./hackbench 20 >benchmark.out
>
> Performance counter stats for './hackbench 20' (10 runs):
>
> # Table of individual measurements:
> 0.15246 (-0.03960) ######
> 0.20832 (+0.01627) ##
> 0.17895 (-0.01310) ##
> 0.19791 (+0.00585) #
> 0.19209 (+0.00004) #
> 0.19406 (+0.00201) #
> 0.22484 (+0.03278) ###
> 0.18695 (-0.00511) #
> 0.19032 (-0.00174) #
> 0.19464 (+0.00259) #
>
> # Final result:
> 0.19205 +- 0.00592 seconds time elapsed ( +- 3.08% )
>
> Note how all the individual measurements can be captured this way,
> without seeing the benchmark output itself. So difference benchmarks can
> be measured this way, assuming they don't have too long setup time.
>
> Thanks,
>
> Ingo
>
--
Regards
Thara
On 04/17/2019 01:55 AM, Ingo Molnar wrote:
>
> * Ingo Molnar <[email protected]> wrote:
>
>> * Thara Gopinath <[email protected]> wrote:
>>
>>> The test results below shows 3-5% improvement in performance when
>>> using the third solution compared to the default system today where
>>> scheduler is unware of cpu capacity limitations due to thermal events.
>>
>> The numbers look very promising!
>>
>> I've rearranged the results to make the performance properties of the
>> various approaches and parameters easier to see:
>>
>> (seconds, lower is better)
>>
>> Hackbench Aobench Dhrystone
>> ========= ======= =========
>> Vanilla kernel (No Thermal Pressure) 10.21 141.58 1.14
>> Instantaneous thermal pressure 10.16 141.63 1.15
>> Thermal Pressure Averaging:
>> - PELT fmwk 9.88 134.48 1.19
>> - non-PELT Algo. Decay : 500 ms 9.94 133.62 1.09
>> - non-PELT Algo. Decay : 250 ms 7.52 137.22 1.012
>> - non-PELT Algo. Decay : 125 ms 9.87 137.55 1.12
>
> So what I forgot to say is that IMO your results show robust improvements
> over the vanilla kernel of around 5%, with a relatively straightforward
> thermal pressure metric. So I suspect we could do something like this, if
> there was a bit more measurements done to get the best decay period
> established - the 125-250-500 msecs results seem a bit coarse and not
> entirely unambiguous.
To give you the background, I started with decay period of 500 ms. No
other reason except the previous version of rt-pressure that existed in
the scheduler employed a 500 ms decay period. Then the idea was to
decrease the decay period by half and see what happens and so on. But I
agree, that it is a bit coarse. I will probably get around to
implementing some of your suggestions to capture more granular results
in the next few weeks.
>
> In terms of stddev: the perf stat --pre hook could be used to add a dummy
> benchmark run, to heat up the test system, to get more reliable, less
> noisy numbers?
>
> BTW., that big improvement in hackbench results to 7.52 at 250 msecs, is
> that real, or a fluke perhaps?
For me, it is an anomaly. Having said that, I did rerun the tests with
this configuration at least twice(if not more) and the results were
similar. It is an anomaly because I have no explanation as to why there
is so much improvement at the 250 ms decay period.
>
> Thanks,
>
> Ingo
>
--
Regards
Thara
* Thara Gopinath <[email protected]> wrote:
>
> On 04/17/2019 01:36 AM, Ingo Molnar wrote:
> >
> > * Thara Gopinath <[email protected]> wrote:
> >
> >> The test results below shows 3-5% improvement in performance when
> >> using the third solution compared to the default system today where
> >> scheduler is unware of cpu capacity limitations due to thermal events.
> >
> > The numbers look very promising!
>
> Hello Ingo,
> Thank you for the review.
> >
> > I've rearranged the results to make the performance properties of the
> > various approaches and parameters easier to see:
> >
> > (seconds, lower is better)
> >
> > Hackbench Aobench Dhrystone
> > ========= ======= =========
> > Vanilla kernel (No Thermal Pressure) 10.21 141.58 1.14
> > Instantaneous thermal pressure 10.16 141.63 1.15
> > Thermal Pressure Averaging:
> > - PELT fmwk 9.88 134.48 1.19
> > - non-PELT Algo. Decay : 500 ms 9.94 133.62 1.09
> > - non-PELT Algo. Decay : 250 ms 7.52 137.22 1.012
> > - non-PELT Algo. Decay : 125 ms 9.87 137.55 1.12
> >
> >
> > Firstly, a couple of questions about the numbers:
> >
> > 1)
> >
> > Is the 1.012 result for "non-PELT 250 msecs Dhrystone" really 1.012?
> > You reported it as:
> >
> > non-PELT Algo. Decay : 250 ms 1.012 7.02%
>
> It is indeed 1.012. So, I ran the "non-PELT Algo 250 ms" benchmarks
> multiple time because of the anomalies noticed. 1.012 is a formatting
> error on my part when I copy pasted the results into a google sheet I am
> maintaining to capture the test results. Sorry about the confusion.
That's actually pretty good, because it suggests a 35% and 15%
improvement over the vanilla kernel - which is very good for such
CPU-bound workloads.
Not that 5% is bad in itself - but 15% is better ;-)
> Regarding the decay period, I agree that more testing can be done. I
> like your suggestions below and I am going to try implementing them
> sometime next week. Once I have some solid results, I will send them
> out.
Thanks!
> My concern regarding getting hung up too much on decay period is that I
> think it could vary from SoC to SoC depending on the type and number of
> cores and thermal characteristics. So I was thinking eventually the
> decay period should be configurable via a config option or by any other
> means. Testing on different systems will definitely help and maybe I am
> wrong and there is no much variation between systems.
Absolutely, so I'd not be against keeping it a SCHED_DEBUG tunable or so,
until there's a better understanding of how the physical properties of
the SoC map to an ideal decay period.
Assuming PeterZ & Rafael & Quentin doesn't hate the whole thermal load
tracking approach. I suppose there's some connection of this to Energy
Aware Scheduling? Or not ...
Thanks,
Ingo
On 04/17/2019 02:29 PM, Ingo Molnar wrote:
>
> * Thara Gopinath <[email protected]> wrote:
>
>>
>> On 04/17/2019 01:36 AM, Ingo Molnar wrote:
>>>
>>> * Thara Gopinath <[email protected]> wrote:
>>>
>>>> The test results below shows 3-5% improvement in performance when
>>>> using the third solution compared to the default system today where
>>>> scheduler is unware of cpu capacity limitations due to thermal events.
>>>
>>> The numbers look very promising!
>>
>> Hello Ingo,
>> Thank you for the review.
>>>
>>> I've rearranged the results to make the performance properties of the
>>> various approaches and parameters easier to see:
>>>
>>> (seconds, lower is better)
>>>
>>> Hackbench Aobench Dhrystone
>>> ========= ======= =========
>>> Vanilla kernel (No Thermal Pressure) 10.21 141.58 1.14
>>> Instantaneous thermal pressure 10.16 141.63 1.15
>>> Thermal Pressure Averaging:
>>> - PELT fmwk 9.88 134.48 1.19
>>> - non-PELT Algo. Decay : 500 ms 9.94 133.62 1.09
>>> - non-PELT Algo. Decay : 250 ms 7.52 137.22 1.012
>>> - non-PELT Algo. Decay : 125 ms 9.87 137.55 1.12
>>>
>>>
>>> Firstly, a couple of questions about the numbers:
>>>
>>> 1)
>>>
>>> Is the 1.012 result for "non-PELT 250 msecs Dhrystone" really 1.012?
>>> You reported it as:
>>>
>>> non-PELT Algo. Decay : 250 ms 1.012 7.02%
>>
>> It is indeed 1.012. So, I ran the "non-PELT Algo 250 ms" benchmarks
>> multiple time because of the anomalies noticed. 1.012 is a formatting
>> error on my part when I copy pasted the results into a google sheet I am
>> maintaining to capture the test results. Sorry about the confusion.
>
> That's actually pretty good, because it suggests a 35% and 15%
> improvement over the vanilla kernel - which is very good for such
> CPU-bound workloads.
>
> Not that 5% is bad in itself - but 15% is better ;-)
>
>> Regarding the decay period, I agree that more testing can be done. I
>> like your suggestions below and I am going to try implementing them
>> sometime next week. Once I have some solid results, I will send them
>> out.
>
> Thanks!
>
>> My concern regarding getting hung up too much on decay period is that I
>> think it could vary from SoC to SoC depending on the type and number of
>> cores and thermal characteristics. So I was thinking eventually the
>> decay period should be configurable via a config option or by any other
>> means. Testing on different systems will definitely help and maybe I am
>> wrong and there is no much variation between systems.
>
> Absolutely, so I'd not be against keeping it a SCHED_DEBUG tunable or so,
> until there's a better understanding of how the physical properties of
> the SoC map to an ideal decay period.
>
> Assuming PeterZ & Rafael & Quentin doesn't hate the whole thermal load
> tracking approach. I suppose there's some connection of this to Energy
> Aware Scheduling? Or not ...
Mmm.. Not so much. This does not have much to do with EAS. The feature
itself will be really useful if there are asymmetric cpus in the system
rather than symmetric cpus. In case of SMP, since all cores have same
capacity and assuming any thermal mitigation will be implemented across
the all the cpus, there won't be any different scheduler behavior.
Regards
Thara
>
> Thanks,
>
> Ingo
>
--
Regards
Thara
On Wednesday 17 Apr 2019 at 20:29:32 (+0200), Ingo Molnar wrote:
>
> * Thara Gopinath <[email protected]> wrote:
>
> >
> > On 04/17/2019 01:36 AM, Ingo Molnar wrote:
> > >
> > > * Thara Gopinath <[email protected]> wrote:
> > >
> > >> The test results below shows 3-5% improvement in performance when
> > >> using the third solution compared to the default system today where
> > >> scheduler is unware of cpu capacity limitations due to thermal events.
> > >
> > > The numbers look very promising!
> >
> > Hello Ingo,
> > Thank you for the review.
> > >
> > > I've rearranged the results to make the performance properties of the
> > > various approaches and parameters easier to see:
> > >
> > > (seconds, lower is better)
> > >
> > > Hackbench Aobench Dhrystone
> > > ========= ======= =========
> > > Vanilla kernel (No Thermal Pressure) 10.21 141.58 1.14
> > > Instantaneous thermal pressure 10.16 141.63 1.15
> > > Thermal Pressure Averaging:
> > > - PELT fmwk 9.88 134.48 1.19
> > > - non-PELT Algo. Decay : 500 ms 9.94 133.62 1.09
> > > - non-PELT Algo. Decay : 250 ms 7.52 137.22 1.012
> > > - non-PELT Algo. Decay : 125 ms 9.87 137.55 1.12
> > >
> > >
> > > Firstly, a couple of questions about the numbers:
> > >
> > > 1)
> > >
> > > Is the 1.012 result for "non-PELT 250 msecs Dhrystone" really 1.012?
> > > You reported it as:
> > >
> > > non-PELT Algo. Decay : 250 ms 1.012 7.02%
> >
> > It is indeed 1.012. So, I ran the "non-PELT Algo 250 ms" benchmarks
> > multiple time because of the anomalies noticed. 1.012 is a formatting
> > error on my part when I copy pasted the results into a google sheet I am
> > maintaining to capture the test results. Sorry about the confusion.
>
> That's actually pretty good, because it suggests a 35% and 15%
> improvement over the vanilla kernel - which is very good for such
> CPU-bound workloads.
>
> Not that 5% is bad in itself - but 15% is better ;-)
>
> > Regarding the decay period, I agree that more testing can be done. I
> > like your suggestions below and I am going to try implementing them
> > sometime next week. Once I have some solid results, I will send them
> > out.
>
> Thanks!
>
> > My concern regarding getting hung up too much on decay period is that I
> > think it could vary from SoC to SoC depending on the type and number of
> > cores and thermal characteristics. So I was thinking eventually the
> > decay period should be configurable via a config option or by any other
> > means. Testing on different systems will definitely help and maybe I am
> > wrong and there is no much variation between systems.
>
> Absolutely, so I'd not be against keeping it a SCHED_DEBUG tunable or so,
> until there's a better understanding of how the physical properties of
> the SoC map to an ideal decay period.
+1, that'd be really useful to try this out on several platforms.
> Assuming PeterZ & Rafael & Quentin doesn't hate the whole thermal load
> tracking approach.
I certainly don't hate it :-) In fact we already have something in the
Android kernel to reflect thermal pressure into the CPU capacity using
the 'instantaneous' approach. I'm all in favour of replacing our
out-of-tree stuff by a mainline solution, and even more if that performs
better.
So yes, we need to discuss the implementation details and all, but I'd
personally be really happy to see something upstream in this area.
> I suppose there's some connection of this to Energy
> Aware Scheduling? Or not ...
Hmm, there isn't an immediate connection, I think. But that could
change.
FWIW I'm currently pushing a patch-set to make the thermal subsystem use
the same Energy Model as EAS ([1]) instead of its own. There are several
good reasons to do this, but one of them is to make sure the scheduler
and the thermal stuff (and the rest of the kernel) have a consistent
definition of what 'power' means. That might enable us do smart things
in the scheduler, but that's really for later.
Thanks,
Quentin
[1] https://lore.kernel.org/lkml/[email protected]/
On Tuesday 16 Apr 2019 at 15:38:41 (-0400), Thara Gopinath wrote:
> diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
> @@ -177,6 +178,9 @@ static int cpufreq_thermal_notifier(struct notifier_block *nb,
>
> if (policy->max > clipped_freq)
> cpufreq_verify_within_limits(policy, 0, clipped_freq);
> +
> + sched_update_thermal_pressure(policy->cpus,
> + policy->max, policy->cpuinfo.max_freq);
Is this something we could do this CPUFreq ? Directly in
cpufreq_verify_within_limits() perhaps ?
That would re-define the 'thermal pressure' framework in a more abstract
way and make the scheduler look at 'frequency capping' events,
regardless of the reason for capping.
That would reflect user-defined frequency constraint into cpu_capacity,
in addition to the thermal stuff. I'm not sure if there is another use
case for frequency capping ?
Perhaps the Intel boost stuff could be factored in there ? That is,
at times when the boost freq is not reachable capacity_of() would appear
smaller ... Unless this wants to be reflected instantaneously ?
Thoughts ?
Quentin
On Tuesday 16 Apr 2019 at 15:38:39 (-0400), Thara Gopinath wrote:
> +/**
> + * Function to update thermal pressure from cooling device
> + * or any framework responsible for capping cpu maximum
> + * capacity.
> + */
> +void sched_update_thermal_pressure(struct cpumask *cpus,
> + unsigned long cap_max_freq,
> + unsigned long max_freq)
> +{
> + int cpu;
> + unsigned long flags = 0;
> + struct thermal_pressure *cpu_thermal;
> +
> + for_each_cpu(cpu, cpus) {
Is it actually required to do this for each CPU ? You could calculate
the whole thing once for the first CPU, and apply the result to all CPUs
in the policy no ? All CPUs in a policy are capped and uncapped
synchronously.
> + cpu_thermal = per_cpu(thermal_pressure_cpu, cpu);
> + if (!cpu_thermal)
> + return;
> + spin_lock_irqsave(&cpu_thermal->lock, flags);
> + thermal_pressure_update(cpu_thermal, cap_max_freq, max_freq, 1);
> + }
> +}
On 04/18/2019 05:48 AM, Quentin Perret wrote:
> On Tuesday 16 Apr 2019 at 15:38:41 (-0400), Thara Gopinath wrote:
>> diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
>> @@ -177,6 +178,9 @@ static int cpufreq_thermal_notifier(struct notifier_block *nb,
>>
>> if (policy->max > clipped_freq)
>> cpufreq_verify_within_limits(policy, 0, clipped_freq);
>> +
>> + sched_update_thermal_pressure(policy->cpus,
>> + policy->max, policy->cpuinfo.max_freq);
>
> Is this something we could do this CPUFreq ? Directly in
> cpufreq_verify_within_limits() perhaps ?
>
> That would re-define the 'thermal pressure' framework in a more abstract
> way and make the scheduler look at 'frequency capping' events,
> regardless of the reason for capping.
>
> That would reflect user-defined frequency constraint into cpu_capacity,
> in addition to the thermal stuff. I'm not sure if there is another use
> case for frequency capping ?
Hi Quentin,
Thanks for the review. Sorry for the delay in response as I was on
vacation for the past few days.
I think there is one major difference between user-defined frequency
constraints and frequency constraints due to thermal events in terms of
the time period the system spends in the the constraint state.
Typically, a user constraint lasts for seconds if not minutes and I
think in this case cpu_capacity_orig should reflect this constraint and
not cpu_capacity like this patch set. Also, in case of the user
constraint, there is possibly no need to accumulate and average the
capacity constraints and instantaneous values can be directly applied to
cpu_capacity_orig. On the other hand thermal pressure is more spiky and
sometimes in the order of ms and us requiring the accumulating and
averaging.
>
> Perhaps the Intel boost stuff could be factored in there ? That is,
> at times when the boost freq is not reachable capacity_of() would appear
> smaller ... Unless this wants to be reflected instantaneously ?
Again, do you think intel boost is more applicable to be reflected in
cpu_capacity_orig and not cpu_capacity?
>
> Thoughts ?
> Quentin
>
--
Regards
Thara
On 04/18/2019 06:14 AM, Quentin Perret wrote:
> On Tuesday 16 Apr 2019 at 15:38:39 (-0400), Thara Gopinath wrote:
>> +/**
>> + * Function to update thermal pressure from cooling device
>> + * or any framework responsible for capping cpu maximum
>> + * capacity.
>> + */
>> +void sched_update_thermal_pressure(struct cpumask *cpus,
>> + unsigned long cap_max_freq,
>> + unsigned long max_freq)
>> +{
>> + int cpu;
>> + unsigned long flags = 0;
>> + struct thermal_pressure *cpu_thermal;
>> +
>> + for_each_cpu(cpu, cpus) {
>
> Is it actually required to do this for each CPU ? You could calculate
> the whole thing once for the first CPU, and apply the result to all CPUs
> in the policy no ? All CPUs in a policy are capped and uncapped
> synchronously.
Hmm. You are right that all cpus in a policy are capped and uncapped
synchronously from the thermal framework point of view. But the thermal
pressure decay can happen at different times for each cpu and hence the
update has to be done on a per cpu basis(especially to keep track of
other age and other variables in the averaging and accumulating
algorithm). It can be separated out but I think it will just make the
solution more complicated.
>
>> + cpu_thermal = per_cpu(thermal_pressure_cpu, cpu);
>> + if (!cpu_thermal)
>> + return;
>> + spin_lock_irqsave(&cpu_thermal->lock, flags);
>> + thermal_pressure_update(cpu_thermal, cap_max_freq, max_freq, 1);
>> + }
>> +}
--
Regards
Thara
Hi guys,
On 23/04/2019 23:38, Thara Gopinath wrote:
> On 04/18/2019 05:48 AM, Quentin Perret wrote:
>> On Tuesday 16 Apr 2019 at 15:38:41 (-0400), Thara Gopinath wrote:
>>> diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
>>> @@ -177,6 +178,9 @@ static int cpufreq_thermal_notifier(struct notifier_block *nb,
>>>
>>> if (policy->max > clipped_freq)
>>> cpufreq_verify_within_limits(policy, 0, clipped_freq);
>>> +
>>> + sched_update_thermal_pressure(policy->cpus,
>>> + policy->max, policy->cpuinfo.max_freq);
>>
>> Is this something we could do this CPUFreq ? Directly in
>> cpufreq_verify_within_limits() perhaps ?
>>
>> That would re-define the 'thermal pressure' framework in a more abstract
>> way and make the scheduler look at 'frequency capping' events,
>> regardless of the reason for capping.
>>
>> That would reflect user-defined frequency constraint into cpu_capacity,
>> in addition to the thermal stuff. I'm not sure if there is another use
>> case for frequency capping ?
> Hi Quentin,
> Thanks for the review. Sorry for the delay in response as I was on
> vacation for the past few days.
> I think there is one major difference between user-defined frequency
> constraints and frequency constraints due to thermal events in terms of
> the time period the system spends in the the constraint state.
> Typically, a user constraint lasts for seconds if not minutes and I
> think in this case cpu_capacity_orig should reflect this constraint and
> not cpu_capacity like this patch set. Also, in case of the user
> constraint, there is possibly no need to accumulate and average the
> capacity constraints and instantaneous values can be directly applied to
> cpu_capacity_orig. On the other hand thermal pressure is more spiky and
> sometimes in the order of ms and us requiring the accumulating and
> averaging.
I think we can't make any assumptions in regards to the intentions of
the user when restricting the OPP range though the cpufreq interface,
but it would still be nice to do something and reflecting it as thermal
pressure would be a good start. It might not be due to thermal, but it
is a capacity restriction that would have the same result. Also, if the
user has the ability to tune the decay period he has the control over
the behavior of the signal. Given that currently there isn't a smarter
mechanism (modifying capacity orig, re-normalising the capacity range)
for long-term capping, even treating it as short-term capping is a good
start. But this is a bigger exercise and it needs thorough
consideration, so it could be skipped, in my opinion, for now..
Also, if we want to stick with the "definition", userspace would still
be able to reflect thermal pressure though the thermal limits interface
by setting the cooling device state, which will be reflected in this
update as well. So userspace would have a mechanism to reflect thermal
pressure.
One addition.. I like that the thermal pressure framework is not tied to
cpufreq. There are firmware solutions that do not bother informing
cpufreq of limits being changed, and therefore all of this could be
skipped. But any firmware driver could call sched_update_thermal_pressure
on notifications for limits changing from firmware, which is an
important feature.
>>
>> Perhaps the Intel boost stuff could be factored in there ? That is,
>> at times when the boost freq is not reachable capacity_of() would appear
>> smaller ... Unless this wants to be reflected instantaneously ?
> Again, do you think intel boost is more applicable to be reflected in
> cpu_capacity_orig and not cpu_capacity?
>>
>> Thoughts ?
>> Quentin
>>
>
The changes here would happen even faster than thermal capping, same as
other restrictions imposed by firmware, so it would not seem right to me
to reflect it in capacity_orig. Reflecting it as thermal pressure is
another matter, which I'd say it should be up to the client. The big
disadvantage I'd see for this is coping with decisions made while being
capped, when you're not capped any longer, and the other way around. I
believe these changes would happen too often and they will not happen in
a ramp-up/ramp-down behavior that we expect from thermal mitigation.
That's why I believe averaging/regulation of the signal works well in
this case, and it might not for power related fast restrictions.
But given these three cases above, it might be that the ideal solution
is for this framework to be made more generic and for each client to be
able to obtain and configure a pressure signal to be reflected
separately in the capacity of each CPU.
My two pennies' worth,
Ionela.
On Tue, Apr 16, 2019 at 03:38:39PM -0400, Thara Gopinath wrote:
> +static void thermal_pressure_update(struct thermal_pressure *cpu_thermal,
> + unsigned long cap_max_freq,
> + unsigned long max_freq, bool change_scale)
> +{
> + unsigned long flags = 0;
> +
> + calculate_thermal_pressure(cpu_thermal);
> + if (change_scale)
> + cpu_thermal->raw_scale =
> + (cap_max_freq << SCHED_CAPACITY_SHIFT) / max_freq;
That needs {} per coding style.
> +
> + mod_timer(&cpu_thermal->timer, jiffies +
> + usecs_to_jiffies(TICK_USEC));
> +
> + spin_unlock_irqrestore(&cpu_thermal->lock, flags);
This is busted has heck, @flags should be the result of irqsave(.flags).
> +}
> +
> +/**
> + * Function for the tick update of the thermal pressure.
> + * The thermal pressure update is aborted if already an update is
> + * happening.
> + */
> +static void thermal_pressure_timeout(struct timer_list *timer)
> +{
> + struct thermal_pressure *cpu_thermal = from_timer(cpu_thermal, timer,
> + timer);
If you split after the = the result is so very much easier to read.
> + unsigned long flags = 0;
> +
> + if (!cpu_thermal)
> + return;
> +
> + if (!spin_trylock_irqsave(&cpu_thermal->lock, flags))
> + return;
> +
> + thermal_pressure_update(cpu_thermal, 0, 0, 0);
> +}
> +
> +/**
> + * Function to update thermal pressure from cooling device
> + * or any framework responsible for capping cpu maximum
> + * capacity.
Would be useful to know how wide @cpus is, typically. Is that the power
culster?
> + */
> +void sched_update_thermal_pressure(struct cpumask *cpus,
> + unsigned long cap_max_freq,
> + unsigned long max_freq)
> +{
> + int cpu;
> + unsigned long flags = 0;
> + struct thermal_pressure *cpu_thermal;
You got them in exactly the wrong order here.
> +
> + for_each_cpu(cpu, cpus) {
> + cpu_thermal = per_cpu(thermal_pressure_cpu, cpu);
> + if (!cpu_thermal)
> + return;
> + spin_lock_irqsave(&cpu_thermal->lock, flags);
> + thermal_pressure_update(cpu_thermal, cap_max_freq, max_freq, 1);
> + }
> +}
That's just horrible style. Move the unlock out of
thermal_pressure_update() such that lock and unlock are in the same
function.
On Tue, Apr 16, 2019 at 03:38:41PM -0400, Thara Gopinath wrote:
> Enable cpufreq cooling device to update the thermal pressure in
> event of a capped maximum frequency or removal of capped maximum
> frequency.
>
> Signed-off-by: Thara Gopinath <[email protected]>
> ---
> drivers/thermal/cpu_cooling.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
> index 6fff161..d5cc3c3 100644
> --- a/drivers/thermal/cpu_cooling.c
> +++ b/drivers/thermal/cpu_cooling.c
> @@ -31,6 +31,7 @@
> #include <linux/slab.h>
> #include <linux/cpu.h>
> #include <linux/cpu_cooling.h>
> +#include <linux/sched/thermal.h>
>
> #include <trace/events/thermal.h>
>
> @@ -177,6 +178,9 @@ static int cpufreq_thermal_notifier(struct notifier_block *nb,
>
> if (policy->max > clipped_freq)
> cpufreq_verify_within_limits(policy, 0, clipped_freq);
> +
> + sched_update_thermal_pressure(policy->cpus,
> + policy->max, policy->cpuinfo.max_freq);
If it's already telling the cpufreq thing, why not get it from sugov
instead?
Hi Thara,
The idea and the results look promising. I'm trying to understand better
the cause of the improvements so I've added below some questions that
would help me out with this.
> Regarding testing, basic build, boot and sanity testing have been
> performed on hikey960 mainline kernel with debian file system.
> Further, aobench (An occlusion renderer for benchmarking realworld
> floating point performance), dhrystone and hackbench test have been
> run with the thermal pressure algorithm. During testing, due to
> constraints of step wise governor in dealing with big little systems,
> cpu cooling was disabled on little core, the idea being that
> big core will heat up and cpu cooling device will throttle the
> frequency of the big cores there by limiting the maximum available
> capacity and the scheduler will spread out tasks to little cores as well.
> Finally, this patch series has been boot tested on db410C running v5.1-rc4
> kernel.
>
Did you try using IPA as well? It is better equipped to deal with
big-LITTLE systems and it's more probable IPA will be used for these
systems, where your solution will have the biggest impact as well.
The difference will be that you'll have both the big cluster and the
LITTLE cluster capped in different proportions depending on their
utilization and their efficiency.
> During the course of development various methods of capturing
> and reflecting thermal pressure were implemented.
>
> The first method to be evaluated was to convert the
> capped max frequency into capacity and have the scheduler use the
> instantaneous value when updating cpu_capacity.
> This method is referenced as "Instantaneous Thermal Pressure" in the
> test results below.
>
> The next two methods employs different methods of averaging the
> thermal pressure before applying it when updating cpu_capacity.
> The first of these methods re-used the PELT algorithm already present
> in the kernel that does the averaging of rt and dl load and utilization.
> This method is referenced as "Thermal Pressure Averaging using PELT fmwk"
> in the test results below.
>
> The final method employs an averaging algorithm that collects and
> decays thermal pressure based on the decay period. In this method,
> the decay period is configurable. This method is referenced as
> "Thermal Pressure Averaging non-PELT Algo. Decay : XXX ms" in the
> test results below.
>
> The test results below shows 3-5% improvement in performance when
> using the third solution compared to the default system today where
> scheduler is unware of cpu capacity limitations due to thermal events.
>
Did you happen to record the amount of capping imposed on the big cores
when these results were obtained? Did you find scenarios where the
capacity of the bigs resulted in being lower than the capacity of the
LITTLEs (capacity inversion)?
This is one case where we'll see a big impact in considering thermal
pressure.
Also, given that these are more or less sustained workloads, I'm
wondering if there is any effect on workloads running on an uncapped
system following capping. I would image such a test being composed of a
single threaded period (no capping) followed by a multi-threaded period
(with capping), continued in a loop. It might be interesting to have
something like this as well, as part of your test coverage.
Thanks,
Ionela.
On Tue, Apr 16, 2019 at 03:38:39PM -0400, Thara Gopinath wrote:
> +static void __init init_thermal_pressure(void)
> +{
> + struct thermal_pressure *cpu_thermal;
> + unsigned long scale;
> + int cpu;
> +
> + pr_debug("Init thermal pressure\n");
> + for_each_possible_cpu(cpu) {
> + cpu_thermal = per_cpu(thermal_pressure_cpu, cpu);
> + if (cpu_thermal)
> + continue;
> +
> + cpu_thermal = kzalloc(sizeof(*cpu_thermal), GFP_KERNEL);
> + if (!cpu_thermal)
> + continue;
So if you unconditinoally allocate this memory, why not just have:
DEFINE_PER_CPU(struct thermal_pressure, cpu_thermal_pressure); ?
Or stick it in struct rq or whatever.
> + scale = SCHED_CAPACITY_SCALE;
> + cpu_thermal->scale = scale;
> + cpu_thermal->old_scale = scale;
> + cpu_thermal->raw_scale = scale;
> + cpu_thermal->age_stamp = sched_clock_cpu(cpu);
> + cpu_thermal->last_update = sched_clock_cpu(cpu);
> + cpu_thermal->cpu = cpu;
> + spin_lock_init(&cpu_thermal->lock);
> + timer_setup(&cpu_thermal->timer, thermal_pressure_timeout,
> + TIMER_DEFERRABLE);
> + per_cpu(thermal_pressure_cpu, cpu) = cpu_thermal;
> + pr_debug("cpu %d thermal scale = %ld\n", cpu, cpu_thermal->scale);
> + }
> +
> + for_each_possible_cpu(cpu) {
You just done that, what gives?
> + cpu_thermal = per_cpu(thermal_pressure_cpu, cpu);
> + if (!cpu_thermal)
> + continue;
> + cpu_thermal->timer.expires = jiffies +
> + usecs_to_jiffies(TICK_USEC);
That's like a very copmlicated way of writing: '1', right?
> + add_timer(&cpu_thermal->timer);
> + }
> +}
So really what you want is a hook into the tick?
On Wed, Apr 17, 2019 at 08:29:32PM +0200, Ingo Molnar wrote:
> Assuming PeterZ & Rafael & Quentin doesn't hate the whole thermal load
> tracking approach.
I seem to remember competing proposals, and have forgotten everything
about them; the cover letter also didn't have references to them or
mention them in any way.
As to the averaging and period, I personally prefer a PELT signal with
the windows lined up, if that really is too short a window, then a PELT
like signal with a natural multiple of the PELT period would make sense,
such that the windows still line up nicely.
Mixing different averaging methods and non-aligned windows just makes me
uncomfortable.
On Tuesday 16 Apr 2019 at 15:38:39 (-0400), Thara Gopinath wrote:
> +/* Per cpu structure to keep track of Thermal Pressure */
> +struct thermal_pressure {
> + unsigned long scale; /* scale reflecting average cpu max capacity*/
> + unsigned long acc_scale; /* Accumulated scale for this time window */
> + unsigned long old_scale; /* Scale value for the previous window */
> + unsigned long raw_scale; /* Raw max capacity */
> + unsigned long age_stamp; /* Last time old_scale was updated */
> + unsigned long last_update; /* Last time acc_scale was updated */
> + spinlock_t lock; /* Lock for protecting from simultaneous access*/
> + /* Timer for periodic update of thermal pressure */
> + struct timer_list timer;
Do you actually need the periodic update ? You only really need to
update the 'scale' value when updating the LB stats no ? Nobody
accesses that value in between two LBs.
Thanks,
Quentin
On Thu, 25 Apr 2019 at 12:57, Quentin Perret <[email protected]> wrote:
>
> On Tuesday 16 Apr 2019 at 15:38:39 (-0400), Thara Gopinath wrote:
> > +/* Per cpu structure to keep track of Thermal Pressure */
> > +struct thermal_pressure {
> > + unsigned long scale; /* scale reflecting average cpu max capacity*/
> > + unsigned long acc_scale; /* Accumulated scale for this time window */
> > + unsigned long old_scale; /* Scale value for the previous window */
> > + unsigned long raw_scale; /* Raw max capacity */
> > + unsigned long age_stamp; /* Last time old_scale was updated */
> > + unsigned long last_update; /* Last time acc_scale was updated */
> > + spinlock_t lock; /* Lock for protecting from simultaneous access*/
> > + /* Timer for periodic update of thermal pressure */
> > + struct timer_list timer;
>
> Do you actually need the periodic update ? You only really need to
> update the 'scale' value when updating the LB stats no ? Nobody
> accesses that value in between two LBs.
Do you mean calling a variant of sched_update_thermal_pressure() in
update_cpu_capacity() instead of periodic update ?
Yes , that should be enough
>
> Thanks,
> Quentin
On Thursday 25 Apr 2019 at 14:45:57 (+0200), Vincent Guittot wrote:
> On Thu, 25 Apr 2019 at 12:57, Quentin Perret <[email protected]> wrote:
> >
> > On Tuesday 16 Apr 2019 at 15:38:39 (-0400), Thara Gopinath wrote:
> > > +/* Per cpu structure to keep track of Thermal Pressure */
> > > +struct thermal_pressure {
> > > + unsigned long scale; /* scale reflecting average cpu max capacity*/
> > > + unsigned long acc_scale; /* Accumulated scale for this time window */
> > > + unsigned long old_scale; /* Scale value for the previous window */
> > > + unsigned long raw_scale; /* Raw max capacity */
> > > + unsigned long age_stamp; /* Last time old_scale was updated */
> > > + unsigned long last_update; /* Last time acc_scale was updated */
> > > + spinlock_t lock; /* Lock for protecting from simultaneous access*/
> > > + /* Timer for periodic update of thermal pressure */
> > > + struct timer_list timer;
> >
> > Do you actually need the periodic update ? You only really need to
> > update the 'scale' value when updating the LB stats no ? Nobody
> > accesses that value in between two LBs.
>
> Do you mean calling a variant of sched_update_thermal_pressure() in
> update_cpu_capacity() instead of periodic update ?
> Yes , that should be enough
Right something like this, and remove all the timers. Should be a bit
cleaner I guess.
Thanks,
Quentin
On Thu, 25 Apr 2019 at 12:45, Quentin Perret <[email protected]> wrote:
>
> On Tuesday 23 Apr 2019 at 18:38:46 (-0400), Thara Gopinath wrote:
> > I think there is one major difference between user-defined frequency
> > constraints and frequency constraints due to thermal events in terms of
> > the time period the system spends in the the constraint state.
> > Typically, a user constraint lasts for seconds if not minutes and I
> > think in this case cpu_capacity_orig should reflect this constraint and
> > not cpu_capacity like this patch set.
>
> That might not always be true I think. There's tons of userspace thermal
> deamons out there, and I wouldn't be suprised if they were writing into
> the cpufreq sysfs files, although I'm not sure.
They would better use the sysfs set_target interface of cpu_cooling
device in this case.
>
> Another thing is, if you want to change the capacity_orig value, you'll
> need to rebuild the sched domains and all I believe. Otherwise there is
> a risk to 'break' the sd_asym flags. So we need to make sure we're happy
> to pay that price.
That would be the goal, if userspace uses the sysfs interface of
cpufreq to set a new max frequency, it should be considered as a long
change in regards to the scheduling rate and in this case it should be
interesting to update cpacity_orig and rebuild sched_domain.
>
> > Also, in case of the user
> > constraint, there is possibly no need to accumulate and average the
> > capacity constraints and instantaneous values can be directly applied to
> > cpu_capacity_orig. On the other hand thermal pressure is more spiky and
> > sometimes in the order of ms and us requiring the accumulating and
> > averaging.
> > >
> > > Perhaps the Intel boost stuff could be factored in there ? That is,
> > > at times when the boost freq is not reachable capacity_of() would appear
> > > smaller ... Unless this wants to be reflected instantaneously ?
> > Again, do you think intel boost is more applicable to be reflected in
> > cpu_capacity_orig and not cpu_capacity?
>
> I'm not even sure if we want to reflect it at all TBH, but I'd be
> interested to see what Intel folks think :-)
>
> Thanks,
> Quentin
On Tuesday 23 Apr 2019 at 18:38:46 (-0400), Thara Gopinath wrote:
> I think there is one major difference between user-defined frequency
> constraints and frequency constraints due to thermal events in terms of
> the time period the system spends in the the constraint state.
> Typically, a user constraint lasts for seconds if not minutes and I
> think in this case cpu_capacity_orig should reflect this constraint and
> not cpu_capacity like this patch set.
That might not always be true I think. There's tons of userspace thermal
deamons out there, and I wouldn't be suprised if they were writing into
the cpufreq sysfs files, although I'm not sure.
Another thing is, if you want to change the capacity_orig value, you'll
need to rebuild the sched domains and all I believe. Otherwise there is
a risk to 'break' the sd_asym flags. So we need to make sure we're happy
to pay that price.
> Also, in case of the user
> constraint, there is possibly no need to accumulate and average the
> capacity constraints and instantaneous values can be directly applied to
> cpu_capacity_orig. On the other hand thermal pressure is more spiky and
> sometimes in the order of ms and us requiring the accumulating and
> averaging.
> >
> > Perhaps the Intel boost stuff could be factored in there ? That is,
> > at times when the boost freq is not reachable capacity_of() would appear
> > smaller ... Unless this wants to be reflected instantaneously ?
> Again, do you think intel boost is more applicable to be reflected in
> cpu_capacity_orig and not cpu_capacity?
I'm not even sure if we want to reflect it at all TBH, but I'd be
interested to see what Intel folks think :-)
Thanks,
Quentin
On Thursday 25 Apr 2019 at 14:04:10 (+0200), Vincent Guittot wrote:
> On Thu, 25 Apr 2019 at 12:45, Quentin Perret <[email protected]> wrote:
> >
> > On Tuesday 23 Apr 2019 at 18:38:46 (-0400), Thara Gopinath wrote:
> > > I think there is one major difference between user-defined frequency
> > > constraints and frequency constraints due to thermal events in terms of
> > > the time period the system spends in the the constraint state.
> > > Typically, a user constraint lasts for seconds if not minutes and I
> > > think in this case cpu_capacity_orig should reflect this constraint and
> > > not cpu_capacity like this patch set.
> >
> > That might not always be true I think. There's tons of userspace thermal
> > deamons out there, and I wouldn't be suprised if they were writing into
> > the cpufreq sysfs files, although I'm not sure.
>
> They would better use the sysfs set_target interface of cpu_cooling
> device in this case.
Right
> > Another thing is, if you want to change the capacity_orig value, you'll
> > need to rebuild the sched domains and all I believe. Otherwise there is
> > a risk to 'break' the sd_asym flags. So we need to make sure we're happy
> > to pay that price.
>
> That would be the goal, if userspace uses the sysfs interface of
> cpufreq to set a new max frequency, it should be considered as a long
> change in regards to the scheduling rate and in this case it should be
> interesting to update cpacity_orig and rebuild sched_domain.
I guess as long as we don't rebuild too frequently that could work.
Perhaps we could put some rate limiting in there to enforce that. Though
we don't do it for hotplug so ... :/
* Peter Zijlstra <[email protected]> wrote:
> On Wed, Apr 17, 2019 at 08:29:32PM +0200, Ingo Molnar wrote:
> > Assuming PeterZ & Rafael & Quentin doesn't hate the whole thermal load
> > tracking approach.
>
> I seem to remember competing proposals, and have forgotten everything
> about them; the cover letter also didn't have references to them or
> mention them in any way.
>
> As to the averaging and period, I personally prefer a PELT signal with
> the windows lined up, if that really is too short a window, then a PELT
> like signal with a natural multiple of the PELT period would make sense,
> such that the windows still line up nicely.
>
> Mixing different averaging methods and non-aligned windows just makes me
> uncomfortable.
Yeah, so the problem with PELT is that while it nicely approximates
variable-period decay calculations with plain additions, shifts and table
lookups (i.e. accelerates pow()), AFAICS the most important decay
parameter is fixed: the speed of decay, the dampening factor, which is
fixed at 32:
Documentation/scheduler/sched-pelt.c
#define HALFLIFE 32
Right?
Thara's numbers suggest that there's high sensitivity to the speed of
decay. By using PELT we'd be using whatever averaging speed there is
within PELT.
Now we could make that parametric of course, but that would both
complicate the PELT lookup code (one more dimension) and would negatively
affect code generation in a number of places.
Thanks,
Ingo
* Ingo Molnar <[email protected]> wrote:
>
> * Peter Zijlstra <[email protected]> wrote:
>
> > On Wed, Apr 17, 2019 at 08:29:32PM +0200, Ingo Molnar wrote:
> > > Assuming PeterZ & Rafael & Quentin doesn't hate the whole thermal load
> > > tracking approach.
> >
> > I seem to remember competing proposals, and have forgotten everything
> > about them; the cover letter also didn't have references to them or
> > mention them in any way.
> >
> > As to the averaging and period, I personally prefer a PELT signal with
> > the windows lined up, if that really is too short a window, then a PELT
> > like signal with a natural multiple of the PELT period would make sense,
> > such that the windows still line up nicely.
> >
> > Mixing different averaging methods and non-aligned windows just makes me
> > uncomfortable.
>
> Yeah, so the problem with PELT is that while it nicely approximates
> variable-period decay calculations with plain additions, shifts and table
> lookups (i.e. accelerates pow()), AFAICS the most important decay
> parameter is fixed: the speed of decay, the dampening factor, which is
> fixed at 32:
>
> Documentation/scheduler/sched-pelt.c
>
> #define HALFLIFE 32
>
> Right?
>
> Thara's numbers suggest that there's high sensitivity to the speed of
> decay. By using PELT we'd be using whatever averaging speed there is
> within PELT.
>
> Now we could make that parametric of course, but that would both
> complicate the PELT lookup code (one more dimension) and would negatively
> affect code generation in a number of places.
I missed the other solution, which is what you suggested: by
increasing/reducing the PELT window size we can effectively shift decay
speed and use just a single lookup table.
I.e. instead of the fixed period size of 1024 in accumulate_sum(), use
decay_load() directly but use a different (longer) window size from 1024
usecs to calculate 'periods', and make it a multiple of 1024.
This might just work out right: with a half-life of 32 the fastest decay
speed should be around ~20 msecs (?) - and Thara's numbers so far suggest
that the sweet spot averaging is significantly longer, at a couple of
hundred millisecs.
Thanks,
Ingo
On Thu, 25 Apr 2019 at 19:44, Ingo Molnar <[email protected]> wrote:
>
>
> * Ingo Molnar <[email protected]> wrote:
>
> >
> > * Peter Zijlstra <[email protected]> wrote:
> >
> > > On Wed, Apr 17, 2019 at 08:29:32PM +0200, Ingo Molnar wrote:
> > > > Assuming PeterZ & Rafael & Quentin doesn't hate the whole thermal load
> > > > tracking approach.
> > >
> > > I seem to remember competing proposals, and have forgotten everything
> > > about them; the cover letter also didn't have references to them or
> > > mention them in any way.
> > >
> > > As to the averaging and period, I personally prefer a PELT signal with
> > > the windows lined up, if that really is too short a window, then a PELT
> > > like signal with a natural multiple of the PELT period would make sense,
> > > such that the windows still line up nicely.
> > >
> > > Mixing different averaging methods and non-aligned windows just makes me
> > > uncomfortable.
> >
> > Yeah, so the problem with PELT is that while it nicely approximates
> > variable-period decay calculations with plain additions, shifts and table
> > lookups (i.e. accelerates pow()), AFAICS the most important decay
> > parameter is fixed: the speed of decay, the dampening factor, which is
> > fixed at 32:
> >
> > Documentation/scheduler/sched-pelt.c
> >
> > #define HALFLIFE 32
> >
> > Right?
> >
> > Thara's numbers suggest that there's high sensitivity to the speed of
> > decay. By using PELT we'd be using whatever averaging speed there is
> > within PELT.
> >
> > Now we could make that parametric of course, but that would both
> > complicate the PELT lookup code (one more dimension) and would negatively
> > affect code generation in a number of places.
>
> I missed the other solution, which is what you suggested: by
> increasing/reducing the PELT window size we can effectively shift decay
> speed and use just a single lookup table.
>
> I.e. instead of the fixed period size of 1024 in accumulate_sum(), use
> decay_load() directly but use a different (longer) window size from 1024
> usecs to calculate 'periods', and make it a multiple of 1024.
Can't we also scale the now parameter of ___update_load_sum() ?
If we right shift it before calling ___update_load_sum, it should be
the same as using a half period of 62, 128, 256ms ...
The main drawback would be a lost of precision but we are in the range
of 2, 4, 8us compared to the 1ms window
This is quite similar to how we scale the utilization with frequency and uarch
>
> This might just work out right: with a half-life of 32 the fastest decay
> speed should be around ~20 msecs (?) - and Thara's numbers so far suggest
> that the sweet spot averaging is significantly longer, at a couple of
> hundred millisecs.
>
> Thanks,
>
> Ingo
* Vincent Guittot <[email protected]> wrote:
> On Thu, 25 Apr 2019 at 19:44, Ingo Molnar <[email protected]> wrote:
> >
> >
> > * Ingo Molnar <[email protected]> wrote:
> >
> > >
> > > * Peter Zijlstra <[email protected]> wrote:
> > >
> > > > On Wed, Apr 17, 2019 at 08:29:32PM +0200, Ingo Molnar wrote:
> > > > > Assuming PeterZ & Rafael & Quentin doesn't hate the whole thermal load
> > > > > tracking approach.
> > > >
> > > > I seem to remember competing proposals, and have forgotten everything
> > > > about them; the cover letter also didn't have references to them or
> > > > mention them in any way.
> > > >
> > > > As to the averaging and period, I personally prefer a PELT signal with
> > > > the windows lined up, if that really is too short a window, then a PELT
> > > > like signal with a natural multiple of the PELT period would make sense,
> > > > such that the windows still line up nicely.
> > > >
> > > > Mixing different averaging methods and non-aligned windows just makes me
> > > > uncomfortable.
> > >
> > > Yeah, so the problem with PELT is that while it nicely approximates
> > > variable-period decay calculations with plain additions, shifts and table
> > > lookups (i.e. accelerates pow()), AFAICS the most important decay
> > > parameter is fixed: the speed of decay, the dampening factor, which is
> > > fixed at 32:
> > >
> > > Documentation/scheduler/sched-pelt.c
> > >
> > > #define HALFLIFE 32
> > >
> > > Right?
> > >
> > > Thara's numbers suggest that there's high sensitivity to the speed of
> > > decay. By using PELT we'd be using whatever averaging speed there is
> > > within PELT.
> > >
> > > Now we could make that parametric of course, but that would both
> > > complicate the PELT lookup code (one more dimension) and would negatively
> > > affect code generation in a number of places.
> >
> > I missed the other solution, which is what you suggested: by
> > increasing/reducing the PELT window size we can effectively shift decay
> > speed and use just a single lookup table.
> >
> > I.e. instead of the fixed period size of 1024 in accumulate_sum(), use
> > decay_load() directly but use a different (longer) window size from 1024
> > usecs to calculate 'periods', and make it a multiple of 1024.
>
> Can't we also scale the now parameter of ___update_load_sum() ?
> If we right shift it before calling ___update_load_sum, it should be
> the same as using a half period of 62, 128, 256ms ...
> The main drawback would be a lost of precision but we are in the range
> of 2, 4, 8us compared to the 1ms window
>
> This is quite similar to how we scale the utilization with frequency and uarch
Yeah, that would work too.
Thanks,
Ingo
On 04/24/2019 11:56 AM, Ionela Voinescu wrote:
> Hi guys,
>
> On 23/04/2019 23:38, Thara Gopinath wrote:
>> On 04/18/2019 05:48 AM, Quentin Perret wrote:
>>> On Tuesday 16 Apr 2019 at 15:38:41 (-0400), Thara Gopinath wrote:
>>>> diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
>>>> @@ -177,6 +178,9 @@ static int cpufreq_thermal_notifier(struct notifier_block *nb,
>>>>
>>>> if (policy->max > clipped_freq)
>>>> cpufreq_verify_within_limits(policy, 0, clipped_freq);
>>>> +
>>>> + sched_update_thermal_pressure(policy->cpus,
>>>> + policy->max, policy->cpuinfo.max_freq);
>>>
>>> Is this something we could do this CPUFreq ? Directly in
>>> cpufreq_verify_within_limits() perhaps ?
>>>
>>> That would re-define the 'thermal pressure' framework in a more abstract
>>> way and make the scheduler look at 'frequency capping' events,
>>> regardless of the reason for capping.
>>>
>>> That would reflect user-defined frequency constraint into cpu_capacity,
>>> in addition to the thermal stuff. I'm not sure if there is another use
>>> case for frequency capping ?
>> Hi Quentin,
>> Thanks for the review. Sorry for the delay in response as I was on
>> vacation for the past few days.
>> I think there is one major difference between user-defined frequency
>> constraints and frequency constraints due to thermal events in terms of
>> the time period the system spends in the the constraint state.
>> Typically, a user constraint lasts for seconds if not minutes and I
>> think in this case cpu_capacity_orig should reflect this constraint and
>> not cpu_capacity like this patch set. Also, in case of the user
>> constraint, there is possibly no need to accumulate and average the
>> capacity constraints and instantaneous values can be directly applied to
>> cpu_capacity_orig. On the other hand thermal pressure is more spiky and
>> sometimes in the order of ms and us requiring the accumulating and
>> averaging.
>
> I think we can't make any assumptions in regards to the intentions of
> the user when restricting the OPP range though the cpufreq interface,
> but it would still be nice to do something and reflecting it as thermal
> pressure would be a good start. It might not be due to thermal, but it
> is a capacity restriction that would have the same result. Also, if the
> user has the ability to tune the decay period he has the control over
> the behavior of the signal. Given that currently there isn't a smarter
> mechanism (modifying capacity orig, re-normalising the capacity range)
> for long-term capping, even treating it as short-term capping is a good
> start. But this is a bigger exercise and it needs thorough
> consideration, so it could be skipped, in my opinion, for now..
>
> Also, if we want to stick with the "definition", userspace would still
> be able to reflect thermal pressure though the thermal limits interface
> by setting the cooling device state, which will be reflected in this
> update as well. So userspace would have a mechanism to reflect thermal
> pressure.
Yes, target_state under cooling devices can be set and this will reflect
as thermal pressure.
>
> One addition.. I like that the thermal pressure framework is not tied to
> cpufreq. There are firmware solutions that do not bother informing
> cpufreq of limits being changed, and therefore all of this could be
> skipped. But any firmware driver could call sched_update_thermal_pressure
> on notifications for limits changing from firmware, which is an
> important feature.
For me, I am open to discussion on the best place to call
sched_update_thermal_pressure from. Seeing the discussion and different
opinions, I am wondering should there be a SoC or platform specific hook
provided for better abstraction.
Regards
Thara
>
>>>
>>> Perhaps the Intel boost stuff could be factored in there ? That is,
>>> at times when the boost freq is not reachable capacity_of() would appear
>>> smaller ... Unless this wants to be reflected instantaneously ?
>> Again, do you think intel boost is more applicable to be reflected in
>> cpu_capacity_orig and not cpu_capacity?
>>>
>>> Thoughts ?
>>> Quentin
>>>
>>
>
> The changes here would happen even faster than thermal capping, same as
> other restrictions imposed by firmware, so it would not seem right to me
> to reflect it in capacity_orig. Reflecting it as thermal pressure is
> another matter, which I'd say it should be up to the client. The big
> disadvantage I'd see for this is coping with decisions made while being
> capped, when you're not capped any longer, and the other way around. I
> believe these changes would happen too often and they will not happen in
> a ramp-up/ramp-down behavior that we expect from thermal mitigation.
> That's why I believe averaging/regulation of the signal works well in
> this case, and it might not for power related fast restrictions.
>
> But given these three cases above, it might be that the ideal solution
> is for this framework to be made more generic and for each client to be
> able to obtain and configure a pressure signal to be reflected
> separately in the capacity of each CPU.
>
> My two pennies' worth,
> Ionela.
>
>
>
--
Regards
Thara
On 04/24/2019 11:57 AM, Ionela Voinescu wrote:
> Hi Thara,
>
> The idea and the results look promising. I'm trying to understand better
> the cause of the improvements so I've added below some questions that
> would help me out with this.
Hi Ionela,
Thanks for the review.
>
>
>> Regarding testing, basic build, boot and sanity testing have been
>> performed on hikey960 mainline kernel with debian file system.
>> Further, aobench (An occlusion renderer for benchmarking realworld
>> floating point performance), dhrystone and hackbench test have been
>> run with the thermal pressure algorithm. During testing, due to
>> constraints of step wise governor in dealing with big little systems,
>> cpu cooling was disabled on little core, the idea being that
>> big core will heat up and cpu cooling device will throttle the
>> frequency of the big cores there by limiting the maximum available
>> capacity and the scheduler will spread out tasks to little cores as well.
>> Finally, this patch series has been boot tested on db410C running v5.1-rc4
>> kernel.
>>
>
> Did you try using IPA as well? It is better equipped to deal with
> big-LITTLE systems and it's more probable IPA will be used for these
> systems, where your solution will have the biggest impact as well.
> The difference will be that you'll have both the big cluster and the
> LITTLE cluster capped in different proportions depending on their
> utilization and their efficiency.
No. I did not use IPA simply because it was not enabled in mainline. I
agree it is better equipped to deal with big-little systems. The idea
to remove cpu cooling on little cluster was to in some (not the
cleanest) manner to mimic this. But I agree that IPA testing is possibly
the next step.Any help in this regard is appreciated.
>
>> During the course of development various methods of capturing
>> and reflecting thermal pressure were implemented.
>>
>> The first method to be evaluated was to convert the
>> capped max frequency into capacity and have the scheduler use the
>> instantaneous value when updating cpu_capacity.
>> This method is referenced as "Instantaneous Thermal Pressure" in the
>> test results below.
>>
>> The next two methods employs different methods of averaging the
>> thermal pressure before applying it when updating cpu_capacity.
>> The first of these methods re-used the PELT algorithm already present
>> in the kernel that does the averaging of rt and dl load and utilization.
>> This method is referenced as "Thermal Pressure Averaging using PELT fmwk"
>> in the test results below.
>>
>> The final method employs an averaging algorithm that collects and
>> decays thermal pressure based on the decay period. In this method,
>> the decay period is configurable. This method is referenced as
>> "Thermal Pressure Averaging non-PELT Algo. Decay : XXX ms" in the
>> test results below.
>>
>> The test results below shows 3-5% improvement in performance when
>> using the third solution compared to the default system today where
>> scheduler is unware of cpu capacity limitations due to thermal events.
>>
>
> Did you happen to record the amount of capping imposed on the big cores
> when these results were obtained? Did you find scenarios where the
> capacity of the bigs resulted in being lower than the capacity of the
> LITTLEs (capacity inversion)?
> This is one case where we'll see a big impact in considering thermal
> pressure.
I think I saw capacity inversion in some scenarios. I did not
particularly capture them.
>
> Also, given that these are more or less sustained workloads, I'm
> wondering if there is any effect on workloads running on an uncapped
> system following capping. I would image such a test being composed of a
> single threaded period (no capping) followed by a multi-threaded period
> (with capping), continued in a loop. It might be interesting to have
> something like this as well, as part of your test coverage
I do not understand this. There is either capping for a workload or no
capping. There is no sysctl entry to turn on or off capping.
Regards
Thara
>
>
> Thanks,
> Ionela.
>
--
Regards
Thara
On 04/25/2019 06:45 AM, Quentin Perret wrote:
> On Tuesday 23 Apr 2019 at 18:38:46 (-0400), Thara Gopinath wrote:
>> I think there is one major difference between user-defined frequency
>> constraints and frequency constraints due to thermal events in terms of
>> the time period the system spends in the the constraint state.
>> Typically, a user constraint lasts for seconds if not minutes and I
>> think in this case cpu_capacity_orig should reflect this constraint and
>> not cpu_capacity like this patch set.
>
> That might not always be true I think. There's tons of userspace thermal
> deamons out there, and I wouldn't be suprised if they were writing into
> the cpufreq sysfs files, although I'm not sure.
>
> Another thing is, if you want to change the capacity_orig value, you'll
> need to rebuild the sched domains and all I believe. Otherwise there is
> a risk to 'break' the sd_asym flags. So we need to make sure we're happy
> to pay that price.
Hi Quentin,
I saw Vincent's reply on this and my answer is similar. I completely
agree that this will involve a rebuild of sched domains. My thought on
cpufreq capping max frequency from the user space is that the capping is
for long term basis and hence we could live with re-building sched
domains. If user space wants to control the max frequency of a cpu for
thermal reasons then the cooling device sys interface should be used. In
practical scenario, I am interested in knowing why thermal daemons
control cpufreq sysfs files instead of cooling device files.
Regards
Thara
>
>> Also, in case of the user
>> constraint, there is possibly no need to accumulate and average the
>> capacity constraints and instantaneous values can be directly applied to
>> cpu_capacity_orig. On the other hand thermal pressure is more spiky and
>> sometimes in the order of ms and us requiring the accumulating and
>> averaging.
>>>
>>> Perhaps the Intel boost stuff could be factored in there ? That is,
>>> at times when the boost freq is not reachable capacity_of() would appear
>>> smaller ... Unless this wants to be reflected instantaneously ?
>> Again, do you think intel boost is more applicable to be reflected in
>> cpu_capacity_orig and not cpu_capacity?
>
> I'm not even sure if we want to reflect it at all TBH, but I'd be
> interested to see what Intel folks think :-)
>
> Thanks,
> Quentin
>
--
Regards
Thara
On 04/25/2019 08:45 AM, Vincent Guittot wrote:
> On Thu, 25 Apr 2019 at 12:57, Quentin Perret <[email protected]> wrote:
>>
>> On Tuesday 16 Apr 2019 at 15:38:39 (-0400), Thara Gopinath wrote:
>>> +/* Per cpu structure to keep track of Thermal Pressure */
>>> +struct thermal_pressure {
>>> + unsigned long scale; /* scale reflecting average cpu max capacity*/
>>> + unsigned long acc_scale; /* Accumulated scale for this time window */
>>> + unsigned long old_scale; /* Scale value for the previous window */
>>> + unsigned long raw_scale; /* Raw max capacity */
>>> + unsigned long age_stamp; /* Last time old_scale was updated */
>>> + unsigned long last_update; /* Last time acc_scale was updated */
>>> + spinlock_t lock; /* Lock for protecting from simultaneous access*/
>>> + /* Timer for periodic update of thermal pressure */
>>> + struct timer_list timer;
>>
>> Do you actually need the periodic update ? You only really need to
>> update the 'scale' value when updating the LB stats no ? Nobody
>> accesses that value in between two LBs.
>
> Do you mean calling a variant of sched_update_thermal_pressure() in
> update_cpu_capacity() instead of periodic update ?
> Yes , that should be enough
Hi,
I do have some concerns in doing this.
1. Updating thermal pressure does involve some calculations for
accumulating, averaging, decaying etc which in turn could have some
finite and measurable time spent in the function. I am not sure if this
delay will be acceptable for all systems during load balancing (I have
not measured the time involved). We need to decide if this is something
we can live with.
2. More importantly, since update can happen from at least two paths (
thermal fw and periodic timer in case of this patch series)to ensure
mutual exclusion, the update is done under a spin lock. Again calling
from update_cpu_capacity will involve holding the lock in the load
balance path which is possible not for the best.
For me, updating out of load balance minimizes the disruption to
scheduler on the whole.
But if there is an over whelming support for updating the statistics
from the LB , I can move the code.
Regards
Thara
>
>>
>> Thanks,
>> Quentin
--
Regards
Thara
Hi Thara,
>>> Regarding testing, basic build, boot and sanity testing have been
>>> performed on hikey960 mainline kernel with debian file system.
>>> Further, aobench (An occlusion renderer for benchmarking realworld
>>> floating point performance), dhrystone and hackbench test have been
>>> run with the thermal pressure algorithm. During testing, due to
>>> constraints of step wise governor in dealing with big little systems,
>>> cpu cooling was disabled on little core, the idea being that
>>> big core will heat up and cpu cooling device will throttle the
>>> frequency of the big cores there by limiting the maximum available
>>> capacity and the scheduler will spread out tasks to little cores as well.
>>> Finally, this patch series has been boot tested on db410C running v5.1-rc4
>>> kernel.
>>>
>>
>> Did you try using IPA as well? It is better equipped to deal with
>> big-LITTLE systems and it's more probable IPA will be used for these
>> systems, where your solution will have the biggest impact as well.
>> The difference will be that you'll have both the big cluster and the
>> LITTLE cluster capped in different proportions depending on their
>> utilization and their efficiency.
>
> No. I did not use IPA simply because it was not enabled in mainline. I
> agree it is better equipped to deal with big-little systems. The idea
> to remove cpu cooling on little cluster was to in some (not the
> cleanest) manner to mimic this. But I agree that IPA testing is possibly
> the next step.Any help in this regard is appreciated.
>
I see CONFIG_THERMAL_GOV_POWER_ALLOCATOR=y in the defconfig for arm64:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/configs/defconfig?h=v5.1-rc6#n413
You can enable the use of it or make it default in the defconfig.
Also, Hikey960 has the needed setup in DT:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/boot/dts/hisilicon/hi3660.dtsi?h=v5.1-rc6#n1093
This should work fine.
>>
>>> During the course of development various methods of capturing
>>> and reflecting thermal pressure were implemented.
>>>
>>> The first method to be evaluated was to convert the
>>> capped max frequency into capacity and have the scheduler use the
>>> instantaneous value when updating cpu_capacity.
>>> This method is referenced as "Instantaneous Thermal Pressure" in the
>>> test results below.
>>>
>>> The next two methods employs different methods of averaging the
>>> thermal pressure before applying it when updating cpu_capacity.
>>> The first of these methods re-used the PELT algorithm already present
>>> in the kernel that does the averaging of rt and dl load and utilization.
>>> This method is referenced as "Thermal Pressure Averaging using PELT fmwk"
>>> in the test results below.
>>>
>>> The final method employs an averaging algorithm that collects and
>>> decays thermal pressure based on the decay period. In this method,
>>> the decay period is configurable. This method is referenced as
>>> "Thermal Pressure Averaging non-PELT Algo. Decay : XXX ms" in the
>>> test results below.
>>>
>>> The test results below shows 3-5% improvement in performance when
>>> using the third solution compared to the default system today where
>>> scheduler is unware of cpu capacity limitations due to thermal events.
>>>
>>
>> Did you happen to record the amount of capping imposed on the big cores
>> when these results were obtained? Did you find scenarios where the
>> capacity of the bigs resulted in being lower than the capacity of the
>> LITTLEs (capacity inversion)?
>> This is one case where we'll see a big impact in considering thermal
>> pressure.
>
> I think I saw capacity inversion in some scenarios. I did not
> particularly capture them.
>
It would be good to observe this and possibly correlate the amount of
capping with resulting behavior and performance numbers. This would
give more confidence in the testing coverage.
You can create a specific testcase for capacity inversion by only
capping the big CPUs, as you've done for these tests, and by running
sysbench/dhrystone for example with at least nr_big_cpus tasks.
This assumes that the bigs fully utilized would generate enough heat and
would be capped enough to achieve a capacity lower than the littles,
which on Hikey960 I don't doubt it can be obtained.
>>
>> Also, given that these are more or less sustained workloads, I'm
>> wondering if there is any effect on workloads running on an uncapped
>> system following capping. I would image such a test being composed of a
>> single threaded period (no capping) followed by a multi-threaded period
>> (with capping), continued in a loop. It might be interesting to have
>> something like this as well, as part of your test coverage
>
> I do not understand this. There is either capping for a workload or no
> capping. There is no sysctl entry to turn on or off capping.
>
I was thinking of this as a second hand effect. If you have only one big
CPU even fully utilized, with the others quiet, you might not see any
capping. But when you have a multi-threaded workload, with all or at
least the bigs at a high OPP, the platform will definitely overheat and
there will be capping.
Thanks,
Ionela.
> Regards
> Thara
>>
>>
>> Thanks,
>> Ionela.
>>
>
>
Hi Thara,
>
> Hackbench: (1 group , 30000 loops, 10 runs)
> Result Standard Deviation
> (Time Secs) (% of mean)
>
> No Thermal Pressure 10.21 7.99%
>
> Instantaneous thermal pressure 10.16 5.36%
>
> Thermal Pressure Averaging
> using PELT fmwk 9.88 3.94%
>
> Thermal Pressure Averaging
> non-PELT Algo. Decay : 500 ms 9.94 4.59%
>
> Thermal Pressure Averaging
> non-PELT Algo. Decay : 250 ms 7.52 5.42%
>
> Thermal Pressure Averaging
> non-PELT Algo. Decay : 125 ms 9.87 3.94%
>
>
I'm trying your patches on my Hikey960 and I'm getting different results
than the ones here.
I'm running with the step-wise governor, enabled only on the big cores.
The decay period is set to 250ms.
The result for hackbench is:
# ./hackbench -g 1 -l 30000
Running in process mode with 1 groups using 40 file descriptors each (== 40 tasks)
Each sender will pass 30000 messages of 100 bytes
Time: 20.756
During the run I see the little cores running at maximum frequency
(1.84GHz) while the big cores run mostly at 1.8GHz, only sometimes capped
at 1.42GHz. There should not be any capacity inversion.
The temperature is kept around 75 degrees (73 to 77 degrees).
I don't have any kind of active cooling (no fans on the board), only a
heatsink on the SoC.
But as you see my results(~20s) are very far from the 7-10s in your
results.
Do you see anything wrong with this process? Can you give me more
details on your setup that I can use to test on my board?
Thank you,
Ionela.
Hi Thara,
On 29/04/2019 14:29, Ionela Voinescu wrote:
> Hi Thara,
>
>>
>> Hackbench: (1 group , 30000 loops, 10 runs)
>> Result Standard Deviation
>> (Time Secs) (% of mean)
>>
>> No Thermal Pressure 10.21 7.99%
>>
>> Instantaneous thermal pressure 10.16 5.36%
>>
>> Thermal Pressure Averaging
>> using PELT fmwk 9.88 3.94%
>>
>> Thermal Pressure Averaging
>> non-PELT Algo. Decay : 500 ms 9.94 4.59%
>>
>> Thermal Pressure Averaging
>> non-PELT Algo. Decay : 250 ms 7.52 5.42%
>>
>> Thermal Pressure Averaging
>> non-PELT Algo. Decay : 125 ms 9.87 3.94%
>>
>>
>
> I'm trying your patches on my Hikey960 and I'm getting different results
> than the ones here.
>
> I'm running with the step-wise governor, enabled only on the big cores.
> The decay period is set to 250ms.
>
> The result for hackbench is:
>
> # ./hackbench -g 1 -l 30000
> Running in process mode with 1 groups using 40 file descriptors each (== 40 tasks)
> Each sender will pass 30000 messages of 100 bytes
> Time: 20.756
>
> During the run I see the little cores running at maximum frequency
> (1.84GHz) while the big cores run mostly at 1.8GHz, only sometimes capped
> at 1.42GHz. There should not be any capacity inversion.
> The temperature is kept around 75 degrees (73 to 77 degrees).
>
> I don't have any kind of active cooling (no fans on the board), only a
> heatsink on the SoC.
>
> But as you see my results(~20s) are very far from the 7-10s in your
> results.
>
> Do you see anything wrong with this process? Can you give me more
> details on your setup that I can use to test on my board?
>
I've found that my poor results above were due to debug options
mistakenly left enabled in the defconfig. Sorry about that!
After cleaning it up I'm getting results around 5.6s for this test case.
I've run 50 iterations for each test, with 90s cool down period between
them.
Hackbench: (1 group , 30000 loops, 50 runs)
Result Standard Deviation
(Time Secs) (% of mean)
No Thermal Pressure(step_wise) 5.644 7.760%
No Thermal Pressure(IPA) 5.677 9.062%
Thermal Pressure Averaging
non-PELT Algo. Decay : 250 ms 5.627 5.593%
(step-wise, bigs capped only)
Thermal Pressure Averaging
non-PELT Algo. Decay : 250 ms 5.690 3.738%
(IPA)
All of the results above are within 1.1% difference with a
significantly higher standard deviation.
I wanted to run this initially to validate my setup and understand
if there is any conclusion we can draw from a test like this, that
floods the CPUs with tasks. Looking over the traces, the tasks are
running almost back to back, trying to use all available resources,
on all the CPUs.
Therefore, I doubt that there could be better decisions that could be
made, knowing about thermal pressure, for this usecase.
I'll try next some capacity inversion usecase and post the results when
they are ready.
Hope it helps,
Ionela.
> Thank you,
> Ionela.
>
On 04/29/2019 09:29 AM, Ionela Voinescu wrote:
> Hi Thara,
>
>>
>> Hackbench: (1 group , 30000 loops, 10 runs)
>> Result Standard Deviation
>> (Time Secs) (% of mean)
>>
>> No Thermal Pressure 10.21 7.99%
>>
>> Instantaneous thermal pressure 10.16 5.36%
>>
>> Thermal Pressure Averaging
>> using PELT fmwk 9.88 3.94%
>>
>> Thermal Pressure Averaging
>> non-PELT Algo. Decay : 500 ms 9.94 4.59%
>>
>> Thermal Pressure Averaging
>> non-PELT Algo. Decay : 250 ms 7.52 5.42%
>>
>> Thermal Pressure Averaging
>> non-PELT Algo. Decay : 125 ms 9.87 3.94%
>>
>>
>
> I'm trying your patches on my Hikey960 and I'm getting different results
> than the ones here.
>
> I'm running with the step-wise governor, enabled only on the big cores.
> The decay period is set to 250ms.
>
> The result for hackbench is:
>
> # ./hackbench -g 1 -l 30000
> Running in process mode with 1 groups using 40 file descriptors each (== 40 tasks)
> Each sender will pass 30000 messages of 100 bytes
> Time: 20.756
>
> During the run I see the little cores running at maximum frequency
> (1.84GHz) while the big cores run mostly at 1.8GHz, only sometimes capped
> at 1.42GHz. There should not be any capacity inversion.
> The temperature is kept around 75 degrees (73 to 77 degrees).
>
> I don't have any kind of active cooling (no fans on the board), only a
> heatsink on the SoC.
>
> But as you see my results(~20s) are very far from the 7-10s in your
> results.
>
> Do you see anything wrong with this process? Can you give me more
> details on your setup that I can use to test on my board?
Hi Ionela,
I used the latest mainline kernel with sched/ tip merged in for my
testing. My hikey960 did not have any fan or heat sink during testing. I
disabled cpu cooling for little cores in the dts files.
Also I have to warn you that I have managed to blow up my hikey960. So I
no longer have a functional board for past two weeks or so.
I don't have my test scripts to send you, but I have some of the results
files downloaded which I can send you in a separate email.
I did run the test 10 rounds.
Also I think 20s is too much of variation for the test results. Like I
mentioned in my previous emails I think the 7.52 is an anomaly but the
results should be around the range of 8-9 s.
Regards
Thara
>
> Thank you,
> Ionela.
>
--
Regards
Thara
On 04/30/2019 11:57 AM, Thara Gopinath wrote:
> On 04/29/2019 09:29 AM, Ionela Voinescu wrote:
>> Hi Thara,
>>
>>>
>>> Hackbench: (1 group , 30000 loops, 10 runs)
>>> Result Standard Deviation
>>> (Time Secs) (% of mean)
>>>
>>> No Thermal Pressure 10.21 7.99%
>>>
>>> Instantaneous thermal pressure 10.16 5.36%
>>>
>>> Thermal Pressure Averaging
>>> using PELT fmwk 9.88 3.94%
>>>
>>> Thermal Pressure Averaging
>>> non-PELT Algo. Decay : 500 ms 9.94 4.59%
>>>
>>> Thermal Pressure Averaging
>>> non-PELT Algo. Decay : 250 ms 7.52 5.42%
>>>
>>> Thermal Pressure Averaging
>>> non-PELT Algo. Decay : 125 ms 9.87 3.94%
>>>
>>>
>>
>> I'm trying your patches on my Hikey960 and I'm getting different results
>> than the ones here.
>>
>> I'm running with the step-wise governor, enabled only on the big cores.
>> The decay period is set to 250ms.
>>
>> The result for hackbench is:
>>
>> # ./hackbench -g 1 -l 30000
>> Running in process mode with 1 groups using 40 file descriptors each (== 40 tasks)
>> Each sender will pass 30000 messages of 100 bytes
>> Time: 20.756
>>
>> During the run I see the little cores running at maximum frequency
>> (1.84GHz) while the big cores run mostly at 1.8GHz, only sometimes capped
>> at 1.42GHz. There should not be any capacity inversion.
>> The temperature is kept around 75 degrees (73 to 77 degrees).
>>
>> I don't have any kind of active cooling (no fans on the board), only a
>> heatsink on the SoC.
>>
>> But as you see my results(~20s) are very far from the 7-10s in your
>> results.
>>
>> Do you see anything wrong with this process? Can you give me more
>> details on your setup that I can use to test on my board?
>
> Hi Ionela,
>
> I used the latest mainline kernel with sched/ tip merged in for my
> testing. My hikey960 did not have any fan or heat sink during testing. I
> disabled cpu cooling for little cores in the dts files.
> Also I have to warn you that I have managed to blow up my hikey960. So I
> no longer have a functional board for past two weeks or so.
>
> I don't have my test scripts to send you, but I have some of the results
> files downloaded which I can send you in a separate email.
> I did run the test 10 rounds.
Hi Ionela,
I failed to mention that I drop the first run for averaging.
>
> Also I think 20s is too much of variation for the test results. Like I
> mentioned in my previous emails I think the 7.52 is an anomaly but the
> results should be around the range of 8-9 s.
Also since we are more interested in comparison rather than absolute
numbers did you run tests in a system with no thermal pressure( to see
if there are any improvements)?
Regards
Thara
>
> Regards
> Thara
>
>>
>> Thank you,
>> Ionela.
>>
>
>
--
Regards
Thara
On 04/30/2019 10:39 AM, Ionela Voinescu wrote:
> Hi Thara,
>
> On 29/04/2019 14:29, Ionela Voinescu wrote:
>> Hi Thara,
>>
>>>
>>> Hackbench: (1 group , 30000 loops, 10 runs)
>>> Result Standard Deviation
>>> (Time Secs) (% of mean)
>>>
>>> No Thermal Pressure 10.21 7.99%
>>>
>>> Instantaneous thermal pressure 10.16 5.36%
>>>
>>> Thermal Pressure Averaging
>>> using PELT fmwk 9.88 3.94%
>>>
>>> Thermal Pressure Averaging
>>> non-PELT Algo. Decay : 500 ms 9.94 4.59%
>>>
>>> Thermal Pressure Averaging
>>> non-PELT Algo. Decay : 250 ms 7.52 5.42%
>>>
>>> Thermal Pressure Averaging
>>> non-PELT Algo. Decay : 125 ms 9.87 3.94%
>>>
>>>
>>
>> I'm trying your patches on my Hikey960 and I'm getting different results
>> than the ones here.
>>
>> I'm running with the step-wise governor, enabled only on the big cores.
>> The decay period is set to 250ms.
>>
>> The result for hackbench is:
>>
>> # ./hackbench -g 1 -l 30000
>> Running in process mode with 1 groups using 40 file descriptors each (== 40 tasks)
>> Each sender will pass 30000 messages of 100 bytes
>> Time: 20.756
>>
>> During the run I see the little cores running at maximum frequency
>> (1.84GHz) while the big cores run mostly at 1.8GHz, only sometimes capped
>> at 1.42GHz. There should not be any capacity inversion.
>> The temperature is kept around 75 degrees (73 to 77 degrees).
>>
>> I don't have any kind of active cooling (no fans on the board), only a
>> heatsink on the SoC.
>>
>> But as you see my results(~20s) are very far from the 7-10s in your
>> results.
>>
>> Do you see anything wrong with this process? Can you give me more
>> details on your setup that I can use to test on my board?
>>
>
> I've found that my poor results above were due to debug options
> mistakenly left enabled in the defconfig. Sorry about that!
>
> After cleaning it up I'm getting results around 5.6s for this test case.
> I've run 50 iterations for each test, with 90s cool down period between
> them.
>
>
> Hackbench: (1 group , 30000 loops, 50 runs)
> Result Standard Deviation
> (Time Secs) (% of mean)
>
> No Thermal Pressure(step_wise) 5.644 7.760%
> No Thermal Pressure(IPA) 5.677 9.062%
>
> Thermal Pressure Averaging
> non-PELT Algo. Decay : 250 ms 5.627 5.593%
> (step-wise, bigs capped only)
>
> Thermal Pressure Averaging
> non-PELT Algo. Decay : 250 ms 5.690 3.738%
> (IPA)
>
> All of the results above are within 1.1% difference with a
> significantly higher standard deviation.
Hi Ionela,
I have replied to your original emails without seeing this one. So,
interesting results. I see IPA is worse off (Slightly) than step wise in
both thermal pressure and non-thermal pressure scenarios. Did you try
500 ms decay period by any chance?
>
> I wanted to run this initially to validate my setup and understand
> if there is any conclusion we can draw from a test like this, that
> floods the CPUs with tasks. Looking over the traces, the tasks are
> running almost back to back, trying to use all available resources,
> on all the CPUs.
> Therefore, I doubt that there could be better decisions that could be
> made, knowing about thermal pressure, for this usecase.
>
> I'll try next some capacity inversion usecase and post the results when
> they are ready.
Sure. let me know if I can help.
Regards
Thara
>
> Hope it helps,
> Ionela.
>
>
>> Thank you,
>> Ionela.
>>
--
Regards
Thara
Hi Thara,
>> After cleaning it up I'm getting results around 5.6s for this test case.
>> I've run 50 iterations for each test, with 90s cool down period between
>> them.
>>
>>
>> Hackbench: (1 group , 30000 loops, 50 runs)
>> Result Standard Deviation
>> (Time Secs) (% of mean)
>>
>> No Thermal Pressure(step_wise) 5.644 7.760%
>> No Thermal Pressure(IPA) 5.677 9.062%
>>
>> Thermal Pressure Averaging
>> non-PELT Algo. Decay : 250 ms 5.627 5.593%
>> (step-wise, bigs capped only)
>>
>> Thermal Pressure Averaging
>> non-PELT Algo. Decay : 250 ms 5.690 3.738%
>> (IPA)
>>
>> All of the results above are within 1.1% difference with a
>> significantly higher standard deviation.
>
> Hi Ionela,
>
> I have replied to your original emails without seeing this one. So,
> interesting results. I see IPA is worse off (Slightly) than step wise in
> both thermal pressure and non-thermal pressure scenarios. Did you try
> 500 ms decay period by any chance?
>
I don't think we can draw a conclusion on that given how close the
results are and given the high standard deviation. Probably if I run
them again the tables will be turned :).
I did not run experiments with different decay periods yet, as I want to
have first a list of experiments that are relevant for thermal pressure,
that can help later with refining the solution, which can mean either
deciding on a decay period or possibly going with the instantaneous
thermal pressure. Please find more details below.
>>
>> I wanted to run this initially to validate my setup and understand
>> if there is any conclusion we can draw from a test like this, that
>> floods the CPUs with tasks. Looking over the traces, the tasks are
>> running almost back to back, trying to use all available resources,
>> on all the CPUs.
>> Therefore, I doubt that there could be better decisions that could be
>> made, knowing about thermal pressure, for this usecase.
>>
>> I'll try next some capacity inversion usecase and post the results when
>> they are ready.
>
I've started looking into this, starting from the most obvious case of
capacity inversion: using the user-space thermal governor and capping
the bigs to their lowest OPP. The LITTLEs are left uncapped.
This was not enough on the Hikey960 as the bigs at their lowest OPP were
in the capacity margin of the LITTLEs at their highest OPP. That meant
that LITTLEs would not pull tasks from the bigs, even if they had higher
capacity, as the capacity was in within the 25% margin. So another
change I've made was to set the capacity margin in fair.c to 10%.
I've run both sysbench and dhrystone. I'll put here only the results for
sysbench, interleaved, with and without considering thermal pressure (TP
and !TP).
As before, the TP solution uses averaging with a 250ms decay period.
Sysbench: (500000 req, 4 runs)
Result Standard Deviation
(Time Secs) (% of mean)
!TP/4 threads 146.46 0.063%
TP/4 threads 136.36 0.002%
!TP/5 threads 115.38 0.028%
TP/5 threads 110.62 0.006%
!TP/6 threads 95.38 0.051%
TP/6 threads 93.07 0.054%
!TP/7 threads 81.19 0.012%
TP/7 threads 80.32 0.028%
!TP/8 threads 72.58 2.295%
TP/8 threads 71.37 0.044%
As expected, the results are significantly improved when the scheduler
is let know of reduced capacity on the bigs which results in tasks being
placed or migrated to the littles which are able to provide better
performance. Traces nicely confirm this.
To be noted that these results only show that reflecting thermal
pressure in the capacity of the CPUs is useful and the scheduler is
equipped to make proper use of this information.
Possibly a thing to consider is whether or not to reduce the capacity
margin, but that's for another discussion.
This does not reflect the benefits of averaging, as, with the bigs
always being capped to the lowest OPP, the thermal pressure value will
be constant over the duration of the workload. The same results would
have been obtained with instantaneous thermal pressure.
Secondly, I've tried to use the step-wise governor, modified to only
cap the big CPUs, with the intention to obtain smaller periods of
capacity inversion for which a thermal pressure solution would show its
benefits.
Unfortunately dhrystone was misbehaving for some reason and was
giving me a high variation between results for the same test case.
Also, sysbench, ran with the same arguments as above, was not creating
enough load and thermal capping as to show the benefits of considering
thermal pressure.
So my recommendation is continue exploiting more test-cases like these.
I would continue with sysbench as it looks more stable, but modify the
the temperature threshold to determine periods of drastic capping of the
bigs. Once a dynamic test case and setup like this (no fixing
frequencies) is identified, it can be used to understand if averaging
is needed and to refine the decay period, and establish a good default.
What do you think? Does this make sense as a direction for obtaining
test cases? In my opinion the previous test cases were not triggering
the right behaviors that can help prove the need for thermal pressure,
or help refine it.
I will try to continue in this direction, but I won't be able to get to
in for a few days.
You'll find more results at:
https://docs.google.com/spreadsheets/d/1ibxDSSSLTodLzihNAw6jM36eVZABuPMMnjvV-Xh4NEo/edit?usp=sharing
> Sure. let me know if I can help.
Any test results or recommendations for test cases would be helpful.
The need for thermal pressure is obvious, but the way that thermal
pressure is reflected in the capacity of the CPUs could be supported by
more thorough testing.
Regards,
Ionela.
>
> Regards
> Thara
>
>>
>> Hope it helps,
>> Ionela.
>>
>>
>>> Thank you,
>>> Ionela.
>>>
>
>
Hi Thara,
Sorry for the delayed response.
On Friday 26 Apr 2019 at 10:17:56 (-0400), Thara Gopinath wrote:
> On 04/25/2019 08:45 AM, Vincent Guittot wrote:
> > Do you mean calling a variant of sched_update_thermal_pressure() in
> > update_cpu_capacity() instead of periodic update ?
> > Yes , that should be enough
>
> Hi,
>
> I do have some concerns in doing this.
> 1. Updating thermal pressure does involve some calculations for
> accumulating, averaging, decaying etc which in turn could have some
> finite and measurable time spent in the function. I am not sure if this
> delay will be acceptable for all systems during load balancing (I have
> not measured the time involved). We need to decide if this is something
> we can live with.
>
> 2. More importantly, since update can happen from at least two paths (
> thermal fw and periodic timer in case of this patch series)to ensure
> mutual exclusion, the update is done under a spin lock. Again calling
> from update_cpu_capacity will involve holding the lock in the load
> balance path which is possible not for the best.
> For me, updating out of load balance minimizes the disruption to
> scheduler on the whole.
>
> But if there is an over whelming support for updating the statistics
> from the LB , I can move the code.
If I try to clarify my point a little bit, my observation is really that
it's a shame to update the thermal stats often, but to not reflect that
in capacity_of().
So in fact there are two alternatives: 1) do the update only during LB
(which is what I suggested first) to avoid 'useless' work; or 2) reflect
the thermal pressure in the CPU capacity every time the thermal stats
are updated.
And thinking more about it, perhaps 2) is actually a better option? With
this we could try smaller decay periods than the LB interval (which is
most likely useless otherwise) and make sure the capacity considered
during wake-up is up-to-date. This should be a good thing for latency
sensitive tasks I think. (If you consider a task in the Android display
pipeline for example, it needs to run within 16ms or the frame is
missed. So, on wake-up, we'd like to know where the task can run fast
_now_, not according to the capacities the CPUs had 200ms ago or so).
Thoughts ?
Quentin