2018-10-09 16:25:59

by Thara Gopinath

[permalink] [raw]
Subject: [RFC PATCH 0/7] Introduce thermal pressure

Thermal governors can respond to an overheat event for a cpu by
capping the cpu's maximum possible frequency. This in turn
means that the maximum available compute capacity of the
cpu is restricted. But today in linux kernel, in event of maximum
frequency capping of a cpu, the maximum available compute
capacity of the cpu is not adjusted at all. In other words, scheduler
is unware maximum cpu capacity restrictions placed due to thermal
activity. This patch series attempts to address this issue.
The benefits identified are better task placement among available
cpus in event of overheating which in turn leads to better
performance numbers.

The delta between the maximum possible capacity of a cpu and
maximum available capacity of a cpu due to thermal event can
be considered as thermal pressure. Instantaneous thermal pressure
is hard to record and can sometime be erroneous as there can be mismatch
between the actual capping of capacity and scheduler recording it.
Thus solution is to have a weighted average per cpu value for thermal
pressure over time. The weight reflects the amount of time the cpu has
spent at a capped maximum frequency. To accumulate, average and
appropriately decay thermal pressure, this patch series uses pelt
signals and reuses the available framework that does a similar
bookkeeping of rt/dl task utilization.

Regarding testing, basic build, boot and sanity testing have been
performed on hikey960 mainline kernel with debian file system.
Further aobench (An occlusion renderer for benchmarking realworld
floating point performance) showed the following results on hikey960
with debain.

Result Standard Standard
(Time secs) Error Deviation
Hikey 960 - no thermal pressure applied 138.67 6.52 11.52%
Hikey 960 - thermal pressure applied 122.37 5.78 11.57%

Thara Gopinath (7):
sched/pelt: Add option to make load and util calculations frequency
invariant
sched/pelt.c: Add support to track thermal pressure
sched: Add infrastructure to store and update instantaneous thermal
pressure
sched: Initialize per cpu thermal pressure structure
sched/fair: Enable CFS periodic tick to update thermal pressure
sched/fair: update cpu_capcity to reflect thermal pressure
thermal/cpu-cooling: Update thermal pressure in case of a maximum
frequency capping

drivers/base/arch_topology.c | 1 +
drivers/thermal/cpu_cooling.c | 20 ++++++++++++-
include/linux/sched.h | 14 +++++++++
kernel/sched/Makefile | 2 +-
kernel/sched/core.c | 2 ++
kernel/sched/fair.c | 4 +++
kernel/sched/pelt.c | 40 ++++++++++++++++++--------
kernel/sched/pelt.h | 7 +++++
kernel/sched/sched.h | 1 +
kernel/sched/thermal.c | 66 +++++++++++++++++++++++++++++++++++++++++++
kernel/sched/thermal.h | 13 +++++++++
11 files changed, 157 insertions(+), 13 deletions(-)
create mode 100644 kernel/sched/thermal.c
create mode 100644 kernel/sched/thermal.h

--
2.1.4



2018-10-09 16:26:02

by Thara Gopinath

[permalink] [raw]
Subject: [RFC PATCH 1/7] sched/pelt.c: Add option to make load and util calculations frequency invariant

Add an additional parametr in accumulate_sum to allow optional
frequency adjustment of load and utilization. When considering
rt/dl load/util, it is correct to scale it to the current cpu
frequency. On the other hand, thermal pressure(max capped frequency)
is frequency invariant.

Signed-off-by: Thara Gopinath <[email protected]>
---
kernel/sched/pelt.c | 26 +++++++++++++++-----------
1 file changed, 15 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index 35475c0..05b8798 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -107,7 +107,8 @@ static u32 __accumulate_pelt_segments(u64 periods, u32 d1, u32 d3)
*/
static __always_inline u32
accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
- unsigned long load, unsigned long runnable, int running)
+ unsigned long load, unsigned long runnable, int running,
+ int freq_adjusted)
{
unsigned long scale_freq, scale_cpu;
u32 contrib = (u32)delta; /* p == 0 -> delta < 1024 */
@@ -137,7 +138,8 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
}
sa->period_contrib = delta;

- contrib = cap_scale(contrib, scale_freq);
+ if (freq_adjusted)
+ contrib = cap_scale(contrib, scale_freq);
if (load)
sa->load_sum += load * contrib;
if (runnable)
@@ -178,7 +180,8 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
*/
static __always_inline int
___update_load_sum(u64 now, int cpu, struct sched_avg *sa,
- unsigned long load, unsigned long runnable, int running)
+ unsigned long load, unsigned long runnable, int running,
+ int freq_adjusted)
{
u64 delta;

@@ -221,7 +224,8 @@ ___update_load_sum(u64 now, int cpu, struct sched_avg *sa,
* Step 1: accumulate *_sum since last_update_time. If we haven't
* crossed period boundaries, finish.
*/
- if (!accumulate_sum(delta, cpu, sa, load, runnable, running))
+ if (!accumulate_sum(delta, cpu, sa, load, runnable, running,
+ freq_adjusted))
return 0;

return 1;
@@ -272,7 +276,7 @@ int __update_load_avg_blocked_se(u64 now, int cpu, struct sched_entity *se)
if (entity_is_task(se))
se->runnable_weight = se->load.weight;

- if (___update_load_sum(now, cpu, &se->avg, 0, 0, 0)) {
+ if (___update_load_sum(now, cpu, &se->avg, 0, 0, 0, 1)) {
___update_load_avg(&se->avg, se_weight(se), se_runnable(se));
return 1;
}
@@ -286,7 +290,7 @@ int __update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_e
se->runnable_weight = se->load.weight;

if (___update_load_sum(now, cpu, &se->avg, !!se->on_rq, !!se->on_rq,
- cfs_rq->curr == se)) {
+ cfs_rq->curr == se, 1)) {

___update_load_avg(&se->avg, se_weight(se), se_runnable(se));
cfs_se_util_change(&se->avg);
@@ -301,7 +305,7 @@ int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq)
if (___update_load_sum(now, cpu, &cfs_rq->avg,
scale_load_down(cfs_rq->load.weight),
scale_load_down(cfs_rq->runnable_weight),
- cfs_rq->curr != NULL)) {
+ cfs_rq->curr != NULL, 1)) {

___update_load_avg(&cfs_rq->avg, 1, 1);
return 1;
@@ -326,7 +330,7 @@ int update_rt_rq_load_avg(u64 now, struct rq *rq, int running)
if (___update_load_sum(now, rq->cpu, &rq->avg_rt,
running,
running,
- running)) {
+ running, 1)) {

___update_load_avg(&rq->avg_rt, 1, 1);
return 1;
@@ -349,7 +353,7 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
if (___update_load_sum(now, rq->cpu, &rq->avg_dl,
running,
running,
- running)) {
+ running, 1)) {

___update_load_avg(&rq->avg_dl, 1, 1);
return 1;
@@ -385,11 +389,11 @@ int update_irq_load_avg(struct rq *rq, u64 running)
ret = ___update_load_sum(rq->clock - running, rq->cpu, &rq->avg_irq,
0,
0,
- 0);
+ 0, 1);
ret += ___update_load_sum(rq->clock, rq->cpu, &rq->avg_irq,
1,
1,
- 1);
+ 1, 1);

if (ret)
___update_load_avg(&rq->avg_irq, 1, 1);
--
2.1.4


2018-10-09 16:26:10

by Thara Gopinath

[permalink] [raw]
Subject: [RFC PATCH 7/7] thermal/cpu-cooling: Update thermal pressure in case of a maximum frequency capping

Thermal governors can request for a cpu's maximum supported frequency
to be capped in case of an overheat event. This in turn means that the
maximum capacity available for tasks to run on the particular cpu is
reduced. Delta between the original maximum capacity and capped
maximum capacity is known as thermal pressure. Enable cpufreq cooling
device to update the thermal pressure in event of a capped
maximum frequency.

Signed-off-by: Thara Gopinath <[email protected]>
---
drivers/thermal/cpu_cooling.c | 20 +++++++++++++++++++-
1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
index dfd2324..da8de66 100644
--- a/drivers/thermal/cpu_cooling.c
+++ b/drivers/thermal/cpu_cooling.c
@@ -31,6 +31,7 @@
#include <linux/slab.h>
#include <linux/cpu.h>
#include <linux/cpu_cooling.h>
+#include <linux/sched/topology.h>

#include <trace/events/thermal.h>

@@ -132,6 +133,21 @@ static unsigned long get_level(struct cpufreq_cooling_device *cpufreq_cdev,
}

/**
+ * cpufreq_update_sched_max_capacity - update scheduler about change in cpu
+ * max frequency.
+ * @policy - cpufreq policy whose max frequency is capped.
+ */
+static void cpufreq_update_sched_max_capacity(struct cpufreq_policy *policy)
+{
+ int cpu;
+ unsigned long capacity = (policy->max << SCHED_CAPACITY_SHIFT) /
+ policy->cpuinfo.max_freq;
+
+ for_each_cpu(cpu, policy->cpus)
+ update_maxcap_capacity(cpu, capacity);
+}
+
+/**
* cpufreq_thermal_notifier - notifier callback for cpufreq policy change.
* @nb: struct notifier_block * with callback info.
* @event: value showing cpufreq event for which this function invoked.
@@ -175,8 +191,10 @@ static int cpufreq_thermal_notifier(struct notifier_block *nb,
*/
clipped_freq = cpufreq_cdev->clipped_freq;

- if (policy->max > clipped_freq)
+ if (policy->max > clipped_freq) {
cpufreq_verify_within_limits(policy, 0, clipped_freq);
+ cpufreq_update_sched_max_capacity(policy);
+ }
break;
}
mutex_unlock(&cooling_list_lock);
--
2.1.4


2018-10-09 16:26:16

by Thara Gopinath

[permalink] [raw]
Subject: [RFC PATCH 6/7] sched/fair: update cpu_capcity to reflect thermal pressure

cpu_capacity relflects the maximum available capacity of a cpu. Thermal
pressure on a cpu means this maximum available capacity is reduced. This
patch reduces the average thermal pressure for a cpu from its maximum
available capacity so that cpu_capacity reflects the actual
available capacity.

Signed-off-by: Thara Gopinath <[email protected]>
---
kernel/sched/fair.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7deb1d0..8651e55 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7497,6 +7497,7 @@ static unsigned long scale_rt_capacity(int cpu)

used = READ_ONCE(rq->avg_rt.util_avg);
used += READ_ONCE(rq->avg_dl.util_avg);
+ used += READ_ONCE(rq->avg_thermal.load_avg);

if (unlikely(used >= max))
return 1;
--
2.1.4


2018-10-09 16:26:31

by Thara Gopinath

[permalink] [raw]
Subject: [RFC PATCH 3/7] sched: Add infrastructure to store and update instantaneous thermal pressure

Add thermal.c and thermal.h files that provides interface
APIs to initialize, update/average, track, accumulate and decay
thermal pressure per cpu basis. A per cpu structure max_capacity_info is
introduced to keep track of instantaneous per cpu thermal pressure.
Thermal pressure the delta between max_capacity and cap_capacity.
API update_periodic_maxcap is called for periodic accumulate and decay
of the thermal pressure. It is to to be called from a periodic tick
function. The API calculates the delta between max_capacity and
cap_capacity and passes on the delta to update_thermal_avg to do the
necessary accumulate, decay and average. update_maxcap_capacity is for
the system to update the thermal pressure by updating cap_capacity.
Considering, update_periodic_maxcap reads cap_capacity and
update_maxcap_capacity writes into cap_capacity, one can argue for
some sort of locking mechanism to avoid a stale value.
But considering update_periodic_maxcap can be called from a system
critical path like scheduler tick function, a locking mechanism is not
ideal. This means that possibly for 1 tick period the value used to
calculate average thermal pressure for a cpu can be stale.

Signed-off-by: Thara Gopinath <[email protected]>
---
include/linux/sched.h | 14 +++++++++++
kernel/sched/Makefile | 2 +-
kernel/sched/thermal.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++++++
kernel/sched/thermal.h | 13 ++++++++++
4 files changed, 94 insertions(+), 1 deletion(-)
create mode 100644 kernel/sched/thermal.c
create mode 100644 kernel/sched/thermal.h

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 977cb57..931b76d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1891,3 +1891,17 @@ static inline void rseq_syscall(struct pt_regs *regs)
#endif

#endif
+
+#ifdef CONFIG_SMP
+void update_maxcap_capacity(int cpu, u64 capacity);
+
+void populate_max_capacity_info(void);
+#else
+static inline void update_maxcap_capacity(int cpu, u64 capacity)
+{
+}
+
+static inline void populate_max_capacity_info(void)
+{
+}
+#endif
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 7fe1834..232a0cf 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -20,7 +20,7 @@ obj-y += core.o loadavg.o clock.o cputime.o
obj-y += idle.o fair.o rt.o deadline.o
obj-y += wait.o wait_bit.o swait.o completion.o

-obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o topology.o stop_task.o pelt.o
+obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o topology.o stop_task.o pelt.o thermal.o
obj-$(CONFIG_SCHED_AUTOGROUP) += autogroup.o
obj-$(CONFIG_SCHEDSTATS) += stats.o
obj-$(CONFIG_SCHED_DEBUG) += debug.o
diff --git a/kernel/sched/thermal.c b/kernel/sched/thermal.c
new file mode 100644
index 0000000..dd8300d
--- /dev/null
+++ b/kernel/sched/thermal.c
@@ -0,0 +1,66 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Sceduler Thermal Interactions
+ *
+ * Copyright (C) 2018 Linaro, Inc., Thara Gopinath <[email protected]>
+ */
+
+#include <linux/sched.h>
+#include "sched.h"
+#include "pelt.h"
+#include "thermal.h"
+
+struct max_capacity_info {
+ unsigned long max_capacity;
+ unsigned long cap_capacity;
+};
+
+static DEFINE_PER_CPU(struct max_capacity_info, max_cap);
+
+void update_maxcap_capacity(int cpu, u64 capacity)
+{
+ struct max_capacity_info *__max_cap;
+ u64 __capacity;
+
+ __max_cap = (&per_cpu(max_cap, cpu));
+ if (!__max_cap) {
+ pr_err("no max_capacity_info structure for cpu %d\n", cpu);
+ return;
+ }
+
+ /* Normalize the capacity */
+ __capacity = (capacity * arch_scale_cpu_capacity(NULL, cpu)) >>
+ SCHED_CAPACITY_SHIFT;
+ pr_debug("updating cpu%d capped capacity from %ld to %ld\n", cpu, __max_cap->cap_capacity, __capacity);
+
+ __max_cap->cap_capacity = __capacity;
+}
+
+void populate_max_capacity_info(void)
+{
+ struct max_capacity_info *__max_cap;
+ u64 capacity;
+ int cpu;
+
+
+ for_each_possible_cpu(cpu) {
+ __max_cap = (&per_cpu(max_cap, cpu));
+ if (!__max_cap)
+ continue;
+ capacity = arch_scale_cpu_capacity(NULL, cpu);
+ __max_cap->max_capacity = __max_cap->cap_capacity = capacity;
+ pr_debug("cpu %d max capacity set to %ld\n", cpu, __max_cap->max_capacity);
+ }
+}
+
+void update_periodic_maxcap(struct rq *rq)
+{
+ struct max_capacity_info *__max_cap = (&per_cpu(max_cap, cpu_of(rq)));
+ unsigned long delta;
+
+ if (!__max_cap)
+ return;
+
+ delta = __max_cap->max_capacity - __max_cap->cap_capacity;
+ update_thermal_avg(rq_clock_task(rq), rq, delta);
+}
diff --git a/kernel/sched/thermal.h b/kernel/sched/thermal.h
new file mode 100644
index 0000000..20a0270
--- /dev/null
+++ b/kernel/sched/thermal.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Scheduler thermal interaction internal methods.
+ */
+
+#ifdef CONFIG_SMP
+void update_periodic_maxcap(struct rq *rq);
+
+#else
+static inline void update_periodic_maxcap(struct rq *rq)
+{
+}
+#endif
--
2.1.4


2018-10-09 16:27:47

by Thara Gopinath

[permalink] [raw]
Subject: [RFC PATCH 5/7] sched/fair: Enable CFS periodic tick to update thermal pressure

Introduce support in CFS periodic tick to trigger the process of
computing average thermal pressure for a cpu.

Signed-off-by: Thara Gopinath <[email protected]>
---
kernel/sched/fair.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b39fb59..7deb1d0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -21,6 +21,7 @@
* Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra
*/
#include "sched.h"
+#include "thermal.h"

#include <trace/events/sched.h>

@@ -9557,6 +9558,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)

if (static_branch_unlikely(&sched_numa_balancing))
task_tick_numa(rq, curr);
+
+ update_periodic_maxcap(rq);
}

/*
--
2.1.4


2018-10-09 16:28:01

by Thara Gopinath

[permalink] [raw]
Subject: [RFC PATCH 4/7] sched: Initialize per cpu thermal pressure structure

Initialize per cpu max_capacity_info during scheduler init.

Signed-off-by: Thara Gopinath <[email protected]>
---
drivers/base/arch_topology.c | 1 +
kernel/sched/core.c | 2 ++
2 files changed, 3 insertions(+)

diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index e7cb0c6..542745f 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -126,6 +126,7 @@ void topology_normalize_cpu_scale(void)
pr_debug("cpu_capacity: CPU%d cpu_capacity=%lu\n",
cpu, topology_get_cpu_scale(NULL, cpu));
}
+ populate_max_capacity_info();
mutex_unlock(&cpu_scale_mutex);
}

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 625bc98..f0eed1a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5868,6 +5868,8 @@ void __init sched_init_smp(void)
init_sched_rt_class();
init_sched_dl_class();

+ populate_max_capacity_info();
+
sched_smp_initialized = true;
}

--
2.1.4


2018-10-09 16:28:39

by Thara Gopinath

[permalink] [raw]
Subject: [RFC PATCH 2/7] sched/pelt.c: Add support to track thermal pressure

Extrapolating on the exisitng framework to track rt/dl utilization using
pelt signals, add a similar mechanism to track thermal pressue. The
difference here from rt/dl utilization tracking is that, instead of
tracking time spent by a cpu running a rt/dl task through util_avg,
the average thermal pressure is tracked through load_avg.
In order to track average thermal pressure, a new sched_avg variable
avg_thermal is introduced. Function update_thermal_avg can be called
to do the periodic bookeeping (accumulate, decay and average)
of the thermal pressure.

Signed-off-by: Thara Gopinath <[email protected]>
---
kernel/sched/pelt.c | 14 ++++++++++++++
kernel/sched/pelt.h | 7 +++++++
kernel/sched/sched.h | 1 +
3 files changed, 22 insertions(+)

diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index 05b8798..7034ede 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -362,6 +362,20 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
return 0;
}

+int update_thermal_avg(u64 now, struct rq *rq, u64 capacity)
+{
+ if (___update_load_sum(now, rq->cpu, &rq->avg_thermal,
+ capacity,
+ capacity,
+ capacity, 0)) {
+
+ ___update_load_avg(&rq->avg_thermal, 1, 1);
+ return 1;
+ }
+
+ return 0;
+}
+
#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
/*
* irq:
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index d2894db..cc5a3ad 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -5,6 +5,7 @@ int __update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_e
int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq);
int update_rt_rq_load_avg(u64 now, struct rq *rq, int running);
int update_dl_rq_load_avg(u64 now, struct rq *rq, int running);
+int update_thermal_avg(u64 now, struct rq *rq, u64 capacity);

#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
int update_irq_load_avg(struct rq *rq, u64 running);
@@ -67,6 +68,12 @@ update_irq_load_avg(struct rq *rq, u64 running)
{
return 0;
}
+
+static inline int
+update_thermal_avg(u64 now, struct rq *rq, u64 capacity)
+{
+ return 0;
+}
#endif


diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4a2e8ca..30b40a5 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -859,6 +859,7 @@ struct rq {
#define HAVE_SCHED_AVG_IRQ
struct sched_avg avg_irq;
#endif
+ struct sched_avg avg_thermal;
u64 idle_stamp;
u64 avg_idle;

--
2.1.4


2018-10-10 05:47:39

by Javi Merino

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Introduce thermal pressure

On Tue, Oct 09, 2018 at 12:24:55PM -0400, Thara Gopinath wrote:
> Thermal governors can respond to an overheat event for a cpu by
> capping the cpu's maximum possible frequency. This in turn
> means that the maximum available compute capacity of the
> cpu is restricted. But today in linux kernel, in event of maximum
> frequency capping of a cpu, the maximum available compute
> capacity of the cpu is not adjusted at all. In other words, scheduler
> is unware maximum cpu capacity restrictions placed due to thermal
> activity.

Interesting, I would have sworn that I tested this years ago by
lowering the maximum frequency of a cpufreq domain, and the scheduler
reacted accordingly to the new maximum capacities of the cpus.

> This patch series attempts to address this issue.
> The benefits identified are better task placement among available
> cpus in event of overheating which in turn leads to better
> performance numbers.
>
> The delta between the maximum possible capacity of a cpu and
> maximum available capacity of a cpu due to thermal event can
> be considered as thermal pressure. Instantaneous thermal pressure
> is hard to record and can sometime be erroneous as there can be mismatch
> between the actual capping of capacity and scheduler recording it.
> Thus solution is to have a weighted average per cpu value for thermal
> pressure over time. The weight reflects the amount of time the cpu has
> spent at a capped maximum frequency. To accumulate, average and
> appropriately decay thermal pressure, this patch series uses pelt
> signals and reuses the available framework that does a similar
> bookkeeping of rt/dl task utilization.
>
> Regarding testing, basic build, boot and sanity testing have been
> performed on hikey960 mainline kernel with debian file system.
> Further aobench (An occlusion renderer for benchmarking realworld
> floating point performance) showed the following results on hikey960
> with debain.
>
> Result Standard Standard
> (Time secs) Error Deviation
> Hikey 960 - no thermal pressure applied 138.67 6.52 11.52%
> Hikey 960 - thermal pressure applied 122.37 5.78 11.57%
>
> Thara Gopinath (7):
> sched/pelt: Add option to make load and util calculations frequency
> invariant
> sched/pelt.c: Add support to track thermal pressure
> sched: Add infrastructure to store and update instantaneous thermal
> pressure
> sched: Initialize per cpu thermal pressure structure
> sched/fair: Enable CFS periodic tick to update thermal pressure
> sched/fair: update cpu_capcity to reflect thermal pressure
> thermal/cpu-cooling: Update thermal pressure in case of a maximum
> frequency capping
>
> drivers/base/arch_topology.c | 1 +
> drivers/thermal/cpu_cooling.c | 20 ++++++++++++-

thermal? There are other ways in which the maximum frequency of a cpu
can be limited, for example from userspace via scaling_max_freq.

When something (anything) changes the maximum frequency of a cpufreq
policy, the scheduler should be notified. I think this change should
be done in cpufreq instead to make it generic and not particular to
a given maximum frequency "capper".

Cheers,
Javi

2018-10-10 05:58:20

by Javi Merino

[permalink] [raw]
Subject: Re: [RFC PATCH 6/7] sched/fair: update cpu_capcity to reflect thermal pressure

On Tue, Oct 09, 2018 at 12:25:01PM -0400, Thara Gopinath wrote:
> cpu_capacity relflects the maximum available capacity of a cpu. Thermal
> pressure on a cpu means this maximum available capacity is reduced. This
> patch reduces the average thermal pressure for a cpu from its maximum
> available capacity so that cpu_capacity reflects the actual
> available capacity.
>
> Signed-off-by: Thara Gopinath <[email protected]>
> ---
> kernel/sched/fair.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7deb1d0..8651e55 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7497,6 +7497,7 @@ static unsigned long scale_rt_capacity(int cpu)
>
> used = READ_ONCE(rq->avg_rt.util_avg);
> used += READ_ONCE(rq->avg_dl.util_avg);
> + used += READ_ONCE(rq->avg_thermal.load_avg);

IIUIC, you are treating thermal pressure as an artificial load on the
cpu. If so, this sounds like a hard to maintain hack. Thermal
pressure have different characteristics to utilization. What happens
if thermal sets the cpu cooling state back to 0 because there is
thermal headroom again? Do we keep adding this artificial load to the
cpu just because there was thermal pressure in the past and let it
decay as if it was cpu load?

Cheers,
Javi


2018-10-10 06:18:51

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Introduce thermal pressure


* Thara Gopinath <[email protected]> wrote:

> Thermal governors can respond to an overheat event for a cpu by
> capping the cpu's maximum possible frequency. This in turn
> means that the maximum available compute capacity of the
> cpu is restricted. But today in linux kernel, in event of maximum
> frequency capping of a cpu, the maximum available compute
> capacity of the cpu is not adjusted at all. In other words, scheduler
> is unware maximum cpu capacity restrictions placed due to thermal
> activity. This patch series attempts to address this issue.
> The benefits identified are better task placement among available
> cpus in event of overheating which in turn leads to better
> performance numbers.
>
> The delta between the maximum possible capacity of a cpu and
> maximum available capacity of a cpu due to thermal event can
> be considered as thermal pressure. Instantaneous thermal pressure
> is hard to record and can sometime be erroneous as there can be mismatch
> between the actual capping of capacity and scheduler recording it.
> Thus solution is to have a weighted average per cpu value for thermal
> pressure over time. The weight reflects the amount of time the cpu has
> spent at a capped maximum frequency. To accumulate, average and
> appropriately decay thermal pressure, this patch series uses pelt
> signals and reuses the available framework that does a similar
> bookkeeping of rt/dl task utilization.
>
> Regarding testing, basic build, boot and sanity testing have been
> performed on hikey960 mainline kernel with debian file system.
> Further aobench (An occlusion renderer for benchmarking realworld
> floating point performance) showed the following results on hikey960
> with debain.
>
> Result Standard Standard
> (Time secs) Error Deviation
> Hikey 960 - no thermal pressure applied 138.67 6.52 11.52%
> Hikey 960 - thermal pressure applied 122.37 5.78 11.57%

Wow, +13% speedup, impressive! We definitely want this outcome.

I'm wondering what happens if we do not track and decay the thermal load at all at the PELT
level, but instantaneously decrease/increase effective CPU capacity in reaction to thermal
events we receive from the CPU.

You describe the averaging as:

> Instantaneous thermal pressure is hard to record and can sometime be erroneous as there can
> be mismatch between the actual capping of capacity and scheduler recording it.

Not sure I follow the argument here: are there bogus thermal throttling events? If so then
they are hopefully not frequent enough and should average out over time even if we follow
it instantly.

I.e. what is 'can sometimes be erroneous', exactly?

Thanks,

Ingo

2018-10-10 08:31:14

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Introduce thermal pressure

Hi Thara,

On Wednesday 10 Oct 2018 at 08:17:51 (+0200), Ingo Molnar wrote:
>
> * Thara Gopinath <[email protected]> wrote:
>
> > Thermal governors can respond to an overheat event for a cpu by
> > capping the cpu's maximum possible frequency. This in turn
> > means that the maximum available compute capacity of the
> > cpu is restricted. But today in linux kernel, in event of maximum
> > frequency capping of a cpu, the maximum available compute
> > capacity of the cpu is not adjusted at all. In other words, scheduler
> > is unware maximum cpu capacity restrictions placed due to thermal
> > activity. This patch series attempts to address this issue.
> > The benefits identified are better task placement among available
> > cpus in event of overheating which in turn leads to better
> > performance numbers.
> >
> > The delta between the maximum possible capacity of a cpu and
> > maximum available capacity of a cpu due to thermal event can
> > be considered as thermal pressure. Instantaneous thermal pressure
> > is hard to record and can sometime be erroneous as there can be mismatch
> > between the actual capping of capacity and scheduler recording it.
> > Thus solution is to have a weighted average per cpu value for thermal
> > pressure over time. The weight reflects the amount of time the cpu has
> > spent at a capped maximum frequency. To accumulate, average and
> > appropriately decay thermal pressure, this patch series uses pelt
> > signals and reuses the available framework that does a similar
> > bookkeeping of rt/dl task utilization.
> >
> > Regarding testing, basic build, boot and sanity testing have been
> > performed on hikey960 mainline kernel with debian file system.
> > Further aobench (An occlusion renderer for benchmarking realworld
> > floating point performance) showed the following results on hikey960
> > with debain.
> >
> > Result Standard Standard
> > (Time secs) Error Deviation
> > Hikey 960 - no thermal pressure applied 138.67 6.52 11.52%
> > Hikey 960 - thermal pressure applied 122.37 5.78 11.57%
>
> Wow, +13% speedup, impressive! We definitely want this outcome.
>
> I'm wondering what happens if we do not track and decay the thermal load at all at the PELT
> level, but instantaneously decrease/increase effective CPU capacity in reaction to thermal
> events we receive from the CPU.

+1, it's not that obvious (to me at least) that averaging the thermal
pressure over time is necessarily what we want. Say the thermal governor
caps a CPU and 'removes' 70% of its capacity, it will take forever for
the PELT signal to ramp-up to that level before the scheduler can react.
And the other way around, if you release the cap, it'll take a while
before we actually start using the newly available capacity. I can also
imagine how reacting too fast can be counter-productive, but I guess
having numbers and/or use-cases to show that would be great :-)

Thara, have you tried to experiment with a simpler implementation as
suggested by Ingo ?

Also, assuming that we do want to average things, do we actually want to
tie the thermal ramp-up time to the PELT half life ? That provides
nice maths properties wrt the other signals, but it's not obvious to me
that this thermal 'constant' should be the same on all platforms. Or
maybe it should ?

Thanks,
Quentin

>
> You describe the averaging as:
>
> > Instantaneous thermal pressure is hard to record and can sometime be erroneous as there can
> > be mismatch between the actual capping of capacity and scheduler recording it.
>
> Not sure I follow the argument here: are there bogus thermal throttling events? If so then
> they are hopefully not frequent enough and should average out over time even if we follow
> it instantly.
>
> I.e. what is 'can sometimes be erroneous', exactly?
>
> Thanks,
>
> Ingo

2018-10-10 08:52:18

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Introduce thermal pressure

On Wed, 10 Oct 2018 at 10:29, Quentin Perret <[email protected]> wrote:
>
> Hi Thara,
>
> On Wednesday 10 Oct 2018 at 08:17:51 (+0200), Ingo Molnar wrote:
> >
> > * Thara Gopinath <[email protected]> wrote:
> >
> > > Thermal governors can respond to an overheat event for a cpu by
> > > capping the cpu's maximum possible frequency. This in turn
> > > means that the maximum available compute capacity of the
> > > cpu is restricted. But today in linux kernel, in event of maximum
> > > frequency capping of a cpu, the maximum available compute
> > > capacity of the cpu is not adjusted at all. In other words, scheduler
> > > is unware maximum cpu capacity restrictions placed due to thermal
> > > activity. This patch series attempts to address this issue.
> > > The benefits identified are better task placement among available
> > > cpus in event of overheating which in turn leads to better
> > > performance numbers.
> > >
> > > The delta between the maximum possible capacity of a cpu and
> > > maximum available capacity of a cpu due to thermal event can
> > > be considered as thermal pressure. Instantaneous thermal pressure
> > > is hard to record and can sometime be erroneous as there can be mismatch
> > > between the actual capping of capacity and scheduler recording it.
> > > Thus solution is to have a weighted average per cpu value for thermal
> > > pressure over time. The weight reflects the amount of time the cpu has
> > > spent at a capped maximum frequency. To accumulate, average and
> > > appropriately decay thermal pressure, this patch series uses pelt
> > > signals and reuses the available framework that does a similar
> > > bookkeeping of rt/dl task utilization.
> > >
> > > Regarding testing, basic build, boot and sanity testing have been
> > > performed on hikey960 mainline kernel with debian file system.
> > > Further aobench (An occlusion renderer for benchmarking realworld
> > > floating point performance) showed the following results on hikey960
> > > with debain.
> > >
> > > Result Standard Standard
> > > (Time secs) Error Deviation
> > > Hikey 960 - no thermal pressure applied 138.67 6.52 11.52%
> > > Hikey 960 - thermal pressure applied 122.37 5.78 11.57%
> >
> > Wow, +13% speedup, impressive! We definitely want this outcome.
> >
> > I'm wondering what happens if we do not track and decay the thermal load at all at the PELT
> > level, but instantaneously decrease/increase effective CPU capacity in reaction to thermal
> > events we receive from the CPU.
>
> +1, it's not that obvious (to me at least) that averaging the thermal
> pressure over time is necessarily what we want. Say the thermal governor
> caps a CPU and 'removes' 70% of its capacity, it will take forever for
> the PELT signal to ramp-up to that level before the scheduler can react.
> And the other way around, if you release the cap, it'll take a while
> before we actually start using the newly available capacity. I can also
> imagine how reacting too fast can be counter-productive, but I guess
> having numbers and/or use-cases to show that would be great :-)

The problem with reflecting directly the capping is that it happens
far more often than the pace at which cpu_capacity_orig is updated in
the scheduler. This means that at the moment when scheduler uses the
value, it might not be correct anymore. Then, this value are also used
when building the sched_domain and setting max_cpu_capacity which
would also implies the rebuilt the sched_domain topology ...

The pace of changing the capping is to fast to reflect that in the
whole scheduler topology

>
> Thara, have you tried to experiment with a simpler implementation as
> suggested by Ingo ?
>
> Also, assuming that we do want to average things, do we actually want to
> tie the thermal ramp-up time to the PELT half life ? That provides
> nice maths properties wrt the other signals, but it's not obvious to me
> that this thermal 'constant' should be the same on all platforms. Or
> maybe it should ?

The main interest of using PELT signal is that thermal pressure will
evolve at the same pace as other signals used in the scheduler. With
thermal pressure, we have the exact same problem as with RT tasks. The
thermal will cap the max frequency which will cap the utilization of
the tasks running on the CPU

>
> Thanks,
> Quentin
>
> >
> > You describe the averaging as:
> >
> > > Instantaneous thermal pressure is hard to record and can sometime be erroneous as there can
> > > be mismatch between the actual capping of capacity and scheduler recording it.
> >
> > Not sure I follow the argument here: are there bogus thermal throttling events? If so then
> > they are hopefully not frequent enough and should average out over time even if we follow
> > it instantly.
> >
> > I.e. what is 'can sometimes be erroneous', exactly?
> >
> > Thanks,
> >
> > Ingo

2018-10-10 09:57:01

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Introduce thermal pressure

Hi Vincent,

On Wednesday 10 Oct 2018 at 10:50:05 (+0200), Vincent Guittot wrote:
> The problem with reflecting directly the capping is that it happens
> far more often than the pace at which cpu_capacity_orig is updated in
> the scheduler.

Hmm, how can you be so sure ? That most likely depends on the workload,
the platform and the thermal governor. Some platforms heat up slowly,
some quickly. The pace at which the thermal governor will change things
should depend on that I assume.

> This means that at the moment when scheduler uses the
> value, it might not be correct anymore.

And OTOH, when you remove a cap for example, it will take time before
the scheduler can see the newly available capacity if you need to wait
for the signal to decay. So you are using a wrong information too in
that scenario.

> Then, this value are also used
> when building the sched_domain and setting max_cpu_capacity which
> would also implies the rebuilt the sched_domain topology ...

Wait what ? I thought the thermal cap was reflected in capacity_of, not
capacity_orig_of ... You need to rebuild the sched_domain in case of
thermal pressure ?

Hmm, let me have a closer look at the patches, I must have missed
something ...

> The pace of changing the capping is to fast to reflect that in the
> whole scheduler topology

That's probably true in some cases, but it'd be cool to have numbers to
back up that statement, I think.

Now, if you do need to rebuild the sched domain topology every time you
update the thermal pressure, I think the PELT HL is _way_ too short for
that ... You can't rebuild the whole thing every 32ms or so. Or am I
misunderstanding something ?

> > Thara, have you tried to experiment with a simpler implementation as
> > suggested by Ingo ?
> >
> > Also, assuming that we do want to average things, do we actually want to
> > tie the thermal ramp-up time to the PELT half life ? That provides
> > nice maths properties wrt the other signals, but it's not obvious to me
> > that this thermal 'constant' should be the same on all platforms. Or
> > maybe it should ?
>
> The main interest of using PELT signal is that thermal pressure will
> evolve at the same pace as other signals used in the scheduler.

Right, I think this is a nice property too (assuming that we actually
want to average things out).

> With
> thermal pressure, we have the exact same problem as with RT tasks. The
> thermal will cap the max frequency which will cap the utilization of
> the tasks running on the CPU

Well, the nature of the signal is slightly different IMO. Yes it's
capacity, but you're no actually measuring time spent on the CPU. All
other PELT signals are based on time, this thermal thing isn't, so it is
kinda different in a way. And I'm still wondering if it could be helpful
to be able to have a different HL for that thermal signal. That would
'break' the nice maths properties we have, yes, but is it a problem or is
it actually helpful to cope with the thermal characteristics of
different platforms ?

Thanks,
Quentin

2018-10-10 10:15:34

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Introduce thermal pressure

On Wed, 10 Oct 2018 at 11:55, Quentin Perret <[email protected]> wrote:
>
> Hi Vincent,
>
> On Wednesday 10 Oct 2018 at 10:50:05 (+0200), Vincent Guittot wrote:
> > The problem with reflecting directly the capping is that it happens
> > far more often than the pace at which cpu_capacity_orig is updated in
> > the scheduler.
>
> Hmm, how can you be so sure ? That most likely depends on the workload,
> the platform and the thermal governor. Some platforms heat up slowly,
> some quickly. The pace at which the thermal governor will change things
> should depend on that I assume.
>
> > This means that at the moment when scheduler uses the
> > value, it might not be correct anymore.
>
> And OTOH, when you remove a cap for example, it will take time before
> the scheduler can see the newly available capacity if you need to wait
> for the signal to decay. So you are using a wrong information too in
> that scenario.

But we stay consistant with all other metrics. The same happen when a
task decide to stay idle for a long time after some activity... You
will have to wait for the signal to decay

>
> > Then, this value are also used
> > when building the sched_domain and setting max_cpu_capacity which
> > would also implies the rebuilt the sched_domain topology ...
>
> Wait what ? I thought the thermal cap was reflected in capacity_of, not
> capacity_orig_of ... You need to rebuild the sched_domain in case of
> thermal pressure ?
>
> Hmm, let me have a closer look at the patches, I must have missed
> something ...
>
> > The pace of changing the capping is to fast to reflect that in the
> > whole scheduler topology
>
> That's probably true in some cases, but it'd be cool to have numbers to
> back up that statement, I think.
>
> Now, if you do need to rebuild the sched domain topology every time you
> update the thermal pressure, I think the PELT HL is _way_ too short for
> that ... You can't rebuild the whole thing every 32ms or so. Or am I
> misunderstanding something ?
>
> > > Thara, have you tried to experiment with a simpler implementation as
> > > suggested by Ingo ?
> > >
> > > Also, assuming that we do want to average things, do we actually want to
> > > tie the thermal ramp-up time to the PELT half life ? That provides
> > > nice maths properties wrt the other signals, but it's not obvious to me
> > > that this thermal 'constant' should be the same on all platforms. Or
> > > maybe it should ?
> >
> > The main interest of using PELT signal is that thermal pressure will
> > evolve at the same pace as other signals used in the scheduler.
>
> Right, I think this is a nice property too (assuming that we actually
> want to average things out).
>
> > With
> > thermal pressure, we have the exact same problem as with RT tasks. The
> > thermal will cap the max frequency which will cap the utilization of
> > the tasks running on the CPU
>
> Well, the nature of the signal is slightly different IMO. Yes it's
> capacity, but you're no actually measuring time spent on the CPU. All
> other PELT signals are based on time, this thermal thing isn't, so it is
> kinda different in a way. And I'm still wondering if it could be helpful

hmmm ... it is based on time too.
Both signals (current ones and thermal one) are really close. The main
difference with other utilization signal is that instead of providing
a running/not running boolean that is then weighted by the current
capacity, the signal uses direclty the capped max capacity to reflect
the amount of cycle that is stolen by thermal mitigation.

> to be able to have a different HL for that thermal signal. That would
> 'break' the nice maths properties we have, yes, but is it a problem or is
> it actually helpful to cope with the thermal characteristics of
> different platforms ?

If you don't use the sign kind of signal with the same responsiveness,
you will start to see some OPP toggles as an example when the thermal
state change because one metrics will change faster than the other one
and you can't have a correct view of the system. Same problem was
happening with rt task.

I take the example of RT task because it quite similar in the effect
except that RT task steal all cycles when it runs whereas thermal
mitigation steal only some by capping the frequency
>
> Thanks,
> Quentin

2018-10-10 10:37:10

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Introduce thermal pressure

On Wednesday 10 Oct 2018 at 12:14:32 (+0200), Vincent Guittot wrote:
> On Wed, 10 Oct 2018 at 11:55, Quentin Perret <[email protected]> wrote:
> >
> > Hi Vincent,
> >
> > On Wednesday 10 Oct 2018 at 10:50:05 (+0200), Vincent Guittot wrote:
> > > The problem with reflecting directly the capping is that it happens
> > > far more often than the pace at which cpu_capacity_orig is updated in
> > > the scheduler.
> >
> > Hmm, how can you be so sure ? That most likely depends on the workload,
> > the platform and the thermal governor. Some platforms heat up slowly,
> > some quickly. The pace at which the thermal governor will change things
> > should depend on that I assume.
> >
> > > This means that at the moment when scheduler uses the
> > > value, it might not be correct anymore.
> >
> > And OTOH, when you remove a cap for example, it will take time before
> > the scheduler can see the newly available capacity if you need to wait
> > for the signal to decay. So you are using a wrong information too in
> > that scenario.
>
> But we stay consistant with all other metrics. The same happen when a
> task decide to stay idle for a long time after some activity... You
> will have to wait for the signal to decay

Because you see that as a task going idle. You could also say that
removing the cap from a CPU is equivalent to migrating a task off that
CPU. In this case you should see the newly available cap pretty much
immediately.

Also, I feel like the whole thermal capping story could interest the
Intel folks who, IIRC, were discussing how to represent their 'boosted'
OPPs in the scheduler some time ago. You can see the Intel boost thing
as a cap I think -- some OPPs can be inaccessible in some case. I'd be
interested to see what's their take on this.

> > > Then, this value are also used
> > > when building the sched_domain and setting max_cpu_capacity which
> > > would also implies the rebuilt the sched_domain topology ...
> >
> > Wait what ? I thought the thermal cap was reflected in capacity_of, not
> > capacity_orig_of ... You need to rebuild the sched_domain in case of
> > thermal pressure ?

I can't locate where you need to do this in the series. Do you actually
need to rebuild the sd hierarchy ?

> > Hmm, let me have a closer look at the patches, I must have missed
> > something ...
> >
> > > The pace of changing the capping is to fast to reflect that in the
> > > whole scheduler topology
> >
> > That's probably true in some cases, but it'd be cool to have numbers to
> > back up that statement, I think.
> >
> > Now, if you do need to rebuild the sched domain topology every time you
> > update the thermal pressure, I think the PELT HL is _way_ too short for
> > that ... You can't rebuild the whole thing every 32ms or so. Or am I
> > misunderstanding something ?
> >
> > > > Thara, have you tried to experiment with a simpler implementation as
> > > > suggested by Ingo ?
> > > >
> > > > Also, assuming that we do want to average things, do we actually want to
> > > > tie the thermal ramp-up time to the PELT half life ? That provides
> > > > nice maths properties wrt the other signals, but it's not obvious to me
> > > > that this thermal 'constant' should be the same on all platforms. Or
> > > > maybe it should ?
> > >
> > > The main interest of using PELT signal is that thermal pressure will
> > > evolve at the same pace as other signals used in the scheduler.
> >
> > Right, I think this is a nice property too (assuming that we actually
> > want to average things out).
> >
> > > With
> > > thermal pressure, we have the exact same problem as with RT tasks. The
> > > thermal will cap the max frequency which will cap the utilization of
> > > the tasks running on the CPU
> >
> > Well, the nature of the signal is slightly different IMO. Yes it's
> > capacity, but you're no actually measuring time spent on the CPU. All
> > other PELT signals are based on time, this thermal thing isn't, so it is
> > kinda different in a way. And I'm still wondering if it could be helpful
>
> hmmm ... it is based on time too.

You're not actually measuring the time spent on the CPU by the 'thermal
task'. There is no such thing as a 'thermal task'. You're just trying to
model things like that, but the thermal stuff isn't actually
interrupting running tasks to eat CPU cycles. It just makes thing run
slower, which isn't exactly the same thing IMO.

But maybe that's a detail.

> Both signals (current ones and thermal one) are really close. The main
> difference with other utilization signal is that instead of providing
> a running/not running boolean that is then weighted by the current
> capacity, the signal uses direclty the capped max capacity to reflect
> the amount of cycle that is stolen by thermal mitigation.
>
> > to be able to have a different HL for that thermal signal. That would
> > 'break' the nice maths properties we have, yes, but is it a problem or is
> > it actually helpful to cope with the thermal characteristics of
> > different platforms ?
>
> If you don't use the sign kind of signal with the same responsiveness,
> you will start to see some OPP toggles as an example when the thermal
> state change because one metrics will change faster than the other one
> and you can't have a correct view of the system. Same problem was
> happening with rt task.

Well, that wasn't the problem with rt tasks. The problem with RT tasks
was that the time they spend on the CPU wasn't accounted _at all_ when
selecting frequency for CFS, not that the accounting was at a different
pace ...

Thanks,
Quentin

2018-10-10 12:07:11

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Introduce thermal pressure

On Wed, 10 Oct 2018 at 12:36, Quentin Perret <[email protected]> wrote:
>
> On Wednesday 10 Oct 2018 at 12:14:32 (+0200), Vincent Guittot wrote:
> > On Wed, 10 Oct 2018 at 11:55, Quentin Perret <[email protected]> wrote:
> > >
> > > Hi Vincent,
> > >
> > > On Wednesday 10 Oct 2018 at 10:50:05 (+0200), Vincent Guittot wrote:
> > > > The problem with reflecting directly the capping is that it happens
> > > > far more often than the pace at which cpu_capacity_orig is updated in
> > > > the scheduler.
> > >
> > > Hmm, how can you be so sure ? That most likely depends on the workload,
> > > the platform and the thermal governor. Some platforms heat up slowly,
> > > some quickly. The pace at which the thermal governor will change things
> > > should depend on that I assume.
> > >
> > > > This means that at the moment when scheduler uses the
> > > > value, it might not be correct anymore.
> > >
> > > And OTOH, when you remove a cap for example, it will take time before
> > > the scheduler can see the newly available capacity if you need to wait
> > > for the signal to decay. So you are using a wrong information too in
> > > that scenario.
> >
> > But we stay consistant with all other metrics. The same happen when a
> > task decide to stay idle for a long time after some activity... You
> > will have to wait for the signal to decay
>
> Because you see that as a task going idle. You could also say that
> removing the cap from a CPU is equivalent to migrating a task off that
> CPU. In this case you should see the newly available cap pretty much
> immediately.
>
> Also, I feel like the whole thermal capping story could interest the
> Intel folks who, IIRC, were discussing how to represent their 'boosted'
> OPPs in the scheduler some time ago. You can see the Intel boost thing
> as a cap I think -- some OPPs can be inaccessible in some case. I'd be
> interested to see what's their take on this.
>
> > > > Then, this value are also used
> > > > when building the sched_domain and setting max_cpu_capacity which
> > > > would also implies the rebuilt the sched_domain topology ...
> > >
> > > Wait what ? I thought the thermal cap was reflected in capacity_of, not
> > > capacity_orig_of ... You need to rebuild the sched_domain in case of
> > > thermal pressure ?
>
> I can't locate where you need to do this in the series. Do you actually
> need to rebuild the sd hierarchy ?

This patchset doesn't touch cpu_capacity_orig and doesn't need to as
it assume that the max capacity is unchanged but some capacity is
momentary stolen by thermal.
If you want to reflect immediately all thermal capping change, you
have to update this field and all related fields and struct around

>
> > > Hmm, let me have a closer look at the patches, I must have missed
> > > something ...
> > >
> > > > The pace of changing the capping is to fast to reflect that in the
> > > > whole scheduler topology
> > >
> > > That's probably true in some cases, but it'd be cool to have numbers to
> > > back up that statement, I think.
> > >
> > > Now, if you do need to rebuild the sched domain topology every time you
> > > update the thermal pressure, I think the PELT HL is _way_ too short for
> > > that ... You can't rebuild the whole thing every 32ms or so. Or am I
> > > misunderstanding something ?
> > >
> > > > > Thara, have you tried to experiment with a simpler implementation as
> > > > > suggested by Ingo ?
> > > > >
> > > > > Also, assuming that we do want to average things, do we actually want to
> > > > > tie the thermal ramp-up time to the PELT half life ? That provides
> > > > > nice maths properties wrt the other signals, but it's not obvious to me
> > > > > that this thermal 'constant' should be the same on all platforms. Or
> > > > > maybe it should ?
> > > >
> > > > The main interest of using PELT signal is that thermal pressure will
> > > > evolve at the same pace as other signals used in the scheduler.
> > >
> > > Right, I think this is a nice property too (assuming that we actually
> > > want to average things out).
> > >
> > > > With
> > > > thermal pressure, we have the exact same problem as with RT tasks. The
> > > > thermal will cap the max frequency which will cap the utilization of
> > > > the tasks running on the CPU
> > >
> > > Well, the nature of the signal is slightly different IMO. Yes it's
> > > capacity, but you're no actually measuring time spent on the CPU. All
> > > other PELT signals are based on time, this thermal thing isn't, so it is
> > > kinda different in a way. And I'm still wondering if it could be helpful
> >
> > hmmm ... it is based on time too.
>
> You're not actually measuring the time spent on the CPU by the 'thermal
> task'. There is no such thing as a 'thermal task'. You're just trying to
> model things like that, but the thermal stuff isn't actually
> interrupting running tasks to eat CPU cycles. It just makes thing run
> slower, which isn't exactly the same thing IMO.
>
> But maybe that's a detail.
>
> > Both signals (current ones and thermal one) are really close. The main
> > difference with other utilization signal is that instead of providing
> > a running/not running boolean that is then weighted by the current
> > capacity, the signal uses direclty the capped max capacity to reflect
> > the amount of cycle that is stolen by thermal mitigation.
> >
> > > to be able to have a different HL for that thermal signal. That would
> > > 'break' the nice maths properties we have, yes, but is it a problem or is
> > > it actually helpful to cope with the thermal characteristics of
> > > different platforms ?
> >
> > If you don't use the sign kind of signal with the same responsiveness,
> > you will start to see some OPP toggles as an example when the thermal
> > state change because one metrics will change faster than the other one
> > and you can't have a correct view of the system. Same problem was
> > happening with rt task.
>
> Well, that wasn't the problem with rt tasks. The problem with RT tasks
> was that the time they spend on the CPU wasn't accounted _at all_ when
> selecting frequency for CFS, not that the accounting was at a different
> pace ...

The problem was the same with RT, the cfs utilization was lower than
reality because RT steals soem cycle to CFS
So schedutil was selecting a lower frequency when cfs was running
whereas the CPU was fully used.
The same can happen with thermal:
cap the max freq because of thermal
the utilization with decrease.
remove the cap
the utilization is still low and you will select a low OPP because you
don't take into account cycle stolen by thermal like with RT

Regards,
Vincent
>
> Thanks,
> Quentin

2018-10-10 12:24:39

by Juri Lelli

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Introduce thermal pressure

On 10/10/18 14:04, Vincent Guittot wrote:

[...]

> The problem was the same with RT, the cfs utilization was lower than
> reality because RT steals soem cycle to CFS
> So schedutil was selecting a lower frequency when cfs was running
> whereas the CPU was fully used.
> The same can happen with thermal:
> cap the max freq because of thermal
> the utilization with decrease.
> remove the cap
> the utilization is still low and you will select a low OPP because you
> don't take into account cycle stolen by thermal like with RT

What if we scale frequency component considering the capped temporary
max?

2018-10-10 12:35:50

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Introduce thermal pressure

Hi Juri,

On Wed, 10 Oct 2018 at 14:23, Juri Lelli <[email protected]> wrote:
>
> On 10/10/18 14:04, Vincent Guittot wrote:
>
> [...]
>
> > The problem was the same with RT, the cfs utilization was lower than
> > reality because RT steals soem cycle to CFS
> > So schedutil was selecting a lower frequency when cfs was running
> > whereas the CPU was fully used.
> > The same can happen with thermal:
> > cap the max freq because of thermal
> > the utilization with decrease.
> > remove the cap
> > the utilization is still low and you will select a low OPP because you
> > don't take into account cycle stolen by thermal like with RT
>
> What if we scale frequency component considering the capped temporary
> max?

Do you mean using a kind of scale_thermal_capacity in accumulate_sum
when computing utilization ?

2018-10-10 12:52:47

by Juri Lelli

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Introduce thermal pressure

On 10/10/18 14:34, Vincent Guittot wrote:
> Hi Juri,
>
> On Wed, 10 Oct 2018 at 14:23, Juri Lelli <[email protected]> wrote:
> >
> > On 10/10/18 14:04, Vincent Guittot wrote:
> >
> > [...]
> >
> > > The problem was the same with RT, the cfs utilization was lower than
> > > reality because RT steals soem cycle to CFS
> > > So schedutil was selecting a lower frequency when cfs was running
> > > whereas the CPU was fully used.
> > > The same can happen with thermal:
> > > cap the max freq because of thermal
> > > the utilization with decrease.
> > > remove the cap
> > > the utilization is still low and you will select a low OPP because you
> > > don't take into account cycle stolen by thermal like with RT
> >
> > What if we scale frequency component considering the capped temporary
> > max?
>
> Do you mean using a kind of scale_thermal_capacity in accumulate_sum
> when computing utilization ?

Yeah, something like that I guess. So that we account for temporary
"fake" 1024..

2018-10-10 13:08:24

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Introduce thermal pressure

On Wednesday 10 Oct 2018 at 14:04:40 (+0200), Vincent Guittot wrote:
> This patchset doesn't touch cpu_capacity_orig and doesn't need to as
> it assume that the max capacity is unchanged but some capacity is
> momentary stolen by thermal.
> If you want to reflect immediately all thermal capping change, you
> have to update this field and all related fields and struct around

I don't follow you here. I never said I wanted to change
cpu_capacity_orig. I don't think we should do that actually. Changing
capacity_of (which is updated during LB IIRC) is just fine. The question
is about what you want to do there: reflect an averaged value or the
instantaneous one.

It's not obvious (to me) that the complex one (the averaged value) is
better than the other, simpler, one. All I'm saying from the beginning
is that it'd be nice to have numbers and use cases to discuss the pros
and cons of both approaches.

> > > > Hmm, let me have a closer look at the patches, I must have missed
> > > > something ...
> > > >
> > > > > The pace of changing the capping is to fast to reflect that in the
> > > > > whole scheduler topology
> > > >
> > > > That's probably true in some cases, but it'd be cool to have numbers to
> > > > back up that statement, I think.
> > > >
> > > > Now, if you do need to rebuild the sched domain topology every time you
> > > > update the thermal pressure, I think the PELT HL is _way_ too short for
> > > > that ... You can't rebuild the whole thing every 32ms or so. Or am I
> > > > misunderstanding something ?
> > > >
> > > > > > Thara, have you tried to experiment with a simpler implementation as
> > > > > > suggested by Ingo ?
> > > > > >
> > > > > > Also, assuming that we do want to average things, do we actually want to
> > > > > > tie the thermal ramp-up time to the PELT half life ? That provides
> > > > > > nice maths properties wrt the other signals, but it's not obvious to me
> > > > > > that this thermal 'constant' should be the same on all platforms. Or
> > > > > > maybe it should ?
> > > > >
> > > > > The main interest of using PELT signal is that thermal pressure will
> > > > > evolve at the same pace as other signals used in the scheduler.
> > > >
> > > > Right, I think this is a nice property too (assuming that we actually
> > > > want to average things out).
> > > >
> > > > > With
> > > > > thermal pressure, we have the exact same problem as with RT tasks. The
> > > > > thermal will cap the max frequency which will cap the utilization of
> > > > > the tasks running on the CPU
> > > >
> > > > Well, the nature of the signal is slightly different IMO. Yes it's
> > > > capacity, but you're no actually measuring time spent on the CPU. All
> > > > other PELT signals are based on time, this thermal thing isn't, so it is
> > > > kinda different in a way. And I'm still wondering if it could be helpful
> > >
> > > hmmm ... it is based on time too.
> >
> > You're not actually measuring the time spent on the CPU by the 'thermal
> > task'. There is no such thing as a 'thermal task'. You're just trying to
> > model things like that, but the thermal stuff isn't actually
> > interrupting running tasks to eat CPU cycles. It just makes thing run
> > slower, which isn't exactly the same thing IMO.
> >
> > But maybe that's a detail.
> >
> > > Both signals (current ones and thermal one) are really close. The main
> > > difference with other utilization signal is that instead of providing
> > > a running/not running boolean that is then weighted by the current
> > > capacity, the signal uses direclty the capped max capacity to reflect
> > > the amount of cycle that is stolen by thermal mitigation.
> > >
> > > > to be able to have a different HL for that thermal signal. That would
> > > > 'break' the nice maths properties we have, yes, but is it a problem or is
> > > > it actually helpful to cope with the thermal characteristics of
> > > > different platforms ?
> > >
> > > If you don't use the sign kind of signal with the same responsiveness,
> > > you will start to see some OPP toggles as an example when the thermal
> > > state change because one metrics will change faster than the other one
> > > and you can't have a correct view of the system. Same problem was
> > > happening with rt task.
> >
> > Well, that wasn't the problem with rt tasks. The problem with RT tasks
> > was that the time they spend on the CPU wasn't accounted _at all_ when
> > selecting frequency for CFS, not that the accounting was at a different
> > pace ...
>
> The problem was the same with RT, the cfs utilization was lower than
> reality because RT steals soem cycle to CFS
> So schedutil was selecting a lower frequency when cfs was running
> whereas the CPU was fully used.
> The same can happen with thermal:
> cap the max freq because of thermal
> the utilization with decrease.
> remove the cap
> the utilization is still low and you will select a low OPP because you
> don't take into account cycle stolen by thermal like with RT

I'm not arguing with the fact that we need to reflect the thermal
pressure in the scheduler to see that a CPU is fully busy. I agree with
that.

I'm saying you don't necessarily have to update the thermal stuff and
the existing PELT signals *at the same pace*, because different
platforms have different thermal characteristics.

Thanks,*
Quentin

2018-10-10 13:14:13

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Introduce thermal pressure

On Wednesday 10 Oct 2018 at 14:50:33 (+0200), Juri Lelli wrote:
> On 10/10/18 14:34, Vincent Guittot wrote:
> > Hi Juri,
> >
> > On Wed, 10 Oct 2018 at 14:23, Juri Lelli <[email protected]> wrote:
> > >
> > > On 10/10/18 14:04, Vincent Guittot wrote:
> > >
> > > [...]
> > >
> > > > The problem was the same with RT, the cfs utilization was lower than
> > > > reality because RT steals soem cycle to CFS
> > > > So schedutil was selecting a lower frequency when cfs was running
> > > > whereas the CPU was fully used.
> > > > The same can happen with thermal:
> > > > cap the max freq because of thermal
> > > > the utilization with decrease.
> > > > remove the cap
> > > > the utilization is still low and you will select a low OPP because you
> > > > don't take into account cycle stolen by thermal like with RT
> > >
> > > What if we scale frequency component considering the capped temporary
> > > max?
> >
> > Do you mean using a kind of scale_thermal_capacity in accumulate_sum
> > when computing utilization ?
>
> Yeah, something like that I guess. So that we account for temporary
> "fake" 1024..

But wouldn't that break frequency invariance ? A task would look bigger
on a capped CPU than a non-capped one no ?

2018-10-10 13:15:05

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Introduce thermal pressure

On Wed, 10 Oct 2018 at 14:50, Juri Lelli <[email protected]> wrote:
>
> On 10/10/18 14:34, Vincent Guittot wrote:
> > Hi Juri,
> >
> > On Wed, 10 Oct 2018 at 14:23, Juri Lelli <[email protected]> wrote:
> > >
> > > On 10/10/18 14:04, Vincent Guittot wrote:
> > >
> > > [...]
> > >
> > > > The problem was the same with RT, the cfs utilization was lower than
> > > > reality because RT steals soem cycle to CFS
> > > > So schedutil was selecting a lower frequency when cfs was running
> > > > whereas the CPU was fully used.
> > > > The same can happen with thermal:
> > > > cap the max freq because of thermal
> > > > the utilization with decrease.
> > > > remove the cap
> > > > the utilization is still low and you will select a low OPP because you
> > > > don't take into account cycle stolen by thermal like with RT
> > >
> > > What if we scale frequency component considering the capped temporary
> > > max?
> >
> > Do you mean using a kind of scale_thermal_capacity in accumulate_sum
> > when computing utilization ?
>
> Yeah, something like that I guess. So that we account for temporary
> "fake" 1024..

But the utilization will not be invariant anymore across the system

2018-10-10 13:28:47

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Introduce thermal pressure

On Wed, 10 Oct 2018 at 15:05, Quentin Perret <[email protected]> wrote:
>
> On Wednesday 10 Oct 2018 at 14:04:40 (+0200), Vincent Guittot wrote:
> > This patchset doesn't touch cpu_capacity_orig and doesn't need to as
> > it assume that the max capacity is unchanged but some capacity is
> > momentary stolen by thermal.
> > If you want to reflect immediately all thermal capping change, you
> > have to update this field and all related fields and struct around
>
> I don't follow you here. I never said I wanted to change
> cpu_capacity_orig. I don't think we should do that actually. Changing
> capacity_of (which is updated during LB IIRC) is just fine. The question
> is about what you want to do there: reflect an averaged value or the
> instantaneous one.

Sorry I though your were speaking about updating this cpu_capacity_orig.
With using instantaneous max value in capacity_of(), we are back to
the problem raised by Thara that the value will most probably not
reflect the current capping value when it is used in LB, because LB
period can quite long on busy CPU (default max value is 32*sd_weight
ms)

>
> It's not obvious (to me) that the complex one (the averaged value) is
> better than the other, simpler, one. All I'm saying from the beginning
> is that it'd be nice to have numbers and use cases to discuss the pros
> and cons of both approaches.
>
> > > > > Hmm, let me have a closer look at the patches, I must have missed
> > > > > something ...
> > > > >
> > > > > > The pace of changing the capping is to fast to reflect that in the
> > > > > > whole scheduler topology
> > > > >

[snip]

> > >
> > > Well, that wasn't the problem with rt tasks. The problem with RT tasks
> > > was that the time they spend on the CPU wasn't accounted _at all_ when
> > > selecting frequency for CFS, not that the accounting was at a different
> > > pace ...
> >
> > The problem was the same with RT, the cfs utilization was lower than
> > reality because RT steals soem cycle to CFS
> > So schedutil was selecting a lower frequency when cfs was running
> > whereas the CPU was fully used.
> > The same can happen with thermal:
> > cap the max freq because of thermal
> > the utilization with decrease.
> > remove the cap
> > the utilization is still low and you will select a low OPP because you
> > don't take into account cycle stolen by thermal like with RT
>
> I'm not arguing with the fact that we need to reflect the thermal
> pressure in the scheduler to see that a CPU is fully busy. I agree with
> that.
>
> I'm saying you don't necessarily have to update the thermal stuff and
> the existing PELT signals *at the same pace*, because different
> platforms have different thermal characteristics.

But you also need to take into account how fast other metrics in the
scheduler are updated otherwise a metric will reflect a change not
already reflected in the other metrics and you might take wrong
decision as my example above where utilization is still low but
thermal pressure is nul and you assume that you have lot of spare
capacity
Having metrics that use same responsiveness and are synced, help to
get a consolidated view of the system.

Vincent
>
> Thanks,*
> Quentin

2018-10-10 13:35:47

by Juri Lelli

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Introduce thermal pressure

On 10/10/18 15:08, Vincent Guittot wrote:
> On Wed, 10 Oct 2018 at 14:50, Juri Lelli <[email protected]> wrote:
> >
> > On 10/10/18 14:34, Vincent Guittot wrote:
> > > Hi Juri,
> > >
> > > On Wed, 10 Oct 2018 at 14:23, Juri Lelli <[email protected]> wrote:
> > > >
> > > > On 10/10/18 14:04, Vincent Guittot wrote:
> > > >
> > > > [...]
> > > >
> > > > > The problem was the same with RT, the cfs utilization was lower than
> > > > > reality because RT steals soem cycle to CFS
> > > > > So schedutil was selecting a lower frequency when cfs was running
> > > > > whereas the CPU was fully used.
> > > > > The same can happen with thermal:
> > > > > cap the max freq because of thermal
> > > > > the utilization with decrease.
> > > > > remove the cap
> > > > > the utilization is still low and you will select a low OPP because you
> > > > > don't take into account cycle stolen by thermal like with RT
> > > >
> > > > What if we scale frequency component considering the capped temporary
> > > > max?
> > >
> > > Do you mean using a kind of scale_thermal_capacity in accumulate_sum
> > > when computing utilization ?
> >
> > Yeah, something like that I guess. So that we account for temporary
> > "fake" 1024..
>
> But the utilization will not be invariant anymore across the system

Mmm, I guess I might be wrong, but I was thinking we should be able to
deal with this similarly to what we do with cpus with different max
capacities. So, another factor? Because then, how do we handle other
ways in which max freq can be restricted (e.g. from userspace as Javi
was also mentioning)?

2018-10-10 13:41:17

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Introduce thermal pressure

On Wed, 10 Oct 2018 at 15:35, Juri Lelli <[email protected]> wrote:
>
> On 10/10/18 15:08, Vincent Guittot wrote:
> > On Wed, 10 Oct 2018 at 14:50, Juri Lelli <[email protected]> wrote:
> > >
> > > On 10/10/18 14:34, Vincent Guittot wrote:
> > > > Hi Juri,
> > > >
> > > > On Wed, 10 Oct 2018 at 14:23, Juri Lelli <[email protected]> wrote:
> > > > >
> > > > > On 10/10/18 14:04, Vincent Guittot wrote:
> > > > >
> > > > > [...]
> > > > >
> > > > > > The problem was the same with RT, the cfs utilization was lower than
> > > > > > reality because RT steals soem cycle to CFS
> > > > > > So schedutil was selecting a lower frequency when cfs was running
> > > > > > whereas the CPU was fully used.
> > > > > > The same can happen with thermal:
> > > > > > cap the max freq because of thermal
> > > > > > the utilization with decrease.
> > > > > > remove the cap
> > > > > > the utilization is still low and you will select a low OPP because you
> > > > > > don't take into account cycle stolen by thermal like with RT
> > > > >
> > > > > What if we scale frequency component considering the capped temporary
> > > > > max?
> > > >
> > > > Do you mean using a kind of scale_thermal_capacity in accumulate_sum
> > > > when computing utilization ?
> > >
> > > Yeah, something like that I guess. So that we account for temporary
> > > "fake" 1024..
> >
> > But the utilization will not be invariant anymore across the system
>
> Mmm, I guess I might be wrong, but I was thinking we should be able to
> deal with this similarly to what we do with cpus with different max
> capacities. So, another factor? Because then, how do we handle other
> ways in which max freq can be restricted (e.g. from userspace as Javi
> was also mentioning)?

IMHO, userspace capping is a different story because it is not
expected to happen so often but it should stay for a while and in this
case, a solution is probably to rebuild the sched_domain and update
all the cpu_capacity struct and fields

2018-10-10 13:48:43

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Introduce thermal pressure

On Wednesday 10 Oct 2018 at 15:27:57 (+0200), Vincent Guittot wrote:
> On Wed, 10 Oct 2018 at 15:05, Quentin Perret <[email protected]> wrote:
> >
> > On Wednesday 10 Oct 2018 at 14:04:40 (+0200), Vincent Guittot wrote:
> > > This patchset doesn't touch cpu_capacity_orig and doesn't need to as
> > > it assume that the max capacity is unchanged but some capacity is
> > > momentary stolen by thermal.
> > > If you want to reflect immediately all thermal capping change, you
> > > have to update this field and all related fields and struct around
> >
> > I don't follow you here. I never said I wanted to change
> > cpu_capacity_orig. I don't think we should do that actually. Changing
> > capacity_of (which is updated during LB IIRC) is just fine. The question
> > is about what you want to do there: reflect an averaged value or the
> > instantaneous one.
>
> Sorry I though your were speaking about updating this cpu_capacity_orig.

N/p, communication via email can easily become confusing :-)

> With using instantaneous max value in capacity_of(), we are back to
> the problem raised by Thara that the value will most probably not
> reflect the current capping value when it is used in LB, because LB
> period can quite long on busy CPU (default max value is 32*sd_weight
> ms)

But averaging the capping value over time doesn't make LB happen more
often ... That will help you account for capping that happened in the
past, but it's not obvious this is actually a good thing. Probably not
all the time anyway.

Say a CPU was capped at 50% of it's capacity, then the cap is removed.
At that point it'll take 100ms+ for the thermal signal to decay and let
the scheduler know about the newly available capacity. That can probably
be a performance hit in some use cases ... And the other way around, it
can also take forever for the scheduler to notice that a CPU has a
reduced capacity before reacting to it.

If you want to filter out very short transient capping events to avoid
over-reacting in the scheduler (is this actually happening ?), then
maybe the average should be done on the cooling device side or something
like that ?

Thanks,
Quentin

2018-10-10 14:16:37

by Thara Gopinath

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Introduce thermal pressure

Hello Javi,

Thanks for the interest.

On 10/10/2018 01:44 AM, Javi Merino wrote:
> On Tue, Oct 09, 2018 at 12:24:55PM -0400, Thara Gopinath wrote:
>> Thermal governors can respond to an overheat event for a cpu by
>> capping the cpu's maximum possible frequency. This in turn
>> means that the maximum available compute capacity of the
>> cpu is restricted. But today in linux kernel, in event of maximum
>> frequency capping of a cpu, the maximum available compute
>> capacity of the cpu is not adjusted at all. In other words, scheduler
>> is unware maximum cpu capacity restrictions placed due to thermal
>> activity.
>
> Interesting, I would have sworn that I tested this years ago by
> lowering the maximum frequency of a cpufreq domain, and the scheduler
> reacted accordingly to the new maximum capacities of the cpus.
>
>> This patch series attempts to address this issue.
>> The benefits identified are better task placement among available
>> cpus in event of overheating which in turn leads to better
>> performance numbers.
>>
>> The delta between the maximum possible capacity of a cpu and
>> maximum available capacity of a cpu due to thermal event can
>> be considered as thermal pressure. Instantaneous thermal pressure
>> is hard to record and can sometime be erroneous as there can be mismatch
>> between the actual capping of capacity and scheduler recording it.
>> Thus solution is to have a weighted average per cpu value for thermal
>> pressure over time. The weight reflects the amount of time the cpu has
>> spent at a capped maximum frequency. To accumulate, average and
>> appropriately decay thermal pressure, this patch series uses pelt
>> signals and reuses the available framework that does a similar
>> bookkeeping of rt/dl task utilization.
>>
>> Regarding testing, basic build, boot and sanity testing have been
>> performed on hikey960 mainline kernel with debian file system.
>> Further aobench (An occlusion renderer for benchmarking realworld
>> floating point performance) showed the following results on hikey960
>> with debain.
>>
>> Result Standard Standard
>> (Time secs) Error Deviation
>> Hikey 960 - no thermal pressure applied 138.67 6.52 11.52%
>> Hikey 960 - thermal pressure applied 122.37 5.78 11.57%
>>
>> Thara Gopinath (7):
>> sched/pelt: Add option to make load and util calculations frequency
>> invariant
>> sched/pelt.c: Add support to track thermal pressure
>> sched: Add infrastructure to store and update instantaneous thermal
>> pressure
>> sched: Initialize per cpu thermal pressure structure
>> sched/fair: Enable CFS periodic tick to update thermal pressure
>> sched/fair: update cpu_capcity to reflect thermal pressure
>> thermal/cpu-cooling: Update thermal pressure in case of a maximum
>> frequency capping
>>
>> drivers/base/arch_topology.c | 1 +
>> drivers/thermal/cpu_cooling.c | 20 ++++++++++++-
>
> thermal? There are other ways in which the maximum frequency of a cpu
> can be limited, for example from userspace via scaling_max_freq.

Yes there are other ways in which maximum frequency of a cpu can be
capped. The difference probably is that in case of a user-space cap, the
time duration the cpu remains in the capped state is significantly
higher than capping due to a thermal event. So may be the response of
the scheduler should be different in that scenario (like rebuilding the
capacities of all cpus etc).

This patch series does not rebuild the scheduler structures. This just
tells the scheduler that some cpu capacity is stolen.
>
> When something (anything) changes the maximum frequency of a cpufreq
> policy, the scheduler should be notified. I think this change should
> be done in cpufreq instead to make it generic and not particular to
> a given maximum frequency "capper".

I am glad to hear you say this! So far, all I have heard whenever I
bring up this topic is issues with such an update from cpufreq(delays,
lost updates etc). Personally, I have not seen these issues enough to
comment on them. I will really like to hear more about these issues for
an update from cpufreq here on the list.

For me, the patch series is a mechanism to let scheduler know that a
thermal event has stolen some cpu capacity. The update itself can happen
from any framework which can track the event and we all mutually agree on.

Regards
Thara

>
> Cheers,
> Javi
>


--
Regards
Thara

2018-10-10 14:24:46

by Thara Gopinath

[permalink] [raw]
Subject: Re: [RFC PATCH 6/7] sched/fair: update cpu_capcity to reflect thermal pressure

On 10/10/2018 01:57 AM, Javi Merino wrote:
> On Tue, Oct 09, 2018 at 12:25:01PM -0400, Thara Gopinath wrote:
>> cpu_capacity relflects the maximum available capacity of a cpu. Thermal
>> pressure on a cpu means this maximum available capacity is reduced. This
>> patch reduces the average thermal pressure for a cpu from its maximum
>> available capacity so that cpu_capacity reflects the actual
>> available capacity.
>>
>> Signed-off-by: Thara Gopinath <[email protected]>
>> ---
>> kernel/sched/fair.c | 1 +
>> 1 file changed, 1 insertion(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 7deb1d0..8651e55 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -7497,6 +7497,7 @@ static unsigned long scale_rt_capacity(int cpu)
>>
>> used = READ_ONCE(rq->avg_rt.util_avg);
>> used += READ_ONCE(rq->avg_dl.util_avg);
>> + used += READ_ONCE(rq->avg_thermal.load_avg);
>
> IIUIC, you are treating thermal pressure as an artificial load on the
> cpu. If so, this sounds like a hard to maintain hack. Thermal
> pressure have different characteristics to utilization. What happens
> if thermal sets the cpu cooling state back to 0 because there is
> thermal headroom again? Do we keep adding this artificial load to the
> cpu just because there was thermal pressure in the past and let it
> decay as if it was cpu load?

Setting cpu cooling state back to 0 will decay the thermal pressure back
to 0 ? Yes, cpu will not register an instantaneous drop in the cap, it
will be decayed down following the PELT signals.

>
> Cheers,
> Javi
>


--
Regards
Thara

2018-10-10 15:20:45

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Introduce thermal pressure

On Wed, 10 Oct 2018 at 15:48, Quentin Perret <[email protected]> wrote:
>
> On Wednesday 10 Oct 2018 at 15:27:57 (+0200), Vincent Guittot wrote:
> > On Wed, 10 Oct 2018 at 15:05, Quentin Perret <[email protected]> wrote:
> > >
> > > On Wednesday 10 Oct 2018 at 14:04:40 (+0200), Vincent Guittot wrote:
> > > > This patchset doesn't touch cpu_capacity_orig and doesn't need to as
> > > > it assume that the max capacity is unchanged but some capacity is
> > > > momentary stolen by thermal.
> > > > If you want to reflect immediately all thermal capping change, you
> > > > have to update this field and all related fields and struct around
> > >
> > > I don't follow you here. I never said I wanted to change
> > > cpu_capacity_orig. I don't think we should do that actually. Changing
> > > capacity_of (which is updated during LB IIRC) is just fine. The question
> > > is about what you want to do there: reflect an averaged value or the
> > > instantaneous one.
> >
> > Sorry I though your were speaking about updating this cpu_capacity_orig.
>
> N/p, communication via email can easily become confusing :-)
>
> > With using instantaneous max value in capacity_of(), we are back to
> > the problem raised by Thara that the value will most probably not
> > reflect the current capping value when it is used in LB, because LB
> > period can quite long on busy CPU (default max value is 32*sd_weight
> > ms)
>
> But averaging the capping value over time doesn't make LB happen more
> often ... That will help you account for capping that happened in the

But you know what happens in average between 2 LB

> past, but it's not obvious this is actually a good thing. Probably not
> all the time anyway.
>
> Say a CPU was capped at 50% of it's capacity, then the cap is removed.
> At that point it'll take 100ms+ for the thermal signal to decay and let
> the scheduler know about the newly available capacity. That can probably

But the point is that you don't know:
- if the capping will not happen soon. If the pressure has reached the
50%, it means that it already happened quite often in the past 100ms.
- if there is really available capacity as the current sum of
utilization reflects what was available for tasks and not what the
tasks really wants to use


> be a performance hit in some use cases ... And the other way around, it
> can also take forever for the scheduler to notice that a CPU has a

What do you mean by forever ?

> reduced capacity before reacting to it.
>
> If you want to filter out very short transient capping events to avoid
> over-reacting in the scheduler (is this actually happening ?), then
> maybe the average should be done on the cooling device side or something
> like that ?
>
> Thanks,
> Quentin

2018-10-10 15:38:00

by Lukasz Luba

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Introduce thermal pressure

Hi Thara,

I have run it on Exynos5433 mainline.
When it is enabled with step_wise thermal governor,
some of my tests are showing ~30-50% regression (i.e. hackbench),
dhrystone ~10%.

Could you tell me which thermal governor was used in your case?
Please also share the name of that benchmark, i will give it a try.
Is it single threaded compute-intensive?

Regards,
Lukasz

On 10/09/2018 06:24 PM, Thara Gopinath wrote:
> Thermal governors can respond to an overheat event for a cpu by
> capping the cpu's maximum possible frequency. This in turn
> means that the maximum available compute capacity of the
> cpu is restricted. But today in linux kernel, in event of maximum
> frequency capping of a cpu, the maximum available compute
> capacity of the cpu is not adjusted at all. In other words, scheduler
> is unware maximum cpu capacity restrictions placed due to thermal
> activity. This patch series attempts to address this issue.
> The benefits identified are better task placement among available
> cpus in event of overheating which in turn leads to better
> performance numbers.
>
> The delta between the maximum possible capacity of a cpu and
> maximum available capacity of a cpu due to thermal event can
> be considered as thermal pressure. Instantaneous thermal pressure
> is hard to record and can sometime be erroneous as there can be mismatch
> between the actual capping of capacity and scheduler recording it.
> Thus solution is to have a weighted average per cpu value for thermal
> pressure over time. The weight reflects the amount of time the cpu has
> spent at a capped maximum frequency. To accumulate, average and
> appropriately decay thermal pressure, this patch series uses pelt
> signals and reuses the available framework that does a similar
> bookkeeping of rt/dl task utilization.
>
> Regarding testing, basic build, boot and sanity testing have been
> performed on hikey960 mainline kernel with debian file system.
> Further aobench (An occlusion renderer for benchmarking realworld
> floating point performance) showed the following results on hikey960
> with debain.
>
> Result Standard Standard
> (Time secs) Error Deviation
> Hikey 960 - no thermal pressure applied 138.67 6.52 11.52%
> Hikey 960 - thermal pressure applied 122.37 5.78 11.57%
>
> Thara Gopinath (7):
> sched/pelt: Add option to make load and util calculations frequency
> invariant
> sched/pelt.c: Add support to track thermal pressure
> sched: Add infrastructure to store and update instantaneous thermal
> pressure
> sched: Initialize per cpu thermal pressure structure
> sched/fair: Enable CFS periodic tick to update thermal pressure
> sched/fair: update cpu_capcity to reflect thermal pressure
> thermal/cpu-cooling: Update thermal pressure in case of a maximum
> frequency capping
>
> drivers/base/arch_topology.c | 1 +
> drivers/thermal/cpu_cooling.c | 20 ++++++++++++-
> include/linux/sched.h | 14 +++++++++
> kernel/sched/Makefile | 2 +-
> kernel/sched/core.c | 2 ++
> kernel/sched/fair.c | 4 +++
> kernel/sched/pelt.c | 40 ++++++++++++++++++--------
> kernel/sched/pelt.h | 7 +++++
> kernel/sched/sched.h | 1 +
> kernel/sched/thermal.c | 66 +++++++++++++++++++++++++++++++++++++++++++
> kernel/sched/thermal.h | 13 +++++++++
> 11 files changed, 157 insertions(+), 13 deletions(-)
> create mode 100644 kernel/sched/thermal.c
> create mode 100644 kernel/sched/thermal.h
>

2018-10-10 15:44:09

by Thara Gopinath

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Introduce thermal pressure

Hello Ingo,
Thank you for the review.

On 10/10/2018 02:17 AM, Ingo Molnar wrote:
>
> * Thara Gopinath <[email protected]> wrote:
>
>> Thermal governors can respond to an overheat event for a cpu by
>> capping the cpu's maximum possible frequency. This in turn
>> means that the maximum available compute capacity of the
>> cpu is restricted. But today in linux kernel, in event of maximum
>> frequency capping of a cpu, the maximum available compute
>> capacity of the cpu is not adjusted at all. In other words, scheduler
>> is unware maximum cpu capacity restrictions placed due to thermal
>> activity. This patch series attempts to address this issue.
>> The benefits identif

ied are better task placement among available
>> cpus in event of overheating which in turn leads to better
>> performance numbers.
>>
>> The delta between the maximum possible capacity of a cpu and
>> maximum available capacity of a cpu due to thermal event can
>> be considered as thermal pressure. Instantaneous thermal pressure
>> is hard to record and can sometime be erroneous as there can be mismatch
>> between the actual capping of capacity and scheduler recording it.
>> Thus solution is to have a weighted average per cpu value for thermal
>> pressure over time. The weight reflects the amount of time the cpu has
>> spent at a capped maximum frequency. To accumulate, average and
>> appropriately decay thermal pressure, this patch series uses pelt
>> signals and reuses the available framework that does a similar
>> bookkeeping of rt/dl task utilization.
>>
>> Regarding testing, basic build, boot and sanity testing have been
>> performed on hikey960 mainline kernel with debian file system.
>> Further aobench (An occlusion renderer for benchmarking realworld
>> floating point performance) showed the following results on hikey960
>> with debain.
>>
>> Result Standard Standard
>> (Time secs) Error Deviation
>> Hikey 960 - no thermal pressure applied 138.67 6.52 11.52%
>> Hikey 960 - thermal pressure applied 122.37 5.78 11.57%
>
> Wow, +13% speedup, impressive! We definitely want this outcome.
>
> I'm wondering what happens if we do not track and decay the thermal load at all at the PELT
> level, but instantaneously decrease/increase effective CPU capacity in reaction to thermal
> events we receive from the CPU.

The problem with instantaneous update is that sometimes thermal events
happen at a much faster pace than cpu_capacity is updated in
the scheduler. This means that at the moment when scheduler uses the
value, it might not be correct anymore.

Having said that, today Android common kernel has a solution which
instantaneously updates cpu_capacity in case of a thermal event.
To give a bit of background on the evolution of the solution I have
proposed, below is a time line of analysis I have done.

1. I started this activity by analyzing the existing framework on
android common kernel. I ran android benchmark tests (Jankbench,
Vellamo, Geekbench) with and without the existing instantaneous update
mechanism. I found that there is no real performance difference to be
observed with an instantaneous updated of cpu_capacity at least in my
test scenarios.
2. Then I developed an algorithm to track, accumulate and decay the
capacity capping i.e an algorithm without using the pelt signals(this
was prior to the new pelt framework in mainline). With this android
benchmarks showed performance improvements. At this point I also ported
the solution to mainline kernel and ran the aobench analysis which again
showed a performance improvement.
3. Finally with the new pelt framework in place, I replaced my algorithm
with the one used for rt and dl utilization tracking which is the
current patch series. I have not been able to run tests with this on
Android yet.

All tests were performed on hikey960.
I have a Google spreadsheet, documenting results at various stages of
analysis. I am not sure how to share it with the group here.


>
> You describe the averaging as:
>
>> Instantaneous thermal pressure is hard to record and can sometime be erroneous as there can
>> be mismatch between the actual capping of capacity and scheduler recording it.
>
> Not sure I follow the argument here: are there bogus thermal throttling events? If so then
> they are hopefully not frequent enough and should average out over time even if we follow
> it instantly.

No bogus events. It is more like sometimes capping happens at a much
faster rate than cpu_capacity is updated and the scheduler looses these
events.

>
> I.e. what is 'can sometimes be erroneous', exactly?
>
> Thanks,
>
> Ingo
>


--
Regards
Thara

2018-10-10 16:15:45

by Ionela Voinescu

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Introduce thermal pressure

Hi guys,

On 10/10/18 14:47, Quentin Perret wrote:
> On Wednesday 10 Oct 2018 at 15:27:57 (+0200), Vincent Guittot wrote:
>> On Wed, 10 Oct 2018 at 15:05, Quentin Perret <[email protected]> wrote:
>>>
>>> On Wednesday 10 Oct 2018 at 14:04:40 (+0200), Vincent Guittot wrote:
>>>> This patchset doesn't touch cpu_capacity_orig and doesn't need to as
>>>> it assume that the max capacity is unchanged but some capacity is
>>>> momentary stolen by thermal.
>>>> If you want to reflect immediately all thermal capping change, you
>>>> have to update this field and all related fields and struct around
>>>
>>> I don't follow you here. I never said I wanted to change
>>> cpu_capacity_orig. I don't think we should do that actually. Changing
>>> capacity_of (which is updated during LB IIRC) is just fine. The question
>>> is about what you want to do there: reflect an averaged value or the
>>> instantaneous one.
>>
>> Sorry I though your were speaking about updating this cpu_capacity_orig.
>
> N/p, communication via email can easily become confusing :-)
>
>> With using instantaneous max value in capacity_of(), we are back to
>> the problem raised by Thara that the value will most probably not
>> reflect the current capping value when it is used in LB, because LB
>> period can quite long on busy CPU (default max value is 32*sd_weight
>> ms)
>
> But averaging the capping value over time doesn't make LB happen more
> often ... That will help you account for capping that happened in the
> past, but it's not obvious this is actually a good thing. Probably not
> all the time anyway.
>
> Say a CPU was capped at 50% of it's capacity, then the cap is removed.
> At that point it'll take 100ms+ for the thermal signal to decay and let
> the scheduler know about the newly available capacity. That can probably
> be a performance hit in some use cases ... And the other way around, it
> can also take forever for the scheduler to notice that a CPU has a
> reduced capacity before reacting to it.
>
> If you want to filter out very short transient capping events to avoid
> over-reacting in the scheduler (is this actually happening ?), then
> maybe the average should be done on the cooling device side or something
> like that ?
>

I think there isn't just the issue of the *occasional* overreaction of a
thermal governor due to noise in the temperature sensors or some spike
in environmental temperature that determines a delayed reaction in the
scheduler due to when capacity is updated.

I'm seeing a bigger issue for *sustained* high temperatures that are not
treated effectively by governors. Depending on the platform, heat can
be dissipated over longer or shorter periods of time. This can determine
a seesaw effect on the maximum frequency: it determines the temperature
is over a threshold and it starts capping, but heat is not dissipated
quickly enough for that to reflect in the value of the temperature sensor,
so it continues to cap; when the temperature gets to normal, capping
is lifted, which in turn results access to higher OPPs and a return to
high temperatures, etc.

What will happen is that, *depending on platform* and the moment when
capacity is updated, you can see either a CPU with a capacity of 1024, or
let's say 800, or (on hikey960 :)) around 500, and back and forth
between them.

Because of these I tend to think that a regulated (averaged) value of
thermal pressure is better than an instantaneous one. Thermal mitigation
measures are there for the well-being and safety of a device, not for
optimizations so it can and should be allowed to overreact, or have a
delayed reaction. But ping-pong-ing tasks back and forth between CPUs
due to changes in CPU capacity is harmful for performance. What would be
awesome to achieve with this is (close to) optimal use of restricted
capacities of CPUs, and I tend to believe instantaneous and most probably
out of date capacity values would not lead to this.

But this is almost a gut feeling and of course it should be validated on
devices with different thermal characteristics. Given the high variation
between devices with regards to this I'd be reluctant to tie it to the
PELT half life.

Regards,
Ionela.

> Thanks,
> Quentin
>

2018-10-10 16:56:45

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Introduce thermal pressure

On 10/10/2018 17:35, Lukasz Luba wrote:
> Hi Thara,
>
> I have run it on Exynos5433 mainline.
> When it is enabled with step_wise thermal governor,
> some of my tests are showing ~30-50% regression (i.e. hackbench),
> dhrystone ~10%.
>
> Could you tell me which thermal governor was used in your case?
> Please also share the name of that benchmark, i will give it a try.
> Is it single threaded compute-intensive?

aobench AFAICT

It would be interesting if you can share the thermal profile of your board.


> On 10/09/2018 06:24 PM, Thara Gopinath wrote:
>> Thermal governors can respond to an overheat event for a cpu by
>> capping the cpu's maximum possible frequency. This in turn
>> means that the maximum available compute capacity of the
>> cpu is restricted. But today in linux kernel, in event of maximum
>> frequency capping of a cpu, the maximum available compute
>> capacity of the cpu is not adjusted at all. In other words, scheduler
>> is unware maximum cpu capacity restrictions placed due to thermal
>> activity. This patch series attempts to address this issue.
>> The benefits identified are better task placement among available
>> cpus in event of overheating which in turn leads to better
>> performance numbers.
>>
>> The delta between the maximum possible capacity of a cpu and
>> maximum available capacity of a cpu due to thermal event can
>> be considered as thermal pressure. Instantaneous thermal pressure
>> is hard to record and can sometime be erroneous as there can be mismatch
>> between the actual capping of capacity and scheduler recording it.
>> Thus solution is to have a weighted average per cpu value for thermal
>> pressure over time. The weight reflects the amount of time the cpu has
>> spent at a capped maximum frequency. To accumulate, average and
>> appropriately decay thermal pressure, this patch series uses pelt
>> signals and reuses the available framework that does a similar
>> bookkeeping of rt/dl task utilization.
>>
>> Regarding testing, basic build, boot and sanity testing have been
>> performed on hikey960 mainline kernel with debian file system.
>> Further aobench (An occlusion renderer for benchmarking realworld
>> floating point performance) showed the following results on hikey960
>> with debain.
>>
>> Result Standard Standard
>> (Time secs) Error Deviation
>> Hikey 960 - no thermal pressure applied 138.67 6.52 11.52%
>> Hikey 960 - thermal pressure applied 122.37 5.78 11.57%
>>
>> Thara Gopinath (7):
>> sched/pelt: Add option to make load and util calculations frequency
>> invariant
>> sched/pelt.c: Add support to track thermal pressure
>> sched: Add infrastructure to store and update instantaneous thermal
>> pressure
>> sched: Initialize per cpu thermal pressure structure
>> sched/fair: Enable CFS periodic tick to update thermal pressure
>> sched/fair: update cpu_capcity to reflect thermal pressure
>> thermal/cpu-cooling: Update thermal pressure in case of a maximum
>> frequency capping
>>
>> drivers/base/arch_topology.c | 1 +
>> drivers/thermal/cpu_cooling.c | 20 ++++++++++++-
>> include/linux/sched.h | 14 +++++++++
>> kernel/sched/Makefile | 2 +-
>> kernel/sched/core.c | 2 ++
>> kernel/sched/fair.c | 4 +++
>> kernel/sched/pelt.c | 40 ++++++++++++++++++--------
>> kernel/sched/pelt.h | 7 +++++
>> kernel/sched/sched.h | 1 +
>> kernel/sched/thermal.c | 66 +++++++++++++++++++++++++++++++++++++++++++
>> kernel/sched/thermal.h | 13 +++++++++
>> 11 files changed, 157 insertions(+), 13 deletions(-)
>> create mode 100644 kernel/sched/thermal.c
>> create mode 100644 kernel/sched/thermal.h
>>


--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


2018-10-10 17:05:41

by Thara Gopinath

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Introduce thermal pressure

Hello Quentin,
On 10/10/2018 05:55 AM, Quentin Perret wrote:
> Hi Vincent,
>
> On Wednesday 10 Oct 2018 at 10:50:05 (+0200), Vincent Guittot wrote:
>> The problem with reflecting directly the capping is that it happens
>> far more often than the pace at which cpu_capacity_orig is updated in
>> the scheduler.
>
> Hmm, how can you be so sure ? That most likely depends on the workload,
> the platform and the thermal governor. Some platforms heat up slowly,
> some quickly. The pace at which the thermal governor will change things
> should depend on that I assume.

Yes I agree. How often a thermal event occurs is indeed dependent on
workload, platform and governors. For e.g. hikey960 the same workload
displays different behavior with and without a fan. What we want is a
generic solution that will work across "spiky" events and more spread
out events. An instantaneous update of cpu_capacity most certainly does
not capture spiky events. Also it can lead to cpu_capacity swinging
wildly from the original value to a much lower value and back again.
Averaging of thermal capacity limitation (Across any pre-determined
duration i.e using Pelt signals or without) takes care of smoothening
the effect of thermal capping and making sure that capping events are
not entirely missed. For me it is a much more elegant solution than an
instantaneous solution.
Having said that, like you mentioned below, when the thermal capping
goes away, it is not reflected as an instantaneous availability of spare
capacity. It is slowly increased. But it is not necessarily wrong
information. All it tells the scheduler is that during the last
"pre-determined" duration on an average "X" was the capacity available
for a CFS task on this cpu.

>
>> This means that at the moment when scheduler uses the
>> value, it might not be correct anymore.
>
> And OTOH, when you remove a cap for example, it will take time before
> the scheduler can see the newly available capacity if you need to wait
> for the signal to decay. So you are using a wrong information too in
> that scenario.
>
>> Then, this value are also used
>> when building the sched_domain and setting max_cpu_capacity which
>> would also implies the rebuilt the sched_domain topology ...
>
> Wait what ? I thought the thermal cap was reflected in capacity_of, not
> capacity_orig_of ... You need to rebuild the sched_domain in case of
> thermal pressure ?
>
> Hmm, let me have a closer look at the patches, I must have missed
> something ...
>
>> The pace of changing the capping is to fast to reflect that in the
>> whole scheduler topology
>
> That's probably true in some cases, but it'd be cool to have numbers to
> back up that statement, I think.
>
> Now, if you do need to rebuild the sched domain topology every time you
> update the thermal pressure, I think the PELT HL is _way_ too short for
> that ... You can't rebuild the whole thing every 32ms or so. Or am I
> misunderstanding something ?
>
>>> Thara, have you tried to experiment with a simpler implementation as
>>> suggested by Ingo ?
>>>
>>> Also, assuming that we do want to average things, do we actually want to
>>> tie the thermal ramp-up time to the PELT half life ? That provides
>>> nice maths properties wrt the other signals, but it's not obvious to me
>>> that this thermal 'constant' should be the same on all platforms. Or
>>> maybe it should ?
>>
>> The main interest of using PELT signal is that thermal pressure will
>> evolve at the same pace as other signals used in the scheduler.
>
> Right, I think this is a nice property too (assuming that we actually
> want to average things out).
>
>> With
>> thermal pressure, we have the exact same problem as with RT tasks. The
>> thermal will cap the max frequency which will cap the utilization of
>> the tasks running on the CPU
>
> Well, the nature of the signal is slightly different IMO. Yes it's
> capacity, but you're no actually measuring time spent on the CPU. All
> other PELT signals are based on time, this thermal thing isn't, so it is
> kinda different in a way. And I'm still wondering if it could be helpful
> to be able to have a different HL for that thermal signal. That would
> 'break' the nice maths properties we have, yes, but is it a problem or is
> it actually helpful to cope with the thermal characteristics of
> different platforms ?


>
> Thanks,
> Quentin
>


--
Regards
Thara

2018-10-10 17:09:18

by Thara Gopinath

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Introduce thermal pressure

On 10/10/2018 09:34 AM, Juri Lelli wrote:
> On 10/10/18 15:08, Vincent Guittot wrote:
>> On Wed, 10 Oct 2018 at 14:50, Juri Lelli <[email protected]> wrote:
>>>
>>> On 10/10/18 14:34, Vincent Guittot wrote:
>>>> Hi Juri,
>>>>
>>>> On Wed, 10 Oct 2018 at 14:23, Juri Lelli <[email protected]> wrote:
>>>>>
>>>>> On 10/10/18 14:04, Vincent Guittot wrote:
>>>>>
>>>>> [...]
>>>>>
>>>>>> The problem was the same with RT, the cfs utilization was lower than
>>>>>> reality because RT steals soem cycle to CFS
>>>>>> So schedutil was selecting a lower frequency when cfs was running
>>>>>> whereas the CPU was fully used.
>>>>>> The same can happen with thermal:
>>>>>> cap the max freq because of thermal
>>>>>> the utilization with decrease.
>>>>>> remove the cap
>>>>>> the utilization is still low and you will select a low OPP because you
>>>>>> don't take into account cycle stolen by thermal like with RT
>>>>>
>>>>> What if we scale frequency component considering the capped temporary
>>>>> max?
>>>>
>>>> Do you mean using a kind of scale_thermal_capacity in accumulate_sum
>>>> when computing utilization ?
>>>
>>> Yeah, something like that I guess. So that we account for temporary
>>> "fake" 1024..
>>
>> But the utilization will not be invariant anymore across the system
>
> Mmm, I guess I might be wrong, but I was thinking we should be able to
> deal with this similarly to what we do with cpus with different max
> capacities. So, another factor? Because then, how do we handle other
> ways in which max freq can be restricted (e.g. from userspace as Javi
> was also mentioning)?

IMHO, user-space restrictions should be handled separately. It should
probably reflect as an update of capacity_orig and rebuilding of
scheduler structures as such a restriction is meant to stay for a long
duration.

Regards
Thara
>


--
Regards
Thara

2018-10-10 17:31:50

by Thara Gopinath

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Introduce thermal pressure

Hello Lukasz,

On 10/10/2018 11:35 AM, Lukasz Luba wrote:
> Hi Thara,
>
> I have run it on Exynos5433 mainline.
> When it is enabled with step_wise thermal governor,
> some of my tests are showing ~30-50% regression (i.e. hackbench),
> dhrystone ~10%.

That is interesting. If I understand correctly, dhrystone spawns 16
threads or so and floods the system. In "theory", such a test should not
see any performance improvement and degradation. What is the thermal
activity like in your system? I will try running one of these tests on
hikey960.
>
> Could you tell me which thermal governor was used in your case?
> Please also share the name of that benchmark, i will give it a try.
> Is it single threaded compute-intensive?

Step-wise governor.
I use aobench which is part of phoronix-test-suite.

Regards
Thara

>
> Regards,
> Lukasz
>
> On 10/09/2018 06:24 PM, Thara Gopinath wrote:
>> Thermal governors can respond to an overheat event for a cpu by
>> capping the cpu's maximum possible frequency. This in turn
>> means that the maximum available compute capacity of the
>> cpu is restricted. But today in linux kernel, in event of maximum
>> frequency capping of a cpu, the maximum available compute
>> capacity of the cpu is not adjusted at all. In other words, scheduler
>> is unware maximum cpu capacity restrictions placed due to thermal
>> activity. This patch series attempts to address this issue.
>> The benefits identified are better task placement among available
>> cpus in event of overheating which in turn leads to better
>> performance numbers.
>>
>> The delta between the maximum possible capacity of a cpu and
>> maximum available capacity of a cpu due to thermal event can
>> be considered as thermal pressure. Instantaneous thermal pressure
>> is hard to record and can sometime be erroneous as there can be mismatch
>> between the actual capping of capacity and scheduler recording it.
>> Thus solution is to have a weighted average per cpu value for thermal
>> pressure over time. The weight reflects the amount of time the cpu has
>> spent at a capped maximum frequency. To accumulate, average and
>> appropriately decay thermal pressure, this patch series uses pelt
>> signals and reuses the available framework that does a similar
>> bookkeeping of rt/dl task utilization.
>>
>> Regarding testing, basic build, boot and sanity testing have been
>> performed on hikey960 mainline kernel with debian file system.
>> Further aobench (An occlusion renderer for benchmarking realworld
>> floating point performance) showed the following results on hikey960
>> with debain.
>>
>> Result Standard Standard
>> (Time secs) Error Deviation
>> Hikey 960 - no thermal pressure applied 138.67 6.52 11.52%
>> Hikey 960 - thermal pressure applied 122.37 5.78 11.57%
>>
>> Thara Gopinath (7):
>> sched/pelt: Add option to make load and util calculations frequency
>> invariant
>> sched/pelt.c: Add support to track thermal pressure
>> sched: Add infrastructure to store and update instantaneous thermal
>> pressure
>> sched: Initialize per cpu thermal pressure structure
>> sched/fair: Enable CFS periodic tick to update thermal pressure
>> sched/fair: update cpu_capcity to reflect thermal pressure
>> thermal/cpu-cooling: Update thermal pressure in case of a maximum
>> frequency capping
>>
>> drivers/base/arch_topology.c | 1 +
>> drivers/thermal/cpu_cooling.c | 20 ++++++++++++-
>> include/linux/sched.h | 14 +++++++++
>> kernel/sched/Makefile | 2 +-
>> kernel/sched/core.c | 2 ++
>> kernel/sched/fair.c | 4 +++
>> kernel/sched/pelt.c | 40 ++++++++++++++++++--------
>> kernel/sched/pelt.h | 7 +++++
>> kernel/sched/sched.h | 1 +
>> kernel/sched/thermal.c | 66 +++++++++++++++++++++++++++++++++++++++++++
>> kernel/sched/thermal.h | 13 +++++++++
>> 11 files changed, 157 insertions(+), 13 deletions(-)
>> create mode 100644 kernel/sched/thermal.c
>> create mode 100644 kernel/sched/thermal.h
>>


--
Regards
Thara

2018-10-11 07:36:12

by Lukasz Luba

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Introduce thermal pressure

Hi Daniel,

On 10/10/2018 06:54 PM, Daniel Lezcano wrote:
> On 10/10/2018 17:35, Lukasz Luba wrote:
>> Hi Thara,
>>
>> I have run it on Exynos5433 mainline.
>> When it is enabled with step_wise thermal governor,
>> some of my tests are showing ~30-50% regression (i.e. hackbench),
>> dhrystone ~10%.
>>
>> Could you tell me which thermal governor was used in your case?
>> Please also share the name of that benchmark, i will give it a try.
>> Is it single threaded compute-intensive?
>
> aobench AFAICT
>
> It would be interesting if you can share the thermal profile of your board.
>
Thanks for the benchmark name.
It was tested on Samsung TM2 device with Exynos 5433 with debian.
Thermal stuff you can find in mainline:
arch/arm64/boot/dts/exynos/exynos5433-tmu.dtsi

Regards,
Lukasz

>
>> On 10/09/2018 06:24 PM, Thara Gopinath wrote:
>>> Thermal governors can respond to an overheat event for a cpu by
>>> capping the cpu's maximum possible frequency. This in turn
>>> means that the maximum available compute capacity of the
>>> cpu is restricted. But today in linux kernel, in event of maximum
>>> frequency capping of a cpu, the maximum available compute
>>> capacity of the cpu is not adjusted at all. In other words, scheduler
>>> is unware maximum cpu capacity restrictions placed due to thermal
>>> activity. This patch series attempts to address this issue.
>>> The benefits identified are better task placement among available
>>> cpus in event of overheating which in turn leads to better
>>> performance numbers.
>>>
>>> The delta between the maximum possible capacity of a cpu and
>>> maximum available capacity of a cpu due to thermal event can
>>> be considered as thermal pressure. Instantaneous thermal pressure
>>> is hard to record and can sometime be erroneous as there can be mismatch
>>> between the actual capping of capacity and scheduler recording it.
>>> Thus solution is to have a weighted average per cpu value for thermal
>>> pressure over time. The weight reflects the amount of time the cpu has
>>> spent at a capped maximum frequency. To accumulate, average and
>>> appropriately decay thermal pressure, this patch series uses pelt
>>> signals and reuses the available framework that does a similar
>>> bookkeeping of rt/dl task utilization.
>>>
>>> Regarding testing, basic build, boot and sanity testing have been
>>> performed on hikey960 mainline kernel with debian file system.
>>> Further aobench (An occlusion renderer for benchmarking realworld
>>> floating point performance) showed the following results on hikey960
>>> with debain.
>>>
>>> Result Standard Standard
>>> (Time secs) Error Deviation
>>> Hikey 960 - no thermal pressure applied 138.67 6.52 11.52%
>>> Hikey 960 - thermal pressure applied 122.37 5.78 11.57%
>>>
>>> Thara Gopinath (7):
>>> sched/pelt: Add option to make load and util calculations frequency
>>> invariant
>>> sched/pelt.c: Add support to track thermal pressure
>>> sched: Add infrastructure to store and update instantaneous thermal
>>> pressure
>>> sched: Initialize per cpu thermal pressure structure
>>> sched/fair: Enable CFS periodic tick to update thermal pressure
>>> sched/fair: update cpu_capcity to reflect thermal pressure
>>> thermal/cpu-cooling: Update thermal pressure in case of a maximum
>>> frequency capping
>>>
>>> drivers/base/arch_topology.c | 1 +
>>> drivers/thermal/cpu_cooling.c | 20 ++++++++++++-
>>> include/linux/sched.h | 14 +++++++++
>>> kernel/sched/Makefile | 2 +-
>>> kernel/sched/core.c | 2 ++
>>> kernel/sched/fair.c | 4 +++
>>> kernel/sched/pelt.c | 40 ++++++++++++++++++--------
>>> kernel/sched/pelt.h | 7 +++++
>>> kernel/sched/sched.h | 1 +
>>> kernel/sched/thermal.c | 66 +++++++++++++++++++++++++++++++++++++++++++
>>> kernel/sched/thermal.h | 13 +++++++++
>>> 11 files changed, 157 insertions(+), 13 deletions(-)
>>> create mode 100644 kernel/sched/thermal.c
>>> create mode 100644 kernel/sched/thermal.h
>>>
>
>

2018-10-11 08:28:41

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Introduce thermal pressure

On 11/10/2018 09:35, Lukasz Luba wrote:
> Hi Daniel,
>
> On 10/10/2018 06:54 PM, Daniel Lezcano wrote:
>> On 10/10/2018 17:35, Lukasz Luba wrote:
>>> Hi Thara,
>>>
>>> I have run it on Exynos5433 mainline.
>>> When it is enabled with step_wise thermal governor,
>>> some of my tests are showing ~30-50% regression (i.e. hackbench),
>>> dhrystone ~10%.
>>>
>>> Could you tell me which thermal governor was used in your case?
>>> Please also share the name of that benchmark, i will give it a try.
>>> Is it single threaded compute-intensive?
>>
>> aobench AFAICT
>>
>> It would be interesting if you can share the thermal profile of your board.
>>
> Thanks for the benchmark name.
> It was tested on Samsung TM2 device with Exynos 5433 with debian.
> Thermal stuff you can find in mainline:
> arch/arm64/boot/dts/exynos/exynos5433-tmu.dtsi

By thermal profile, I was meaning a figure with the temperature
regulation (temperature idle, then workload, then temperature increase,
mitigated temperature, end of workload, temperature decrease).

The thermal description looks wrong in the DT. I suggest to experiment
the series with the DT fixed.

eg. from hi6220.dtsi


thermal-zones {

cls0: cls0 {
polling-delay = <1000>;
polling-delay-passive = <100>;
sustainable-power = <3326>;

/* sensor ID */
thermal-sensors = <&tsensor 2>;

trips {
threshold: trip-point@0 {
temperature = <65000>;
hysteresis = <0>;
type = "passive";
};

target: trip-point@1 {
temperature = <75000>;
hysteresis = <0>;
type = "passive";
};
};

cooling-maps {
map0 {
trip = <&target>;
cooling-device = <&cpu0
THERMAL_NO_LIMIT THERMAL_NO_LIMIT>;
};
};
};
};

Note the cooling devices are *passive*


--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


2018-10-11 11:36:36

by Lukasz Luba

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Introduce thermal pressure



On 10/10/2018 07:30 PM, Thara Gopinath wrote:
> Hello Lukasz,
>
> On 10/10/2018 11:35 AM, Lukasz Luba wrote:
>> Hi Thara,
>>
>> I have run it on Exynos5433 mainline.
>> When it is enabled with step_wise thermal governor,
>> some of my tests are showing ~30-50% regression (i.e. hackbench),
>> dhrystone ~10%.
>
> That is interesting. If I understand correctly, dhrystone spawns 16
> threads or so and floods the system. In "theory", such a test should not
> see any performance improvement and degradation. What is the thermal
> activity like in your system? I will try running one of these tests on
> hikey960.
I use this dhrystone implementation:
https://github.com/Keith-S-Thompson/dhrystone/blob/master/v2.2/dry.c
It does not span new threads/processes and I pinned it to a single cpu.

My thermal setup is probably different than yours.
You have (on hikey960) probably 1 sensor for whole SoC and one thermal
zone (if it is this mainline file:
arch/arm64/boot/dts/hisilicon/hi3660.dtsi).
This thermal zone has two cooling devices - two clusters with dvfs.
Your temperature signal read out from that sensor is probably much
smoother. When you have sensor inside cluster, the rising factor
can be even 20deg/s (for big cores).
In my case, there are 4 thermal zones, each cluster has it's private
sensor and thermal zone. There is no 'SoC sensor' or 'PCB sensor',
which is recommended for IPA.
>>
>> Could you tell me which thermal governor was used in your case?
>> Please also share the name of that benchmark, i will give it a try.
>> Is it single threaded compute-intensive?
>
> Step-wise governor.
> I use aobench which is part of phoronix-test-suite.
>
> Regards
> Thara
>
I have built this aobench and run it pinned to single big cpu:
time taskset -c 4 ./aobench
The results:
3min-5:30min [mainline]
5:15min-5:50min [+patchset]

The idea is definitely worth to investigate further.

Regards,
Lukasz




2018-10-12 09:38:31

by Lukasz Luba

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Introduce thermal pressure


On 10/11/2018 10:23 AM, Daniel Lezcano wrote:
> On 11/10/2018 09:35, Lukasz Luba wrote:
>> Hi Daniel,
>>
>> On 10/10/2018 06:54 PM, Daniel Lezcano wrote:
>>> On 10/10/2018 17:35, Lukasz Luba wrote:
>>>> Hi Thara,
>>>>
>>>> I have run it on Exynos5433 mainline.
>>>> When it is enabled with step_wise thermal governor,
>>>> some of my tests are showing ~30-50% regression (i.e. hackbench),
>>>> dhrystone ~10%.
>>>>
>>>> Could you tell me which thermal governor was used in your case?
>>>> Please also share the name of that benchmark, i will give it a try.
>>>> Is it single threaded compute-intensive?
>>>
>>> aobench AFAICT
>>>
>>> It would be interesting if you can share the thermal profile of your board.
>>>
>> Thanks for the benchmark name.
>> It was tested on Samsung TM2 device with Exynos 5433 with debian.
>> Thermal stuff you can find in mainline:
>> arch/arm64/boot/dts/exynos/exynos5433-tmu.dtsi
>
> By thermal profile, I was meaning a figure with the temperature
> regulation (temperature idle, then workload, then temperature increase,
> mitigated temperature, end of workload, temperature decrease).
Currently, I cannot share these data.
>
> The thermal description looks wrong in the DT. I suggest to experiment
> the series with the DT fixed.

Could you tell more what looks wrong or maybe send a draft/patch?
> eg. from hi6220.dtsi
>
>
> thermal-zones {
>
> cls0: cls0 {
> polling-delay = <1000>;
> polling-delay-passive = <100>;
> sustainable-power = <3326>;
>
> /* sensor ID */
> thermal-sensors = <&tsensor 2>;
>
> trips {
> threshold: trip-point@0 {
> temperature = <65000>;
> hysteresis = <0>;
> type = "passive";
> };
>
> target: trip-point@1 {
> temperature = <75000>;
> hysteresis = <0>;
> type = "passive";
> };
> };
>
> cooling-maps {
> map0 {
> trip = <&target>;
> cooling-device = <&cpu0
> THERMAL_NO_LIMIT THERMAL_NO_LIMIT>;
> };
> };
> };
> };
>
> Note the cooling devices are *passive*
>
>
For me this DT looks like it is copied from ARM Juno board for IPA,
where AFAIR there were no interrupts for the temperature sensor
(due to some psci/scpi/firmware lack of support).
The cooling map is also short. I am not sure if it would work efficient
with step-wise, with IPA it will work.

The DT configuration for Exynos has been working OK for years.
Exynos5433 supports 8 trip points, Exynos5422 has 4.
So if you have more trip points you need to start 'polling'.
Therefore, for Exynos there is no need to run queued work
in thermal framework every 1s or 100ms just to
read the current temperature.
The exynos-tmu driver gets irq and calls thermal framework
when such trip point is crossed. Then there is 'monitoring
window' for the trip point...
Long story short: 'active' is used for that reason.

Regards,
Lukasz


2018-10-16 07:33:50

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Introduce thermal pressure


* Thara Gopinath <[email protected]> wrote:

> >> Regarding testing, basic build, boot and sanity testing have been
> >> performed on hikey960 mainline kernel with debian file system.
> >> Further aobench (An occlusion renderer for benchmarking realworld
> >> floating point performance) showed the following results on hikey960
> >> with debain.
> >>
> >> Result Standard Standard
> >> (Time secs) Error Deviation
> >> Hikey 960 - no thermal pressure applied 138.67 6.52 11.52%
> >> Hikey 960 - thermal pressure applied 122.37 5.78 11.57%
> >
> > Wow, +13% speedup, impressive! We definitely want this outcome.
> >
> > I'm wondering what happens if we do not track and decay the thermal
> > load at all at the PELT level, but instantaneously decrease/increase
> > effective CPU capacity in reaction to thermal events we receive from
> > the CPU.
>
> The problem with instantaneous update is that sometimes thermal events
> happen at a much faster pace than cpu_capacity is updated in the
> scheduler. This means that at the moment when scheduler uses the
> value, it might not be correct anymore.

Let me offer a different interpretation: if we average throttling events
then we create a 'smooth' average of 'true CPU capacity' that doesn't
fluctuate much. This allows more stable yet asymmetric task placement if
the thermal characteristics of the different cores is different
(asymmetric). This, compared to instantaneous updates, would reduce
unnecessary task migrations between cores.

Is that accurate?

If the thermal characteristics of the cores is roughly symmetric and the
measured CPU-intense load itself is symmetric as well, then I have
trouble seeing why reacting to thermal events should make any difference
at all.

Are there any inherent asymmetries in the thermal properties of the
cores, or in the benchmarked workload itself?

Thanks,

Ingo

2018-10-16 09:30:02

by Lukasz Luba

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Introduce thermal pressure


On 10/16/2018 09:33 AM, Ingo Molnar wrote:
>
> * Thara Gopinath <[email protected]> wrote:
>
>>>> Regarding testing, basic build, boot and sanity testing have been
>>>> performed on hikey960 mainline kernel with debian file system.
>>>> Further aobench (An occlusion renderer for benchmarking realworld
>>>> floating point performance) showed the following results on hikey960
>>>> with debain.
>>>>
>>>> Result Standard Standard
>>>> (Time secs) Error Deviation
>>>> Hikey 960 - no thermal pressure applied 138.67 6.52 11.52%
>>>> Hikey 960 - thermal pressure applied 122.37 5.78 11.57%
>>>
>>> Wow, +13% speedup, impressive! We definitely want this outcome.
>>>
>>> I'm wondering what happens if we do not track and decay the thermal
>>> load at all at the PELT level, but instantaneously decrease/increase
>>> effective CPU capacity in reaction to thermal events we receive from
>>> the CPU.
>>
>> The problem with instantaneous update is that sometimes thermal events
>> happen at a much faster pace than cpu_capacity is updated in the
>> scheduler. This means that at the moment when scheduler uses the
>> value, it might not be correct anymore.
>
> Let me offer a different interpretation: if we average throttling events
> then we create a 'smooth' average of 'true CPU capacity' that doesn't
> fluctuate much. This allows more stable yet asymmetric task placement if
> the thermal characteristics of the different cores is different
> (asymmetric). This, compared to instantaneous updates, would reduce
> unnecessary task migrations between cores.
>
> Is that accurate?
>
> If the thermal characteristics of the cores is roughly symmetric and the
> measured CPU-intense load itself is symmetric as well, then I have
> trouble seeing why reacting to thermal events should make any difference
> at all.
>
> Are there any inherent asymmetries in the thermal properties of the
> cores, or in the benchmarked workload itself?
The aobench that at least I have built is a single threaded app.
If there is migration of the process to cluster and core which is in
avg faster, then it will gain.
The hikey960 platform has limited number of OPPs.
big cluster: 2.36, 2.1, 1.8, 1.4, 0.9 [GHz]
little cluster: 1.84, 1.7, 1.4, 1.0, 0.5 [GHz]
Comparing to Exynos5433 which has 15 OPPs for big cluster every 100MHZ,
it is harder to pick-up the right one.
I can imagine that the thermal governor is jumping around 1.8, 1.4, 0.9
for the big cluster. Maybe little cluster is at higher OPP
and running there longer would help. Thermal has time slots are 100ms
(based on this DT).

Regarding other asymmetries, there are different parts of the cluster
and core utilized depending of workload and data set.
There might be floating point or vectorized code utilizing long piplines
in NEON and also causing less cache misses.
That will warm up more than integer unit or copy using load/store unit
(which occupy less silicon (and C 'capacitance')) at the same frequency.

There are also SoCs which have single power rail from DCDC in PMIC
for both asymmetric clusters. In SoC on front of these clusters,
there is internal LDO, which reduces the voltage to the cluster.
In such system cpufreq driver chooses max of the voltages for the
clusters and sets it to the PMIC, then sets LDOx voltage diff for
cluster with smaller voltage. This causes another asymmetries,
because more current going through LDO causes more heat than
direct DCDC voltage (i.e. seen as a heat on big cluster).

There are also cache portion power down asymmetries.
I have been developing such driver. Based on memory traffic
and cache hit/miss ratio it chooses how much cache can be powered down.
I can image that some HW does it without the need of SW assist.

There are SoCs with DDR modules mounted on top - PoP.
I still have to investigate what is different in SoC power budget
in such setup (depending on workload).

There are also workloads for UI using GPU, which can also
be utilized in 'portions' (shader cores from 1 to 32).

These asymmetries cause that simple assumptio
P_dynamic = C * V^2 * f
is probably not enough.

I would suggest to choose platform with more fine grained OPPs or
add more points to hikey960 and repeat the tests.

Regards,
Lukasz Luba

>
> Thanks,
>
> Ingo
>
>

2018-10-16 17:46:51

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Introduce thermal pressure

Hi Lukasz,

On Thu, 11 Oct 2018 at 13:10, Lukasz Luba <[email protected]> wrote:
>
>
>
> On 10/10/2018 07:30 PM, Thara Gopinath wrote:
> > Hello Lukasz,
> >
> > On 10/10/2018 11:35 AM, Lukasz Luba wrote:
> >> Hi Thara,
> >>
> >> I have run it on Exynos5433 mainline.
> >> When it is enabled with step_wise thermal governor,
> >> some of my tests are showing ~30-50% regression (i.e. hackbench),
> >> dhrystone ~10%.
> >
> > That is interesting. If I understand correctly, dhrystone spawns 16
> > threads or so and floods the system. In "theory", such a test should not
> > see any performance improvement and degradation. What is the thermal
> > activity like in your system? I will try running one of these tests on
> > hikey960.
> I use this dhrystone implementation:
> https://github.com/Keith-S-Thompson/dhrystone/blob/master/v2.2/dry.c
> It does not span new threads/processes and I pinned it to a single cpu.
>
> My thermal setup is probably different than yours.
> You have (on hikey960) probably 1 sensor for whole SoC and one thermal
> zone (if it is this mainline file:
> arch/arm64/boot/dts/hisilicon/hi3660.dtsi).
> This thermal zone has two cooling devices - two clusters with dvfs.
> Your temperature signal read out from that sensor is probably much
> smoother. When you have sensor inside cluster, the rising factor
> can be even 20deg/s (for big cores).
> In my case, there are 4 thermal zones, each cluster has it's private
> sensor and thermal zone. There is no 'SoC sensor' or 'PCB sensor',
> which is recommended for IPA.
> >>
> >> Could you tell me which thermal governor was used in your case?
> >> Please also share the name of that benchmark, i will give it a try.
> >> Is it single threaded compute-intensive?
> >
> > Step-wise governor.
> > I use aobench which is part of phoronix-test-suite.
> >
> > Regards
> > Thara
> >
> I have built this aobench and run it pinned to single big cpu:
> time taskset -c 4 ./aobench

Why have you pinned the test only on CPU4 ?
Goal of thermal pressure is to inform the scheduler of reduced compute
capacity and help the scheduler to take better decision in tasks
placement.
So I would not expect perf impact on your test as the bench will stay
pinned on the cpu4
That being said you obviously have perf impact as shown below in your results
And other tasks on the system are not pinned and might come and
disturb your bench

> The results:
> 3min-5:30min [mainline]
> 5:15min-5:50min [+patchset]
>
> The idea is definitely worth to investigate further.

Yes I agree

Vincent
>
> Regards,
> Lukasz
>
>
>

2018-10-17 16:22:17

by Thara Gopinath

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Introduce thermal pressure

On 10/16/2018 03:33 AM, Ingo Molnar wrote:
>
> * Thara Gopinath <[email protected]> wrote:
>
>>>> Regarding testing, basic build, boot and sanity testing have been
>>>> performed on hikey960 mainline kernel with debian file system.
>>>> Further aobench (An occlusion renderer for benchmarking realworld
>>>> floating point performance) showed the following results on hikey960
>>>> with debain.
>>>>
>>>> Result Standard Standard
>>>> (Time secs) Error Deviation
>>>> Hikey 960 - no thermal pressure applied 138.67 6.52 11.52%
>>>> Hikey 960 - thermal pressure applied 122.37 5.78 11.57%
>>>
>>> Wow, +13% speedup, impressive! We definitely want this outcome.
>>>
>>> I'm wondering what happens if we do not track and decay the thermal
>>> load at all at the PELT level, but instantaneously decrease/increase
>>> effective CPU capacity in reaction to thermal events we receive from
>>> the CPU.
>>
>> The problem with instantaneous update is that sometimes thermal events
>> happen at a much faster pace than cpu_capacity is updated in the
>> scheduler. This means that at the moment when scheduler uses the
>> value, it might not be correct anymore.
>
> Let me offer a different interpretation: if we average throttling events
> then we create a 'smooth' average of 'true CPU capacity' that doesn't
> fluctuate much. This allows more stable yet asymmetric task placement if
> the thermal characteristics of the different cores is different
> (asymmetric). This, compared to instantaneous updates, would reduce
> unnecessary task migrations between cores.
>
> Is that accurate?

Yes. I think it is accurate. I will also add that if we don't average
throttling events, we will miss the events that occur in between load
balancing(LB) period.

>
> If the thermal characteristics of the cores is roughly symmetric and the
> measured CPU-intense load itself is symmetric as well, then I have
> trouble seeing why reacting to thermal events should make any difference
> at all.
In this scenario, i agree that scheduler reaction to thermal events
should not make any difference in fact we should not observe any
improvement or degradation in performance.

>
> Are there any inherent asymmetries in the thermal properties of the
> cores, or in the benchmarked workload itself?

The benchmarked workload , meaning aobench? I don't think there arre any
asymmetries there. On Hikey960, there are two clusters with different
frequency domains. So yes I will say there is asymmetry there. Asides,
IMHO, any other tasks running on the system can create an inherent
asymmetry as cpu utilizations can vary.

Regards
Thara
>
> Thanks,
>
> Ingo
>


--
Regards
Thara

2018-10-17 16:25:06

by Thara Gopinath

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Introduce thermal pressure

On 10/16/2018 01:11 PM, Vincent Guittot wrote:
> Hi Lukasz,
>
> On Thu, 11 Oct 2018 at 13:10, Lukasz Luba <[email protected]> wrote:
>>
>>
>>
>> On 10/10/2018 07:30 PM, Thara Gopinath wrote:
>>> Hello Lukasz,
>>>
>>> On 10/10/2018 11:35 AM, Lukasz Luba wrote:
>>>> Hi Thara,
>>>>
>>>> I have run it on Exynos5433 mainline.
>>>> When it is enabled with step_wise thermal governor,
>>>> some of my tests are showing ~30-50% regression (i.e. hackbench),
>>>> dhrystone ~10%.
>>>
>>> That is interesting. If I understand correctly, dhrystone spawns 16
>>> threads or so and floods the system. In "theory", such a test should not
>>> see any performance improvement and degradation. What is the thermal
>>> activity like in your system? I will try running one of these tests on
>>> hikey960.
>> I use this dhrystone implementation:
>> https://github.com/Keith-S-Thompson/dhrystone/blob/master/v2.2/dry.c
>> It does not span new threads/processes and I pinned it to a single cpu.
>>
>> My thermal setup is probably different than yours.
>> You have (on hikey960) probably 1 sensor for whole SoC and one thermal
>> zone (if it is this mainline file:
>> arch/arm64/boot/dts/hisilicon/hi3660.dtsi).
>> This thermal zone has two cooling devices - two clusters with dvfs.
>> Your temperature signal read out from that sensor is probably much
>> smoother. When you have sensor inside cluster, the rising factor
>> can be even 20deg/s (for big cores).
>> In my case, there are 4 thermal zones, each cluster has it's private
>> sensor and thermal zone. There is no 'SoC sensor' or 'PCB sensor',
>> which is recommended for IPA.
>>>>
>>>> Could you tell me which thermal governor was used in your case?
>>>> Please also share the name of that benchmark, i will give it a try.
>>>> Is it single threaded compute-intensive?
>>>
>>> Step-wise governor.
>>> I use aobench which is part of phoronix-test-suite.
>>>
>>> Regards
>>> Thara
>>>
>> I have built this aobench and run it pinned to single big cpu:
>> time taskset -c 4 ./aobench
>
> Why have you pinned the test only on CPU4 ?
> Goal of thermal pressure is to inform the scheduler of reduced compute
> capacity and help the scheduler to take better decision in tasks
> placement.

Hi Lukasz,

I agree with Vincent's observation. I had not seen this earlier. Pinning
a task to a cpu will obviously prevent migration. The performance
regression could be due to as Vincent mentioned below other tasks in the
system. On another note, which cpufreq governor are you using? Is the
core being bumped up to highest possible OPP during the test?

Regards
Thara
> So I would not expect perf impact on your test as the bench will stay
> pinned on the cpu4
> That being said you obviously have perf impact as shown below in your results
> And other tasks on the system are not pinned and might come and
> disturb your bench
>
>> The results:
>> 3min-5:30min [mainline]
>> 5:15min-5:50min [+patchset]
>>
>> The idea is definitely worth to investigate further.
>
> Yes I agree
>
> Vincent
>>
>> Regards,
>> Lukasz
>>
>>
>>


--
Regards
Thara

2018-10-18 06:49:35

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Introduce thermal pressure


* Thara Gopinath <[email protected]> wrote:

> On 10/16/2018 03:33 AM, Ingo Molnar wrote:
> >
> > * Thara Gopinath <[email protected]> wrote:
> >
> >>>> Regarding testing, basic build, boot and sanity testing have been
> >>>> performed on hikey960 mainline kernel with debian file system.
> >>>> Further aobench (An occlusion renderer for benchmarking realworld
> >>>> floating point performance) showed the following results on hikey960
> >>>> with debain.
> >>>>
> >>>> Result Standard Standard
> >>>> (Time secs) Error Deviation
> >>>> Hikey 960 - no thermal pressure applied 138.67 6.52 11.52%
> >>>> Hikey 960 - thermal pressure applied 122.37 5.78 11.57%
> >>>
> >>> Wow, +13% speedup, impressive! We definitely want this outcome.
> >>>
> >>> I'm wondering what happens if we do not track and decay the thermal
> >>> load at all at the PELT level, but instantaneously decrease/increase
> >>> effective CPU capacity in reaction to thermal events we receive from
> >>> the CPU.
> >>
> >> The problem with instantaneous update is that sometimes thermal events
> >> happen at a much faster pace than cpu_capacity is updated in the
> >> scheduler. This means that at the moment when scheduler uses the
> >> value, it might not be correct anymore.
> >
> > Let me offer a different interpretation: if we average throttling events
> > then we create a 'smooth' average of 'true CPU capacity' that doesn't
> > fluctuate much. This allows more stable yet asymmetric task placement if
> > the thermal characteristics of the different cores is different
> > (asymmetric). This, compared to instantaneous updates, would reduce
> > unnecessary task migrations between cores.
> >
> > Is that accurate?
>
> Yes. I think it is accurate. I will also add that if we don't average
> throttling events, we will miss the events that occur in between load
> balancing(LB) period.

Yeah, so I'd definitely suggest to not integrate this averaging into
pelt.c in the fashion presented, because:

- This couples your thermal throttling averaging to the PELT decay
half-time AFAICS, which would break the other user every time the
decay is changed/tuned.

- The boolean flag that changes behavior in pelt.c is not particularly
clean either and complicates the code.

- Instead maybe factor out a decaying average library into
kernel/sched/avg.h perhaps (if this truly improves the code), and use
those methods both in pelt.c and any future thermal.c - and maybe
other places where we do decaying averages.

- But simple decaying averages are not that complex either, so I think
your original solution of open coding it is probably fine as well.

Furthermore, any logic introduced by thermal.c and the resulting change
to load-balancing behavior would have to be in perfect sync with cpufreq
governor actions - one mechanism should not work against the other.

The only long term maintainable solution is to move all high level
cpufreq logic and policy handling code into kernel/sched/cpufreq*.c,
which has been done to a fair degree already in the past ~2 years - but
it's unclear to me to what extent this is true for thermal throttling
policy currently: there might be more governor surgery and code
reshuffling required?

The short term goal would be to at minimum have all the bugs lined up in
kernel/sched/* neatly, so that we have the chance to see and fix them in
a single place. ;-)

Thanks,

Ingo

2018-10-18 07:09:40

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Introduce thermal pressure

On Thu, Oct 18, 2018 at 8:48 AM Ingo Molnar <[email protected]> wrote:
>
>
> * Thara Gopinath <[email protected]> wrote:
>
> > On 10/16/2018 03:33 AM, Ingo Molnar wrote:
> > >
> > > * Thara Gopinath <[email protected]> wrote:
> > >
> > >>>> Regarding testing, basic build, boot and sanity testing have been
> > >>>> performed on hikey960 mainline kernel with debian file system.
> > >>>> Further aobench (An occlusion renderer for benchmarking realworld
> > >>>> floating point performance) showed the following results on hikey960
> > >>>> with debain.
> > >>>>
> > >>>> Result Standard Standard
> > >>>> (Time secs) Error Deviation
> > >>>> Hikey 960 - no thermal pressure applied 138.67 6.52 11.52%
> > >>>> Hikey 960 - thermal pressure applied 122.37 5.78 11.57%
> > >>>
> > >>> Wow, +13% speedup, impressive! We definitely want this outcome.
> > >>>
> > >>> I'm wondering what happens if we do not track and decay the thermal
> > >>> load at all at the PELT level, but instantaneously decrease/increase
> > >>> effective CPU capacity in reaction to thermal events we receive from
> > >>> the CPU.
> > >>
> > >> The problem with instantaneous update is that sometimes thermal events
> > >> happen at a much faster pace than cpu_capacity is updated in the
> > >> scheduler. This means that at the moment when scheduler uses the
> > >> value, it might not be correct anymore.
> > >
> > > Let me offer a different interpretation: if we average throttling events
> > > then we create a 'smooth' average of 'true CPU capacity' that doesn't
> > > fluctuate much. This allows more stable yet asymmetric task placement if
> > > the thermal characteristics of the different cores is different
> > > (asymmetric). This, compared to instantaneous updates, would reduce
> > > unnecessary task migrations between cores.
> > >
> > > Is that accurate?
> >
> > Yes. I think it is accurate. I will also add that if we don't average
> > throttling events, we will miss the events that occur in between load
> > balancing(LB) period.
>
> Yeah, so I'd definitely suggest to not integrate this averaging into
> pelt.c in the fashion presented, because:
>
> - This couples your thermal throttling averaging to the PELT decay
> half-time AFAICS, which would break the other user every time the
> decay is changed/tuned.
>
> - The boolean flag that changes behavior in pelt.c is not particularly
> clean either and complicates the code.
>
> - Instead maybe factor out a decaying average library into
> kernel/sched/avg.h perhaps (if this truly improves the code), and use
> those methods both in pelt.c and any future thermal.c - and maybe
> other places where we do decaying averages.
>
> - But simple decaying averages are not that complex either, so I think
> your original solution of open coding it is probably fine as well.
>
> Furthermore, any logic introduced by thermal.c and the resulting change
> to load-balancing behavior would have to be in perfect sync with cpufreq
> governor actions - one mechanism should not work against the other.

Right, that really is required.

> The only long term maintainable solution is to move all high level
> cpufreq logic and policy handling code into kernel/sched/cpufreq*.c,
> which has been done to a fair degree already in the past ~2 years - but
> it's unclear to me to what extent this is true for thermal throttling
> policy currently: there might be more governor surgery and code
> reshuffling required?

It doesn't cover thermal management directly ATM.

The EAS work kind of hopes to make a connection in there by adding a
common energy model to underlie both the performance scaling and
thermal management, but it doesn't change the thermal decision making
part AFAICS.

So it is fair to say that additional governor surgery and code
reshuffling will be required IMO.

Thanks,
Rafael

2018-10-18 07:52:54

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Introduce thermal pressure


* Rafael J. Wysocki <[email protected]> wrote:

> > The only long term maintainable solution is to move all high level
> > cpufreq logic and policy handling code into kernel/sched/cpufreq*.c,
> > which has been done to a fair degree already in the past ~2 years - but
> > it's unclear to me to what extent this is true for thermal throttling
> > policy currently: there might be more governor surgery and code
> > reshuffling required?
>
> It doesn't cover thermal management directly ATM.
>
> The EAS work kind of hopes to make a connection in there by adding a
> common energy model to underlie both the performance scaling and
> thermal management, but it doesn't change the thermal decision making
> part AFAICS.
>
> So it is fair to say that additional governor surgery and code
> reshuffling will be required IMO.

BTW., when factoring out high level thermal management code it might make
sense to increase the prominence of the cpufreq code within the scheduler
and organize it a bit better, by introducing its own
kernel/sched/cpufreq/ directory and renaming things the following way:

kernel/sched/cpufreq.c => kernel/sched/cpufreq/core.c
kernel/sched/cpufreq_schedutil.c => kernel/sched/cpufreq/metrics.c
kernel/sched/thermal.c => kernel/sched/cpufreq/thermal.c

... or so?

With no change to functionality, this is just a re-organization and
expansion/preparation for the bright future. =B-)

Thanks,

Ingo

2018-10-18 08:03:20

by Lukasz Luba

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Introduce thermal pressure


On 10/17/2018 06:24 PM, Thara Gopinath wrote:
> On 10/16/2018 01:11 PM, Vincent Guittot wrote:
>> Hi Lukasz,
>>
>> On Thu, 11 Oct 2018 at 13:10, Lukasz Luba <[email protected]> wrote:
>>>
>>>
>>>
>>> On 10/10/2018 07:30 PM, Thara Gopinath wrote:
>>>> Hello Lukasz,
>>>>
>>>> On 10/10/2018 11:35 AM, Lukasz Luba wrote:
>>>>> Hi Thara,
>>>>>
>>>>> I have run it on Exynos5433 mainline.
>>>>> When it is enabled with step_wise thermal governor,
>>>>> some of my tests are showing ~30-50% regression (i.e. hackbench),
>>>>> dhrystone ~10%.
>>>>
>>>> That is interesting. If I understand correctly, dhrystone spawns 16
>>>> threads or so and floods the system. In "theory", such a test should not
>>>> see any performance improvement and degradation. What is the thermal
>>>> activity like in your system? I will try running one of these tests on
>>>> hikey960.
>>> I use this dhrystone implementation:
>>> https://github.com/Keith-S-Thompson/dhrystone/blob/master/v2.2/dry.c
>>> It does not span new threads/processes and I pinned it to a single cpu.
>>>
>>> My thermal setup is probably different than yours.
>>> You have (on hikey960) probably 1 sensor for whole SoC and one thermal
>>> zone (if it is this mainline file:
>>> arch/arm64/boot/dts/hisilicon/hi3660.dtsi).
>>> This thermal zone has two cooling devices - two clusters with dvfs.
>>> Your temperature signal read out from that sensor is probably much
>>> smoother. When you have sensor inside cluster, the rising factor
>>> can be even 20deg/s (for big cores).
>>> In my case, there are 4 thermal zones, each cluster has it's private
>>> sensor and thermal zone. There is no 'SoC sensor' or 'PCB sensor',
>>> which is recommended for IPA.
>>>>>
>>>>> Could you tell me which thermal governor was used in your case?
>>>>> Please also share the name of that benchmark, i will give it a try.
>>>>> Is it single threaded compute-intensive?
>>>>
>>>> Step-wise governor.
>>>> I use aobench which is part of phoronix-test-suite.
>>>>
>>>> Regards
>>>> Thara
>>>>
>>> I have built this aobench and run it pinned to single big cpu:
>>> time taskset -c 4 ./aobench
>>
>> Why have you pinned the test only on CPU4 ?
>> Goal of thermal pressure is to inform the scheduler of reduced compute
>> capacity and help the scheduler to take better decision in tasks
>> placement.
>
> Hi Lukasz,
>
> I agree with Vincent's observation. I had not seen this earlier. Pinning
> a task to a cpu will obviously prevent migration. The performance
> regression could be due to as Vincent mentioned below other tasks in the
> system. On another note, which cpufreq governor are you using? Is the
> core being bumped up to highest possible OPP during the test?
The governor is ondemand. No, it is not at highest OPP.
Could you send me the needed test configuration and condition?
We would then align in this area.

Regards,
Lukasz
>
> Regards
> Thara
>> So I would not expect perf impact on your test as the bench will stay
>> pinned on the cpu4
>> That being said you obviously have perf impact as shown below in your results
>> And other tasks on the system are not pinned and might come and
>> disturb your bench
>>
>>> The results:
>>> 3min-5:30min [mainline]
>>> 5:15min-5:50min [+patchset]
>>>
>>> The idea is definitely worth to investigate further.
>>
>> Yes I agree
>>
>> Vincent
>>>
>>> Regards,
>>> Lukasz
>>>
>>>
>>>
>
>

2018-10-18 08:13:20

by Lukasz Luba

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Introduce thermal pressure

Hi Vincent,

On 10/16/2018 07:11 PM, Vincent Guittot wrote:
> Hi Lukasz,
>
> On Thu, 11 Oct 2018 at 13:10, Lukasz Luba <[email protected]> wrote:
>>
>>
>>
>> On 10/10/2018 07:30 PM, Thara Gopinath wrote:
>>> Hello Lukasz,
>>>
>>> On 10/10/2018 11:35 AM, Lukasz Luba wrote:
>>>> Hi Thara,
>>>>
>>>> I have run it on Exynos5433 mainline.
>>>> When it is enabled with step_wise thermal governor,
>>>> some of my tests are showing ~30-50% regression (i.e. hackbench),
>>>> dhrystone ~10%.
>>>
>>> That is interesting. If I understand correctly, dhrystone spawns 16
>>> threads or so and floods the system. In "theory", such a test should not
>>> see any performance improvement and degradation. What is the thermal
>>> activity like in your system? I will try running one of these tests on
>>> hikey960.
>> I use this dhrystone implementation:
>> https://github.com/Keith-S-Thompson/dhrystone/blob/master/v2.2/dry.c
>> It does not span new threads/processes and I pinned it to a single cpu.
>>
>> My thermal setup is probably different than yours.
>> You have (on hikey960) probably 1 sensor for whole SoC and one thermal
>> zone (if it is this mainline file:
>> arch/arm64/boot/dts/hisilicon/hi3660.dtsi).
>> This thermal zone has two cooling devices - two clusters with dvfs.
>> Your temperature signal read out from that sensor is probably much
>> smoother. When you have sensor inside cluster, the rising factor
>> can be even 20deg/s (for big cores).
>> In my case, there are 4 thermal zones, each cluster has it's private
>> sensor and thermal zone. There is no 'SoC sensor' or 'PCB sensor',
>> which is recommended for IPA.
>>>>
>>>> Could you tell me which thermal governor was used in your case?
>>>> Please also share the name of that benchmark, i will give it a try.
>>>> Is it single threaded compute-intensive?
>>>
>>> Step-wise governor.
>>> I use aobench which is part of phoronix-test-suite.
>>>
>>> Regards
>>> Thara
>>>
>> I have built this aobench and run it pinned to single big cpu:
>> time taskset -c 4 ./aobench
>
> Why have you pinned the test only on CPU4 ?
> Goal of thermal pressure is to inform the scheduler of reduced compute
> capacity and help the scheduler to take better decision in tasks
> placement.
> So I would not expect perf impact on your test as the bench will stay
> pinned on the cpu4
> That being said you obviously have perf impact as shown below in your results
> And other tasks on the system are not pinned and might come and
> disturb your bench
Unpinned runs on this platform have a lot of variation. The tests take
even 16min to finish.

Regards,
Lukasz
>
>> The results:
>> 3min-5:30min [mainline]
>> 5:15min-5:50min [+patchset]
>>
>> The idea is definitely worth to investigate further.
>
> Yes I agree
>
> Vincent
>>
>> Regards,
>> Lukasz
>>
>>
>>
>
>

2018-10-18 08:17:10

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Introduce thermal pressure

On Thu, Oct 18, 2018 at 9:50 AM Ingo Molnar <[email protected]> wrote:
>
>
> * Rafael J. Wysocki <[email protected]> wrote:
>
> > > The only long term maintainable solution is to move all high level
> > > cpufreq logic and policy handling code into kernel/sched/cpufreq*.c,
> > > which has been done to a fair degree already in the past ~2 years - but
> > > it's unclear to me to what extent this is true for thermal throttling
> > > policy currently: there might be more governor surgery and code
> > > reshuffling required?
> >
> > It doesn't cover thermal management directly ATM.
> >
> > The EAS work kind of hopes to make a connection in there by adding a
> > common energy model to underlie both the performance scaling and
> > thermal management, but it doesn't change the thermal decision making
> > part AFAICS.
> >
> > So it is fair to say that additional governor surgery and code
> > reshuffling will be required IMO.
>
> BTW., when factoring out high level thermal management code it might make
> sense to increase the prominence of the cpufreq code within the scheduler
> and organize it a bit better, by introducing its own
> kernel/sched/cpufreq/ directory and renaming things the following way:
>
> kernel/sched/cpufreq.c => kernel/sched/cpufreq/core.c
> kernel/sched/cpufreq_schedutil.c => kernel/sched/cpufreq/metrics.c
> kernel/sched/thermal.c => kernel/sched/cpufreq/thermal.c
>
> ... or so?
>
> With no change to functionality, this is just a re-organization and
> expansion/preparation for the bright future. =B-)

No disagreement here. :-)

Cheers,
Rafael

2018-10-18 09:37:28

by Daniel Lezcano

[permalink] [raw]
Subject: [PATCH 1/2] sched/cpufreq: Reorganize the cpufreq files

It was suggested to set the scene for the PM components in the
scheduler code organization in the recent discussion about making the
scheduler aware of the capacity capping from the thermal framework.

Move the cpufreq files into its own directory as suggested at:

https://lkml.org/lkml/2018/10/18/353
https://lkml.org/lkml/2018/10/18/408

Suggested-by: Ingo Molnar <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Signed-off-by: Daniel Lezcano <[email protected]>
---
kernel/sched/Makefile | 3 +--
kernel/sched/{cpufreq.c => cpufreq/core.c} | 2 +-
kernel/sched/{cpufreq_schedutil.c => cpufreq/metrics.c} | 2 +-
3 files changed, 3 insertions(+), 4 deletions(-)
rename kernel/sched/{cpufreq.c => cpufreq/core.c} (99%)
rename kernel/sched/{cpufreq_schedutil.c => cpufreq/metrics.c} (99%)

diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 7fe1834..bc6bce0 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -19,13 +19,12 @@ endif
obj-y += core.o loadavg.o clock.o cputime.o
obj-y += idle.o fair.o rt.o deadline.o
obj-y += wait.o wait_bit.o swait.o completion.o
+obj-y += cpufreq/

obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o topology.o stop_task.o pelt.o
obj-$(CONFIG_SCHED_AUTOGROUP) += autogroup.o
obj-$(CONFIG_SCHEDSTATS) += stats.o
obj-$(CONFIG_SCHED_DEBUG) += debug.o
obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
-obj-$(CONFIG_CPU_FREQ) += cpufreq.o
-obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o
obj-$(CONFIG_MEMBARRIER) += membarrier.o
obj-$(CONFIG_CPU_ISOLATION) += isolation.o
diff --git a/kernel/sched/cpufreq.c b/kernel/sched/cpufreq/core.c
similarity index 99%
rename from kernel/sched/cpufreq.c
rename to kernel/sched/cpufreq/core.c
index 5e54cbc..8c17a63 100644
--- a/kernel/sched/cpufreq.c
+++ b/kernel/sched/cpufreq/core.c
@@ -8,7 +8,7 @@
* it under the terms of the GNU General Public License version 2 as
* published by the Free Software Foundation.
*/
-#include "sched.h"
+#include "../sched.h"

DEFINE_PER_CPU(struct update_util_data *, cpufreq_update_util_data);

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq/metrics.c
similarity index 99%
rename from kernel/sched/cpufreq_schedutil.c
rename to kernel/sched/cpufreq/metrics.c
index 3fffad3..597df47 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq/metrics.c
@@ -11,7 +11,7 @@

#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt

-#include "sched.h"
+#include "../sched.h"

#include <trace/events/power.h>

--
2.7.4


2018-10-18 09:38:51

by Daniel Lezcano

[permalink] [raw]
Subject: [PATCH 2/2] sched/cpufreq: Add the SPDX tags

The SPDX tags are not present in core.c and metrics.c.

Replace the license description by the SPDX tags.

Signed-off-by: Daniel Lezcano <[email protected]>
---
kernel/sched/cpufreq/core.c | 4 +---
kernel/sched/cpufreq/metrics.c | 4 +---
2 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/cpufreq/core.c b/kernel/sched/cpufreq/core.c
index 8c17a63..7a1400c 100644
--- a/kernel/sched/cpufreq/core.c
+++ b/kernel/sched/cpufreq/core.c
@@ -1,12 +1,10 @@
+// SPDX-License-Identifier: GPL-2.0
/*
* Scheduler code and data structures related to cpufreq.
*
* Copyright (C) 2016, Intel Corporation
* Author: Rafael J. Wysocki <[email protected]>
*
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2 as
- * published by the Free Software Foundation.
*/
#include "../sched.h"

diff --git a/kernel/sched/cpufreq/metrics.c b/kernel/sched/cpufreq/metrics.c
index 597df47..9083a70 100644
--- a/kernel/sched/cpufreq/metrics.c
+++ b/kernel/sched/cpufreq/metrics.c
@@ -1,12 +1,10 @@
+// SPDX-License-Identifier: GPL-2.0
/*
* CPUFreq governor based on scheduler-provided CPU utilization data.
*
* Copyright (C) 2016, Intel Corporation
* Author: Rafael J. Wysocki <[email protected]>
*
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2 as
- * published by the Free Software Foundation.
*/

#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
--
2.7.4


2018-10-18 09:45:37

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/cpufreq: Reorganize the cpufreq files

On Thu, Oct 18, 2018 at 11:36 AM Daniel Lezcano
<[email protected]> wrote:
>
> It was suggested to set the scene for the PM components in the
> scheduler code organization in the recent discussion about making the
> scheduler aware of the capacity capping from the thermal framework.
>
> Move the cpufreq files into its own directory as suggested at:
>
> https://lkml.org/lkml/2018/10/18/353
> https://lkml.org/lkml/2018/10/18/408

Fair enough, but do we need to do that right now?

2018-10-18 09:45:51

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/cpufreq: Reorganize the cpufreq files

On 18/10/2018 11:35, Daniel Lezcano wrote:
> It was suggested to set the scene for the PM components in the
> scheduler code organization in the recent discussion about making the
> scheduler aware of the capacity capping from the thermal framework.
>
> Move the cpufreq files into its own directory as suggested at:
>
> https://lkml.org/lkml/2018/10/18/353
> https://lkml.org/lkml/2018/10/18/408
>
> Suggested-by: Ingo Molnar <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Signed-off-by: Daniel Lezcano <[email protected]>
> ---
> kernel/sched/Makefile | 3 +--
> kernel/sched/{cpufreq.c => cpufreq/core.c} | 2 +-
> kernel/sched/{cpufreq_schedutil.c => cpufreq/metrics.c} | 2 +-


I forget to git add the cpufreq/Makefile, resending.



--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


2018-10-18 09:46:07

by Daniel Lezcano

[permalink] [raw]
Subject: [PATCH V2 1/2] sched/cpufreq: Reorganize the cpufreq files

It was suggested to set the scene for the PM components in the
scheduler code organization in the recent discussion about making the
scheduler aware of the capacity capping from the thermal framework.

Move the cpufreq files into its own directory as suggested at:

https://lkml.org/lkml/2018/10/18/353
https://lkml.org/lkml/2018/10/18/408

Suggested-by: Ingo Molnar <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Signed-off-by: Daniel Lezcano <[email protected]>
---
Changelog:
* git added the Makefile in cpufreq/Makefile (V2)

Signed-off-by: Daniel Lezcano <[email protected]>
---
kernel/sched/Makefile | 3 +--
kernel/sched/cpufreq/Makefile | 3 +++
kernel/sched/{cpufreq.c => cpufreq/core.c} | 2 +-
kernel/sched/{cpufreq_schedutil.c => cpufreq/metrics.c} | 2 +-
4 files changed, 6 insertions(+), 4 deletions(-)
create mode 100644 kernel/sched/cpufreq/Makefile
rename kernel/sched/{cpufreq.c => cpufreq/core.c} (99%)
rename kernel/sched/{cpufreq_schedutil.c => cpufreq/metrics.c} (99%)

diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 7fe1834..bc6bce0 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -19,13 +19,12 @@ endif
obj-y += core.o loadavg.o clock.o cputime.o
obj-y += idle.o fair.o rt.o deadline.o
obj-y += wait.o wait_bit.o swait.o completion.o
+obj-y += cpufreq/

obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o topology.o stop_task.o pelt.o
obj-$(CONFIG_SCHED_AUTOGROUP) += autogroup.o
obj-$(CONFIG_SCHEDSTATS) += stats.o
obj-$(CONFIG_SCHED_DEBUG) += debug.o
obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
-obj-$(CONFIG_CPU_FREQ) += cpufreq.o
-obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o
obj-$(CONFIG_MEMBARRIER) += membarrier.o
obj-$(CONFIG_CPU_ISOLATION) += isolation.o
diff --git a/kernel/sched/cpufreq/Makefile b/kernel/sched/cpufreq/Makefile
new file mode 100644
index 0000000..4bf1087
--- /dev/null
+++ b/kernel/sched/cpufreq/Makefile
@@ -0,0 +1,3 @@
+# SPDX-License-Identifier: GPL-2.0
+obj-$(CONFIG_CPU_FREQ) += core.o
+obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += metrics.o
diff --git a/kernel/sched/cpufreq.c b/kernel/sched/cpufreq/core.c
similarity index 99%
rename from kernel/sched/cpufreq.c
rename to kernel/sched/cpufreq/core.c
index 5e54cbc..8c17a63 100644
--- a/kernel/sched/cpufreq.c
+++ b/kernel/sched/cpufreq/core.c
@@ -8,7 +8,7 @@
* it under the terms of the GNU General Public License version 2 as
* published by the Free Software Foundation.
*/
-#include "sched.h"
+#include "../sched.h"

DEFINE_PER_CPU(struct update_util_data *, cpufreq_update_util_data);

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq/metrics.c
similarity index 99%
rename from kernel/sched/cpufreq_schedutil.c
rename to kernel/sched/cpufreq/metrics.c
index 3fffad3..597df47 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq/metrics.c
@@ -11,7 +11,7 @@

#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt

-#include "sched.h"
+#include "../sched.h"

#include <trace/events/power.h>

--
2.7.4


2018-10-18 09:46:42

by Daniel Lezcano

[permalink] [raw]
Subject: [PATCH V2 2/2] sched/cpufreq: Add the SPDX tags

The SPDX tags are not present in core.c and metrics.c.

Replace the license description by the SPDX tags.

Signed-off-by: Daniel Lezcano <[email protected]>
---
kernel/sched/cpufreq/core.c | 4 +---
kernel/sched/cpufreq/metrics.c | 4 +---
2 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/cpufreq/core.c b/kernel/sched/cpufreq/core.c
index 8c17a63..7a1400c 100644
--- a/kernel/sched/cpufreq/core.c
+++ b/kernel/sched/cpufreq/core.c
@@ -1,12 +1,10 @@
+// SPDX-License-Identifier: GPL-2.0
/*
* Scheduler code and data structures related to cpufreq.
*
* Copyright (C) 2016, Intel Corporation
* Author: Rafael J. Wysocki <[email protected]>
*
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2 as
- * published by the Free Software Foundation.
*/
#include "../sched.h"

diff --git a/kernel/sched/cpufreq/metrics.c b/kernel/sched/cpufreq/metrics.c
index 597df47..9083a70 100644
--- a/kernel/sched/cpufreq/metrics.c
+++ b/kernel/sched/cpufreq/metrics.c
@@ -1,12 +1,10 @@
+// SPDX-License-Identifier: GPL-2.0
/*
* CPUFreq governor based on scheduler-provided CPU utilization data.
*
* Copyright (C) 2016, Intel Corporation
* Author: Rafael J. Wysocki <[email protected]>
*
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2 as
- * published by the Free Software Foundation.
*/

#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
--
2.7.4


2018-10-18 09:56:14

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/cpufreq: Reorganize the cpufreq files

On 18/10/2018 11:42, Rafael J. Wysocki wrote:
> On Thu, Oct 18, 2018 at 11:36 AM Daniel Lezcano
> <[email protected]> wrote:
>>
>> It was suggested to set the scene for the PM components in the
>> scheduler code organization in the recent discussion about making the
>> scheduler aware of the capacity capping from the thermal framework.
>>
>> Move the cpufreq files into its own directory as suggested at:
>>
>> https://lkml.org/lkml/2018/10/18/353
>> https://lkml.org/lkml/2018/10/18/408
>
> Fair enough, but do we need to do that right now?

I'm not planning to do more code reorg than this patch right now. Up to
you to decide if you are willing to take them.


--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


2018-10-18 10:15:50

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/cpufreq: Reorganize the cpufreq files

On 18/10/2018 12:06, Rafael J. Wysocki wrote:
> On Thu, Oct 18, 2018 at 11:54 AM Daniel Lezcano
> <[email protected]> wrote:
>>
>> On 18/10/2018 11:42, Rafael J. Wysocki wrote:
>>> On Thu, Oct 18, 2018 at 11:36 AM Daniel Lezcano
>>> <[email protected]> wrote:
>>>>
>>>> It was suggested to set the scene for the PM components in the
>>>> scheduler code organization in the recent discussion about making the
>>>> scheduler aware of the capacity capping from the thermal framework.
>>>>
>>>> Move the cpufreq files into its own directory as suggested at:
>>>>
>>>> https://lkml.org/lkml/2018/10/18/353
>>>> https://lkml.org/lkml/2018/10/18/408
>>>
>>> Fair enough, but do we need to do that right now?
>>
>> I'm not planning to do more code reorg than this patch right now. Up to
>> you to decide if you are willing to take them.
>
> The SPDX one certainly is applicable, but I'm not sure about the other one TBH.
>
> Why don't you add the SPDX IDs to those files as they are for now?

Yes, sure.


--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


2018-10-18 10:20:17

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/cpufreq: Reorganize the cpufreq files

On Thu, Oct 18, 2018 at 11:54 AM Daniel Lezcano
<[email protected]> wrote:
>
> On 18/10/2018 11:42, Rafael J. Wysocki wrote:
> > On Thu, Oct 18, 2018 at 11:36 AM Daniel Lezcano
> > <[email protected]> wrote:
> >>
> >> It was suggested to set the scene for the PM components in the
> >> scheduler code organization in the recent discussion about making the
> >> scheduler aware of the capacity capping from the thermal framework.
> >>
> >> Move the cpufreq files into its own directory as suggested at:
> >>
> >> https://lkml.org/lkml/2018/10/18/353
> >> https://lkml.org/lkml/2018/10/18/408
> >
> > Fair enough, but do we need to do that right now?
>
> I'm not planning to do more code reorg than this patch right now. Up to
> you to decide if you are willing to take them.

The SPDX one certainly is applicable, but I'm not sure about the other one TBH.

Why don't you add the SPDX IDs to those files as they are for now?

2018-10-18 16:18:41

by Thara Gopinath

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Introduce thermal pressure

On 10/18/2018 02:48 AM, Ingo Molnar wrote:
>
> * Thara Gopinath <[email protected]> wrote:
>
>> On 10/16/2018 03:33 AM, Ingo Molnar wrote:
>>>
>>> * Thara Gopinath <[email protected]> wrote:
>>>
>>>>>> Regarding testing, basic build, boot and sanity testing have been
>>>>>> performed on hikey960 mainline kernel with debian file system.
>>>>>> Further aobench (An occlusion renderer for benchmarking realworld
>>>>>> floating point performance) showed the following results on hikey960
>>>>>> with debain.
>>>>>>
>>>>>> Result Standard Standard
>>>>>> (Time secs) Error Deviation
>>>>>> Hikey 960 - no thermal pressure applied 138.67 6.52 11.52%
>>>>>> Hikey 960 - thermal pressure applied 122.37 5.78 11.57%
>>>>>
>>>>> Wow, +13% speedup, impressive! We definitely want this outcome.
>>>>>
>>>>> I'm wondering what happens if we do not track and decay the thermal
>>>>> load at all at the PELT level, but instantaneously decrease/increase
>>>>> effective CPU capacity in reaction to thermal events we receive from
>>>>> the CPU.
>>>>
>>>> The problem with instantaneous update is that sometimes thermal events
>>>> happen at a much faster pace than cpu_capacity is updated in the
>>>> scheduler. This means that at the moment when scheduler uses the
>>>> value, it might not be correct anymore.
>>>
>>> Let me offer a different interpretation: if we average throttling events
>>> then we create a 'smooth' average of 'true CPU capacity' that doesn't
>>> fluctuate much. This allows more stable yet asymmetric task placement if
>>> the thermal characteristics of the different cores is different
>>> (asymmetric). This, compared to instantaneous updates, would reduce
>>> unnecessary task migrations between cores.
>>>
>>> Is that accurate?
>>
>> Yes. I think it is accurate. I will also add that if we don't average
>> throttling events, we will miss the events that occur in between load
>> balancing(LB) period.
>
> Yeah, so I'd definitely suggest to not integrate this averaging into
> pelt.c in the fashion presented, because:
>
> - This couples your thermal throttling averaging to the PELT decay
> half-time AFAICS, which would break the other user every time the
> decay is changed/tuned.
Let me pose the question in this manner. Today rt utilization, dl
utilization etc is tracked via PELT. The inherent idea is that, a cpu
has some capacity stolen by let us say rt task and so subtract the
capacity utilized by the rt task from cpu when calculating the remaining
capacity for a CFS task. Now, the idea behind thermal pressure is that,
the maximum available capacity of a cpu is limited due to a thermal
event, so take it out of the remaining capacity of a cpu for a CFS task
(at-least to start with). If utilization for rt task, dl task etc is
calculated via PELT and the capacity constraint due to thermal event
calculated by another averaging algorithm, there can be some mismatch in
the "capacity stolen" calculations, right? Isnt't it better to track all
the events that can limit the capacity of a cpu via one algorithm ?

>
> - The boolean flag that changes behavior in pelt.c is not particularly
> clean either and complicates the code.

I agree. Part of the idea behind this RFC patch set was to brainstorm if
there is a better approach to this.
>
> - Instead maybe factor out a decaying average library into
> kernel/sched/avg.h perhaps (if this truly improves the code), and use
> those methods both in pelt.c and any future thermal.c - and maybe
> other places where we do decaying averages.
>
> - But simple decaying averages are not that complex either, so I think
> your original solution of open coding it is probably fine as well.
>
> Furthermore, any logic introduced by thermal.c and the resulting change
> to load-balancing behavior would have to be in perfect sync with cpufreq
> governor actions - one mechanism should not work against the other.
I agree. I will go one step further and argue that the changes here make
best sense with sched util governor as it picks up the signals directly
from the scheduler to choose an OPP. Any other governor will be less
effective.


>
> The only long term maintainable solution is to move all high level
> cpufreq logic and policy handling code into kernel/sched/cpufreq*.c,
> which has been done to a fair degree already in the past ~2 years - but
> it's unclear to me to what extent this is true for thermal throttling
> policy currently: there might be more governor surgery and code
> reshuffling required?
>
> The short term goal would be to at minimum have all the bugs lined up in
> kernel/sched/* neatly, so that we have the chance to see and fix them in
> a single place. ;-)

I see that Daniel has already sent some patches for this!
Regards
Thara
>
> Thanks,
>
> Ingo
>


--
Regards
Thara

2018-10-19 05:25:40

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/cpufreq: Reorganize the cpufreq files

Hi Daniel,

I love your patch! Yet something to improve:

[auto build test ERROR on tip/sched/core]
[also build test ERROR on v4.19-rc8 next-20181018]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url: https://github.com/0day-ci/linux/commits/Daniel-Lezcano/sched-cpufreq-Reorganize-the-cpufreq-files/20181019-103851
config: x86_64-randconfig-s1-10191209 (attached as .config)
compiler: gcc-6 (Debian 6.4.0-9) 6.4.0 20171026
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64

All errors (new ones prefixed by >>):

>> scripts/Makefile.build:45: kernel/sched/cpufreq/Makefile: No such file or directory
>> make[4]: *** No rule to make target 'kernel/sched/cpufreq/Makefile'.
make[4]: Failed to remake makefile 'kernel/sched/cpufreq/Makefile'.

vim +45 scripts/Makefile.build

0c53c8e6e Sam Ravnborg 2007-10-14 41
2a6914703 Sam Ravnborg 2005-07-25 42 # The filename Kbuild has precedence over Makefile
db8c1a7b2 Sam Ravnborg 2005-07-27 43 kbuild-dir := $(if $(filter /%,$(src)),$(src),$(srctree)/$(src))
0c53c8e6e Sam Ravnborg 2007-10-14 44 kbuild-file := $(if $(wildcard $(kbuild-dir)/Kbuild),$(kbuild-dir)/Kbuild,$(kbuild-dir)/Makefile)
0c53c8e6e Sam Ravnborg 2007-10-14 @45 include $(kbuild-file)
^1da177e4 Linus Torvalds 2005-04-16 46

:::::: The code at line 45 was first introduced by commit
:::::: 0c53c8e6eb456cde30f2305421c605713856abc8 kbuild: check for wrong use of CFLAGS

:::::: TO: Sam Ravnborg <sam@neptun.(none)>
:::::: CC: Sam Ravnborg <sam@neptun.(none)>

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation


Attachments:
(No filename) (1.75 kB)
.config.gz (25.68 kB)
Download all attachments

2018-10-19 05:53:52

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/cpufreq: Reorganize the cpufreq files

Hi Daniel,

I love your patch! Yet something to improve:

[auto build test ERROR on tip/sched/core]
[also build test ERROR on v4.19-rc8 next-20181018]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url: https://github.com/0day-ci/linux/commits/Daniel-Lezcano/sched-cpufreq-Reorganize-the-cpufreq-files/20181019-103851
config: x86_64-randconfig-s2-10191209 (attached as .config)
compiler: gcc-6 (Debian 6.4.0-9) 6.4.0 20171026
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64

All errors (new ones prefixed by >>):

>> scripts/Makefile.modbuiltin:26: kernel/sched/cpufreq/Makefile: No such file or directory
make[4]: *** No rule to make target 'kernel/sched/cpufreq/Makefile'.
make[4]: Failed to remake makefile 'kernel/sched/cpufreq/Makefile'.

vim +26 scripts/Makefile.modbuiltin

607b30fc Michal Marek 2010-06-10 22
bc081dd6 Michal Marek 2009-12-07 23 # The filename Kbuild has precedence over Makefile
bc081dd6 Michal Marek 2009-12-07 24 kbuild-dir := $(if $(filter /%,$(src)),$(src),$(srctree)/$(src))
bc081dd6 Michal Marek 2009-12-07 25 kbuild-file := $(if $(wildcard $(kbuild-dir)/Kbuild),$(kbuild-dir)/Kbuild,$(kbuild-dir)/Makefile)
bc081dd6 Michal Marek 2009-12-07 @26 include $(kbuild-file)
bc081dd6 Michal Marek 2009-12-07 27

:::::: The code at line 26 was first introduced by commit
:::::: bc081dd6e9f622c73334dc465359168543ccaabf kbuild: generate modules.builtin

:::::: TO: Michal Marek <[email protected]>
:::::: CC: Michal Marek <[email protected]>

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation


Attachments:
(No filename) (1.74 kB)
.config.gz (30.09 kB)
Download all attachments

2018-10-19 08:02:59

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Introduce thermal pressure


* Thara Gopinath <[email protected]> wrote:

> > Yeah, so I'd definitely suggest to not integrate this averaging into
> > pelt.c in the fashion presented, because:
> >
> > - This couples your thermal throttling averaging to the PELT decay
> > half-time AFAICS, which would break the other user every time the
> > decay is changed/tuned.
>
> Let me pose the question in this manner. Today rt utilization, dl
> utilization etc is tracked via PELT. The inherent idea is that, a cpu
> has some capacity stolen by let us say rt task and so subtract the
> capacity utilized by the rt task from cpu when calculating the
> remaining capacity for a CFS task. Now, the idea behind thermal
> pressure is that, the maximum available capacity of a cpu is limited
> due to a thermal event, so take it out of the remaining capacity of a
> cpu for a CFS task (at-least to start with). If utilization for rt
> task, dl task etc is calculated via PELT and the capacity constraint
> due to thermal event calculated by another averaging algorithm, there
> can be some mismatch in the "capacity stolen" calculations, right?
> Isnt't it better to track all the events that can limit the capacity of
> a cpu via one algorithm ?

So what unifies RT and DL utilization is that those are all direct task
loads independent of external factors.

Thermal load is more of a complex physical property of the combination of
various internal and external factors: the whole system workload running
(not just that single task), the thermal topology of the hardware,
external temperatures, the hardware's and the governor's policy regarding
thermal loads, etc. etc.

So while obviously when effective capacity of a CPU is calculated then
these will all be subtracted from the maximum capacity of the CPU, but I
think the thermal load metric and the averaging itself is probably
dissimilar enough to not be calculated via the PELT half-life for
example.

For example a reasonable future property would be match the speed of
decay in the averaging to the observed speed of decay via temperature
sensors? Most temperature sensors do a certain amount of averaging
themselves as well - and some platforms might not expose temperatures at
all, only 'got thermally throttled' / 'running at full speed' kind of
feedback.

Anyway, this doesn't really impact the concept, it's an implementational
detail, and much of this could be resolved if the averaging code in
pelt.c was librarized a bit - and that's really what you did there in a
fashion, I just think it should probably be abstracted out more clearly.
(I have no clear implementational suggestions right now, other than 'try
and see how it works out - it might be a bad idea'.)

Thanks,

Ingo

2018-10-19 11:30:21

by Valentin Schneider

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Introduce thermal pressure

Hi,

On 19/10/2018 09:02, Ingo Molnar wrote:
>
> * Thara Gopinath <[email protected]> wrote:

[...]

> So what unifies RT and DL utilization is that those are all direct task
> loads independent of external factors.
>
> Thermal load is more of a complex physical property of the combination of
> various internal and external factors: the whole system workload running
> (not just that single task), the thermal topology of the hardware,
> external temperatures, the hardware's and the governor's policy regarding
> thermal loads, etc. etc.
>
> So while obviously when effective capacity of a CPU is calculated then
> these will all be subtracted from the maximum capacity of the CPU, but I
> think the thermal load metric and the averaging itself is probably
> dissimilar enough to not be calculated via the PELT half-life for
> example.
>
> For example a reasonable future property would be match the speed of
> decay in the averaging to the observed speed of decay via temperature
> sensors? Most temperature sensors do a certain amount of averaging
> themselves as well - and some platforms might not expose temperatures at
> all, only 'got thermally throttled' / 'running at full speed' kind of
> feedback.

That would also open the door to having different decay speeds on
different domains, if we have the tsensors for it - big and LITTLE cores
are not going to heat up in the same way (although there's going to
be some heat propagation).

Another thermal decay speed hint I'd see would be the energy model - it
does tell us after all how much energy is going through those cores, and
with a rough estimate of how much they can take before overheating
(sustainable-power entry in the devicetree) we might be able to deduce a
somewhat sane decay speed.

>
> Anyway, this doesn't really impact the concept, it's an implementational
> detail, and much of this could be resolved if the averaging code in
> pelt.c was librarized a bit - and that's really what you did there in a
> fashion, I just think it should probably be abstracted out more clearly.
> (I have no clear implementational suggestions right now, other than 'try
> and see how it works out - it might be a bad idea'.)
>
> Thanks,
>
> Ingo
>
>
>

2018-12-04 15:46:33

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH 5/7] sched/fair: Enable CFS periodic tick to update thermal pressure

Hi Thara,

On Tue, 9 Oct 2018 at 18:25, Thara Gopinath <[email protected]> wrote:
>
> Introduce support in CFS periodic tick to trigger the process of
> computing average thermal pressure for a cpu.
>
> Signed-off-by: Thara Gopinath <[email protected]>
> ---
> kernel/sched/fair.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b39fb59..7deb1d0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -21,6 +21,7 @@
> * Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra
> */
> #include "sched.h"
> +#include "thermal.h"
>
> #include <trace/events/sched.h>
>
> @@ -9557,6 +9558,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
>
> if (static_branch_unlikely(&sched_numa_balancing))
> task_tick_numa(rq, curr);
> +
> + update_periodic_maxcap(rq);

You have to call update_periodic_maxcap() in update_blocked_averages() too
Otherwise, the thermal pressure will not always be updated correctly
for tickless system

> }
>
> /*
> --
> 2.1.4
>