Thermal governors can respond to an overheat event of a cpu by
capping the cpu's maximum possible frequency. This in turn
means that the maximum available compute capacity of the
cpu is restricted. But today in the kernel, task scheduler is
not notified of capping of maximum frequency of a cpu.
In other words, scheduler is unaware of maximum capacity
restrictions placed on a cpu due to thermal activity.
This patch series attempts to address this issue.
The benefits identified are better task placement among available
cpus in event of overheating which in turn leads to better
performance numbers.
The reduction in the maximum possible capacity of a cpu due to a
thermal event can be considered as thermal pressure. Instantaneous
thermal pressure is hard to record and can sometime be erroneous
as there can be mismatch between the actual capping of capacity
and scheduler recording it. Thus solution is to have a weighted
average per cpu value for thermal pressure over time.
The weight reflects the amount of time the cpu has spent at a
capped maximum frequency. Since thermal pressure is recorded as
an average, it must be decayed periodically. Exisiting algorithm
in the kernel scheduler pelt framework is re-used to calculate
the weighted average. This patch series also defines a sysctl
inerface to allow for a configurable decay period.
Regarding testing, basic build, boot and sanity testing have been
performed on db845c platform with debian file system.
Further, dhrystone and hackbench tests have been
run with the thermal pressure algorithm. During testing, due to
constraints of step wise governor in dealing with big little systems,
trip point 0 temperature was made assymetric between cpus in little
cluster and big cluster; the idea being that
big core will heat up and cpu cooling device will throttle the
frequency of the big cores faster, there by limiting the maximum available
capacity and the scheduler will spread out tasks to little cores as well.
Test Results
Hackbench: 1 group , 30000 loops, 10 runs
Result SD
(Secs) (% of mean)
No Thermal Pressure 14.03 2.69%
Thermal Pressure PELT Algo. Decay : 32 ms 13.29 0.56%
Thermal Pressure PELT Algo. Decay : 64 ms 12.57 1.56%
Thermal Pressure PELT Algo. Decay : 128 ms 12.71 1.04%
Thermal Pressure PELT Algo. Decay : 256 ms 12.29 1.42%
Thermal Pressure PELT Algo. Decay : 512 ms 12.42 1.15%
Dhrystone Run Time : 20 threads, 3000 MLOOPS
Result SD
(Secs) (% of mean)
No Thermal Pressure 9.452 4.49%
Thermal Pressure PELT Algo. Decay : 32 ms 8.793 5.30%
Thermal Pressure PELT Algo. Decay : 64 ms 8.981 5.29%
Thermal Pressure PELT Algo. Decay : 128 ms 8.647 6.62%
Thermal Pressure PELT Algo. Decay : 256 ms 8.774 6.45%
Thermal Pressure PELT Algo. Decay : 512 ms 8.603 5.41%
A Brief History
The first version of this patch-series was posted with resuing
PELT algorithm to decay thermal pressure signal. The discussions
that followed were around whether intanteneous thermal pressure
solution is better and whether a stand-alone algortihm to accumulate
and decay thermal pressure is more appropriate than re-using the
PELT framework.
Tests on Hikey960 showed the stand-alone algorithm performing slightly
better than resuing PELT algorithm and V2 was posted with the stand
alone algorithm. Test results were shared as part of this series.
Discussions were around re-using PELT algorithm and running
further tests with more granular decay period.
For some time after this development was impeded due to hardware
unavailability, some other unforseen and possibly unfortunate events.
For this version, h/w was switched from hikey960 to db845c.
Also Instantaneous thermal pressure was never tested as part of this
cycle as it is clear that weighted average is a better implementation.
The non-PELT algorithm never gave any conclusive results to prove that it
is better than reusing PELT algorithm, in this round of testing.
Also reusing PELT algorithm means thermal pressure tracks the
other utilization signals in the scheduler.
v3->v4:
- "Patch 3/7:sched: Initialize per cpu thermal pressure structure"
is dropped as it is no longer needed following changes in other
other patches.
- rest of the change log mentioned in specific patches.
v5->v6:
- "Added arch_ interface APIs to access and update thermal pressure.
Moved declaration of per cpu thermal_pressure valriable and
infrastructure to update the variable to topology files.
v6->v7:
- Added CONFIG_HAVE_SCHED_THERMAL_PRESSURE to stub out
update_thermal_load_avg in unsupported architectures as per
review comments from Peter, Dietmar and Quentin.
- Renamed arch_scale_thermal_capacity to arch_cpu_thermal_pressure
as per review comments from Peter, Dietmar and Ionela.
- Changed the input argument in arch_set_thermal_pressure from
capped capacity to delta capacity(thermal pressure) as per
Ionela's review comments. Hence the calculation for delta
capacity(thermal pressure) is moved to cpufreq_cooling.c.
- Fixed a bunch of spelling typos.
Thara Gopinath (7):
sched/pelt: Add support to track thermal pressure
sched/topology: Add hook to read per cpu thermal pressure.
arm,arm64,drivers:Add infrastructure to store and update instantaneous
thermal pressure
sched/fair: Enable periodic update of average thermal pressure
sched/fair: update cpu_capacity to reflect thermal pressure
thermal/cpu-cooling: Update thermal pressure in case of a maximum
frequency capping
sched/fair: Enable tuning of decay period
Documentation/admin-guide/kernel-parameters.txt | 5 +++
arch/arm/include/asm/topology.h | 3 ++
arch/arm64/include/asm/topology.h | 3 ++
drivers/base/arch_topology.c | 11 ++++++
drivers/thermal/cpufreq_cooling.c | 19 +++++++++--
include/linux/arch_topology.h | 10 ++++++
include/linux/sched/topology.h | 8 +++++
include/trace/events/sched.h | 4 +++
init/Kconfig | 4 +++
kernel/sched/fair.c | 45 +++++++++++++++++++++++++
kernel/sched/pelt.c | 31 +++++++++++++++++
kernel/sched/pelt.h | 16 +++++++++
kernel/sched/sched.h | 1 +
13 files changed, 158 insertions(+), 2 deletions(-)
--
2.1.4
Extrapolating on the existing framework to track rt/dl utilization using
pelt signals, add a similar mechanism to track thermal pressure. The
difference here from rt/dl utilization tracking is that, instead of
tracking time spent by a cpu running a rt/dl task through util_avg, the
average thermal pressure is tracked through load_avg. This is because
thermal pressure signal is weighted "delta" capacity and is not
binary(util_avg is binary). "delta capacity" here means delta between the
actual capacity of a cpu and the decreased capacity a cpu due to a thermal
event.
In order to track average thermal pressure, a new sched_avg variable
avg_thermal is introduced. Function update_thermal_load_avg can be called
to do the periodic bookkeeping (accumulate, decay and average) of the
thermal pressure.
Signed-off-by: Thara Gopinath <[email protected]>
Reviewed-by: Vincent Guittot <[email protected]>
---
v6->v7:
- Added CONFIG_HAVE_SCHED_THERMAL_PRESSURE to stub out
update_thermal_load_avg in unsupported architectures as per
review comments from Peter, Dietmar and Quentin.
- Updated comment for update_thermal_load_avg as per review
comments from Peter and Dietmar.
include/trace/events/sched.h | 4 ++++
init/Kconfig | 4 ++++
kernel/sched/pelt.c | 31 +++++++++++++++++++++++++++++++
kernel/sched/pelt.h | 16 ++++++++++++++++
kernel/sched/sched.h | 1 +
5 files changed, 56 insertions(+)
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 420e80e..a8fb667 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -613,6 +613,10 @@ DECLARE_TRACE(pelt_dl_tp,
TP_PROTO(struct rq *rq),
TP_ARGS(rq));
+DECLARE_TRACE(pelt_thermal_tp,
+ TP_PROTO(struct rq *rq),
+ TP_ARGS(rq));
+
DECLARE_TRACE(pelt_irq_tp,
TP_PROTO(struct rq *rq),
TP_ARGS(rq));
diff --git a/init/Kconfig b/init/Kconfig
index f6a4a91..c16ea88 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -450,6 +450,10 @@ config HAVE_SCHED_AVG_IRQ
depends on IRQ_TIME_ACCOUNTING || PARAVIRT_TIME_ACCOUNTING
depends on SMP
+config HAVE_SCHED_THERMAL_PRESSURE
+ bool "Enable periodic averaging of thermal pressure"
+ depends on SMP
+
config BSD_PROCESS_ACCT
bool "BSD Process Accounting"
depends on MULTIUSER
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index bd006b7..5d1fbf0 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -367,6 +367,37 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
return 0;
}
+#ifdef CONFIG_HAVE_SCHED_THERMAL_PRESSURE
+/*
+ * thermal:
+ *
+ * load_sum = \Sum se->avg.load_sum but se->avg.load_sum is not tracked
+ *
+ * util_avg and runnable_load_avg are not supported and meaningless.
+ *
+ * Unlike rt/dl utilization tracking that track time spent by a cpu
+ * running a rt/dl task through util_avg, the average thermal pressure is
+ * tracked through load_avg. This is because thermal pressure signal is
+ * weighted "delta" capacity and is not binary(util_avg is binary). "delta
+ * capacity" here means delta between the actual capacity of a cpu and the
+ * decreased capacity a cpu due to a thermal event.
+ */
+
+int update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity)
+{
+ if (___update_load_sum(now, &rq->avg_thermal,
+ capacity,
+ capacity,
+ capacity)) {
+ ___update_load_avg(&rq->avg_thermal, 1, 1);
+ trace_pelt_thermal_tp(rq);
+ return 1;
+ }
+
+ return 0;
+}
+#endif
+
#ifdef CONFIG_HAVE_SCHED_AVG_IRQ
/*
* irq:
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index afff644..cb20c80 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -7,6 +7,16 @@ int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq);
int update_rt_rq_load_avg(u64 now, struct rq *rq, int running);
int update_dl_rq_load_avg(u64 now, struct rq *rq, int running);
+#ifdef CONFIG_HAVE_SCHED_THERMAL_PRESSURE
+int update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity);
+#else
+static inline int
+update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity)
+{
+ return 0;
+}
+#endif
+
#ifdef CONFIG_HAVE_SCHED_AVG_IRQ
int update_irq_load_avg(struct rq *rq, u64 running);
#else
@@ -159,6 +169,12 @@ update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
}
static inline int
+update_thermal_rq_load_avg(u64 now, struct rq *rq, u64 capacity)
+{
+ return 0;
+}
+
+static inline int
update_irq_load_avg(struct rq *rq, u64 running)
{
return 0;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 280a3c7..37bd7ef 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -944,6 +944,7 @@ struct rq {
#ifdef CONFIG_HAVE_SCHED_AVG_IRQ
struct sched_avg avg_irq;
#endif
+ struct sched_avg avg_thermal;
u64 idle_stamp;
u64 avg_idle;
--
2.1.4
Introduce arch_cpu_thermal_pressure to retrieve per cpu thermal
pressure.
Signed-off-by: Thara Gopinath <[email protected]>
---
v6->v7:
- Renamed arch_scale_thermal_capacity to arch_cpu_thermal_pressure
as per review comments from Peter, Dietmar and Ionela.
include/linux/sched/topology.h | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index f341163..850b3bf 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -225,6 +225,14 @@ unsigned long arch_scale_cpu_capacity(int cpu)
}
#endif
+#ifndef arch_cpu_thermal_pressure
+static __always_inline
+unsigned long arch_cpu_thermal_pressure(int cpu)
+{
+ return 0;
+}
+#endif
+
static inline int task_node(const struct task_struct *p)
{
return cpu_to_node(task_cpu(p));
--
2.1.4
Introduce support in CFS periodic tick and other bookkeeping apis
to trigger the process of computing average thermal pressure for a
cpu. Also consider avg_thermal.load_avg in others_have_blocked
which allows for decay of pelt signals.
Signed-off-by: Thara Gopinath <[email protected]>
---
kernel/sched/fair.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8da0222..311bb0b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7470,6 +7470,9 @@ static inline bool others_have_blocked(struct rq *rq)
if (READ_ONCE(rq->avg_dl.util_avg))
return true;
+ if (READ_ONCE(rq->avg_thermal.load_avg))
+ return true;
+
#ifdef CONFIG_HAVE_SCHED_AVG_IRQ
if (READ_ONCE(rq->avg_irq.util_avg))
return true;
@@ -7495,6 +7498,7 @@ static bool __update_blocked_others(struct rq *rq, bool *done)
{
const struct sched_class *curr_class;
u64 now = rq_clock_pelt(rq);
+ unsigned long thermal_pressure = arch_cpu_thermal_pressure(cpu_of(rq));
bool decayed;
/*
@@ -7505,6 +7509,8 @@ static bool __update_blocked_others(struct rq *rq, bool *done)
decayed = update_rt_rq_load_avg(now, rq, curr_class == &rt_sched_class) |
update_dl_rq_load_avg(now, rq, curr_class == &dl_sched_class) |
+ update_thermal_load_avg(rq_clock_task(rq), rq,
+ thermal_pressure) |
update_irq_load_avg(rq, 0);
if (others_have_blocked(rq))
@@ -10275,6 +10281,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
{
struct cfs_rq *cfs_rq;
struct sched_entity *se = &curr->se;
+ unsigned long thermal_pressure = arch_cpu_thermal_pressure(cpu_of(rq));
for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
@@ -10286,6 +10293,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
update_misfit_status(curr, rq);
update_overutilized_status(task_rq(curr));
+ update_thermal_load_avg(rq_clock_task(rq), rq, thermal_pressure);
}
/*
--
2.1.4
cpu_capacity initially reflects the maximum possible capacity of a cpu.
Thermal pressure on a cpu means this maximum possible capacity is
unavailable due to thermal events. This patch subtracts the average thermal
pressure for a cpu from its maximum possible capacity so that cpu_capacity
reflects the actual maximum currently available capacity.
Signed-off-by: Thara Gopinath <[email protected]>
---
v6->v7:
Rewrote the patch description as per Ionela's suggestion.
kernel/sched/fair.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 311bb0b..2b1fec3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7733,8 +7733,15 @@ static unsigned long scale_rt_capacity(struct sched_domain *sd, int cpu)
if (unlikely(irq >= max))
return 1;
+ /*
+ * avg_rt.util avg and avg_dl.util track binary signals
+ * (running and not running) with weights 0 and 1024 respectively.
+ * avg_thermal.load_avg tracks thermal pressure and the weighted
+ * average uses the actual delta max capacity(load).
+ */
used = READ_ONCE(rq->avg_rt.util_avg);
used += READ_ONCE(rq->avg_dl.util_avg);
+ used += READ_ONCE(rq->avg_thermal.load_avg);
if (unlikely(used >= max))
return 1;
--
2.1.4
Thermal pressure follows pelt signals which means the decay period for
thermal pressure is the default pelt decay period. Depending on soc
characteristics and thermal activity, it might be beneficial to decay
thermal pressure slower, but still in-tune with the pelt signals. One way
to achieve this is to provide a command line parameter to set a decay
shift parameter to an integer between 0 and 10.
Signed-off-by: Thara Gopinath <[email protected]>
---
Documentation/admin-guide/kernel-parameters.txt | 5 ++++
kernel/sched/fair.c | 34 +++++++++++++++++++++++--
2 files changed, 37 insertions(+), 2 deletions(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index dd3df3d..34848e4 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4370,6 +4370,11 @@
incurs a small amount of overhead in the scheduler
but is useful for debugging and performance tuning.
+ sched_thermal_decay_shift=
+ [KNL, SMP] Set decay shift for thermal pressure signal.
+ Format: integer between 0 and 10
+ Default is 0.
+
skew_tick= [KNL] Offset the periodic timer tick per cpu to mitigate
xtime_lock contention on larger systems, and/or RCU lock
contention on all systems with CONFIG_MAXSMP set.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2b1fec3..8b2ee5a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -86,6 +86,36 @@ static unsigned int normalized_sysctl_sched_wakeup_granularity = 1000000UL;
const_debug unsigned int sysctl_sched_migration_cost = 500000UL;
+/**
+ * By default the decay is the default pelt decay period.
+ * The decay shift can change the decay period in
+ * multiples of 32.
+ * Decay shift Decay period(ms)
+ * 0 32
+ * 1 64
+ * 2 128
+ * 3 256
+ * 4 512
+ */
+static int sched_thermal_decay_shift;
+
+static inline u64 rq_clock_thermal(struct rq *rq)
+{
+ return rq_clock_task(rq) >> sched_thermal_decay_shift;
+}
+
+static int __init setup_sched_thermal_decay_shift(char *str)
+{
+ int _shift;
+
+ if (kstrtoint(str, 0, &_shift))
+ pr_warn("Unable to set scheduler thermal pressure decay shift parameter\n");
+
+ sched_thermal_decay_shift = clamp(_shift, 0, 10);
+ return 1;
+}
+__setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift);
+
#ifdef CONFIG_SMP
/*
* For asym packing, by default the lower numbered CPU has higher priority.
@@ -7509,7 +7539,7 @@ static bool __update_blocked_others(struct rq *rq, bool *done)
decayed = update_rt_rq_load_avg(now, rq, curr_class == &rt_sched_class) |
update_dl_rq_load_avg(now, rq, curr_class == &dl_sched_class) |
- update_thermal_load_avg(rq_clock_task(rq), rq,
+ update_thermal_load_avg(rq_clock_thermal(rq), rq,
thermal_pressure) |
update_irq_load_avg(rq, 0);
@@ -10300,7 +10330,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
update_misfit_status(curr, rq);
update_overutilized_status(task_rq(curr));
- update_thermal_load_avg(rq_clock_task(rq), rq, thermal_pressure);
+ update_thermal_load_avg(rq_clock_thermal(rq), rq, thermal_pressure);
}
/*
--
2.1.4
Add architecture specific APIs to update and track thermal pressure on a
per cpu basis. A per cpu variable thermal_pressure is introduced to keep
track of instantaneous per cpu thermal pressure. Thermal pressure is the
delta between maximum capacity and capped capacity due to a thermal event.
topology_get_thermal_pressure can be hooked into the scheduler specified
arch_cpu_thermal_capacity to retrieve instantaneous thermal pressure of a
cpu.
arch_set_thermal_pressure can be used to update the thermal pressure.
Considering topology_get_thermal_pressure reads thermal_pressure and
arch_set_thermal_pressure writes into thermal_pressure, one can argue for
some sort of locking mechanism to avoid a stale value. But considering
topology_get_thermal_pressure can be called from a system critical path
like scheduler tick function, a locking mechanism is not ideal. This means
that it is possible the thermal_pressure value used to calculate average
thermal pressure for a cpu can be stale for upto 1 tick period.
Signed-off-by: Thara Gopinath <[email protected]>
---
v6->v7:
- Changed the input argument in arch_set_thermal_pressure from
capped capacity to delta capacity(thermal pressure) as per
Ionela's review comments.
arch/arm/include/asm/topology.h | 3 +++
arch/arm64/include/asm/topology.h | 3 +++
drivers/base/arch_topology.c | 11 +++++++++++
include/linux/arch_topology.h | 10 ++++++++++
4 files changed, 27 insertions(+)
diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h
index 8a0fae9..3a50a19 100644
--- a/arch/arm/include/asm/topology.h
+++ b/arch/arm/include/asm/topology.h
@@ -16,6 +16,9 @@
/* Enable topology flag updates */
#define arch_update_cpu_topology topology_update_cpu_topology
+/* Replace task scheduler's default thermal pressure retrieve API */
+#define arch_cpu_thermal_pressure topology_get_thermal_pressure
+
#else
static inline void init_cpu_topology(void) { }
diff --git a/arch/arm64/include/asm/topology.h b/arch/arm64/include/asm/topology.h
index a4d945d..a70896f 100644
--- a/arch/arm64/include/asm/topology.h
+++ b/arch/arm64/include/asm/topology.h
@@ -25,6 +25,9 @@ int pcibus_to_node(struct pci_bus *bus);
/* Enable topology flag updates */
#define arch_update_cpu_topology topology_update_cpu_topology
+/* Replace task scheduler's default thermal pressure retrieve API */
+#define arch_cpu_thermal_pressure topology_get_thermal_pressure
+
#include <asm-generic/topology.h>
#endif /* _ASM_ARM_TOPOLOGY_H */
diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index 1eb81f11..c2c5f1d 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -42,6 +42,17 @@ void topology_set_cpu_scale(unsigned int cpu, unsigned long capacity)
per_cpu(cpu_scale, cpu) = capacity;
}
+DEFINE_PER_CPU(unsigned long, thermal_pressure);
+
+void arch_set_thermal_pressure(struct cpumask *cpus,
+ unsigned long th_pressure)
+{
+ int cpu;
+
+ for_each_cpu(cpu, cpus)
+ WRITE_ONCE(per_cpu(thermal_pressure, cpu), th_pressure);
+}
+
static ssize_t cpu_capacity_show(struct device *dev,
struct device_attribute *attr,
char *buf)
diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h
index 3015ecb..88a115e 100644
--- a/include/linux/arch_topology.h
+++ b/include/linux/arch_topology.h
@@ -33,6 +33,16 @@ unsigned long topology_get_freq_scale(int cpu)
return per_cpu(freq_scale, cpu);
}
+DECLARE_PER_CPU(unsigned long, thermal_pressure);
+
+static inline unsigned long topology_get_thermal_pressure(int cpu)
+{
+ return per_cpu(thermal_pressure, cpu);
+}
+
+void arch_set_thermal_pressure(struct cpumask *cpus,
+ unsigned long th_pressure);
+
struct cpu_topology {
int thread_id;
int core_id;
--
2.1.4
Thermal governors can request for a cpu's maximum supported frequency to
be capped in case of an overheat event. This in turn means that the
maximum capacity available for tasks to run on the particular cpu is
reduced. Delta between the original maximum capacity and capped maximum
capacity is known as thermal pressure. Enable cpufreq cooling device to
update the thermal pressure in event of a capped maximum frequency.
Signed-off-by: Thara Gopinath <[email protected]>
---
v6->v7
- Changed the input argument in arch_set_thermal_pressure from
capped capacity to delta capacity(thermal pressure) as per
Ionela's review comments. Hence the calculation for delta
capacity(thermal pressure) is moved to cpufreq_cooling.c.
drivers/thermal/cpufreq_cooling.c | 19 +++++++++++++++++--
1 file changed, 17 insertions(+), 2 deletions(-)
diff --git a/drivers/thermal/cpufreq_cooling.c b/drivers/thermal/cpufreq_cooling.c
index fe83d7a..4ae8c85 100644
--- a/drivers/thermal/cpufreq_cooling.c
+++ b/drivers/thermal/cpufreq_cooling.c
@@ -431,6 +431,10 @@ static int cpufreq_set_cur_state(struct thermal_cooling_device *cdev,
unsigned long state)
{
struct cpufreq_cooling_device *cpufreq_cdev = cdev->devdata;
+ struct cpumask *cpus;
+ unsigned int frequency;
+ unsigned long max_capacity, capacity;
+ int ret;
/* Request state should be less than max_level */
if (WARN_ON(state > cpufreq_cdev->max_level))
@@ -442,8 +446,19 @@ static int cpufreq_set_cur_state(struct thermal_cooling_device *cdev,
cpufreq_cdev->cpufreq_state = state;
- return freq_qos_update_request(&cpufreq_cdev->qos_req,
- get_state_freq(cpufreq_cdev, state));
+ frequency = get_state_freq(cpufreq_cdev, state);
+
+ ret = freq_qos_update_request(&cpufreq_cdev->qos_req, frequency);
+
+ if (ret > 0) {
+ cpus = cpufreq_cdev->policy->cpus;
+ max_capacity = arch_scale_cpu_capacity(cpumask_first(cpus));
+ capacity = frequency * max_capacity;
+ capacity /= cpufreq_cdev->policy->cpuinfo.max_freq;
+ arch_set_thermal_pressure(cpus, max_capacity - capacity);
+ }
+
+ return ret;
}
/* Bind cpufreq callbacks to thermal cooling device ops */
--
2.1.4
Hi Thara,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on next-20200110]
[also build test ERROR on v5.5-rc5]
[cannot apply to tip/sched/core tip/perf/core arm/for-next arm64/for-next/core driver-core/driver-core-testing linus/master v5.5-rc5 v5.5-rc4 v5.5-rc3]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]
url: https://github.com/0day-ci/linux/commits/Thara-Gopinath/Introduce-Thermal-Pressure/20200112-000559
base: 6c09d7dbb7d366122d0218bc7487e0a1e6cca6ed
config: sh-randconfig-a001-20200112 (attached as .config)
compiler: sh4-linux-gcc (GCC) 7.5.0
reproduce:
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
GCC_VERSION=7.5.0 make.cross ARCH=sh
If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <[email protected]>
All errors (new ones prefixed by >>):
kernel/sched/fair.c: In function 'task_tick_fair':
>> kernel/sched/fair.c:10308:2: error: implicit declaration of function 'update_thermal_load_avg'; did you mean 'update_thermal_rq_load_avg'? [-Werror=implicit-function-declaration]
update_thermal_load_avg(rq_clock_task(rq), rq, thermal_pressure);
^~~~~~~~~~~~~~~~~~~~~~~
update_thermal_rq_load_avg
cc1: some warnings being treated as errors
vim +10308 kernel/sched/fair.c
10283
10284 /*
10285 * scheduler tick hitting a task of our scheduling class.
10286 *
10287 * NOTE: This function can be called remotely by the tick offload that
10288 * goes along full dynticks. Therefore no local assumption can be made
10289 * and everything must be accessed through the @rq and @curr passed in
10290 * parameters.
10291 */
10292 static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
10293 {
10294 struct cfs_rq *cfs_rq;
10295 struct sched_entity *se = &curr->se;
10296 unsigned long thermal_pressure = arch_cpu_thermal_pressure(cpu_of(rq));
10297
10298 for_each_sched_entity(se) {
10299 cfs_rq = cfs_rq_of(se);
10300 entity_tick(cfs_rq, se, queued);
10301 }
10302
10303 if (static_branch_unlikely(&sched_numa_balancing))
10304 task_tick_numa(rq, curr);
10305
10306 update_misfit_status(curr, rq);
10307 update_overutilized_status(task_rq(curr));
10308 update_thermal_load_avg(rq_clock_task(rq), rq, thermal_pressure);
10309 }
10310
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/hyperkitty/list/[email protected] Intel Corporation
Hi Thara,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on next-20200110]
[also build test ERROR on v5.5-rc5]
[cannot apply to tip/sched/core tip/perf/core arm/for-next arm64/for-next/core driver-core/driver-core-testing linus/master v5.5-rc5 v5.5-rc4 v5.5-rc3]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]
url: https://github.com/0day-ci/linux/commits/Thara-Gopinath/Introduce-Thermal-Pressure/20200112-000559
base: 6c09d7dbb7d366122d0218bc7487e0a1e6cca6ed
config: mips-fuloong2e_defconfig (attached as .config)
compiler: mips64el-linux-gcc (GCC) 5.5.0
reproduce:
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
GCC_VERSION=5.5.0 make.cross ARCH=mips
If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <[email protected]>
All errors (new ones prefixed by >>):
kernel/sched/fair.c: In function 'task_tick_fair':
>> kernel/sched/fair.c:10308:2: error: implicit declaration of function 'update_thermal_load_avg' [-Werror=implicit-function-declaration]
update_thermal_load_avg(rq_clock_task(rq), rq, thermal_pressure);
^
cc1: some warnings being treated as errors
vim +/update_thermal_load_avg +10308 kernel/sched/fair.c
10283
10284 /*
10285 * scheduler tick hitting a task of our scheduling class.
10286 *
10287 * NOTE: This function can be called remotely by the tick offload that
10288 * goes along full dynticks. Therefore no local assumption can be made
10289 * and everything must be accessed through the @rq and @curr passed in
10290 * parameters.
10291 */
10292 static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
10293 {
10294 struct cfs_rq *cfs_rq;
10295 struct sched_entity *se = &curr->se;
10296 unsigned long thermal_pressure = arch_cpu_thermal_pressure(cpu_of(rq));
10297
10298 for_each_sched_entity(se) {
10299 cfs_rq = cfs_rq_of(se);
10300 entity_tick(cfs_rq, se, queued);
10301 }
10302
10303 if (static_branch_unlikely(&sched_numa_balancing))
10304 task_tick_numa(rq, curr);
10305
10306 update_misfit_status(curr, rq);
10307 update_overutilized_status(task_rq(curr));
10308 update_thermal_load_avg(rq_clock_task(rq), rq, thermal_pressure);
10309 }
10310
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/hyperkitty/list/[email protected] Intel Corporation