2020-01-14 19:58:50

by Thara Gopinath

[permalink] [raw]
Subject: [Patch v8 0/7] Introduce Thermal Pressure

Thermal governors can respond to an overheat event of a cpu by
capping the cpu's maximum possible frequency. This in turn
means that the maximum available compute capacity of the
cpu is restricted. But today in the kernel, task scheduler is
not notified of capping of maximum frequency of a cpu.
In other words, scheduler is unaware of maximum capacity
restrictions placed on a cpu due to thermal activity.
This patch series attempts to address this issue.
The benefits identified are better task placement among available
cpus in event of overheating which in turn leads to better
performance numbers.

The reduction in the maximum possible capacity of a cpu due to a
thermal event can be considered as thermal pressure. Instantaneous
thermal pressure is hard to record and can sometime be erroneous
as there can be mismatch between the actual capping of capacity
and scheduler recording it. Thus solution is to have a weighted
average per cpu value for thermal pressure over time.
The weight reflects the amount of time the cpu has spent at a
capped maximum frequency. Since thermal pressure is recorded as
an average, it must be decayed periodically. Exisiting algorithm
in the kernel scheduler pelt framework is re-used to calculate
the weighted average. This patch series also defines a sysctl
inerface to allow for a configurable decay period.

Regarding testing, basic build, boot and sanity testing have been
performed on db845c platform with debian file system.
Further, dhrystone and hackbench tests have been
run with the thermal pressure algorithm. During testing, due to
constraints of step wise governor in dealing with big little systems,
trip point 0 temperature was made assymetric between cpus in little
cluster and big cluster; the idea being that
big core will heat up and cpu cooling device will throttle the
frequency of the big cores faster, there by limiting the maximum available
capacity and the scheduler will spread out tasks to little cores as well.

Test Results

Hackbench: 1 group , 30000 loops, 10 runs
Result SD
(Secs) (% of mean)
No Thermal Pressure 14.03 2.69%
Thermal Pressure PELT Algo. Decay : 32 ms 13.29 0.56%
Thermal Pressure PELT Algo. Decay : 64 ms 12.57 1.56%
Thermal Pressure PELT Algo. Decay : 128 ms 12.71 1.04%
Thermal Pressure PELT Algo. Decay : 256 ms 12.29 1.42%
Thermal Pressure PELT Algo. Decay : 512 ms 12.42 1.15%

Dhrystone Run Time : 20 threads, 3000 MLOOPS
Result SD
(Secs) (% of mean)
No Thermal Pressure 9.452 4.49%
Thermal Pressure PELT Algo. Decay : 32 ms 8.793 5.30%
Thermal Pressure PELT Algo. Decay : 64 ms 8.981 5.29%
Thermal Pressure PELT Algo. Decay : 128 ms 8.647 6.62%
Thermal Pressure PELT Algo. Decay : 256 ms 8.774 6.45%
Thermal Pressure PELT Algo. Decay : 512 ms 8.603 5.41%

A Brief History

The first version of this patch-series was posted with resuing
PELT algorithm to decay thermal pressure signal. The discussions
that followed were around whether intanteneous thermal pressure
solution is better and whether a stand-alone algortihm to accumulate
and decay thermal pressure is more appropriate than re-using the
PELT framework.
Tests on Hikey960 showed the stand-alone algorithm performing slightly
better than resuing PELT algorithm and V2 was posted with the stand
alone algorithm. Test results were shared as part of this series.
Discussions were around re-using PELT algorithm and running
further tests with more granular decay period.

For some time after this development was impeded due to hardware
unavailability, some other unforseen and possibly unfortunate events.
For this version, h/w was switched from hikey960 to db845c.
Also Instantaneous thermal pressure was never tested as part of this
cycle as it is clear that weighted average is a better implementation.
The non-PELT algorithm never gave any conclusive results to prove that it
is better than reusing PELT algorithm, in this round of testing.
Also reusing PELT algorithm means thermal pressure tracks the
other utilization signals in the scheduler.

v3->v4:
- "Patch 3/7:sched: Initialize per cpu thermal pressure structure"
is dropped as it is no longer needed following changes in other
other patches.
- rest of the change log mentioned in specific patches.

v5->v6:
- "Added arch_ interface APIs to access and update thermal pressure.
Moved declaration of per cpu thermal_pressure valriable and
infrastructure to update the variable to topology files.

v6->v7:
- Added CONFIG_HAVE_SCHED_THERMAL_PRESSURE to stub out
update_thermal_load_avg in unsupported architectures as per
review comments from Peter, Dietmar and Quentin.
- Renamed arch_scale_thermal_capacity to arch_cpu_thermal_pressure
as per review comments from Peter, Dietmar and Ionela.
- Changed the input argument in arch_set_thermal_pressure from
capped capacity to delta capacity(thermal pressure) as per
Ionela's review comments. Hence the calculation for delta
capacity(thermal pressure) is moved to cpufreq_cooling.c.
- Fixed a bunch of spelling typos.

v7->v8:
- Fixed typo in defining update_thermal_load_avg which was
causing build errors (reported by kbuild test report)

Thara Gopinath (7):
sched/pelt: Add support to track thermal pressure
sched/topology: Add hook to read per cpu thermal pressure.
arm,arm64,drivers:Add infrastructure to store and update instantaneous
thermal pressure
sched/fair: Enable periodic update of average thermal pressure
sched/fair: update cpu_capacity to reflect thermal pressure
thermal/cpu-cooling: Update thermal pressure in case of a maximum
frequency capping
sched/fair: Enable tuning of decay period

Documentation/admin-guide/kernel-parameters.txt | 5 +++
arch/arm/include/asm/topology.h | 3 ++
arch/arm64/include/asm/topology.h | 3 ++
drivers/base/arch_topology.c | 11 ++++++
drivers/thermal/cpufreq_cooling.c | 19 +++++++++--
include/linux/arch_topology.h | 10 ++++++
include/linux/sched/topology.h | 8 +++++
include/trace/events/sched.h | 4 +++
init/Kconfig | 4 +++
kernel/sched/fair.c | 45 +++++++++++++++++++++++++
kernel/sched/pelt.c | 31 +++++++++++++++++
kernel/sched/pelt.h | 16 +++++++++
kernel/sched/sched.h | 1 +
13 files changed, 158 insertions(+), 2 deletions(-)

--
2.1.4


2020-01-14 19:58:57

by Thara Gopinath

[permalink] [raw]
Subject: [Patch v8 5/7] sched/fair: update cpu_capacity to reflect thermal pressure

cpu_capacity initially reflects the maximum possible capacity of a cpu.
Thermal pressure on a cpu means this maximum possible capacity is
unavailable due to thermal events. This patch subtracts the average thermal
pressure for a cpu from its maximum possible capacity so that cpu_capacity
reflects the actual maximum currently available capacity.

Signed-off-by: Thara Gopinath <[email protected]>
---
kernel/sched/fair.c | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 311bb0b..2b1fec3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7733,8 +7733,15 @@ static unsigned long scale_rt_capacity(struct sched_domain *sd, int cpu)
if (unlikely(irq >= max))
return 1;

+ /*
+ * avg_rt.util avg and avg_dl.util track binary signals
+ * (running and not running) with weights 0 and 1024 respectively.
+ * avg_thermal.load_avg tracks thermal pressure and the weighted
+ * average uses the actual delta max capacity(load).
+ */
used = READ_ONCE(rq->avg_rt.util_avg);
used += READ_ONCE(rq->avg_dl.util_avg);
+ used += READ_ONCE(rq->avg_thermal.load_avg);

if (unlikely(used >= max))
return 1;
--
2.1.4

2020-01-14 19:59:43

by Thara Gopinath

[permalink] [raw]
Subject: [Patch v8 1/7] sched/pelt: Add support to track thermal pressure

Extrapolating on the existing framework to track rt/dl utilization using
pelt signals, add a similar mechanism to track thermal pressure. The
difference here from rt/dl utilization tracking is that, instead of
tracking time spent by a cpu running a rt/dl task through util_avg, the
average thermal pressure is tracked through load_avg. This is because
thermal pressure signal is weighted "delta" capacity and is not
binary(util_avg is binary). "delta capacity" here means delta between the
actual capacity of a cpu and the decreased capacity a cpu due to a thermal
event.

In order to track average thermal pressure, a new sched_avg variable
avg_thermal is introduced. Function update_thermal_load_avg can be called
to do the periodic bookkeeping (accumulate, decay and average) of the
thermal pressure.

Signed-off-by: Thara Gopinath <[email protected]>
Reviewed-by: Vincent Guittot <[email protected]>
---

v6->v7:
- Added CONFIG_HAVE_SCHED_THERMAL_PRESSURE to stub out
update_thermal_load_avg in unsupported architectures as per
review comments from Peter, Dietmar and Quentin.
- Updated comment for update_thermal_load_avg as per review
comments from Peter and Dietmar.
v7->v8:
- Fixed typo in defining update_thermal_load_avg which was
causing build errors (reported by kbuild test report)

include/trace/events/sched.h | 4 ++++
init/Kconfig | 4 ++++
kernel/sched/pelt.c | 31 +++++++++++++++++++++++++++++++
kernel/sched/pelt.h | 16 ++++++++++++++++
kernel/sched/sched.h | 1 +
5 files changed, 56 insertions(+)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 420e80e..a8fb667 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -613,6 +613,10 @@ DECLARE_TRACE(pelt_dl_tp,
TP_PROTO(struct rq *rq),
TP_ARGS(rq));

+DECLARE_TRACE(pelt_thermal_tp,
+ TP_PROTO(struct rq *rq),
+ TP_ARGS(rq));
+
DECLARE_TRACE(pelt_irq_tp,
TP_PROTO(struct rq *rq),
TP_ARGS(rq));
diff --git a/init/Kconfig b/init/Kconfig
index f6a4a91..c16ea88 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -450,6 +450,10 @@ config HAVE_SCHED_AVG_IRQ
depends on IRQ_TIME_ACCOUNTING || PARAVIRT_TIME_ACCOUNTING
depends on SMP

+config HAVE_SCHED_THERMAL_PRESSURE
+ bool "Enable periodic averaging of thermal pressure"
+ depends on SMP
+
config BSD_PROCESS_ACCT
bool "BSD Process Accounting"
depends on MULTIUSER
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index bd006b7..5d1fbf0 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -367,6 +367,37 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
return 0;
}

+#ifdef CONFIG_HAVE_SCHED_THERMAL_PRESSURE
+/*
+ * thermal:
+ *
+ * load_sum = \Sum se->avg.load_sum but se->avg.load_sum is not tracked
+ *
+ * util_avg and runnable_load_avg are not supported and meaningless.
+ *
+ * Unlike rt/dl utilization tracking that track time spent by a cpu
+ * running a rt/dl task through util_avg, the average thermal pressure is
+ * tracked through load_avg. This is because thermal pressure signal is
+ * weighted "delta" capacity and is not binary(util_avg is binary). "delta
+ * capacity" here means delta between the actual capacity of a cpu and the
+ * decreased capacity a cpu due to a thermal event.
+ */
+
+int update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity)
+{
+ if (___update_load_sum(now, &rq->avg_thermal,
+ capacity,
+ capacity,
+ capacity)) {
+ ___update_load_avg(&rq->avg_thermal, 1, 1);
+ trace_pelt_thermal_tp(rq);
+ return 1;
+ }
+
+ return 0;
+}
+#endif
+
#ifdef CONFIG_HAVE_SCHED_AVG_IRQ
/*
* irq:
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index afff644..bf1e17b 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -7,6 +7,16 @@ int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq);
int update_rt_rq_load_avg(u64 now, struct rq *rq, int running);
int update_dl_rq_load_avg(u64 now, struct rq *rq, int running);

+#ifdef CONFIG_HAVE_SCHED_THERMAL_PRESSURE
+int update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity);
+#else
+static inline int
+update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity)
+{
+ return 0;
+}
+#endif
+
#ifdef CONFIG_HAVE_SCHED_AVG_IRQ
int update_irq_load_avg(struct rq *rq, u64 running);
#else
@@ -159,6 +169,12 @@ update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
}

static inline int
+update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity)
+{
+ return 0;
+}
+
+static inline int
update_irq_load_avg(struct rq *rq, u64 running)
{
return 0;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 280a3c7..37bd7ef 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -944,6 +944,7 @@ struct rq {
#ifdef CONFIG_HAVE_SCHED_AVG_IRQ
struct sched_avg avg_irq;
#endif
+ struct sched_avg avg_thermal;
u64 idle_stamp;
u64 avg_idle;

--
2.1.4

2020-01-14 19:59:52

by Thara Gopinath

[permalink] [raw]
Subject: [Patch v8 6/7] thermal/cpu-cooling: Update thermal pressure in case of a maximum frequency capping

Thermal governors can request for a cpu's maximum supported frequency to
be capped in case of an overheat event. This in turn means that the
maximum capacity available for tasks to run on the particular cpu is
reduced. Delta between the original maximum capacity and capped maximum
capacity is known as thermal pressure. Enable cpufreq cooling device to
update the thermal pressure in event of a capped maximum frequency.

Signed-off-by: Thara Gopinath <[email protected]>
---

v6->v7
- Changed the input argument in arch_set_thermal_pressure from
capped capacity to delta capacity(thermal pressure) as per
Ionela's review comments. Hence the calculation for delta
capacity(thermal pressure) is moved to cpufreq_cooling.c.

drivers/thermal/cpufreq_cooling.c | 19 +++++++++++++++++--
1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/drivers/thermal/cpufreq_cooling.c b/drivers/thermal/cpufreq_cooling.c
index fe83d7a..4ae8c85 100644
--- a/drivers/thermal/cpufreq_cooling.c
+++ b/drivers/thermal/cpufreq_cooling.c
@@ -431,6 +431,10 @@ static int cpufreq_set_cur_state(struct thermal_cooling_device *cdev,
unsigned long state)
{
struct cpufreq_cooling_device *cpufreq_cdev = cdev->devdata;
+ struct cpumask *cpus;
+ unsigned int frequency;
+ unsigned long max_capacity, capacity;
+ int ret;

/* Request state should be less than max_level */
if (WARN_ON(state > cpufreq_cdev->max_level))
@@ -442,8 +446,19 @@ static int cpufreq_set_cur_state(struct thermal_cooling_device *cdev,

cpufreq_cdev->cpufreq_state = state;

- return freq_qos_update_request(&cpufreq_cdev->qos_req,
- get_state_freq(cpufreq_cdev, state));
+ frequency = get_state_freq(cpufreq_cdev, state);
+
+ ret = freq_qos_update_request(&cpufreq_cdev->qos_req, frequency);
+
+ if (ret > 0) {
+ cpus = cpufreq_cdev->policy->cpus;
+ max_capacity = arch_scale_cpu_capacity(cpumask_first(cpus));
+ capacity = frequency * max_capacity;
+ capacity /= cpufreq_cdev->policy->cpuinfo.max_freq;
+ arch_set_thermal_pressure(cpus, max_capacity - capacity);
+ }
+
+ return ret;
}

/* Bind cpufreq callbacks to thermal cooling device ops */
--
2.1.4

2020-01-14 19:59:55

by Thara Gopinath

[permalink] [raw]
Subject: [Patch v8 4/7] sched/fair: Enable periodic update of average thermal pressure

Introduce support in CFS periodic tick and other bookkeeping apis
to trigger the process of computing average thermal pressure for a
cpu. Also consider avg_thermal.load_avg in others_have_blocked
which allows for decay of pelt signals.

Signed-off-by: Thara Gopinath <[email protected]>
---
kernel/sched/fair.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8da0222..311bb0b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7470,6 +7470,9 @@ static inline bool others_have_blocked(struct rq *rq)
if (READ_ONCE(rq->avg_dl.util_avg))
return true;

+ if (READ_ONCE(rq->avg_thermal.load_avg))
+ return true;
+
#ifdef CONFIG_HAVE_SCHED_AVG_IRQ
if (READ_ONCE(rq->avg_irq.util_avg))
return true;
@@ -7495,6 +7498,7 @@ static bool __update_blocked_others(struct rq *rq, bool *done)
{
const struct sched_class *curr_class;
u64 now = rq_clock_pelt(rq);
+ unsigned long thermal_pressure = arch_cpu_thermal_pressure(cpu_of(rq));
bool decayed;

/*
@@ -7505,6 +7509,8 @@ static bool __update_blocked_others(struct rq *rq, bool *done)

decayed = update_rt_rq_load_avg(now, rq, curr_class == &rt_sched_class) |
update_dl_rq_load_avg(now, rq, curr_class == &dl_sched_class) |
+ update_thermal_load_avg(rq_clock_task(rq), rq,
+ thermal_pressure) |
update_irq_load_avg(rq, 0);

if (others_have_blocked(rq))
@@ -10275,6 +10281,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
{
struct cfs_rq *cfs_rq;
struct sched_entity *se = &curr->se;
+ unsigned long thermal_pressure = arch_cpu_thermal_pressure(cpu_of(rq));

for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
@@ -10286,6 +10293,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)

update_misfit_status(curr, rq);
update_overutilized_status(task_rq(curr));
+ update_thermal_load_avg(rq_clock_task(rq), rq, thermal_pressure);
}

/*
--
2.1.4

2020-01-14 19:59:59

by Thara Gopinath

[permalink] [raw]
Subject: [Patch v8 7/7] sched/fair: Enable tuning of decay period

Thermal pressure follows pelt signals which means the decay period for
thermal pressure is the default pelt decay period. Depending on soc
characteristics and thermal activity, it might be beneficial to decay
thermal pressure slower, but still in-tune with the pelt signals. One way
to achieve this is to provide a command line parameter to set a decay
shift parameter to an integer between 0 and 10.

Signed-off-by: Thara Gopinath <[email protected]>
---
Documentation/admin-guide/kernel-parameters.txt | 5 ++++
kernel/sched/fair.c | 34 +++++++++++++++++++++++--
2 files changed, 37 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index dd3df3d..34848e4 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4370,6 +4370,11 @@
incurs a small amount of overhead in the scheduler
but is useful for debugging and performance tuning.

+ sched_thermal_decay_shift=
+ [KNL, SMP] Set decay shift for thermal pressure signal.
+ Format: integer between 0 and 10
+ Default is 0.
+
skew_tick= [KNL] Offset the periodic timer tick per cpu to mitigate
xtime_lock contention on larger systems, and/or RCU lock
contention on all systems with CONFIG_MAXSMP set.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2b1fec3..8b2ee5a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -86,6 +86,36 @@ static unsigned int normalized_sysctl_sched_wakeup_granularity = 1000000UL;

const_debug unsigned int sysctl_sched_migration_cost = 500000UL;

+/**
+ * By default the decay is the default pelt decay period.
+ * The decay shift can change the decay period in
+ * multiples of 32.
+ * Decay shift Decay period(ms)
+ * 0 32
+ * 1 64
+ * 2 128
+ * 3 256
+ * 4 512
+ */
+static int sched_thermal_decay_shift;
+
+static inline u64 rq_clock_thermal(struct rq *rq)
+{
+ return rq_clock_task(rq) >> sched_thermal_decay_shift;
+}
+
+static int __init setup_sched_thermal_decay_shift(char *str)
+{
+ int _shift;
+
+ if (kstrtoint(str, 0, &_shift))
+ pr_warn("Unable to set scheduler thermal pressure decay shift parameter\n");
+
+ sched_thermal_decay_shift = clamp(_shift, 0, 10);
+ return 1;
+}
+__setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift);
+
#ifdef CONFIG_SMP
/*
* For asym packing, by default the lower numbered CPU has higher priority.
@@ -7509,7 +7539,7 @@ static bool __update_blocked_others(struct rq *rq, bool *done)

decayed = update_rt_rq_load_avg(now, rq, curr_class == &rt_sched_class) |
update_dl_rq_load_avg(now, rq, curr_class == &dl_sched_class) |
- update_thermal_load_avg(rq_clock_task(rq), rq,
+ update_thermal_load_avg(rq_clock_thermal(rq), rq,
thermal_pressure) |
update_irq_load_avg(rq, 0);

@@ -10300,7 +10330,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)

update_misfit_status(curr, rq);
update_overutilized_status(task_rq(curr));
- update_thermal_load_avg(rq_clock_task(rq), rq, thermal_pressure);
+ update_thermal_load_avg(rq_clock_thermal(rq), rq, thermal_pressure);
}

/*
--
2.1.4

2020-01-14 20:00:17

by Thara Gopinath

[permalink] [raw]
Subject: [Patch v8 3/7] arm,arm64,drivers:Add infrastructure to store and update instantaneous thermal pressure

Add architecture specific APIs to update and track thermal pressure on a
per cpu basis. A per cpu variable thermal_pressure is introduced to keep
track of instantaneous per cpu thermal pressure. Thermal pressure is the
delta between maximum capacity and capped capacity due to a thermal event.

topology_get_thermal_pressure can be hooked into the scheduler specified
arch_cpu_thermal_capacity to retrieve instantaneous thermal pressure of a
cpu.

arch_set_thermal_pressure can be used to update the thermal pressure.

Considering topology_get_thermal_pressure reads thermal_pressure and
arch_set_thermal_pressure writes into thermal_pressure, one can argue for
some sort of locking mechanism to avoid a stale value. But considering
topology_get_thermal_pressure can be called from a system critical path
like scheduler tick function, a locking mechanism is not ideal. This means
that it is possible the thermal_pressure value used to calculate average
thermal pressure for a cpu can be stale for upto 1 tick period.

Signed-off-by: Thara Gopinath <[email protected]>
---

v6->v7:
- Changed the input argument in arch_set_thermal_pressure from
capped capacity to delta capacity(thermal pressure) as per
Ionela's review comments.

arch/arm/include/asm/topology.h | 3 +++
arch/arm64/include/asm/topology.h | 3 +++
drivers/base/arch_topology.c | 11 +++++++++++
include/linux/arch_topology.h | 10 ++++++++++
4 files changed, 27 insertions(+)

diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h
index 8a0fae9..3a50a19 100644
--- a/arch/arm/include/asm/topology.h
+++ b/arch/arm/include/asm/topology.h
@@ -16,6 +16,9 @@
/* Enable topology flag updates */
#define arch_update_cpu_topology topology_update_cpu_topology

+/* Replace task scheduler's default thermal pressure retrieve API */
+#define arch_cpu_thermal_pressure topology_get_thermal_pressure
+
#else

static inline void init_cpu_topology(void) { }
diff --git a/arch/arm64/include/asm/topology.h b/arch/arm64/include/asm/topology.h
index a4d945d..a70896f 100644
--- a/arch/arm64/include/asm/topology.h
+++ b/arch/arm64/include/asm/topology.h
@@ -25,6 +25,9 @@ int pcibus_to_node(struct pci_bus *bus);
/* Enable topology flag updates */
#define arch_update_cpu_topology topology_update_cpu_topology

+/* Replace task scheduler's default thermal pressure retrieve API */
+#define arch_cpu_thermal_pressure topology_get_thermal_pressure
+
#include <asm-generic/topology.h>

#endif /* _ASM_ARM_TOPOLOGY_H */
diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index 1eb81f11..c2c5f1d 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -42,6 +42,17 @@ void topology_set_cpu_scale(unsigned int cpu, unsigned long capacity)
per_cpu(cpu_scale, cpu) = capacity;
}

+DEFINE_PER_CPU(unsigned long, thermal_pressure);
+
+void arch_set_thermal_pressure(struct cpumask *cpus,
+ unsigned long th_pressure)
+{
+ int cpu;
+
+ for_each_cpu(cpu, cpus)
+ WRITE_ONCE(per_cpu(thermal_pressure, cpu), th_pressure);
+}
+
static ssize_t cpu_capacity_show(struct device *dev,
struct device_attribute *attr,
char *buf)
diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h
index 3015ecb..88a115e 100644
--- a/include/linux/arch_topology.h
+++ b/include/linux/arch_topology.h
@@ -33,6 +33,16 @@ unsigned long topology_get_freq_scale(int cpu)
return per_cpu(freq_scale, cpu);
}

+DECLARE_PER_CPU(unsigned long, thermal_pressure);
+
+static inline unsigned long topology_get_thermal_pressure(int cpu)
+{
+ return per_cpu(thermal_pressure, cpu);
+}
+
+void arch_set_thermal_pressure(struct cpumask *cpus,
+ unsigned long th_pressure);
+
struct cpu_topology {
int thread_id;
int core_id;
--
2.1.4

2020-01-14 20:00:41

by Thara Gopinath

[permalink] [raw]
Subject: [Patch v8 2/7] sched/topology: Add hook to read per cpu thermal pressure.

Introduce arch_cpu_thermal_pressure to retrieve per cpu thermal
pressure.

Signed-off-by: Thara Gopinath <[email protected]>
---
include/linux/sched/topology.h | 8 ++++++++
1 file changed, 8 insertions(+)

v6->v7:
- Renamed arch_scale_thermal_capacity to arch_cpu_thermal_pressure
as per review comments from Peter, Dietmar and Ionela.

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index f341163..850b3bf 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -225,6 +225,14 @@ unsigned long arch_scale_cpu_capacity(int cpu)
}
#endif

+#ifndef arch_cpu_thermal_pressure
+static __always_inline
+unsigned long arch_cpu_thermal_pressure(int cpu)
+{
+ return 0;
+}
+#endif
+
static inline int task_node(const struct task_struct *p)
{
return cpu_to_node(task_cpu(p));
--
2.1.4

2020-01-16 15:16:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [Patch v8 4/7] sched/fair: Enable periodic update of average thermal pressure

On Tue, Jan 14, 2020 at 02:57:36PM -0500, Thara Gopinath wrote:
> Introduce support in CFS periodic tick and other bookkeeping apis
> to trigger the process of computing average thermal pressure for a
> cpu. Also consider avg_thermal.load_avg in others_have_blocked
> which allows for decay of pelt signals.
>
> Signed-off-by: Thara Gopinath <[email protected]>
> ---
> kernel/sched/fair.c | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8da0222..311bb0b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7470,6 +7470,9 @@ static inline bool others_have_blocked(struct rq *rq)
> if (READ_ONCE(rq->avg_dl.util_avg))
> return true;
>
> + if (READ_ONCE(rq->avg_thermal.load_avg))
> + return true;
> +

Given that struct sched_avg is 1 cacheline, the above is a pointless
guaranteed cacheline miss if the arch doesn't
CONFIG_HAVE_SCHED_THERMAL_PRESSURE.

> #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
> if (READ_ONCE(rq->avg_irq.util_avg))
> return true;
> @@ -7495,6 +7498,7 @@ static bool __update_blocked_others(struct rq *rq, bool *done)
> {
> const struct sched_class *curr_class;
> u64 now = rq_clock_pelt(rq);
> + unsigned long thermal_pressure = arch_cpu_thermal_pressure(cpu_of(rq));
> bool decayed;
>
> /*
> @@ -7505,6 +7509,8 @@ static bool __update_blocked_others(struct rq *rq, bool *done)
>
> decayed = update_rt_rq_load_avg(now, rq, curr_class == &rt_sched_class) |
> update_dl_rq_load_avg(now, rq, curr_class == &dl_sched_class) |
> + update_thermal_load_avg(rq_clock_task(rq), rq,
> + thermal_pressure) |
> update_irq_load_avg(rq, 0);
>
> if (others_have_blocked(rq))

That there indentation trainwreck is a reason to rename the function.

decayed = update_rt_rq_load_avg(now, rq, curr_class == &rt_sched_class) |
update_dl_rq_load_avg(now, rq, curr_class == &dl_sched_class) |
update_thermal_load_avg(rq_clock_task(rq), rq, thermal_pressure) |
update_irq_load_avg(rq, 0);

Is much better.

But now that you made me look at that, I noticed it's using a different
clock -- it is _NOT_ using now/rq_clock_pelt(), which means it'll not be
in sync with the other averages.

Is there a good reason for that?

> @@ -10275,6 +10281,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
> {
> struct cfs_rq *cfs_rq;
> struct sched_entity *se = &curr->se;
> + unsigned long thermal_pressure = arch_cpu_thermal_pressure(cpu_of(rq));
>
> for_each_sched_entity(se) {
> cfs_rq = cfs_rq_of(se);
> @@ -10286,6 +10293,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
>
> update_misfit_status(curr, rq);
> update_overutilized_status(task_rq(curr));
> + update_thermal_load_avg(rq_clock_task(rq), rq, thermal_pressure);
> }

I'm thinking this is the wrong place; should this not be in
scheduler_tick(), right before calling sched_class::task_tick() ? Surely
any execution will affect thermals, not only fair class execution.

2020-01-16 15:19:05

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [Patch v8 1/7] sched/pelt: Add support to track thermal pressure

On Tue, Jan 14, 2020 at 02:57:33PM -0500, Thara Gopinath wrote:

> diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
> index afff644..bf1e17b 100644
> --- a/kernel/sched/pelt.h
> +++ b/kernel/sched/pelt.h
> @@ -7,6 +7,16 @@ int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq);
> int update_rt_rq_load_avg(u64 now, struct rq *rq, int running);
> int update_dl_rq_load_avg(u64 now, struct rq *rq, int running);
>
> +#ifdef CONFIG_HAVE_SCHED_THERMAL_PRESSURE
> +int update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity);
static inline u64 thermal_load_avg(struct rq *rq)
{
return READ_ONCE(rq->avg_thermal.load_avg);
}
> +#else
> +static inline int
> +update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity)
> +{
> + return 0;
> +}

static inline u64 thermal_load_avg(struct rq *rq)
{
return 0;
}
> +#endif

Would help with patch 4 and 5.

2020-01-17 11:42:00

by Quentin Perret

[permalink] [raw]
Subject: Re: [Patch v8 4/7] sched/fair: Enable periodic update of average thermal pressure

On Thursday 16 Jan 2020 at 16:15:02 (+0100), Peter Zijlstra wrote:
> > @@ -10275,6 +10281,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
> > {
> > struct cfs_rq *cfs_rq;
> > struct sched_entity *se = &curr->se;
> > + unsigned long thermal_pressure = arch_cpu_thermal_pressure(cpu_of(rq));
> >
> > for_each_sched_entity(se) {
> > cfs_rq = cfs_rq_of(se);
> > @@ -10286,6 +10293,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
> >
> > update_misfit_status(curr, rq);
> > update_overutilized_status(task_rq(curr));
> > + update_thermal_load_avg(rq_clock_task(rq), rq, thermal_pressure);
> > }
>
> I'm thinking this is the wrong place; should this not be in
> scheduler_tick(), right before calling sched_class::task_tick() ? Surely
> any execution will affect thermals, not only fair class execution.

Right, but right now only CFS takes action when we overheat. That is,
only CFS uses capacity_of() which is where the thermal signal gets
reflected.

We definitely could (and maybe should) make RT and DL react to thermal
pressure as well when they're both capacity-aware. But perhaps that's
for later ? Thoughts ?

Thanks,
Quentin

2020-01-17 11:49:10

by Quentin Perret

[permalink] [raw]
Subject: Re: [Patch v8 7/7] sched/fair: Enable tuning of decay period

On Tuesday 14 Jan 2020 at 14:57:39 (-0500), Thara Gopinath wrote:
> +static int __init setup_sched_thermal_decay_shift(char *str)
> +{
> + int _shift;
> +
> + if (kstrtoint(str, 0, &_shift))
> + pr_warn("Unable to set scheduler thermal pressure decay shift parameter\n");

Nit: looking at kstrtoint() it seems that _shift will be left unmodified
upon failure. To avoid feeding a random value to clamp() below, perhaps
initialize _shift to 0 ?

> + sched_thermal_decay_shift = clamp(_shift, 0, 10);
> + return 1;
> +}

Thanks,
Quentin

2020-01-17 12:32:31

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [Patch v8 4/7] sched/fair: Enable periodic update of average thermal pressure

On Fri, Jan 17, 2020 at 11:40:45AM +0000, Quentin Perret wrote:
> On Thursday 16 Jan 2020 at 16:15:02 (+0100), Peter Zijlstra wrote:
> > > @@ -10275,6 +10281,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
> > > {
> > > struct cfs_rq *cfs_rq;
> > > struct sched_entity *se = &curr->se;
> > > + unsigned long thermal_pressure = arch_cpu_thermal_pressure(cpu_of(rq));
> > >
> > > for_each_sched_entity(se) {
> > > cfs_rq = cfs_rq_of(se);
> > > @@ -10286,6 +10293,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
> > >
> > > update_misfit_status(curr, rq);
> > > update_overutilized_status(task_rq(curr));
> > > + update_thermal_load_avg(rq_clock_task(rq), rq, thermal_pressure);
> > > }
> >
> > I'm thinking this is the wrong place; should this not be in
> > scheduler_tick(), right before calling sched_class::task_tick() ? Surely
> > any execution will affect thermals, not only fair class execution.
>
> Right, but right now only CFS takes action when we overheat. That is,
> only CFS uses capacity_of() which is where the thermal signal gets
> reflected.

Sure, but we should still track the thermals unconditionally, even if
only CFS consumes it.

> We definitely could (and maybe should) make RT and DL react to thermal
> pressure as well when they're both capacity-aware. But perhaps that's
> for later ? Thoughts ?

Yeah, that's later head-aches. Even determining what to do there, except
panic() is going to be 'interesting'.

2020-01-17 13:18:59

by Vincent Guittot

[permalink] [raw]
Subject: Re: [Patch v8 4/7] sched/fair: Enable periodic update of average thermal pressure

On Fri, 17 Jan 2020 at 13:31, Peter Zijlstra <[email protected]> wrote:
>
> On Fri, Jan 17, 2020 at 11:40:45AM +0000, Quentin Perret wrote:
> > On Thursday 16 Jan 2020 at 16:15:02 (+0100), Peter Zijlstra wrote:
> > > > @@ -10275,6 +10281,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
> > > > {
> > > > struct cfs_rq *cfs_rq;
> > > > struct sched_entity *se = &curr->se;
> > > > + unsigned long thermal_pressure = arch_cpu_thermal_pressure(cpu_of(rq));
> > > >
> > > > for_each_sched_entity(se) {
> > > > cfs_rq = cfs_rq_of(se);
> > > > @@ -10286,6 +10293,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
> > > >
> > > > update_misfit_status(curr, rq);
> > > > update_overutilized_status(task_rq(curr));
> > > > + update_thermal_load_avg(rq_clock_task(rq), rq, thermal_pressure);
> > > > }
> > >
> > > I'm thinking this is the wrong place; should this not be in
> > > scheduler_tick(), right before calling sched_class::task_tick() ? Surely
> > > any execution will affect thermals, not only fair class execution.
> >
> > Right, but right now only CFS takes action when we overheat. That is,
> > only CFS uses capacity_of() which is where the thermal signal gets
> > reflected.
>
> Sure, but we should still track the thermals unconditionally, even if
> only CFS consumes it.

I agree, tracking thermal pressure should happen even if no cfs task are running

>
> > We definitely could (and maybe should) make RT and DL react to thermal
> > pressure as well when they're both capacity-aware. But perhaps that's
> > for later ? Thoughts ?
>
> Yeah, that's later head-aches. Even determining what to do there, except
> panic() is going to be 'interesting'.

2020-01-17 13:24:18

by Vincent Guittot

[permalink] [raw]
Subject: Re: [Patch v8 4/7] sched/fair: Enable periodic update of average thermal pressure

On Thu, 16 Jan 2020 at 16:15, Peter Zijlstra <[email protected]> wrote:
>
> On Tue, Jan 14, 2020 at 02:57:36PM -0500, Thara Gopinath wrote:
> > Introduce support in CFS periodic tick and other bookkeeping apis
> > to trigger the process of computing average thermal pressure for a
> > cpu. Also consider avg_thermal.load_avg in others_have_blocked
> > which allows for decay of pelt signals.
> >
> > Signed-off-by: Thara Gopinath <[email protected]>
> > ---
> > kernel/sched/fair.c | 8 ++++++++
> > 1 file changed, 8 insertions(+)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 8da0222..311bb0b 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7470,6 +7470,9 @@ static inline bool others_have_blocked(struct rq *rq)
> > if (READ_ONCE(rq->avg_dl.util_avg))
> > return true;
> >
> > + if (READ_ONCE(rq->avg_thermal.load_avg))
> > + return true;
> > +
>
> Given that struct sched_avg is 1 cacheline, the above is a pointless
> guaranteed cacheline miss if the arch doesn't
> CONFIG_HAVE_SCHED_THERMAL_PRESSURE.
>
> > #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
> > if (READ_ONCE(rq->avg_irq.util_avg))
> > return true;
> > @@ -7495,6 +7498,7 @@ static bool __update_blocked_others(struct rq *rq, bool *done)
> > {
> > const struct sched_class *curr_class;
> > u64 now = rq_clock_pelt(rq);
> > + unsigned long thermal_pressure = arch_cpu_thermal_pressure(cpu_of(rq));
> > bool decayed;
> >
> > /*
> > @@ -7505,6 +7509,8 @@ static bool __update_blocked_others(struct rq *rq, bool *done)
> >
> > decayed = update_rt_rq_load_avg(now, rq, curr_class == &rt_sched_class) |
> > update_dl_rq_load_avg(now, rq, curr_class == &dl_sched_class) |
> > + update_thermal_load_avg(rq_clock_task(rq), rq,
> > + thermal_pressure) |
> > update_irq_load_avg(rq, 0);
> >
> > if (others_have_blocked(rq))
>
> That there indentation trainwreck is a reason to rename the function.
>
> decayed = update_rt_rq_load_avg(now, rq, curr_class == &rt_sched_class) |
> update_dl_rq_load_avg(now, rq, curr_class == &dl_sched_class) |
> update_thermal_load_avg(rq_clock_task(rq), rq, thermal_pressure) |
> update_irq_load_avg(rq, 0);
>
> Is much better.
>
> But now that you made me look at that, I noticed it's using a different
> clock -- it is _NOT_ using now/rq_clock_pelt(), which means it'll not be
> in sync with the other averages.
>
> Is there a good reason for that?

We don't need to apply frequency and cpu capacity invariance on the
thermal capping signal which is what rq_clock_pelt does

>
> > @@ -10275,6 +10281,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
> > {
> > struct cfs_rq *cfs_rq;
> > struct sched_entity *se = &curr->se;
> > + unsigned long thermal_pressure = arch_cpu_thermal_pressure(cpu_of(rq));
> >
> > for_each_sched_entity(se) {
> > cfs_rq = cfs_rq_of(se);
> > @@ -10286,6 +10293,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
> >
> > update_misfit_status(curr, rq);
> > update_overutilized_status(task_rq(curr));
> > + update_thermal_load_avg(rq_clock_task(rq), rq, thermal_pressure);
> > }
>
> I'm thinking this is the wrong place; should this not be in
> scheduler_tick(), right before calling sched_class::task_tick() ? Surely
> any execution will affect thermals, not only fair class execution.

2020-01-17 13:46:17

by Quentin Perret

[permalink] [raw]
Subject: Re: [Patch v8 4/7] sched/fair: Enable periodic update of average thermal pressure

On Friday 17 Jan 2020 at 14:17:36 (+0100), Vincent Guittot wrote:
> On Fri, 17 Jan 2020 at 13:31, Peter Zijlstra <[email protected]> wrote:
> > Sure, but we should still track the thermals unconditionally, even if
> > only CFS consumes it.
>
> I agree, tracking thermal pressure should happen even if no cfs task are running

Bah, I got confused with cpu_capacity update stuff. This is for the PELT
signal, so +1 this should be updated unconditionally.

Thanks,
Quentin

2020-01-17 14:57:07

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [Patch v8 4/7] sched/fair: Enable periodic update of average thermal pressure

On Fri, Jan 17, 2020 at 02:22:51PM +0100, Vincent Guittot wrote:
> On Thu, 16 Jan 2020 at 16:15, Peter Zijlstra <[email protected]> wrote:

> >
> > That there indentation trainwreck is a reason to rename the function.
> >
> > decayed = update_rt_rq_load_avg(now, rq, curr_class == &rt_sched_class) |
> > update_dl_rq_load_avg(now, rq, curr_class == &dl_sched_class) |
> > update_thermal_load_avg(rq_clock_task(rq), rq, thermal_pressure) |
> > update_irq_load_avg(rq, 0);
> >
> > Is much better.
> >
> > But now that you made me look at that, I noticed it's using a different
> > clock -- it is _NOT_ using now/rq_clock_pelt(), which means it'll not be
> > in sync with the other averages.
> >
> > Is there a good reason for that?
>
> We don't need to apply frequency and cpu capacity invariance on the
> thermal capping signal which is what rq_clock_pelt does

Hmm, I suppose that is true, and that really could've done with a
comment. Now clock_pelt is sort-of in sync with clock_task, but won't it
still give weird artifacts by having it on a slightly different basis?

Anyway, looking at this, would it make sense to remove the @now argument
from update_*_load_avg()? All those functions already take @rq, and
rq_clock_*() are fairly trivial inlines.

2020-01-17 15:42:04

by Vincent Guittot

[permalink] [raw]
Subject: Re: [Patch v8 4/7] sched/fair: Enable periodic update of average thermal pressure

On Fri, 17 Jan 2020 at 15:55, Peter Zijlstra <[email protected]> wrote:
>
> On Fri, Jan 17, 2020 at 02:22:51PM +0100, Vincent Guittot wrote:
> > On Thu, 16 Jan 2020 at 16:15, Peter Zijlstra <[email protected]> wrote:
>
> > >
> > > That there indentation trainwreck is a reason to rename the function.
> > >
> > > decayed = update_rt_rq_load_avg(now, rq, curr_class == &rt_sched_class) |
> > > update_dl_rq_load_avg(now, rq, curr_class == &dl_sched_class) |
> > > update_thermal_load_avg(rq_clock_task(rq), rq, thermal_pressure) |
> > > update_irq_load_avg(rq, 0);
> > >
> > > Is much better.
> > >
> > > But now that you made me look at that, I noticed it's using a different
> > > clock -- it is _NOT_ using now/rq_clock_pelt(), which means it'll not be
> > > in sync with the other averages.
> > >
> > > Is there a good reason for that?
> >
> > We don't need to apply frequency and cpu capacity invariance on the
> > thermal capping signal which is what rq_clock_pelt does
>
> Hmm, I suppose that is true, and that really could've done with a
> comment. Now clock_pelt is sort-of in sync with clock_task, but won't it
> still give weird artifacts by having it on a slightly different basis?

No we should not. Weird artifacts happens when we
add/subtract/propagate signals between each other and then apply pelt
algorithm on the results. In the case of thermal signal, we only add
it to others to update cpu_capacity but pelt algo is then not applied
on it. The error because of some signals being at segment boundaries
whereas others are not, is limited to 2% and doesn't accumulate over
time.

>
> Anyway, looking at this, would it make sense to remove the @now argument
> from update_*_load_avg()? All those functions already take @rq, and
> rq_clock_*() are fairly trivial inlines.

TBH I was thinking of doing the opposite for update_irq_load_avg which
hides the clock that is used for irq_avg. This helps to easily
identify which signals use the exact same clock and can be mixed to
create a new pelt signal and which can't

2020-01-17 15:47:41

by Thara Gopinath

[permalink] [raw]
Subject: Re: [Patch v8 7/7] sched/fair: Enable tuning of decay period

On 01/17/2020 06:47 AM, Quentin Perret wrote:
> On Tuesday 14 Jan 2020 at 14:57:39 (-0500), Thara Gopinath wrote:
>> +static int __init setup_sched_thermal_decay_shift(char *str)
>> +{
>> + int _shift;
>> +
>> + if (kstrtoint(str, 0, &_shift))
>> + pr_warn("Unable to set scheduler thermal pressure decay shift parameter\n");
>
> Nit: looking at kstrtoint() it seems that _shift will be left unmodified
> upon failure. To avoid feeding a random value to clamp() below, perhaps
> initialize _shift to 0 ?

You are right. I will fix it. Thanks!
>
>> + sched_thermal_decay_shift = clamp(_shift, 0, 10);
>> + return 1;
>> +}
>
> Thanks,
> Quentin
>


--
Warm Regards
Thara

2020-01-17 20:21:35

by Thara Gopinath

[permalink] [raw]
Subject: Re: [Patch v8 4/7] sched/fair: Enable periodic update of average thermal pressure

On 01/16/2020 10:15 AM, Peter Zijlstra wrote:
> On Tue, Jan 14, 2020 at 02:57:36PM -0500, Thara Gopinath wrote:
>> Introduce support in CFS periodic tick and other bookkeeping apis
>> to trigger the process of computing average thermal pressure for a
>> cpu. Also consider avg_thermal.load_avg in others_have_blocked
>> which allows for decay of pelt signals.
>>
>> Signed-off-by: Thara Gopinath <[email protected]>
>> ---
>> kernel/sched/fair.c | 8 ++++++++
>> 1 file changed, 8 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 8da0222..311bb0b 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -7470,6 +7470,9 @@ static inline bool others_have_blocked(struct rq *rq)
>> if (READ_ONCE(rq->avg_dl.util_avg))
>> return true;
>>
>> + if (READ_ONCE(rq->avg_thermal.load_avg))
>> + return true;
>> +
>
> Given that struct sched_avg is 1 cacheline, the above is a pointless
> guaranteed cacheline miss if the arch doesn't
> CONFIG_HAVE_SCHED_THERMAL_PRESSURE.
Thanks for the review Peter. I see your suggestion in Patch 1 to fix
this issue. Will send out the next version implementing it.

>
>> #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
>> if (READ_ONCE(rq->avg_irq.util_avg))
>> return true;
>> @@ -7495,6 +7498,7 @@ static bool __update_blocked_others(struct rq *rq, bool *done)
>> {
>> const struct sched_class *curr_class;
>> u64 now = rq_clock_pelt(rq);
>> + unsigned long thermal_pressure = arch_cpu_thermal_pressure(cpu_of(rq));
>> bool decayed;
>>
>> /*
>> @@ -7505,6 +7509,8 @@ static bool __update_blocked_others(struct rq *rq, bool *done)
>>
>> decayed = update_rt_rq_load_avg(now, rq, curr_class == &rt_sched_class) |
>> update_dl_rq_load_avg(now, rq, curr_class == &dl_sched_class) |
>> + update_thermal_load_avg(rq_clock_task(rq), rq,
>> + thermal_pressure) |
>> update_irq_load_avg(rq, 0);
>>
>> if (others_have_blocked(rq))
>
> That there indentation trainwreck is a reason to rename the function.
>
> decayed = update_rt_rq_load_avg(now, rq, curr_class == &rt_sched_class) |
> update_dl_rq_load_avg(now, rq, curr_class == &dl_sched_class) |
> update_thermal_load_avg(rq_clock_task(rq), rq, thermal_pressure) |
> update_irq_load_avg(rq, 0);
>
> Is much better.

Did you intend to say here to rename update_thermal_load_avg to
something else ?
>
> But now that you made me look at that, I noticed it's using a different
> clock -- it is _NOT_ using now/rq_clock_pelt(), which means it'll not be
> in sync with the other averages.
>
> Is there a good reason for that?

So I guess as Vincent has replied in his email, rq_clock_pelt adjusts
clock for frequency and cpu capacity invariance. Thermal pressure signal
is already accounted for it when updated from the thermal framework.
>
>> @@ -10275,6 +10281,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
>> {
>> struct cfs_rq *cfs_rq;
>> struct sched_entity *se = &curr->se;
>> + unsigned long thermal_pressure = arch_cpu_thermal_pressure(cpu_of(rq));
>>
>> for_each_sched_entity(se) {
>> cfs_rq = cfs_rq_of(se);
>> @@ -10286,6 +10293,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
>>
>> update_misfit_status(curr, rq);
>> update_overutilized_status(task_rq(curr));
>> + update_thermal_load_avg(rq_clock_task(rq), rq, thermal_pressure);
>> }
>
> I'm thinking this is the wrong place; should this not be in
> scheduler_tick(), right before calling sched_class::task_tick() ? Surely
> any execution will affect thermals, not only fair class execution.

I read all other comments from others on this. I agree. I will move this
to scheduler_tick.

>


--
Warm Regards
Thara

2020-01-23 19:17:04

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [Patch v8 1/7] sched/pelt: Add support to track thermal pressure

On 16/01/2020 16:17, Peter Zijlstra wrote:
> On Tue, Jan 14, 2020 at 02:57:33PM -0500, Thara Gopinath wrote:
>
>> diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
>> index afff644..bf1e17b 100644
>> --- a/kernel/sched/pelt.h
>> +++ b/kernel/sched/pelt.h
>> @@ -7,6 +7,16 @@ int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq);
>> int update_rt_rq_load_avg(u64 now, struct rq *rq, int running);
>> int update_dl_rq_load_avg(u64 now, struct rq *rq, int running);
>>
>> +#ifdef CONFIG_HAVE_SCHED_THERMAL_PRESSURE

I assume your plan is to enable this for Arm and Arm64? Otherwise the
code in 3/7 should also be guarded by this.

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index e688dfad0b72..9eb414b2c8b9 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -164,6 +164,7 @@ config ARM64
select HAVE_FUNCTION_ARG_ACCESS_API
select HAVE_RCU_TABLE_FREE
select HAVE_RSEQ
+ select HAVE_SCHED_THERMAL_PRESSURE
select HAVE_STACKPROTECTOR
select HAVE_SYSCALL_TRACEPOINTS

Currently it lives in the 'CPU/Task time and stats accounting' of
.config which doesn't feel right to me.

[...]

2020-01-24 20:54:07

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [Patch v8 4/7] sched/fair: Enable periodic update of average thermal pressure

On 17/01/2020 16:39, Vincent Guittot wrote:
> On Fri, 17 Jan 2020 at 15:55, Peter Zijlstra <[email protected]> wrote:
>>
>> On Fri, Jan 17, 2020 at 02:22:51PM +0100, Vincent Guittot wrote:
>>> On Thu, 16 Jan 2020 at 16:15, Peter Zijlstra <[email protected]> wrote:
>>
>>>>
>>>> That there indentation trainwreck is a reason to rename the function.
>>>>
>>>> decayed = update_rt_rq_load_avg(now, rq, curr_class == &rt_sched_class) |
>>>> update_dl_rq_load_avg(now, rq, curr_class == &dl_sched_class) |
>>>> update_thermal_load_avg(rq_clock_task(rq), rq, thermal_pressure) |
>>>> update_irq_load_avg(rq, 0);
>>>>
>>>> Is much better.
>>>>
>>>> But now that you made me look at that, I noticed it's using a different
>>>> clock -- it is _NOT_ using now/rq_clock_pelt(), which means it'll not be
>>>> in sync with the other averages.
>>>>
>>>> Is there a good reason for that?
>>>
>>> We don't need to apply frequency and cpu capacity invariance on the
>>> thermal capping signal which is what rq_clock_pelt does
>>
>> Hmm, I suppose that is true, and that really could've done with a
>> comment. Now clock_pelt is sort-of in sync with clock_task, but won't it
>> still give weird artifacts by having it on a slightly different basis?
>
> No we should not. Weird artifacts happens when we
> add/subtract/propagate signals between each other and then apply pelt
> algorithm on the results. In the case of thermal signal, we only add
> it to others to update cpu_capacity but pelt algo is then not applied
> on it. The error because of some signals being at segment boundaries
> whereas others are not, is limited to 2% and doesn't accumulate over
> time.
>
>>
>> Anyway, looking at this, would it make sense to remove the @now argument
>> from update_*_load_avg()? All those functions already take @rq, and
>> rq_clock_*() are fairly trivial inlines.
>
> TBH I was thinking of doing the opposite for update_irq_load_avg which
> hides the clock that is used for irq_avg. This helps to easily
> identify which signals use the exact same clock and can be mixed to
> create a new pelt signal and which can't

The 'now' argument is one thing but why not:

-int update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity)
+int update_thermal_load_avg(u64 now, struct rq *rq)
{
+ u64 capacity = arch_cpu_thermal_pressure(cpu_of(rq));
+
if (___update_load_sum(now, &rq->avg_thermal,

This would make the call-sites __update_blocked_others() and
task_tick(_fair)() cleaner.

I guess the argument is not to pollute pelt.c. But it already contains
arch_scale_[freq|cpu]_capacity() for irq.


2020-01-24 20:54:24

by Vincent Guittot

[permalink] [raw]
Subject: Re: [Patch v8 4/7] sched/fair: Enable periodic update of average thermal pressure

On Fri, 24 Jan 2020 at 16:37, Dietmar Eggemann <[email protected]> wrote:
>
> On 17/01/2020 16:39, Vincent Guittot wrote:
> > On Fri, 17 Jan 2020 at 15:55, Peter Zijlstra <[email protected]> wrote:
> >>
> >> On Fri, Jan 17, 2020 at 02:22:51PM +0100, Vincent Guittot wrote:
> >>> On Thu, 16 Jan 2020 at 16:15, Peter Zijlstra <[email protected]> wrote:
> >>
> >>>>
> >>>> That there indentation trainwreck is a reason to rename the function.
> >>>>
> >>>> decayed = update_rt_rq_load_avg(now, rq, curr_class == &rt_sched_class) |
> >>>> update_dl_rq_load_avg(now, rq, curr_class == &dl_sched_class) |
> >>>> update_thermal_load_avg(rq_clock_task(rq), rq, thermal_pressure) |
> >>>> update_irq_load_avg(rq, 0);
> >>>>
> >>>> Is much better.
> >>>>
> >>>> But now that you made me look at that, I noticed it's using a different
> >>>> clock -- it is _NOT_ using now/rq_clock_pelt(), which means it'll not be
> >>>> in sync with the other averages.
> >>>>
> >>>> Is there a good reason for that?
> >>>
> >>> We don't need to apply frequency and cpu capacity invariance on the
> >>> thermal capping signal which is what rq_clock_pelt does
> >>
> >> Hmm, I suppose that is true, and that really could've done with a
> >> comment. Now clock_pelt is sort-of in sync with clock_task, but won't it
> >> still give weird artifacts by having it on a slightly different basis?
> >
> > No we should not. Weird artifacts happens when we
> > add/subtract/propagate signals between each other and then apply pelt
> > algorithm on the results. In the case of thermal signal, we only add
> > it to others to update cpu_capacity but pelt algo is then not applied
> > on it. The error because of some signals being at segment boundaries
> > whereas others are not, is limited to 2% and doesn't accumulate over
> > time.
> >
> >>
> >> Anyway, looking at this, would it make sense to remove the @now argument
> >> from update_*_load_avg()? All those functions already take @rq, and
> >> rq_clock_*() are fairly trivial inlines.
> >
> > TBH I was thinking of doing the opposite for update_irq_load_avg which
> > hides the clock that is used for irq_avg. This helps to easily
> > identify which signals use the exact same clock and can be mixed to
> > create a new pelt signal and which can't
>
> The 'now' argument is one thing but why not:
>
> -int update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity)
> +int update_thermal_load_avg(u64 now, struct rq *rq)
> {
> + u64 capacity = arch_cpu_thermal_pressure(cpu_of(rq));
> +
> if (___update_load_sum(now, &rq->avg_thermal,
>
> This would make the call-sites __update_blocked_others() and
> task_tick(_fair)() cleaner.

I prefer to keep the capacity as argument. This is more aligned with
others that provides the value of the signal to apply

>
> I guess the argument is not to pollute pelt.c. But it already contains

you've got it. I don't want to pollute the pelt.c file with things not
related to pelt but thermal as an example.

> arch_scale_[freq|cpu]_capacity() for irq.
>
>

2020-01-24 21:01:50

by Thara Gopinath

[permalink] [raw]
Subject: Re: [Patch v8 1/7] sched/pelt: Add support to track thermal pressure

On 01/23/2020 02:15 PM, Dietmar Eggemann wrote:
> On 16/01/2020 16:17, Peter Zijlstra wrote:
>> On Tue, Jan 14, 2020 at 02:57:33PM -0500, Thara Gopinath wrote:
>>
>>> diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
>>> index afff644..bf1e17b 100644
>>> --- a/kernel/sched/pelt.h
>>> +++ b/kernel/sched/pelt.h
>>> @@ -7,6 +7,16 @@ int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq);
>>> int update_rt_rq_load_avg(u64 now, struct rq *rq, int running);
>>> int update_dl_rq_load_avg(u64 now, struct rq *rq, int running);
>>>
>>> +#ifdef CONFIG_HAVE_SCHED_THERMAL_PRESSURE
>
> I assume your plan is to enable this for Arm and Arm64? Otherwise the
> code in 3/7 should also be guarded by this.

Yes. I think it should be enabled for arm and arm64. I can submit a
patch after this series is accepted to enable it.
Nevertheless , I don't understand why is patch 3/7 tied with this.
This portion is the averaging of thermal pressure. Patch 3/7 is to store
and retrieve the instantaneous value.
>
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index e688dfad0b72..9eb414b2c8b9 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -164,6 +164,7 @@ config ARM64
> select HAVE_FUNCTION_ARG_ACCESS_API
> select HAVE_RCU_TABLE_FREE
> select HAVE_RSEQ
> + select HAVE_SCHED_THERMAL_PRESSURE
> select HAVE_STACKPROTECTOR
> select HAVE_SYSCALL_TRACEPOINTS
>
> Currently it lives in the 'CPU/Task time and stats accounting' of
> .config which doesn't feel right to me.

It is cpu statistics if you think about it. It is also the same .config
where CONFIG_HAVE_SCHED_AVG_IRQ is defined.
>
> [...]
>


--
Warm Regards
Thara

2020-01-27 10:07:55

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [Patch v8 1/7] sched/pelt: Add support to track thermal pressure

On 24/01/2020 20:07, Thara Gopinath wrote:
> On 01/23/2020 02:15 PM, Dietmar Eggemann wrote:
>> On 16/01/2020 16:17, Peter Zijlstra wrote:
>>> On Tue, Jan 14, 2020 at 02:57:33PM -0500, Thara Gopinath wrote:

[...]

>>>> +#ifdef CONFIG_HAVE_SCHED_THERMAL_PRESSURE
>>
>> I assume your plan is to enable this for Arm and Arm64? Otherwise the
>> code in 3/7 should also be guarded by this.
>
> Yes. I think it should be enabled for arm and arm64. I can submit a
> patch after this series is accepted to enable it.
> Nevertheless , I don't understand why is patch 3/7 tied with this.
> This portion is the averaging of thermal pressure. Patch 3/7 is to store
> and retrieve the instantaneous value.

3/7 is the code which overwrites the scheduler default
arch_cpu_thermal_pressure() [include/linux/sched/topology.h]. I see it
more of the engine to drive thermal pressure tracking in the scheduler.

So all the code in 3/7 only makes sense if HAVE_SCHED_THERMAL_PRESSURE
is selected by the arch. IMHO, 3/7 and enabling it for Arm/Arm64 should
go in together.

>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>> index e688dfad0b72..9eb414b2c8b9 100644
>> --- a/arch/arm64/Kconfig
>> +++ b/arch/arm64/Kconfig
>> @@ -164,6 +164,7 @@ config ARM64
>> select HAVE_FUNCTION_ARG_ACCESS_API
>> select HAVE_RCU_TABLE_FREE
>> select HAVE_RSEQ
>> + select HAVE_SCHED_THERMAL_PRESSURE
>> select HAVE_STACKPROTECTOR
>> select HAVE_SYSCALL_TRACEPOINTS
>>
>> Currently it lives in the 'CPU/Task time and stats accounting' of
>> .config which doesn't feel right to me.
>
> It is cpu statistics if you think about it. It is also the same .config
> where CONFIG_HAVE_SCHED_AVG_IRQ is defined.

OK, makes sense.

2020-01-27 12:39:34

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [Patch v8 4/7] sched/fair: Enable periodic update of average thermal pressure

On 24/01/2020 16:45, Vincent Guittot wrote:
> On Fri, 24 Jan 2020 at 16:37, Dietmar Eggemann <[email protected]> wrote:
>>
>> On 17/01/2020 16:39, Vincent Guittot wrote:
>>> On Fri, 17 Jan 2020 at 15:55, Peter Zijlstra <[email protected]> wrote:
>>>>
>>>> On Fri, Jan 17, 2020 at 02:22:51PM +0100, Vincent Guittot wrote:
>>>>> On Thu, 16 Jan 2020 at 16:15, Peter Zijlstra <[email protected]> wrote:

[...]

>> The 'now' argument is one thing but why not:
>>
>> -int update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity)
>> +int update_thermal_load_avg(u64 now, struct rq *rq)
>> {
>> + u64 capacity = arch_cpu_thermal_pressure(cpu_of(rq));
>> +
>> if (___update_load_sum(now, &rq->avg_thermal,
>>
>> This would make the call-sites __update_blocked_others() and
>> task_tick(_fair)() cleaner.
>
> I prefer to keep the capacity as argument. This is more aligned with
> others that provides the value of the signal to apply
>
>>
>> I guess the argument is not to pollute pelt.c. But it already contains
>
> you've got it. I don't want to pollute the pelt.c file with things not
> related to pelt but thermal as an example.
>
>> arch_scale_[freq|cpu]_capacity() for irq.

But isn't arch_cpu_thermal_pressure() not exactly the same as
arch_scale_cpu_capacity() and arch_scale_freq_capacity()?

All of them are defined by default within the scheduler code
[include/linux/sched/topology.h or kernel/sched/sched.h] and can be
overwritten by arch code with a fast implementation (e.g. returning a
per-cpu variable).

So why is using arch_scale_freq_capacity() and arch_scale_cpu_capacity()
in update_irq_load_avg [kernel/sched/pelt.c] and update_rq_clock_pelt()
[kernel/sched/pelt.h] OK but arch_cpu_thermal_pressure() in
update_thermal_load_avg() [kernel/sched/pelt.c] not?

Shouldn't arch_cpu_thermal_pressure() not be called
arch_scale_thermal_capacity() to highlight the fact that those three
functions are doing the same thing, scaling capacity by something (cpu,
frequency or thermal)?

2020-01-27 15:16:26

by Vincent Guittot

[permalink] [raw]
Subject: Re: [Patch v8 4/7] sched/fair: Enable periodic update of average thermal pressure

On Mon, 27 Jan 2020 at 13:09, Dietmar Eggemann <[email protected]> wrote:
>
> On 24/01/2020 16:45, Vincent Guittot wrote:
> > On Fri, 24 Jan 2020 at 16:37, Dietmar Eggemann <[email protected]> wrote:
> >>
> >> On 17/01/2020 16:39, Vincent Guittot wrote:
> >>> On Fri, 17 Jan 2020 at 15:55, Peter Zijlstra <[email protected]> wrote:
> >>>>
> >>>> On Fri, Jan 17, 2020 at 02:22:51PM +0100, Vincent Guittot wrote:
> >>>>> On Thu, 16 Jan 2020 at 16:15, Peter Zijlstra <[email protected]> wrote:
>
> [...]
>
> >> The 'now' argument is one thing but why not:
> >>
> >> -int update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity)
> >> +int update_thermal_load_avg(u64 now, struct rq *rq)
> >> {
> >> + u64 capacity = arch_cpu_thermal_pressure(cpu_of(rq));
> >> +
> >> if (___update_load_sum(now, &rq->avg_thermal,
> >>
> >> This would make the call-sites __update_blocked_others() and
> >> task_tick(_fair)() cleaner.
> >
> > I prefer to keep the capacity as argument. This is more aligned with
> > others that provides the value of the signal to apply
> >
> >>
> >> I guess the argument is not to pollute pelt.c. But it already contains
> >
> > you've got it. I don't want to pollute the pelt.c file with things not
> > related to pelt but thermal as an example.
> >
> >> arch_scale_[freq|cpu]_capacity() for irq.
>
> But isn't arch_cpu_thermal_pressure() not exactly the same as
> arch_scale_cpu_capacity() and arch_scale_freq_capacity()?
>
> All of them are defined by default within the scheduler code
> [include/linux/sched/topology.h or kernel/sched/sched.h] and can be
> overwritten by arch code with a fast implementation (e.g. returning a
> per-cpu variable).
>
> So why is using arch_scale_freq_capacity() and arch_scale_cpu_capacity()
> in update_irq_load_avg [kernel/sched/pelt.c] and update_rq_clock_pelt()

As explained previously, update_irq_load_avg is an exception and not
the example to follow. update_rt/dl_rq_load_avg are the example to
follow and fixing update_irq_load_avg exception is on my todo list

> [kernel/sched/pelt.h] OK but arch_cpu_thermal_pressure() in
> update_thermal_load_avg() [kernel/sched/pelt.c] not?
>
> Shouldn't arch_cpu_thermal_pressure() not be called
> arch_scale_thermal_capacity() to highlight the fact that those three

Quoted from cover letter https://lkml.org/lkml/2020/1/14/1164:
"
v6->v7:
- ...
- Renamed arch_scale_thermal_capacity to arch_cpu_thermal_pressure
as per review comments from Peter, Dietmar and Ionela.
-...

"

> functions are doing the same thing, scaling capacity by something (cpu,
> frequency or thermal)?

2020-01-28 13:33:47

by Thara Gopinath

[permalink] [raw]
Subject: Re: [Patch v8 1/7] sched/pelt: Add support to track thermal pressure

On 01/27/2020 04:28 AM, Dietmar Eggemann wrote:
> On 24/01/2020 20:07, Thara Gopinath wrote:
>> On 01/23/2020 02:15 PM, Dietmar Eggemann wrote:
>>> On 16/01/2020 16:17, Peter Zijlstra wrote:
>>>> On Tue, Jan 14, 2020 at 02:57:33PM -0500, Thara Gopinath wrote:
>
> [...]
>
>>>>> +#ifdef CONFIG_HAVE_SCHED_THERMAL_PRESSURE
>>>
>>> I assume your plan is to enable this for Arm and Arm64? Otherwise the
>>> code in 3/7 should also be guarded by this.
>>
>> Yes. I think it should be enabled for arm and arm64. I can submit a
>> patch after this series is accepted to enable it.
>> Nevertheless , I don't understand why is patch 3/7 tied with this.
>> This portion is the averaging of thermal pressure. Patch 3/7 is to store
>> and retrieve the instantaneous value.
>
> 3/7 is the code which overwrites the scheduler default
> arch_cpu_thermal_pressure() [include/linux/sched/topology.h]. I see it
> more of the engine to drive thermal pressure tracking in the scheduler.
>
> So all the code in 3/7 only makes sense if HAVE_SCHED_THERMAL_PRESSURE
> is selected by the arch. IMHO, 3/7 and enabling it for Arm/Arm64 should
> go in together.
Hi Dietmar,
I will have to respectfully disagree here. We explicitly separated out
this stuff (updating and reading of instantaneous thermal pressure)from
scheduler. To me putting all this under HAVE_SCHED_THERMAL_PRESSURE is
equivalent to keeping this stuff in scheduler specific code. But I will
provide a patch enabling the option HAVE_SCHED_THERMAL_PRESSURE in
arm64/defconfig. arm is trickier though as it has a bunch of SoC
defconfigs. I will leave it out for now .



--
Warm Regards
Thara

2020-01-28 16:16:56

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [Patch v8 1/7] sched/pelt: Add support to track thermal pressure

On 28/01/2020 14:32, Thara Gopinath wrote:
> On 01/27/2020 04:28 AM, Dietmar Eggemann wrote:
>> On 24/01/2020 20:07, Thara Gopinath wrote:
>>> On 01/23/2020 02:15 PM, Dietmar Eggemann wrote:
>>>> On 16/01/2020 16:17, Peter Zijlstra wrote:
>>>>> On Tue, Jan 14, 2020 at 02:57:33PM -0500, Thara Gopinath wrote:
>>
>> [...]
>>
>>>>>> +#ifdef CONFIG_HAVE_SCHED_THERMAL_PRESSURE
>>>>
>>>> I assume your plan is to enable this for Arm and Arm64? Otherwise the
>>>> code in 3/7 should also be guarded by this.
>>>
>>> Yes. I think it should be enabled for arm and arm64. I can submit a
>>> patch after this series is accepted to enable it.
>>> Nevertheless , I don't understand why is patch 3/7 tied with this.
>>> This portion is the averaging of thermal pressure. Patch 3/7 is to store
>>> and retrieve the instantaneous value.
>>
>> 3/7 is the code which overwrites the scheduler default
>> arch_cpu_thermal_pressure() [include/linux/sched/topology.h]. I see it
>> more of the engine to drive thermal pressure tracking in the scheduler.
>>
>> So all the code in 3/7 only makes sense if HAVE_SCHED_THERMAL_PRESSURE
>> is selected by the arch. IMHO, 3/7 and enabling it for Arm/Arm64 should
>> go in together.
> Hi Dietmar,
> I will have to respectfully disagree here. We explicitly separated out
> this stuff (updating and reading of instantaneous thermal pressure)from
> scheduler. To me putting all this under HAVE_SCHED_THERMAL_PRESSURE is
> equivalent to keeping this stuff in scheduler specific code. But I will
> provide a patch enabling the option HAVE_SCHED_THERMAL_PRESSURE in
> arm64/defconfig. arm is trickier though as it has a bunch of SoC
> defconfigs. I will leave it out for now .

If you enable HAVE_SCHED_THERMAL_PRESSURE for Arm64 w/ this patch-set
then I don't see any disagreement here.

2020-01-29 15:51:58

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [Patch v8 4/7] sched/fair: Enable periodic update of average thermal pressure

On 27/01/2020 16:15, Vincent Guittot wrote:
> On Mon, 27 Jan 2020 at 13:09, Dietmar Eggemann <[email protected]> wrote:
>>
>> On 24/01/2020 16:45, Vincent Guittot wrote:
>>> On Fri, 24 Jan 2020 at 16:37, Dietmar Eggemann <[email protected]> wrote:
>>>>
>>>> On 17/01/2020 16:39, Vincent Guittot wrote:
>>>>> On Fri, 17 Jan 2020 at 15:55, Peter Zijlstra <[email protected]> wrote:
>>>>>>
>>>>>> On Fri, Jan 17, 2020 at 02:22:51PM +0100, Vincent Guittot wrote:
>>>>>>> On Thu, 16 Jan 2020 at 16:15, Peter Zijlstra <[email protected]> wrote:
>>
>> [...]
>>
>>>> The 'now' argument is one thing but why not:
>>>>
>>>> -int update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity)
>>>> +int update_thermal_load_avg(u64 now, struct rq *rq)
>>>> {
>>>> + u64 capacity = arch_cpu_thermal_pressure(cpu_of(rq));
>>>> +
>>>> if (___update_load_sum(now, &rq->avg_thermal,
>>>>
>>>> This would make the call-sites __update_blocked_others() and
>>>> task_tick(_fair)() cleaner.
>>>
>>> I prefer to keep the capacity as argument. This is more aligned with
>>> others that provides the value of the signal to apply
>>>
>>>>
>>>> I guess the argument is not to pollute pelt.c. But it already contains
>>>
>>> you've got it. I don't want to pollute the pelt.c file with things not
>>> related to pelt but thermal as an example.
>>>
>>>> arch_scale_[freq|cpu]_capacity() for irq.
>>
>> But isn't arch_cpu_thermal_pressure() not exactly the same as
>> arch_scale_cpu_capacity() and arch_scale_freq_capacity()?
>>
>> All of them are defined by default within the scheduler code
>> [include/linux/sched/topology.h or kernel/sched/sched.h] and can be
>> overwritten by arch code with a fast implementation (e.g. returning a
>> per-cpu variable).
>>
>> So why is using arch_scale_freq_capacity() and arch_scale_cpu_capacity()
>> in update_irq_load_avg [kernel/sched/pelt.c] and update_rq_clock_pelt()
>
> As explained previously, update_irq_load_avg is an exception and not
> the example to follow. update_rt/dl_rq_load_avg are the example to
> follow and fixing update_irq_load_avg exception is on my todo list

There is already a v9 but I comment here so the thread stays intact.

I guess this thread leads to nowhere. For me the question is do we
review against existing code or some possible future changes? The
arguments didn't convince me so far.
But we're not talking functional issues here so I won't continue to push
for change on this one here.

>> [kernel/sched/pelt.h] OK but arch_cpu_thermal_pressure() in
>> update_thermal_load_avg() [kernel/sched/pelt.c] not?
>>
>> Shouldn't arch_cpu_thermal_pressure() not be called
>> arch_scale_thermal_capacity() to highlight the fact that those three
>
> Quoted from cover letter https://lkml.org/lkml/2020/1/14/1164:
> "
> v6->v7:
> - ...
> - Renamed arch_scale_thermal_capacity to arch_cpu_thermal_pressure
> as per review comments from Peter, Dietmar and Ionela.
> -...
>
> "

I went back to the v6 review. Peter originally asked for a better name
(or an additional comment) for arch_scale_thermal_capacity() because the
return value is not capacity.

So IMHO arch_scale_thermal_pressure() is a good name for keeping this
aligned w/ the other arch_scale_* functions and to address this review
comment.

arch_scale_cpu_capacity() - scale capacity by cpu
arch_scale_freq_capacity() - scale capacity by frequency
arch_scale_thermal_pressure() - scale pressure (1 - capacity) by thermal

[...]

2020-01-30 09:50:44

by Vincent Guittot

[permalink] [raw]
Subject: Re: [Patch v8 4/7] sched/fair: Enable periodic update of average thermal pressure

On Wed, 29 Jan 2020 at 16:41, Dietmar Eggemann <[email protected]> wrote:
>
> On 27/01/2020 16:15, Vincent Guittot wrote:
> > On Mon, 27 Jan 2020 at 13:09, Dietmar Eggemann <[email protected]> wrote:
> >>
> >> On 24/01/2020 16:45, Vincent Guittot wrote:
> >>> On Fri, 24 Jan 2020 at 16:37, Dietmar Eggemann <[email protected]> wrote:
> >>>>
> >>>> On 17/01/2020 16:39, Vincent Guittot wrote:
> >>>>> On Fri, 17 Jan 2020 at 15:55, Peter Zijlstra <[email protected]> wrote:
> >>>>>>
> >>>>>> On Fri, Jan 17, 2020 at 02:22:51PM +0100, Vincent Guittot wrote:
> >>>>>>> On Thu, 16 Jan 2020 at 16:15, Peter Zijlstra <[email protected]> wrote:
> >>
> >> [...]
> >>
> >>>> The 'now' argument is one thing but why not:
> >>>>
> >>>> -int update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity)
> >>>> +int update_thermal_load_avg(u64 now, struct rq *rq)
> >>>> {
> >>>> + u64 capacity = arch_cpu_thermal_pressure(cpu_of(rq));
> >>>> +
> >>>> if (___update_load_sum(now, &rq->avg_thermal,
> >>>>
> >>>> This would make the call-sites __update_blocked_others() and
> >>>> task_tick(_fair)() cleaner.
> >>>
> >>> I prefer to keep the capacity as argument. This is more aligned with
> >>> others that provides the value of the signal to apply
> >>>
> >>>>
> >>>> I guess the argument is not to pollute pelt.c. But it already contains
> >>>
> >>> you've got it. I don't want to pollute the pelt.c file with things not
> >>> related to pelt but thermal as an example.
> >>>
> >>>> arch_scale_[freq|cpu]_capacity() for irq.
> >>
> >> But isn't arch_cpu_thermal_pressure() not exactly the same as
> >> arch_scale_cpu_capacity() and arch_scale_freq_capacity()?
> >>
> >> All of them are defined by default within the scheduler code
> >> [include/linux/sched/topology.h or kernel/sched/sched.h] and can be
> >> overwritten by arch code with a fast implementation (e.g. returning a
> >> per-cpu variable).
> >>
> >> So why is using arch_scale_freq_capacity() and arch_scale_cpu_capacity()
> >> in update_irq_load_avg [kernel/sched/pelt.c] and update_rq_clock_pelt()
> >
> > As explained previously, update_irq_load_avg is an exception and not
> > the example to follow. update_rt/dl_rq_load_avg are the example to
> > follow and fixing update_irq_load_avg exception is on my todo list
>
> There is already a v9 but I comment here so the thread stays intact.
>
> I guess this thread leads to nowhere. For me the question is do we
> review against existing code or some possible future changes? The
> arguments didn't convince me so far.
> But we're not talking functional issues here so I won't continue to push
> for change on this one here.
>
> >> [kernel/sched/pelt.h] OK but arch_cpu_thermal_pressure() in
> >> update_thermal_load_avg() [kernel/sched/pelt.c] not?
> >>
> >> Shouldn't arch_cpu_thermal_pressure() not be called
> >> arch_scale_thermal_capacity() to highlight the fact that those three
> >
> > Quoted from cover letter https://lkml.org/lkml/2020/1/14/1164:
> > "
> > v6->v7:
> > - ...
> > - Renamed arch_scale_thermal_capacity to arch_cpu_thermal_pressure
> > as per review comments from Peter, Dietmar and Ionela.
> > -...
> >
> > "
>
> I went back to the v6 review. Peter originally asked for a better name
> (or an additional comment) for arch_scale_thermal_capacity() because the
> return value is not capacity.
>
> So IMHO arch_scale_thermal_pressure() is a good name for keeping this
> aligned w/ the other arch_scale_* functions and to address this review
> comment.
>
> arch_scale_cpu_capacity() - scale capacity by cpu
> arch_scale_freq_capacity() - scale capacity by frequency
> arch_scale_thermal_pressure() - scale pressure (1 - capacity) by thermal

If arch_scale_thermal_pressure() is ok for everybody i'm fine too.

I don't have a strong opinion about the name of the function as long
as we don't go back and forth

>
> [...]

2020-02-13 12:04:32

by Amit Kucheria

[permalink] [raw]
Subject: Re: [Patch v8 1/7] sched/pelt: Add support to track thermal pressure

On Wed, Jan 15, 2020 at 1:27 AM Thara Gopinath
<[email protected]> wrote:
>
> Extrapolating on the existing framework to track rt/dl utilization using
> pelt signals, add a similar mechanism to track thermal pressure. The
> difference here from rt/dl utilization tracking is that, instead of
> tracking time spent by a cpu running a rt/dl task through util_avg, the
> average thermal pressure is tracked through load_avg. This is because
> thermal pressure signal is weighted "delta" capacity and is not
> binary(util_avg is binary). "delta capacity" here means delta between the
> actual capacity of a cpu and the decreased capacity a cpu due to a thermal
> event.
>
> In order to track average thermal pressure, a new sched_avg variable
> avg_thermal is introduced. Function update_thermal_load_avg can be called
> to do the periodic bookkeeping (accumulate, decay and average) of the
> thermal pressure.
>
> Signed-off-by: Thara Gopinath <[email protected]>
> Reviewed-by: Vincent Guittot <[email protected]>
> ---
>
> v6->v7:
> - Added CONFIG_HAVE_SCHED_THERMAL_PRESSURE to stub out
> update_thermal_load_avg in unsupported architectures as per
> review comments from Peter, Dietmar and Quentin.
> - Updated comment for update_thermal_load_avg as per review
> comments from Peter and Dietmar.
> v7->v8:
> - Fixed typo in defining update_thermal_load_avg which was
> causing build errors (reported by kbuild test report)
>
> include/trace/events/sched.h | 4 ++++
> init/Kconfig | 4 ++++
> kernel/sched/pelt.c | 31 +++++++++++++++++++++++++++++++
> kernel/sched/pelt.h | 16 ++++++++++++++++
> kernel/sched/sched.h | 1 +
> 5 files changed, 56 insertions(+)
>
> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index 420e80e..a8fb667 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -613,6 +613,10 @@ DECLARE_TRACE(pelt_dl_tp,
> TP_PROTO(struct rq *rq),
> TP_ARGS(rq));
>
> +DECLARE_TRACE(pelt_thermal_tp,
> + TP_PROTO(struct rq *rq),
> + TP_ARGS(rq));
> +
> DECLARE_TRACE(pelt_irq_tp,
> TP_PROTO(struct rq *rq),
> TP_ARGS(rq));
> diff --git a/init/Kconfig b/init/Kconfig
> index f6a4a91..c16ea88 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -450,6 +450,10 @@ config HAVE_SCHED_AVG_IRQ
> depends on IRQ_TIME_ACCOUNTING || PARAVIRT_TIME_ACCOUNTING
> depends on SMP
>
> +config HAVE_SCHED_THERMAL_PRESSURE
> + bool "Enable periodic averaging of thermal pressure"
> + depends on SMP
> +
> config BSD_PROCESS_ACCT
> bool "BSD Process Accounting"
> depends on MULTIUSER
> diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
> index bd006b7..5d1fbf0 100644
> --- a/kernel/sched/pelt.c
> +++ b/kernel/sched/pelt.c
> @@ -367,6 +367,37 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
> return 0;
> }
>
> +#ifdef CONFIG_HAVE_SCHED_THERMAL_PRESSURE
> +/*
> + * thermal:
> + *
> + * load_sum = \Sum se->avg.load_sum but se->avg.load_sum is not tracked
> + *
> + * util_avg and runnable_load_avg are not supported and meaningless.
> + *
> + * Unlike rt/dl utilization tracking that track time spent by a cpu
> + * running a rt/dl task through util_avg, the average thermal pressure is
> + * tracked through load_avg. This is because thermal pressure signal is
> + * weighted "delta" capacity and is not binary(util_avg is binary). "delta

May I suggest a slight rewording here and in the commit description,

This is because the thermal pressure signal is weighted "delta"
capacity unlike util_avg which is binary.

It would also help, if you expanded on what you mean by binary in the
commit description and how the delta capacity is weighted.

> + * capacity" here means delta between the actual capacity of a cpu and the
> + * decreased capacity a cpu due to a thermal event.

This could be shortened to:

Delta capacity of cpu = Actual capacity - Decreased capacity due to
thermal event

> + */
> +
> +int update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity)
> +{
> + if (___update_load_sum(now, &rq->avg_thermal,
> + capacity,
> + capacity,
> + capacity)) {
> + ___update_load_avg(&rq->avg_thermal, 1, 1);
> + trace_pelt_thermal_tp(rq);
> + return 1;
> + }
> +
> + return 0;
> +}
> +#endif
> +
> #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
> /*
> * irq:
> diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
> index afff644..bf1e17b 100644
> --- a/kernel/sched/pelt.h
> +++ b/kernel/sched/pelt.h
> @@ -7,6 +7,16 @@ int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq);
> int update_rt_rq_load_avg(u64 now, struct rq *rq, int running);
> int update_dl_rq_load_avg(u64 now, struct rq *rq, int running);
>
> +#ifdef CONFIG_HAVE_SCHED_THERMAL_PRESSURE
> +int update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity);
> +#else
> +static inline int
> +update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity)
> +{
> + return 0;
> +}
> +#endif
> +
> #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
> int update_irq_load_avg(struct rq *rq, u64 running);
> #else
> @@ -159,6 +169,12 @@ update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
> }
>
> static inline int
> +update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity)
> +{
> + return 0;
> +}
> +
> +static inline int
> update_irq_load_avg(struct rq *rq, u64 running)
> {
> return 0;
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 280a3c7..37bd7ef 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -944,6 +944,7 @@ struct rq {
> #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
> struct sched_avg avg_irq;
> #endif
> + struct sched_avg avg_thermal;
> u64 idle_stamp;
> u64 avg_idle;
>
> --
> 2.1.4
>

2020-02-13 12:32:15

by Amit Kucheria

[permalink] [raw]
Subject: Re: [Patch v8 1/7] sched/pelt: Add support to track thermal pressure

On Thu, Feb 13, 2020 at 5:33 PM Amit Kucheria <[email protected]> wrote:
>
> On Wed, Jan 15, 2020 at 1:27 AM Thara Gopinath
> <[email protected]> wrote:
> >
> > Extrapolating on the existing framework to track rt/dl utilization using
> > pelt signals, add a similar mechanism to track thermal pressure. The
> > difference here from rt/dl utilization tracking is that, instead of
> > tracking time spent by a cpu running a rt/dl task through util_avg, the
> > average thermal pressure is tracked through load_avg. This is because
> > thermal pressure signal is weighted "delta" capacity and is not
> > binary(util_avg is binary). "delta capacity" here means delta between the
> > actual capacity of a cpu and the decreased capacity a cpu due to a thermal
> > event.
> >
> > In order to track average thermal pressure, a new sched_avg variable
> > avg_thermal is introduced. Function update_thermal_load_avg can be called
> > to do the periodic bookkeeping (accumulate, decay and average) of the
> > thermal pressure.
> >
> > Signed-off-by: Thara Gopinath <[email protected]>
> > Reviewed-by: Vincent Guittot <[email protected]>
> > ---
> >
> > v6->v7:
> > - Added CONFIG_HAVE_SCHED_THERMAL_PRESSURE to stub out
> > update_thermal_load_avg in unsupported architectures as per
> > review comments from Peter, Dietmar and Quentin.
> > - Updated comment for update_thermal_load_avg as per review
> > comments from Peter and Dietmar.
> > v7->v8:
> > - Fixed typo in defining update_thermal_load_avg which was
> > causing build errors (reported by kbuild test report)
> >
> > include/trace/events/sched.h | 4 ++++
> > init/Kconfig | 4 ++++
> > kernel/sched/pelt.c | 31 +++++++++++++++++++++++++++++++
> > kernel/sched/pelt.h | 16 ++++++++++++++++
> > kernel/sched/sched.h | 1 +
> > 5 files changed, 56 insertions(+)
> >
> > diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> > index 420e80e..a8fb667 100644
> > --- a/include/trace/events/sched.h
> > +++ b/include/trace/events/sched.h
> > @@ -613,6 +613,10 @@ DECLARE_TRACE(pelt_dl_tp,
> > TP_PROTO(struct rq *rq),
> > TP_ARGS(rq));
> >
> > +DECLARE_TRACE(pelt_thermal_tp,
> > + TP_PROTO(struct rq *rq),
> > + TP_ARGS(rq));
> > +
> > DECLARE_TRACE(pelt_irq_tp,
> > TP_PROTO(struct rq *rq),
> > TP_ARGS(rq));
> > diff --git a/init/Kconfig b/init/Kconfig
> > index f6a4a91..c16ea88 100644
> > --- a/init/Kconfig
> > +++ b/init/Kconfig
> > @@ -450,6 +450,10 @@ config HAVE_SCHED_AVG_IRQ
> > depends on IRQ_TIME_ACCOUNTING || PARAVIRT_TIME_ACCOUNTING
> > depends on SMP
> >
> > +config HAVE_SCHED_THERMAL_PRESSURE
> > + bool "Enable periodic averaging of thermal pressure"
> > + depends on SMP
> > +
> > config BSD_PROCESS_ACCT
> > bool "BSD Process Accounting"
> > depends on MULTIUSER
> > diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
> > index bd006b7..5d1fbf0 100644
> > --- a/kernel/sched/pelt.c
> > +++ b/kernel/sched/pelt.c
> > @@ -367,6 +367,37 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
> > return 0;
> > }
> >
> > +#ifdef CONFIG_HAVE_SCHED_THERMAL_PRESSURE
> > +/*
> > + * thermal:
> > + *
> > + * load_sum = \Sum se->avg.load_sum but se->avg.load_sum is not tracked
> > + *
> > + * util_avg and runnable_load_avg are not supported and meaningless.
> > + *
> > + * Unlike rt/dl utilization tracking that track time spent by a cpu
> > + * running a rt/dl task through util_avg, the average thermal pressure is
> > + * tracked through load_avg. This is because thermal pressure signal is
> > + * weighted "delta" capacity and is not binary(util_avg is binary). "delta
>
> May I suggest a slight rewording here and in the commit description,
>
> This is because the thermal pressure signal is weighted "delta"
> capacity unlike util_avg which is binary.
>
> It would also help, if you expanded on what you mean by binary in the
> commit description and how the delta capacity is weighted.
>
> > + * capacity" here means delta between the actual capacity of a cpu and the
> > + * decreased capacity a cpu due to a thermal event.
>
> This could be shortened to:
>
> Delta capacity of cpu = Actual capacity - Decreased capacity due to
> thermal event
>

Please ignore. I should have sent this against v9, it was languishing
in my mailbox since v8 and I sent this out by mistake.