2020-01-28 22:37:15

by Thara Gopinath

[permalink] [raw]
Subject: [Patch v9 0/8] Introduce Thermal Pressure

Thermal governors can respond to an overheat event of a cpu by
capping the cpu's maximum possible frequency. This in turn
means that the maximum available compute capacity of the
cpu is restricted. But today in the kernel, task scheduler is
not notified of capping of maximum frequency of a cpu.
In other words, scheduler is unaware of maximum capacity
restrictions placed on a cpu due to thermal activity.
This patch series attempts to address this issue.
The benefits identified are better task placement among available
cpus in event of overheating which in turn leads to better
performance numbers.

The reduction in the maximum possible capacity of a cpu due to a
thermal event can be considered as thermal pressure. Instantaneous
thermal pressure is hard to record and can sometime be erroneous
as there can be mismatch between the actual capping of capacity
and scheduler recording it. Thus solution is to have a weighted
average per cpu value for thermal pressure over time.
The weight reflects the amount of time the cpu has spent at a
capped maximum frequency. Since thermal pressure is recorded as
an average, it must be decayed periodically. Exisiting algorithm
in the kernel scheduler pelt framework is re-used to calculate
the weighted average. This patch series also defines a sysctl
inerface to allow for a configurable decay period.

Regarding testing, basic build, boot and sanity testing have been
performed on db845c platform with debian file system.
Further, dhrystone and hackbench tests have been
run with the thermal pressure algorithm. During testing, due to
constraints of step wise governor in dealing with big little systems,
trip point 0 temperature was made assymetric between cpus in little
cluster and big cluster; the idea being that
big core will heat up and cpu cooling device will throttle the
frequency of the big cores faster, there by limiting the maximum available
capacity and the scheduler will spread out tasks to little cores as well.

Test Results

Hackbench: 1 group , 30000 loops, 10 runs
Result SD
(Secs) (% of mean)
No Thermal Pressure 14.03 2.69%
Thermal Pressure PELT Algo. Decay : 32 ms 13.29 0.56%
Thermal Pressure PELT Algo. Decay : 64 ms 12.57 1.56%
Thermal Pressure PELT Algo. Decay : 128 ms 12.71 1.04%
Thermal Pressure PELT Algo. Decay : 256 ms 12.29 1.42%
Thermal Pressure PELT Algo. Decay : 512 ms 12.42 1.15%

Dhrystone Run Time : 20 threads, 3000 MLOOPS
Result SD
(Secs) (% of mean)
No Thermal Pressure 9.452 4.49%
Thermal Pressure PELT Algo. Decay : 32 ms 8.793 5.30%
Thermal Pressure PELT Algo. Decay : 64 ms 8.981 5.29%
Thermal Pressure PELT Algo. Decay : 128 ms 8.647 6.62%
Thermal Pressure PELT Algo. Decay : 256 ms 8.774 6.45%
Thermal Pressure PELT Algo. Decay : 512 ms 8.603 5.41%

A Brief History

The first version of this patch-series was posted with resuing
PELT algorithm to decay thermal pressure signal. The discussions
that followed were around whether intanteneous thermal pressure
solution is better and whether a stand-alone algortihm to accumulate
and decay thermal pressure is more appropriate than re-using the
PELT framework.
Tests on Hikey960 showed the stand-alone algorithm performing slightly
better than resuing PELT algorithm and V2 was posted with the stand
alone algorithm. Test results were shared as part of this series.
Discussions were around re-using PELT algorithm and running
further tests with more granular decay period.

For some time after this development was impeded due to hardware
unavailability, some other unforseen and possibly unfortunate events.
For this version, h/w was switched from hikey960 to db845c.
Also Instantaneous thermal pressure was never tested as part of this
cycle as it is clear that weighted average is a better implementation.
The non-PELT algorithm never gave any conclusive results to prove that it
is better than reusing PELT algorithm, in this round of testing.
Also reusing PELT algorithm means thermal pressure tracks the
other utilization signals in the scheduler.

v3->v4:
- "Patch 3/7:sched: Initialize per cpu thermal pressure structure"
is dropped as it is no longer needed following changes in other
other patches.
- rest of the change log mentioned in specific patches.

v5->v6:
- "Added arch_ interface APIs to access and update thermal pressure.
Moved declaration of per cpu thermal_pressure valriable and
infrastructure to update the variable to topology files.

v6->v7:
- Added CONFIG_HAVE_SCHED_THERMAL_PRESSURE to stub out
update_thermal_load_avg in unsupported architectures as per
review comments from Peter, Dietmar and Quentin.
- Renamed arch_scale_thermal_capacity to arch_cpu_thermal_pressure
as per review comments from Peter, Dietmar and Ionela.
- Changed the input argument in arch_set_thermal_pressure from
capped capacity to delta capacity(thermal pressure) as per
Ionela's review comments. Hence the calculation for delta
capacity(thermal pressure) is moved to cpufreq_cooling.c.
- Fixed a bunch of spelling typos.

v7->v8:
- Fixed typo in defining update_thermal_load_avg which was
causing build errors (reported by kbuild test report)

v8->v9:
- Defined thermal_load_avg to read rq->avg_thermal.load_avg and
avoid cacheline miss in unsupported cases as per Peter's
suggestion.
- Moved periodic triggering of thermal pressure averaging from CFS
tick function to generic scheduler core tick function.
- Moved rq_clock_thermal from fair.c to sched.h to enable using
the function from multiple files.
- Initialized the __shift to 0 in setup_sched_thermal_decay_shift
as per Quentin's suggestion
- Added an extra patch enabling CONFIG_HAVE_SCHED_THERMAL_PRESSURE
as per Dietmar's request.

Thara Gopinath (8):
sched/pelt: Add support to track thermal pressure
sched/topology: Add hook to read per cpu thermal pressure.
arm,arm64,drivers:Add infrastructure to store and update instantaneous
thermal pressure
sched/fair: Enable periodic update of average thermal pressure
sched/fair: update cpu_capacity to reflect thermal pressure
thermal/cpu-cooling: Update thermal pressure in case of a maximum
frequency capping
sched/fair: Enable tuning of decay period
arm64: Enable averaging of thermal pressure for arm64 based SoCs

Documentation/admin-guide/kernel-parameters.txt | 5 ++++
arch/arm/include/asm/topology.h | 3 +++
arch/arm64/configs/defconfig | 1 +
arch/arm64/include/asm/topology.h | 3 +++
drivers/base/arch_topology.c | 11 +++++++++
drivers/thermal/cpufreq_cooling.c | 19 +++++++++++++--
include/linux/arch_topology.h | 10 ++++++++
include/linux/sched/topology.h | 8 +++++++
include/trace/events/sched.h | 4 ++++
init/Kconfig | 4 ++++
kernel/sched/core.c | 3 +++
kernel/sched/fair.c | 25 ++++++++++++++++++++
kernel/sched/pelt.c | 31 +++++++++++++++++++++++++
kernel/sched/pelt.h | 31 +++++++++++++++++++++++++
kernel/sched/sched.h | 21 +++++++++++++++++
15 files changed, 177 insertions(+), 2 deletions(-)

--
2.1.4


2020-01-28 22:37:17

by Thara Gopinath

[permalink] [raw]
Subject: [Patch v9 1/8] sched/pelt: Add support to track thermal pressure

Extrapolating on the existing framework to track rt/dl utilization using
pelt signals, add a similar mechanism to track thermal pressure. The
difference here from rt/dl utilization tracking is that, instead of
tracking time spent by a cpu running a rt/dl task through util_avg, the
average thermal pressure is tracked through load_avg. This is because
thermal pressure signal is weighted "delta" capacity and is not
binary(util_avg is binary). "delta capacity" here means delta between the
actual capacity of a cpu and the decreased capacity a cpu due to a thermal
event.

In order to track average thermal pressure, a new sched_avg variable
avg_thermal is introduced. Function update_thermal_load_avg can be called
to do the periodic bookkeeping (accumulate, decay and average) of the
thermal pressure.

Signed-off-by: Thara Gopinath <[email protected]>
Reviewed-by: Vincent Guittot <[email protected]>
---

v6->v7:
- Added CONFIG_HAVE_SCHED_THERMAL_PRESSURE to stub out
update_thermal_load_avg in unsupported architectures as per
review comments from Peter, Dietmar and Quentin.
- Updated comment for update_thermal_load_avg as per review
comments from Peter and Dietmar.
v7->v8:
- Fixed typo in defining update_thermal_load_avg which was
causing build errors (reported by kbuild test report)
v8->v9:
- Defined thermal_load_avg to read rq->avg_thermal.load_avg and
avoid cacheline miss in unsupported cases as per Peter's
suggestion.

include/trace/events/sched.h | 4 ++++
init/Kconfig | 4 ++++
kernel/sched/pelt.c | 31 +++++++++++++++++++++++++++++++
kernel/sched/pelt.h | 31 +++++++++++++++++++++++++++++++
kernel/sched/sched.h | 3 +++
5 files changed, 73 insertions(+)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 420e80e..a8fb667 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -613,6 +613,10 @@ DECLARE_TRACE(pelt_dl_tp,
TP_PROTO(struct rq *rq),
TP_ARGS(rq));

+DECLARE_TRACE(pelt_thermal_tp,
+ TP_PROTO(struct rq *rq),
+ TP_ARGS(rq));
+
DECLARE_TRACE(pelt_irq_tp,
TP_PROTO(struct rq *rq),
TP_ARGS(rq));
diff --git a/init/Kconfig b/init/Kconfig
index bd9f1fd..055c3bf 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -463,6 +463,10 @@ config HAVE_SCHED_AVG_IRQ
depends on IRQ_TIME_ACCOUNTING || PARAVIRT_TIME_ACCOUNTING
depends on SMP

+config HAVE_SCHED_THERMAL_PRESSURE
+ bool "Enable periodic averaging of thermal pressure"
+ depends on SMP
+
config BSD_PROCESS_ACCT
bool "BSD Process Accounting"
depends on MULTIUSER
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index bd006b7..5d1fbf0 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -367,6 +367,37 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
return 0;
}

+#ifdef CONFIG_HAVE_SCHED_THERMAL_PRESSURE
+/*
+ * thermal:
+ *
+ * load_sum = \Sum se->avg.load_sum but se->avg.load_sum is not tracked
+ *
+ * util_avg and runnable_load_avg are not supported and meaningless.
+ *
+ * Unlike rt/dl utilization tracking that track time spent by a cpu
+ * running a rt/dl task through util_avg, the average thermal pressure is
+ * tracked through load_avg. This is because thermal pressure signal is
+ * weighted "delta" capacity and is not binary(util_avg is binary). "delta
+ * capacity" here means delta between the actual capacity of a cpu and the
+ * decreased capacity a cpu due to a thermal event.
+ */
+
+int update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity)
+{
+ if (___update_load_sum(now, &rq->avg_thermal,
+ capacity,
+ capacity,
+ capacity)) {
+ ___update_load_avg(&rq->avg_thermal, 1, 1);
+ trace_pelt_thermal_tp(rq);
+ return 1;
+ }
+
+ return 0;
+}
+#endif
+
#ifdef CONFIG_HAVE_SCHED_AVG_IRQ
/*
* irq:
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index afff644..916979a 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -7,6 +7,26 @@ int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq);
int update_rt_rq_load_avg(u64 now, struct rq *rq, int running);
int update_dl_rq_load_avg(u64 now, struct rq *rq, int running);

+#ifdef CONFIG_HAVE_SCHED_THERMAL_PRESSURE
+int update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity);
+
+static inline u64 thermal_load_avg(struct rq *rq)
+{
+ return READ_ONCE(rq->avg_thermal.load_avg);
+}
+#else
+static inline int
+update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity)
+{
+ return 0;
+}
+
+static inline u64 thermal_load_avg(struct rq *rq)
+{
+ return 0;
+}
+#endif
+
#ifdef CONFIG_HAVE_SCHED_AVG_IRQ
int update_irq_load_avg(struct rq *rq, u64 running);
#else
@@ -159,6 +179,17 @@ update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
}

static inline int
+update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity)
+{
+ return 0;
+}
+
+static inline u64 thermal_load_avg(struct rq *rq)
+{
+ return 0;
+}
+
+static inline int
update_irq_load_avg(struct rq *rq, u64 running)
{
return 0;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1a88dc8..1f256cb 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -944,6 +944,9 @@ struct rq {
#ifdef CONFIG_HAVE_SCHED_AVG_IRQ
struct sched_avg avg_irq;
#endif
+#ifdef CONFIG_HAVE_SCHED_THERMAL_PRESSURE
+ struct sched_avg avg_thermal;
+#endif
u64 idle_stamp;
u64 avg_idle;

--
2.1.4

2020-01-28 22:37:22

by Thara Gopinath

[permalink] [raw]
Subject: [Patch v9 2/8] sched/topology: Add hook to read per cpu thermal pressure.

Introduce arch_cpu_thermal_pressure to retrieve per cpu thermal
pressure.

Signed-off-by: Thara Gopinath <[email protected]>
---

v6->v7:
- Renamed arch_scale_thermal_capacity to arch_cpu_thermal_pressure
as per review comments from Peter, Dietmar and Ionela.

include/linux/sched/topology.h | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index f341163..850b3bf 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -225,6 +225,14 @@ unsigned long arch_scale_cpu_capacity(int cpu)
}
#endif

+#ifndef arch_cpu_thermal_pressure
+static __always_inline
+unsigned long arch_cpu_thermal_pressure(int cpu)
+{
+ return 0;
+}
+#endif
+
static inline int task_node(const struct task_struct *p)
{
return cpu_to_node(task_cpu(p));
--
2.1.4

2020-01-28 22:37:29

by Thara Gopinath

[permalink] [raw]
Subject: [Patch v9 5/8] sched/fair: update cpu_capacity to reflect thermal pressure

cpu_capacity initially reflects the maximum possible capacity of a cpu.
Thermal pressure on a cpu means this maximum possible capacity is
unavailable due to thermal events. This patch subtracts the average thermal
pressure for a cpu from its maximum possible capacity so that cpu_capacity
reflects the actual maximum currently available capacity.

Signed-off-by: Thara Gopinath <[email protected]>
---

v8->v9:
- Use thermal_load_avg to read rq->avg_thermal.load_avg.

kernel/sched/fair.c | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5f58c03..d879077 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7753,8 +7753,15 @@ static unsigned long scale_rt_capacity(struct sched_domain *sd, int cpu)
if (unlikely(irq >= max))
return 1;

+ /*
+ * avg_rt.util avg and avg_dl.util track binary signals
+ * (running and not running) with weights 0 and 1024 respectively.
+ * avg_thermal.load_avg tracks thermal pressure and the weighted
+ * average uses the actual delta max capacity(load).
+ */
used = READ_ONCE(rq->avg_rt.util_avg);
used += READ_ONCE(rq->avg_dl.util_avg);
+ used += thermal_load_avg(rq);

if (unlikely(used >= max))
return 1;
--
2.1.4

2020-01-28 22:37:32

by Thara Gopinath

[permalink] [raw]
Subject: [Patch v9 4/8] sched/fair: Enable periodic update of average thermal pressure

Introduce support in scheduler periodic tick and other CFS bookkeeping
apis to trigger the process of computing average thermal pressure for a
cpu. Also consider avg_thermal.load_avg in others_have_blocked which
allows for decay of pelt signals.

Signed-off-by: Thara Gopinath <[email protected]>
---

v8->v9:
- Moved periodic triggering of thermal pressure averaging from CFS
tick function to generic scheduler core tick function.

kernel/sched/core.c | 3 +++
kernel/sched/fair.c | 5 +++++
2 files changed, 8 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fc1dfc0..b921795 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3594,12 +3594,15 @@ void scheduler_tick(void)
struct rq *rq = cpu_rq(cpu);
struct task_struct *curr = rq->curr;
struct rq_flags rf;
+ unsigned long thermal_pressure;

sched_clock_tick();

rq_lock(rq, &rf);

update_rq_clock(rq);
+ thermal_pressure = arch_cpu_thermal_pressure(cpu_of(rq));
+ update_thermal_load_avg(rq_clock_task(rq), rq, thermal_pressure);
curr->sched_class->task_tick(rq, curr, 0);
calc_global_load_tick(rq);
psi_task_tick(rq);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ebf5095..5f58c03 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7491,6 +7491,9 @@ static inline bool others_have_blocked(struct rq *rq)
if (READ_ONCE(rq->avg_dl.util_avg))
return true;

+ if (thermal_load_avg(rq))
+ return true;
+
#ifdef CONFIG_HAVE_SCHED_AVG_IRQ
if (READ_ONCE(rq->avg_irq.util_avg))
return true;
@@ -7516,6 +7519,7 @@ static bool __update_blocked_others(struct rq *rq, bool *done)
{
const struct sched_class *curr_class;
u64 now = rq_clock_pelt(rq);
+ unsigned long thermal_pressure = arch_cpu_thermal_pressure(cpu_of(rq));
bool decayed;

/*
@@ -7526,6 +7530,7 @@ static bool __update_blocked_others(struct rq *rq, bool *done)

decayed = update_rt_rq_load_avg(now, rq, curr_class == &rt_sched_class) |
update_dl_rq_load_avg(now, rq, curr_class == &dl_sched_class) |
+ update_thermal_load_avg(rq_clock_task(rq), rq, thermal_pressure) |
update_irq_load_avg(rq, 0);

if (others_have_blocked(rq))
--
2.1.4

2020-01-28 22:37:51

by Thara Gopinath

[permalink] [raw]
Subject: [Patch v9 6/8] thermal/cpu-cooling: Update thermal pressure in case of a maximum frequency capping

Thermal governors can request for a cpu's maximum supported frequency to
be capped in case of an overheat event. This in turn means that the
maximum capacity available for tasks to run on the particular cpu is
reduced. Delta between the original maximum capacity and capped maximum
capacity is known as thermal pressure. Enable cpufreq cooling device to
update the thermal pressure in event of a capped maximum frequency.

Signed-off-by: Thara Gopinath <[email protected]>
---

v6->v7
- Changed the input argument in arch_set_thermal_pressure from
capped capacity to delta capacity(thermal pressure) as per
Ionela's review comments. Hence the calculation for delta
capacity(thermal pressure) is moved to cpufreq_cooling.c.

drivers/thermal/cpufreq_cooling.c | 19 +++++++++++++++++--
1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/drivers/thermal/cpufreq_cooling.c b/drivers/thermal/cpufreq_cooling.c
index fe83d7a..4ae8c85 100644
--- a/drivers/thermal/cpufreq_cooling.c
+++ b/drivers/thermal/cpufreq_cooling.c
@@ -431,6 +431,10 @@ static int cpufreq_set_cur_state(struct thermal_cooling_device *cdev,
unsigned long state)
{
struct cpufreq_cooling_device *cpufreq_cdev = cdev->devdata;
+ struct cpumask *cpus;
+ unsigned int frequency;
+ unsigned long max_capacity, capacity;
+ int ret;

/* Request state should be less than max_level */
if (WARN_ON(state > cpufreq_cdev->max_level))
@@ -442,8 +446,19 @@ static int cpufreq_set_cur_state(struct thermal_cooling_device *cdev,

cpufreq_cdev->cpufreq_state = state;

- return freq_qos_update_request(&cpufreq_cdev->qos_req,
- get_state_freq(cpufreq_cdev, state));
+ frequency = get_state_freq(cpufreq_cdev, state);
+
+ ret = freq_qos_update_request(&cpufreq_cdev->qos_req, frequency);
+
+ if (ret > 0) {
+ cpus = cpufreq_cdev->policy->cpus;
+ max_capacity = arch_scale_cpu_capacity(cpumask_first(cpus));
+ capacity = frequency * max_capacity;
+ capacity /= cpufreq_cdev->policy->cpuinfo.max_freq;
+ arch_set_thermal_pressure(cpus, max_capacity - capacity);
+ }
+
+ return ret;
}

/* Bind cpufreq callbacks to thermal cooling device ops */
--
2.1.4

2020-01-28 22:37:52

by Thara Gopinath

[permalink] [raw]
Subject: [Patch v9 7/8] sched/fair: Enable tuning of decay period

Thermal pressure follows pelt signals which means the decay period for
thermal pressure is the default pelt decay period. Depending on soc
characteristics and thermal activity, it might be beneficial to decay
thermal pressure slower, but still in-tune with the pelt signals. One way
to achieve this is to provide a command line parameter to set a decay
shift parameter to an integer between 0 and 10.

Signed-off-by: Thara Gopinath <[email protected]>
---

v8->v9:
- Initialized the __shift to 0 in setup_sched_thermal_decay_shift
as per Quentin's suggestion.

Documentation/admin-guide/kernel-parameters.txt | 5 +++++
kernel/sched/core.c | 2 +-
kernel/sched/fair.c | 15 ++++++++++++++-
kernel/sched/sched.h | 18 ++++++++++++++++++
4 files changed, 38 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index e35b28e..be4147b 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4376,6 +4376,11 @@
incurs a small amount of overhead in the scheduler
but is useful for debugging and performance tuning.

+ sched_thermal_decay_shift=
+ [KNL, SMP] Set decay shift for thermal pressure signal.
+ Format: integer between 0 and 10
+ Default is 0.
+
skew_tick= [KNL] Offset the periodic timer tick per cpu to mitigate
xtime_lock contention on larger systems, and/or RCU lock
contention on all systems with CONFIG_MAXSMP set.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b921795..508e64b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3602,7 +3602,7 @@ void scheduler_tick(void)

update_rq_clock(rq);
thermal_pressure = arch_cpu_thermal_pressure(cpu_of(rq));
- update_thermal_load_avg(rq_clock_task(rq), rq, thermal_pressure);
+ update_thermal_load_avg(rq_clock_thermal(rq), rq, thermal_pressure);
curr->sched_class->task_tick(rq, curr, 0);
calc_global_load_tick(rq);
psi_task_tick(rq);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d879077..df23564 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -86,6 +86,19 @@ static unsigned int normalized_sysctl_sched_wakeup_granularity = 1000000UL;

const_debug unsigned int sysctl_sched_migration_cost = 500000UL;

+int sched_thermal_decay_shift;
+static int __init setup_sched_thermal_decay_shift(char *str)
+{
+ int _shift = 0;
+
+ if (kstrtoint(str, 0, &_shift))
+ pr_warn("Unable to set scheduler thermal pressure decay shift parameter\n");
+
+ sched_thermal_decay_shift = clamp(_shift, 0, 10);
+ return 1;
+}
+__setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift);
+
#ifdef CONFIG_SMP
/*
* For asym packing, by default the lower numbered CPU has higher priority.
@@ -7530,7 +7543,7 @@ static bool __update_blocked_others(struct rq *rq, bool *done)

decayed = update_rt_rq_load_avg(now, rq, curr_class == &rt_sched_class) |
update_dl_rq_load_avg(now, rq, curr_class == &dl_sched_class) |
- update_thermal_load_avg(rq_clock_task(rq), rq, thermal_pressure) |
+ update_thermal_load_avg(rq_clock_thermal(rq), rq, thermal_pressure) |
update_irq_load_avg(rq, 0);

if (others_have_blocked(rq))
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1f256cb..acd32bf 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1110,6 +1110,24 @@ static inline u64 rq_clock_task(struct rq *rq)
return rq->clock_task;
}

+/**
+ * By default the decay is the default pelt decay period.
+ * The decay shift can change the decay period in
+ * multiples of 32.
+ * Decay shift Decay period(ms)
+ * 0 32
+ * 1 64
+ * 2 128
+ * 3 256
+ * 4 512
+ */
+extern int sched_thermal_decay_shift;
+
+static inline u64 rq_clock_thermal(struct rq *rq)
+{
+ return rq_clock_task(rq) >> sched_thermal_decay_shift;
+}
+
static inline void rq_clock_skip_update(struct rq *rq)
{
lockdep_assert_held(&rq->lock);
--
2.1.4

2020-01-28 22:38:20

by Thara Gopinath

[permalink] [raw]
Subject: [Patch v9 3/8] arm,arm64,drivers:Add infrastructure to store and update instantaneous thermal pressure

Add architecture specific APIs to update and track thermal pressure on a
per cpu basis. A per cpu variable thermal_pressure is introduced to keep
track of instantaneous per cpu thermal pressure. Thermal pressure is the
delta between maximum capacity and capped capacity due to a thermal event.

topology_get_thermal_pressure can be hooked into the scheduler specified
arch_cpu_thermal_capacity to retrieve instantaneous thermal pressure of a
cpu.

arch_set_thermal_pressure can be used to update the thermal pressure.

Considering topology_get_thermal_pressure reads thermal_pressure and
arch_set_thermal_pressure writes into thermal_pressure, one can argue for
some sort of locking mechanism to avoid a stale value. But considering
topology_get_thermal_pressure can be called from a system critical path
like scheduler tick function, a locking mechanism is not ideal. This means
that it is possible the thermal_pressure value used to calculate average
thermal pressure for a cpu can be stale for upto 1 tick period.

Signed-off-by: Thara Gopinath <[email protected]>
---

v6->v7:
- Changed the input argument in arch_set_thermal_pressure from
capped capacity to delta capacity(thermal pressure) as per
Ionela's review comments.

arch/arm/include/asm/topology.h | 3 +++
arch/arm64/include/asm/topology.h | 3 +++
drivers/base/arch_topology.c | 11 +++++++++++
include/linux/arch_topology.h | 10 ++++++++++
4 files changed, 27 insertions(+)

diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h
index 8a0fae9..3a50a19 100644
--- a/arch/arm/include/asm/topology.h
+++ b/arch/arm/include/asm/topology.h
@@ -16,6 +16,9 @@
/* Enable topology flag updates */
#define arch_update_cpu_topology topology_update_cpu_topology

+/* Replace task scheduler's default thermal pressure retrieve API */
+#define arch_cpu_thermal_pressure topology_get_thermal_pressure
+
#else

static inline void init_cpu_topology(void) { }
diff --git a/arch/arm64/include/asm/topology.h b/arch/arm64/include/asm/topology.h
index a4d945d..a70896f 100644
--- a/arch/arm64/include/asm/topology.h
+++ b/arch/arm64/include/asm/topology.h
@@ -25,6 +25,9 @@ int pcibus_to_node(struct pci_bus *bus);
/* Enable topology flag updates */
#define arch_update_cpu_topology topology_update_cpu_topology

+/* Replace task scheduler's default thermal pressure retrieve API */
+#define arch_cpu_thermal_pressure topology_get_thermal_pressure
+
#include <asm-generic/topology.h>

#endif /* _ASM_ARM_TOPOLOGY_H */
diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index 6119e11..68dfa49 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -42,6 +42,17 @@ void topology_set_cpu_scale(unsigned int cpu, unsigned long capacity)
per_cpu(cpu_scale, cpu) = capacity;
}

+DEFINE_PER_CPU(unsigned long, thermal_pressure);
+
+void arch_set_thermal_pressure(struct cpumask *cpus,
+ unsigned long th_pressure)
+{
+ int cpu;
+
+ for_each_cpu(cpu, cpus)
+ WRITE_ONCE(per_cpu(thermal_pressure, cpu), th_pressure);
+}
+
static ssize_t cpu_capacity_show(struct device *dev,
struct device_attribute *attr,
char *buf)
diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h
index 3015ecb..88a115e 100644
--- a/include/linux/arch_topology.h
+++ b/include/linux/arch_topology.h
@@ -33,6 +33,16 @@ unsigned long topology_get_freq_scale(int cpu)
return per_cpu(freq_scale, cpu);
}

+DECLARE_PER_CPU(unsigned long, thermal_pressure);
+
+static inline unsigned long topology_get_thermal_pressure(int cpu)
+{
+ return per_cpu(thermal_pressure, cpu);
+}
+
+void arch_set_thermal_pressure(struct cpumask *cpus,
+ unsigned long th_pressure);
+
struct cpu_topology {
int thread_id;
int core_id;
--
2.1.4

2020-01-28 22:39:15

by Thara Gopinath

[permalink] [raw]
Subject: [Patch v9 8/8] arm64: Enable averaging of thermal pressure for arm64 based SoCs

Enable CONFIG_HAVE_SCHED_THERMAL_PRESSURE in arm64 defconfig.

Signed-off-by: Thara Gopinath <[email protected]>
---
arch/arm64/configs/defconfig | 1 +
1 file changed, 1 insertion(+)

diff --git a/arch/arm64/configs/defconfig b/arch/arm64/configs/defconfig
index 0565a61..7a8145b 100644
--- a/arch/arm64/configs/defconfig
+++ b/arch/arm64/configs/defconfig
@@ -5,6 +5,7 @@ CONFIG_NO_HZ_IDLE=y
CONFIG_HIGH_RES_TIMERS=y
CONFIG_PREEMPT=y
CONFIG_IRQ_TIME_ACCOUNTING=y
+CONFIG_HAVE_SCHED_THERMAL_PRESSURE=y
CONFIG_BSD_PROCESS_ACCT=y
CONFIG_BSD_PROCESS_ACCT_V3=y
CONFIG_TASK_XACCT=y
--
2.1.4

2020-01-28 23:58:10

by Randy Dunlap

[permalink] [raw]
Subject: Re: [Patch v9 7/8] sched/fair: Enable tuning of decay period

Hi,

On 1/28/20 2:36 PM, Thara Gopinath wrote:
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index e35b28e..be4147b 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -4376,6 +4376,11 @@
> incurs a small amount of overhead in the scheduler
> but is useful for debugging and performance tuning.
>
> + sched_thermal_decay_shift=
> + [KNL, SMP] Set decay shift for thermal pressure signal.
> + Format: integer between 0 and 10
> + Default is 0.
> +

That tells an admin [or any reader] almost nothing about this kernel parameter
or what it does. And nothing about what unit the value is in.
Does the value 0 disable this feature?

> skew_tick= [KNL] Offset the periodic timer tick per cpu to mitigate
> xtime_lock contention on larger systems, and/or RCU lock
> contention on all systems with CONFIG_MAXSMP set.


thanks.
--
~Randy

2020-02-03 10:04:35

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [Patch v9 8/8] arm64: Enable averaging of thermal pressure for arm64 based SoCs

On 28.01.20 23:36, Thara Gopinath wrote:
> Enable CONFIG_HAVE_SCHED_THERMAL_PRESSURE in arm64 defconfig.
>
> Signed-off-by: Thara Gopinath <[email protected]>
> ---
> arch/arm64/configs/defconfig | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/arch/arm64/configs/defconfig b/arch/arm64/configs/defconfig
> index 0565a61..7a8145b 100644
> --- a/arch/arm64/configs/defconfig
> +++ b/arch/arm64/configs/defconfig
> @@ -5,6 +5,7 @@ CONFIG_NO_HZ_IDLE=y
> CONFIG_HIGH_RES_TIMERS=y
> CONFIG_PREEMPT=y
> CONFIG_IRQ_TIME_ACCOUNTING=y
> +CONFIG_HAVE_SCHED_THERMAL_PRESSURE=y

I thought about this a bit more and maybe it's not a good idea to enable
this by default. An erroneous thermal setup could have bad influence on
the CPU capacities and hence on the performance without people
understanding the cause of it.
If they have to actively enable it, chances are higher that they also
try to understand how higher thermal-cpufreq-X cooling states lead to
CPU capacity reduction and possibly inversion (big-LITTLE swap).

So if the thermal pressure feature is guarded by
CONFIG_HAVE_SCHED_THERMAL_PRESSURE the thermal pressure related code in
arch_topology.c (thermal_pressure, arch_set_thermal_pressure,
topology_get_thermal_pressure) should too. Saves some text and data of
arch_topology.o. Moreover Arm32 and Arm64 will be handled equally.

2020-02-03 14:04:05

by Thara Gopinath

[permalink] [raw]
Subject: Re: [Patch v9 7/8] sched/fair: Enable tuning of decay period

On 01/28/2020 06:56 PM, Randy Dunlap wrote:
> Hi,
>
> On 1/28/20 2:36 PM, Thara Gopinath wrote:
>> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
>> index e35b28e..be4147b 100644
>> --- a/Documentation/admin-guide/kernel-parameters.txt
>> +++ b/Documentation/admin-guide/kernel-parameters.txt
>> @@ -4376,6 +4376,11 @@
>> incurs a small amount of overhead in the scheduler
>> but is useful for debugging and performance tuning.
>>
>> + sched_thermal_decay_shift=
>> + [KNL, SMP] Set decay shift for thermal pressure signal.
>> + Format: integer between 0 and 10
>> + Default is 0.
>> +
>
> That tells an admin [or any reader] almost nothing about this kernel parameter
> or what it does. And nothing about what unit the value is in.
> Does the value 0 disable this feature?

Thanks for the review. 0 does not disable "thermal pressure" feature. 0
means the default decay period for averaging PELT signals (which is
usually 32 but configurable) will also be applied for thermal pressure
signal. A shift will shift the default decay period.

You are right. It needs more explanation here. I will fix it and send v10.

>
>> skew_tick= [KNL] Offset the periodic timer tick per cpu to mitigate
>> xtime_lock contention on larger systems, and/or RCU lock
>> contention on all systems with CONFIG_MAXSMP set.
>
>
> thanks.
>


--
Warm Regards
Thara

2020-02-03 17:53:33

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [Patch v9 7/8] sched/fair: Enable tuning of decay period

On Mon, Feb 03, 2020 at 07:07:57AM -0500, Thara Gopinath wrote:
> On 01/28/2020 06:56 PM, Randy Dunlap wrote:
> > Hi,
> >
> > On 1/28/20 2:36 PM, Thara Gopinath wrote:
> >> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> >> index e35b28e..be4147b 100644
> >> --- a/Documentation/admin-guide/kernel-parameters.txt
> >> +++ b/Documentation/admin-guide/kernel-parameters.txt
> >> @@ -4376,6 +4376,11 @@
> >> incurs a small amount of overhead in the scheduler
> >> but is useful for debugging and performance tuning.
> >>
> >> + sched_thermal_decay_shift=
> >> + [KNL, SMP] Set decay shift for thermal pressure signal.
> >> + Format: integer between 0 and 10
> >> + Default is 0.
> >> +
> >
> > That tells an admin [or any reader] almost nothing about this kernel parameter
> > or what it does. And nothing about what unit the value is in.
> > Does the value 0 disable this feature?
>
> Thanks for the review. 0 does not disable "thermal pressure" feature. 0
> means the default decay period for averaging PELT signals (which is
> usually 32 but configurable) will also be applied for thermal pressure
> signal. A shift will shift the default decay period.
>
> You are right. It needs more explanation here. I will fix it and send v10.

Or just send an update for this patch? I'm thinking most of this is
looking good.

2020-02-04 08:41:50

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [Patch v9 7/8] sched/fair: Enable tuning of decay period

On 03/02/2020 16:55, Peter Zijlstra wrote:
> On Mon, Feb 03, 2020 at 07:07:57AM -0500, Thara Gopinath wrote:
>> On 01/28/2020 06:56 PM, Randy Dunlap wrote:
>>> Hi,
>>>
>>> On 1/28/20 2:36 PM, Thara Gopinath wrote:
>>>> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
>>>> index e35b28e..be4147b 100644
>>>> --- a/Documentation/admin-guide/kernel-parameters.txt
>>>> +++ b/Documentation/admin-guide/kernel-parameters.txt
>>>> @@ -4376,6 +4376,11 @@
>>>> incurs a small amount of overhead in the scheduler
>>>> but is useful for debugging and performance tuning.
>>>>
>>>> + sched_thermal_decay_shift=
>>>> + [KNL, SMP] Set decay shift for thermal pressure signal.
>>>> + Format: integer between 0 and 10
>>>> + Default is 0.
>>>> +
>>>
>>> That tells an admin [or any reader] almost nothing about this kernel parameter
>>> or what it does. And nothing about what unit the value is in.
>>> Does the value 0 disable this feature?
>>
>> Thanks for the review. 0 does not disable "thermal pressure" feature. 0
>> means the default decay period for averaging PELT signals (which is
>> usually 32 but configurable) will also be applied for thermal pressure
>> signal. A shift will shift the default decay period.
>>
>> You are right. It needs more explanation here. I will fix it and send v10.
>
> Or just send an update for this patch? I'm thinking most of this is
> looking good.

I do agree. IMHO, there are just two little things outstanding:

(1) arch_scale_thermal_pressure() instead of
arch_cpu_thermal_pressure() in v8 4/7

(2) guarding of thermal pressure code in Arm's arch_topology driver w/
CONFIG_HAVE_SCHED_THERMAL_PRESSURE plus disabling it by default for
Arm64.

2020-02-07 22:43:37

by Thara Gopinath

[permalink] [raw]
Subject: Re: [Patch v9 7/8] sched/fair: Enable tuning of decay period

On 02/04/2020 03:39 AM, Dietmar Eggemann wrote:
> On 03/02/2020 16:55, Peter Zijlstra wrote:
>> On Mon, Feb 03, 2020 at 07:07:57AM -0500, Thara Gopinath wrote:
>>> On 01/28/2020 06:56 PM, Randy Dunlap wrote:
>>>> Hi,
>>>>
>>>> On 1/28/20 2:36 PM, Thara Gopinath wrote:
>>>>> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
>>>>> index e35b28e..be4147b 100644
>>>>> --- a/Documentation/admin-guide/kernel-parameters.txt
>>>>> +++ b/Documentation/admin-guide/kernel-parameters.txt
>>>>> @@ -4376,6 +4376,11 @@
>>>>> incurs a small amount of overhead in the scheduler
>>>>> but is useful for debugging and performance tuning.
>>>>>
>>>>> + sched_thermal_decay_shift=
>>>>> + [KNL, SMP] Set decay shift for thermal pressure signal.
>>>>> + Format: integer between 0 and 10
>>>>> + Default is 0.
>>>>> +
>>>>
>>>> That tells an admin [or any reader] almost nothing about this kernel parameter
>>>> or what it does. And nothing about what unit the value is in.
>>>> Does the value 0 disable this feature?
>>>
>>> Thanks for the review. 0 does not disable "thermal pressure" feature. 0
>>> means the default decay period for averaging PELT signals (which is
>>> usually 32 but configurable) will also be applied for thermal pressure
>>> signal. A shift will shift the default decay period.
>>>
>>> You are right. It needs more explanation here. I will fix it and send v10.
>>
>> Or just send an update for this patch? I'm thinking most of this is
>> looking good.
>
> I do agree. IMHO, there are just two little things outstanding:
>
> (1) arch_scale_thermal_pressure() instead of
> arch_cpu_thermal_pressure() in v8 4/7

The "scale_" part was discussed in v6. Ionela had suggested that having
"scale" is not suited for this function because "thermal pressure" is
not exactly scaled but subtracted. I actually agree with that.

https://lore.kernel.org/lkml/[email protected]/

Having said that if everyone feel the same about naming of this
function, I can change it one last time.

>
> (2) guarding of thermal pressure code in Arm's arch_topology driver w/
> CONFIG_HAVE_SCHED_THERMAL_PRESSURE plus disabling it by default for
> Arm64.
It was enabled by default as per your suggestion in v9.

The patch can be dropped.

I don't understand the need to guard arch_topology with
CONFIG_HAVE_SCHED_THERMAL_PRESSURE. CONFIG_HAVE_SCHED_THERMAL_PRESSURE
is for scheduler to enable/disable averaging of thermal pressure. We
wanted to separate updating and retrieving of instantaneous thermal
pressure from scheduler. Guarding it with
CONFIG_HAVE_SCHED_THERMAL_PRESSURE is to me equivalent to putting back
this whole code in the scheduler framework. I am against it. I also do
not see other arch_ functions guarded similarly.

>


--
Warm Regards
Thara

2020-02-10 12:00:20

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [Patch v9 7/8] sched/fair: Enable tuning of decay period

On 07/02/2020 23:42, Thara Gopinath wrote:
> On 02/04/2020 03:39 AM, Dietmar Eggemann wrote:
>> On 03/02/2020 16:55, Peter Zijlstra wrote:
>>> On Mon, Feb 03, 2020 at 07:07:57AM -0500, Thara Gopinath wrote:
>>>> On 01/28/2020 06:56 PM, Randy Dunlap wrote:
>>>>> Hi,
>>>>>
>>>>> On 1/28/20 2:36 PM, Thara Gopinath wrote:

[...]

>> I do agree. IMHO, there are just two little things outstanding:
>>
>> (1) arch_scale_thermal_pressure() instead of
>> arch_cpu_thermal_pressure() in v8 4/7
>
> The "scale_" part was discussed in v6. Ionela had suggested that having
> "scale" is not suited for this function because "thermal pressure" is
> not exactly scaled but subtracted. I actually agree with that.
>
> https://lore.kernel.org/lkml/[email protected]/
>
> Having said that if everyone feel the same about naming of this
> function, I can change it one last time.

I'm still in favor for this. And Vincent seems to be OK as well:

https://lore.kernel.org/r/CAKfTPtBzoLnvAJ7sjPogMYS=WwBbdzWO07Kj=KDFVpO4=Su5ow@mail.gmail.com

The 'v6->v7' change note:

- Renamed arch_scale_thermal_capacity to arch_cpu_thermal_pressure
as per review comments from Peter, Dietmar and Ionela.

is really not saying from which review comment the individual changes in
the function name are coming from. And I don't see an answer to Ionela's
email saying that her proposal will manifest in a particular part of
this change.

>> (2) guarding of thermal pressure code in Arm's arch_topology driver w/
>> CONFIG_HAVE_SCHED_THERMAL_PRESSURE plus disabling it by default for
>> Arm64.
> It was enabled by default as per your suggestion in v9.
>
> The patch can be dropped.
>
> I don't understand the need to guard arch_topology with
> CONFIG_HAVE_SCHED_THERMAL_PRESSURE. CONFIG_HAVE_SCHED_THERMAL_PRESSURE
> is for scheduler to enable/disable averaging of thermal pressure. We
> wanted to separate updating and retrieving of instantaneous thermal
> pressure from scheduler. Guarding it with
> CONFIG_HAVE_SCHED_THERMAL_PRESSURE is to me equivalent to putting back
> this whole code in the scheduler framework. I am against it. I also do
> not see other arch_ functions guarded similarly.

Cpu-invariant accounting can't be guarded with a kernel CONFIG switch.
Frequency-invariant accounting could be with CONFIG_CPU_FREQ but this is
enabled by default by Arm64 defconfig.
Thermal pressure (accounting) (CONFIG_HAVE_SCHED_THERMAL_PRESSURE) is
disabled by default so why should a per-cpu thermal_pressure be
maintained on such a system (CONFIG_CPU_THERMAL=y by default)?

2020-02-10 12:11:03

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [Patch v9 0/8] Introduce Thermal Pressure

On 28/01/2020 23:35, Thara Gopinath wrote:
> Thermal governors can respond to an overheat event of a cpu by
> capping the cpu's maximum possible frequency. This in turn
> means that the maximum available compute capacity of the
> cpu is restricted. But today in the kernel, task scheduler is
> not notified of capping of maximum frequency of a cpu.
> In other words, scheduler is unaware of maximum capacity
> restrictions placed on a cpu due to thermal activity.
> This patch series attempts to address this issue.
> The benefits identified are better task placement among available
> cpus in event of overheating which in turn leads to better
> performance numbers.
>
> The reduction in the maximum possible capacity of a cpu due to a
> thermal event can be considered as thermal pressure. Instantaneous
> thermal pressure is hard to record and can sometime be erroneous
> as there can be mismatch between the actual capping of capacity
> and scheduler recording it. Thus solution is to have a weighted
> average per cpu value for thermal pressure over time.
> The weight reflects the amount of time the cpu has spent at a
> capped maximum frequency. Since thermal pressure is recorded as
> an average, it must be decayed periodically. Exisiting algorithm
> in the kernel scheduler pelt framework is re-used to calculate
> the weighted average. This patch series also defines a sysctl
> inerface to allow for a configurable decay period.
>
> Regarding testing, basic build, boot and sanity testing have been
> performed on db845c platform with debian file system.
> Further, dhrystone and hackbench tests have been
> run with the thermal pressure algorithm. During testing, due to
> constraints of step wise governor in dealing with big little systems,
> trip point 0 temperature was made assymetric between cpus in little
> cluster and big cluster; the idea being that
> big core will heat up and cpu cooling device will throttle the
> frequency of the big cores faster, there by limiting the maximum available
> capacity and the scheduler will spread out tasks to little cores as well.
>
> Test Results
>
> Hackbench: 1 group , 30000 loops, 10 runs
> Result SD
> (Secs) (% of mean)
> No Thermal Pressure 14.03 2.69%
> Thermal Pressure PELT Algo. Decay : 32 ms 13.29 0.56%
> Thermal Pressure PELT Algo. Decay : 64 ms 12.57 1.56%
> Thermal Pressure PELT Algo. Decay : 128 ms 12.71 1.04%
> Thermal Pressure PELT Algo. Decay : 256 ms 12.29 1.42%
> Thermal Pressure PELT Algo. Decay : 512 ms 12.42 1.15%
>
> Dhrystone Run Time : 20 threads, 3000 MLOOPS
> Result SD
> (Secs) (% of mean)
> No Thermal Pressure 9.452 4.49%
> Thermal Pressure PELT Algo. Decay : 32 ms 8.793 5.30%
> Thermal Pressure PELT Algo. Decay : 64 ms 8.981 5.29%
> Thermal Pressure PELT Algo. Decay : 128 ms 8.647 6.62%
> Thermal Pressure PELT Algo. Decay : 256 ms 8.774 6.45%
> Thermal Pressure PELT Algo. Decay : 512 ms 8.603 5.41%

What do we do on systems on which one Frequency domain spawns all the
CPUs (e.g. Hikey620)?

perf stat --null --repeat 10 -- perf bench sched messaging -g 10 -l 1000

# Running 'sched/messaging' benchmark:
# 20 sender and receiver processes per group
# 10 groups == 400 processes run

Total time: 4.697 [sec]
# Running 'sched/messaging' benchmark:
[ 8082.882751] hisi_thermal f7030700.tsensor: sensor <2> THERMAL ALARM: 66385 > 65000
# 20 sender and receiver processes per group
# 10 groups == 400 processes run

Total time: 4.910 [sec]
# Running 'sched/messaging' benchmark:
[ 8091.070386] CPU3 cpus=0-7 th_pressure=205
[ 8091.178390] CPU3 cpus=0-7 th_pressure=0
[ 8091.286389] CPU3 cpus=0-7 th_pressure=205
[ 8091.398397] CPU3 cpus=0-7 th_pressure=0

2020-02-13 12:26:48

by Amit Kucheria

[permalink] [raw]
Subject: Re: [Patch v9 3/8] arm,arm64,drivers:Add infrastructure to store and update instantaneous thermal pressure

On Wed, Jan 29, 2020 at 4:06 AM Thara Gopinath
<[email protected]> wrote:
>
> Add architecture specific APIs to update and track thermal pressure on a
> per cpu basis. A per cpu variable thermal_pressure is introduced to keep
> track of instantaneous per cpu thermal pressure. Thermal pressure is the
> delta between maximum capacity and capped capacity due to a thermal event.

s/capped/decreased to have consistent use throughout the series e.g. in patch 1.

Though personally, I like "capped capacity" in which case
s/decreased/capped in patch 1 and elsewhere.

>
> topology_get_thermal_pressure can be hooked into the scheduler specified
> arch_cpu_thermal_capacity to retrieve instantaneous thermal pressure of a
> cpu.
>
> arch_set_thermal_pressure can be used to update the thermal pressure.
>
> Considering topology_get_thermal_pressure reads thermal_pressure and
> arch_set_thermal_pressure writes into thermal_pressure, one can argue for
> some sort of locking mechanism to avoid a stale value. But considering
> topology_get_thermal_pressure can be called from a system critical path
> like scheduler tick function, a locking mechanism is not ideal. This means
> that it is possible the thermal_pressure value used to calculate average
> thermal pressure for a cpu can be stale for upto 1 tick period.
>
> Signed-off-by: Thara Gopinath <[email protected]>
> ---
>
> v6->v7:
> - Changed the input argument in arch_set_thermal_pressure from
> capped capacity to delta capacity(thermal pressure) as per
> Ionela's review comments.
>
> arch/arm/include/asm/topology.h | 3 +++
> arch/arm64/include/asm/topology.h | 3 +++

Any particular reason to enable this for arm/arm64 in this patch
itself? I'd have enabled them in two separate patches after this one.

> drivers/base/arch_topology.c | 11 +++++++++++
> include/linux/arch_topology.h | 10 ++++++++++
> 4 files changed, 27 insertions(+)
>
> diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h
> index 8a0fae9..3a50a19 100644
> --- a/arch/arm/include/asm/topology.h
> +++ b/arch/arm/include/asm/topology.h
> @@ -16,6 +16,9 @@
> /* Enable topology flag updates */
> #define arch_update_cpu_topology topology_update_cpu_topology
>
> +/* Replace task scheduler's default thermal pressure retrieve API */
> +#define arch_cpu_thermal_pressure topology_get_thermal_pressure
> +
> #else
>
> static inline void init_cpu_topology(void) { }
> diff --git a/arch/arm64/include/asm/topology.h b/arch/arm64/include/asm/topology.h
> index a4d945d..a70896f 100644
> --- a/arch/arm64/include/asm/topology.h
> +++ b/arch/arm64/include/asm/topology.h
> @@ -25,6 +25,9 @@ int pcibus_to_node(struct pci_bus *bus);
> /* Enable topology flag updates */
> #define arch_update_cpu_topology topology_update_cpu_topology
>
> +/* Replace task scheduler's default thermal pressure retrieve API */
> +#define arch_cpu_thermal_pressure topology_get_thermal_pressure
> +
> #include <asm-generic/topology.h>
>
> #endif /* _ASM_ARM_TOPOLOGY_H */
> diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> index 6119e11..68dfa49 100644
> --- a/drivers/base/arch_topology.c
> +++ b/drivers/base/arch_topology.c
> @@ -42,6 +42,17 @@ void topology_set_cpu_scale(unsigned int cpu, unsigned long capacity)
> per_cpu(cpu_scale, cpu) = capacity;
> }
>
> +DEFINE_PER_CPU(unsigned long, thermal_pressure);
> +
> +void arch_set_thermal_pressure(struct cpumask *cpus,
> + unsigned long th_pressure)
> +{
> + int cpu;

<snip>

2020-02-13 12:31:58

by Amit Kucheria

[permalink] [raw]
Subject: Re: [Patch v9 1/8] sched/pelt: Add support to track thermal pressure

On Wed, Jan 29, 2020 at 4:06 AM Thara Gopinath
<[email protected]> wrote:
>
> Extrapolating on the existing framework to track rt/dl utilization using
> pelt signals, add a similar mechanism to track thermal pressure. The
> difference here from rt/dl utilization tracking is that, instead of
> tracking time spent by a cpu running a rt/dl task through util_avg, the
> average thermal pressure is tracked through load_avg. This is because
> thermal pressure signal is weighted "delta" capacity and is not
> binary(util_avg is binary). "delta capacity" here means delta between the
> actual capacity of a cpu and the decreased capacity a cpu due to a thermal
> event.
>
> In order to track average thermal pressure, a new sched_avg variable
> avg_thermal is introduced. Function update_thermal_load_avg can be called
> to do the periodic bookkeeping (accumulate, decay and average) of the
> thermal pressure.
>
> Signed-off-by: Thara Gopinath <[email protected]>
> Reviewed-by: Vincent Guittot <[email protected]>
> ---
>
> v6->v7:
> - Added CONFIG_HAVE_SCHED_THERMAL_PRESSURE to stub out
> update_thermal_load_avg in unsupported architectures as per
> review comments from Peter, Dietmar and Quentin.
> - Updated comment for update_thermal_load_avg as per review
> comments from Peter and Dietmar.
> v7->v8:
> - Fixed typo in defining update_thermal_load_avg which was
> causing build errors (reported by kbuild test report)
> v8->v9:
> - Defined thermal_load_avg to read rq->avg_thermal.load_avg and
> avoid cacheline miss in unsupported cases as per Peter's
> suggestion.
>
> include/trace/events/sched.h | 4 ++++
> init/Kconfig | 4 ++++
> kernel/sched/pelt.c | 31 +++++++++++++++++++++++++++++++
> kernel/sched/pelt.h | 31 +++++++++++++++++++++++++++++++
> kernel/sched/sched.h | 3 +++
> 5 files changed, 73 insertions(+)
>
> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index 420e80e..a8fb667 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -613,6 +613,10 @@ DECLARE_TRACE(pelt_dl_tp,
> TP_PROTO(struct rq *rq),
> TP_ARGS(rq));
>
> +DECLARE_TRACE(pelt_thermal_tp,
> + TP_PROTO(struct rq *rq),
> + TP_ARGS(rq));
> +
> DECLARE_TRACE(pelt_irq_tp,
> TP_PROTO(struct rq *rq),
> TP_ARGS(rq));
> diff --git a/init/Kconfig b/init/Kconfig
> index bd9f1fd..055c3bf 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -463,6 +463,10 @@ config HAVE_SCHED_AVG_IRQ
> depends on IRQ_TIME_ACCOUNTING || PARAVIRT_TIME_ACCOUNTING
> depends on SMP
>
> +config HAVE_SCHED_THERMAL_PRESSURE
> + bool "Enable periodic averaging of thermal pressure"
> + depends on SMP
> +
> config BSD_PROCESS_ACCT
> bool "BSD Process Accounting"
> depends on MULTIUSER
> diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
> index bd006b7..5d1fbf0 100644
> --- a/kernel/sched/pelt.c
> +++ b/kernel/sched/pelt.c
> @@ -367,6 +367,37 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
> return 0;
> }
>
> +#ifdef CONFIG_HAVE_SCHED_THERMAL_PRESSURE
> +/*
> + * thermal:
> + *
> + * load_sum = \Sum se->avg.load_sum but se->avg.load_sum is not tracked
> + *
> + * util_avg and runnable_load_avg are not supported and meaningless.
> + *
> + * Unlike rt/dl utilization tracking that track time spent by a cpu
> + * running a rt/dl task through util_avg, the average thermal pressure is
> + * tracked through load_avg. This is because thermal pressure signal is
> + * weighted "delta" capacity and is not binary(util_avg is binary). "delta

May I suggest a slight rewording here and in the commit description,

This is because the thermal pressure signal is weighted "delta"
capacity unlike util_avg which is binary.

It would also help, if you expanded on what you mean by binary in the
commit description and how the delta capacity is weighted.

> + * capacity" here means delta between the actual capacity of a cpu and the
> + * decreased capacity a cpu due to a thermal event.

Use consistent wording throughout the series - either capped or
decreased capacity.

> + */

This could be shortened to:

Delta capacity of cpu = Actual capacity - Capped capacity due to thermal event

> +int update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity)
> +{
> + if (___update_load_sum(now, &rq->avg_thermal,
> + capacity,
> + capacity,
> + capacity)) {
> + ___update_load_avg(&rq->avg_thermal, 1, 1);
> + trace_pelt_thermal_tp(rq);
> + return 1;
> + }
> +
> + return 0;
> +}
> +#endif
> +
> #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
> /*
> * irq:
> diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
> index afff644..916979a 100644
> --- a/kernel/sched/pelt.h
> +++ b/kernel/sched/pelt.h
> @@ -7,6 +7,26 @@ int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq);
> int update_rt_rq_load_avg(u64 now, struct rq *rq, int running);
> int update_dl_rq_load_avg(u64 now, struct rq *rq, int running);
>
> +#ifdef CONFIG_HAVE_SCHED_THERMAL_PRESSURE
> +int update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity);
> +
> +static inline u64 thermal_load_avg(struct rq *rq)
> +{
> + return READ_ONCE(rq->avg_thermal.load_avg);
> +}
> +#else
> +static inline int
> +update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity)
> +{
> + return 0;
> +}
> +
> +static inline u64 thermal_load_avg(struct rq *rq)
> +{
> + return 0;
> +}
> +#endif
> +
> #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
> int update_irq_load_avg(struct rq *rq, u64 running);
> #else
> @@ -159,6 +179,17 @@ update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
> }
>
> static inline int
> +update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity)
> +{
> + return 0;
> +}
> +
> +static inline u64 thermal_load_avg(struct rq *rq)
> +{
> + return 0;
> +}
> +
> +static inline int
> update_irq_load_avg(struct rq *rq, u64 running)
> {
> return 0;
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 1a88dc8..1f256cb 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -944,6 +944,9 @@ struct rq {
> #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
> struct sched_avg avg_irq;
> #endif
> +#ifdef CONFIG_HAVE_SCHED_THERMAL_PRESSURE
> + struct sched_avg avg_thermal;
> +#endif
> u64 idle_stamp;
> u64 avg_idle;
>
> --
> 2.1.4
>

2020-02-13 12:48:20

by Amit Kucheria

[permalink] [raw]
Subject: Re: [Patch v9 5/8] sched/fair: update cpu_capacity to reflect thermal pressure

On Wed, Jan 29, 2020 at 4:06 AM Thara Gopinath
<[email protected]> wrote:
>
> cpu_capacity initially reflects the maximum possible capacity of a cpu.
> Thermal pressure on a cpu means this maximum possible capacity is
> unavailable due to thermal events. This patch subtracts the average thermal
> pressure for a cpu from its maximum possible capacity so that cpu_capacity
> reflects the actual maximum currently available capacity.
>
> Signed-off-by: Thara Gopinath <[email protected]>
> ---
>
> v8->v9:
> - Use thermal_load_avg to read rq->avg_thermal.load_avg.
>
> kernel/sched/fair.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 5f58c03..d879077 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7753,8 +7753,15 @@ static unsigned long scale_rt_capacity(struct sched_domain *sd, int cpu)
> if (unlikely(irq >= max))
> return 1;
>
> + /*
> + * avg_rt.util avg and avg_dl.util track binary signals

For avg_rt, s/util avg/util_avg/
For avg_dl, s/util/util_avg/ ?

> + * (running and not running) with weights 0 and 1024 respectively.
> + * avg_thermal.load_avg tracks thermal pressure and the weighted
> + * average uses the actual delta max capacity(load).
> + */
> used = READ_ONCE(rq->avg_rt.util_avg);
> used += READ_ONCE(rq->avg_dl.util_avg);
> + used += thermal_load_avg(rq);
>
> if (unlikely(used >= max))
> return 1;
> --
> 2.1.4
>

2020-02-13 13:41:14

by Amit Kucheria

[permalink] [raw]
Subject: Re: [Patch v9 5/8] sched/fair: update cpu_capacity to reflect thermal pressure

On Wed, Jan 29, 2020 at 4:06 AM Thara Gopinath
<[email protected]> wrote:
>
> cpu_capacity initially reflects the maximum possible capacity of a cpu.
> Thermal pressure on a cpu means this maximum possible capacity is
> unavailable due to thermal events. This patch subtracts the average thermal
> pressure for a cpu from its maximum possible capacity so that cpu_capacity
> reflects the actual maximum currently available capacity.

"actual maximum currently available capacity" is quite a mouthful. :-)

"Remaining capacity" or "Effective capacity" anyone?

IIUC, this remaining capacity is NOT the same as the capped/decreased
capacity referred to in patches 1 and 3. The delta capacity (aka
thermal pressure) there refers to the difference between HW max
capacity and thermally throttled capacity.
Here, we also subtract RT/DL utilisation. Is that accurate?




>
> Signed-off-by: Thara Gopinath <[email protected]>
> ---
>
> v8->v9:
> - Use thermal_load_avg to read rq->avg_thermal.load_avg.
>
> kernel/sched/fair.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 5f58c03..d879077 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7753,8 +7753,15 @@ static unsigned long scale_rt_capacity(struct sched_domain *sd, int cpu)
> if (unlikely(irq >= max))
> return 1;
>
> + /*
> + * avg_rt.util avg and avg_dl.util track binary signals
> + * (running and not running) with weights 0 and 1024 respectively.
> + * avg_thermal.load_avg tracks thermal pressure and the weighted
> + * average uses the actual delta max capacity(load).
> + */
> used = READ_ONCE(rq->avg_rt.util_avg);
> used += READ_ONCE(rq->avg_dl.util_avg);
> + used += thermal_load_avg(rq);
>
> if (unlikely(used >= max))
> return 1;
> --
> 2.1.4
>

2020-02-13 13:56:34

by Thara Gopinath

[permalink] [raw]
Subject: Re: [Patch v9 7/8] sched/fair: Enable tuning of decay period

On 02/10/2020 06:59 AM, Dietmar Eggemann wrote:
> On 07/02/2020 23:42, Thara Gopinath wrote:
>> On 02/04/2020 03:39 AM, Dietmar Eggemann wrote:
>>> On 03/02/2020 16:55, Peter Zijlstra wrote:
>>>> On Mon, Feb 03, 2020 at 07:07:57AM -0500, Thara Gopinath wrote:
>>>>> On 01/28/2020 06:56 PM, Randy Dunlap wrote:
>>>>>> Hi,
>>>>>>
>>>>>> On 1/28/20 2:36 PM, Thara Gopinath wrote:
>
> [...]
>
>>> I do agree. IMHO, there are just two little things outstanding:
>>>
>>> (1) arch_scale_thermal_pressure() instead of
>>> arch_cpu_thermal_pressure() in v8 4/7
>>
>> The "scale_" part was discussed in v6. Ionela had suggested that having
>> "scale" is not suited for this function because "thermal pressure" is
>> not exactly scaled but subtracted. I actually agree with that.
>>
>> https://lore.kernel.org/lkml/[email protected]/
>>
>> Having said that if everyone feel the same about naming of this
>> function, I can change it one last time.
>
> I'm still in favor for this. And Vincent seems to be OK as well:
>
> https://lore.kernel.org/r/CAKfTPtBzoLnvAJ7sjPogMYS=WwBbdzWO07Kj=KDFVpO4=Su5ow@mail.gmail.com
>
> The 'v6->v7' change note:
>
> - Renamed arch_scale_thermal_capacity to arch_cpu_thermal_pressure
> as per review comments from Peter, Dietmar and Ionela.
>
> is really not saying from which review comment the individual changes in
> the function name are coming from. And I don't see an answer to Ionela's
> email saying that her proposal will manifest in a particular part of
> this change.
Hi Dietmar,

Like I said, don't want to argue on name. It is trivial for me. I have
v10 prepped with the name change. Will send it out shortly.

>
>>> (2) guarding of thermal pressure code in Arm's arch_topology driver w/
>>> CONFIG_HAVE_SCHED_THERMAL_PRESSURE plus disabling it by default for
>>> Arm64.
>> It was enabled by default as per your suggestion in v9.
>>
>> The patch can be dropped.
>>
>> I don't understand the need to guard arch_topology with
>> CONFIG_HAVE_SCHED_THERMAL_PRESSURE. CONFIG_HAVE_SCHED_THERMAL_PRESSURE
>> is for scheduler to enable/disable averaging of thermal pressure. We
>> wanted to separate updating and retrieving of instantaneous thermal
>> pressure from scheduler. Guarding it with
>> CONFIG_HAVE_SCHED_THERMAL_PRESSURE is to me equivalent to putting back
>> this whole code in the scheduler framework. I am against it. I also do
>> not see other arch_ functions guarded similarly.
>
> Cpu-invariant accounting can't be guarded with a kernel CONFIG switch.
> Frequency-invariant accounting could be with CONFIG_CPU_FREQ but this is
> enabled by default by Arm64 defconfig.
> Thermal pressure (accounting) (CONFIG_HAVE_SCHED_THERMAL_PRESSURE) is
> disabled by default so why should a per-cpu thermal_pressure be
> maintained on such a system (CONFIG_CPU_THERMAL=y by default)?

I agree that there is no need for per-cpu thermal pressure to be
maintained if no averaging is happening in the scheduler, today. I don't
know if there will ever be an use for it.
My issue has to do with using a config option meant for internal
scheduler code being used else where. To me, once this happens, the
entire work done to separate out reading and writing of instantaneous
thermal pressure to arch_topology makes no sense. We could have kept it
in scheduler itself.
Another way I think about this whole thermal pressure framework is that
it is the job of cooling device or cpufreq or any other entity to update
a throttle in maximum pressure to the scheduler. It should be
independent of what scheduler does with it. Scheduler can choose to
ignore it.

>


--
Warm Regards
Thara

2020-02-13 14:08:37

by Thara Gopinath

[permalink] [raw]
Subject: Re: [Patch v9 3/8] arm,arm64,drivers:Add infrastructure to store and update instantaneous thermal pressure

On 02/13/2020 07:25 AM, Amit Kucheria wrote:
> On Wed, Jan 29, 2020 at 4:06 AM Thara Gopinath
> <[email protected]> wrote:
>>
>> Add architecture specific APIs to update and track thermal pressure on a
>> per cpu basis. A per cpu variable thermal_pressure is introduced to keep
>> track of instantaneous per cpu thermal pressure. Thermal pressure is the
>> delta between maximum capacity and capped capacity due to a thermal event.
>
> s/capped/decreased to have consistent use throughout the series e.g. in patch 1.
>
> Though personally, I like "capped capacity" in which case
> s/decreased/capped in patch 1 and elsewhere.

I will fix this
>
>>
>> topology_get_thermal_pressure can be hooked into the scheduler specified
>> arch_cpu_thermal_capacity to retrieve instantaneous thermal pressure of a
>> cpu.
>>
>> arch_set_thermal_pressure can be used to update the thermal pressure.
>>
>> Considering topology_get_thermal_pressure reads thermal_pressure and
>> arch_set_thermal_pressure writes into thermal_pressure, one can argue for
>> some sort of locking mechanism to avoid a stale value. But considering
>> topology_get_thermal_pressure can be called from a system critical path
>> like scheduler tick function, a locking mechanism is not ideal. This means
>> that it is possible the thermal_pressure value used to calculate average
>> thermal pressure for a cpu can be stale for upto 1 tick period.
>>
>> Signed-off-by: Thara Gopinath <[email protected]>
>> ---
>>
>> v6->v7:
>> - Changed the input argument in arch_set_thermal_pressure from
>> capped capacity to delta capacity(thermal pressure) as per
>> Ionela's review comments.
>>
>> arch/arm/include/asm/topology.h | 3 +++
>> arch/arm64/include/asm/topology.h | 3 +++
>
> Any particular reason to enable this for arm/arm64 in this patch
> itself? I'd have enabled them in two separate patches after this one.

No reason. No reason not to as well as arch_topology is "Arm specific
cpu topology file" and changes are one-liners.

>
>> drivers/base/arch_topology.c | 11 +++++++++++
>> include/linux/arch_topology.h | 10 ++++++++++
>> 4 files changed, 27 insertions(+)
>>
>> diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h
>> index 8a0fae9..3a50a19 100644
>> --- a/arch/arm/include/asm/topology.h
>> +++ b/arch/arm/include/asm/topology.h
>> @@ -16,6 +16,9 @@
>> /* Enable topology flag updates */
>> #define arch_update_cpu_topology topology_update_cpu_topology
>>
>> +/* Replace task scheduler's default thermal pressure retrieve API */
>> +#define arch_cpu_thermal_pressure topology_get_thermal_pressure
>> +
>> #else
>>
>> static inline void init_cpu_topology(void) { }
>> diff --git a/arch/arm64/include/asm/topology.h b/arch/arm64/include/asm/topology.h
>> index a4d945d..a70896f 100644
>> --- a/arch/arm64/include/asm/topology.h
>> +++ b/arch/arm64/include/asm/topology.h
>> @@ -25,6 +25,9 @@ int pcibus_to_node(struct pci_bus *bus);
>> /* Enable topology flag updates */
>> #define arch_update_cpu_topology topology_update_cpu_topology
>>
>> +/* Replace task scheduler's default thermal pressure retrieve API */
>> +#define arch_cpu_thermal_pressure topology_get_thermal_pressure
>> +
>> #include <asm-generic/topology.h>
>>
>> #endif /* _ASM_ARM_TOPOLOGY_H */
>> diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
>> index 6119e11..68dfa49 100644
>> --- a/drivers/base/arch_topology.c
>> +++ b/drivers/base/arch_topology.c
>> @@ -42,6 +42,17 @@ void topology_set_cpu_scale(unsigned int cpu, unsigned long capacity)
>> per_cpu(cpu_scale, cpu) = capacity;
>> }
>>
>> +DEFINE_PER_CPU(unsigned long, thermal_pressure);
>> +
>> +void arch_set_thermal_pressure(struct cpumask *cpus,
>> + unsigned long th_pressure)
>> +{
>> + int cpu;
>
> <snip>
>


--
Warm Regards
Thara

2020-02-13 14:12:33

by Thara Gopinath

[permalink] [raw]
Subject: Re: [Patch v9 1/8] sched/pelt: Add support to track thermal pressure

On 02/13/2020 07:29 AM, Amit Kucheria wrote:
> On Wed, Jan 29, 2020 at 4:06 AM Thara Gopinath
> <[email protected]> wrote:
>>
>> Extrapolating on the existing framework to track rt/dl utilization using
>> pelt signals, add a similar mechanism to track thermal pressure. The
>> difference here from rt/dl utilization tracking is that, instead of
>> tracking time spent by a cpu running a rt/dl task through util_avg, the
>> average thermal pressure is tracked through load_avg. This is because
>> thermal pressure signal is weighted "delta" capacity and is not
>> binary(util_avg is binary). "delta capacity" here means delta between the
>> actual capacity of a cpu and the decreased capacity a cpu due to a thermal
>> event.
>>
>> In order to track average thermal pressure, a new sched_avg variable
>> avg_thermal is introduced. Function update_thermal_load_avg can be called
>> to do the periodic bookkeeping (accumulate, decay and average) of the
>> thermal pressure.
>>
>> Signed-off-by: Thara Gopinath <[email protected]>
>> Reviewed-by: Vincent Guittot <[email protected]>
>> ---
>>
>> v6->v7:
>> - Added CONFIG_HAVE_SCHED_THERMAL_PRESSURE to stub out
>> update_thermal_load_avg in unsupported architectures as per
>> review comments from Peter, Dietmar and Quentin.
>> - Updated comment for update_thermal_load_avg as per review
>> comments from Peter and Dietmar.
>> v7->v8:
>> - Fixed typo in defining update_thermal_load_avg which was
>> causing build errors (reported by kbuild test report)
>> v8->v9:
>> - Defined thermal_load_avg to read rq->avg_thermal.load_avg and
>> avoid cacheline miss in unsupported cases as per Peter's
>> suggestion.
>>
>> include/trace/events/sched.h | 4 ++++
>> init/Kconfig | 4 ++++
>> kernel/sched/pelt.c | 31 +++++++++++++++++++++++++++++++
>> kernel/sched/pelt.h | 31 +++++++++++++++++++++++++++++++
>> kernel/sched/sched.h | 3 +++
>> 5 files changed, 73 insertions(+)
>>
>> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
>> index 420e80e..a8fb667 100644
>> --- a/include/trace/events/sched.h
>> +++ b/include/trace/events/sched.h
>> @@ -613,6 +613,10 @@ DECLARE_TRACE(pelt_dl_tp,
>> TP_PROTO(struct rq *rq),
>> TP_ARGS(rq));
>>
>> +DECLARE_TRACE(pelt_thermal_tp,
>> + TP_PROTO(struct rq *rq),
>> + TP_ARGS(rq));
>> +
>> DECLARE_TRACE(pelt_irq_tp,
>> TP_PROTO(struct rq *rq),
>> TP_ARGS(rq));
>> diff --git a/init/Kconfig b/init/Kconfig
>> index bd9f1fd..055c3bf 100644
>> --- a/init/Kconfig
>> +++ b/init/Kconfig
>> @@ -463,6 +463,10 @@ config HAVE_SCHED_AVG_IRQ
>> depends on IRQ_TIME_ACCOUNTING || PARAVIRT_TIME_ACCOUNTING
>> depends on SMP
>>
>> +config HAVE_SCHED_THERMAL_PRESSURE
>> + bool "Enable periodic averaging of thermal pressure"
>> + depends on SMP
>> +
>> config BSD_PROCESS_ACCT
>> bool "BSD Process Accounting"
>> depends on MULTIUSER
>> diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
>> index bd006b7..5d1fbf0 100644
>> --- a/kernel/sched/pelt.c
>> +++ b/kernel/sched/pelt.c
>> @@ -367,6 +367,37 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
>> return 0;
>> }
>>
>> +#ifdef CONFIG_HAVE_SCHED_THERMAL_PRESSURE
>> +/*
>> + * thermal:
>> + *
>> + * load_sum = \Sum se->avg.load_sum but se->avg.load_sum is not tracked
>> + *
>> + * util_avg and runnable_load_avg are not supported and meaningless.
>> + *
>> + * Unlike rt/dl utilization tracking that track time spent by a cpu
>> + * running a rt/dl task through util_avg, the average thermal pressure is
>> + * tracked through load_avg. This is because thermal pressure signal is
>> + * weighted "delta" capacity and is not binary(util_avg is binary). "delta
>
> May I suggest a slight rewording here and in the commit description,
>
> This is because the thermal pressure signal is weighted "delta"
> capacity unlike util_avg which is binary.

Sure! Will fix it.

>
> It would also help, if you expanded on what you mean by binary in the
> commit description and how the delta capacity is weighted.

I don't understand this. Binary means 0 or 1. delta capacity is time
weighted, i will update this.

>
>> + * capacity" here means delta between the actual capacity of a cpu and the
>> + * decreased capacity a cpu due to a thermal event.
>
> Use consistent wording throughout the series - either capped or
> decreased capacity.
>
>> + */
>
> This could be shortened to:
>
> Delta capacity of cpu = Actual capacity - Capped capacity due to thermal event

Will fix this.

>
>> +int update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity)
>> +{
>> + if (___update_load_sum(now, &rq->avg_thermal,
>> + capacity,
>> + capacity,
>> + capacity)) {
>> + ___update_load_avg(&rq->avg_thermal, 1, 1);
>> + trace_pelt_thermal_tp(rq);
>> + return 1;
>> + }
>> +
>> + return 0;
>> +}
>> +#endif
>> +
>> #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
>> /*
>> * irq:
>> diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
>> index afff644..916979a 100644
>> --- a/kernel/sched/pelt.h
>> +++ b/kernel/sched/pelt.h
>> @@ -7,6 +7,26 @@ int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq);
>> int update_rt_rq_load_avg(u64 now, struct rq *rq, int running);
>> int update_dl_rq_load_avg(u64 now, struct rq *rq, int running);
>>
>> +#ifdef CONFIG_HAVE_SCHED_THERMAL_PRESSURE
>> +int update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity);
>> +
>> +static inline u64 thermal_load_avg(struct rq *rq)
>> +{
>> + return READ_ONCE(rq->avg_thermal.load_avg);
>> +}
>> +#else
>> +static inline int
>> +update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity)
>> +{
>> + return 0;
>> +}
>> +
>> +static inline u64 thermal_load_avg(struct rq *rq)
>> +{
>> + return 0;
>> +}
>> +#endif
>> +
>> #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
>> int update_irq_load_avg(struct rq *rq, u64 running);
>> #else
>> @@ -159,6 +179,17 @@ update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
>> }
>>
>> static inline int
>> +update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity)
>> +{
>> + return 0;
>> +}
>> +
>> +static inline u64 thermal_load_avg(struct rq *rq)
>> +{
>> + return 0;
>> +}
>> +
>> +static inline int
>> update_irq_load_avg(struct rq *rq, u64 running)
>> {
>> return 0;
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 1a88dc8..1f256cb 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -944,6 +944,9 @@ struct rq {
>> #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
>> struct sched_avg avg_irq;
>> #endif
>> +#ifdef CONFIG_HAVE_SCHED_THERMAL_PRESSURE
>> + struct sched_avg avg_thermal;
>> +#endif
>> u64 idle_stamp;
>> u64 avg_idle;
>>
>> --
>> 2.1.4
>>


--
Warm Regards
Thara

2020-02-13 14:14:11

by Thara Gopinath

[permalink] [raw]
Subject: Re: [Patch v9 5/8] sched/fair: update cpu_capacity to reflect thermal pressure

On 02/13/2020 07:47 AM, Amit Kucheria wrote:
> On Wed, Jan 29, 2020 at 4:06 AM Thara Gopinath
> <[email protected]> wrote:
>>
>> cpu_capacity initially reflects the maximum possible capacity of a cpu.
>> Thermal pressure on a cpu means this maximum possible capacity is
>> unavailable due to thermal events. This patch subtracts the average thermal
>> pressure for a cpu from its maximum possible capacity so that cpu_capacity
>> reflects the actual maximum currently available capacity.
>>
>> Signed-off-by: Thara Gopinath <[email protected]>
>> ---
>>
>> v8->v9:
>> - Use thermal_load_avg to read rq->avg_thermal.load_avg.
>>
>> kernel/sched/fair.c | 7 +++++++
>> 1 file changed, 7 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 5f58c03..d879077 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -7753,8 +7753,15 @@ static unsigned long scale_rt_capacity(struct sched_domain *sd, int cpu)
>> if (unlikely(irq >= max))
>> return 1;
>>
>> + /*
>> + * avg_rt.util avg and avg_dl.util track binary signals
>
> For avg_rt, s/util avg/util_avg/
> For avg_dl, s/util/util_avg/ ?

Yep. Will fix it.

>
>> + * (running and not running) with weights 0 and 1024 respectively.
>> + * avg_thermal.load_avg tracks thermal pressure and the weighted
>> + * average uses the actual delta max capacity(load).
>> + */
>> used = READ_ONCE(rq->avg_rt.util_avg);
>> used += READ_ONCE(rq->avg_dl.util_avg);
>> + used += thermal_load_avg(rq);
>>
>> if (unlikely(used >= max))
>> return 1;
>> --
>> 2.1.4
>>


--
Warm Regards
Thara

2020-02-13 14:40:01

by Amit Kucheria

[permalink] [raw]
Subject: Re: [Patch v9 3/8] arm,arm64,drivers:Add infrastructure to store and update instantaneous thermal pressure

On Thu, Feb 13, 2020 at 7:35 PM Thara Gopinath
<[email protected]> wrote:
>
> On 02/13/2020 07:25 AM, Amit Kucheria wrote:
> > On Wed, Jan 29, 2020 at 4:06 AM Thara Gopinath
> > <[email protected]> wrote:
> >>
> >> Add architecture specific APIs to update and track thermal pressure on a
> >> per cpu basis. A per cpu variable thermal_pressure is introduced to keep
> >> track of instantaneous per cpu thermal pressure. Thermal pressure is the
> >> delta between maximum capacity and capped capacity due to a thermal event.
> >
> > s/capped/decreased to have consistent use throughout the series e.g. in patch 1.
> >
> > Though personally, I like "capped capacity" in which case
> > s/decreased/capped in patch 1 and elsewhere.
>
> I will fix this
> >
> >>
> >> topology_get_thermal_pressure can be hooked into the scheduler specified
> >> arch_cpu_thermal_capacity to retrieve instantaneous thermal pressure of a
> >> cpu.
> >>
> >> arch_set_thermal_pressure can be used to update the thermal pressure.
> >>
> >> Considering topology_get_thermal_pressure reads thermal_pressure and
> >> arch_set_thermal_pressure writes into thermal_pressure, one can argue for
> >> some sort of locking mechanism to avoid a stale value. But considering
> >> topology_get_thermal_pressure can be called from a system critical path
> >> like scheduler tick function, a locking mechanism is not ideal. This means
> >> that it is possible the thermal_pressure value used to calculate average
> >> thermal pressure for a cpu can be stale for upto 1 tick period.
> >>
> >> Signed-off-by: Thara Gopinath <[email protected]>
> >> ---
> >>
> >> v6->v7:
> >> - Changed the input argument in arch_set_thermal_pressure from
> >> capped capacity to delta capacity(thermal pressure) as per
> >> Ionela's review comments.
> >>
> >> arch/arm/include/asm/topology.h | 3 +++
> >> arch/arm64/include/asm/topology.h | 3 +++
> >
> > Any particular reason to enable this for arm/arm64 in this patch
> > itself? I'd have enabled them in two separate patches after this one.
>
> No reason. No reason not to as well as arch_topology is "Arm specific
> cpu topology file" and changes are one-liners.

One reason to do this, IMHO, is to keep platform conversions separate
from the core infrastructure in a series, so the core can get merged
while platform maintainers can take their time to decide if, when, how
to merge this.

2020-02-13 14:43:03

by Amit Kucheria

[permalink] [raw]
Subject: Re: [Patch v9 1/8] sched/pelt: Add support to track thermal pressure

On Thu, Feb 13, 2020 at 7:41 PM Thara Gopinath
<[email protected]> wrote:
>
> On 02/13/2020 07:29 AM, Amit Kucheria wrote:
> > On Wed, Jan 29, 2020 at 4:06 AM Thara Gopinath
> > <[email protected]> wrote:
> >>

<snip>

> >> + * Unlike rt/dl utilization tracking that track time spent by a cpu
> >> + * running a rt/dl task through util_avg, the average thermal pressure is
> >> + * tracked through load_avg. This is because thermal pressure signal is
> >> + * weighted "delta" capacity and is not binary(util_avg is binary). "delta
> >
> > May I suggest a slight rewording here and in the commit description,
> >
> > This is because the thermal pressure signal is weighted "delta"
> > capacity unlike util_avg which is binary.
>
> Sure! Will fix it.
>
> >
> > It would also help, if you expanded on what you mean by binary in the
> > commit description and how the delta capacity is weighted.
>
> I don't understand this. Binary means 0 or 1.

So the value returned by util_avg is literally 0 or 1? In which case
ignore my comment. I'll read the math again later.

> delta capacity is time
> weighted, i will update this.

Thanks.

2020-02-14 10:27:30

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [Patch v9 7/8] sched/fair: Enable tuning of decay period

On 13/02/2020 14:54, Thara Gopinath wrote:
> On 02/10/2020 06:59 AM, Dietmar Eggemann wrote:
>> On 07/02/2020 23:42, Thara Gopinath wrote:
>>> On 02/04/2020 03:39 AM, Dietmar Eggemann wrote:
>>>> On 03/02/2020 16:55, Peter Zijlstra wrote:
>>>>> On Mon, Feb 03, 2020 at 07:07:57AM -0500, Thara Gopinath wrote:
>>>>>> On 01/28/2020 06:56 PM, Randy Dunlap wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> On 1/28/20 2:36 PM, Thara Gopinath wrote:

[...]

>> is really not saying from which review comment the individual changes in
>> the function name are coming from. And I don't see an answer to Ionela's
>> email saying that her proposal will manifest in a particular part of
>> this change.
> Hi Dietmar,
>
> Like I said, don't want to argue on name. It is trivial for me. I have
> v10 prepped with the name change. Will send it out shortly.

Thanks.

[...]

>> Cpu-invariant accounting can't be guarded with a kernel CONFIG switch.
>> Frequency-invariant accounting could be with CONFIG_CPU_FREQ but this is
>> enabled by default by Arm64 defconfig.
>> Thermal pressure (accounting) (CONFIG_HAVE_SCHED_THERMAL_PRESSURE) is
>> disabled by default so why should a per-cpu thermal_pressure be
>> maintained on such a system (CONFIG_CPU_THERMAL=y by default)?
>
> I agree that there is no need for per-cpu thermal pressure to be
> maintained if no averaging is happening in the scheduler, today. I don't
> know if there will ever be an use for it.

All arch_scale_FOO() functions follow the approach to force the arch
(currently x86, arm, arm64) to do

#define arch_scale_FOO BAR

to enable the FOO functionality.

There is no direct link between consumer and provider here.

consumer (sched) -> arch <- provider (arch, counters, CPUfreq, CPU
cooling, etc.)

So IMHO, FOO=thermal_pressure should follow this design pattern too.

'thermal_pressure' would be the only one which can be disabled by a
kernel config switch at the consumer side.
IMHO, it doesn't make sense to have the provider operating in this case.

> My issue has to do with using a config option meant for internal
> scheduler code being used else where. To me, once this happens, the
> entire work done to separate out reading and writing of instantaneous
> thermal pressure to arch_topology makes no sense. We could have kept it
> in scheduler itself.

You might see thermal_pressure more on the level of irq_load or
[rt/dl]_rq_load and that could be why we have a different opinion here?

Now rt_rq_load and dl_rq_load are scheduler internal providers and
irq_load is driven by 'irq_delta + steal' time (which is much closer to
the scheduler than thermal for instance).

My assumption is that we don't want a direct link between the scheduler
and e.g. a provider 'thermal'.

> Another way I think about this whole thermal pressure framework is that
> it is the job of cooling device or cpufreq or any other entity to update
> a throttle in maximum pressure to the scheduler. It should be
> independent of what scheduler does with it. Scheduler can choose to
> ignore it

2020-02-14 14:53:37

by Thara Gopinath

[permalink] [raw]
Subject: Re: [Patch v9 5/8] sched/fair: update cpu_capacity to reflect thermal pressure

On 02/13/2020 08:39 AM, Amit Kucheria wrote:
> On Wed, Jan 29, 2020 at 4:06 AM Thara Gopinath
> <[email protected]> wrote:
>>
>> cpu_capacity initially reflects the maximum possible capacity of a cpu.
>> Thermal pressure on a cpu means this maximum possible capacity is
>> unavailable due to thermal events. This patch subtracts the average thermal
>> pressure for a cpu from its maximum possible capacity so that cpu_capacity
>> reflects the actual maximum currently available capacity.
>
> "actual maximum currently available capacity" is quite a mouthful. :-)
>
> "Remaining capacity" or "Effective capacity" anyone?

"Remaining maximum capacity"?

>
> IIUC, this remaining capacity is NOT the same as the capped/decreased
> capacity referred to in patches 1 and 3. The delta capacity (aka
> thermal pressure) there refers to the difference between HW max
> capacity and thermally throttled capacity.
> Here, we also subtract RT/DL utilisation. Is that accurate?

Yes, here we do subtract RT/DL utilization as well. But from the thermal
pressure point of view, it is immaterial. I am not touching the code
that subtracts RT/DL utilization , I am just adding thermal pressure to
the variables that has to be deducted from the stated max capacity to
reflect the actual capacity. So as far as this patch series is
concerned, capped/decreased capacity is the same across all
patches.(Though there is the instantaneous capped capacity and the
time-averaged capped capacity)


--
Warm Regards
Thara

2020-02-14 15:01:26

by Thara Gopinath

[permalink] [raw]
Subject: Re: [Patch v9 3/8] arm,arm64,drivers:Add infrastructure to store and update instantaneous thermal pressure

On 02/13/2020 09:38 AM, Amit Kucheria wrote:
> On Thu, Feb 13, 2020 at 7:35 PM Thara Gopinath
> <[email protected]> wrote:
>>
>> On 02/13/2020 07:25 AM, Amit Kucheria wrote:
>>> On Wed, Jan 29, 2020 at 4:06 AM Thara Gopinath
>>> <[email protected]> wrote:
>>>>
>>>> Add architecture specific APIs to update and track thermal pressure on a
>>>> per cpu basis. A per cpu variable thermal_pressure is introduced to keep
>>>> track of instantaneous per cpu thermal pressure. Thermal pressure is the
>>>> delta between maximum capacity and capped capacity due to a thermal event.
>>>
>>> s/capped/decreased to have consistent use throughout the series e.g. in patch 1.
>>>
>>> Though personally, I like "capped capacity" in which case
>>> s/decreased/capped in patch 1 and elsewhere.
>>
>> I will fix this
>>>
>>>>
>>>> topology_get_thermal_pressure can be hooked into the scheduler specified
>>>> arch_cpu_thermal_capacity to retrieve instantaneous thermal pressure of a
>>>> cpu.
>>>>
>>>> arch_set_thermal_pressure can be used to update the thermal pressure.
>>>>
>>>> Considering topology_get_thermal_pressure reads thermal_pressure and
>>>> arch_set_thermal_pressure writes into thermal_pressure, one can argue for
>>>> some sort of locking mechanism to avoid a stale value. But considering
>>>> topology_get_thermal_pressure can be called from a system critical path
>>>> like scheduler tick function, a locking mechanism is not ideal. This means
>>>> that it is possible the thermal_pressure value used to calculate average
>>>> thermal pressure for a cpu can be stale for upto 1 tick period.
>>>>
>>>> Signed-off-by: Thara Gopinath <[email protected]>
>>>> ---
>>>>
>>>> v6->v7:
>>>> - Changed the input argument in arch_set_thermal_pressure from
>>>> capped capacity to delta capacity(thermal pressure) as per
>>>> Ionela's review comments.
>>>>
>>>> arch/arm/include/asm/topology.h | 3 +++
>>>> arch/arm64/include/asm/topology.h | 3 +++
>>>
>>> Any particular reason to enable this for arm/arm64 in this patch
>>> itself? I'd have enabled them in two separate patches after this one.
>>
>> No reason. No reason not to as well as arch_topology is "Arm specific
>> cpu topology file" and changes are one-liners.
>
> One reason to do this, IMHO, is to keep platform conversions separate
> from the core infrastructure in a series, so the core can get merged
> while platform maintainers can take their time to decide if, when, how
> to merge this.

That makes sense. I will split it out. I did not think of it from a
merging point of view.

>


--
Warm Regards
Thara

2020-02-18 14:58:14

by Thara Gopinath

[permalink] [raw]
Subject: Re: [Patch v9 7/8] sched/fair: Enable tuning of decay period



On 2/14/20 5:26 AM, Dietmar Eggemann wrote:
> On 13/02/2020 14:54, Thara Gopinath wrote:
>> On 02/10/2020 06:59 AM, Dietmar Eggemann wrote:
>>> On 07/02/2020 23:42, Thara Gopinath wrote:
>>>> On 02/04/2020 03:39 AM, Dietmar Eggemann wrote:
>>>>> On 03/02/2020 16:55, Peter Zijlstra wrote:
>>>>>> On Mon, Feb 03, 2020 at 07:07:57AM -0500, Thara Gopinath wrote:
>>>>>>> On 01/28/2020 06:56 PM, Randy Dunlap wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> On 1/28/20 2:36 PM, Thara Gopinath wrote:
>
> [...]
>
>>> is really not saying from which review comment the individual changes in
>>> the function name are coming from. And I don't see an answer to Ionela's
>>> email saying that her proposal will manifest in a particular part of
>>> this change.
>> Hi Dietmar,
>>
>> Like I said, don't want to argue on name. It is trivial for me. I have
>> v10 prepped with the name change. Will send it out shortly.
>
> Thanks.
>
> [...]
>
>>> Cpu-invariant accounting can't be guarded with a kernel CONFIG switch.
>>> Frequency-invariant accounting could be with CONFIG_CPU_FREQ but this is
>>> enabled by default by Arm64 defconfig.
>>> Thermal pressure (accounting) (CONFIG_HAVE_SCHED_THERMAL_PRESSURE) is
>>> disabled by default so why should a per-cpu thermal_pressure be
>>> maintained on such a system (CONFIG_CPU_THERMAL=y by default)?
>>
>> I agree that there is no need for per-cpu thermal pressure to be
>> maintained if no averaging is happening in the scheduler, today. I don't
>> know if there will ever be an use for it.
>
> All arch_scale_FOO() functions follow the approach to force the arch
> (currently x86, arm, arm64) to do
>
> #define arch_scale_FOO BAR
>
> to enable the FOO functionality.
>
> There is no direct link between consumer and provider here.
>
> consumer (sched) -> arch <- provider (arch, counters, CPUfreq, CPU
> cooling, etc.)
>
> So IMHO, FOO=thermal_pressure should follow this design pattern too.
>
> 'thermal_pressure' would be the only one which can be disabled by a
> kernel config switch at the consumer side.
> IMHO, it doesn't make sense to have the provider operating in this case.
>
>> My issue has to do with using a config option meant for internal
>> scheduler code being used else where. To me, once this happens, the
>> entire work done to separate out reading and writing of instantaneous
>> thermal pressure to arch_topology makes no sense. We could have kept it
>> in scheduler itself.
>
> You might see thermal_pressure more on the level of irq_load or
> [rt/dl]_rq_load and that could be why we have a different opinion here?
>
> Now rt_rq_load and dl_rq_load are scheduler internal providers and
> irq_load is driven by 'irq_delta + steal' time (which is much closer to
> the scheduler than thermal for instance).

In this case, thermal pressure is quite close to scheduler as it reduces
the maximum capacity available per cpu and hence affects scheduler
placement of tasks

>
> My assumption is that we don't want a direct link between the scheduler
> and e.g. a provider 'thermal'.

Exactly. Which is why the same CONFIG option should not be used between
the provider and consumer.

>
>> Another way I think about this whole thermal pressure framework is that
>> it is the job of cooling device or cpufreq or any other entity to update
>> a throttle in maximum pressure to the scheduler. It should be
>> independent of what scheduler does with it. Scheduler can choose to
>> ignore it

--
Warm Regards
Thara

2020-02-19 09:16:22

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [Patch v9 7/8] sched/fair: Enable tuning of decay period

On 18/02/2020 15:57, Thara Gopinath wrote:
>
>
> On 2/14/20 5:26 AM, Dietmar Eggemann wrote:
>> On 13/02/2020 14:54, Thara Gopinath wrote:
>>> On 02/10/2020 06:59 AM, Dietmar Eggemann wrote:
>>>> On 07/02/2020 23:42, Thara Gopinath wrote:
>>>>> On 02/04/2020 03:39 AM, Dietmar Eggemann wrote:
>>>>>> On 03/02/2020 16:55, Peter Zijlstra wrote:
>>>>>>> On Mon, Feb 03, 2020 at 07:07:57AM -0500, Thara Gopinath wrote:
>>>>>>>> On 01/28/2020 06:56 PM, Randy Dunlap wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> On 1/28/20 2:36 PM, Thara Gopinath wrote:

[...]

>>>> Cpu-invariant accounting can't be guarded with a kernel CONFIG switch.
>>>> Frequency-invariant accounting could be with CONFIG_CPU_FREQ but
>>>> this is
>>>> enabled by default by Arm64 defconfig.
>>>> Thermal pressure (accounting) (CONFIG_HAVE_SCHED_THERMAL_PRESSURE) is
>>>> disabled by default so why should a per-cpu thermal_pressure be
>>>> maintained on such a system (CONFIG_CPU_THERMAL=y by default)?
>>>
>>> I agree that there is no need for per-cpu thermal pressure to be
>>> maintained if no averaging is happening in the scheduler, today. I don't
>>> know if there will ever be an use for it.
>>
>> All arch_scale_FOO() functions follow the approach to force the arch
>> (currently x86, arm, arm64) to do
>>
>> #define arch_scale_FOO BAR
>>
>> to enable the FOO functionality.
>>
>> There is no direct link between consumer and provider here.
>>
>>   consumer (sched) -> arch <- provider (arch, counters, CPUfreq, CPU
>>                                         cooling, etc.)
>>
>> So IMHO, FOO=thermal_pressure should follow this design pattern too.
>>
>> 'thermal_pressure' would be the only one which can be disabled by a
>> kernel config switch at the consumer side.
>> IMHO, it doesn't make sense to have the provider operating in this case.
>>
>>> My issue has to do with using a config option meant for internal
>>> scheduler code being used else where. To me, once this happens, the
>>> entire work done to separate out reading and writing of instantaneous
>>> thermal pressure to arch_topology makes no sense. We could have kept it
>>> in scheduler itself.
>>
>> You might see thermal_pressure more on the level of irq_load or
>> [rt/dl]_rq_load and that could be why we have a different opinion here?
>>
>> Now rt_rq_load and dl_rq_load are scheduler internal providers and
>> irq_load is driven by 'irq_delta + steal' time (which is much closer to
>> the scheduler than thermal for instance).
>
> In this case, thermal pressure is quite close to scheduler as it reduces
> the maximum capacity available per cpu and hence affects scheduler
> placement of tasks
>
>>
>> My assumption is that we don't want a direct link between the scheduler
>> and e.g. a provider 'thermal'.
>
> Exactly. Which is why the same CONFIG option should not be used between
> the provider and consumer.

I think there is a little misunderstanding here. By being close to the
scheduler I was referring to rt, dl, irq which are not driven via an
arch_scale_FOO function.

But I guess we agree that FOO=thermal_pressure should use this
arch_scale_FOO function so we don't have a direct link between scheduler
and thermal subsystem.

We disagree in the point whether the provider should be present and
working when the only consumer is disabled by the kernel config.

I guess we can't discuss the technical angle of this issue any further
so maybe the maintainer of drivers/base/arch_topology.c should make a
decision (the actual code is in 3/8 of this patch-set).

>>> Another way I think about this whole thermal pressure framework  is that
>>> it is the job of cooling device or cpufreq or any other entity to update
>>> a throttle in maximum pressure to the scheduler. It should be
>>> independent of what scheduler does with it. Scheduler can choose to
>>> ignore it