Provide some documentation for the idle injection cooling effect in
order to let people to understand the rational of the approach for the
idle injection CPU cooling device.
Signed-off-by: Daniel Lezcano <[email protected]>
Acked-by: Viresh Kumar <[email protected]>
---
V4:
- Fixed typos, replaced 'period' per 'duty cycles', clarified some
wording (Amit Kucheria)
---
.../driver-api/thermal/cpu-idle-cooling.rst | 189 ++++++++++++++++++
1 file changed, 189 insertions(+)
create mode 100644 Documentation/driver-api/thermal/cpu-idle-cooling.rst
diff --git a/Documentation/driver-api/thermal/cpu-idle-cooling.rst b/Documentation/driver-api/thermal/cpu-idle-cooling.rst
new file mode 100644
index 000000000000..13d7fe4e8de8
--- /dev/null
+++ b/Documentation/driver-api/thermal/cpu-idle-cooling.rst
@@ -0,0 +1,189 @@
+
+Situation:
+----------
+
+Under certain circumstances a SoC can reach a critical temperature
+limit and is unable to stabilize the temperature around a temperature
+control. When the SoC has to stabilize the temperature, the kernel can
+act on a cooling device to mitigate the dissipated power. When the
+critical temperature is reached, a decision must be taken to reduce
+the temperature, that, in turn impacts performance.
+
+Another situation is when the silicon temperature continues to
+increase even after the dynamic leakage is reduced to its minimum by
+clock gating the component. This runaway phenomenon can continue due
+to the static leakage. The only solution is to power down the
+component, thus dropping the dynamic and static leakage that will
+allow the component to cool down.
+
+Last but not least, the system can ask for a specific power budget but
+because of the OPP density, we can only choose an OPP with a power
+budget lower than the requested one and under-utilize the CPU, thus
+losing performance. In other words, one OPP under-utilizes the CPU
+with a power less than the requested power budget and the next OPP
+exceeds the power budget. An intermediate OPP could have been used if
+it were present.
+
+Solutions:
+----------
+
+If we can remove the static and the dynamic leakage for a specific
+duration in a controlled period, the SoC temperature will
+decrease. Acting on the idle state duration or the idle cycle
+injection period, we can mitigate the temperature by modulating the
+power budget.
+
+The Operating Performance Point (OPP) density has a great influence on
+the control precision of cpufreq, however different vendors have a
+plethora of OPP density, and some have large power gap between OPPs,
+that will result in loss of performance during thermal control and
+loss of power in other scenarios.
+
+At a specific OPP, we can assume that injecting idle cycle on all CPUs
+belong to the same cluster, with a duration greater than the cluster
+idle state target residency, we lead to dropping the static and the
+dynamic leakage for this period (modulo the energy needed to enter
+this state). So the sustainable power with idle cycles has a linear
+relation with the OPP’s sustainable power and can be computed with a
+coefficient similar to:
+
+ Power(IdleCycle) = Coef x Power(OPP)
+
+Idle Injection:
+---------------
+
+The base concept of the idle injection is to force the CPU to go to an
+idle state for a specified time each control cycle, it provides
+another way to control CPU power and heat in addition to
+cpufreq. Ideally, if all CPUs belonging to the same cluster, inject
+their idle cycles synchronously, the cluster can reach its power down
+state with a minimum power consumption and reduce the static leakage
+to almost zero. However, these idle cycles injection will add extra
+latencies as the CPUs will have to wakeup from a deep sleep state.
+
+We use a fixed duration of idle injection that gives an acceptable
+performance penalty and a fixed latency. Mitigation can be increased
+or decreased by modulating the duty cycle of the idle injection.
+
+ ^
+ |
+ |
+ |------- -------
+ |_______|_______________________|_______|___________
+
+ <------>
+ idle <---------------------->
+ running
+
+ <----------------------------->
+ duty cycle 25%
+
+
+The implementation of the cooling device bases the number of states on
+the duty cycle percentage. When no mitigation is happening the cooling
+device state is zero, meaning the duty cycle is 0%.
+
+When the mitigation begins, depending on the governor's policy, a
+starting state is selected. With a fixed idle duration and the duty
+cycle (aka the cooling device state), the running duration can be
+computed.
+
+The governor will change the cooling device state thus the duty cycle
+and this variation will modulate the cooling effect.
+
+ ^
+ |
+ |
+ |------- -------
+ |_______|_______________|_______|___________
+
+ <------>
+ idle <-------------->
+ running
+
+ <----------------------------->
+ duty cycle 33%
+
+
+ ^
+ |
+ |
+ |------- -------
+ |_______|_______|_______|___________
+
+ <------>
+ idle <------>
+ running
+
+ <------------->
+ duty cycle 50%
+
+The idle injection duration value must comply with the constraints:
+
+- It is less than or equal to the latency we tolerate when the
+ mitigation begins. It is platform dependent and will depend on the
+ user experience, reactivity vs performance trade off we want. This
+ value should be specified.
+
+- It is greater than the idle state’s target residency we want to go
+ for thermal mitigation, otherwise we end up consuming more energy.
+
+Power considerations
+--------------------
+
+When we reach the thermal trip point, we have to sustain a specified
+power for a specific temperature but at this time we consume:
+
+ Power = Capacitance x Voltage^2 x Frequency x Utilisation
+
+... which is more than the sustainable power (or there is something
+wrong in the system setup). The ‘Capacitance’ and ‘Utilisation’ are a
+fixed value, ‘Voltage’ and the ‘Frequency’ are fixed artificially
+because we don’t want to change the OPP. We can group the
+‘Capacitance’ and the ‘Utilisation’ into a single term which is the
+‘Dynamic Power Coefficient (Cdyn)’ Simplifying the above, we have:
+
+ Pdyn = Cdyn x Voltage^2 x Frequency
+
+The power allocator governor will ask us somehow to reduce our power
+in order to target the sustainable power defined in the device
+tree. So with the idle injection mechanism, we want an average power
+(Ptarget) resulting in an amount of time running at full power on a
+specific OPP and idle another amount of time. That could be put in a
+equation:
+
+ P(opp)target = ((Trunning x (P(opp)running) + (Tidle x P(opp)idle)) /
+ (Trunning + Tidle)
+ ...
+
+ Tidle = Trunning x ((P(opp)running / P(opp)target) - 1)
+
+At this point if we know the running period for the CPU, that gives us
+the idle injection we need. Alternatively if we have the idle
+injection duration, we can compute the running duration with:
+
+ Trunning = Tidle / ((P(opp)running / P(opp)target) - 1)
+
+Practically, if the running power is less than the targeted power, we
+end up with a negative time value, so obviously the equation usage is
+bound to a power reduction, hence a higher OPP is needed to have the
+running power greater than the targeted power.
+
+However, in this demonstration we ignore three aspects:
+
+ * The static leakage is not defined here, we can introduce it in the
+ equation but assuming it will be zero most of the time as it is
+ difficult to get the values from the SoC vendors
+
+ * The idle state wake up latency (or entry + exit latency) is not
+ taken into account, it must be added in the equation in order to
+ rigorously compute the idle injection
+
+ * The injected idle duration must be greater than the idle state
+ target residency, otherwise we end up consuming more energy and
+ potentially invert the mitigation effect
+
+So the final equation is:
+
+ Trunning = (Tidle - Twakeup ) x
+ (((P(opp)dyn + P(opp)static ) - P(opp)target) / P(opp)target )
--
2.17.1
As we introduced the idle injection cooling device called
cpuidle_cooling, let's be consistent and rename the cpu_cooling to
cpufreq_cooling as this one mitigates with OPPs changes.
Signed-off-by: Daniel Lezcano <[email protected]>
Acked-by: Viresh Kumar <[email protected]>
Reviewed-by: Amit Kucheria <[email protected]>
---
V4:
- Added Acked-by and Reviewed-by
V3:
- Fix missing name conversion (Viresh Kumar)
---
Documentation/driver-api/thermal/exynos_thermal.rst | 2 +-
MAINTAINERS | 2 +-
drivers/thermal/Makefile | 2 +-
drivers/thermal/clock_cooling.c | 2 +-
drivers/thermal/{cpu_cooling.c => cpufreq_cooling.c} | 6 +++---
include/linux/clock_cooling.h | 2 +-
6 files changed, 8 insertions(+), 8 deletions(-)
rename drivers/thermal/{cpu_cooling.c => cpufreq_cooling.c} (99%)
diff --git a/Documentation/driver-api/thermal/exynos_thermal.rst b/Documentation/driver-api/thermal/exynos_thermal.rst
index 5bd556566c70..d4e4a5b75805 100644
--- a/Documentation/driver-api/thermal/exynos_thermal.rst
+++ b/Documentation/driver-api/thermal/exynos_thermal.rst
@@ -67,7 +67,7 @@ TMU driver description:
The exynos thermal driver is structured as::
Kernel Core thermal framework
- (thermal_core.c, step_wise.c, cpu_cooling.c)
+ (thermal_core.c, step_wise.c, cpufreq_cooling.c)
^
|
|
diff --git a/MAINTAINERS b/MAINTAINERS
index d92e9eb14131..0d658a628296 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -16337,7 +16337,7 @@ L: [email protected]
S: Supported
F: Documentation/driver-api/thermal/cpu-cooling-api.rst
F: Documentation/driver-api/thermal/cpu-idle-cooling.rst
-F: drivers/thermal/cpu_cooling.c
+F: drivers/thermal/cpufreq_cooling.c
F: drivers/thermal/cpuidle_cooling.c
F: include/linux/cpu_cooling.h
diff --git a/drivers/thermal/Makefile b/drivers/thermal/Makefile
index 6929e6fad1ac..d502a597a717 100644
--- a/drivers/thermal/Makefile
+++ b/drivers/thermal/Makefile
@@ -19,7 +19,7 @@ thermal_sys-$(CONFIG_THERMAL_GOV_USER_SPACE) += user_space.o
thermal_sys-$(CONFIG_THERMAL_GOV_POWER_ALLOCATOR) += power_allocator.o
# cpufreq cooling
-thermal_sys-$(CONFIG_CPU_FREQ_THERMAL) += cpu_cooling.o
+thermal_sys-$(CONFIG_CPU_FREQ_THERMAL) += cpufreq_cooling.o
thermal_sys-$(CONFIG_CPU_IDLE_THERMAL) += cpuidle_cooling.o
# clock cooling
diff --git a/drivers/thermal/clock_cooling.c b/drivers/thermal/clock_cooling.c
index 3ad3256c48fd..7cb3ae4b44ee 100644
--- a/drivers/thermal/clock_cooling.c
+++ b/drivers/thermal/clock_cooling.c
@@ -7,7 +7,7 @@
* Copyright (C) 2013 Texas Instruments Inc.
* Contact: Eduardo Valentin <[email protected]>
*
- * Highly based on cpu_cooling.c.
+ * Highly based on cpufreq_cooling.c.
* Copyright (C) 2012 Samsung Electronics Co., Ltd(http://www.samsung.com)
* Copyright (C) 2012 Amit Daniel <[email protected]>
*/
diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpufreq_cooling.c
similarity index 99%
rename from drivers/thermal/cpu_cooling.c
rename to drivers/thermal/cpufreq_cooling.c
index 52569b27b426..1dcba6bcaf7b 100644
--- a/drivers/thermal/cpu_cooling.c
+++ b/drivers/thermal/cpufreq_cooling.c
@@ -1,6 +1,6 @@
// SPDX-License-Identifier: GPL-2.0
/*
- * linux/drivers/thermal/cpu_cooling.c
+ * linux/drivers/thermal/cpufreq_cooling.c
*
* Copyright (C) 2012 Samsung Electronics Co., Ltd(http://www.samsung.com)
*
@@ -620,7 +620,7 @@ of_cpufreq_cooling_register(struct cpufreq_policy *policy)
struct thermal_cooling_device *cdev = NULL;
if (!np) {
- pr_err("cpu_cooling: OF node not available for cpu%d\n",
+ pr_err("cpufreq_cooling: OF node not available for cpu%d\n",
policy->cpu);
return NULL;
}
@@ -630,7 +630,7 @@ of_cpufreq_cooling_register(struct cpufreq_policy *policy)
cdev = __cpufreq_cooling_register(np, policy, em);
if (IS_ERR(cdev)) {
- pr_err("cpu_cooling: cpu%d failed to register as cooling device: %ld\n",
+ pr_err("cpufreq_cooling: cpu%d failed to register as cooling device: %ld\n",
policy->cpu, PTR_ERR(cdev));
cdev = NULL;
}
diff --git a/include/linux/clock_cooling.h b/include/linux/clock_cooling.h
index b5cebf766e02..4b0a69863656 100644
--- a/include/linux/clock_cooling.h
+++ b/include/linux/clock_cooling.h
@@ -7,7 +7,7 @@
* Copyright (C) 2013 Texas Instruments Inc.
* Contact: Eduardo Valentin <[email protected]>
*
- * Highly based on cpu_cooling.c.
+ * Highly based on cpufreq_cooling.c.
* Copyright (C) 2012 Samsung Electronics Co., Ltd(http://www.samsung.com)
* Copyright (C) 2012 Amit Daniel <[email protected]>
*/
--
2.17.1
The cpu idle cooling device offers a new method to cool down a CPU by
injecting idle cycles at runtime.
It has some similarities with the intel power clamp driver but it is
actually designed to be more generic and relying on the idle injection
powercap framework.
The idle injection duration is fixed while the running duration is
variable. That allows to have control on the device reactivity for the
user experience.
An idle state powering down the CPU or the cluster will allow to drop
the static leakage, thus restoring the heat capacity of the SoC. It
can be set with a trip point between the hot and the critical points,
giving the opportunity to prevent a hard reset of the system when the
cpufreq cooling fails to cool down the CPU.
With more sophisticated boards having a per core sensor, the idle
cooling device allows to cool down a single core without throttling
the compute capacity of several cpus belonging to the same clock line,
so it could be used in collaboration with the cpufreq cooling device.
Signed-off-by: Daniel Lezcano <[email protected]>
Acked-by: Viresh Kumar <[email protected]>
---
V5:
- Fixed NULL pointer check from idle_inject_register
- Fixed invalid zero value passed to idle_inject_set_duration
V4:
- Fixed typos in the kernel-doc and clarified the changelog (Amit Kucheria)
V3:
- Add missing parameter documentation (Viresh Kumar)
- Fixed function description (Viresh Kumar)
- Add entry in MAINTAINER file
V2:
- Remove idle_duration_us field and use idle_inject API instead (Viresh Kumar)
- Fixed function definition wheh CPU_IDLE_COOLING is not set
- Inverted the initialization in the init function (Viresh Kumar)
---
MAINTAINERS | 3 +
drivers/thermal/Kconfig | 7 +
drivers/thermal/Makefile | 1 +
drivers/thermal/cpuidle_cooling.c | 234 ++++++++++++++++++++++++++++++
include/linux/cpu_cooling.h | 22 +++
5 files changed, 267 insertions(+)
create mode 100644 drivers/thermal/cpuidle_cooling.c
diff --git a/MAINTAINERS b/MAINTAINERS
index bd5847e802de..d92e9eb14131 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -16330,12 +16330,15 @@ F: Documentation/devicetree/bindings/thermal/
THERMAL/CPU_COOLING
M: Amit Daniel Kachhap <[email protected]>
+M: Daniel Lezcano <[email protected]>
M: Viresh Kumar <[email protected]>
M: Javi Merino <[email protected]>
L: [email protected]
S: Supported
F: Documentation/driver-api/thermal/cpu-cooling-api.rst
+F: Documentation/driver-api/thermal/cpu-idle-cooling.rst
F: drivers/thermal/cpu_cooling.c
+F: drivers/thermal/cpuidle_cooling.c
F: include/linux/cpu_cooling.h
THERMAL DRIVER FOR AMLOGIC SOCS
diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
index 702b33dcdb35..0d93c652014f 100644
--- a/drivers/thermal/Kconfig
+++ b/drivers/thermal/Kconfig
@@ -170,6 +170,13 @@ config CPU_FREQ_THERMAL
This will be useful for platforms using the generic thermal interface
and not the ACPI interface.
+config CPU_IDLE_THERMAL
+ bool "CPU idle cooling device"
+ depends on IDLE_INJECT
+ help
+ This implements the CPU cooling mechanism through
+ idle injection. This will throttle the CPU by injecting
+ idle cycle.
endif
config CLOCK_THERMAL
diff --git a/drivers/thermal/Makefile b/drivers/thermal/Makefile
index 3f3533c01aa3..6929e6fad1ac 100644
--- a/drivers/thermal/Makefile
+++ b/drivers/thermal/Makefile
@@ -20,6 +20,7 @@ thermal_sys-$(CONFIG_THERMAL_GOV_POWER_ALLOCATOR) += power_allocator.o
# cpufreq cooling
thermal_sys-$(CONFIG_CPU_FREQ_THERMAL) += cpu_cooling.o
+thermal_sys-$(CONFIG_CPU_IDLE_THERMAL) += cpuidle_cooling.o
# clock cooling
thermal_sys-$(CONFIG_CLOCK_THERMAL) += clock_cooling.o
diff --git a/drivers/thermal/cpuidle_cooling.c b/drivers/thermal/cpuidle_cooling.c
new file mode 100644
index 000000000000..46932d1bd7e5
--- /dev/null
+++ b/drivers/thermal/cpuidle_cooling.c
@@ -0,0 +1,234 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2019 Linaro Limited.
+ *
+ * Author: Daniel Lezcano <[email protected]>
+ *
+ */
+#include <linux/cpu_cooling.h>
+#include <linux/cpuidle.h>
+#include <linux/err.h>
+#include <linux/idle_inject.h>
+#include <linux/idr.h>
+#include <linux/slab.h>
+#include <linux/thermal.h>
+
+/**
+ * struct cpuidle_cooling_device - data for the idle cooling device
+ * @ii_dev: an atomic to keep track of the last task exiting the idle cycle
+ * @state: a normalized integer giving the state of the cooling device
+ */
+struct cpuidle_cooling_device {
+ struct idle_inject_device *ii_dev;
+ unsigned long state;
+};
+
+static DEFINE_IDA(cpuidle_ida);
+
+/**
+ * cpuidle_cooling_runtime - Running time computation
+ * @idle_duration_us: the idle cooling device
+ * @state: a percentile based number
+ *
+ * The running duration is computed from the idle injection duration
+ * which is fixed. If we reach 100% of idle injection ratio, that
+ * means the running duration is zero. If we have a 50% ratio
+ * injection, that means we have equal duration for idle and for
+ * running duration.
+ *
+ * The formula is deduced as follows:
+ *
+ * running = idle x ((100 / ratio) - 1)
+ *
+ * For precision purpose for integer math, we use the following:
+ *
+ * running = (idle x 100) / ratio - idle
+ *
+ * For example, if we have an injected duration of 50%, then we end up
+ * with 10ms of idle injection and 10ms of running duration.
+ *
+ * Return: An unsigned int for a usec based runtime duration.
+ */
+static unsigned int cpuidle_cooling_runtime(unsigned int idle_duration_us,
+ unsigned long state)
+{
+ if (!state)
+ return 0;
+
+ return ((idle_duration_us * 100) / state) - idle_duration_us;
+}
+
+/**
+ * cpuidle_cooling_get_max_state - Get the maximum state
+ * @cdev : the thermal cooling device
+ * @state : a pointer to the state variable to be filled
+ *
+ * The function always returns 100 as the injection ratio. It is
+ * percentile based for consistency accross different platforms.
+ *
+ * Return: The function can not fail, it is always zero
+ */
+static int cpuidle_cooling_get_max_state(struct thermal_cooling_device *cdev,
+ unsigned long *state)
+{
+ /*
+ * Depending on the configuration or the hardware, the running
+ * cycle and the idle cycle could be different. We want to
+ * unify that to an 0..100 interval, so the set state
+ * interface will be the same whatever the platform is.
+ *
+ * The state 100% will make the cluster 100% ... idle. A 0%
+ * injection ratio means no idle injection at all and 50%
+ * means for 10ms of idle injection, we have 10ms of running
+ * time.
+ */
+ *state = 100;
+
+ return 0;
+}
+
+/**
+ * cpuidle_cooling_get_cur_state - Get the current cooling state
+ * @cdev: the thermal cooling device
+ * @state: a pointer to the state
+ *
+ * The function just copies the state value from the private thermal
+ * cooling device structure, the mapping is 1 <-> 1.
+ *
+ * Return: The function can not fail, it is always zero
+ */
+static int cpuidle_cooling_get_cur_state(struct thermal_cooling_device *cdev,
+ unsigned long *state)
+{
+ struct cpuidle_cooling_device *idle_cdev = cdev->devdata;
+
+ *state = idle_cdev->state;
+
+ return 0;
+}
+
+/**
+ * cpuidle_cooling_set_cur_state - Set the current cooling state
+ * @cdev: the thermal cooling device
+ * @state: the target state
+ *
+ * The function checks first if we are initiating the mitigation which
+ * in turn wakes up all the idle injection tasks belonging to the idle
+ * cooling device. In any case, it updates the internal state for the
+ * cooling device.
+ *
+ * Return: The function can not fail, it is always zero
+ */
+static int cpuidle_cooling_set_cur_state(struct thermal_cooling_device *cdev,
+ unsigned long state)
+{
+ struct cpuidle_cooling_device *idle_cdev = cdev->devdata;
+ struct idle_inject_device *ii_dev = idle_cdev->ii_dev;
+ unsigned long current_state = idle_cdev->state;
+ unsigned int runtime_us, idle_duration_us;
+
+ idle_cdev->state = state;
+
+ idle_inject_get_duration(ii_dev, &runtime_us, &idle_duration_us);
+
+ runtime_us = cpuidle_cooling_runtime(idle_duration_us, state);
+
+ idle_inject_set_duration(ii_dev, runtime_us, idle_duration_us);
+
+ if (current_state == 0 && state > 0) {
+ idle_inject_start(ii_dev);
+ } else if (current_state > 0 && !state) {
+ idle_inject_stop(ii_dev);
+ }
+
+ return 0;
+}
+
+/**
+ * cpuidle_cooling_ops - thermal cooling device ops
+ */
+static struct thermal_cooling_device_ops cpuidle_cooling_ops = {
+ .get_max_state = cpuidle_cooling_get_max_state,
+ .get_cur_state = cpuidle_cooling_get_cur_state,
+ .set_cur_state = cpuidle_cooling_set_cur_state,
+};
+
+/**
+ * cpuidle_of_cooling_register - Idle cooling device initialization function
+ * @drv: a cpuidle driver structure pointer
+ * @np: a node pointer to a device tree cooling device node
+ *
+ * This function is in charge of creating a cooling device per cpuidle
+ * driver and register it to thermal framework.
+ *
+ * Return: A valid pointer to a thermal cooling device or a PTR_ERR
+ * corresponding to the error detected in the underlying subsystems.
+ */
+struct thermal_cooling_device *
+__init cpuidle_of_cooling_register(struct device_node *np,
+ struct cpuidle_driver *drv)
+{
+ struct idle_inject_device *ii_dev;
+ struct cpuidle_cooling_device *idle_cdev;
+ struct thermal_cooling_device *cdev;
+ char dev_name[THERMAL_NAME_LENGTH];
+ int id, ret;
+
+ idle_cdev = kzalloc(sizeof(*idle_cdev), GFP_KERNEL);
+ if (!idle_cdev) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ id = ida_simple_get(&cpuidle_ida, 0, 0, GFP_KERNEL);
+ if (id < 0) {
+ ret = id;
+ goto out_kfree;
+ }
+
+ ii_dev = idle_inject_register(drv->cpumask);
+ if (!ii_dev) {
+ ret = -EINVAL;
+ goto out_id;
+ }
+
+ idle_inject_set_duration(ii_dev, TICK_USEC, TICK_USEC);
+
+ idle_cdev->ii_dev = ii_dev;
+
+ snprintf(dev_name, sizeof(dev_name), "thermal-idle-%d", id);
+
+ cdev = thermal_of_cooling_device_register(np, dev_name, idle_cdev,
+ &cpuidle_cooling_ops);
+ if (IS_ERR(cdev)) {
+ ret = PTR_ERR(cdev);
+ goto out_unregister;
+ }
+
+ return cdev;
+
+out_unregister:
+ idle_inject_unregister(ii_dev);
+out_id:
+ ida_simple_remove(&cpuidle_ida, id);
+out_kfree:
+ kfree(idle_cdev);
+out:
+ return ERR_PTR(ret);
+}
+
+/**
+ * cpuidle_cooling_register - Idle cooling device initialization function
+ * @drv: a cpuidle driver structure pointer
+ *
+ * This function is in charge of creating a cooling device per cpuidle
+ * driver and register it to thermal framework.
+ *
+ * Return: A valid pointer to a thermal cooling device, a PTR_ERR
+ * corresponding to the error detected in the underlying subsystems.
+ */
+struct thermal_cooling_device *
+__init cpuidle_cooling_register(struct cpuidle_driver *drv)
+{
+ return cpuidle_of_cooling_register(NULL, drv);
+}
diff --git a/include/linux/cpu_cooling.h b/include/linux/cpu_cooling.h
index 3cdd85f987d7..da0970183d1f 100644
--- a/include/linux/cpu_cooling.h
+++ b/include/linux/cpu_cooling.h
@@ -60,4 +60,26 @@ of_cpufreq_cooling_register(struct cpufreq_policy *policy)
}
#endif /* CONFIG_CPU_FREQ_THERMAL */
+struct cpuidle_driver;
+
+#ifdef CONFIG_CPU_IDLE_THERMAL
+extern struct thermal_cooling_device *
+__init cpuidle_cooling_register(struct cpuidle_driver *drv);
+extern struct thermal_cooling_device *
+__init cpuidle_of_cooling_register(struct device_node *np,
+ struct cpuidle_driver *drv);
+#else /* CONFIG_CPU_IDLE_THERMAL */
+static inline struct thermal_cooling_device *
+__init cpuidle_cooling_register(struct cpuidle_driver *drv)
+{
+ return ERR_PTR(-EINVAL);
+}
+static inline struct thermal_cooling_device *
+__init cpuidle_of_cooling_register(struct device_node *np,
+ struct cpuidle_driver *drv)
+{
+ return ERR_PTR(-EINVAL);
+}
+#endif /* CONFIG_CPU_IDLE_THERMAL */
+
#endif /* __CPU_COOLING_H__ */
--
2.17.1
On 11.12.19 23:43, Daniel Lezcano wrote:
> Provide some documentation for the idle injection cooling effect in
> order to let people to understand the rational of the approach for the
> idle injection CPU cooling device.
>
> Signed-off-by: Daniel Lezcano <[email protected]>
> Acked-by: Viresh Kumar <[email protected]>
Reviewed-by: Martin Kepplinger <[email protected]>
thanks
> ---
> V4:
> - Fixed typos, replaced 'period' per 'duty cycles', clarified some
> wording (Amit Kucheria)
> ---
> .../driver-api/thermal/cpu-idle-cooling.rst | 189 ++++++++++++++++++
> 1 file changed, 189 insertions(+)
> create mode 100644 Documentation/driver-api/thermal/cpu-idle-cooling.rst
>
> diff --git a/Documentation/driver-api/thermal/cpu-idle-cooling.rst b/Documentation/driver-api/thermal/cpu-idle-cooling.rst
> new file mode 100644
> index 000000000000..13d7fe4e8de8
> --- /dev/null
> +++ b/Documentation/driver-api/thermal/cpu-idle-cooling.rst
> @@ -0,0 +1,189 @@
> +
> +Situation:
> +----------
> +
> +Under certain circumstances a SoC can reach a critical temperature
> +limit and is unable to stabilize the temperature around a temperature
> +control. When the SoC has to stabilize the temperature, the kernel can
> +act on a cooling device to mitigate the dissipated power. When the
> +critical temperature is reached, a decision must be taken to reduce
> +the temperature, that, in turn impacts performance.
> +
> +Another situation is when the silicon temperature continues to
> +increase even after the dynamic leakage is reduced to its minimum by
> +clock gating the component. This runaway phenomenon can continue due
> +to the static leakage. The only solution is to power down the
> +component, thus dropping the dynamic and static leakage that will
> +allow the component to cool down.
> +
> +Last but not least, the system can ask for a specific power budget but
> +because of the OPP density, we can only choose an OPP with a power
> +budget lower than the requested one and under-utilize the CPU, thus
> +losing performance. In other words, one OPP under-utilizes the CPU
> +with a power less than the requested power budget and the next OPP
> +exceeds the power budget. An intermediate OPP could have been used if
> +it were present.
> +
> +Solutions:
> +----------
> +
> +If we can remove the static and the dynamic leakage for a specific
> +duration in a controlled period, the SoC temperature will
> +decrease. Acting on the idle state duration or the idle cycle
> +injection period, we can mitigate the temperature by modulating the
> +power budget.
> +
> +The Operating Performance Point (OPP) density has a great influence on
> +the control precision of cpufreq, however different vendors have a
> +plethora of OPP density, and some have large power gap between OPPs,
> +that will result in loss of performance during thermal control and
> +loss of power in other scenarios.
> +
> +At a specific OPP, we can assume that injecting idle cycle on all CPUs
> +belong to the same cluster, with a duration greater than the cluster
> +idle state target residency, we lead to dropping the static and the
> +dynamic leakage for this period (modulo the energy needed to enter
> +this state). So the sustainable power with idle cycles has a linear
> +relation with the OPP’s sustainable power and can be computed with a
> +coefficient similar to:
> +
> + Power(IdleCycle) = Coef x Power(OPP)
> +
> +Idle Injection:
> +---------------
> +
> +The base concept of the idle injection is to force the CPU to go to an
> +idle state for a specified time each control cycle, it provides
> +another way to control CPU power and heat in addition to
> +cpufreq. Ideally, if all CPUs belonging to the same cluster, inject
> +their idle cycles synchronously, the cluster can reach its power down
> +state with a minimum power consumption and reduce the static leakage
> +to almost zero. However, these idle cycles injection will add extra
> +latencies as the CPUs will have to wakeup from a deep sleep state.
> +
> +We use a fixed duration of idle injection that gives an acceptable
> +performance penalty and a fixed latency. Mitigation can be increased
> +or decreased by modulating the duty cycle of the idle injection.
> +
> + ^
> + |
> + |
> + |------- -------
> + |_______|_______________________|_______|___________
> +
> + <------>
> + idle <---------------------->
> + running
> +
> + <----------------------------->
> + duty cycle 25%
> +
> +
> +The implementation of the cooling device bases the number of states on
> +the duty cycle percentage. When no mitigation is happening the cooling
> +device state is zero, meaning the duty cycle is 0%.
> +
> +When the mitigation begins, depending on the governor's policy, a
> +starting state is selected. With a fixed idle duration and the duty
> +cycle (aka the cooling device state), the running duration can be
> +computed.
> +
> +The governor will change the cooling device state thus the duty cycle
> +and this variation will modulate the cooling effect.
> +
> + ^
> + |
> + |
> + |------- -------
> + |_______|_______________|_______|___________
> +
> + <------>
> + idle <-------------->
> + running
> +
> + <----------------------------->
> + duty cycle 33%
> +
> +
> + ^
> + |
> + |
> + |------- -------
> + |_______|_______|_______|___________
> +
> + <------>
> + idle <------>
> + running
> +
> + <------------->
> + duty cycle 50%
> +
> +The idle injection duration value must comply with the constraints:
> +
> +- It is less than or equal to the latency we tolerate when the
> + mitigation begins. It is platform dependent and will depend on the
> + user experience, reactivity vs performance trade off we want. This
> + value should be specified.
> +
> +- It is greater than the idle state’s target residency we want to go
> + for thermal mitigation, otherwise we end up consuming more energy.
> +
> +Power considerations
> +--------------------
> +
> +When we reach the thermal trip point, we have to sustain a specified
> +power for a specific temperature but at this time we consume:
> +
> + Power = Capacitance x Voltage^2 x Frequency x Utilisation
> +
> +... which is more than the sustainable power (or there is something
> +wrong in the system setup). The ‘Capacitance’ and ‘Utilisation’ are a
> +fixed value, ‘Voltage’ and the ‘Frequency’ are fixed artificially
> +because we don’t want to change the OPP. We can group the
> +‘Capacitance’ and the ‘Utilisation’ into a single term which is the
> +‘Dynamic Power Coefficient (Cdyn)’ Simplifying the above, we have:
> +
> + Pdyn = Cdyn x Voltage^2 x Frequency
> +
> +The power allocator governor will ask us somehow to reduce our power
> +in order to target the sustainable power defined in the device
> +tree. So with the idle injection mechanism, we want an average power
> +(Ptarget) resulting in an amount of time running at full power on a
> +specific OPP and idle another amount of time. That could be put in a
> +equation:
> +
> + P(opp)target = ((Trunning x (P(opp)running) + (Tidle x P(opp)idle)) /
> + (Trunning + Tidle)
> + ...
> +
> + Tidle = Trunning x ((P(opp)running / P(opp)target) - 1)
> +
> +At this point if we know the running period for the CPU, that gives us
> +the idle injection we need. Alternatively if we have the idle
> +injection duration, we can compute the running duration with:
> +
> + Trunning = Tidle / ((P(opp)running / P(opp)target) - 1)
> +
> +Practically, if the running power is less than the targeted power, we
> +end up with a negative time value, so obviously the equation usage is
> +bound to a power reduction, hence a higher OPP is needed to have the
> +running power greater than the targeted power.
> +
> +However, in this demonstration we ignore three aspects:
> +
> + * The static leakage is not defined here, we can introduce it in the
> + equation but assuming it will be zero most of the time as it is
> + difficult to get the values from the SoC vendors
> +
> + * The idle state wake up latency (or entry + exit latency) is not
> + taken into account, it must be added in the equation in order to
> + rigorously compute the idle injection
> +
> + * The injected idle duration must be greater than the idle state
> + target residency, otherwise we end up consuming more energy and
> + potentially invert the mitigation effect
> +
> +So the final equation is:
> +
> + Trunning = (Tidle - Twakeup ) x
> + (((P(opp)dyn + P(opp)static ) - P(opp)target) / P(opp)target )
>
On 11.12.19 23:43, Daniel Lezcano wrote:
> As we introduced the idle injection cooling device called
> cpuidle_cooling, let's be consistent and rename the cpu_cooling to
> cpufreq_cooling as this one mitigates with OPPs changes.
>
> Signed-off-by: Daniel Lezcano <[email protected]>
> Acked-by: Viresh Kumar <[email protected]>
> Reviewed-by: Amit Kucheria <[email protected]>
> ---
> V4:
> - Added Acked-by and Reviewed-by
> V3:
> - Fix missing name conversion (Viresh Kumar)
Tested-by: Martin Kepplinger <[email protected]>
thanks
On 11.12.19 23:43, Daniel Lezcano wrote:
> The cpu idle cooling device offers a new method to cool down a CPU by
> injecting idle cycles at runtime.
>
> It has some similarities with the intel power clamp driver but it is
> actually designed to be more generic and relying on the idle injection
> powercap framework.
>
> The idle injection duration is fixed while the running duration is
> variable. That allows to have control on the device reactivity for the
> user experience.
>
> An idle state powering down the CPU or the cluster will allow to drop
> the static leakage, thus restoring the heat capacity of the SoC. It
> can be set with a trip point between the hot and the critical points,
> giving the opportunity to prevent a hard reset of the system when the
> cpufreq cooling fails to cool down the CPU.
>
> With more sophisticated boards having a per core sensor, the idle
> cooling device allows to cool down a single core without throttling
> the compute capacity of several cpus belonging to the same clock line,
> so it could be used in collaboration with the cpufreq cooling device.
>
> Signed-off-by: Daniel Lezcano <[email protected]>
> Acked-by: Viresh Kumar <[email protected]>
Tested-by: Martin Kepplinger <[email protected]>
thanks