2018-01-23 15:39:16

by Daniel Lezcano

[permalink] [raw]
Subject: [PATCH 0/8] CPU cooling device new strategies

The following series provides a new way to cool down a SoC by reducing
the dissipated power on the CPUs. Based on the initial work from Kevin
Wangtao, the series implements a CPU cooling device based on idle
injection, relying on the cpuidle framework and a combo CPU cooling
device combining the cooling effect of the cpufreq and the cpuidle
cooling device with the objective of getting the advantages of both.

The patchset is designed to have the current DT binding for the
cpufreq cooling device to be compatible with the new cooling devices.

Different cpu cooling devices can not co-exist on the system, the cpu
cooling device is enabled or not, and one cooling strategy is selected
(cpufreq, cpuidle or both with the combo). It is not possible to have
all of them available at the same time.

This series is divided into three parts.

The first part just provides trivial changes for the copyright and
removes an unused field in the cpu freq cooling device structure.

The second part provides the idle injection cooling device, allowing a SoC
without a cpufreq driver to use this cooling device as an alternative.

The third part provides the combo idle injection and frequency cooling
device.

The preliminary benchmarks show the following changes:

On the hikey6220, dhrystone shows a throughtput increase of 40% for an increase
of the latency of 16% while sysbench shows a latency increase of 5%.

On a hikey3660, the combo cooling device shows an improvement in term of
throughput for the little cluster of 5% and a decrease of 5% on the big
cluster. However, the hikey3660 support in mainline is still experimental and
there is no guarantee the numberis provided in the DT are accurate enough.

Daniel Lezcano (9):
thermal/drivers/cpu_cooling: Fixup the header and copyright
thermal/drivers/cpu_cooling: Add Software Package Data Exchange (SPDX)
thermal/drivers/cpu_cooling: Remove pointless field
thermal/drivers/Kconfig: Convert the CPU cooling device to a choice
thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver
thermal/drivers/cpu_cooling: Add idle cooling device documentation
cpuidle/drivers/cpuidle-arm: Register the cooling device
thermal/drivers/cpu_cooling: Add the combo cpu cooling device

Documentation/thermal/cpu-idle-cooling.txt | 165 ++++++
drivers/cpuidle/cpuidle-arm.c | 5 +
drivers/thermal/Kconfig | 37 +-
drivers/thermal/cpu_cooling.c | 869 ++++++++++++++++++++++++++++-
include/linux/cpu_cooling.h | 15 +-
6 files changed, 1068 insertions(+), 31 deletions(-)
create mode 100644 Documentation/thermal/cpu-idle-cooling.txt

--
2.7.4



2018-01-23 15:36:42

by Daniel Lezcano

[permalink] [raw]
Subject: [PATCH 4/8] thermal/drivers/Kconfig: Convert the CPU cooling device to a choice

The next changes will add new way to cool down a CPU. In order to
sanitize and make the overall cpu cooling code consistent and robust
we must prevent the cpu cooling devices to co-exists with the same
purpose at the same time in the kernel.

Make the CPU cooling device a choice in the Kconfig, so only one CPU
cooling strategy can be chosen.

Signed-off-by: Daniel Lezcano <[email protected]>
---
drivers/thermal/Kconfig | 20 +++++++++++++++++---
1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
index 315ae29..925e73b 100644
--- a/drivers/thermal/Kconfig
+++ b/drivers/thermal/Kconfig
@@ -142,17 +142,31 @@ config THERMAL_GOV_POWER_ALLOCATOR
allocating and limiting power to devices.

config CPU_THERMAL
- bool "generic cpu cooling support"
- depends on CPU_FREQ
+ bool "Generic cpu cooling support"
depends on THERMAL_OF
help
+ Enable the CPU cooling features. If the system has no active
+ cooling device available, this option allows to use the CPU
+ as a cooling device.
+
+choice
+ prompt "CPU cooling strategies"
+ depends on CPU_THERMAL
+ default CPU_FREQ_THERMAL
+ help
+ Select the CPU cooling strategy.
+
+config CPU_FREQ_THERMAL
+ bool "CPU frequency cooling strategy"
+ depends on CPU_FREQ
+ help
This implements the generic cpu cooling mechanism through frequency
reduction. An ACPI version of this already exists
(drivers/acpi/processor_thermal.c).
This will be useful for platforms using the generic thermal interface
and not the ACPI interface.

- If you want this support, you should say Y here.
+endchoice

config CLOCK_THERMAL
bool "Generic clock cooling support"
--
2.7.4


2018-01-23 15:36:53

by Daniel Lezcano

[permalink] [raw]
Subject: [PATCH 6/8] thermal/drivers/cpu_cooling: Add idle cooling device documentation

Provide some documentation for the idle injection cooling effect in
order to let people to understand the rational of the approach for the
idle injection CPU cooling device.

Signed-off-by: Daniel Lezcano <[email protected]>
---
Documentation/thermal/cpu-idle-cooling.txt | 165 +++++++++++++++++++++++++++++
1 file changed, 165 insertions(+)
create mode 100644 Documentation/thermal/cpu-idle-cooling.txt

diff --git a/Documentation/thermal/cpu-idle-cooling.txt b/Documentation/thermal/cpu-idle-cooling.txt
new file mode 100644
index 0000000..29fc651
--- /dev/null
+++ b/Documentation/thermal/cpu-idle-cooling.txt
@@ -0,0 +1,165 @@
+
+Situation:
+----------
+
+Under certain circumstances, the SoC reaches a temperature exceeding
+the allocated power budget or the maximum temperature limit. The
+former must be mitigated to stabilize the SoC temperature around the
+temperature control using the defined cooling devices, the latter is a
+catastrophic situation where a radical decision must be taken to
+reduce the temperature under the critical threshold, that can impact
+the performances.
+
+Another situation is when the silicon reaches a certain temperature
+which continues to increase even if the dynamic leakage is reduced to
+its minimum by clock gating the component. The runaway phenomena will
+continue with the static leakage and only powering down the component,
+thus dropping the dynamic and static leakage will allow the component
+to cool down. This situation is critical.
+
+Last but not least, the system can ask for a specific power budget but
+because of the OPP density, we can only choose an OPP with a power
+budget lower than the requested one and underuse the CPU, thus losing
+performances. In other words, one OPP under uses the CPU with a power
+lesser than the power budget and the next OPP exceed the power budget,
+an intermediate OPP could have been used if it were present.
+
+Solutions:
+----------
+
+If we can remove the static and the dynamic leakage for a specific
+duration in a controlled period, the SoC temperature will
+decrease. Acting at the idle state duration or the idle cycle
+injection period, we can mitigate the temperature by modulating the
+power budget.
+
+The Operating Performance Point (OPP) density has a great influence on
+the control precision of cpufreq, however different vendors have a
+plethora of OPP density, and some have large power gap between OPPs,
+that will result in loss of performance during thermal control and
+loss of power in other scenes.
+
+At a specific OPP, we can assume injecting idle cycle on all CPUs,
+belonging to the same cluster, with a duration greater than the
+cluster idle state target residency, we drop the static and the
+dynamic leakage for this period (modulo the energy needed to enter
+this state). So the sustainable power with idle cycles has a linear
+relation with the OPP’s sustainable power and can be computed with a
+coefficient similar to:
+
+ Power(IdleCycle) = Coef x Power(OPP)
+
+Idle Injection:
+---------------
+
+The base concept of the idle injection is to force the CPU to go to an
+idle state for a specified time each control cycle, it provides
+another way to control CPU power and heat in addition to
+cpufreq. Ideally, if all CPUs of a cluster inject idle synchronously,
+this cluster can get into the deepest idle state and achieve minimum
+power consumption, but that will also increase system response latency
+if we inject less than cpuidle latency.
+
+ ^
+ |
+ |
+ |------- ------- -------
+ |_______|_____|_______|_____|_______|___________
+
+ <----->
+ idle <---->
+ running
+
+With the fixed idle injection duration, we can give a value which is
+an acceptable performance drop off or latency when we reach a specific
+temperature and we begin to mitigate by varying the Idle injection
+period.
+
+The mitigation begins with a maximum period value which decrease when
+more cooling effect is requested. When the period duration is equal to
+the idle duration, then we are in a situation the platform can’t
+dissipate the heat enough and the mitigation fails. In this case the
+situation is considered critical and there is nothing to do. The idle
+injection duration must be changed by configuration and until we reach
+the cooling effect, otherwise an additionnal cooling device must be
+used or ultimately decrease the SoC performance by dropping the
+highest OPP point of the SoC.
+
+The idle injection duration value must comply with the constraints:
+
+- It is lesser or equal to the latency we tolerate when the mitigation
+ begins. It is platform dependent and will depend on the user
+ experience, reactivity vs performance trade off we want. This value
+ should be specified.
+
+- It is greater than the idle state’s target residency we want to go
+ for thermal mitigation, otherwise we end up consuming more energy.
+
+Minimum period
+--------------
+
+The idle injection duration being fixed, it is obvious the minimum
+period can’t be lesser than that, otherwise we will be scheduling the
+idle injection task right before the idle injection duration is
+complete, so waking up the CPU to put it asleep again.
+
+Maximum period
+--------------
+
+The maximum period is the initial period when the mitigation
+begins. Theoretically when we reach the thermal trip point, we have to
+sustain a specified power for specific temperature but at this time we
+consume:
+
+ Power = Capacitance x Voltage^2 x Frequency x Utilisation
+
+... which is more than the sustainable power (or there is something
+wrong on the system setup). The ‘Capacitance’ and ‘Utilisation’ are a
+fixed value, ‘Voltage’ and the ‘Frequency’ are fixed artificially
+because we don’t want to change the OPP. We can group the
+‘Capacitance’ and the ‘Utilisation’ into a single term which is the
+‘Dynamic Power Coefficient (Cdyn)’ Simplifying the above, we have:
+
+ Pdyn = Cdyn x Voltage^2 x Frequency
+
+The IPA will ask us somehow to reduce our power in order to target the
+sustainable power defined in the device tree. So with the idle
+injection mechanism, we want an average power (Ptarget) resulting on
+an amount of time running at full power on a specific OPP and idle
+another amount of time. That could be put in a equation:
+
+ P(opp)target = ((trunning x (P(opp)running) + (tidle P(opp)idle)) /
+ (trunning + tidle)
+ ...
+
+ tidle = trunning x ((P(opp)running / P(opp)target) - 1)
+
+At this point if we know the running period for the CPU, that gives us
+the idle injection, we need. Alternatively if we have the idle
+injection duration, we can compute the running duration with:
+
+ trunning = tidle / ((P(opp)running / P(opp)target) - 1)
+
+Practically, if the running power is lesses than the targeted power,
+we end up with a negative time value, so obviously the equation usage
+is bound to a power reduction, hence a higher OPP is needed to have
+the running power greater than the targeted power.
+
+However, in this demonstration we ignore three aspects:
+
+ * The static leakage is not defined here, we can introduce it in the
+ equation but assuming it will be zero most of the time as it is
+ difficult to get the values from the SoC vendors
+
+ * The idle state wake up latency (or entry + exit latency) is not
+ taken into account, it must be added in the equation in order to
+ rigorously compute the idle injection
+
+ * The injected idle duration must be greater than the idle state
+ target residency, otherwise we end up consuming more energy and
+ potentially invert the mitigation effect
+
+So the final equation is:
+
+ trunning = (tidle - twakeup ) x
+ (((P(opp)dyn + P(opp)static ) - P(opp)target) / P(opp)target )
--
2.7.4


2018-01-23 15:36:55

by Daniel Lezcano

[permalink] [raw]
Subject: [PATCH 8/8] thermal/drivers/cpu_cooling: Add the combo cpu cooling device

The cpu cooling device has two strategies to cool down a SoC. The
first one decreases the OPP, the second one forces the cpus to enter a
cluster power down state during an amount of time.

The first cooling device has the benefit to simply decrease the OPP
until the temperature goes below the threshold and then increases the
OPP again. The change of the OPP back and forth allows to keep the cpu
temperature around the specified threshold. Unfortunately, in some
cases, the gap between the OPPs is high and decreasing the OPP makes
the cpu performance unoptimal.

The second cooling device keeps injecting more and more idle cycles
until the temperature stabilizes around the specified threshold. That
is simple and efficient in terms of cooling effect but with the
drawback of increasing the latency.

This new cooling device combines the cooling effect of both the
cpuidle and the cpufreq cooling devices by injecting idle cycle at the
upper OPP boundary of the power interval. When the power of the OPP
minus the idle injection time is equal to the OPP beneath, the cooling
device decreases the OPP and resets the idle injection cycles.

By this way, we can artifically create intermediate OPP by using the
power information found in the DT where we sum the benefit of both
cooling effects without the drawbacks.

Signed-off-by: Daniel Lezcano <[email protected]>
---
drivers/thermal/Kconfig | 7 +
drivers/thermal/cpu_cooling.c | 475 +++++++++++++++++++++++++++++++++++++-----
include/linux/cpu_cooling.h | 21 +-
3 files changed, 438 insertions(+), 65 deletions(-)

diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
index 4bd4be7..200e1f49 100644
--- a/drivers/thermal/Kconfig
+++ b/drivers/thermal/Kconfig
@@ -176,6 +176,13 @@ config CPU_IDLE_THERMAL
will enter idle synchronously to reach the deepest idle
state.

+config CPU_THERMAL_COMBO
+ bool "CPU idle/freq combo cooling strategy"
+ depends on CPU_IDLE && CPU_FREQ
+ help
+ The cpu combo cooling device combines the cooling effect of the
+ cpufreq and the cpuidle cooling devices.
+
endchoice

config CLOCK_THERMAL
diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
index 916a627..a2459d6 100644
--- a/drivers/thermal/cpu_cooling.c
+++ b/drivers/thermal/cpu_cooling.c
@@ -8,6 +8,8 @@
*
* Authors: Amit Daniel <[email protected]>
* Viresh Kumar <[email protected]>
+ * Daniel Lezcano <[email protected]>
+ * Kevin WangTao <[email protected]>
*
*/
#undef DEBUG
@@ -36,7 +38,7 @@

#include <uapi/linux/sched/types.h>

-#ifdef CONFIG_CPU_FREQ_THERMAL
+#if defined(CONFIG_CPU_FREQ_THERMAL) || defined (CONFIG_CPU_THERMAL_COMBO)
/*
* Cooling state <-> CPUFreq frequency
*
@@ -441,10 +443,9 @@ static int cpufreq_get_cur_state(struct thermal_cooling_device *cdev,
*
* Return: 0 on success, an error code otherwise.
*/
-static int cpufreq_set_cur_state(struct thermal_cooling_device *cdev,
- unsigned long state)
+static int __cpufreq_set_cur_state(struct cpufreq_cooling_device *cpufreq_cdev,
+ unsigned long state)
{
- struct cpufreq_cooling_device *cpufreq_cdev = cdev->devdata;
unsigned int clip_freq;

/* Request state should be less than max_level */
@@ -464,6 +465,14 @@ static int cpufreq_set_cur_state(struct thermal_cooling_device *cdev,
return 0;
}

+static int cpufreq_set_cur_state(struct thermal_cooling_device *cdev,
+ unsigned long state)
+{
+ struct cpufreq_cooling_device *cpufreq_cdev = cdev->devdata;
+
+ return __cpufreq_set_cur_state(cpufreq_cdev, state);
+}
+
/**
* cpufreq_get_requested_power() - get the current power
* @cdev: &thermal_cooling_device pointer
@@ -666,6 +675,25 @@ static unsigned int find_next_max(struct cpufreq_frequency_table *table,
return max;
}

+#ifdef CONFIG_CPU_FREQ_THERMAL
+static struct thermal_cooling_device *
+__cpufreq_cooling_thermal_register(struct device_node *np, char *dev_name,
+ struct cpufreq_cooling_device *cpufreq_cdev,
+ struct thermal_cooling_device_ops *ops)
+{
+ return thermal_of_cooling_device_register(np, dev_name, cpufreq_cdev,
+ ops);
+}
+#else
+static inline struct thermal_cooling_device *
+__cpufreq_cooling_thermal_register(struct device_node *np, char *dev_name,
+ struct cpufreq_cooling_device *cpufreq_cdev,
+ struct thermal_cooling_device_ops *ops)
+{
+ return NULL;
+}
+#endif
+
/**
* __cpufreq_cooling_register - helper function to create cpufreq cooling device
* @np: a valid struct device_node to the cooling device device tree node
@@ -769,7 +797,7 @@ __cpufreq_cooling_register(struct device_node *np,
cooling_ops = &cpufreq_cooling_ops;
}

- cdev = thermal_of_cooling_device_register(np, dev_name, cpufreq_cdev,
+ cdev = __cpufreq_cooling_thermal_register(np, dev_name, cpufreq_cdev,
cooling_ops);
if (IS_ERR(cdev))
goto remove_ida;
@@ -944,7 +972,7 @@ EXPORT_SYMBOL_GPL(cpufreq_cooling_unregister);

#endif /* CPU_FREQ_THERMAL */

-#ifdef CONFIG_CPU_IDLE_THERMAL
+#if defined(CONFIG_CPU_IDLE_THERMAL) || defined(CONFIG_CPU_THERMAL_COMBO)
/*
* The idle duration injection. As we don't have yet a way to specify
* from the DT configuration, let's default to a tick duration.
@@ -1130,6 +1158,60 @@ static int cpuidle_cooling_injection_thread(void *arg)
}

/**
+ * cpuidle_cooling_set_cur_state - Set the current cooling state
+ * @cdev: the thermal cooling device
+ * @state: the target state
+ *
+ * The function checks first if we are initiating the mitigation which
+ * in turn wakes up all the idle injection tasks belonging to the idle
+ * cooling device. In any case, it updates the internal state for the
+ * cooling device.
+ *
+ * The function can not fail, it returns always zero.
+ */
+static int
+__cpuidle_cooling_set_cur_state(struct cpuidle_cooling_device *idle_cdev,
+ unsigned long state)
+{
+ unsigned long current_state = idle_cdev->state;
+
+ idle_cdev->state = state;
+
+ if (current_state == 0 && state > 0) {
+ pr_debug("Starting cooling cpus '%*pbl'\n",
+ cpumask_pr_args(idle_cdev->cpumask));
+ cpuidle_cooling_wakeup(idle_cdev);
+ } else if (current_state > 0 && !state) {
+ pr_debug("Stopping cooling cpus '%*pbl'\n",
+ cpumask_pr_args(idle_cdev->cpumask));
+ }
+
+ return 0;
+}
+
+/**
+ * cpuidle_cooling_release - Kref based release helper
+ * @kref: a pointer to the kref structure
+ *
+ * This function is automatically called by the kref_put function when
+ * the idle cooling device refcount reaches zero. At this point, we
+ * have the guarantee the structure is no longer in use and we can
+ * safely release all the ressources.
+ */
+static void __init cpuidle_cooling_release(struct kref *kref)
+{
+ struct cpuidle_cooling_device *idle_cdev =
+ container_of(kref, struct cpuidle_cooling_device, kref);
+
+ thermal_cooling_device_unregister(idle_cdev->cdev);
+ kfree(idle_cdev->waitq);
+ kfree(idle_cdev->tsk);
+ kfree(idle_cdev);
+}
+
+#ifdef CONFIG_CPU_IDLE_THERMAL
+
+/**
* cpuidle_cooling_get_max_state - Get the maximum state
* @cdev : the thermal cooling device
* @state : a pointer to the state variable to be filled
@@ -1178,36 +1260,12 @@ static int cpuidle_cooling_get_cur_state(struct thermal_cooling_device *cdev,
return 0;
}

-/**
- * cpuidle_cooling_set_cur_state - Set the current cooling state
- * @cdev: the thermal cooling device
- * @state: the target state
- *
- * The function checks first if we are initiating the mitigation which
- * in turn wakes up all the idle injection tasks belonging to the idle
- * cooling device. In any case, it updates the internal state for the
- * cooling device.
- *
- * The function can not fail, it returns always zero.
- */
static int cpuidle_cooling_set_cur_state(struct thermal_cooling_device *cdev,
unsigned long state)
{
struct cpuidle_cooling_device *idle_cdev = cdev->devdata;
- unsigned long current_state = idle_cdev->state;
-
- idle_cdev->state = state;
-
- if (current_state == 0 && state > 0) {
- pr_debug("Starting cooling cpus '%*pbl'\n",
- cpumask_pr_args(idle_cdev->cpumask));
- cpuidle_cooling_wakeup(idle_cdev);
- } else if (current_state > 0 && !state) {
- pr_debug("Stopping cooling cpus '%*pbl'\n",
- cpumask_pr_args(idle_cdev->cpumask));
- }

- return 0;
+ return __cpuidle_cooling_set_cur_state(idle_cdev, state);
}

/**
@@ -1219,25 +1277,30 @@ static struct thermal_cooling_device_ops cpuidle_cooling_ops = {
.set_cur_state = cpuidle_cooling_set_cur_state,
};

-/**
- * cpuidle_cooling_release - Kref based release helper
- * @kref: a pointer to the kref structure
- *
- * This function is automatically called by the kref_put function when
- * the idle cooling device refcount reaches zero. At this point, we
- * have the guarantee the structure is no longer in use and we can
- * safely release all the ressources.
- */
-static void __init cpuidle_cooling_release(struct kref *kref)
+static int __cpuidle_cooling_thermal_register(struct device_node *np,
+ struct cpuidle_cooling_device *idle_cdev,
+ char *dev_name)
{
- struct cpuidle_cooling_device *idle_cdev =
- container_of(kref, struct cpuidle_cooling_device, kref);
+ struct thermal_cooling_device *cdev;

- thermal_cooling_device_unregister(idle_cdev->cdev);
- kfree(idle_cdev->waitq);
- kfree(idle_cdev->tsk);
- kfree(idle_cdev);
+ cdev = thermal_of_cooling_device_register(np, dev_name,
+ idle_cdev,
+ &cpuidle_cooling_ops);
+ if (IS_ERR(cdev))
+ return PTR_ERR(cdev);
+
+ idle_cdev->cdev = cdev;
+
+ return 0;
}
+#else
+static inline int __cpuidle_cooling_thermal_register(struct device_node *np,
+ struct cpuidle_cooling_device *idle_cdev,
+ char *dev_name)
+{
+ return 0;
+}
+#endif

/**
* cpuidle_cooling_register - Idle cooling device initialization function
@@ -1256,7 +1319,6 @@ static void __init cpuidle_cooling_release(struct kref *kref)
int cpuidle_cooling_register(void)
{
struct cpuidle_cooling_device *idle_cdev = NULL;
- struct thermal_cooling_device *cdev;
struct task_struct *tsk;
struct device_node *np;
cpumask_t *cpumask;
@@ -1319,15 +1381,10 @@ int cpuidle_cooling_register(void)
* The thermal cooling device name
*/
snprintf(dev_name, sizeof(dev_name), "thermal-idle-%d", index++);
- cdev = thermal_of_cooling_device_register(np, dev_name,
- idle_cdev,
- &cpuidle_cooling_ops);
- if (IS_ERR(cdev)) {
- ret = PTR_ERR(cdev);
- goto out_fail;
- }

- idle_cdev->cdev = cdev;
+ ret = __cpuidle_cooling_thermal_register(np, idle_cdev, dev_name);
+ if (ret)
+ goto out_fail;

idle_cdev->cpumask = cpumask;

@@ -1397,3 +1454,309 @@ int cpuidle_cooling_register(void)
return ret;
}
#endif
+
+#ifdef CONFIG_CPU_THERMAL_COMBO
+/**
+ * struct cpu_cooling_device - the cpu cooling device
+ * @cpuidle_cdev: a pointer to the instanciated cpuidle cooling device
+ * @cpufreq_cdev: a pointer to the instanciated cpufreq cooling device
+ * @max_power: the maximum power managed by the cooling device
+ * @state: the current cooling device state
+ *
+ * The SoC could have different designs. If the SoC is a single
+ * cluster, we have a single clock line for cpufreq and single cluster
+ * powerdown state. If the SoC is a dual cluster we can have a single
+ * clock line for cpufreq and a cluster power down, hence two cpuidle
+ * cooling device. Alternatively, we can have two clock lines.
+ *
+ * 1 cluster - 1 clock line (eg. db410c): There is one cpuidle cooling
+ * device and one cpufreq cooling device. Consequently, there is one
+ * cpu cooling device where the cpuidle_cdev and the cpufreq_cdev
+ * pointers point to the corresponding cooling device instances.
+ *
+ * 2 clusters - 1 clock line (eg. hi6220) : There are two cpuidle
+ * cooling devices and one cpufreq cooling device. It results in two
+ * cpu cooling devices where the cpuidle_cdev points to the cpuidle
+ * instance and the cpufreq_cdev contains a shared pointer to the
+ * cpufreq cooling device. This configuration makes the power
+ * computation to be ponderated by the number of cpus managed by the
+ * cpuidle cooling device.
+ *
+ * 2 clusters - 2 clock lines (eg. hi3660): There are two cpuidle
+ * cooling devices, two cpufreq cooling devices and two cpu cooling
+ * devices.
+ */
+struct cpu_cooling_device {
+ struct cpuidle_cooling_device *cpuidle_cdev;
+ struct cpufreq_cooling_device *cpufreq_cdev;
+ u32 max_power;
+ int state;
+};
+
+/*
+ * The combo CPU cooling device combines the OPP states and the idle
+ * injection cycles in order to provide an intermediate state where we
+ * meet the power budget but without decreasing the OPP. That allows
+ * to keep a higher OPP while reducing the dissipated power. For
+ * example, using the cpufreq cooling device only, we may have to
+ * downgrade the OPP because the current one dissipates too much power
+ * but by downgrading the OPP, we still have room for more power. So
+ * the perfect match would have be in between these two OPPs.
+ *
+ * For example, let's imagine we have 4 cores ruled by a cpufreq
+ * driver with 2 OPPs consuming respectively 250mW and 500mW per
+ * core. With all CPUs loaded at 100%, at the highest OPP, we have
+ * 2000mW of dissipated power for the cluster. Now the thermal
+ * framework allocates 1500mW of power budget. We can decrease to the
+ * other OPP where we end up consuming 1000mW but we still have room
+ * for 500mw. Alternatively, we can stay at the highest OPP but force
+ * to be idle 25% of the time (2000 - 1500) / 1500.
+ *
+ * By inserting idle cycles at a specific OPP, we can reduce the power
+ * without decreasing the OPP, which results on a better power /
+ * performance trade-off.
+ *
+ * The combo CPU cooling device works in a percentile way, the states
+ * represent the percentage of power we want to save. The combo device
+ * is in charge of setting the state for the idle injection cooling
+ * device and the cpufreq cooling device, as well as sorting out when
+ * to go to a specific OPP or/and insert idle cycles.
+ */
+
+/**
+ * cpu_cooling_get_max_state - Return the maximum number of states
+ * @cdev : the thermal cooling device
+ * @state : a pointer to the state variable to be filled
+ *
+ * The function gives always 100 as a percentage of the maximum power
+ * on the thermal zone.
+ *
+ * The function can not fail, it returns always zero.
+ */
+static int cpu_cooling_get_max_state(struct thermal_cooling_device *cdev,
+ unsigned long *state)
+{
+ *state = 100;
+ return 0;
+}
+
+/**
+ * cpu_cooling_power_opp - Find the upper OPP for a specific power
+ * @cpufreq_cdev: the cpufreq cooling device
+ * @num_cpus: the number of cpus managed by the idle cooling device
+ * @power: the requested power
+ *
+ * The function returns the OPP which is the upper limit of the power
+ * interval between two OPPs. It is imposible the requested power is
+ * greater than the maximum power of the cluster.
+ *
+ * Returns an index in the freq_table on success, -EINVAL if the
+ * requested power is invalid (zero or greater than the maximum
+ * cluster power).
+ */
+static int cpu_cooling_power_opp(struct cpufreq_cooling_device *cpufreq_cdev,
+ int num_cpus, u32 power)
+{
+ struct freq_table *freq_table = cpufreq_cdev->freq_table;
+ int i;
+
+ if (!power || power > freq_table[0].power * num_cpus)
+ return -EINVAL;
+
+ for (i = 0; i < cpufreq_cdev->max_level - 1; i++) {
+
+ if (power <= (freq_table[i].power * num_cpus) &&
+ power > (freq_table[i + 1].power * num_cpus))
+ break;
+ }
+
+ return i;
+}
+
+/**
+ * cpu_cooling_set_cur_state - Sets a percentage of the max power
+ * @cdev: the thermal cooling device
+ * @state: the target state representing a ratio
+ *
+ * The function computes the power ratio of the OPP and the
+ * corresponding idle ratio to reach the requested state. The state is
+ * a percentage of the maximum power.
+ *
+ * The function returns zero on success, -EINVAL if the ratio
+ * computation fails any reason, < 0 for the set_cur_state subcalls
+ * failure on the cpuidle / cpufreq cooling devices.
+ */
+static int cpu_cooling_set_cur_state(struct thermal_cooling_device *cdev,
+ unsigned long state)
+{
+ struct cpu_cooling_device *cpu_cdev = cdev->devdata;
+ struct cpuidle_cooling_device *cpuidle_cdev = cpu_cdev->cpuidle_cdev;
+ struct cpufreq_cooling_device *cpufreq_cdev = cpu_cdev->cpufreq_cdev;
+ int num_cpus = cpumask_weight(cpuidle_cdev->cpumask);
+ int opp_index, idle_state, ret;
+ u32 power, opp_power;
+
+ /*
+ * The state gives the percentage of the maximum power on the
+ * thermal zone the cooling device is handling.
+ *
+ * In order to find out which OPP must be selected and the
+ * percentage of idle time to be injected, we must compute
+ * first how much power represents the requested percentage.
+ *
+ * For this we apply a simple ratio:
+ *
+ * requested_power = (max_power * pct) / 100
+ */
+ power = (cpu_cdev->max_power * (100 - state)) / 100;
+
+ /*
+ * The second step is to sort out which OPP it does apply and
+ * how much power it represents. We must convert in a CPU
+ * basis to browse the freq table.
+ *
+ * Pitfall: Don't compare in the function with power /
+ * num_cpus but with opp.power * num_cpus. Otherwise, because
+ * of the rounding effect, we end up with a power lesser than
+ * the opp power and then with a negative value in the idle
+ * ratio computation a few lines below.
+ */
+ opp_index = cpu_cooling_power_opp(cpufreq_cdev, num_cpus, power);
+ if (opp_index < 0)
+ return opp_index;
+
+ /*
+ * The third step is to compute the percentage of idle time
+ * regarding the dissipated power for the selected OPP above.
+ */
+ opp_power = cpufreq_cdev->freq_table[opp_index].power * num_cpus;
+
+ idle_state = ((opp_power - power) * 100) / power;
+
+ /*
+ * Catch unexpected situation where we are out of bound of the
+ * idle state percentage values.
+ */
+ if (WARN_ON_ONCE(idle_state < 0 || idle_state > 100))
+ return -EINVAL;
+
+ /*
+ * Set the cpufreq OPP state
+ */
+ ret = __cpufreq_set_cur_state(cpufreq_cdev, opp_index);
+ if (ret)
+ return ret;
+
+ /*
+ * And inject idle cycles to reduce the power
+ */
+ ret = __cpuidle_cooling_set_cur_state(cpuidle_cdev, idle_state);
+ if (ret)
+ return ret;
+
+ cpu_cdev->state = state;
+
+ return 0;
+}
+
+/**
+ * cpu_cooling_get_cur_state - Gets the percentage of the max power
+ * @cdev : the thermal cooling device
+ * @state : a pointer to the state variable to be filled
+ *
+ * Fill the state pointer variable with the current state of the cpu
+ * cooling device, the value is between 0 and 100 (included).
+ *
+ * The function never fails and returns zero.
+ */
+static int cpu_cooling_get_cur_state(struct thermal_cooling_device *cdev,
+ unsigned long *state)
+{
+ struct cpu_cooling_device *cpu_cdev = cdev->devdata;
+
+ *state = cpu_cdev->state;
+
+ return 0;
+}
+
+/**
+ * cpu_cooling_ops - thermal cooling device ops
+ */
+static struct thermal_cooling_device_ops cpu_cooling_ops = {
+ .get_max_state = cpu_cooling_get_max_state,
+ .get_cur_state = cpu_cooling_get_cur_state,
+ .set_cur_state = cpu_cooling_set_cur_state,
+};
+
+static int __init cpu_cooling_init(void)
+{
+ struct thermal_cooling_device *cdev;
+ struct cpu_cooling_device *cpu_cdev;
+ struct cpuidle_cooling_device *cpuidle_cdev;
+ struct cpufreq_cooling_device *cpufreq_cdev;
+ struct device_node *np;
+ cpumask_t *cpumask;
+ char dev_name[THERMAL_NAME_LENGTH];
+ int cpu, index = 0;
+
+ for_each_possible_cpu(cpu) {
+
+ cpumask = topology_core_cpumask(cpu);
+
+ if (cpu != cpumask_first(cpumask))
+ continue;
+
+ np = of_cpu_device_node_get(cpu);
+
+ cpu_cdev = kzalloc(sizeof(*cpu_cdev), GFP_KERNEL);
+ if (!cpu_cdev)
+ return -ENOMEM;
+
+ list_for_each_entry(cpuidle_cdev,
+ &cpuidle_cdev_list, node) {
+
+ cpumask = cpuidle_cdev->cpumask;
+ if (!cpumask_test_cpu(cpu, cpumask))
+ continue;
+
+ cpu_cdev->cpuidle_cdev = cpuidle_cdev;
+ break;
+ }
+
+ list_for_each_entry(cpufreq_cdev,
+ &cpufreq_cdev_list, node) {
+
+ cpumask = cpufreq_cdev->policy->related_cpus;
+ if (!cpumask_test_cpu(cpu, cpumask))
+ continue;
+
+ cpu_cdev->cpufreq_cdev = cpufreq_cdev;
+ break;
+ }
+
+ if (!cpu_cdev->cpuidle_cdev || !cpu_cdev->cpufreq_cdev) {
+ pr_err("Something is going wrong with the CPU cooling device\n");
+ return -EINVAL;
+ }
+
+ if (!cpufreq_cdev->freq_table[0].power) {
+ pr_err("No power number for the platform\n");
+ return -EINVAL;
+ }
+
+ cpu_cdev->max_power = cpufreq_cdev->freq_table[0].power;
+ cpu_cdev->max_power *= cpumask_weight(cpuidle_cdev->cpumask);
+
+ snprintf(dev_name, sizeof(dev_name),
+ "thermal-cpu-%d", index++);
+ cdev = thermal_of_cooling_device_register(np, dev_name,
+ cpu_cdev,
+ &cpu_cooling_ops);
+ if (IS_ERR(cdev))
+ return PTR_ERR(cdev);
+ }
+
+ return 0;
+}
+late_initcall(cpu_cooling_init);
+#endif
diff --git a/include/linux/cpu_cooling.h b/include/linux/cpu_cooling.h
index 2b5950b..308a914 100644
--- a/include/linux/cpu_cooling.h
+++ b/include/linux/cpu_cooling.h
@@ -33,7 +33,16 @@ struct cpufreq_policy;
typedef int (*get_static_t)(cpumask_t *cpumask, int interval,
unsigned long voltage, u32 *power);

-#ifdef CONFIG_CPU_THERMAL
+#if defined(CONFIG_CPU_IDLE_THERMAL) || defined(CONFIG_CPU_THERMAL_COMBO)
+extern int cpuidle_cooling_register(void);
+#else
+static inline int cpuidle_cooling_register(void)
+{
+ return 0;
+}
+#endif
+
+#if defined(CONFIG_CPU_FREQ_THERMAL) || defined(CONFIG_CPU_THERMAL_COMBO)
/**
* cpufreq_cooling_register - function to create cpufreq cooling device.
* @policy: cpufreq policy.
@@ -45,7 +54,6 @@ struct thermal_cooling_device *
cpufreq_power_cooling_register(struct cpufreq_policy *policy,
u32 capacitance, get_static_t plat_static_func);

-extern int cpuidle_cooling_register(void);
/**
* of_cpufreq_cooling_register - create cpufreq cooling device based on DT.
* @np: a valid struct device_node to the cooling device device tree node.
@@ -85,7 +93,7 @@ of_cpufreq_power_cooling_register(struct device_node *np,
*/
void cpufreq_cooling_unregister(struct thermal_cooling_device *cdev);

-#else /* !CONFIG_CPU_THERMAL */
+#else /* !CONFIG_CPU_FREQ_THERMAL */
static inline struct thermal_cooling_device *
cpufreq_cooling_register(struct cpufreq_policy *policy)
{
@@ -119,11 +127,6 @@ void cpufreq_cooling_unregister(struct thermal_cooling_device *cdev)
{
return;
}
-
-static inline int cpuidle_cooling_register(void)
-{
- return 0;
-}
-#endif /* CONFIG_CPU_THERMAL */
+#endif /* CONFIG_CPU_FREQ_THERMAL */

#endif /* __CPU_COOLING_H__ */
--
2.7.4


2018-01-23 15:37:52

by Daniel Lezcano

[permalink] [raw]
Subject: [PATCH 7/8] cpuidle/drivers/cpuidle-arm: Register the cooling device

Register the ARM generic cpuidle driver as a cooling device.

Signed-off-by: Daniel Lezcano <[email protected]>
---
drivers/cpuidle/cpuidle-arm.c | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/drivers/cpuidle/cpuidle-arm.c b/drivers/cpuidle/cpuidle-arm.c
index ddee1b6..c100915 100644
--- a/drivers/cpuidle/cpuidle-arm.c
+++ b/drivers/cpuidle/cpuidle-arm.c
@@ -11,6 +11,7 @@

#define pr_fmt(fmt) "CPUidle arm: " fmt

+#include <linux/cpu_cooling.h>
#include <linux/cpuidle.h>
#include <linux/cpumask.h>
#include <linux/cpu_pm.h>
@@ -172,6 +173,10 @@ static int __init arm_idle_init(void)
goto out_fail;
}

+ ret = cpuidle_cooling_register();
+ if (ret)
+ pr_warn("Failed to register the cpuidle cooling device\n");
+
return 0;

out_fail:
--
2.7.4


2018-01-23 15:38:38

by Daniel Lezcano

[permalink] [raw]
Subject: [PATCH 2/8] thermal/drivers/cpu_cooling: Add Software Package Data Exchange (SPDX)

For license auditing purpose, let's add the SPDX tag.

Cc: Philippe Ombredanne <[email protected]>
Signed-off-by: Daniel Lezcano <[email protected]>
---
drivers/thermal/cpu_cooling.c | 16 +---------------
1 file changed, 1 insertion(+), 15 deletions(-)

diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
index 988fc55..e62be75 100644
--- a/drivers/thermal/cpu_cooling.c
+++ b/drivers/thermal/cpu_cooling.c
@@ -1,3 +1,4 @@
+// SPDX-License-Identifier: GPL-2.0
/*
* linux/drivers/thermal/cpu_cooling.c
*
@@ -8,21 +9,6 @@
* Authors: Amit Daniel <[email protected]>
* Viresh Kumar <[email protected]>
*
- * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published by
- * the Free Software Foundation; version 2 of the License.
- *
- * This program is distributed in the hope that it will be useful, but
- * WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
- * General Public License for more details.
- *
- * You should have received a copy of the GNU General Public License along
- * with this program; if not, write to the Free Software Foundation, Inc.,
- * 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA.
- *
- * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*/
#include <linux/module.h>
#include <linux/thermal.h>
--
2.7.4


2018-01-23 15:39:06

by Daniel Lezcano

[permalink] [raw]
Subject: [PATCH 3/8] thermal/drivers/cpu_cooling: Remove pointless field

The structure cpufreq_cooling_device provides a backpointer to the thermal
device but this one is used for a trace and to unregister. For the trace,
we don't really need this field and the unregister function as the same
pointer passed as parameter. Remove it.

Signed-off-by: Daniel Lezcano <[email protected]>
---
drivers/thermal/cpu_cooling.c | 7 ++-----
1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
index e62be75..d05bb73 100644
--- a/drivers/thermal/cpu_cooling.c
+++ b/drivers/thermal/cpu_cooling.c
@@ -88,7 +88,6 @@ struct cpufreq_cooling_device {
unsigned int clipped_freq;
unsigned int max_level;
struct freq_table *freq_table; /* In descending order */
- struct thermal_cooling_device *cdev;
struct cpufreq_policy *policy;
struct list_head node;
struct time_in_idle *idle_time;
@@ -197,8 +196,7 @@ static int update_freq_table(struct cpufreq_cooling_device *cpufreq_cdev,

dev = get_cpu_device(cpu);
if (unlikely(!dev)) {
- dev_warn(&cpufreq_cdev->cdev->device,
- "No cpu device for cpu %d\n", cpu);
+ pr_warn("No cpu device for cpu %d\n", cpu);
return -ENODEV;
}

@@ -762,7 +760,6 @@ __cpufreq_cooling_register(struct device_node *np,
goto remove_ida;

cpufreq_cdev->clipped_freq = cpufreq_cdev->freq_table[0].frequency;
- cpufreq_cdev->cdev = cdev;

mutex_lock(&cooling_list_lock);
/* Register the notifier for first cpufreq cooling device */
@@ -922,7 +919,7 @@ void cpufreq_cooling_unregister(struct thermal_cooling_device *cdev)
cpufreq_unregister_notifier(&thermal_cpufreq_notifier_block,
CPUFREQ_POLICY_NOTIFIER);

- thermal_cooling_device_unregister(cpufreq_cdev->cdev);
+ thermal_cooling_device_unregister(cdev);
ida_simple_remove(&cpufreq_ida, cpufreq_cdev->id);
kfree(cpufreq_cdev->idle_time);
kfree(cpufreq_cdev->freq_table);
--
2.7.4


2018-01-23 15:39:19

by Daniel Lezcano

[permalink] [raw]
Subject: [PATCH 5/8] thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver

The cpu idle cooling driver performs synchronized idle injection across all
cpus belonging to the same cluster and offers a new method to cool down a SoC.

Each cluster has its own idle cooling device, each core has its own idle
injection thread, each idle injection thread uses play_idle to enter idle. In
order to reach the deepest idle state, each cooling device has the idle
injection threads synchronized together.

It has some similarity with the intel power clamp driver but it is actually
designed to work on the ARM architecture via the DT with a mathematical proof
with the power model which comes with the Documentation.

The idle injection cycle is fixed while the running cycle is variable. That
allows to have control on the device reactivity for the user experience. At
the mitigation point the idle threads are unparked, they play idle the
specified amount of time and they schedule themselves. The last thread sets
the next idle injection deadline and when the timer expires it wakes up all
the threads which in turn play idle again. Meanwhile the running cycle is
changed by set_cur_state. When the mitigation ends, the threads are parked.
The algorithm is self adaptive, so there is no need to handle hotplugging.

If we take an example of the balanced point, we can use the DT for the hi6220.

The sustainable power for the SoC is 3326mW to mitigate at 75°C. Eight cores
running at full blast at the maximum OPP consumes 5280mW. The first value is
given in the DT, the second is calculated from the OPP with the formula:

Pdyn = Cdyn x Voltage^2 x Frequency

As the SoC vendors don't want to share the static leakage values, we assume
it is zero, so the Prun = Pdyn + Pstatic = Pdyn + 0 = Pdyn.

In order to reduce the power to 3326mW, we have to apply a ratio to the
running time.

ratio = (Prun - Ptarget) / Ptarget = (5280 - 3326) / 3326 = 0,5874

We know the idle cycle which is fixed, let's assume 10ms. However from this
duration we have to substract the wake up latency for the cluster idle state.
In our case, it is 1.5ms. So for a 10ms latency for idle, we are really idle
8.5ms.

As we know the idle duration and the ratio, we can compute the running cycle.

running_cycle = 8.5 / 0.5874 = 14.47ms

So for 8.5ms of idle, we have 14.47ms of running cycle, and that brings the
SoC to the balanced trip point of 75°C.

The driver has been tested on the hi6220 and it appears the temperature
stabilizes at 75°C with an idle injection time of 10ms (8.5ms real) and
running cycle of 14ms as expected by the theory above.

Signed-off-by: Kevin WangTao <[email protected]>
Signed-off-by: Daniel Lezcano <[email protected]>
---
drivers/thermal/Kconfig | 10 +
drivers/thermal/cpu_cooling.c | 471 ++++++++++++++++++++++++++++++++++++++++++
include/linux/cpu_cooling.h | 6 +
3 files changed, 487 insertions(+)

diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
index 925e73b..4bd4be7 100644
--- a/drivers/thermal/Kconfig
+++ b/drivers/thermal/Kconfig
@@ -166,6 +166,16 @@ config CPU_FREQ_THERMAL
This will be useful for platforms using the generic thermal interface
and not the ACPI interface.

+config CPU_IDLE_THERMAL
+ bool "CPU idle cooling strategy"
+ depends on CPU_IDLE
+ help
+ This implements the generic CPU cooling mechanism through
+ idle injection. This will throttle the CPU by injecting
+ fixed idle cycle. All CPUs belonging to the same cluster
+ will enter idle synchronously to reach the deepest idle
+ state.
+
endchoice

config CLOCK_THERMAL
diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
index d05bb73..916a627 100644
--- a/drivers/thermal/cpu_cooling.c
+++ b/drivers/thermal/cpu_cooling.c
@@ -10,18 +10,33 @@
* Viresh Kumar <[email protected]>
*
*/
+#undef DEBUG
+#define pr_fmt(fmt) "CPU cooling: " fmt
+
#include <linux/module.h>
#include <linux/thermal.h>
#include <linux/cpufreq.h>
+#include <linux/cpuidle.h>
#include <linux/err.h>
+#include <linux/freezer.h>
#include <linux/idr.h>
+#include <linux/kthread.h>
#include <linux/pm_opp.h>
#include <linux/slab.h>
+#include <linux/sched/prio.h>
+#include <linux/sched/rt.h>
#include <linux/cpu.h>
#include <linux/cpu_cooling.h>
+#include <linux/wait.h>
+
+#include <linux/platform_device.h>
+#include <linux/of_platform.h>

#include <trace/events/thermal.h>

+#include <uapi/linux/sched/types.h>
+
+#ifdef CONFIG_CPU_FREQ_THERMAL
/*
* Cooling state <-> CPUFreq frequency
*
@@ -926,3 +941,459 @@ void cpufreq_cooling_unregister(struct thermal_cooling_device *cdev)
kfree(cpufreq_cdev);
}
EXPORT_SYMBOL_GPL(cpufreq_cooling_unregister);
+
+#endif /* CPU_FREQ_THERMAL */
+
+#ifdef CONFIG_CPU_IDLE_THERMAL
+/*
+ * The idle duration injection. As we don't have yet a way to specify
+ * from the DT configuration, let's default to a tick duration.
+ */
+#define DEFAULT_IDLE_TIME_US TICK_USEC
+
+/**
+ * struct cpuidle_cooling_device - data for the idle cooling device
+ * @cdev: a pointer to a struct thermal_cooling_device
+ * @tsk: an array of pointer to the idle injection tasks
+ * @cpumask: a cpumask containing the CPU managed by the cooling device
+ * @timer: a hrtimer giving the tempo for the idle injection cycles
+ * @kref: a kernel refcount on this structure
+ * @waitq: the waiq for the idle injection tasks
+ * @count: an atomic to keep track of the last task exiting the idle cycle
+ * @idle_cycle: an integer defining the duration of the idle injection
+ * @state: an normalized integer giving the state of the cooling device
+ */
+struct cpuidle_cooling_device {
+ struct thermal_cooling_device *cdev;
+ struct task_struct **tsk;
+ struct cpumask *cpumask;
+ struct list_head node;
+ struct hrtimer timer;
+ struct kref kref;
+ wait_queue_head_t *waitq;
+ atomic_t count;
+ unsigned int idle_cycle;
+ unsigned int state;
+};
+
+static LIST_HEAD(cpuidle_cdev_list);
+
+/**
+ * cpuidle_cooling_wakeup - Wake up all idle injection threads
+ * @idle_cdev: the idle cooling device
+ *
+ * Every idle injection task belonging to the idle cooling device and
+ * running on an online cpu will be wake up by this call.
+ */
+static void cpuidle_cooling_wakeup(struct cpuidle_cooling_device *idle_cdev)
+{
+ int cpu;
+ int weight = cpumask_weight(idle_cdev->cpumask);
+
+ for_each_cpu_and(cpu, idle_cdev->cpumask, cpu_online_mask)
+ wake_up_process(idle_cdev->tsk[cpu % weight]);
+}
+
+/**
+ * cpuidle_cooling_wakeup_fn - Running cycle timer callback
+ * @timer: a hrtimer structure
+ *
+ * When the mitigation is acting, the CPU is allowed to run an amount
+ * of time, then the idle injection happens for the specified delay
+ * and the idle task injection schedules itself until the timer event
+ * wakes the idle injection tasks again for a new idle injection
+ * cycle. The time between the end of the idle injection and the timer
+ * expiration is the allocated running time for the CPU.
+ *
+ * Returns always HRTIMER_NORESTART
+ */
+static enum hrtimer_restart cpuidle_cooling_wakeup_fn(struct hrtimer *timer)
+{
+ struct cpuidle_cooling_device *idle_cdev =
+ container_of(timer, struct cpuidle_cooling_device, timer);
+
+ cpuidle_cooling_wakeup(idle_cdev);
+
+ return HRTIMER_NORESTART;
+}
+
+/**
+ * cpuidle_cooling_runtime - Running time computation
+ * @idle_cdev: the idle cooling device
+ *
+ * The running duration is computed from the idle injection duration
+ * which is fixed. If we reach 100% of idle injection ratio, that
+ * means the running duration is zero. If we have a 50% ratio
+ * injection, that means we have equal duration for idle and for
+ * running duration.
+ *
+ * The formula is deduced as the following:
+ *
+ * running = idle x ((100 / ratio) - 1)
+ *
+ * For precision purpose for integer math, we use the following:
+ *
+ * running = (idle x 100) / ratio - idle
+ *
+ * For example, if we have an injected duration of 50%, then we end up
+ * with 10ms of idle injection and 10ms of running duration.
+ *
+ * Returns a s64 nanosecond based
+ */
+static s64 cpuidle_cooling_runtime(struct cpuidle_cooling_device *idle_cdev)
+{
+ s64 next_wakeup;
+ int state = idle_cdev->state;
+
+ /*
+ * The function must never be called when there is no
+ * mitigation because:
+ * - that does not make sense
+ * - we end up with a division by zero
+ */
+ BUG_ON(!state);
+
+ next_wakeup = (s64)((idle_cdev->idle_cycle * 100) / state) -
+ idle_cdev->idle_cycle;
+
+ return next_wakeup * NSEC_PER_USEC;
+}
+
+/**
+ * cpuidle_cooling_injection_thread - Idle injection mainloop thread function
+ * @arg: a void pointer containing the idle cooling device address
+ *
+ * This main function does basically two operations:
+ *
+ * - Goes idle for a specific amount of time
+ *
+ * - Sets a timer to wake up all the idle injection threads after a
+ * running period
+ *
+ * That happens only when the mitigation is enabled, otherwise the
+ * task is scheduled out.
+ *
+ * In order to keep the tasks synchronized together, it is the last
+ * task exiting the idle period which is in charge of setting the
+ * timer.
+ *
+ * This function never returns.
+ */
+static int cpuidle_cooling_injection_thread(void *arg)
+{
+ struct sched_param param = { .sched_priority = MAX_USER_RT_PRIO/2 };
+ struct cpuidle_cooling_device *idle_cdev = arg;
+ int index = smp_processor_id() % cpumask_weight(idle_cdev->cpumask);
+ DEFINE_WAIT(wait);
+
+ set_freezable();
+
+ sched_setscheduler(current, SCHED_FIFO, &param);
+
+ while (1) {
+
+ s64 next_wakeup;
+
+ prepare_to_wait(&idle_cdev->waitq[index],
+ &wait, TASK_INTERRUPTIBLE);
+
+ schedule();
+
+ atomic_inc(&idle_cdev->count);
+
+ play_idle(idle_cdev->idle_cycle / USEC_PER_MSEC);
+
+ /*
+ * The last CPU waking up is in charge of setting the
+ * timer. If the CPU is hotplugged, the timer will
+ * move to another CPU (which may not belong to the
+ * same cluster) but that is not a problem as the
+ * timer will be set again by another CPU belonging to
+ * the cluster, so this mechanism is self adaptive and
+ * does not require any hotplugging dance.
+ */
+ if (!atomic_dec_and_test(&idle_cdev->count))
+ continue;
+
+ if (!idle_cdev->state)
+ continue;
+
+ next_wakeup = cpuidle_cooling_runtime(idle_cdev);
+
+ hrtimer_start(&idle_cdev->timer, ns_to_ktime(next_wakeup),
+ HRTIMER_MODE_REL_PINNED);
+ }
+
+ finish_wait(&idle_cdev->waitq[index], &wait);
+
+ return 0;
+}
+
+/**
+ * cpuidle_cooling_get_max_state - Get the maximum state
+ * @cdev : the thermal cooling device
+ * @state : a pointer to the state variable to be filled
+ *
+ * The function gives always 100 as the injection ratio is percentile
+ * based for consistency accros different platforms.
+ *
+ * The function can not fail, it returns always zero.
+ */
+static int cpuidle_cooling_get_max_state(struct thermal_cooling_device *cdev,
+ unsigned long *state)
+{
+ /*
+ * Depending on the configuration or the hardware, the running
+ * cycle and the idle cycle could be different. We want unify
+ * that to an 0..100 interval, so the set state interface will
+ * be the same whatever the platform is.
+ *
+ * The state 100% will make the cluster 100% ... idle. A 0%
+ * injection ratio means no idle injection at all and 50%
+ * means for 10ms of idle injection, we have 10ms of running
+ * time.
+ */
+ *state = 100;
+
+ return 0;
+}
+
+/**
+ * cpuidle_cooling_get_cur_state - Get the current cooling state
+ * @cdev: the thermal cooling device
+ * @state: a pointer to the state
+ *
+ * The function just copy the state value from the private thermal
+ * cooling device structure, the mapping is 1 <-> 1.
+ *
+ * The function can not fail, it returns always zero.
+ */
+static int cpuidle_cooling_get_cur_state(struct thermal_cooling_device *cdev,
+ unsigned long *state)
+{
+ struct cpuidle_cooling_device *idle_cdev = cdev->devdata;
+
+ *state = idle_cdev->state;
+
+ return 0;
+}
+
+/**
+ * cpuidle_cooling_set_cur_state - Set the current cooling state
+ * @cdev: the thermal cooling device
+ * @state: the target state
+ *
+ * The function checks first if we are initiating the mitigation which
+ * in turn wakes up all the idle injection tasks belonging to the idle
+ * cooling device. In any case, it updates the internal state for the
+ * cooling device.
+ *
+ * The function can not fail, it returns always zero.
+ */
+static int cpuidle_cooling_set_cur_state(struct thermal_cooling_device *cdev,
+ unsigned long state)
+{
+ struct cpuidle_cooling_device *idle_cdev = cdev->devdata;
+ unsigned long current_state = idle_cdev->state;
+
+ idle_cdev->state = state;
+
+ if (current_state == 0 && state > 0) {
+ pr_debug("Starting cooling cpus '%*pbl'\n",
+ cpumask_pr_args(idle_cdev->cpumask));
+ cpuidle_cooling_wakeup(idle_cdev);
+ } else if (current_state > 0 && !state) {
+ pr_debug("Stopping cooling cpus '%*pbl'\n",
+ cpumask_pr_args(idle_cdev->cpumask));
+ }
+
+ return 0;
+}
+
+/**
+ * cpuidle_cooling_ops - thermal cooling device ops
+ */
+static struct thermal_cooling_device_ops cpuidle_cooling_ops = {
+ .get_max_state = cpuidle_cooling_get_max_state,
+ .get_cur_state = cpuidle_cooling_get_cur_state,
+ .set_cur_state = cpuidle_cooling_set_cur_state,
+};
+
+/**
+ * cpuidle_cooling_release - Kref based release helper
+ * @kref: a pointer to the kref structure
+ *
+ * This function is automatically called by the kref_put function when
+ * the idle cooling device refcount reaches zero. At this point, we
+ * have the guarantee the structure is no longer in use and we can
+ * safely release all the ressources.
+ */
+static void __init cpuidle_cooling_release(struct kref *kref)
+{
+ struct cpuidle_cooling_device *idle_cdev =
+ container_of(kref, struct cpuidle_cooling_device, kref);
+
+ thermal_cooling_device_unregister(idle_cdev->cdev);
+ kfree(idle_cdev->waitq);
+ kfree(idle_cdev->tsk);
+ kfree(idle_cdev);
+}
+
+/**
+ * cpuidle_cooling_register - Idle cooling device initialization function
+ *
+ * This function is in charge of creating a cooling device per cluster
+ * and register it to thermal framework. For this we rely on the
+ * topology as there is nothing yet describing better the idle state
+ * power domains.
+ *
+ * For each first CPU of the cluster's cpumask, we allocate the idle
+ * cooling device, initialize the general fields and then we initialze
+ * the rest in a per cpu basis.
+ *
+ * Returns zero on success, < 0 otherwise.
+ */
+int cpuidle_cooling_register(void)
+{
+ struct cpuidle_cooling_device *idle_cdev = NULL;
+ struct thermal_cooling_device *cdev;
+ struct task_struct *tsk;
+ struct device_node *np;
+ cpumask_t *cpumask;
+ char dev_name[THERMAL_NAME_LENGTH];
+ int weight;
+ int ret = -ENOMEM, cpu;
+ int index = 0;
+
+ for_each_possible_cpu(cpu) {
+
+ cpumask = topology_core_cpumask(cpu);
+ weight = cpumask_weight(cpumask);
+
+ /*
+ * This condition makes the first cpu belonging to the
+ * cluster to create a cooling device and allocates
+ * the structure. Others CPUs belonging to the same
+ * cluster will just increment the refcount on the
+ * cooling device structure and initialize it.
+ */
+ if (cpu == cpumask_first(cpumask)) {
+
+ np = of_cpu_device_node_get(cpu);
+
+ idle_cdev = kzalloc(sizeof(*idle_cdev), GFP_KERNEL);
+ if (!idle_cdev)
+ goto out_fail;
+
+ idle_cdev->tsk = kzalloc(sizeof(*idle_cdev->tsk) *
+ weight, GFP_KERNEL);
+ if (!idle_cdev->tsk)
+ goto out_fail;
+
+ idle_cdev->waitq = kzalloc(sizeof(*idle_cdev->waitq) *
+ weight, GFP_KERNEL);
+ if (!idle_cdev->waitq)
+ goto out_fail;
+
+ idle_cdev->idle_cycle = DEFAULT_IDLE_TIME_US;
+
+ atomic_set(&idle_cdev->count, 0);
+
+ kref_init(&idle_cdev->kref);
+
+ /*
+ * Initialize the timer to wakeup all the idle
+ * injection tasks
+ */
+ hrtimer_init(&idle_cdev->timer,
+ CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+
+ /*
+ * The wakeup function callback which is in
+ * charge of waking up all CPUs belonging to
+ * the same cluster
+ */
+ idle_cdev->timer.function = cpuidle_cooling_wakeup_fn;
+
+ /*
+ * The thermal cooling device name
+ */
+ snprintf(dev_name, sizeof(dev_name), "thermal-idle-%d", index++);
+ cdev = thermal_of_cooling_device_register(np, dev_name,
+ idle_cdev,
+ &cpuidle_cooling_ops);
+ if (IS_ERR(cdev)) {
+ ret = PTR_ERR(cdev);
+ goto out_fail;
+ }
+
+ idle_cdev->cdev = cdev;
+
+ idle_cdev->cpumask = cpumask;
+
+ list_add(&idle_cdev->node, &cpuidle_cdev_list);
+
+ pr_info("Created idle cooling device for cpus '%*pbl'\n",
+ cpumask_pr_args(cpumask));
+ }
+
+ kref_get(&idle_cdev->kref);
+
+ /*
+ * Each cooling device is per package. Each package
+ * has a set of cpus where the physical number is
+ * duplicate in the kernel namespace. We need a way to
+ * address the waitq[] and tsk[] arrays with index
+ * which are not Linux cpu numbered.
+ *
+ * One solution is to use the
+ * topology_core_id(cpu). Other solution is to use the
+ * modulo.
+ *
+ * eg. 2 x cluster - 4 cores.
+ *
+ * Physical numbering -> Linux numbering -> % nr_cpus
+ *
+ * Pkg0 - Cpu0 -> 0 -> 0
+ * Pkg0 - Cpu1 -> 1 -> 1
+ * Pkg0 - Cpu2 -> 2 -> 2
+ * Pkg0 - Cpu3 -> 3 -> 3
+ *
+ * Pkg1 - Cpu0 -> 4 -> 0
+ * Pkg1 - Cpu1 -> 5 -> 1
+ * Pkg1 - Cpu2 -> 6 -> 2
+ * Pkg1 - Cpu3 -> 7 -> 3
+ */
+ init_waitqueue_head(&idle_cdev->waitq[cpu % weight]);
+
+ tsk = kthread_create_on_cpu(cpuidle_cooling_injection_thread,
+ idle_cdev, cpu, "kidle_inject/%u");
+ if (IS_ERR(tsk)) {
+ ret = PTR_ERR(tsk);
+ goto out_fail;
+ }
+
+ idle_cdev->tsk[cpu % weight] = tsk;
+
+ wake_up_process(tsk);
+ }
+
+ return 0;
+
+out_fail:
+ list_for_each_entry(idle_cdev, &cpuidle_cdev_list, node) {
+
+ for_each_cpu(cpu, idle_cdev->cpumask) {
+
+ if (idle_cdev->tsk[cpu])
+ kthread_stop(idle_cdev->tsk[cpu]);
+
+ kref_put(&idle_cdev->kref, cpuidle_cooling_release);
+ }
+ }
+
+ pr_err("Failed to create idle cooling device (%d)\n", ret);
+
+ return ret;
+}
+#endif
diff --git a/include/linux/cpu_cooling.h b/include/linux/cpu_cooling.h
index d4292eb..2b5950b 100644
--- a/include/linux/cpu_cooling.h
+++ b/include/linux/cpu_cooling.h
@@ -45,6 +45,7 @@ struct thermal_cooling_device *
cpufreq_power_cooling_register(struct cpufreq_policy *policy,
u32 capacitance, get_static_t plat_static_func);

+extern int cpuidle_cooling_register(void);
/**
* of_cpufreq_cooling_register - create cpufreq cooling device based on DT.
* @np: a valid struct device_node to the cooling device device tree node.
@@ -118,6 +119,11 @@ void cpufreq_cooling_unregister(struct thermal_cooling_device *cdev)
{
return;
}
+
+static inline int cpuidle_cooling_register(void)
+{
+ return 0;
+}
#endif /* CONFIG_CPU_THERMAL */

#endif /* __CPU_COOLING_H__ */
--
2.7.4


2018-01-23 15:40:34

by Daniel Lezcano

[permalink] [raw]
Subject: [PATCH 1/8] thermal/drivers/cpu_cooling: Fixup the header and copyright

The copyright format does not conform to the format requested by
Linaro: https://wiki.linaro.org/Copyright

Fix it.

Signed-off-by: Daniel Lezcano <[email protected]>
---
drivers/thermal/cpu_cooling.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
index dc63aba..988fc55 100644
--- a/drivers/thermal/cpu_cooling.c
+++ b/drivers/thermal/cpu_cooling.c
@@ -2,9 +2,11 @@
* linux/drivers/thermal/cpu_cooling.c
*
* Copyright (C) 2012 Samsung Electronics Co., Ltd(http://www.samsung.com)
- * Copyright (C) 2012 Amit Daniel <[email protected]>
*
- * Copyright (C) 2014 Viresh Kumar <[email protected]>
+ * Copyright (C) 2018 Linaro Limited.
+ *
+ * Authors: Amit Daniel <[email protected]>
+ * Viresh Kumar <[email protected]>
*
* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
* This program is free software; you can redistribute it and/or modify
--
2.7.4


2018-01-24 16:40:58

by Daniel Thompson

[permalink] [raw]
Subject: Re: [PATCH 4/8] thermal/drivers/Kconfig: Convert the CPU cooling device to a choice

On Tue, Jan 23, 2018 at 04:34:27PM +0100, Daniel Lezcano wrote:
> The next changes will add new way to cool down a CPU. In order to
> sanitize and make the overall cpu cooling code consistent and robust
> we must prevent the cpu cooling devices to co-exists with the same
> purpose at the same time in the kernel.
>
> Make the CPU cooling device a choice in the Kconfig, so only one CPU
> cooling strategy can be chosen.

I puzzled by the role of Kconfig here.

IIUC a distro kernel will always be forced to select the combo strategy
otherwise it will never be able to cool systems that don't have cpufreq
(I hope the combo strategy treats such system as a special case with
only one OPP).

This raises the question what the other options (cpufreq-only
idle-injection-only) are for? Are they just for petrol heads who want to
save a few bytes of code or is idle-injection undesirable for some
users.

If the later, how can a distro kernel mitigate the undesired effects
whilst still selecting the combo strategy.


Daniel.


>
> Signed-off-by: Daniel Lezcano <[email protected]>
> ---
> drivers/thermal/Kconfig | 20 +++++++++++++++++---
> 1 file changed, 17 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
> index 315ae29..925e73b 100644
> --- a/drivers/thermal/Kconfig
> +++ b/drivers/thermal/Kconfig
> @@ -142,17 +142,31 @@ config THERMAL_GOV_POWER_ALLOCATOR
> allocating and limiting power to devices.
>
> config CPU_THERMAL
> - bool "generic cpu cooling support"
> - depends on CPU_FREQ
> + bool "Generic cpu cooling support"
> depends on THERMAL_OF
> help
> + Enable the CPU cooling features. If the system has no active
> + cooling device available, this option allows to use the CPU
> + as a cooling device.
> +
> +choice
> + prompt "CPU cooling strategies"
> + depends on CPU_THERMAL
> + default CPU_FREQ_THERMAL
> + help
> + Select the CPU cooling strategy.
> +
> +config CPU_FREQ_THERMAL
> + bool "CPU frequency cooling strategy"
> + depends on CPU_FREQ
> + help
> This implements the generic cpu cooling mechanism through frequency
> reduction. An ACPI version of this already exists
> (drivers/acpi/processor_thermal.c).
> This will be useful for platforms using the generic thermal interface
> and not the ACPI interface.
>
> - If you want this support, you should say Y here.

... this may not be great advice... if you think you want this support
then you *probably* actually want the comboappears to be terrible advice.

This cooling strategies should only be selected by petrolheads making a
device specific *and* are obsessive about code size.

The
> +endchoice
>
> config CLOCK_THERMAL
> bool "Generic clock cooling support"
> --
> 2.7.4
>

2018-01-24 17:00:33

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [PATCH 4/8] thermal/drivers/Kconfig: Convert the CPU cooling device to a choice

On 24/01/2018 17:34, Daniel Thompson wrote:
> On Tue, Jan 23, 2018 at 04:34:27PM +0100, Daniel Lezcano wrote:
>> The next changes will add new way to cool down a CPU. In order to
>> sanitize and make the overall cpu cooling code consistent and robust
>> we must prevent the cpu cooling devices to co-exists with the same
>> purpose at the same time in the kernel.
>>
>> Make the CPU cooling device a choice in the Kconfig, so only one CPU
>> cooling strategy can be chosen.
>
> I puzzled by the role of Kconfig here.
>
> IIUC a distro kernel will always be forced to select the combo strategy
> otherwise it will never be able to cool systems that don't have cpufreq
> (I hope the combo strategy treats such system as a special case with
> only one OPP).

Actually it does not make sense to select the combo if there is no
cpufreq support. The cpuidle cooling device must be used instead.

> This raises the question what the other options (cpufreq-only
> idle-injection-only) are for? Are they just for petrol heads who want to
> save a few bytes of code or is idle-injection undesirable for some
> users.

The combo cooling device must be used on a system with a proper support
for cpuidle and cpufreq and with the power numbers specified in the DT.

By proper support of cpuidle, I mean a cluster power down idle state,
fast enough. This idle state allows to drop the dynamic power *and* the
static leakage (the latter to prevent a thermal runaway).

If the system does not have power numbers, no (or bad) cpuidle, the
combo cooling device must not be used. If there is no cpufreq support,
the cpuidle cooling must be used and if there is no proper support for
both, the CPU cooling can't be used. In this case, you have to put a fan
on your board or reduce the frequency where the system stays in its
thermal envelope.

> If the later, how can a distro kernel mitigate the undesired effects
> whilst still selecting the combo strategy.

I'm not sure to understand the question. Distros always use the make
allmodconfig, so that chooses the cpufreq CPU cooling device which was
the case before without this change.

However, we are talking about distros here but the CPU cooling mechanism
is for mobile and in this case the kernel (usually Android based) comes
with a specific configuration file and this is where the SoC vendor has
to choose the right strategy.


>> Signed-off-by: Daniel Lezcano <[email protected]>
>> ---
>> drivers/thermal/Kconfig | 20 +++++++++++++++++---
>> 1 file changed, 17 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
>> index 315ae29..925e73b 100644
>> --- a/drivers/thermal/Kconfig
>> +++ b/drivers/thermal/Kconfig
>> @@ -142,17 +142,31 @@ config THERMAL_GOV_POWER_ALLOCATOR
>> allocating and limiting power to devices.
>>
>> config CPU_THERMAL
>> - bool "generic cpu cooling support"
>> - depends on CPU_FREQ
>> + bool "Generic cpu cooling support"
>> depends on THERMAL_OF
>> help
>> + Enable the CPU cooling features. If the system has no active
>> + cooling device available, this option allows to use the CPU
>> + as a cooling device.
>> +
>> +choice
>> + prompt "CPU cooling strategies"
>> + depends on CPU_THERMAL
>> + default CPU_FREQ_THERMAL
>> + help
>> + Select the CPU cooling strategy.
>> +
>> +config CPU_FREQ_THERMAL
>> + bool "CPU frequency cooling strategy"
>> + depends on CPU_FREQ
>> + help
>> This implements the generic cpu cooling mechanism through frequency
>> reduction. An ACPI version of this already exists
>> (drivers/acpi/processor_thermal.c).
>> This will be useful for platforms using the generic thermal interface
>> and not the ACPI interface.
>>
>> - If you want this support, you should say Y here.
>
> ... this may not be great advice... if you think you want this support
> then you *probably* actually want the comboappears to be terrible advice.
>
> This cooling strategies should only be selected by petrolheads making a
> device specific *and* are obsessive about code size.
>
> The
>> +endchoice
>>
>> config CLOCK_THERMAL
>> bool "Generic clock cooling support"
>> --
>> 2.7.4
>>


--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


2018-01-25 10:58:08

by Daniel Thompson

[permalink] [raw]
Subject: Re: [PATCH 4/8] thermal/drivers/Kconfig: Convert the CPU cooling device to a choice

On Wed, Jan 24, 2018 at 05:59:09PM +0100, Daniel Lezcano wrote:
> On 24/01/2018 17:34, Daniel Thompson wrote:
> > On Tue, Jan 23, 2018 at 04:34:27PM +0100, Daniel Lezcano wrote:
> >> The next changes will add new way to cool down a CPU. In order to
> >> sanitize and make the overall cpu cooling code consistent and robust
> >> we must prevent the cpu cooling devices to co-exists with the same
> >> purpose at the same time in the kernel.
> >>
> >> Make the CPU cooling device a choice in the Kconfig, so only one CPU
> >> cooling strategy can be chosen.
> >
> > I puzzled by the role of Kconfig here.
> >
> > IIUC a distro kernel will always be forced to select the combo strategy
> > otherwise it will never be able to cool systems that don't have cpufreq
> > (I hope the combo strategy treats such system as a special case with
> > only one OPP).
>
> Actually it does not make sense to select the combo if there is no
> cpufreq support. The cpuidle cooling device must be used instead.

Well, I said before what I hoped. This is what I feared! ;-)


> > This raises the question what the other options (cpufreq-only
> > idle-injection-only) are for? Are they just for petrol heads who want to
> > save a few bytes of code or is idle-injection undesirable for some
> > users.
>
> The combo cooling device must be used on a system with a proper support
> for cpuidle and cpufreq and with the power numbers specified in the DT.
>
> By proper support of cpuidle, I mean a cluster power down idle state,
> fast enough. This idle state allows to drop the dynamic power *and* the
> static leakage (the latter to prevent a thermal runaway).
>
> If the system does not have power numbers, no (or bad) cpuidle, the
> combo cooling device must not be used. If there is no cpufreq support,
> the cpuidle cooling must be used and if there is no proper support for
> both, the CPU cooling can't be used. In this case, you have to put a fan
> on your board or reduce the frequency where the system stays in its
> thermal envelope.

How can we know (in the general case) what is going to be in the DT at
compile time?


> > If the later, how can a distro kernel mitigate the undesired effects
> > whilst still selecting the combo strategy.
>
> I'm not sure to understand the question. Distros always use the make
> allmodconfig, so that chooses the cpufreq CPU cooling device which was
> the case before without this change.

So there's no regression. That's nice but doesn't that mean distros
cannot exploit the new features.


> However, we are talking about distros here but the CPU cooling mechanism
> is for mobile and in this case the kernel (usually Android based) comes
> with a specific configuration file and this is where the SoC vendor has
> to choose the right strategy.

I agree its hard to conceive of a modern Android device that doesn't implement
both the needed features but the performance figures in the covering
letter come from Hikey (and they look pretty good) and Hikey is
supported by a good number of regular Linux distros now.

Using Kconfig to force what should be runtime decisions to happen at
compile time means that Hikey becomes an example of a platform that
is unable to run at max performance on the distros that have added
support for it.


Daniel.


> >> Signed-off-by: Daniel Lezcano <[email protected]>
> >> ---
> >> drivers/thermal/Kconfig | 20 +++++++++++++++++---
> >> 1 file changed, 17 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
> >> index 315ae29..925e73b 100644
> >> --- a/drivers/thermal/Kconfig
> >> +++ b/drivers/thermal/Kconfig
> >> @@ -142,17 +142,31 @@ config THERMAL_GOV_POWER_ALLOCATOR
> >> allocating and limiting power to devices.
> >>
> >> config CPU_THERMAL
> >> - bool "generic cpu cooling support"
> >> - depends on CPU_FREQ
> >> + bool "Generic cpu cooling support"
> >> depends on THERMAL_OF
> >> help
> >> + Enable the CPU cooling features. If the system has no active
> >> + cooling device available, this option allows to use the CPU
> >> + as a cooling device.
> >> +
> >> +choice
> >> + prompt "CPU cooling strategies"
> >> + depends on CPU_THERMAL
> >> + default CPU_FREQ_THERMAL
> >> + help
> >> + Select the CPU cooling strategy.
> >> +
> >> +config CPU_FREQ_THERMAL
> >> + bool "CPU frequency cooling strategy"
> >> + depends on CPU_FREQ
> >> + help
> >> This implements the generic cpu cooling mechanism through frequency
> >> reduction. An ACPI version of this already exists
> >> (drivers/acpi/processor_thermal.c).
> >> This will be useful for platforms using the generic thermal interface
> >> and not the ACPI interface.
> >>
> >> - If you want this support, you should say Y here.
> >
> > ... this may not be great advice... if you think you want this support
> > then you *probably* actually want the comboappears to be terrible advice.
> >
> > This cooling strategies should only be selected by petrolheads making a
> > device specific *and* are obsessive about code size.
> >
> > The
> >> +endchoice
> >>
> >> config CLOCK_THERMAL
> >> bool "Generic clock cooling support"
> >> --
> >> 2.7.4
> >>
>
>
> --
> <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
>
> Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
> <http://twitter.com/#!/linaroorg> Twitter |
> <http://www.linaro.org/linaro-blog/> Blog
>

2018-01-25 13:37:44

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [PATCH 4/8] thermal/drivers/Kconfig: Convert the CPU cooling device to a choice

On 25/01/2018 11:57, Daniel Thompson wrote:
> On Wed, Jan 24, 2018 at 05:59:09PM +0100, Daniel Lezcano wrote:
>> On 24/01/2018 17:34, Daniel Thompson wrote:
>>> On Tue, Jan 23, 2018 at 04:34:27PM +0100, Daniel Lezcano wrote:
>>>> The next changes will add new way to cool down a CPU. In order to
>>>> sanitize and make the overall cpu cooling code consistent and robust
>>>> we must prevent the cpu cooling devices to co-exists with the same
>>>> purpose at the same time in the kernel.
>>>>
>>>> Make the CPU cooling device a choice in the Kconfig, so only one CPU
>>>> cooling strategy can be chosen.
>>>
>>> I puzzled by the role of Kconfig here.
>>>
>>> IIUC a distro kernel will always be forced to select the combo strategy
>>> otherwise it will never be able to cool systems that don't have cpufreq
>>> (I hope the combo strategy treats such system as a special case with
>>> only one OPP).
>>
>> Actually it does not make sense to select the combo if there is no
>> cpufreq support. The cpuidle cooling device must be used instead.
>
> Well, I said before what I hoped. This is what I feared! ;-)
>
>
>>> This raises the question what the other options (cpufreq-only
>>> idle-injection-only) are for? Are they just for petrol heads who want to
>>> save a few bytes of code or is idle-injection undesirable for some
>>> users.
>>
>> The combo cooling device must be used on a system with a proper support
>> for cpuidle and cpufreq and with the power numbers specified in the DT.
>>
>> By proper support of cpuidle, I mean a cluster power down idle state,
>> fast enough. This idle state allows to drop the dynamic power *and* the
>> static leakage (the latter to prevent a thermal runaway).
>>
>> If the system does not have power numbers, no (or bad) cpuidle, the
>> combo cooling device must not be used. If there is no cpufreq support,
>> the cpuidle cooling must be used and if there is no proper support for
>> both, the CPU cooling can't be used. In this case, you have to put a fan
>> on your board or reduce the frequency where the system stays in its
>> thermal envelope.
>
> How can we know (in the general case) what is going to be in the DT at
> compile time?
>
>
>>> If the later, how can a distro kernel mitigate the undesired effects
>>> whilst still selecting the combo strategy.
>>
>> I'm not sure to understand the question. Distros always use the make
>> allmodconfig, so that chooses the cpufreq CPU cooling device which was
>> the case before without this change.
>
> So there's no regression. That's nice but doesn't that mean distros
> cannot exploit the new features.
>
>
>> However, we are talking about distros here but the CPU cooling mechanism
>> is for mobile and in this case the kernel (usually Android based) comes
>> with a specific configuration file and this is where the SoC vendor has
>> to choose the right strategy.
>
> I agree its hard to conceive of a modern Android device that doesn't implement
> both the needed features but the performance figures in the covering
> letter come from Hikey (and they look pretty good) and Hikey is
> supported by a good number of regular Linux distros now.
>
> Using Kconfig to force what should be runtime decisions to happen at
> compile time means that Hikey becomes an example of a platform that
> is unable to run at max performance on the distros that have added
> support for it.

I disagree. The ARM64's defconfig is not distro ready. We have always to
change the default option and fix things in the Kconfig to make the
hikey to work (eg. the missing hi655x clock config), without speaking
about the hikey960 which is not yet ready for full support.

Hence, tweaking the Kconfig to choose the better strategy is not
necessarily a problem for this first iteration of code.

Note I'm not against changing the code to make it runtime selectable but
that will need a major rework of the current CPU cooling code especially
handling the change while the thermal framework is doing the mitigation
(and probably also changes of the thermal framework).

I prefer to keep simple self-encapsulated feature code and make it
evolve to something better instead of sending a blowing patch series
taking into account all possible combinations. Choosing the strategy at
compile time may be look restrictive but we can live with that without
problem and iteratively do the change until the choice becomes the
default strategy selection option.






<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


2018-01-26 12:17:28

by Daniel Thompson

[permalink] [raw]
Subject: Re: [PATCH 4/8] thermal/drivers/Kconfig: Convert the CPU cooling device to a choice

On Thu, Jan 25, 2018 at 02:36:12PM +0100, Daniel Lezcano wrote:
> On 25/01/2018 11:57, Daniel Thompson wrote:
> > On Wed, Jan 24, 2018 at 05:59:09PM +0100, Daniel Lezcano wrote:
> >> On 24/01/2018 17:34, Daniel Thompson wrote:
> >> However, we are talking about distros here but the CPU cooling mechanism
> >> is for mobile and in this case the kernel (usually Android based) comes
> >> with a specific configuration file and this is where the SoC vendor has
> >> to choose the right strategy.
> >
> > I agree its hard to conceive of a modern Android device that doesn't implement
> > both the needed features but the performance figures in the covering
> > letter come from Hikey (and they look pretty good) and Hikey is
> > supported by a good number of regular Linux distros now.
> >
> > Using Kconfig to force what should be runtime decisions to happen at
> > compile time means that Hikey becomes an example of a platform that
> > is unable to run at max performance on the distros that have added
> > support for it.
>
> I disagree. The ARM64's defconfig is not distro ready. We have always to
> change the default option and fix things in the Kconfig to make the
> hikey to work (eg. the missing hi655x clock config), without speaking
> about the hikey960 which is not yet ready for full support.
>
> Hence, tweaking the Kconfig to choose the better strategy is not
> necessarily a problem for this first iteration of code.

The problem is not about tweaking config for a distro. No distro I know
defconfig kernels. Tweaking is normal and expected.

The problem is that a distro cannot generate a config that performs
optimally on all devices that it supports because enabling one form of
CPU cooling prevents other forms from being selected. There is no
correct value for us to select if we don't know in advance what board
the kernel image will be loaded on.

Given the massive work that has gone on (especially in ARM ecosystem)
to allow one kernel binary to support many boards at once it surprises
me to see new features proposed that cannot be exploited without
recompiling the kernel from platform to platform.


> Note I'm not against changing the code to make it runtime selectable but
> that will need a major rework of the current CPU cooling code especially
> handling the change while the thermal framework is doing the mitigation
> (and probably also changes of the thermal framework).

Perhaps I should have distinguished more between runtime-meaning-boot-time
and runtime-meaning-full-system-operation. To be clear none of my
comments are about being able to enable/disable idle injection on a
fully running system. It is the impact on multi-platform binaries that
attracted my attention.


> I prefer to keep simple self-encapsulated feature code and make it
> evolve to something better instead of sending a blowing patch series
> taking into account all possible combinations. Choosing the strategy at
> compile time may be look restrictive but we can live with that without
> problem and iteratively do the change until the choice becomes the
> default strategy selection option.

You won't find me arguing against iterative improvement. However I
did observe that the idle injection code consists almost exclusively
of new lines of code. Thus I don't understand why this new code
interferes with cpufreq cooling to the point that the existing cooling
options must compiled out of the kernel before we can exploit it.

I can see the combo code does have tentacles in more places but even so
I have the same question. What prevents the existing cooling strategy
from being compiled at the same time.

You appear to be saying that there's not yet enough infrastructure to
decide which type of cooler to instantiate[1] so its OK to hack around
the decision by forcing it to be made it at compile time. Even if we
want to force a decision by some means, is compile time really the best
point to do that?


Daniel.


[1] Is that because the DT isn't rich enough for us to figure out
the trade off between using real OPPs and virtual idle-injected
ones nor whether cpuidle entry/exit is fast enough for cooling
to be effective?

>
>
>
>
>
>
> <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
>
> Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
> <http://twitter.com/#!/linaroorg> Twitter |
> <http://www.linaro.org/linaro-blog/> Blog
>

2018-01-26 13:27:19

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [PATCH 4/8] thermal/drivers/Kconfig: Convert the CPU cooling device to a choice

On 26/01/2018 13:16, Daniel Thompson wrote:
> On Thu, Jan 25, 2018 at 02:36:12PM +0100, Daniel Lezcano wrote:
>> On 25/01/2018 11:57, Daniel Thompson wrote:
>>> On Wed, Jan 24, 2018 at 05:59:09PM +0100, Daniel Lezcano wrote:
>>>> On 24/01/2018 17:34, Daniel Thompson wrote:
>>>> However, we are talking about distros here but the CPU cooling mechanism
>>>> is for mobile and in this case the kernel (usually Android based) comes
>>>> with a specific configuration file and this is where the SoC vendor has
>>>> to choose the right strategy.
>>>
>>> I agree its hard to conceive of a modern Android device that doesn't implement
>>> both the needed features but the performance figures in the covering
>>> letter come from Hikey (and they look pretty good) and Hikey is
>>> supported by a good number of regular Linux distros now.
>>>
>>> Using Kconfig to force what should be runtime decisions to happen at
>>> compile time means that Hikey becomes an example of a platform that
>>> is unable to run at max performance on the distros that have added
>>> support for it.
>>
>> I disagree. The ARM64's defconfig is not distro ready. We have always to
>> change the default option and fix things in the Kconfig to make the
>> hikey to work (eg. the missing hi655x clock config), without speaking
>> about the hikey960 which is not yet ready for full support.
>>
>> Hence, tweaking the Kconfig to choose the better strategy is not
>> necessarily a problem for this first iteration of code.
>
> The problem is not about tweaking config for a distro. No distro I know
> defconfig kernels. Tweaking is normal and expected.
>
> The problem is that a distro cannot generate a config that performs
> optimally on all devices that it supports because enabling one form of
> CPU cooling prevents other forms from being selected. There is no
> correct value for us to select if we don't know in advance what board
> the kernel image will be loaded on.
>
> Given the massive work that has gone on (especially in ARM ecosystem)
> to allow one kernel binary to support many boards at once it surprises
> me to see new features proposed that cannot be exploited without
> recompiling the kernel from platform to platform.
>
>
>> Note I'm not against changing the code to make it runtime selectable but
>> that will need a major rework of the current CPU cooling code especially
>> handling the change while the thermal framework is doing the mitigation
>> (and probably also changes of the thermal framework).
>
> Perhaps I should have distinguished more between runtime-meaning-boot-time
> and runtime-meaning-full-system-operation. To be clear none of my
> comments are about being able to enable/disable idle injection on a
> fully running system. It is the impact on multi-platform binaries that
> attracted my attention.
>
>
>> I prefer to keep simple self-encapsulated feature code and make it
>> evolve to something better instead of sending a blowing patch series
>> taking into account all possible combinations. Choosing the strategy at
>> compile time may be look restrictive but we can live with that without
>> problem and iteratively do the change until the choice becomes the
>> default strategy selection option.
>
> You won't find me arguing against iterative improvement. However I
> did observe that the idle injection code consists almost exclusively
> of new lines of code. Thus I don't understand why this new code
> interferes with cpufreq cooling to the point that the existing cooling
> options must compiled out of the kernel before we can exploit it.
>
> I can see the combo code does have tentacles in more places but even so
> I have the same question. What prevents the existing cooling strategy
> from being compiled at the same time.
>
> You appear to be saying that there's not yet enough infrastructure to
> decide which type of cooler to instantiate[1] so its OK to hack around
> the decision by forcing it to be made it at compile time. Even if we
> want to force a decision by some means, is compile time really the best
> point to do that?

For the moment, yes. We are backward compatible, there is no change with
the current feature.

I don't see really the point of being so afraid with this compilation
option, it is not the first time we have compile time code which
migrates to runtime code. The important thing is we don't break the
existing.

Having the runtime selection is the objective after improving the CPU
cooling devices. But this is far from being trivial and will need a
rework of the cooling device / framework (and the changes grouped in a
separate patch series).


> [1] Is that because the DT isn't rich enough for us to figure out
> the trade off between using real OPPs and virtual idle-injected
> ones nor whether cpuidle entry/exit is fast enough for cooling
> to be effective?

Yes, it is one aspect.


--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


2018-01-31 05:20:12

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH 1/8] thermal/drivers/cpu_cooling: Fixup the header and copyright

On 23-01-18, 16:34, Daniel Lezcano wrote:
> The copyright format does not conform to the format requested by
> Linaro: https://wiki.linaro.org/Copyright
>
> Fix it.
>
> Signed-off-by: Daniel Lezcano <[email protected]>
> ---
> drivers/thermal/cpu_cooling.c | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
> index dc63aba..988fc55 100644
> --- a/drivers/thermal/cpu_cooling.c
> +++ b/drivers/thermal/cpu_cooling.c
> @@ -2,9 +2,11 @@
> * linux/drivers/thermal/cpu_cooling.c
> *
> * Copyright (C) 2012 Samsung Electronics Co., Ltd(http://www.samsung.com)
> - * Copyright (C) 2012 Amit Daniel <[email protected]>
> *
> - * Copyright (C) 2014 Viresh Kumar <[email protected]>
> + * Copyright (C) 2018 Linaro Limited.
> + *

Shouldn't this be 2012-2018 ?

> + * Authors: Amit Daniel <[email protected]>
> + * Viresh Kumar <[email protected]>
> *
> * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> * This program is free software; you can redistribute it and/or modify
> --
> 2.7.4

--
viresh

2018-01-31 05:20:22

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH 2/8] thermal/drivers/cpu_cooling: Add Software Package Data Exchange (SPDX)

On 23-01-18, 16:34, Daniel Lezcano wrote:
> For license auditing purpose, let's add the SPDX tag.
>
> Cc: Philippe Ombredanne <[email protected]>
> Signed-off-by: Daniel Lezcano <[email protected]>
> ---
> drivers/thermal/cpu_cooling.c | 16 +---------------
> 1 file changed, 1 insertion(+), 15 deletions(-)
>
> diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
> index 988fc55..e62be75 100644
> --- a/drivers/thermal/cpu_cooling.c
> +++ b/drivers/thermal/cpu_cooling.c
> @@ -1,3 +1,4 @@
> +// SPDX-License-Identifier: GPL-2.0
> /*
> * linux/drivers/thermal/cpu_cooling.c
> *
> @@ -8,21 +9,6 @@
> * Authors: Amit Daniel <[email protected]>
> * Viresh Kumar <[email protected]>
> *
> - * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> - * This program is free software; you can redistribute it and/or modify
> - * it under the terms of the GNU General Public License as published by
> - * the Free Software Foundation; version 2 of the License.
> - *
> - * This program is distributed in the hope that it will be useful, but
> - * WITHOUT ANY WARRANTY; without even the implied warranty of
> - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> - * General Public License for more details.
> - *
> - * You should have received a copy of the GNU General Public License along
> - * with this program; if not, write to the Free Software Foundation, Inc.,
> - * 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA.
> - *
> - * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> */
> #include <linux/module.h>
> #include <linux/thermal.h>

Acked-by: Viresh Kumar <[email protected]>

--
viresh

2018-01-31 05:22:03

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH 3/8] thermal/drivers/cpu_cooling: Remove pointless field

On 23-01-18, 16:34, Daniel Lezcano wrote:
> The structure cpufreq_cooling_device provides a backpointer to the thermal
> device but this one is used for a trace and to unregister. For the trace,
> we don't really need this field and the unregister function as the same
> pointer passed as parameter. Remove it.
>
> Signed-off-by: Daniel Lezcano <[email protected]>
> ---
> drivers/thermal/cpu_cooling.c | 7 ++-----
> 1 file changed, 2 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
> index e62be75..d05bb73 100644
> --- a/drivers/thermal/cpu_cooling.c
> +++ b/drivers/thermal/cpu_cooling.c
> @@ -88,7 +88,6 @@ struct cpufreq_cooling_device {
> unsigned int clipped_freq;
> unsigned int max_level;
> struct freq_table *freq_table; /* In descending order */
> - struct thermal_cooling_device *cdev;
> struct cpufreq_policy *policy;
> struct list_head node;
> struct time_in_idle *idle_time;
> @@ -197,8 +196,7 @@ static int update_freq_table(struct cpufreq_cooling_device *cpufreq_cdev,
>
> dev = get_cpu_device(cpu);
> if (unlikely(!dev)) {
> - dev_warn(&cpufreq_cdev->cdev->device,
> - "No cpu device for cpu %d\n", cpu);
> + pr_warn("No cpu device for cpu %d\n", cpu);
> return -ENODEV;
> }
>
> @@ -762,7 +760,6 @@ __cpufreq_cooling_register(struct device_node *np,
> goto remove_ida;
>
> cpufreq_cdev->clipped_freq = cpufreq_cdev->freq_table[0].frequency;
> - cpufreq_cdev->cdev = cdev;
>
> mutex_lock(&cooling_list_lock);
> /* Register the notifier for first cpufreq cooling device */
> @@ -922,7 +919,7 @@ void cpufreq_cooling_unregister(struct thermal_cooling_device *cdev)
> cpufreq_unregister_notifier(&thermal_cpufreq_notifier_block,
> CPUFREQ_POLICY_NOTIFIER);
>
> - thermal_cooling_device_unregister(cpufreq_cdev->cdev);
> + thermal_cooling_device_unregister(cdev);
> ida_simple_remove(&cpufreq_ida, cpufreq_cdev->id);
> kfree(cpufreq_cdev->idle_time);
> kfree(cpufreq_cdev->freq_table);

Acked-by: Viresh Kumar <[email protected]>

--
viresh

2018-01-31 07:05:00

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [PATCH 1/8] thermal/drivers/cpu_cooling: Fixup the header and copyright

On 31/01/2018 06:17, Viresh Kumar wrote:
> On 23-01-18, 16:34, Daniel Lezcano wrote:
>> The copyright format does not conform to the format requested by
>> Linaro: https://wiki.linaro.org/Copyright
>>
>> Fix it.
>>
>> Signed-off-by: Daniel Lezcano <[email protected]>
>> ---
>> drivers/thermal/cpu_cooling.c | 6 ++++--
>> 1 file changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
>> index dc63aba..988fc55 100644
>> --- a/drivers/thermal/cpu_cooling.c
>> +++ b/drivers/thermal/cpu_cooling.c
>> @@ -2,9 +2,11 @@
>> * linux/drivers/thermal/cpu_cooling.c
>> *
>> * Copyright (C) 2012 Samsung Electronics Co., Ltd(http://www.samsung.com)
>> - * Copyright (C) 2012 Amit Daniel <[email protected]>
>> *
>> - * Copyright (C) 2014 Viresh Kumar <[email protected]>
>> + * Copyright (C) 2018 Linaro Limited.
>> + *
>
> Shouldn't this be 2012-2018 ?

Yes, you are right.



--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


2018-01-31 09:06:18

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 5/8] thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver

Hi Daniel,

On 23 January 2018 at 16:34, Daniel Lezcano <[email protected]> wrote:
> The cpu idle cooling driver performs synchronized idle injection across all
> cpus belonging to the same cluster and offers a new method to cool down a SoC.
>
> Each cluster has its own idle cooling device, each core has its own idle
> injection thread, each idle injection thread uses play_idle to enter idle. In
> order to reach the deepest idle state, each cooling device has the idle
> injection threads synchronized together.
>
> It has some similarity with the intel power clamp driver but it is actually
> designed to work on the ARM architecture via the DT with a mathematical proof
> with the power model which comes with the Documentation.
>
> The idle injection cycle is fixed while the running cycle is variable. That
> allows to have control on the device reactivity for the user experience. At
> the mitigation point the idle threads are unparked, they play idle the
> specified amount of time and they schedule themselves. The last thread sets
> the next idle injection deadline and when the timer expires it wakes up all
> the threads which in turn play idle again. Meanwhile the running cycle is
> changed by set_cur_state. When the mitigation ends, the threads are parked.
> The algorithm is self adaptive, so there is no need to handle hotplugging.
>
> If we take an example of the balanced point, we can use the DT for the hi6220.
>
> The sustainable power for the SoC is 3326mW to mitigate at 75°C. Eight cores
> running at full blast at the maximum OPP consumes 5280mW. The first value is
> given in the DT, the second is calculated from the OPP with the formula:
>
> Pdyn = Cdyn x Voltage^2 x Frequency
>
> As the SoC vendors don't want to share the static leakage values, we assume
> it is zero, so the Prun = Pdyn + Pstatic = Pdyn + 0 = Pdyn.
>
> In order to reduce the power to 3326mW, we have to apply a ratio to the
> running time.
>
> ratio = (Prun - Ptarget) / Ptarget = (5280 - 3326) / 3326 = 0,5874
>
> We know the idle cycle which is fixed, let's assume 10ms. However from this
> duration we have to substract the wake up latency for the cluster idle state.
> In our case, it is 1.5ms. So for a 10ms latency for idle, we are really idle
> 8.5ms.
>
> As we know the idle duration and the ratio, we can compute the running cycle.
>
> running_cycle = 8.5 / 0.5874 = 14.47ms
>
> So for 8.5ms of idle, we have 14.47ms of running cycle, and that brings the
> SoC to the balanced trip point of 75°C.
>
> The driver has been tested on the hi6220 and it appears the temperature
> stabilizes at 75°C with an idle injection time of 10ms (8.5ms real) and
> running cycle of 14ms as expected by the theory above.
>
> Signed-off-by: Kevin WangTao <[email protected]>
> Signed-off-by: Daniel Lezcano <[email protected]>
> ---
> drivers/thermal/Kconfig | 10 +
> drivers/thermal/cpu_cooling.c | 471 ++++++++++++++++++++++++++++++++++++++++++
> include/linux/cpu_cooling.h | 6 +
> 3 files changed, 487 insertions(+)
>
> diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
> index 925e73b..4bd4be7 100644
> --- a/drivers/thermal/Kconfig
> +++ b/drivers/thermal/Kconfig
> @@ -166,6 +166,16 @@ config CPU_FREQ_THERMAL
> This will be useful for platforms using the generic thermal interface
> and not the ACPI interface.
>
> +config CPU_IDLE_THERMAL
> + bool "CPU idle cooling strategy"
> + depends on CPU_IDLE
> + help
> + This implements the generic CPU cooling mechanism through
> + idle injection. This will throttle the CPU by injecting
> + fixed idle cycle. All CPUs belonging to the same cluster
> + will enter idle synchronously to reach the deepest idle
> + state.
> +
> endchoice
>
> config CLOCK_THERMAL
> diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
> index d05bb73..916a627 100644
> --- a/drivers/thermal/cpu_cooling.c
> +++ b/drivers/thermal/cpu_cooling.c
> @@ -10,18 +10,33 @@
> * Viresh Kumar <[email protected]>
> *
> */
> +#undef DEBUG
> +#define pr_fmt(fmt) "CPU cooling: " fmt
> +
> #include <linux/module.h>
> #include <linux/thermal.h>
> #include <linux/cpufreq.h>
> +#include <linux/cpuidle.h>
> #include <linux/err.h>
> +#include <linux/freezer.h>
> #include <linux/idr.h>
> +#include <linux/kthread.h>
> #include <linux/pm_opp.h>
> #include <linux/slab.h>
> +#include <linux/sched/prio.h>
> +#include <linux/sched/rt.h>
> #include <linux/cpu.h>
> #include <linux/cpu_cooling.h>
> +#include <linux/wait.h>
> +
> +#include <linux/platform_device.h>
> +#include <linux/of_platform.h>
>
> #include <trace/events/thermal.h>
>
> +#include <uapi/linux/sched/types.h>
> +
> +#ifdef CONFIG_CPU_FREQ_THERMAL
> /*
> * Cooling state <-> CPUFreq frequency
> *
> @@ -926,3 +941,459 @@ void cpufreq_cooling_unregister(struct thermal_cooling_device *cdev)
> kfree(cpufreq_cdev);
> }
> EXPORT_SYMBOL_GPL(cpufreq_cooling_unregister);
> +
> +#endif /* CPU_FREQ_THERMAL */
> +
> +#ifdef CONFIG_CPU_IDLE_THERMAL
> +/*
> + * The idle duration injection. As we don't have yet a way to specify
> + * from the DT configuration, let's default to a tick duration.
> + */
> +#define DEFAULT_IDLE_TIME_US TICK_USEC
> +
> +/**
> + * struct cpuidle_cooling_device - data for the idle cooling device
> + * @cdev: a pointer to a struct thermal_cooling_device
> + * @tsk: an array of pointer to the idle injection tasks
> + * @cpumask: a cpumask containing the CPU managed by the cooling device
> + * @timer: a hrtimer giving the tempo for the idle injection cycles
> + * @kref: a kernel refcount on this structure
> + * @waitq: the waiq for the idle injection tasks
> + * @count: an atomic to keep track of the last task exiting the idle cycle
> + * @idle_cycle: an integer defining the duration of the idle injection
> + * @state: an normalized integer giving the state of the cooling device
> + */
> +struct cpuidle_cooling_device {
> + struct thermal_cooling_device *cdev;
> + struct task_struct **tsk;
> + struct cpumask *cpumask;
> + struct list_head node;
> + struct hrtimer timer;
> + struct kref kref;
> + wait_queue_head_t *waitq;
> + atomic_t count;
> + unsigned int idle_cycle;
> + unsigned int state;
> +};
> +
> +static LIST_HEAD(cpuidle_cdev_list);
> +
> +/**
> + * cpuidle_cooling_wakeup - Wake up all idle injection threads
> + * @idle_cdev: the idle cooling device
> + *
> + * Every idle injection task belonging to the idle cooling device and
> + * running on an online cpu will be wake up by this call.
> + */
> +static void cpuidle_cooling_wakeup(struct cpuidle_cooling_device *idle_cdev)
> +{
> + int cpu;
> + int weight = cpumask_weight(idle_cdev->cpumask);
> +
> + for_each_cpu_and(cpu, idle_cdev->cpumask, cpu_online_mask)
> + wake_up_process(idle_cdev->tsk[cpu % weight]);
> +}
> +
> +/**
> + * cpuidle_cooling_wakeup_fn - Running cycle timer callback
> + * @timer: a hrtimer structure
> + *
> + * When the mitigation is acting, the CPU is allowed to run an amount
> + * of time, then the idle injection happens for the specified delay
> + * and the idle task injection schedules itself until the timer event
> + * wakes the idle injection tasks again for a new idle injection
> + * cycle. The time between the end of the idle injection and the timer
> + * expiration is the allocated running time for the CPU.
> + *
> + * Returns always HRTIMER_NORESTART
> + */
> +static enum hrtimer_restart cpuidle_cooling_wakeup_fn(struct hrtimer *timer)
> +{
> + struct cpuidle_cooling_device *idle_cdev =
> + container_of(timer, struct cpuidle_cooling_device, timer);
> +
> + cpuidle_cooling_wakeup(idle_cdev);
> +
> + return HRTIMER_NORESTART;
> +}
> +
> +/**
> + * cpuidle_cooling_runtime - Running time computation
> + * @idle_cdev: the idle cooling device
> + *
> + * The running duration is computed from the idle injection duration
> + * which is fixed. If we reach 100% of idle injection ratio, that
> + * means the running duration is zero. If we have a 50% ratio
> + * injection, that means we have equal duration for idle and for
> + * running duration.
> + *
> + * The formula is deduced as the following:
> + *
> + * running = idle x ((100 / ratio) - 1)
> + *
> + * For precision purpose for integer math, we use the following:
> + *
> + * running = (idle x 100) / ratio - idle
> + *
> + * For example, if we have an injected duration of 50%, then we end up
> + * with 10ms of idle injection and 10ms of running duration.
> + *
> + * Returns a s64 nanosecond based
> + */
> +static s64 cpuidle_cooling_runtime(struct cpuidle_cooling_device *idle_cdev)
> +{
> + s64 next_wakeup;
> + int state = idle_cdev->state;
> +
> + /*
> + * The function must never be called when there is no
> + * mitigation because:
> + * - that does not make sense
> + * - we end up with a division by zero
> + */
> + BUG_ON(!state);
> +
> + next_wakeup = (s64)((idle_cdev->idle_cycle * 100) / state) -
> + idle_cdev->idle_cycle;
> +
> + return next_wakeup * NSEC_PER_USEC;
> +}
> +
> +/**
> + * cpuidle_cooling_injection_thread - Idle injection mainloop thread function
> + * @arg: a void pointer containing the idle cooling device address
> + *
> + * This main function does basically two operations:
> + *
> + * - Goes idle for a specific amount of time
> + *
> + * - Sets a timer to wake up all the idle injection threads after a
> + * running period
> + *
> + * That happens only when the mitigation is enabled, otherwise the
> + * task is scheduled out.
> + *
> + * In order to keep the tasks synchronized together, it is the last
> + * task exiting the idle period which is in charge of setting the
> + * timer.
> + *
> + * This function never returns.
> + */
> +static int cpuidle_cooling_injection_thread(void *arg)
> +{
> + struct sched_param param = { .sched_priority = MAX_USER_RT_PRIO/2 };
> + struct cpuidle_cooling_device *idle_cdev = arg;
> + int index = smp_processor_id() % cpumask_weight(idle_cdev->cpumask);
> + DEFINE_WAIT(wait);
> +
> + set_freezable();
> +
> + sched_setscheduler(current, SCHED_FIFO, &param);
> +
> + while (1) {
> +
> + s64 next_wakeup;
> +
> + prepare_to_wait(&idle_cdev->waitq[index],
> + &wait, TASK_INTERRUPTIBLE);
> +
> + schedule();
> +
> + atomic_inc(&idle_cdev->count);
> +
> + play_idle(idle_cdev->idle_cycle / USEC_PER_MSEC);
> +
> + /*
> + * The last CPU waking up is in charge of setting the
> + * timer. If the CPU is hotplugged, the timer will
> + * move to another CPU (which may not belong to the
> + * same cluster) but that is not a problem as the
> + * timer will be set again by another CPU belonging to
> + * the cluster, so this mechanism is self adaptive and
> + * does not require any hotplugging dance.
> + */
> + if (!atomic_dec_and_test(&idle_cdev->count))
> + continue;
> +
> + if (!idle_cdev->state)
> + continue;
> +
> + next_wakeup = cpuidle_cooling_runtime(idle_cdev);
> +
> + hrtimer_start(&idle_cdev->timer, ns_to_ktime(next_wakeup),
> + HRTIMER_MODE_REL_PINNED);
> + }
> +
> + finish_wait(&idle_cdev->waitq[index], &wait);
> +
> + return 0;
> +}
> +
> +/**
> + * cpuidle_cooling_get_max_state - Get the maximum state
> + * @cdev : the thermal cooling device
> + * @state : a pointer to the state variable to be filled
> + *
> + * The function gives always 100 as the injection ratio is percentile
> + * based for consistency accros different platforms.
> + *
> + * The function can not fail, it returns always zero.
> + */
> +static int cpuidle_cooling_get_max_state(struct thermal_cooling_device *cdev,
> + unsigned long *state)
> +{
> + /*
> + * Depending on the configuration or the hardware, the running
> + * cycle and the idle cycle could be different. We want unify
> + * that to an 0..100 interval, so the set state interface will
> + * be the same whatever the platform is.
> + *
> + * The state 100% will make the cluster 100% ... idle. A 0%
> + * injection ratio means no idle injection at all and 50%
> + * means for 10ms of idle injection, we have 10ms of running
> + * time.
> + */
> + *state = 100;
> +
> + return 0;
> +}
> +
> +/**
> + * cpuidle_cooling_get_cur_state - Get the current cooling state
> + * @cdev: the thermal cooling device
> + * @state: a pointer to the state
> + *
> + * The function just copy the state value from the private thermal
> + * cooling device structure, the mapping is 1 <-> 1.
> + *
> + * The function can not fail, it returns always zero.
> + */
> +static int cpuidle_cooling_get_cur_state(struct thermal_cooling_device *cdev,
> + unsigned long *state)
> +{
> + struct cpuidle_cooling_device *idle_cdev = cdev->devdata;
> +
> + *state = idle_cdev->state;
> +
> + return 0;
> +}
> +
> +/**
> + * cpuidle_cooling_set_cur_state - Set the current cooling state
> + * @cdev: the thermal cooling device
> + * @state: the target state
> + *
> + * The function checks first if we are initiating the mitigation which
> + * in turn wakes up all the idle injection tasks belonging to the idle
> + * cooling device. In any case, it updates the internal state for the
> + * cooling device.
> + *
> + * The function can not fail, it returns always zero.
> + */
> +static int cpuidle_cooling_set_cur_state(struct thermal_cooling_device *cdev,
> + unsigned long state)
> +{
> + struct cpuidle_cooling_device *idle_cdev = cdev->devdata;
> + unsigned long current_state = idle_cdev->state;
> +
> + idle_cdev->state = state;
> +
> + if (current_state == 0 && state > 0) {
> + pr_debug("Starting cooling cpus '%*pbl'\n",
> + cpumask_pr_args(idle_cdev->cpumask));
> + cpuidle_cooling_wakeup(idle_cdev);
> + } else if (current_state > 0 && !state) {
> + pr_debug("Stopping cooling cpus '%*pbl'\n",
> + cpumask_pr_args(idle_cdev->cpumask));
> + }
> +
> + return 0;
> +}
> +
> +/**
> + * cpuidle_cooling_ops - thermal cooling device ops
> + */
> +static struct thermal_cooling_device_ops cpuidle_cooling_ops = {
> + .get_max_state = cpuidle_cooling_get_max_state,
> + .get_cur_state = cpuidle_cooling_get_cur_state,
> + .set_cur_state = cpuidle_cooling_set_cur_state,
> +};
> +
> +/**
> + * cpuidle_cooling_release - Kref based release helper
> + * @kref: a pointer to the kref structure
> + *
> + * This function is automatically called by the kref_put function when
> + * the idle cooling device refcount reaches zero. At this point, we
> + * have the guarantee the structure is no longer in use and we can
> + * safely release all the ressources.
> + */
> +static void __init cpuidle_cooling_release(struct kref *kref)
> +{
> + struct cpuidle_cooling_device *idle_cdev =
> + container_of(kref, struct cpuidle_cooling_device, kref);
> +
> + thermal_cooling_device_unregister(idle_cdev->cdev);
> + kfree(idle_cdev->waitq);
> + kfree(idle_cdev->tsk);
> + kfree(idle_cdev);
> +}
> +
> +/**
> + * cpuidle_cooling_register - Idle cooling device initialization function
> + *
> + * This function is in charge of creating a cooling device per cluster
> + * and register it to thermal framework. For this we rely on the
> + * topology as there is nothing yet describing better the idle state
> + * power domains.
> + *
> + * For each first CPU of the cluster's cpumask, we allocate the idle
> + * cooling device, initialize the general fields and then we initialze
> + * the rest in a per cpu basis.
> + *
> + * Returns zero on success, < 0 otherwise.
> + */
> +int cpuidle_cooling_register(void)
> +{
> + struct cpuidle_cooling_device *idle_cdev = NULL;
> + struct thermal_cooling_device *cdev;
> + struct task_struct *tsk;
> + struct device_node *np;
> + cpumask_t *cpumask;
> + char dev_name[THERMAL_NAME_LENGTH];
> + int weight;
> + int ret = -ENOMEM, cpu;
> + int index = 0;
> +
> + for_each_possible_cpu(cpu) {
> +
> + cpumask = topology_core_cpumask(cpu);
> + weight = cpumask_weight(cpumask);
> +
> + /*
> + * This condition makes the first cpu belonging to the
> + * cluster to create a cooling device and allocates
> + * the structure. Others CPUs belonging to the same
> + * cluster will just increment the refcount on the
> + * cooling device structure and initialize it.
> + */
> + if (cpu == cpumask_first(cpumask)) {
> +
> + np = of_cpu_device_node_get(cpu);
> +
> + idle_cdev = kzalloc(sizeof(*idle_cdev), GFP_KERNEL);
> + if (!idle_cdev)
> + goto out_fail;
> +
> + idle_cdev->tsk = kzalloc(sizeof(*idle_cdev->tsk) *
> + weight, GFP_KERNEL);
> + if (!idle_cdev->tsk)
> + goto out_fail;
> +
> + idle_cdev->waitq = kzalloc(sizeof(*idle_cdev->waitq) *
> + weight, GFP_KERNEL);
> + if (!idle_cdev->waitq)
> + goto out_fail;
> +
> + idle_cdev->idle_cycle = DEFAULT_IDLE_TIME_US;
> +
> + atomic_set(&idle_cdev->count, 0);
> +
> + kref_init(&idle_cdev->kref);
> +
> + /*
> + * Initialize the timer to wakeup all the idle
> + * injection tasks
> + */
> + hrtimer_init(&idle_cdev->timer,
> + CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> +
> + /*
> + * The wakeup function callback which is in
> + * charge of waking up all CPUs belonging to
> + * the same cluster
> + */
> + idle_cdev->timer.function = cpuidle_cooling_wakeup_fn;
> +
> + /*
> + * The thermal cooling device name
> + */
> + snprintf(dev_name, sizeof(dev_name), "thermal-idle-%d", index++);
> + cdev = thermal_of_cooling_device_register(np, dev_name,
> + idle_cdev,
> + &cpuidle_cooling_ops);
> + if (IS_ERR(cdev)) {
> + ret = PTR_ERR(cdev);
> + goto out_fail;
> + }
> +
> + idle_cdev->cdev = cdev;
> +
> + idle_cdev->cpumask = cpumask;
> +
> + list_add(&idle_cdev->node, &cpuidle_cdev_list);
> +
> + pr_info("Created idle cooling device for cpus '%*pbl'\n",
> + cpumask_pr_args(cpumask));
> + }
> +
> + kref_get(&idle_cdev->kref);
> +
> + /*
> + * Each cooling device is per package. Each package
> + * has a set of cpus where the physical number is
> + * duplicate in the kernel namespace. We need a way to
> + * address the waitq[] and tsk[] arrays with index
> + * which are not Linux cpu numbered.
> + *
> + * One solution is to use the
> + * topology_core_id(cpu). Other solution is to use the
> + * modulo.
> + *
> + * eg. 2 x cluster - 4 cores.
> + *
> + * Physical numbering -> Linux numbering -> % nr_cpus
> + *
> + * Pkg0 - Cpu0 -> 0 -> 0
> + * Pkg0 - Cpu1 -> 1 -> 1
> + * Pkg0 - Cpu2 -> 2 -> 2
> + * Pkg0 - Cpu3 -> 3 -> 3
> + *
> + * Pkg1 - Cpu0 -> 4 -> 0
> + * Pkg1 - Cpu1 -> 5 -> 1
> + * Pkg1 - Cpu2 -> 6 -> 2
> + * Pkg1 - Cpu3 -> 7 -> 3


I'm not sure that the assumption above for the CPU numbering is safe.
Can't you use a per cpu structure to point to resources that are per
cpu instead ? so you will not have to rely on CPU ordering


> + */
> + init_waitqueue_head(&idle_cdev->waitq[cpu % weight]);
> +
> + tsk = kthread_create_on_cpu(cpuidle_cooling_injection_thread,
> + idle_cdev, cpu, "kidle_inject/%u");
> + if (IS_ERR(tsk)) {
> + ret = PTR_ERR(tsk);
> + goto out_fail;
> + }
> +
> + idle_cdev->tsk[cpu % weight] = tsk;
> +
> + wake_up_process(tsk);
> + }
> +
> + return 0;
> +
> +out_fail:
> + list_for_each_entry(idle_cdev, &cpuidle_cdev_list, node) {
> +
> + for_each_cpu(cpu, idle_cdev->cpumask) {
> +
> + if (idle_cdev->tsk[cpu])
> + kthread_stop(idle_cdev->tsk[cpu]);
> +
> + kref_put(&idle_cdev->kref, cpuidle_cooling_release);
> + }
> + }
> +
> + pr_err("Failed to create idle cooling device (%d)\n", ret);
> +
> + return ret;
> +}
> +#endif
> diff --git a/include/linux/cpu_cooling.h b/include/linux/cpu_cooling.h
> index d4292eb..2b5950b 100644
> --- a/include/linux/cpu_cooling.h
> +++ b/include/linux/cpu_cooling.h
> @@ -45,6 +45,7 @@ struct thermal_cooling_device *
> cpufreq_power_cooling_register(struct cpufreq_policy *policy,
> u32 capacitance, get_static_t plat_static_func);
>
> +extern int cpuidle_cooling_register(void);
> /**
> * of_cpufreq_cooling_register - create cpufreq cooling device based on DT.
> * @np: a valid struct device_node to the cooling device device tree node.
> @@ -118,6 +119,11 @@ void cpufreq_cooling_unregister(struct thermal_cooling_device *cdev)
> {
> return;
> }
> +
> +static inline int cpuidle_cooling_register(void)
> +{
> + return 0;
> +}
> #endif /* CONFIG_CPU_THERMAL */
>
> #endif /* __CPU_COOLING_H__ */
> --
> 2.7.4
>

2018-01-31 09:42:35

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [PATCH 5/8] thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver

On 31/01/2018 10:01, Vincent Guittot wrote:
> Hi Daniel,
>
> On 23 January 2018 at 16:34, Daniel Lezcano <[email protected]> wrote:

[ ... ] (please trim :)

>> + /*
>> + * Each cooling device is per package. Each package
>> + * has a set of cpus where the physical number is
>> + * duplicate in the kernel namespace. We need a way to
>> + * address the waitq[] and tsk[] arrays with index
>> + * which are not Linux cpu numbered.
>> + *
>> + * One solution is to use the
>> + * topology_core_id(cpu). Other solution is to use the
>> + * modulo.
>> + *
>> + * eg. 2 x cluster - 4 cores.
>> + *
>> + * Physical numbering -> Linux numbering -> % nr_cpus
>> + *
>> + * Pkg0 - Cpu0 -> 0 -> 0
>> + * Pkg0 - Cpu1 -> 1 -> 1
>> + * Pkg0 - Cpu2 -> 2 -> 2
>> + * Pkg0 - Cpu3 -> 3 -> 3
>> + *
>> + * Pkg1 - Cpu0 -> 4 -> 0
>> + * Pkg1 - Cpu1 -> 5 -> 1
>> + * Pkg1 - Cpu2 -> 6 -> 2
>> + * Pkg1 - Cpu3 -> 7 -> 3
>
>
> I'm not sure that the assumption above for the CPU numbering is safe.
> Can't you use a per cpu structure to point to resources that are per
> cpu instead ? so you will not have to rely on CPU ordering

Can you elaborate ? I don't get the part with the percpu structure.


--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


2018-01-31 09:50:30

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 5/8] thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver

On 31 January 2018 at 10:33, Daniel Lezcano <[email protected]> wrote:
> On 31/01/2018 10:01, Vincent Guittot wrote:
>> Hi Daniel,
>>
>> On 23 January 2018 at 16:34, Daniel Lezcano <[email protected]> wrote:
>
> [ ... ] (please trim :)
>
>>> + /*
>>> + * Each cooling device is per package. Each package
>>> + * has a set of cpus where the physical number is
>>> + * duplicate in the kernel namespace. We need a way to
>>> + * address the waitq[] and tsk[] arrays with index
>>> + * which are not Linux cpu numbered.
>>> + *
>>> + * One solution is to use the
>>> + * topology_core_id(cpu). Other solution is to use the
>>> + * modulo.
>>> + *
>>> + * eg. 2 x cluster - 4 cores.
>>> + *
>>> + * Physical numbering -> Linux numbering -> % nr_cpus
>>> + *
>>> + * Pkg0 - Cpu0 -> 0 -> 0
>>> + * Pkg0 - Cpu1 -> 1 -> 1
>>> + * Pkg0 - Cpu2 -> 2 -> 2
>>> + * Pkg0 - Cpu3 -> 3 -> 3
>>> + *
>>> + * Pkg1 - Cpu0 -> 4 -> 0
>>> + * Pkg1 - Cpu1 -> 5 -> 1
>>> + * Pkg1 - Cpu2 -> 6 -> 2
>>> + * Pkg1 - Cpu3 -> 7 -> 3
>>
>>
>> I'm not sure that the assumption above for the CPU numbering is safe.
>> Can't you use a per cpu structure to point to resources that are per
>> cpu instead ? so you will not have to rely on CPU ordering
>
> Can you elaborate ? I don't get the part with the percpu structure.

Something like:

struct cpuidle_cooling_cpu {
struct task_struct *tsk;
wait_queue_head_t waitq;
};

DECLARE_PER_CPU(struct cpuidle_cooling_cpu *, cpu_data);

>
>
> --
> <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
>
> Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
> <http://twitter.com/#!/linaroorg> Twitter |
> <http://www.linaro.org/linaro-blog/> Blog
>

2018-01-31 09:52:54

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [PATCH 5/8] thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver

On 31/01/2018 10:46, Vincent Guittot wrote:
> On 31 January 2018 at 10:33, Daniel Lezcano <[email protected]> wrote:
>> On 31/01/2018 10:01, Vincent Guittot wrote:
>>> Hi Daniel,
>>>
>>> On 23 January 2018 at 16:34, Daniel Lezcano <[email protected]> wrote:
>>
>> [ ... ] (please trim :)
>>
>>>> + /*
>>>> + * Each cooling device is per package. Each package
>>>> + * has a set of cpus where the physical number is
>>>> + * duplicate in the kernel namespace. We need a way to
>>>> + * address the waitq[] and tsk[] arrays with index
>>>> + * which are not Linux cpu numbered.
>>>> + *
>>>> + * One solution is to use the
>>>> + * topology_core_id(cpu). Other solution is to use the
>>>> + * modulo.
>>>> + *
>>>> + * eg. 2 x cluster - 4 cores.
>>>> + *
>>>> + * Physical numbering -> Linux numbering -> % nr_cpus
>>>> + *
>>>> + * Pkg0 - Cpu0 -> 0 -> 0
>>>> + * Pkg0 - Cpu1 -> 1 -> 1
>>>> + * Pkg0 - Cpu2 -> 2 -> 2
>>>> + * Pkg0 - Cpu3 -> 3 -> 3
>>>> + *
>>>> + * Pkg1 - Cpu0 -> 4 -> 0
>>>> + * Pkg1 - Cpu1 -> 5 -> 1
>>>> + * Pkg1 - Cpu2 -> 6 -> 2
>>>> + * Pkg1 - Cpu3 -> 7 -> 3
>>>
>>>
>>> I'm not sure that the assumption above for the CPU numbering is safe.
>>> Can't you use a per cpu structure to point to resources that are per
>>> cpu instead ? so you will not have to rely on CPU ordering
>>
>> Can you elaborate ? I don't get the part with the percpu structure.
>
> Something like:
>
> struct cpuidle_cooling_cpu {
> struct task_struct *tsk;
> wait_queue_head_t waitq;
> };
>
> DECLARE_PER_CPU(struct cpuidle_cooling_cpu *, cpu_data);

I got this part but I don't get how that fixes the ordering thing.


--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


2018-01-31 09:57:35

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 5/8] thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver

On 31 January 2018 at 10:50, Daniel Lezcano <[email protected]> wrote:
> On 31/01/2018 10:46, Vincent Guittot wrote:
>> On 31 January 2018 at 10:33, Daniel Lezcano <[email protected]> wrote:
>>> On 31/01/2018 10:01, Vincent Guittot wrote:
>>>> Hi Daniel,
>>>>
>>>> On 23 January 2018 at 16:34, Daniel Lezcano <[email protected]> wrote:
>>>
>>> [ ... ] (please trim :)
>>>
>>>>> + /*
>>>>> + * Each cooling device is per package. Each package
>>>>> + * has a set of cpus where the physical number is
>>>>> + * duplicate in the kernel namespace. We need a way to
>>>>> + * address the waitq[] and tsk[] arrays with index
>>>>> + * which are not Linux cpu numbered.
>>>>> + *
>>>>> + * One solution is to use the
>>>>> + * topology_core_id(cpu). Other solution is to use the
>>>>> + * modulo.
>>>>> + *
>>>>> + * eg. 2 x cluster - 4 cores.
>>>>> + *
>>>>> + * Physical numbering -> Linux numbering -> % nr_cpus
>>>>> + *
>>>>> + * Pkg0 - Cpu0 -> 0 -> 0
>>>>> + * Pkg0 - Cpu1 -> 1 -> 1
>>>>> + * Pkg0 - Cpu2 -> 2 -> 2
>>>>> + * Pkg0 - Cpu3 -> 3 -> 3
>>>>> + *
>>>>> + * Pkg1 - Cpu0 -> 4 -> 0
>>>>> + * Pkg1 - Cpu1 -> 5 -> 1
>>>>> + * Pkg1 - Cpu2 -> 6 -> 2
>>>>> + * Pkg1 - Cpu3 -> 7 -> 3
>>>>
>>>>
>>>> I'm not sure that the assumption above for the CPU numbering is safe.
>>>> Can't you use a per cpu structure to point to resources that are per
>>>> cpu instead ? so you will not have to rely on CPU ordering
>>>
>>> Can you elaborate ? I don't get the part with the percpu structure.
>>
>> Something like:
>>
>> struct cpuidle_cooling_cpu {
>> struct task_struct *tsk;
>> wait_queue_head_t waitq;
>> };
>>
>> DECLARE_PER_CPU(struct cpuidle_cooling_cpu *, cpu_data);
>
> I got this part but I don't get how that fixes the ordering thing.

Because you don't care of the CPU ordering to retrieve the data as
they are stored per cpu directly

>
>
> --
> <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
>
> Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
> <http://twitter.com/#!/linaroorg> Twitter |
> <http://www.linaro.org/linaro-blog/> Blog
>

2018-01-31 10:10:29

by Daniel Thompson

[permalink] [raw]
Subject: Re: [PATCH 4/8] thermal/drivers/Kconfig: Convert the CPU cooling device to a choice

On Fri, Jan 26, 2018 at 02:25:58PM +0100, Daniel Lezcano wrote:
> > Perhaps I should have distinguished more between runtime-meaning-boot-time
> > and runtime-meaning-full-system-operation. To be clear none of my
> > comments are about being able to enable/disable idle injection on a
> > fully running system. It is the impact on multi-platform binaries that
> > attracted my attention.
> >
> >
> >> I prefer to keep simple self-encapsulated feature code and make it
> >> evolve to something better instead of sending a blowing patch series
> >> taking into account all possible combinations. Choosing the strategy at
> >> compile time may be look restrictive but we can live with that without
> >> problem and iteratively do the change until the choice becomes the
> >> default strategy selection option.
> >
> > You won't find me arguing against iterative improvement. However I
> > did observe that the idle injection code consists almost exclusively
> > of new lines of code. Thus I don't understand why this new code
> > interferes with cpufreq cooling to the point that the existing cooling
> > options must compiled out of the kernel before we can exploit it.
> >
> > I can see the combo code does have tentacles in more places but even so
> > I have the same question. What prevents the existing cooling strategy
> > from being compiled at the same time.
> >
> > You appear to be saying that there's not yet enough infrastructure to
> > decide which type of cooler to instantiate[1] so its OK to hack around
> > the decision by forcing it to be made it at compile time. Even if we
> > want to force a decision by some means, is compile time really the best
> > point to do that?
>
> For the moment, yes. We are backward compatible, there is no change with
> the current feature.
>
> I don't see really the point of being so afraid with this compilation
> option, it is not the first time we have compile time code which
> migrates to runtime code. The important thing is we don't break the
> existing.

I guess its because the code that implements the cooling devices doesn't
look like there is any conflict preventing them being enabled at the
same time except. There isn't even that much problem choosing between
cpufreq-only and cpuidle-only during instatiation (preferring cpufreq
for the benefit of existing platforms) although I can see that we cannot
choose between cpufreq-only and combo without extending the DT bindings.


> Having the runtime selection is the objective after improving the CPU
> cooling devices. But this is far from being trivial and will need a
> rework of the cooling device / framework (and the changes grouped in a
> separate patch series).

I guess if I'm in a minority of one then I can agree to disagree at
this point. I don't like build-time only features (for all the distro
centric reasons I've already mentioned) but I'm not that heavily
invested... and look forward to seeing runtime selection implemented.


Daniel.

2018-01-31 15:28:59

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [PATCH 5/8] thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver

On 31/01/2018 10:56, Vincent Guittot wrote:
> On 31 January 2018 at 10:50, Daniel Lezcano <[email protected]> wrote:
>> On 31/01/2018 10:46, Vincent Guittot wrote:
>>> On 31 January 2018 at 10:33, Daniel Lezcano <[email protected]> wrote:
>>>> On 31/01/2018 10:01, Vincent Guittot wrote:
>>>>> Hi Daniel,
>>>>>
>>>>> On 23 January 2018 at 16:34, Daniel Lezcano <[email protected]> wrote:
>>>>
>>>> [ ... ] (please trim :)
>>>>
>>>>>> + /*
>>>>>> + * Each cooling device is per package. Each package
>>>>>> + * has a set of cpus where the physical number is
>>>>>> + * duplicate in the kernel namespace. We need a way to
>>>>>> + * address the waitq[] and tsk[] arrays with index
>>>>>> + * which are not Linux cpu numbered.
>>>>>> + *
>>>>>> + * One solution is to use the
>>>>>> + * topology_core_id(cpu). Other solution is to use the
>>>>>> + * modulo.
>>>>>> + *
>>>>>> + * eg. 2 x cluster - 4 cores.
>>>>>> + *
>>>>>> + * Physical numbering -> Linux numbering -> % nr_cpus
>>>>>> + *
>>>>>> + * Pkg0 - Cpu0 -> 0 -> 0
>>>>>> + * Pkg0 - Cpu1 -> 1 -> 1
>>>>>> + * Pkg0 - Cpu2 -> 2 -> 2
>>>>>> + * Pkg0 - Cpu3 -> 3 -> 3
>>>>>> + *
>>>>>> + * Pkg1 - Cpu0 -> 4 -> 0
>>>>>> + * Pkg1 - Cpu1 -> 5 -> 1
>>>>>> + * Pkg1 - Cpu2 -> 6 -> 2
>>>>>> + * Pkg1 - Cpu3 -> 7 -> 3
>>>>>
>>>>>
>>>>> I'm not sure that the assumption above for the CPU numbering is safe.
>>>>> Can't you use a per cpu structure to point to resources that are per
>>>>> cpu instead ? so you will not have to rely on CPU ordering
>>>>
>>>> Can you elaborate ? I don't get the part with the percpu structure.
>>>
>>> Something like:
>>>
>>> struct cpuidle_cooling_cpu {
>>> struct task_struct *tsk;
>>> wait_queue_head_t waitq;
>>> };
>>>
>>> DECLARE_PER_CPU(struct cpuidle_cooling_cpu *, cpu_data);
>>
>> I got this part but I don't get how that fixes the ordering thing.
>
> Because you don't care of the CPU ordering to retrieve the data as
> they are stored per cpu directly

That's what I did initially, but for consistency reasons with the
cpufreq cpu cooling device which is stored in a list and the combo cpu
cooling device, the cpuidle cooling device must be per cluster and
stored in a list.

Alternatively I can do:

struct cpuidle_cooling_device {
struct thermal_cooling_device *cdev;
- struct task_struct **tsk;
+ struct task_struct __percpu *tsk;
struct cpumask *cpumask;
struct list_head node;
struct hrtimer timer;
struct kref kref;
- wait_queue_head_t *waitq;
+ wait_queue_head_t __percpu waitq;
atomic_t count;
unsigned int idle_cycle;
unsigned int state;
};


--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


2018-02-01 07:59:17

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 5/8] thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver

On 31 January 2018 at 16:27, Daniel Lezcano <[email protected]> wrote:
> On 31/01/2018 10:56, Vincent Guittot wrote:
>> On 31 January 2018 at 10:50, Daniel Lezcano <[email protected]> wrote:
>>> On 31/01/2018 10:46, Vincent Guittot wrote:
>>>> On 31 January 2018 at 10:33, Daniel Lezcano <[email protected]> wrote:
>>>>> On 31/01/2018 10:01, Vincent Guittot wrote:
>>>>>> Hi Daniel,
>>>>>>
>>>>>> On 23 January 2018 at 16:34, Daniel Lezcano <[email protected]> wrote:
>>>>>
>>>>> [ ... ] (please trim :)
>>>>>
>>>>>>> + /*
>>>>>>> + * Each cooling device is per package. Each package
>>>>>>> + * has a set of cpus where the physical number is
>>>>>>> + * duplicate in the kernel namespace. We need a way to
>>>>>>> + * address the waitq[] and tsk[] arrays with index
>>>>>>> + * which are not Linux cpu numbered.
>>>>>>> + *
>>>>>>> + * One solution is to use the
>>>>>>> + * topology_core_id(cpu). Other solution is to use the
>>>>>>> + * modulo.
>>>>>>> + *
>>>>>>> + * eg. 2 x cluster - 4 cores.
>>>>>>> + *
>>>>>>> + * Physical numbering -> Linux numbering -> % nr_cpus
>>>>>>> + *
>>>>>>> + * Pkg0 - Cpu0 -> 0 -> 0
>>>>>>> + * Pkg0 - Cpu1 -> 1 -> 1
>>>>>>> + * Pkg0 - Cpu2 -> 2 -> 2
>>>>>>> + * Pkg0 - Cpu3 -> 3 -> 3
>>>>>>> + *
>>>>>>> + * Pkg1 - Cpu0 -> 4 -> 0
>>>>>>> + * Pkg1 - Cpu1 -> 5 -> 1
>>>>>>> + * Pkg1 - Cpu2 -> 6 -> 2
>>>>>>> + * Pkg1 - Cpu3 -> 7 -> 3
>>>>>>
>>>>>>
>>>>>> I'm not sure that the assumption above for the CPU numbering is safe.
>>>>>> Can't you use a per cpu structure to point to resources that are per
>>>>>> cpu instead ? so you will not have to rely on CPU ordering
>>>>>
>>>>> Can you elaborate ? I don't get the part with the percpu structure.
>>>>
>>>> Something like:
>>>>
>>>> struct cpuidle_cooling_cpu {
>>>> struct task_struct *tsk;
>>>> wait_queue_head_t waitq;
>>>> };
>>>>
>>>> DECLARE_PER_CPU(struct cpuidle_cooling_cpu *, cpu_data);
>>>
>>> I got this part but I don't get how that fixes the ordering thing.
>>
>> Because you don't care of the CPU ordering to retrieve the data as
>> they are stored per cpu directly
>
> That's what I did initially, but for consistency reasons with the
> cpufreq cpu cooling device which is stored in a list and the combo cpu
> cooling device, the cpuidle cooling device must be per cluster and
> stored in a list.

I'm not sure to catch your problem. You can still have cpuidle cooling
device per cluster and stored in the list but keep per cpu data in a

AFAICT, you will not have more than one cpu cooling device registered
per CPU so one per cpu variable that will gathers cpu private data
should be enough ?

>
> Alternatively I can do:
>
> struct cpuidle_cooling_device {
> struct thermal_cooling_device *cdev;
> - struct task_struct **tsk;
> + struct task_struct __percpu *tsk;
> struct cpumask *cpumask;
> struct list_head node;
> struct hrtimer timer;
> struct kref kref;
> - wait_queue_head_t *waitq;
> + wait_queue_head_t __percpu waitq;
> atomic_t count;
> unsigned int idle_cycle;
> unsigned int state;
> };

struct cpuidle_cooling_device {
struct thermal_cooling_device *cdev;
struct cpumask *cpumask;
struct list_head node;
struct hrtimer timer;
struct kref kref;
atomic_t count;
unsigned int idle_cycle;
unsigned int state;
};

struct cpuidle_cooling_cpu {
struct task_struct *tsk;
wait_queue_head_t waitq;
};
DECLARE_PER_CPU(struct cpuidle_cooling_cpu *, cpu_data);

You continue to have cpuidle_cooling_device allocated dynamically per
cluster and added in the list but task and waitq are stored per cpu

>
>
> --
> <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
>
> Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
> <http://twitter.com/#!/linaroorg> Twitter |
> <http://www.linaro.org/linaro-blog/> Blog
>

2018-02-01 08:26:14

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [PATCH 5/8] thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver

On 01/02/2018 08:57, Vincent Guittot wrote:
> On 31 January 2018 at 16:27, Daniel Lezcano <[email protected]> wrote:
>> On 31/01/2018 10:56, Vincent Guittot wrote:
>>> On 31 January 2018 at 10:50, Daniel Lezcano <[email protected]> wrote:
>>>> On 31/01/2018 10:46, Vincent Guittot wrote:
>>>>> On 31 January 2018 at 10:33, Daniel Lezcano <[email protected]> wrote:
>>>>>> On 31/01/2018 10:01, Vincent Guittot wrote:
>>>>>>> Hi Daniel,
>>>>>>>
>>>>>>> On 23 January 2018 at 16:34, Daniel Lezcano <[email protected]> wrote:
>>>>>>
>>>>>> [ ... ] (please trim :)
>>>>>>

[ ... ]

> struct cpuidle_cooling_device {
> struct thermal_cooling_device *cdev;
> struct cpumask *cpumask;
> struct list_head node;
> struct hrtimer timer;
> struct kref kref;
> atomic_t count;
> unsigned int idle_cycle;
> unsigned int state;
> };
>
> struct cpuidle_cooling_cpu {
> struct task_struct *tsk;
> wait_queue_head_t waitq;
> };
> DECLARE_PER_CPU(struct cpuidle_cooling_cpu *, cpu_data);
>
> You continue to have cpuidle_cooling_device allocated dynamically per
> cluster and added in the list but task and waitq are stored per cpu

Ok. I will try that.


--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


2018-02-02 10:45:01

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH 8/8] thermal/drivers/cpu_cooling: Add the combo cpu cooling device

Hi Daniel,

I have gone through the other review comments, specially from Daniel T.. While I
share some of his concerns, I have few more of mine.

On 23-01-18, 16:34, Daniel Lezcano wrote:
> +late_initcall(cpu_cooling_init);

For example, this thing isn't going to fly nicely as you have assumed cpufreq
and cpuidle drivers are going to be part of the kernel itself. What if they are
modules and are inserted after late-init ? There are more cases like this where
the cpufreq driver may unregister the cpufreq cooling device on the fly and then
add it back. And so this stuff is a bit tricky.

Here is how I see the whole thing now:

- Yes we need individual support for both cpufreq and cpuidle cooling devices,
and no one disagrees on that I believe.

- There is nothing in the thermal framework that disallows both cpufreq and
cpuidle cooling devices to co-exist. Both would be part of the same thermal
zone and so will get throttled with the same thermal sensor event. And so we
will end up trying to cool down the SoC using both cpufreq and cpuidle
technique.

- Now I am just wondering if we really need the "combo" functionality or not.
Can we fine tune the DT cpu-cooling properties (existing ones) for a platform,
so that it automatically acts as a combo cooling device? I am not 100% sure
its gonna fly, but just wanted to make sure its not possible to work around
with and then only try the combo device thing.

For example, suppose that with just cpufreq-cooling device we need to take the
CPU down to 1 GHz from 2 GHz if we cross temperature 'X'. What if we can change
this policy from DT and say the cpufreq-cooling device goes to 1.5 GHz and
cpuidle-cooling device takes us to idle for 'y' us, and the effect of
combination of these two is >= the effect of the 1 GHz for just the
cpufreq-cooling device.

Is there any possibility of this to work ?

--
viresh

2018-02-02 14:31:36

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [PATCH 8/8] thermal/drivers/cpu_cooling: Add the combo cpu cooling device

On 02/02/2018 11:42, Viresh Kumar wrote:
> Hi Daniel,

Hi Viresh,


> I have gone through the other review comments, specially from Daniel T.. While I
> share some of his concerns, I have few more of mine.
>
> On 23-01-18, 16:34, Daniel Lezcano wrote:
>> +late_initcall(cpu_cooling_init);
>
> For example, this thing isn't going to fly nicely as you have assumed cpufreq
> and cpuidle drivers are going to be part of the kernel itself. What if they are
> modules and are inserted after late-init ? There are more cases like this where
> the cpufreq driver may unregister the cpufreq cooling device on the fly and then
> add it back. And so this stuff is a bit tricky.

The cpuidle driver can not be compiled as a module.

Agree the cpufreq driver can be unloaded and this part needs some
adjustments.

> Here is how I see the whole thing now:
>
> - Yes we need individual support for both cpufreq and cpuidle cooling devices,
> and no one disagrees on that I believe.
>
> - There is nothing in the thermal framework that disallows both cpufreq and
> cpuidle cooling devices to co-exist. Both would be part of the same thermal
> zone and so will get throttled with the same thermal sensor event. And so we
> will end up trying to cool down the SoC using both cpufreq and cpuidle
> technique.

No. It does not work because we will need different state for each
cooling device and we need some logic behind.

> - Now I am just wondering if we really need the "combo" functionality or not.
> Can we fine tune the DT cpu-cooling properties (existing ones) for a platform,
> so that it automatically acts as a combo cooling device? I am not 100% sure
> its gonna fly, but just wanted to make sure its not possible to work around
> with and then only try the combo device thing.
>
> For example, suppose that with just cpufreq-cooling device we need to take the
> CPU down to 1 GHz from 2 GHz if we cross temperature 'X'. What if we can change
> this policy from DT and say the cpufreq-cooling device goes to 1.5 GHz and
> cpuidle-cooling device takes us to idle for 'y' us, and the effect of
> combination of these two is >= the effect of the 1 GHz for just the
> cpufreq-cooling device.
>
> Is there any possibility of this to work ?

It does not make sense. The combo does that automatically by computing
the power equivalence more precisely.


--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


2018-02-05 04:19:50

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH 8/8] thermal/drivers/cpu_cooling: Add the combo cpu cooling device

On 02-02-18, 15:30, Daniel Lezcano wrote:
> On 02/02/2018 11:42, Viresh Kumar wrote:
> > Here is how I see the whole thing now:
> >
> > - Yes we need individual support for both cpufreq and cpuidle cooling devices,
> > and no one disagrees on that I believe.
> >
> > - There is nothing in the thermal framework that disallows both cpufreq and
> > cpuidle cooling devices to co-exist. Both would be part of the same thermal
> > zone and so will get throttled with the same thermal sensor event. And so we
> > will end up trying to cool down the SoC using both cpufreq and cpuidle
> > technique.
>
> No. It does not work because we will need different state for each
> cooling device and we need some logic behind.

Right, but I thought the cooling-maps can help us specify different cooling
states for different cooling devices for the same trip point. Maybe my
understanding of that is incorrect.

> > - Now I am just wondering if we really need the "combo" functionality or not.
> > Can we fine tune the DT cpu-cooling properties (existing ones) for a platform,
> > so that it automatically acts as a combo cooling device? I am not 100% sure
> > its gonna fly, but just wanted to make sure its not possible to work around
> > with and then only try the combo device thing.
> >
> > For example, suppose that with just cpufreq-cooling device we need to take the
> > CPU down to 1 GHz from 2 GHz if we cross temperature 'X'. What if we can change
> > this policy from DT and say the cpufreq-cooling device goes to 1.5 GHz and
> > cpuidle-cooling device takes us to idle for 'y' us, and the effect of
> > combination of these two is >= the effect of the 1 GHz for just the
> > cpufreq-cooling device.
> >
> > Is there any possibility of this to work ?
>
> It does not make sense. The combo does that automatically by computing
> the power equivalence more precisely.

Sure, but that works by creating a virtual combo-cooling device instead of two
separate cooling devices and then there are several limitation (at least right
now) where it doesn't sense the real situation automagically. For example I
would expect the combo to just work with cpuidle if cpufreq isn't present and as
soon as cpufreq comes in, covert itself to cpufreq+cpuidle. I was just trying to
present another view at solving the problem at hand, not that one is better
than the other.

--
viresh

2018-02-05 10:33:29

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [PATCH 8/8] thermal/drivers/cpu_cooling: Add the combo cpu cooling device

On 05/02/2018 05:17, Viresh Kumar wrote:
> On 02-02-18, 15:30, Daniel Lezcano wrote:
>> On 02/02/2018 11:42, Viresh Kumar wrote:
>>> Here is how I see the whole thing now:
>>>
>>> - Yes we need individual support for both cpufreq and cpuidle cooling devices,
>>> and no one disagrees on that I believe.
>>>
>>> - There is nothing in the thermal framework that disallows both cpufreq and
>>> cpuidle cooling devices to co-exist. Both would be part of the same thermal
>>> zone and so will get throttled with the same thermal sensor event. And so we
>>> will end up trying to cool down the SoC using both cpufreq and cpuidle
>>> technique.
>>
>> No. It does not work because we will need different state for each
>> cooling device and we need some logic behind.
>
> Right, but I thought the cooling-maps can help us specify different cooling
> states for different cooling devices for the same trip point. Maybe my
> understanding of that is incorrect.
>
>>> - Now I am just wondering if we really need the "combo" functionality or not.
>>> Can we fine tune the DT cpu-cooling properties (existing ones) for a platform,
>>> so that it automatically acts as a combo cooling device? I am not 100% sure
>>> its gonna fly, but just wanted to make sure its not possible to work around
>>> with and then only try the combo device thing.
>>>
>>> For example, suppose that with just cpufreq-cooling device we need to take the
>>> CPU down to 1 GHz from 2 GHz if we cross temperature 'X'. What if we can change
>>> this policy from DT and say the cpufreq-cooling device goes to 1.5 GHz and
>>> cpuidle-cooling device takes us to idle for 'y' us, and the effect of
>>> combination of these two is >= the effect of the 1 GHz for just the
>>> cpufreq-cooling device.
>>>
>>> Is there any possibility of this to work ?
>>
>> It does not make sense. The combo does that automatically by computing
>> the power equivalence more precisely.
>
> Sure, but that works by creating a virtual combo-cooling device instead of two
> separate cooling devices and then there are several limitation (at least right
> now) where it doesn't sense the real situation automagically. For example I
> would expect the combo to just work with cpuidle if cpufreq isn't present and as
> soon as cpufreq comes in, covert itself to cpufreq+cpuidle. I was just trying to
> present another view at solving the problem at hand, not that one is better
> than the other.
At the first glance, it sounds interesting but I'm afraid that raises
more corner-cases than it solves because we have to take into account
all the combinations: cpuidle=0 && cpufreq=1, cpuidle=1 && cpufreq=0,
cpuidle=1 && cpufreq=1 with dynamic code changes when the cpufreq driver
is loaded/unloaded.

I'm not against this approach as well as merging all the cpu cooling
devices into a single one but that won't be trivial and will need
several iterations before reaching this level of features.

IMO, we should keep the current approach (but handle the cpufreq
loading/unloading) and then iteratively merge all the cooling device
into a single one with policy change at runtime which will automatically
handle the cpufreq load/unload.

However I'm open to suggestion.



--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


2018-02-05 13:55:53

by Daniel Thompson

[permalink] [raw]
Subject: Re: [PATCH 5/8] thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver

On 23/01/18 15:34, Daniel Lezcano wrote:
> +/**
> + * cpuidle_cooling_get_max_state - Get the maximum state
> + * @cdev : the thermal cooling device
> + * @state : a pointer to the state variable to be filled
> + *
> + * The function gives always 100 as the injection ratio is percentile
> + * based for consistency accros different platforms.
> + *
> + * The function can not fail, it returns always zero.
> + */
> +static int cpuidle_cooling_get_max_state(struct thermal_cooling_device *cdev,
> + unsigned long *state)
> +{
> + /*
> + * Depending on the configuration or the hardware, the running
> + * cycle and the idle cycle could be different. We want unify
> + * that to an 0..100 interval, so the set state interface will
> + * be the same whatever the platform is.
> + *
> + * The state 100% will make the cluster 100% ... idle. A 0%
> + * injection ratio means no idle injection at all and 50%
> + * means for 10ms of idle injection, we have 10ms of running
> + * time.
> + */
> + *state = 100;

Doesn't this requires DTs to be changed?

Basically the existing bindings for cpu cooling dictate that
cooling-max-levels == num OPPs - 1 .

Likewise I think cooling-max-levels *defines* the scale used by the
cooling maps in the DT. Introducing an alternative scale means that the
OPP limits in the cooling-map will be misinterpreted when we bind the
cooling device.

Note that we get away with this on Hikey because its mainline cooling
map is, AFAICT, deeply sub-optimal and doesn't set any cooling limits.
If its cooling map has been tuned (as the Exynos ones are) then I
suspect there could be overheating problems when the cpuidle (or combo)
CPU coolers are enabled.


Daniel.



> +
> + return 0;
> +}
> +
> +/**
> + * cpuidle_cooling_get_cur_state - Get the current cooling state
> + * @cdev: the thermal cooling device
> + * @state: a pointer to the state
> + *
> + * The function just copy the state value from the private thermal
> + * cooling device structure, the mapping is 1 <-> 1.
> + *
> + * The function can not fail, it returns always zero.
> + */
> +static int cpuidle_cooling_get_cur_state(struct thermal_cooling_device *cdev,
> + unsigned long *state)
> +{
> + struct cpuidle_cooling_device *idle_cdev = cdev->devdata;
> +
> + *state = idle_cdev->state;
> +
> + return 0;
> +}
> +
> +/**
> + * cpuidle_cooling_set_cur_state - Set the current cooling state
> + * @cdev: the thermal cooling device
> + * @state: the target state
> + *
> + * The function checks first if we are initiating the mitigation which
> + * in turn wakes up all the idle injection tasks belonging to the idle
> + * cooling device. In any case, it updates the internal state for the
> + * cooling device.
> + *
> + * The function can not fail, it returns always zero.
> + */
> +static int cpuidle_cooling_set_cur_state(struct thermal_cooling_device *cdev,
> + unsigned long state)
> +{
> + struct cpuidle_cooling_device *idle_cdev = cdev->devdata;
> + unsigned long current_state = idle_cdev->state;
> +
> + idle_cdev->state = state;
> +
> + if (current_state == 0 && state > 0) {
> + pr_debug("Starting cooling cpus '%*pbl'\n",
> + cpumask_pr_args(idle_cdev->cpumask));
> + cpuidle_cooling_wakeup(idle_cdev);
> + } else if (current_state > 0 && !state) {
> + pr_debug("Stopping cooling cpus '%*pbl'\n",
> + cpumask_pr_args(idle_cdev->cpumask));
> + }
> +
> + return 0;
> +}
> +
> +/**
> + * cpuidle_cooling_ops - thermal cooling device ops
> + */
> +static struct thermal_cooling_device_ops cpuidle_cooling_ops = {
> + .get_max_state = cpuidle_cooling_get_max_state,
> + .get_cur_state = cpuidle_cooling_get_cur_state,
> + .set_cur_state = cpuidle_cooling_set_cur_state,
> +};
> +
> +/**
> + * cpuidle_cooling_release - Kref based release helper
> + * @kref: a pointer to the kref structure
> + *
> + * This function is automatically called by the kref_put function when
> + * the idle cooling device refcount reaches zero. At this point, we
> + * have the guarantee the structure is no longer in use and we can
> + * safely release all the ressources.
> + */
> +static void __init cpuidle_cooling_release(struct kref *kref)
> +{
> + struct cpuidle_cooling_device *idle_cdev =
> + container_of(kref, struct cpuidle_cooling_device, kref);
> +
> + thermal_cooling_device_unregister(idle_cdev->cdev);
> + kfree(idle_cdev->waitq);
> + kfree(idle_cdev->tsk);
> + kfree(idle_cdev);
> +}
> +
> +/**
> + * cpuidle_cooling_register - Idle cooling device initialization function
> + *
> + * This function is in charge of creating a cooling device per cluster
> + * and register it to thermal framework. For this we rely on the
> + * topology as there is nothing yet describing better the idle state
> + * power domains.
> + *
> + * For each first CPU of the cluster's cpumask, we allocate the idle
> + * cooling device, initialize the general fields and then we initialze
> + * the rest in a per cpu basis.
> + *
> + * Returns zero on success, < 0 otherwise.
> + */
> +int cpuidle_cooling_register(void)
> +{
> + struct cpuidle_cooling_device *idle_cdev = NULL;
> + struct thermal_cooling_device *cdev;
> + struct task_struct *tsk;
> + struct device_node *np;
> + cpumask_t *cpumask;
> + char dev_name[THERMAL_NAME_LENGTH];
> + int weight;
> + int ret = -ENOMEM, cpu;
> + int index = 0;
> +
> + for_each_possible_cpu(cpu) {
> +
> + cpumask = topology_core_cpumask(cpu);
> + weight = cpumask_weight(cpumask);
> +
> + /*
> + * This condition makes the first cpu belonging to the
> + * cluster to create a cooling device and allocates
> + * the structure. Others CPUs belonging to the same
> + * cluster will just increment the refcount on the
> + * cooling device structure and initialize it.
> + */
> + if (cpu == cpumask_first(cpumask)) {
> +
> + np = of_cpu_device_node_get(cpu);
> +
> + idle_cdev = kzalloc(sizeof(*idle_cdev), GFP_KERNEL);
> + if (!idle_cdev)
> + goto out_fail;
> +
> + idle_cdev->tsk = kzalloc(sizeof(*idle_cdev->tsk) *
> + weight, GFP_KERNEL);
> + if (!idle_cdev->tsk)
> + goto out_fail;
> +
> + idle_cdev->waitq = kzalloc(sizeof(*idle_cdev->waitq) *
> + weight, GFP_KERNEL);
> + if (!idle_cdev->waitq)
> + goto out_fail;
> +
> + idle_cdev->idle_cycle = DEFAULT_IDLE_TIME_US;
> +
> + atomic_set(&idle_cdev->count, 0);
> +
> + kref_init(&idle_cdev->kref);
> +
> + /*
> + * Initialize the timer to wakeup all the idle
> + * injection tasks
> + */
> + hrtimer_init(&idle_cdev->timer,
> + CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> +
> + /*
> + * The wakeup function callback which is in
> + * charge of waking up all CPUs belonging to
> + * the same cluster
> + */
> + idle_cdev->timer.function = cpuidle_cooling_wakeup_fn;
> +
> + /*
> + * The thermal cooling device name
> + */
> + snprintf(dev_name, sizeof(dev_name), "thermal-idle-%d", index++);
> + cdev = thermal_of_cooling_device_register(np, dev_name,
> + idle_cdev,
> + &cpuidle_cooling_ops);
> + if (IS_ERR(cdev)) {
> + ret = PTR_ERR(cdev);
> + goto out_fail;
> + }
> +
> + idle_cdev->cdev = cdev;
> +
> + idle_cdev->cpumask = cpumask;
> +
> + list_add(&idle_cdev->node, &cpuidle_cdev_list);
> +
> + pr_info("Created idle cooling device for cpus '%*pbl'\n",
> + cpumask_pr_args(cpumask));
> + }
> +
> + kref_get(&idle_cdev->kref);
> +
> + /*
> + * Each cooling device is per package. Each package
> + * has a set of cpus where the physical number is
> + * duplicate in the kernel namespace. We need a way to
> + * address the waitq[] and tsk[] arrays with index
> + * which are not Linux cpu numbered.
> + *
> + * One solution is to use the
> + * topology_core_id(cpu). Other solution is to use the
> + * modulo.
> + *
> + * eg. 2 x cluster - 4 cores.
> + *
> + * Physical numbering -> Linux numbering -> % nr_cpus
> + *
> + * Pkg0 - Cpu0 -> 0 -> 0
> + * Pkg0 - Cpu1 -> 1 -> 1
> + * Pkg0 - Cpu2 -> 2 -> 2
> + * Pkg0 - Cpu3 -> 3 -> 3
> + *
> + * Pkg1 - Cpu0 -> 4 -> 0
> + * Pkg1 - Cpu1 -> 5 -> 1
> + * Pkg1 - Cpu2 -> 6 -> 2
> + * Pkg1 - Cpu3 -> 7 -> 3
> + */
> + init_waitqueue_head(&idle_cdev->waitq[cpu % weight]);
> +
> + tsk = kthread_create_on_cpu(cpuidle_cooling_injection_thread,
> + idle_cdev, cpu, "kidle_inject/%u");
> + if (IS_ERR(tsk)) {
> + ret = PTR_ERR(tsk);
> + goto out_fail;
> + }
> +
> + idle_cdev->tsk[cpu % weight] = tsk;
> +
> + wake_up_process(tsk);
> + }
> +
> + return 0;
> +
> +out_fail:
> + list_for_each_entry(idle_cdev, &cpuidle_cdev_list, node) {
> +
> + for_each_cpu(cpu, idle_cdev->cpumask) {
> +
> + if (idle_cdev->tsk[cpu])
> + kthread_stop(idle_cdev->tsk[cpu]);
> +
> + kref_put(&idle_cdev->kref, cpuidle_cooling_release);
> + }
> + }
> +
> + pr_err("Failed to create idle cooling device (%d)\n", ret);
> +
> + return ret;
> +}
> +#endif
> diff --git a/include/linux/cpu_cooling.h b/include/linux/cpu_cooling.h
> index d4292eb..2b5950b 100644
> --- a/include/linux/cpu_cooling.h
> +++ b/include/linux/cpu_cooling.h
> @@ -45,6 +45,7 @@ struct thermal_cooling_device *
> cpufreq_power_cooling_register(struct cpufreq_policy *policy,
> u32 capacitance, get_static_t plat_static_func);
>
> +extern int cpuidle_cooling_register(void);
> /**
> * of_cpufreq_cooling_register - create cpufreq cooling device based on DT.
> * @np: a valid struct device_node to the cooling device device tree node.
> @@ -118,6 +119,11 @@ void cpufreq_cooling_unregister(struct thermal_cooling_device *cdev)
> {
> return;
> }
> +
> +static inline int cpuidle_cooling_register(void)
> +{
> + return 0;
> +}
> #endif /* CONFIG_CPU_THERMAL */
>
> #endif /* __CPU_COOLING_H__ */
>

2018-02-06 04:29:45

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH 8/8] thermal/drivers/cpu_cooling: Add the combo cpu cooling device

On 05-02-18, 11:32, Daniel Lezcano wrote:
> On 05/02/2018 05:17, Viresh Kumar wrote:
> > Right, but I thought the cooling-maps can help us specify different cooling
> > states for different cooling devices for the same trip point. Maybe my
> > understanding of that is incorrect.

Any inputs on this? I am still wondering if this can be done.

> At the first glance, it sounds interesting but I'm afraid that raises
> more corner-cases than it solves because we have to take into account
> all the combinations: cpuidle=0 && cpufreq=1, cpuidle=1 && cpufreq=0,
> cpuidle=1 && cpufreq=1 with dynamic code changes when the cpufreq driver
> is loaded/unloaded.
>
> I'm not against this approach as well as merging all the cpu cooling
> devices into a single one but that won't be trivial and will need
> several iterations before reaching this level of features.
>
> IMO, we should keep the current approach (but handle the cpufreq
> loading/unloading) and then iteratively merge all the cooling device
> into a single one with policy change at runtime which will automatically
> handle the cpufreq load/unload.

Surely we can do one thing at a time if that's the way we choose to do it.

--
viresh

2018-02-06 10:50:04

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [PATCH 8/8] thermal/drivers/cpu_cooling: Add the combo cpu cooling device

On 06/02/2018 05:28, Viresh Kumar wrote:
> On 05-02-18, 11:32, Daniel Lezcano wrote:
>> On 05/02/2018 05:17, Viresh Kumar wrote:
>>> Right, but I thought the cooling-maps can help us specify different cooling
>>> states for different cooling devices for the same trip point. Maybe my
>>> understanding of that is incorrect.
>
> Any inputs on this? I am still wondering if this can be done.

Can you give an example? Or your understanding is incorrect or I missed
the point.

>> At the first glance, it sounds interesting but I'm afraid that raises
>> more corner-cases than it solves because we have to take into account
>> all the combinations: cpuidle=0 && cpufreq=1, cpuidle=1 && cpufreq=0,
>> cpuidle=1 && cpufreq=1 with dynamic code changes when the cpufreq driver
>> is loaded/unloaded.
>>
>> I'm not against this approach as well as merging all the cpu cooling
>> devices into a single one but that won't be trivial and will need
>> several iterations before reaching this level of features.
>>
>> IMO, we should keep the current approach (but handle the cpufreq
>> loading/unloading) and then iteratively merge all the cooling device
>> into a single one with policy change at runtime which will automatically
>> handle the cpufreq load/unload.
>
> Surely we can do one thing at a time if that's the way we choose to do it.

Easy to say :)

The current code is to introduce the feature without impacting the DT
bindings in order to keep focused on the thermal mitigation aspect.

There are still a lot of improvements to do after that. You are
basically asking me to implement the copy-on-write before the memory
management is complete.





--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


2018-02-06 11:36:37

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [PATCH 5/8] thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver

On 05/02/2018 14:54, Daniel Thompson wrote:
> On 23/01/18 15:34, Daniel Lezcano wrote:
>> +/**
>> + * cpuidle_cooling_get_max_state - Get the maximum state
>> + * @cdev  : the thermal cooling device
>> + * @state : a pointer to the state variable to be filled
>> + *
>> + * The function gives always 100 as the injection ratio is percentile
>> + * based for consistency accros different platforms.
>> + *
>> + * The function can not fail, it returns always zero.
>> + */
>> +static int cpuidle_cooling_get_max_state(struct
>> thermal_cooling_device *cdev,
>> +                     unsigned long *state)
>> +{
>> +    /*
>> +     * Depending on the configuration or the hardware, the running
>> +     * cycle and the idle cycle could be different. We want unify
>> +     * that to an 0..100 interval, so the set state interface will
>> +     * be the same whatever the platform is.
>> +     *
>> +     * The state 100% will make the cluster 100% ... idle. A 0%
>> +     * injection ratio means no idle injection at all and 50%
>> +     * means for 10ms of idle injection, we have 10ms of running
>> +     * time.
>> +     */
>> +    *state = 100;
>
> Doesn't this requires DTs to be changed?
>
> Basically the existing bindings for cpu cooling dictate that
> cooling-max-levels == num OPPs - 1 .
>
> Likewise I think cooling-max-levels *defines* the scale used by the
> cooling maps in the DT. Introducing an alternative scale means that the
> OPP limits in the cooling-map will be misinterpreted when we bind the
> cooling device.

The DT binding for the cpuidle cooling device just need to bind to the
thermal-zone, it does not care about the cooling-max-levels which is a
binding for the cpufreq cooling device. That makes both cooling devices
to be able to co-exists with the same DT.

However, as you mentioned it in the previous emails, new DT bindings
will need to be defined in order to extend the features. That is
something I would like to put apart in the discussion in order to be
focused in the thermal mitigation aspect of the cooling device.

The DT bindings discussions can be long and end up as a non-removable
definition, so that is something IMO which should be discussed in a
specific series.

> Note that we get away with this on Hikey because its mainline cooling
> map is, AFAICT, deeply sub-optimal and doesn't set any cooling limits.
> If its cooling map has been tuned (as the Exynos ones are) then I
> suspect there could be overheating problems when the cpuidle (or combo)
> CPU coolers are enabled.

Yes, that's a good point but from my POV, if we select here the cpuidle
cooling device then we should also change the cooling mapping which is
optimized for the cpufreq cooling device, a cooling device we
deliberately disabled by choosing the cpuidle cooling device.

The DT aspect is complex to solve, that's why I propose this first
simple approach to introduce the feature with current DT bindings.




--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


2018-02-07 07:27:06

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH 8/8] thermal/drivers/cpu_cooling: Add the combo cpu cooling device

On 06-02-18, 11:48, Daniel Lezcano wrote:
> On 06/02/2018 05:28, Viresh Kumar wrote:

> > Surely we can do one thing at a time if that's the way we choose to do it.
>
> Easy to say :)
>
> The current code is to introduce the feature without impacting the DT
> bindings in order to keep focused on the thermal mitigation aspect.
>
> There are still a lot of improvements to do after that. You are
> basically asking me to implement the copy-on-write before the memory
> management is complete.

Perhaps I wasn't clear. What I was trying to say is that we can do "one thing at
a time" if we choose to create a "combo device" (the way you proposed). I am not
trying to force you to solve all the problems in one go :)

> Can you give an example? Or your understanding is incorrect or I missed
> the point.

So I tried to write it down and realized I was assuming that different
cooling-maps can be provided for different cooling strategies
(cpufreq/cpuidle) and obviously that's not the case as its per device.
Not sure if it would be correct to explore the possibility of doing
that.

--
viresh

2018-02-07 09:05:52

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH 4/8] thermal/drivers/Kconfig: Convert the CPU cooling device to a choice

On 23-01-18, 16:34, Daniel Lezcano wrote:
> The next changes will add new way to cool down a CPU. In order to
> sanitize and make the overall cpu cooling code consistent and robust
> we must prevent the cpu cooling devices to co-exists with the same
> purpose at the same time in the kernel.
>
> Make the CPU cooling device a choice in the Kconfig, so only one CPU
> cooling strategy can be chosen.
>
> Signed-off-by: Daniel Lezcano <[email protected]>
> ---
> drivers/thermal/Kconfig | 20 +++++++++++++++++---
> 1 file changed, 17 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
> index 315ae29..925e73b 100644
> --- a/drivers/thermal/Kconfig
> +++ b/drivers/thermal/Kconfig
> @@ -142,17 +142,31 @@ config THERMAL_GOV_POWER_ALLOCATOR
> allocating and limiting power to devices.
>
> config CPU_THERMAL
> - bool "generic cpu cooling support"
> - depends on CPU_FREQ
> + bool "Generic cpu cooling support"
> depends on THERMAL_OF
> help
> + Enable the CPU cooling features. If the system has no active
> + cooling device available, this option allows to use the CPU
> + as a cooling device.
> +
> +choice
> + prompt "CPU cooling strategies"
> + depends on CPU_THERMAL
> + default CPU_FREQ_THERMAL
> + help
> + Select the CPU cooling strategy.
> +
> +config CPU_FREQ_THERMAL

Perhaps you should start using this macro in code in this patch
itself.

> + bool "CPU frequency cooling strategy"
> + depends on CPU_FREQ
> + help
> This implements the generic cpu cooling mechanism through frequency
> reduction. An ACPI version of this already exists
> (drivers/acpi/processor_thermal.c).
> This will be useful for platforms using the generic thermal interface
> and not the ACPI interface.
>
> - If you want this support, you should say Y here.
> +endchoice
>
> config CLOCK_THERMAL
> bool "Generic clock cooling support"
> --
> 2.7.4

--
viresh

2018-02-07 09:14:26

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH 5/8] thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver

On 23-01-18, 16:34, Daniel Lezcano wrote:
> diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c

> +/**
> + * cpuidle_cooling_ops - thermal cooling device ops
> + */
> +static struct thermal_cooling_device_ops cpuidle_cooling_ops = {
> + .get_max_state = cpuidle_cooling_get_max_state,
> + .get_cur_state = cpuidle_cooling_get_cur_state,
> + .set_cur_state = cpuidle_cooling_set_cur_state,
> +};
> +
> +/**
> + * cpuidle_cooling_release - Kref based release helper
> + * @kref: a pointer to the kref structure
> + *
> + * This function is automatically called by the kref_put function when
> + * the idle cooling device refcount reaches zero. At this point, we
> + * have the guarantee the structure is no longer in use and we can
> + * safely release all the ressources.
> + */

Don't really need doc style comments for internal routines.

> +static void __init cpuidle_cooling_release(struct kref *kref)
> +{
> + struct cpuidle_cooling_device *idle_cdev =
> + container_of(kref, struct cpuidle_cooling_device, kref);
> +
> + thermal_cooling_device_unregister(idle_cdev->cdev);
> + kfree(idle_cdev->waitq);
> + kfree(idle_cdev->tsk);
> + kfree(idle_cdev);

What about list-del here (cpuidle_cdev_list) ?

> +}
> +
> +/**
> + * cpuidle_cooling_register - Idle cooling device initialization function
> + *
> + * This function is in charge of creating a cooling device per cluster
> + * and register it to thermal framework. For this we rely on the
> + * topology as there is nothing yet describing better the idle state
> + * power domains.
> + *
> + * For each first CPU of the cluster's cpumask, we allocate the idle
> + * cooling device, initialize the general fields and then we initialze
> + * the rest in a per cpu basis.
> + *
> + * Returns zero on success, < 0 otherwise.
> + */
> +int cpuidle_cooling_register(void)
> +{
> + struct cpuidle_cooling_device *idle_cdev = NULL;
> + struct thermal_cooling_device *cdev;
> + struct task_struct *tsk;
> + struct device_node *np;
> + cpumask_t *cpumask;
> + char dev_name[THERMAL_NAME_LENGTH];
> + int weight;
> + int ret = -ENOMEM, cpu;
> + int index = 0;
> +
> + for_each_possible_cpu(cpu) {
> +

Perhaps this is coding choice, but just for the sake of consistency in
this driver should we remove such empty lines at the beginning of a
blocks ?

> + cpumask = topology_core_cpumask(cpu);
> + weight = cpumask_weight(cpumask);
> +
> + /*
> + * This condition makes the first cpu belonging to the
> + * cluster to create a cooling device and allocates
> + * the structure. Others CPUs belonging to the same
> + * cluster will just increment the refcount on the
> + * cooling device structure and initialize it.
> + */
> + if (cpu == cpumask_first(cpumask)) {
> +

Like here as well.

> + np = of_cpu_device_node_get(cpu);
> +
> + idle_cdev = kzalloc(sizeof(*idle_cdev), GFP_KERNEL);
> + if (!idle_cdev)
> + goto out_fail;
> +
> + idle_cdev->tsk = kzalloc(sizeof(*idle_cdev->tsk) *
> + weight, GFP_KERNEL);
> + if (!idle_cdev->tsk)
> + goto out_fail;
> +
> + idle_cdev->waitq = kzalloc(sizeof(*idle_cdev->waitq) *
> + weight, GFP_KERNEL);
> + if (!idle_cdev->waitq)
> + goto out_fail;
> +
> + idle_cdev->idle_cycle = DEFAULT_IDLE_TIME_US;
> +
> + atomic_set(&idle_cdev->count, 0);
> +
> + kref_init(&idle_cdev->kref);
> +
> + /*
> + * Initialize the timer to wakeup all the idle
> + * injection tasks
> + */
> + hrtimer_init(&idle_cdev->timer,
> + CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> +
> + /*
> + * The wakeup function callback which is in
> + * charge of waking up all CPUs belonging to
> + * the same cluster
> + */
> + idle_cdev->timer.function = cpuidle_cooling_wakeup_fn;
> +
> + /*
> + * The thermal cooling device name
> + */
> + snprintf(dev_name, sizeof(dev_name), "thermal-idle-%d", index++);
> + cdev = thermal_of_cooling_device_register(np, dev_name,
> + idle_cdev,
> + &cpuidle_cooling_ops);
> + if (IS_ERR(cdev)) {
> + ret = PTR_ERR(cdev);
> + goto out_fail;
> + }
> +
> + idle_cdev->cdev = cdev;
> +
> + idle_cdev->cpumask = cpumask;
> +
> + list_add(&idle_cdev->node, &cpuidle_cdev_list);
> +
> + pr_info("Created idle cooling device for cpus '%*pbl'\n",
> + cpumask_pr_args(cpumask));
> + }
> +
> + kref_get(&idle_cdev->kref);
> +
> + /*
> + * Each cooling device is per package. Each package
> + * has a set of cpus where the physical number is
> + * duplicate in the kernel namespace. We need a way to
> + * address the waitq[] and tsk[] arrays with index
> + * which are not Linux cpu numbered.
> + *
> + * One solution is to use the
> + * topology_core_id(cpu). Other solution is to use the
> + * modulo.
> + *
> + * eg. 2 x cluster - 4 cores.
> + *
> + * Physical numbering -> Linux numbering -> % nr_cpus
> + *
> + * Pkg0 - Cpu0 -> 0 -> 0
> + * Pkg0 - Cpu1 -> 1 -> 1
> + * Pkg0 - Cpu2 -> 2 -> 2
> + * Pkg0 - Cpu3 -> 3 -> 3
> + *
> + * Pkg1 - Cpu0 -> 4 -> 0
> + * Pkg1 - Cpu1 -> 5 -> 1
> + * Pkg1 - Cpu2 -> 6 -> 2
> + * Pkg1 - Cpu3 -> 7 -> 3
> + */
> + init_waitqueue_head(&idle_cdev->waitq[cpu % weight]);
> +
> + tsk = kthread_create_on_cpu(cpuidle_cooling_injection_thread,
> + idle_cdev, cpu, "kidle_inject/%u");
> + if (IS_ERR(tsk)) {
> + ret = PTR_ERR(tsk);
> + goto out_fail;
> + }
> +
> + idle_cdev->tsk[cpu % weight] = tsk;
> +
> + wake_up_process(tsk);
> + }
> +
> + return 0;
> +
> +out_fail:
> + list_for_each_entry(idle_cdev, &cpuidle_cdev_list, node) {
> +
> + for_each_cpu(cpu, idle_cdev->cpumask) {
> +
> + if (idle_cdev->tsk[cpu])
> + kthread_stop(idle_cdev->tsk[cpu]);
> +
> + kref_put(&idle_cdev->kref, cpuidle_cooling_release);
> + }
> + }
> +
> + pr_err("Failed to create idle cooling device (%d)\n", ret);
> +
> + return ret;
> +}

What about cpuidle_cooling_unregister() ?

> +#endif
> diff --git a/include/linux/cpu_cooling.h b/include/linux/cpu_cooling.h
> index d4292eb..2b5950b 100644
> --- a/include/linux/cpu_cooling.h
> +++ b/include/linux/cpu_cooling.h
> @@ -45,6 +45,7 @@ struct thermal_cooling_device *
> cpufreq_power_cooling_register(struct cpufreq_policy *policy,
> u32 capacitance, get_static_t plat_static_func);
>
> +extern int cpuidle_cooling_register(void);
> /**
> * of_cpufreq_cooling_register - create cpufreq cooling device based on DT.
> * @np: a valid struct device_node to the cooling device device tree node.
> @@ -118,6 +119,11 @@ void cpufreq_cooling_unregister(struct thermal_cooling_device *cdev)
> {
> return;
> }
> +
> +static inline int cpuidle_cooling_register(void)

You need to use the new macros of cpufreq and cpuidle here as well,
else you will see compile time errors with some configurations.

> +{
> + return 0;
> +}
> #endif /* CONFIG_CPU_THERMAL */
>
> #endif /* __CPU_COOLING_H__ */
> --
> 2.7.4

--
viresh

2018-02-07 10:06:23

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [PATCH 8/8] thermal/drivers/cpu_cooling: Add the combo cpu cooling device

On 07/02/2018 08:26, Viresh Kumar wrote:
> On 06-02-18, 11:48, Daniel Lezcano wrote:
>> On 06/02/2018 05:28, Viresh Kumar wrote:
>
>>> Surely we can do one thing at a time if that's the way we choose to do it.
>>
>> Easy to say :)
>>
>> The current code is to introduce the feature without impacting the DT
>> bindings in order to keep focused on the thermal mitigation aspect.
>>
>> There are still a lot of improvements to do after that. You are
>> basically asking me to implement the copy-on-write before the memory
>> management is complete.
>
> Perhaps I wasn't clear. What I was trying to say is that we can do "one thing at
> a time" if we choose to create a "combo device" (the way you proposed). I am not
> trying to force you to solve all the problems in one go :)

Oh, ok :)

>> Can you give an example? Or your understanding is incorrect or I missed
>> the point.
>
> So I tried to write it down and realized I was assuming that different
> cooling-maps can be provided for different cooling strategies
> (cpufreq/cpuidle) and obviously that's not the case as its per device.
> Not sure if it would be correct to explore the possibility of doing
> that.
>


--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


2018-02-07 10:17:41

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [PATCH 4/8] thermal/drivers/Kconfig: Convert the CPU cooling device to a choice

On 07/02/2018 10:04, Viresh Kumar wrote:
> On 23-01-18, 16:34, Daniel Lezcano wrote:
>> The next changes will add new way to cool down a CPU. In order to
>> sanitize and make the overall cpu cooling code consistent and robust
>> we must prevent the cpu cooling devices to co-exists with the same
>> purpose at the same time in the kernel.
>>
>> Make the CPU cooling device a choice in the Kconfig, so only one CPU
>> cooling strategy can be chosen.
>>
>> Signed-off-by: Daniel Lezcano <[email protected]>
>> ---
>> drivers/thermal/Kconfig | 20 +++++++++++++++++---
>> 1 file changed, 17 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
>> index 315ae29..925e73b 100644
>> --- a/drivers/thermal/Kconfig
>> +++ b/drivers/thermal/Kconfig
>> @@ -142,17 +142,31 @@ config THERMAL_GOV_POWER_ALLOCATOR
>> allocating and limiting power to devices.
>>
>> config CPU_THERMAL
>> - bool "generic cpu cooling support"
>> - depends on CPU_FREQ
>> + bool "Generic cpu cooling support"
>> depends on THERMAL_OF
>> help
>> + Enable the CPU cooling features. If the system has no active
>> + cooling device available, this option allows to use the CPU
>> + as a cooling device.
>> +
>> +choice
>> + prompt "CPU cooling strategies"
>> + depends on CPU_THERMAL
>> + default CPU_FREQ_THERMAL
>> + help
>> + Select the CPU cooling strategy.
>> +
>> +config CPU_FREQ_THERMAL
>
> Perhaps you should start using this macro in code in this patch
> itself.

Do you mean s/CPU_THERMAL/CPU_FREQ_THERMAL/ in include/linux/cpu_cooling.h ?


--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


2018-02-07 10:21:41

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH 4/8] thermal/drivers/Kconfig: Convert the CPU cooling device to a choice

On 07-02-18, 11:15, Daniel Lezcano wrote:
> Do you mean s/CPU_THERMAL/CPU_FREQ_THERMAL/ in include/linux/cpu_cooling.h ?

Yes, but also in the .c file (which you have done in 5/8 anyway).

--
viresh

2018-02-07 10:35:37

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [PATCH 5/8] thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver


Hi Viresh,

thanks for reviewing.

On 07/02/2018 10:12, Viresh Kumar wrote:
> On 23-01-18, 16:34, Daniel Lezcano wrote:
>> diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
>
>> +/**
>> + * cpuidle_cooling_ops - thermal cooling device ops
>> + */
>> +static struct thermal_cooling_device_ops cpuidle_cooling_ops = {
>> + .get_max_state = cpuidle_cooling_get_max_state,
>> + .get_cur_state = cpuidle_cooling_get_cur_state,
>> + .set_cur_state = cpuidle_cooling_set_cur_state,
>> +};
>> +
>> +/**
>> + * cpuidle_cooling_release - Kref based release helper
>> + * @kref: a pointer to the kref structure
>> + *
>> + * This function is automatically called by the kref_put function when
>> + * the idle cooling device refcount reaches zero. At this point, we
>> + * have the guarantee the structure is no longer in use and we can
>> + * safely release all the ressources.
>> + */
>
> Don't really need doc style comments for internal routines.

From Documentation/doc-guide/kernel-doc.rst:

"We also recommend providing kernel-doc formatted documentation for
private (file "static") routines, for consistency of kernel source code
layout. But this is lower priority and at the discretion of the
MAINTAINER of that kernel source file."

Vote is open :)

>> +static void __init cpuidle_cooling_release(struct kref *kref)
>> +{
>> + struct cpuidle_cooling_device *idle_cdev =
>> + container_of(kref, struct cpuidle_cooling_device, kref);
>> +
>> + thermal_cooling_device_unregister(idle_cdev->cdev);
>> + kfree(idle_cdev->waitq);
>> + kfree(idle_cdev->tsk);
>> + kfree(idle_cdev);
>
> What about list-del here (cpuidle_cdev_list) ?

Yes, thanks for pointing this. I have to convert the calling loop with
the 'safe' list variant.

>> +}
>> +
>> +/**
>> + * cpuidle_cooling_register - Idle cooling device initialization function
>> + *
>> + * This function is in charge of creating a cooling device per cluster
>> + * and register it to thermal framework. For this we rely on the
>> + * topology as there is nothing yet describing better the idle state
>> + * power domains.
>> + *
>> + * For each first CPU of the cluster's cpumask, we allocate the idle
>> + * cooling device, initialize the general fields and then we initialze
>> + * the rest in a per cpu basis.
>> + *
>> + * Returns zero on success, < 0 otherwise.
>> + */
>> +int cpuidle_cooling_register(void)
>> +{
>> + struct cpuidle_cooling_device *idle_cdev = NULL;
>> + struct thermal_cooling_device *cdev;
>> + struct task_struct *tsk;
>> + struct device_node *np;
>> + cpumask_t *cpumask;
>> + char dev_name[THERMAL_NAME_LENGTH];
>> + int weight;
>> + int ret = -ENOMEM, cpu;
>> + int index = 0;
>> +
>> + for_each_possible_cpu(cpu) {
>> +
>
> Perhaps this is coding choice, but just for the sake of consistency in
> this driver should we remove such empty lines at the beginning of a
> blocks ?

Yes, it is coding choice. I'm in favor of separated blocks. I can remove
the lines if that hurts.

>> + cpumask = topology_core_cpumask(cpu);
>> + weight = cpumask_weight(cpumask);
>> +
>> + /*
>> + * This condition makes the first cpu belonging to the
>> + * cluster to create a cooling device and allocates
>> + * the structure. Others CPUs belonging to the same
>> + * cluster will just increment the refcount on the
>> + * cooling device structure and initialize it.
>> + */
>> + if (cpu == cpumask_first(cpumask)) {
>> +
>
> Like here as well.

Ok.

[ ... ]

>> + pr_err("Failed to create idle cooling device (%d)\n", ret);
>> +
>> + return ret;
>> +}
>
> What about cpuidle_cooling_unregister() ?

The unregister function is not needed because cpuidle can't be unloaded.
The cpuidle cooling device is registered after the cpuidle successfully
initialized itself, there is no error path.

>> +#endif
>> diff --git a/include/linux/cpu_cooling.h b/include/linux/cpu_cooling.h
>> index d4292eb..2b5950b 100644
>> --- a/include/linux/cpu_cooling.h
>> +++ b/include/linux/cpu_cooling.h
>> @@ -45,6 +45,7 @@ struct thermal_cooling_device *
>> cpufreq_power_cooling_register(struct cpufreq_policy *policy,
>> u32 capacitance, get_static_t plat_static_func);
>>
>> +extern int cpuidle_cooling_register(void);
>> /**
>> * of_cpufreq_cooling_register - create cpufreq cooling device based on DT.
>> * @np: a valid struct device_node to the cooling device device tree node.
>> @@ -118,6 +119,11 @@ void cpufreq_cooling_unregister(struct thermal_cooling_device *cdev)
>> {
>> return;
>> }
>> +
>> +static inline int cpuidle_cooling_register(void)
>
> You need to use the new macros of cpufreq and cpuidle here as well,
> else you will see compile time errors with some configurations.

Ok, I thought I tried the different combinations but I will double check.

Thanks

-- Daniel

>> +{
>> + return 0;
>> +}
>> #endif /* CONFIG_CPU_THERMAL */
>>
>> #endif /* __CPU_COOLING_H__ */
>> --
>> 2.7.4
>


--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


2018-02-07 10:43:04

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH 5/8] thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver

On 07-02-18, 11:34, Daniel Lezcano wrote:
> Ok, I thought I tried the different combinations but I will double check.

You perhaps tried after the combo patch. Try after this one :)

--
viresh

2018-02-08 14:07:09

by Philippe Ombredanne

[permalink] [raw]
Subject: Re: [PATCH 2/8] thermal/drivers/cpu_cooling: Add Software Package Data Exchange (SPDX)

Daniel,

On Tue, Jan 23, 2018 at 4:34 PM, Daniel Lezcano
<[email protected]> wrote:
> For license auditing purpose, let's add the SPDX tag.
>
> Cc: Philippe Ombredanne <[email protected]>
> Signed-off-by: Daniel Lezcano <[email protected]>
> ---
> drivers/thermal/cpu_cooling.c | 16 +---------------
> 1 file changed, 1 insertion(+), 15 deletions(-)
>
> diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
> index 988fc55..e62be75 100644
> --- a/drivers/thermal/cpu_cooling.c
> +++ b/drivers/thermal/cpu_cooling.c
> @@ -1,3 +1,4 @@
> +// SPDX-License-Identifier: GPL-2.0
> /*
> * linux/drivers/thermal/cpu_cooling.c
> *
> @@ -8,21 +9,6 @@
> * Authors: Amit Daniel <[email protected]>
> * Viresh Kumar <[email protected]>
> *
> - * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> - * This program is free software; you can redistribute it and/or modify
> - * it under the terms of the GNU General Public License as published by
> - * the Free Software Foundation; version 2 of the License.
> - *
> - * This program is distributed in the hope that it will be useful, but
> - * WITHOUT ANY WARRANTY; without even the implied warranty of
> - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> - * General Public License for more details.
> - *
> - * You should have received a copy of the GNU General Public License along
> - * with this program; if not, write to the Free Software Foundation, Inc.,
> - * 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA.
> - *
> - * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> */
> #include <linux/module.h>
> #include <linux/thermal.h>


Sorry for the late reply!
Thank you.
Acked-by: Philippe Ombredanne <[email protected]>

--
Philippe

2018-02-08 14:08:14

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [PATCH 2/8] thermal/drivers/cpu_cooling: Add Software Package Data Exchange (SPDX)

On 08/02/2018 15:05, Philippe Ombredanne wrote:
> Daniel,

[ ... ]

> Sorry for the late reply!
> Thank you.
> Acked-by: Philippe Ombredanne <[email protected]>

No worries.

Thanks for the ack.

-- Daniel


--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


2018-02-09 09:42:46

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH 5/8] thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver

On 07-02-18, 11:34, Daniel Lezcano wrote:
> On 07/02/2018 10:12, Viresh Kumar wrote:
> > What about cpuidle_cooling_unregister() ?
>
> The unregister function is not needed because cpuidle can't be unloaded.
> The cpuidle cooling device is registered after the cpuidle successfully
> initialized itself, there is no error path.

Okay, then there are two more things here.

First, you don't need a kref in your patch and simple counter should
be used instead, as kref is obviously more heavy to be used for the
single error path here.

Secondly, what about CPU hotplug ? For example, the cpu-freq cooling
device gets removed currently if all CPUs of a cluster are
hotplugged-out. But with your code, even if the CPUs are gone, their
cpu-idle cooling device will stay.

--
viresh

2018-02-09 09:45:33

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH 8/8] thermal/drivers/cpu_cooling: Add the combo cpu cooling device

On 02-02-18, 15:30, Daniel Lezcano wrote:
> Agree the cpufreq driver can be unloaded and this part needs some
> adjustments.

Also note that the cpufreq cooling device has two separate set of
callbacks, one for the power-specific callbacks. You haven't taken
care of them in the combo thing.

--
viresh

2018-02-16 18:53:30

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [PATCH 8/8] thermal/drivers/cpu_cooling: Add the combo cpu cooling device

On 09/02/2018 10:44, Viresh Kumar wrote:
> On 02-02-18, 15:30, Daniel Lezcano wrote:
>> Agree the cpufreq driver can be unloaded and this part needs some
>> adjustments.
>
> Also note that the cpufreq cooling device has two separate set of
> callbacks, one for the power-specific callbacks. You haven't taken
> care of them in the combo thing.

The plan is add them after.


--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


2018-02-16 19:21:39

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [PATCH 5/8] thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver


Hi Viresh,

sorry for the late reply.


On 09/02/2018 10:41, Viresh Kumar wrote:
> On 07-02-18, 11:34, Daniel Lezcano wrote:
>> On 07/02/2018 10:12, Viresh Kumar wrote:
>>> What about cpuidle_cooling_unregister() ?
>>
>> The unregister function is not needed because cpuidle can't be unloaded.
>> The cpuidle cooling device is registered after the cpuidle successfully
>> initialized itself, there is no error path.
>
> Okay, then there are two more things here.
>
> First, you don't need a kref in your patch and simple counter should
> be used instead, as kref is obviously more heavy to be used for the
> single error path here.

I prefer to keep the kref for its API.

And I disagree about the heavy aspect :)

struct kref {
refcount_t refcount;
};


> Secondly, what about CPU hotplug ? For example, the cpu-freq cooling
> device gets removed currently if all CPUs of a cluster are
> hotplugged-out. But with your code, even if the CPUs are gone, their
> cpu-idle cooling device will stay.

Yes and it will continue to compute the state, so if new CPUs are
inserted the cooling device automatically uses the cooling state.
I don't see a problem with that.

--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog