2018-02-21 15:33:18

by Daniel Lezcano

[permalink] [raw]
Subject: [PATCH V2 0/7] CPU cooling device new strategies

Changelog:
V2:
- Dropped the cpu combo cooling device
- Added the acked-by tags
- Replaced task array by a percpu structure
- Fixed the copyright dates
- Fixed the extra lines
- Fixed the compilation macros to be reused
- Fixed the list node removal
- Massaged a couple of function names


The following series provides a new way to cool down a SoC by reducing
the dissipated power on the CPUs. Based on the initial work from Kevin
Wangtao, the series implements a CPU cooling device based on idle
injection, relying on the cpuidle framework.

The patchset is designed to have the current DT binding for the
cpufreq cooling device to be compatible with the new cooling devices.

Different cpu cooling devices can not co-exist on the system, the cpu
cooling device is enabled or not, and one cooling strategy is selected
(cpufreq or cpuidle). It is not possible to have all of them available
at the same time. However, the goal is to enable them all and be able
to switch from one to another at runtime but that needs a rework of the
thermal framework which is orthogonal to the feature we are providing.

This series is divided into two parts.

The first part just provides trivial changes for the copyright and
removes an unused field in the cpu freq cooling device structure.

The second part provides the idle injection cooling device, allowing a SoC
without a cpufreq driver to use this cooling device as an alternative.

The preliminary benchmarks show the following changes:

On the hikey6220, dhrystone shows a throughtput increase of 40% for an
increase of the latency of 16% while sysbench shows a latency increase
of 5%.

Initially, the first version provided also the cpuidle + cpufreq combo
cooling device but regarding the latest comments, there is a misfit with
how the cpufreq cooling device and suspend/resume/cpu hotplug/module
loading|unloading behave together while the combo cooling device was
designed assuming the cpufreq cooling device was always there. This
dynamic is investigated and the combo cooling device will be posted
separetely after this series gets merged.

Daniel Lezcano (7):
thermal/drivers/cpu_cooling: Fixup the header and copyright
thermal/drivers/cpu_cooling: Add Software Package Data Exchange (SPDX)
thermal/drivers/cpu_cooling: Remove pointless field
thermal/drivers/Kconfig: Convert the CPU cooling device to a choice
thermal/drivers/cpu_cooling: Add idle cooling device documentation
thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver
cpuidle/drivers/cpuidle-arm: Register the cooling device

Documentation/thermal/cpu-idle-cooling.txt | 165 ++++++++++
drivers/cpuidle/cpuidle-arm.c | 5 +
drivers/thermal/Kconfig | 30 +-
drivers/thermal/cpu_cooling.c | 480 +++++++++++++++++++++++++++--
include/linux/cpu_cooling.h | 15 +-
5 files changed, 668 insertions(+), 27 deletions(-)
create mode 100644 Documentation/thermal/cpu-idle-cooling.txt

--
2.7.4



2018-02-21 15:31:16

by Daniel Lezcano

[permalink] [raw]
Subject: [PATCH V2 7/7] cpuidle/drivers/cpuidle-arm: Register the cooling device

Register the ARM generic cpuidle driver as a cooling device.

Signed-off-by: Daniel Lezcano <[email protected]>
---
drivers/cpuidle/cpuidle-arm.c | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/drivers/cpuidle/cpuidle-arm.c b/drivers/cpuidle/cpuidle-arm.c
index ddee1b6..c100915 100644
--- a/drivers/cpuidle/cpuidle-arm.c
+++ b/drivers/cpuidle/cpuidle-arm.c
@@ -11,6 +11,7 @@

#define pr_fmt(fmt) "CPUidle arm: " fmt

+#include <linux/cpu_cooling.h>
#include <linux/cpuidle.h>
#include <linux/cpumask.h>
#include <linux/cpu_pm.h>
@@ -172,6 +173,10 @@ static int __init arm_idle_init(void)
goto out_fail;
}

+ ret = cpuidle_cooling_register();
+ if (ret)
+ pr_warn("Failed to register the cpuidle cooling device\n");
+
return 0;

out_fail:
--
2.7.4


2018-02-21 15:31:16

by Daniel Lezcano

[permalink] [raw]
Subject: [PATCH V2 5/7] thermal/drivers/cpu_cooling: Add idle cooling device documentation

Provide some documentation for the idle injection cooling effect in
order to let people to understand the rational of the approach for the
idle injection CPU cooling device.

Signed-off-by: Daniel Lezcano <[email protected]>
---
Documentation/thermal/cpu-idle-cooling.txt | 165 +++++++++++++++++++++++++++++
1 file changed, 165 insertions(+)
create mode 100644 Documentation/thermal/cpu-idle-cooling.txt

diff --git a/Documentation/thermal/cpu-idle-cooling.txt b/Documentation/thermal/cpu-idle-cooling.txt
new file mode 100644
index 0000000..29fc651
--- /dev/null
+++ b/Documentation/thermal/cpu-idle-cooling.txt
@@ -0,0 +1,165 @@
+
+Situation:
+----------
+
+Under certain circumstances, the SoC reaches a temperature exceeding
+the allocated power budget or the maximum temperature limit. The
+former must be mitigated to stabilize the SoC temperature around the
+temperature control using the defined cooling devices, the latter is a
+catastrophic situation where a radical decision must be taken to
+reduce the temperature under the critical threshold, that can impact
+the performances.
+
+Another situation is when the silicon reaches a certain temperature
+which continues to increase even if the dynamic leakage is reduced to
+its minimum by clock gating the component. The runaway phenomena will
+continue with the static leakage and only powering down the component,
+thus dropping the dynamic and static leakage will allow the component
+to cool down. This situation is critical.
+
+Last but not least, the system can ask for a specific power budget but
+because of the OPP density, we can only choose an OPP with a power
+budget lower than the requested one and underuse the CPU, thus losing
+performances. In other words, one OPP under uses the CPU with a power
+lesser than the power budget and the next OPP exceed the power budget,
+an intermediate OPP could have been used if it were present.
+
+Solutions:
+----------
+
+If we can remove the static and the dynamic leakage for a specific
+duration in a controlled period, the SoC temperature will
+decrease. Acting at the idle state duration or the idle cycle
+injection period, we can mitigate the temperature by modulating the
+power budget.
+
+The Operating Performance Point (OPP) density has a great influence on
+the control precision of cpufreq, however different vendors have a
+plethora of OPP density, and some have large power gap between OPPs,
+that will result in loss of performance during thermal control and
+loss of power in other scenes.
+
+At a specific OPP, we can assume injecting idle cycle on all CPUs,
+belonging to the same cluster, with a duration greater than the
+cluster idle state target residency, we drop the static and the
+dynamic leakage for this period (modulo the energy needed to enter
+this state). So the sustainable power with idle cycles has a linear
+relation with the OPP’s sustainable power and can be computed with a
+coefficient similar to:
+
+ Power(IdleCycle) = Coef x Power(OPP)
+
+Idle Injection:
+---------------
+
+The base concept of the idle injection is to force the CPU to go to an
+idle state for a specified time each control cycle, it provides
+another way to control CPU power and heat in addition to
+cpufreq. Ideally, if all CPUs of a cluster inject idle synchronously,
+this cluster can get into the deepest idle state and achieve minimum
+power consumption, but that will also increase system response latency
+if we inject less than cpuidle latency.
+
+ ^
+ |
+ |
+ |------- ------- -------
+ |_______|_____|_______|_____|_______|___________
+
+ <----->
+ idle <---->
+ running
+
+With the fixed idle injection duration, we can give a value which is
+an acceptable performance drop off or latency when we reach a specific
+temperature and we begin to mitigate by varying the Idle injection
+period.
+
+The mitigation begins with a maximum period value which decrease when
+more cooling effect is requested. When the period duration is equal to
+the idle duration, then we are in a situation the platform can’t
+dissipate the heat enough and the mitigation fails. In this case the
+situation is considered critical and there is nothing to do. The idle
+injection duration must be changed by configuration and until we reach
+the cooling effect, otherwise an additionnal cooling device must be
+used or ultimately decrease the SoC performance by dropping the
+highest OPP point of the SoC.
+
+The idle injection duration value must comply with the constraints:
+
+- It is lesser or equal to the latency we tolerate when the mitigation
+ begins. It is platform dependent and will depend on the user
+ experience, reactivity vs performance trade off we want. This value
+ should be specified.
+
+- It is greater than the idle state’s target residency we want to go
+ for thermal mitigation, otherwise we end up consuming more energy.
+
+Minimum period
+--------------
+
+The idle injection duration being fixed, it is obvious the minimum
+period can’t be lesser than that, otherwise we will be scheduling the
+idle injection task right before the idle injection duration is
+complete, so waking up the CPU to put it asleep again.
+
+Maximum period
+--------------
+
+The maximum period is the initial period when the mitigation
+begins. Theoretically when we reach the thermal trip point, we have to
+sustain a specified power for specific temperature but at this time we
+consume:
+
+ Power = Capacitance x Voltage^2 x Frequency x Utilisation
+
+... which is more than the sustainable power (or there is something
+wrong on the system setup). The ‘Capacitance’ and ‘Utilisation’ are a
+fixed value, ‘Voltage’ and the ‘Frequency’ are fixed artificially
+because we don’t want to change the OPP. We can group the
+‘Capacitance’ and the ‘Utilisation’ into a single term which is the
+‘Dynamic Power Coefficient (Cdyn)’ Simplifying the above, we have:
+
+ Pdyn = Cdyn x Voltage^2 x Frequency
+
+The IPA will ask us somehow to reduce our power in order to target the
+sustainable power defined in the device tree. So with the idle
+injection mechanism, we want an average power (Ptarget) resulting on
+an amount of time running at full power on a specific OPP and idle
+another amount of time. That could be put in a equation:
+
+ P(opp)target = ((trunning x (P(opp)running) + (tidle P(opp)idle)) /
+ (trunning + tidle)
+ ...
+
+ tidle = trunning x ((P(opp)running / P(opp)target) - 1)
+
+At this point if we know the running period for the CPU, that gives us
+the idle injection, we need. Alternatively if we have the idle
+injection duration, we can compute the running duration with:
+
+ trunning = tidle / ((P(opp)running / P(opp)target) - 1)
+
+Practically, if the running power is lesses than the targeted power,
+we end up with a negative time value, so obviously the equation usage
+is bound to a power reduction, hence a higher OPP is needed to have
+the running power greater than the targeted power.
+
+However, in this demonstration we ignore three aspects:
+
+ * The static leakage is not defined here, we can introduce it in the
+ equation but assuming it will be zero most of the time as it is
+ difficult to get the values from the SoC vendors
+
+ * The idle state wake up latency (or entry + exit latency) is not
+ taken into account, it must be added in the equation in order to
+ rigorously compute the idle injection
+
+ * The injected idle duration must be greater than the idle state
+ target residency, otherwise we end up consuming more energy and
+ potentially invert the mitigation effect
+
+So the final equation is:
+
+ trunning = (tidle - twakeup ) x
+ (((P(opp)dyn + P(opp)static ) - P(opp)target) / P(opp)target )
--
2.7.4


2018-02-21 15:32:36

by Daniel Lezcano

[permalink] [raw]
Subject: [PATCH V2 2/7] thermal/drivers/cpu_cooling: Add Software Package Data Exchange (SPDX)

For license auditing purpose, let's add the SPDX tag.

Signed-off-by: Daniel Lezcano <[email protected]>
Acked-by: Viresh Kumar <[email protected]>
Acked-by: Philippe Ombredanne <[email protected]>
---
drivers/thermal/cpu_cooling.c | 16 +---------------
1 file changed, 1 insertion(+), 15 deletions(-)

diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
index 42110ee..d7528bc 100644
--- a/drivers/thermal/cpu_cooling.c
+++ b/drivers/thermal/cpu_cooling.c
@@ -1,3 +1,4 @@
+// SPDX-License-Identifier: GPL-2.0
/*
* linux/drivers/thermal/cpu_cooling.c
*
@@ -8,21 +9,6 @@
* Authors: Amit Daniel <[email protected]>
* Viresh Kumar <[email protected]>
*
- * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published by
- * the Free Software Foundation; version 2 of the License.
- *
- * This program is distributed in the hope that it will be useful, but
- * WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
- * General Public License for more details.
- *
- * You should have received a copy of the GNU General Public License along
- * with this program; if not, write to the Free Software Foundation, Inc.,
- * 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA.
- *
- * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*/
#include <linux/module.h>
#include <linux/thermal.h>
--
2.7.4


2018-02-21 18:56:55

by Daniel Lezcano

[permalink] [raw]
Subject: [PATCH V2 1/7] thermal/drivers/cpu_cooling: Fixup the header and copyright

The copyright format does not conform to the format requested by
Linaro: https://wiki.linaro.org/Copyright

Fix it.

Signed-off-by: Daniel Lezcano <[email protected]>
---
drivers/thermal/cpu_cooling.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
index dc63aba..42110ee 100644
--- a/drivers/thermal/cpu_cooling.c
+++ b/drivers/thermal/cpu_cooling.c
@@ -2,9 +2,11 @@
* linux/drivers/thermal/cpu_cooling.c
*
* Copyright (C) 2012 Samsung Electronics Co., Ltd(http://www.samsung.com)
- * Copyright (C) 2012 Amit Daniel <[email protected]>
*
- * Copyright (C) 2014 Viresh Kumar <[email protected]>
+ * Copyright (C) 2012-2018 Linaro Limited.
+ *
+ * Authors: Amit Daniel <[email protected]>
+ * Viresh Kumar <[email protected]>
*
* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
* This program is free software; you can redistribute it and/or modify
--
2.7.4


2018-02-21 18:57:05

by Daniel Lezcano

[permalink] [raw]
Subject: [PATCH V2 4/7] thermal/drivers/Kconfig: Convert the CPU cooling device to a choice

The next changes will add new way to cool down a CPU. In order to
sanitize and make the overall cpu cooling code consistent and robust
we must prevent the cpu cooling devices to co-exists with the same
purpose at the same time in the kernel.

Make the CPU cooling device a choice in the Kconfig, so only one CPU
cooling strategy can be chosen.

Signed-off-by: Daniel Lezcano <[email protected]>
---
drivers/thermal/Kconfig | 20 +++++++++++++++++---
drivers/thermal/cpu_cooling.c | 2 ++
include/linux/cpu_cooling.h | 6 +++---
3 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
index b6adc54..5aaae1b 100644
--- a/drivers/thermal/Kconfig
+++ b/drivers/thermal/Kconfig
@@ -142,17 +142,31 @@ config THERMAL_GOV_POWER_ALLOCATOR
allocating and limiting power to devices.

config CPU_THERMAL
- bool "generic cpu cooling support"
- depends on CPU_FREQ
+ bool "Generic cpu cooling support"
depends on THERMAL_OF
help
+ Enable the CPU cooling features. If the system has no active
+ cooling device available, this option allows to use the CPU
+ as a cooling device.
+
+choice
+ prompt "CPU cooling strategies"
+ depends on CPU_THERMAL
+ default CPU_FREQ_THERMAL
+ help
+ Select the CPU cooling strategy.
+
+config CPU_FREQ_THERMAL
+ bool "CPU frequency cooling strategy"
+ depends on CPU_FREQ
+ help
This implements the generic cpu cooling mechanism through frequency
reduction. An ACPI version of this already exists
(drivers/acpi/processor_thermal.c).
This will be useful for platforms using the generic thermal interface
and not the ACPI interface.

- If you want this support, you should say Y here.
+endchoice

config CLOCK_THERMAL
bool "Generic clock cooling support"
diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
index 7bdc19f..5c219dc 100644
--- a/drivers/thermal/cpu_cooling.c
+++ b/drivers/thermal/cpu_cooling.c
@@ -22,6 +22,7 @@

#include <trace/events/thermal.h>

+#ifdef CONFIG_CPU_FREQ_THERMAL
/*
* Cooling state <-> CPUFreq frequency
*
@@ -926,3 +927,4 @@ void cpufreq_cooling_unregister(struct thermal_cooling_device *cdev)
kfree(cpufreq_cdev);
}
EXPORT_SYMBOL_GPL(cpufreq_cooling_unregister);
+#endif /* CONFIG_CPU_FREQ_THERMAL */
diff --git a/include/linux/cpu_cooling.h b/include/linux/cpu_cooling.h
index d4292eb..c0accc7 100644
--- a/include/linux/cpu_cooling.h
+++ b/include/linux/cpu_cooling.h
@@ -33,7 +33,7 @@ struct cpufreq_policy;
typedef int (*get_static_t)(cpumask_t *cpumask, int interval,
unsigned long voltage, u32 *power);

-#ifdef CONFIG_CPU_THERMAL
+#ifdef CONFIG_CPU_FREQ_THERMAL
/**
* cpufreq_cooling_register - function to create cpufreq cooling device.
* @policy: cpufreq policy.
@@ -84,7 +84,7 @@ of_cpufreq_power_cooling_register(struct device_node *np,
*/
void cpufreq_cooling_unregister(struct thermal_cooling_device *cdev);

-#else /* !CONFIG_CPU_THERMAL */
+#else /* !CONFIG_CPU_FREQ_THERMAL */
static inline struct thermal_cooling_device *
cpufreq_cooling_register(struct cpufreq_policy *policy)
{
@@ -118,6 +118,6 @@ void cpufreq_cooling_unregister(struct thermal_cooling_device *cdev)
{
return;
}
-#endif /* CONFIG_CPU_THERMAL */
+#endif /* CONFIG_CPU_FREQ_THERMAL */

#endif /* __CPU_COOLING_H__ */
--
2.7.4


2018-02-21 18:57:08

by Daniel Lezcano

[permalink] [raw]
Subject: [PATCH V2 6/7] thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver

The cpu idle cooling driver performs synchronized idle injection across all
cpus belonging to the same cluster and offers a new method to cool down a SoC.

Each cluster has its own idle cooling device, each core has its own idle
injection thread, each idle injection thread uses play_idle to enter idle. In
order to reach the deepest idle state, each cooling device has the idle
injection threads synchronized together.

It has some similarity with the intel power clamp driver but it is actually
designed to work on the ARM architecture via the DT with a mathematical proof
with the power model which comes with the Documentation.

The idle injection cycle is fixed while the running cycle is variable. That
allows to have control on the device reactivity for the user experience. At
the mitigation point the idle threads are unparked, they play idle the
specified amount of time and they schedule themselves. The last thread sets
the next idle injection deadline and when the timer expires it wakes up all
the threads which in turn play idle again. Meanwhile the running cycle is
changed by set_cur_state. When the mitigation ends, the threads are parked.
The algorithm is self adaptive, so there is no need to handle hotplugging.

If we take an example of the balanced point, we can use the DT for the hi6220.

The sustainable power for the SoC is 3326mW to mitigate at 75°C. Eight cores
running at full blast at the maximum OPP consumes 5280mW. The first value is
given in the DT, the second is calculated from the OPP with the formula:

Pdyn = Cdyn x Voltage^2 x Frequency

As the SoC vendors don't want to share the static leakage values, we assume
it is zero, so the Prun = Pdyn + Pstatic = Pdyn + 0 = Pdyn.

In order to reduce the power to 3326mW, we have to apply a ratio to the
running time.

ratio = (Prun - Ptarget) / Ptarget = (5280 - 3326) / 3326 = 0,5874

We know the idle cycle which is fixed, let's assume 10ms. However from this
duration we have to substract the wake up latency for the cluster idle state.
In our case, it is 1.5ms. So for a 10ms latency for idle, we are really idle
8.5ms.

As we know the idle duration and the ratio, we can compute the running cycle.

running_cycle = 8.5 / 0.5874 = 14.47ms

So for 8.5ms of idle, we have 14.47ms of running cycle, and that brings the
SoC to the balanced trip point of 75°C.

The driver has been tested on the hi6220 and it appears the temperature
stabilizes at 75°C with an idle injection time of 10ms (8.5ms real) and
running cycle of 14ms as expected by the theory above.

Signed-off-by: Kevin Wangtao <[email protected]>
Signed-off-by: Daniel Lezcano <[email protected]>
---
drivers/thermal/Kconfig | 10 +
drivers/thermal/cpu_cooling.c | 451 ++++++++++++++++++++++++++++++++++++++++++
include/linux/cpu_cooling.h | 9 +
3 files changed, 470 insertions(+)

diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
index 5aaae1b..6c34117 100644
--- a/drivers/thermal/Kconfig
+++ b/drivers/thermal/Kconfig
@@ -166,6 +166,16 @@ config CPU_FREQ_THERMAL
This will be useful for platforms using the generic thermal interface
and not the ACPI interface.

+config CPU_IDLE_THERMAL
+ bool "CPU idle cooling strategy"
+ depends on CPU_IDLE
+ help
+ This implements the generic CPU cooling mechanism through
+ idle injection. This will throttle the CPU by injecting
+ fixed idle cycle. All CPUs belonging to the same cluster
+ will enter idle synchronously to reach the deepest idle
+ state.
+
endchoice

config CLOCK_THERMAL
diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
index 5c219dc..9340216 100644
--- a/drivers/thermal/cpu_cooling.c
+++ b/drivers/thermal/cpu_cooling.c
@@ -10,18 +10,32 @@
* Viresh Kumar <[email protected]>
*
*/
+#undef DEBUG
+#define pr_fmt(fmt) "CPU cooling: " fmt
+
#include <linux/module.h>
#include <linux/thermal.h>
#include <linux/cpufreq.h>
+#include <linux/cpuidle.h>
#include <linux/err.h>
+#include <linux/freezer.h>
#include <linux/idr.h>
+#include <linux/kthread.h>
#include <linux/pm_opp.h>
#include <linux/slab.h>
+#include <linux/sched/prio.h>
+#include <linux/sched/rt.h>
#include <linux/cpu.h>
#include <linux/cpu_cooling.h>
+#include <linux/wait.h>
+
+#include <linux/platform_device.h>
+#include <linux/of_platform.h>

#include <trace/events/thermal.h>

+#include <uapi/linux/sched/types.h>
+
#ifdef CONFIG_CPU_FREQ_THERMAL
/*
* Cooling state <-> CPUFreq frequency
@@ -928,3 +942,440 @@ void cpufreq_cooling_unregister(struct thermal_cooling_device *cdev)
}
EXPORT_SYMBOL_GPL(cpufreq_cooling_unregister);
#endif /* CONFIG_CPU_FREQ_THERMAL */
+
+#ifdef CONFIG_CPU_IDLE_THERMAL
+/*
+ * The idle duration injection. As we don't have yet a way to specify
+ * from the DT configuration, let's default to a tick duration.
+ */
+#define DEFAULT_IDLE_TIME_US TICK_USEC
+
+/**
+ * struct cpuidle_cooling_device - data for the idle cooling device
+ * @cdev: a pointer to a struct thermal_cooling_device
+ * @cpumask: a cpumask containing the CPU managed by the cooling device
+ * @timer: a hrtimer giving the tempo for the idle injection cycles
+ * @kref: a kernel refcount on this structure
+ * @count: an atomic to keep track of the last task exiting the idle cycle
+ * @idle_cycle: an integer defining the duration of the idle injection
+ * @state: an normalized integer giving the state of the cooling device
+ */
+struct cpuidle_cooling_device {
+ struct thermal_cooling_device *cdev;
+ struct cpumask *cpumask;
+ struct list_head node;
+ struct hrtimer timer;
+ struct kref kref;
+ atomic_t count;
+ unsigned int idle_cycle;
+ unsigned int state;
+};
+
+/**
+ * @tsk: an array of pointer to the idle injection tasks
+ * @waitq: the waiq for the idle injection tasks
+ */
+struct cpuidle_cooling_tsk {
+ struct task_struct *tsk;
+ wait_queue_head_t waitq;
+};
+
+DEFINE_PER_CPU(struct cpuidle_cooling_tsk, cpuidle_cooling_tsk);
+
+static LIST_HEAD(cpuidle_cdev_list);
+
+/**
+ * cpuidle_cooling_wakeup - Wake up all idle injection threads
+ * @idle_cdev: the idle cooling device
+ *
+ * Every idle injection task belonging to the idle cooling device and
+ * running on an online cpu will be wake up by this call.
+ */
+static void cpuidle_cooling_wakeup(struct cpuidle_cooling_device *idle_cdev)
+{
+ int cpu;
+ struct cpuidle_cooling_tsk *cct;
+
+ for_each_cpu_and(cpu, idle_cdev->cpumask, cpu_online_mask) {
+ cct = per_cpu_ptr(&cpuidle_cooling_tsk, cpu);
+ wake_up_process(cct->tsk);
+ }
+}
+
+/**
+ * cpuidle_cooling_wakeup_fn - Running cycle timer callback
+ * @timer: a hrtimer structure
+ *
+ * When the mitigation is acting, the CPU is allowed to run an amount
+ * of time, then the idle injection happens for the specified delay
+ * and the idle task injection schedules itself until the timer event
+ * wakes the idle injection tasks again for a new idle injection
+ * cycle. The time between the end of the idle injection and the timer
+ * expiration is the allocated running time for the CPU.
+ *
+ * Returns always HRTIMER_NORESTART
+ */
+static enum hrtimer_restart cpuidle_cooling_wakeup_fn(struct hrtimer *timer)
+{
+ struct cpuidle_cooling_device *idle_cdev =
+ container_of(timer, struct cpuidle_cooling_device, timer);
+
+ cpuidle_cooling_wakeup(idle_cdev);
+
+ return HRTIMER_NORESTART;
+}
+
+/**
+ * cpuidle_cooling_runtime - Running time computation
+ * @idle_cdev: the idle cooling device
+ *
+ * The running duration is computed from the idle injection duration
+ * which is fixed. If we reach 100% of idle injection ratio, that
+ * means the running duration is zero. If we have a 50% ratio
+ * injection, that means we have equal duration for idle and for
+ * running duration.
+ *
+ * The formula is deduced as the following:
+ *
+ * running = idle x ((100 / ratio) - 1)
+ *
+ * For precision purpose for integer math, we use the following:
+ *
+ * running = (idle x 100) / ratio - idle
+ *
+ * For example, if we have an injected duration of 50%, then we end up
+ * with 10ms of idle injection and 10ms of running duration.
+ *
+ * Returns a s64 nanosecond based
+ */
+static s64 cpuidle_cooling_runtime(struct cpuidle_cooling_device *idle_cdev)
+{
+ s64 next_wakeup;
+ int state = idle_cdev->state;
+
+ /*
+ * The function must never be called when there is no
+ * mitigation because:
+ * - that does not make sense
+ * - we end up with a division by zero
+ */
+ BUG_ON(!state);
+
+ next_wakeup = (s64)((idle_cdev->idle_cycle * 100) / state) -
+ idle_cdev->idle_cycle;
+
+ return next_wakeup * NSEC_PER_USEC;
+}
+
+/**
+ * cpuidle_cooling_injection_thread - Idle injection mainloop thread function
+ * @arg: a void pointer containing the idle cooling device address
+ *
+ * This main function does basically two operations:
+ *
+ * - Goes idle for a specific amount of time
+ *
+ * - Sets a timer to wake up all the idle injection threads after a
+ * running period
+ *
+ * That happens only when the mitigation is enabled, otherwise the
+ * task is scheduled out.
+ *
+ * In order to keep the tasks synchronized together, it is the last
+ * task exiting the idle period which is in charge of setting the
+ * timer.
+ *
+ * This function never returns.
+ */
+static int cpuidle_cooling_injection_thread(void *arg)
+{
+ struct sched_param param = { .sched_priority = MAX_USER_RT_PRIO/2 };
+ struct cpuidle_cooling_device *idle_cdev = arg;
+ struct cpuidle_cooling_tsk *cct = per_cpu_ptr(&cpuidle_cooling_tsk,
+ smp_processor_id());
+ DEFINE_WAIT(wait);
+
+ set_freezable();
+
+ sched_setscheduler(current, SCHED_FIFO, &param);
+
+ while (1) {
+ s64 next_wakeup;
+
+ prepare_to_wait(&cct->waitq, &wait, TASK_INTERRUPTIBLE);
+
+ schedule();
+
+ atomic_inc(&idle_cdev->count);
+
+ play_idle(idle_cdev->idle_cycle / USEC_PER_MSEC);
+
+ /*
+ * The last CPU waking up is in charge of setting the
+ * timer. If the CPU is hotplugged, the timer will
+ * move to another CPU (which may not belong to the
+ * same cluster) but that is not a problem as the
+ * timer will be set again by another CPU belonging to
+ * the cluster, so this mechanism is self adaptive and
+ * does not require any hotplugging dance.
+ */
+ if (!atomic_dec_and_test(&idle_cdev->count))
+ continue;
+
+ if (!idle_cdev->state)
+ continue;
+
+ next_wakeup = cpuidle_cooling_runtime(idle_cdev);
+
+ hrtimer_start(&idle_cdev->timer, ns_to_ktime(next_wakeup),
+ HRTIMER_MODE_REL_PINNED);
+ }
+
+ finish_wait(&cct->waitq, &wait);
+
+ return 0;
+}
+
+/**
+ * cpuidle_cooling_release - Kref based release helper
+ * @kref: a pointer to the kref structure
+ *
+ * This function is automatically called by the kref_put function when
+ * the idle cooling device refcount reaches zero. At this point, we
+ * have the guarantee the structure is no longer in use and we can
+ * safely release all the ressources.
+ */
+static void __init cpuidle_cooling_release(struct kref *kref)
+{
+ struct cpuidle_cooling_device *idle_cdev =
+ container_of(kref, struct cpuidle_cooling_device, kref);
+
+ thermal_cooling_device_unregister(idle_cdev->cdev);
+ kfree(idle_cdev);
+}
+
+/**
+ * cpuidle_cooling_get_max_state - Get the maximum state
+ * @cdev : the thermal cooling device
+ * @state : a pointer to the state variable to be filled
+ *
+ * The function gives always 100 as the injection ratio is percentile
+ * based for consistency accros different platforms.
+ *
+ * The function can not fail, it returns always zero.
+ */
+static int cpuidle_cooling_get_max_state(struct thermal_cooling_device *cdev,
+ unsigned long *state)
+{
+ /*
+ * Depending on the configuration or the hardware, the running
+ * cycle and the idle cycle could be different. We want unify
+ * that to an 0..100 interval, so the set state interface will
+ * be the same whatever the platform is.
+ *
+ * The state 100% will make the cluster 100% ... idle. A 0%
+ * injection ratio means no idle injection at all and 50%
+ * means for 10ms of idle injection, we have 10ms of running
+ * time.
+ */
+ *state = 100;
+
+ return 0;
+}
+
+/**
+ * cpuidle_cooling_get_cur_state - Get the current cooling state
+ * @cdev: the thermal cooling device
+ * @state: a pointer to the state
+ *
+ * The function just copy the state value from the private thermal
+ * cooling device structure, the mapping is 1 <-> 1.
+ *
+ * The function can not fail, it returns always zero.
+ */
+static int cpuidle_cooling_get_cur_state(struct thermal_cooling_device *cdev,
+ unsigned long *state)
+{
+ struct cpuidle_cooling_device *idle_cdev = cdev->devdata;
+
+ *state = idle_cdev->state;
+
+ return 0;
+}
+
+/**
+ * cpuidle_cooling_set_cur_state - Set the current cooling state
+ * @cdev: the thermal cooling device
+ * @state: the target state
+ *
+ * The function checks first if we are initiating the mitigation which
+ * in turn wakes up all the idle injection tasks belonging to the idle
+ * cooling device. In any case, it updates the internal state for the
+ * cooling device.
+ *
+ * The function can not fail, it returns always zero.
+ */
+static int cpuidle_cooling_set_cur_state(struct thermal_cooling_device *cdev,
+ unsigned long state)
+{
+ struct cpuidle_cooling_device *idle_cdev = cdev->devdata;
+ unsigned long current_state = idle_cdev->state;
+
+ idle_cdev->state = state;
+
+ if (current_state == 0 && state > 0) {
+ pr_debug("Starting cooling cpus '%*pbl'\n",
+ cpumask_pr_args(idle_cdev->cpumask));
+ cpuidle_cooling_wakeup(idle_cdev);
+ } else if (current_state > 0 && !state) {
+ pr_debug("Stopping cooling cpus '%*pbl'\n",
+ cpumask_pr_args(idle_cdev->cpumask));
+ }
+
+ return 0;
+}
+
+/**
+ * cpuidle_cooling_ops - thermal cooling device ops
+ */
+static struct thermal_cooling_device_ops cpuidle_cooling_ops = {
+ .get_max_state = cpuidle_cooling_get_max_state,
+ .get_cur_state = cpuidle_cooling_get_cur_state,
+ .set_cur_state = cpuidle_cooling_set_cur_state,
+};
+
+/**
+ * cpuilde_cooling_unregister - Idle cooling device exit function
+ *
+ * This function unregisters the cpuidle cooling device and frees the
+ * ressources previously allocated by the init function. This function
+ * is called when the initialization fails.
+ */
+static void cpuidle_cooling_unregister(void)
+{
+ struct cpuidle_cooling_device *tmp, *idle_cdev = NULL;
+ struct cpuidle_cooling_tsk *cct;
+ int cpu;
+
+ list_for_each_entry_safe(idle_cdev, tmp, &cpuidle_cdev_list, node) {
+ for_each_cpu(cpu, idle_cdev->cpumask) {
+ cct = per_cpu_ptr(&cpuidle_cooling_tsk, cpu);
+ if (cct->tsk)
+ kthread_stop(cct->tsk);
+ kref_put(&idle_cdev->kref, cpuidle_cooling_release);
+ }
+ }
+}
+
+/**
+ * cpuidle_cooling_register - Idle cooling device initialization function
+ *
+ * This function is in charge of creating a cooling device per cluster
+ * and register it to thermal framework. For this we rely on the
+ * topology as there is nothing yet describing better the idle state
+ * power domains.
+ *
+ * For each first CPU of the cluster's cpumask, we allocate the idle
+ * cooling device, initialize the general fields and then we initialze
+ * the rest in a per cpu basis.
+ *
+ * Returns zero on success, < 0 otherwise.
+ */
+int cpuidle_cooling_register(void)
+{
+ struct cpuidle_cooling_device *idle_cdev = NULL;
+ struct thermal_cooling_device *cdev;
+ struct cpuidle_cooling_tsk *cct;
+ struct task_struct *tsk;
+ struct device_node *np;
+ cpumask_t *cpumask;
+ char dev_name[THERMAL_NAME_LENGTH];
+ int ret = -ENOMEM, cpu;
+ int index = 0;
+
+ for_each_possible_cpu(cpu) {
+ cpumask = topology_core_cpumask(cpu);
+
+ cct = per_cpu_ptr(&cpuidle_cooling_tsk, cpu);
+
+ /*
+ * This condition makes the first cpu belonging to the
+ * cluster to create a cooling device and allocates
+ * the structure. Others CPUs belonging to the same
+ * cluster will just increment the refcount on the
+ * cooling device structure and initialize it.
+ */
+ if (cpu == cpumask_first(cpumask)) {
+ np = of_cpu_device_node_get(cpu);
+
+ idle_cdev = kzalloc(sizeof(*idle_cdev), GFP_KERNEL);
+ if (!idle_cdev)
+ goto out_fail;
+
+ idle_cdev->idle_cycle = DEFAULT_IDLE_TIME_US;
+
+ atomic_set(&idle_cdev->count, 0);
+
+ kref_init(&idle_cdev->kref);
+
+ /*
+ * Initialize the timer to wakeup all the idle
+ * injection tasks
+ */
+ hrtimer_init(&idle_cdev->timer,
+ CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+
+ /*
+ * The wakeup function callback which is in
+ * charge of waking up all CPUs belonging to
+ * the same cluster
+ */
+ idle_cdev->timer.function = cpuidle_cooling_wakeup_fn;
+
+ /*
+ * The thermal cooling device name
+ */
+ snprintf(dev_name, sizeof(dev_name), "thermal-idle-%d", index++);
+ cdev = thermal_of_cooling_device_register(np, dev_name,
+ idle_cdev,
+ &cpuidle_cooling_ops);
+ if (IS_ERR(cdev)) {
+ ret = PTR_ERR(cdev);
+ goto out_fail;
+ }
+
+ idle_cdev->cdev = cdev;
+
+ idle_cdev->cpumask = cpumask;
+
+ list_add(&idle_cdev->node, &cpuidle_cdev_list);
+
+ pr_info("Created idle cooling device for cpus '%*pbl'\n",
+ cpumask_pr_args(cpumask));
+ }
+
+ kref_get(&idle_cdev->kref);
+
+ init_waitqueue_head(&cct->waitq);
+
+ tsk = kthread_create_on_cpu(cpuidle_cooling_injection_thread,
+ idle_cdev, cpu, "kidle_inject/%u");
+ if (IS_ERR(tsk)) {
+ ret = PTR_ERR(tsk);
+ goto out_fail;
+ }
+
+ cct->tsk = tsk;
+
+ wake_up_process(tsk);
+ }
+
+ return 0;
+
+out_fail:
+ cpuidle_cooling_unregister();
+ pr_err("Failed to create idle cooling device (%d)\n", ret);
+
+ return ret;
+}
+#endif /* CONFIG_CPU_IDLE_THERMAL */
diff --git a/include/linux/cpu_cooling.h b/include/linux/cpu_cooling.h
index c0accc7..fee2038 100644
--- a/include/linux/cpu_cooling.h
+++ b/include/linux/cpu_cooling.h
@@ -120,4 +120,13 @@ void cpufreq_cooling_unregister(struct thermal_cooling_device *cdev)
}
#endif /* CONFIG_CPU_FREQ_THERMAL */

+#ifdef CONFIG_CPU_IDLE_THERMAL
+extern int cpuidle_cooling_register(void);
+#else /* CONFIG_CPU_IDLE_THERMAL */
+static inline int cpuidle_cooling_register(void)
+{
+ return 0;
+}
+#endif /* CONFIG_CPU_IDLE_THERMAL */
+
#endif /* __CPU_COOLING_H__ */
--
2.7.4


2018-02-21 18:57:17

by Daniel Lezcano

[permalink] [raw]
Subject: [PATCH V2 3/7] thermal/drivers/cpu_cooling: Remove pointless field

The structure cpufreq_cooling_device provides a backpointer to the thermal
device but this one is used for a trace and to unregister. For the trace,
we don't really need this field and the unregister function as the same
pointer passed as parameter. Remove it.

Acked-by: Viresh Kumar <[email protected]>
Signed-off-by: Daniel Lezcano <[email protected]>
---
drivers/thermal/cpu_cooling.c | 7 ++-----
1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
index d7528bc..7bdc19f 100644
--- a/drivers/thermal/cpu_cooling.c
+++ b/drivers/thermal/cpu_cooling.c
@@ -88,7 +88,6 @@ struct cpufreq_cooling_device {
unsigned int clipped_freq;
unsigned int max_level;
struct freq_table *freq_table; /* In descending order */
- struct thermal_cooling_device *cdev;
struct cpufreq_policy *policy;
struct list_head node;
struct time_in_idle *idle_time;
@@ -197,8 +196,7 @@ static int update_freq_table(struct cpufreq_cooling_device *cpufreq_cdev,

dev = get_cpu_device(cpu);
if (unlikely(!dev)) {
- dev_warn(&cpufreq_cdev->cdev->device,
- "No cpu device for cpu %d\n", cpu);
+ pr_warn("No cpu device for cpu %d\n", cpu);
return -ENODEV;
}

@@ -762,7 +760,6 @@ __cpufreq_cooling_register(struct device_node *np,
goto remove_ida;

cpufreq_cdev->clipped_freq = cpufreq_cdev->freq_table[0].frequency;
- cpufreq_cdev->cdev = cdev;

mutex_lock(&cooling_list_lock);
/* Register the notifier for first cpufreq cooling device */
@@ -922,7 +919,7 @@ void cpufreq_cooling_unregister(struct thermal_cooling_device *cdev)
cpufreq_unregister_notifier(&thermal_cpufreq_notifier_block,
CPUFREQ_POLICY_NOTIFIER);

- thermal_cooling_device_unregister(cpufreq_cdev->cdev);
+ thermal_cooling_device_unregister(cdev);
ida_simple_remove(&cpufreq_ida, cpufreq_cdev->id);
kfree(cpufreq_cdev->idle_time);
kfree(cpufreq_cdev->freq_table);
--
2.7.4


2018-02-23 04:33:49

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH V2 1/7] thermal/drivers/cpu_cooling: Fixup the header and copyright

On 21-02-18, 16:29, Daniel Lezcano wrote:
> The copyright format does not conform to the format requested by
> Linaro: https://wiki.linaro.org/Copyright
>
> Fix it.
>
> Signed-off-by: Daniel Lezcano <[email protected]>
> ---
> drivers/thermal/cpu_cooling.c | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
> index dc63aba..42110ee 100644
> --- a/drivers/thermal/cpu_cooling.c
> +++ b/drivers/thermal/cpu_cooling.c
> @@ -2,9 +2,11 @@
> * linux/drivers/thermal/cpu_cooling.c
> *
> * Copyright (C) 2012 Samsung Electronics Co., Ltd(http://www.samsung.com)
> - * Copyright (C) 2012 Amit Daniel <[email protected]>
> *
> - * Copyright (C) 2014 Viresh Kumar <[email protected]>
> + * Copyright (C) 2012-2018 Linaro Limited.
> + *
> + * Authors: Amit Daniel <[email protected]>
> + * Viresh Kumar <[email protected]>
> *
> * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> * This program is free software; you can redistribute it and/or modify

Acked-by: Viresh Kumar <[email protected]>

--
viresh

2018-02-23 05:25:15

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH V2 4/7] thermal/drivers/Kconfig: Convert the CPU cooling device to a choice

On 21-02-18, 16:29, Daniel Lezcano wrote:
> The next changes will add new way to cool down a CPU. In order to
> sanitize and make the overall cpu cooling code consistent and robust
> we must prevent the cpu cooling devices to co-exists with the same
> purpose at the same time in the kernel.
>
> Make the CPU cooling device a choice in the Kconfig, so only one CPU
> cooling strategy can be chosen.

Daniel T. already raised his concern (which I share too) about the
multi-platform builds, where we would want this to be runtime
selectable. I am fine with your approach of making that possible later
on, but I would really like that to be merged before the combo thing
comes in. So, I would suggest to merge stuff in this order:

- this series
- runtime selectable strategy
- combo stuff

I hope that would be fine ?

> Signed-off-by: Daniel Lezcano <[email protected]>
> ---
> drivers/thermal/Kconfig | 20 +++++++++++++++++---
> drivers/thermal/cpu_cooling.c | 2 ++
> include/linux/cpu_cooling.h | 6 +++---
> 3 files changed, 22 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
> index b6adc54..5aaae1b 100644
> --- a/drivers/thermal/Kconfig
> +++ b/drivers/thermal/Kconfig
> @@ -142,17 +142,31 @@ config THERMAL_GOV_POWER_ALLOCATOR
> allocating and limiting power to devices.
>
> config CPU_THERMAL
> - bool "generic cpu cooling support"
> - depends on CPU_FREQ
> + bool "Generic cpu cooling support"
> depends on THERMAL_OF
> help
> + Enable the CPU cooling features. If the system has no active
> + cooling device available, this option allows to use the CPU
> + as a cooling device.
> +
> +choice
> + prompt "CPU cooling strategies"
> + depends on CPU_THERMAL
> + default CPU_FREQ_THERMAL
> + help
> + Select the CPU cooling strategy.
> +
> +config CPU_FREQ_THERMAL
> + bool "CPU frequency cooling strategy"
> + depends on CPU_FREQ
> + help
> This implements the generic cpu cooling mechanism through frequency
> reduction. An ACPI version of this already exists
> (drivers/acpi/processor_thermal.c).
> This will be useful for platforms using the generic thermal interface
> and not the ACPI interface.
>
> - If you want this support, you should say Y here.

Should this line be moved to the CPU_THERMAL section above ?

> +endchoice
>
> config CLOCK_THERMAL
> bool "Generic clock cooling support"
> diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
> index 7bdc19f..5c219dc 100644
> --- a/drivers/thermal/cpu_cooling.c
> +++ b/drivers/thermal/cpu_cooling.c
> @@ -22,6 +22,7 @@
>
> #include <trace/events/thermal.h>
>
> +#ifdef CONFIG_CPU_FREQ_THERMAL
> /*
> * Cooling state <-> CPUFreq frequency
> *
> @@ -926,3 +927,4 @@ void cpufreq_cooling_unregister(struct thermal_cooling_device *cdev)
> kfree(cpufreq_cdev);
> }
> EXPORT_SYMBOL_GPL(cpufreq_cooling_unregister);
> +#endif /* CONFIG_CPU_FREQ_THERMAL */
> diff --git a/include/linux/cpu_cooling.h b/include/linux/cpu_cooling.h
> index d4292eb..c0accc7 100644
> --- a/include/linux/cpu_cooling.h
> +++ b/include/linux/cpu_cooling.h
> @@ -33,7 +33,7 @@ struct cpufreq_policy;
> typedef int (*get_static_t)(cpumask_t *cpumask, int interval,
> unsigned long voltage, u32 *power);
>
> -#ifdef CONFIG_CPU_THERMAL
> +#ifdef CONFIG_CPU_FREQ_THERMAL
> /**
> * cpufreq_cooling_register - function to create cpufreq cooling device.
> * @policy: cpufreq policy.
> @@ -84,7 +84,7 @@ of_cpufreq_power_cooling_register(struct device_node *np,
> */
> void cpufreq_cooling_unregister(struct thermal_cooling_device *cdev);
>
> -#else /* !CONFIG_CPU_THERMAL */
> +#else /* !CONFIG_CPU_FREQ_THERMAL */
> static inline struct thermal_cooling_device *
> cpufreq_cooling_register(struct cpufreq_policy *policy)
> {
> @@ -118,6 +118,6 @@ void cpufreq_cooling_unregister(struct thermal_cooling_device *cdev)
> {
> return;
> }
> -#endif /* CONFIG_CPU_THERMAL */
> +#endif /* CONFIG_CPU_FREQ_THERMAL */
>
> #endif /* __CPU_COOLING_H__ */

drivers/cpufreq/Kconfig: # if CPU_THERMAL is on and THERMAL=m, CPUFREQ_DT cannot be =y:
drivers/cpufreq/Kconfig: depends on !CPU_THERMAL || THERMAL
drivers/cpufreq/Kconfig: depends on !CPU_THERMAL || THERMAL
drivers/cpufreq/Kconfig.arm: # if CPU_THERMAL is on and THERMAL=m, ARM_BIT_LITTLE_CPUFREQ cannot be =y
drivers/cpufreq/Kconfig.arm: depends on !CPU_THERMAL || THERMAL
drivers/cpufreq/Kconfig.arm: depends on !CPU_THERMAL || THERMAL

All of these need to use CPU_FREQ_THERMAL now.

include/trace/events/thermal.h:#ifdef CONFIG_CPU_THERMAL
include/trace/events/thermal.h:#endif /* CONFIG_CPU_THERMAL */

And this too.

--
viresh

2018-02-23 05:28:01

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH V2 0/7] CPU cooling device new strategies

On Wed, Feb 21, 2018 at 8:59 PM, Daniel Lezcano
<[email protected]> wrote:
> Changelog:
> V2:
> - Dropped the cpu combo cooling device
> - Added the acked-by tags
> - Replaced task array by a percpu structure
> - Fixed the copyright dates
> - Fixed the extra lines
> - Fixed the compilation macros to be reused
> - Fixed the list node removal
> - Massaged a couple of function names

There are few patches in this series, including the cover-letter,
where your script
forgot to cc me and probably other cpu-cooling maintainers.

Maybe just explicitly cc us all next time :)

--
viresh

2018-02-23 05:36:00

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH V2 7/7] cpuidle/drivers/cpuidle-arm: Register the cooling device

On Wed, Feb 21, 2018 at 8:59 PM, Daniel Lezcano
<[email protected]> wrote:
> Register the ARM generic cpuidle driver as a cooling device.
>
> Signed-off-by: Daniel Lezcano <[email protected]>
> ---
> drivers/cpuidle/cpuidle-arm.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/drivers/cpuidle/cpuidle-arm.c b/drivers/cpuidle/cpuidle-arm.c
> index ddee1b6..c100915 100644
> --- a/drivers/cpuidle/cpuidle-arm.c
> +++ b/drivers/cpuidle/cpuidle-arm.c
> @@ -11,6 +11,7 @@
>
> #define pr_fmt(fmt) "CPUidle arm: " fmt
>
> +#include <linux/cpu_cooling.h>
> #include <linux/cpuidle.h>
> #include <linux/cpumask.h>
> #include <linux/cpu_pm.h>
> @@ -172,6 +173,10 @@ static int __init arm_idle_init(void)
> goto out_fail;
> }
>
> + ret = cpuidle_cooling_register();
> + if (ret)
> + pr_warn("Failed to register the cpuidle cooling device\n");
> +

Because you aren't going to use the return value here, it might be better
to change return type of cpuidle_cooling_register() as "void" and do
error printing from within that routine.

2018-02-23 07:38:13

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH V2 6/7] thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver

On 21-02-18, 16:29, Daniel Lezcano wrote:
> diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
> index 5c219dc..9340216 100644
> --- a/drivers/thermal/cpu_cooling.c
> +++ b/drivers/thermal/cpu_cooling.c
> @@ -10,18 +10,32 @@
> * Viresh Kumar <[email protected]>
> *
> */
> +#undef DEBUG

Why is this required ?

> +#define pr_fmt(fmt) "CPU cooling: " fmt

I think you can use the dev_***() routines instead, as you can
directly the CPU device from anywhere.

> #include <linux/module.h>
> #include <linux/thermal.h>
> #include <linux/cpufreq.h>
> +#include <linux/cpuidle.h>
> #include <linux/err.h>
> +#include <linux/freezer.h>
> #include <linux/idr.h>
> +#include <linux/kthread.h>
> #include <linux/pm_opp.h>
> #include <linux/slab.h>
> +#include <linux/sched/prio.h>
> +#include <linux/sched/rt.h>
> #include <linux/cpu.h>
> #include <linux/cpu_cooling.h>
> +#include <linux/wait.h>
> +
> +#include <linux/platform_device.h>
> +#include <linux/of_platform.h>
>
> #include <trace/events/thermal.h>
>
> +#include <uapi/linux/sched/types.h>
> +
> #ifdef CONFIG_CPU_FREQ_THERMAL
> /*
> * Cooling state <-> CPUFreq frequency
> @@ -928,3 +942,440 @@ void cpufreq_cooling_unregister(struct thermal_cooling_device *cdev)
> }
> EXPORT_SYMBOL_GPL(cpufreq_cooling_unregister);
> #endif /* CONFIG_CPU_FREQ_THERMAL */
> +
> +#ifdef CONFIG_CPU_IDLE_THERMAL
> +/*
> + * The idle duration injection. As we don't have yet a way to specify
> + * from the DT configuration, let's default to a tick duration.
> + */
> +#define DEFAULT_IDLE_TIME_US TICK_USEC

I think this macro is a bit unnecessary here. Its used only at
initialization and so the same comment can be present there and you
can use TICK_USEC there.

Else, Keep it as it is and remove the "idle_cycle" field from the
below structure, as it holds a constant forever.

> +
> +/**
> + * struct cpuidle_cooling_device - data for the idle cooling device
> + * @cdev: a pointer to a struct thermal_cooling_device
> + * @cpumask: a cpumask containing the CPU managed by the cooling device

Missed @node here.

> + * @timer: a hrtimer giving the tempo for the idle injection cycles
> + * @kref: a kernel refcount on this structure
> + * @count: an atomic to keep track of the last task exiting the idle cycle
> + * @idle_cycle: an integer defining the duration of the idle injection
> + * @state: an normalized integer giving the state of the cooling device
> + */
> +struct cpuidle_cooling_device {
> + struct thermal_cooling_device *cdev;
> + struct cpumask *cpumask;
> + struct list_head node;
> + struct hrtimer timer;
> + struct kref kref;
> + atomic_t count;
> + unsigned int idle_cycle;
> + unsigned int state;
> +};
> +
> +/**
> + * @tsk: an array of pointer to the idle injection tasks
> + * @waitq: the waiq for the idle injection tasks
> + */
> +struct cpuidle_cooling_tsk {
> + struct task_struct *tsk;
> + wait_queue_head_t waitq;
> +};
> +
> +DEFINE_PER_CPU(struct cpuidle_cooling_tsk, cpuidle_cooling_tsk);

static ?

> +
> +static LIST_HEAD(cpuidle_cdev_list);
> +
> +/**
> + * cpuidle_cooling_wakeup - Wake up all idle injection threads
> + * @idle_cdev: the idle cooling device
> + *
> + * Every idle injection task belonging to the idle cooling device and
> + * running on an online cpu will be wake up by this call.
> + */
> +static void cpuidle_cooling_wakeup(struct cpuidle_cooling_device *idle_cdev)
> +{
> + int cpu;
> + struct cpuidle_cooling_tsk *cct;
> +
> + for_each_cpu_and(cpu, idle_cdev->cpumask, cpu_online_mask) {
> + cct = per_cpu_ptr(&cpuidle_cooling_tsk, cpu);
> + wake_up_process(cct->tsk);
> + }
> +}
> +
> +/**
> + * cpuidle_cooling_wakeup_fn - Running cycle timer callback
> + * @timer: a hrtimer structure
> + *
> + * When the mitigation is acting, the CPU is allowed to run an amount
> + * of time, then the idle injection happens for the specified delay
> + * and the idle task injection schedules itself until the timer event
> + * wakes the idle injection tasks again for a new idle injection
> + * cycle. The time between the end of the idle injection and the timer
> + * expiration is the allocated running time for the CPU.
> + *
> + * Returns always HRTIMER_NORESTART
> + */
> +static enum hrtimer_restart cpuidle_cooling_wakeup_fn(struct hrtimer *timer)
> +{
> + struct cpuidle_cooling_device *idle_cdev =
> + container_of(timer, struct cpuidle_cooling_device, timer);
> +
> + cpuidle_cooling_wakeup(idle_cdev);
> +
> + return HRTIMER_NORESTART;
> +}
> +
> +/**
> + * cpuidle_cooling_runtime - Running time computation
> + * @idle_cdev: the idle cooling device
> + *
> + * The running duration is computed from the idle injection duration
> + * which is fixed. If we reach 100% of idle injection ratio, that
> + * means the running duration is zero. If we have a 50% ratio
> + * injection, that means we have equal duration for idle and for
> + * running duration.
> + *
> + * The formula is deduced as the following:
> + *
> + * running = idle x ((100 / ratio) - 1)
> + *
> + * For precision purpose for integer math, we use the following:
> + *
> + * running = (idle x 100) / ratio - idle
> + *
> + * For example, if we have an injected duration of 50%, then we end up
> + * with 10ms of idle injection and 10ms of running duration.
> + *
> + * Returns a s64 nanosecond based
> + */
> +static s64 cpuidle_cooling_runtime(struct cpuidle_cooling_device *idle_cdev)
> +{
> + s64 next_wakeup;
> + int state = idle_cdev->state;
> +
> + /*
> + * The function must never be called when there is no
> + * mitigation because:
> + * - that does not make sense
> + * - we end up with a division by zero
> + */
> + BUG_ON(!state);

As there is no locking in place, we can surely hit this case. What if
the state changed to 0 right before this routine was called ?

I would suggest we should just return 0 in that case and get away with
the BUG_ON().

> +
> + next_wakeup = (s64)((idle_cdev->idle_cycle * 100) / state) -
> + idle_cdev->idle_cycle;
> +
> + return next_wakeup * NSEC_PER_USEC;
> +}
> +
> +/**
> + * cpuidle_cooling_injection_thread - Idle injection mainloop thread function
> + * @arg: a void pointer containing the idle cooling device address
> + *
> + * This main function does basically two operations:
> + *
> + * - Goes idle for a specific amount of time
> + *
> + * - Sets a timer to wake up all the idle injection threads after a
> + * running period
> + *
> + * That happens only when the mitigation is enabled, otherwise the
> + * task is scheduled out.
> + *
> + * In order to keep the tasks synchronized together, it is the last
> + * task exiting the idle period which is in charge of setting the
> + * timer.
> + *
> + * This function never returns.
> + */
> +static int cpuidle_cooling_injection_thread(void *arg)
> +{
> + struct sched_param param = { .sched_priority = MAX_USER_RT_PRIO/2 };
> + struct cpuidle_cooling_device *idle_cdev = arg;
> + struct cpuidle_cooling_tsk *cct = per_cpu_ptr(&cpuidle_cooling_tsk,
> + smp_processor_id());

this_cpu_ptr ?

> + DEFINE_WAIT(wait);
> +
> + set_freezable();
> +
> + sched_setscheduler(current, SCHED_FIFO, &param);
> +
> + while (1) {
> + s64 next_wakeup;
> +
> + prepare_to_wait(&cct->waitq, &wait, TASK_INTERRUPTIBLE);
> +
> + schedule();
> +
> + atomic_inc(&idle_cdev->count);
> +
> + play_idle(idle_cdev->idle_cycle / USEC_PER_MSEC);
> +
> + /*
> + * The last CPU waking up is in charge of setting the
> + * timer. If the CPU is hotplugged, the timer will
> + * move to another CPU (which may not belong to the
> + * same cluster) but that is not a problem as the
> + * timer will be set again by another CPU belonging to
> + * the cluster, so this mechanism is self adaptive and
> + * does not require any hotplugging dance.
> + */

Well this depends on how CPU hotplug really happens. What happens to
the per-cpu-tasks which are in the middle of something when hotplug
happens? Does hotplug wait for those per-cpu-tasks to finish ?

> + if (!atomic_dec_and_test(&idle_cdev->count))
> + continue;
> +
> + if (!idle_cdev->state)
> + continue;
> +
> + next_wakeup = cpuidle_cooling_runtime(idle_cdev);
> +
> + hrtimer_start(&idle_cdev->timer, ns_to_ktime(next_wakeup),
> + HRTIMER_MODE_REL_PINNED);
> + }
> +
> + finish_wait(&cct->waitq, &wait);
> +
> + return 0;
> +}
> +
> +/**
> + * cpuidle_cooling_release - Kref based release helper
> + * @kref: a pointer to the kref structure
> + *
> + * This function is automatically called by the kref_put function when
> + * the idle cooling device refcount reaches zero. At this point, we
> + * have the guarantee the structure is no longer in use and we can
> + * safely release all the ressources.
> + */
> +static void __init cpuidle_cooling_release(struct kref *kref)
> +{
> + struct cpuidle_cooling_device *idle_cdev =
> + container_of(kref, struct cpuidle_cooling_device, kref);
> +
> + thermal_cooling_device_unregister(idle_cdev->cdev);
> + kfree(idle_cdev);
> +}

Maybe just move it closer to the only function that calls it?

> +
> +/**
> + * cpuidle_cooling_get_max_state - Get the maximum state
> + * @cdev : the thermal cooling device
> + * @state : a pointer to the state variable to be filled
> + *
> + * The function gives always 100 as the injection ratio is percentile
> + * based for consistency accros different platforms.
> + *
> + * The function can not fail, it returns always zero.
> + */
> +static int cpuidle_cooling_get_max_state(struct thermal_cooling_device *cdev,
> + unsigned long *state)
> +{
> + /*
> + * Depending on the configuration or the hardware, the running
> + * cycle and the idle cycle could be different. We want unify
> + * that to an 0..100 interval, so the set state interface will
> + * be the same whatever the platform is.
> + *
> + * The state 100% will make the cluster 100% ... idle. A 0%
> + * injection ratio means no idle injection at all and 50%
> + * means for 10ms of idle injection, we have 10ms of running
> + * time.
> + */
> + *state = 100;
> +
> + return 0;
> +}
> +
> +/**
> + * cpuidle_cooling_get_cur_state - Get the current cooling state
> + * @cdev: the thermal cooling device
> + * @state: a pointer to the state
> + *
> + * The function just copy the state value from the private thermal
> + * cooling device structure, the mapping is 1 <-> 1.
> + *
> + * The function can not fail, it returns always zero.
> + */
> +static int cpuidle_cooling_get_cur_state(struct thermal_cooling_device *cdev,
> + unsigned long *state)
> +{
> + struct cpuidle_cooling_device *idle_cdev = cdev->devdata;

This line isn't aligned properly. Spaces are present instead of a tab.

> +
> + *state = idle_cdev->state;
> +
> + return 0;
> +}
> +
> +/**
> + * cpuidle_cooling_set_cur_state - Set the current cooling state
> + * @cdev: the thermal cooling device
> + * @state: the target state
> + *
> + * The function checks first if we are initiating the mitigation which
> + * in turn wakes up all the idle injection tasks belonging to the idle
> + * cooling device. In any case, it updates the internal state for the
> + * cooling device.
> + *
> + * The function can not fail, it returns always zero.

it always returns zero.

Same at other places as well.

> + */
> +static int cpuidle_cooling_set_cur_state(struct thermal_cooling_device *cdev,
> + unsigned long state)
> +{
> + struct cpuidle_cooling_device *idle_cdev = cdev->devdata;
> + unsigned long current_state = idle_cdev->state;
> +
> + idle_cdev->state = state;
> +
> + if (current_state == 0 && state > 0) {
> + pr_debug("Starting cooling cpus '%*pbl'\n",
> + cpumask_pr_args(idle_cdev->cpumask));
> + cpuidle_cooling_wakeup(idle_cdev);
> + } else if (current_state > 0 && !state) {
> + pr_debug("Stopping cooling cpus '%*pbl'\n",
> + cpumask_pr_args(idle_cdev->cpumask));
> + }
> +
> + return 0;
> +}
> +
> +/**
> + * cpuidle_cooling_ops - thermal cooling device ops
> + */
> +static struct thermal_cooling_device_ops cpuidle_cooling_ops = {
> + .get_max_state = cpuidle_cooling_get_max_state,
> + .get_cur_state = cpuidle_cooling_get_cur_state,
> + .set_cur_state = cpuidle_cooling_set_cur_state,
> +};
> +
> +/**
> + * cpuilde_cooling_unregister - Idle cooling device exit function
> + *
> + * This function unregisters the cpuidle cooling device and frees the
> + * ressources previously allocated by the init function. This function
> + * is called when the initialization fails.
> + */
> +static void cpuidle_cooling_unregister(void)
> +{
> + struct cpuidle_cooling_device *tmp, *idle_cdev = NULL;
> + struct cpuidle_cooling_tsk *cct;
> + int cpu;
> +
> + list_for_each_entry_safe(idle_cdev, tmp, &cpuidle_cdev_list, node) {
> + for_each_cpu(cpu, idle_cdev->cpumask) {
> + cct = per_cpu_ptr(&cpuidle_cooling_tsk, cpu);
> + if (cct->tsk)
> + kthread_stop(cct->tsk);

What about hrtimer ? Shouldn't that be stopped as well ?

> + kref_put(&idle_cdev->kref, cpuidle_cooling_release);
> + }
> + }
> +}
> +
> +/**
> + * cpuidle_cooling_register - Idle cooling device initialization function
> + *
> + * This function is in charge of creating a cooling device per cluster
> + * and register it to thermal framework. For this we rely on the
> + * topology as there is nothing yet describing better the idle state
> + * power domains.
> + *
> + * For each first CPU of the cluster's cpumask, we allocate the idle
> + * cooling device, initialize the general fields and then we initialze
> + * the rest in a per cpu basis.
> + *
> + * Returns zero on success, < 0 otherwise.

This wouldn't get shown in the doc properly, it should be written as:

* Return: zero on success, < 0 otherwise.

> + */
> +int cpuidle_cooling_register(void)
> +{
> + struct cpuidle_cooling_device *idle_cdev = NULL;
> + struct thermal_cooling_device *cdev;
> + struct cpuidle_cooling_tsk *cct;
> + struct task_struct *tsk;
> + struct device_node *np;
> + cpumask_t *cpumask;
> + char dev_name[THERMAL_NAME_LENGTH];
> + int ret = -ENOMEM, cpu;
> + int index = 0;
> +
> + for_each_possible_cpu(cpu) {
> + cpumask = topology_core_cpumask(cpu);
> +
> + cct = per_cpu_ptr(&cpuidle_cooling_tsk, cpu);
> +
> + /*
> + * This condition makes the first cpu belonging to the
> + * cluster to create a cooling device and allocates
> + * the structure. Others CPUs belonging to the same
> + * cluster will just increment the refcount on the
> + * cooling device structure and initialize it.
> + */
> + if (cpu == cpumask_first(cpumask)) {

Your function still have few assumptions of cpu numbering and it will
break in few cases. What if the CPUs on a big Little system (4x4) are
present in this order: B L L L L B B B ??

This configuration can happen if CPUs in DT are marked as: 0-3 LITTLE,
4-7 big and a big CPU is used by the boot loader to bring up Linux.

> + np = of_cpu_device_node_get(cpu);
> +
> + idle_cdev = kzalloc(sizeof(*idle_cdev), GFP_KERNEL);
> + if (!idle_cdev)
> + goto out_fail;
> +
> + idle_cdev->idle_cycle = DEFAULT_IDLE_TIME_US;
> +
> + atomic_set(&idle_cdev->count, 0);

This should already be 0, isn't it ?

> +
> + kref_init(&idle_cdev->kref);
> +
> + /*
> + * Initialize the timer to wakeup all the idle
> + * injection tasks
> + */
> + hrtimer_init(&idle_cdev->timer,
> + CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> +
> + /*
> + * The wakeup function callback which is in
> + * charge of waking up all CPUs belonging to
> + * the same cluster
> + */
> + idle_cdev->timer.function = cpuidle_cooling_wakeup_fn;
> +
> + /*
> + * The thermal cooling device name
> + */
> + snprintf(dev_name, sizeof(dev_name), "thermal-idle-%d", index++);
> + cdev = thermal_of_cooling_device_register(np, dev_name,
> + idle_cdev,
> + &cpuidle_cooling_ops);

You registered the cooling device, while the idle_cdev is still
getting initialized. Ideally, we should register with the thermal core
(or any other framework) when we are fully ready. For example, any of
the callbacks present in cpuidle_cooling_ops() can get called by the
core after this point and you should be ready to handle them. It will
result in kernel crash in your case as idle_cdev isn't fully
initialized yet. For example, with the set-state callback you will end
up calling wake_up_task(NULL).

> + if (IS_ERR(cdev)) {
> + ret = PTR_ERR(cdev);
> + goto out_fail;
> + }
> +
> + idle_cdev->cdev = cdev;
> +
> + idle_cdev->cpumask = cpumask;
> +
> + list_add(&idle_cdev->node, &cpuidle_cdev_list);

I would have removed the above two blank lines as these can all go
together.

> +
> + pr_info("Created idle cooling device for cpus '%*pbl'\n",
> + cpumask_pr_args(cpumask));

I am not sure if it makes sense to print the message right here. We
can still fail while creating the cooling devices. Maybe a single
print at the very end of the function is more than enough?

> + }
> +
> + kref_get(&idle_cdev->kref);

Did you check if the kref thing is really working here or not? I think
your structure will never get freed on errors because kref_init()
initializes the count by 1 and then you do an additional kref_get()
here for the first CPU of the cluster.

> +
> + init_waitqueue_head(&cct->waitq);
> +
> + tsk = kthread_create_on_cpu(cpuidle_cooling_injection_thread,
> + idle_cdev, cpu, "kidle_inject/%u");
> + if (IS_ERR(tsk)) {
> + ret = PTR_ERR(tsk);
> + goto out_fail;
> + }
> +
> + cct->tsk = tsk;
> +
> + wake_up_process(tsk);
> + }
> +
> + return 0;
> +
> +out_fail:
> + cpuidle_cooling_unregister();
> + pr_err("Failed to create idle cooling device (%d)\n", ret);

So you are already printing the error message here, just make the
return type void as the caller of this can't do anything anyway.

> +
> + return ret;
> +}
> +#endif /* CONFIG_CPU_IDLE_THERMAL */

--
viresh

2018-02-23 09:12:09

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [PATCH V2 4/7] thermal/drivers/Kconfig: Convert the CPU cooling device to a choice

On 23/02/2018 06:24, Viresh Kumar wrote:
> On 21-02-18, 16:29, Daniel Lezcano wrote:
>> The next changes will add new way to cool down a CPU. In order to
>> sanitize and make the overall cpu cooling code consistent and robust
>> we must prevent the cpu cooling devices to co-exists with the same
>> purpose at the same time in the kernel.
>>
>> Make the CPU cooling device a choice in the Kconfig, so only one CPU
>> cooling strategy can be chosen.
>
> Daniel T. already raised his concern (which I share too) about the
> multi-platform builds, where we would want this to be runtime
> selectable. I am fine with your approach of making that possible later
> on, but I would really like that to be merged before the combo thing
> comes in. So, I would suggest to merge stuff in this order:
>
> - this series
> - runtime selectable strategy
> - combo stuff
>
> I hope that would be fine ?

Yes, for me it is ok. So meanwhile we can also check the cpufreq cooling
device unregistering at suspend/resume/hotplug and hopefully that will
simplify the addition of the combo cooling device.


>> Signed-off-by: Daniel Lezcano <[email protected]>
>> ---
>> drivers/thermal/Kconfig | 20 +++++++++++++++++---
>> drivers/thermal/cpu_cooling.c | 2 ++
>> include/linux/cpu_cooling.h | 6 +++---
>> 3 files changed, 22 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
>> index b6adc54..5aaae1b 100644
>> --- a/drivers/thermal/Kconfig
>> +++ b/drivers/thermal/Kconfig
>> @@ -142,17 +142,31 @@ config THERMAL_GOV_POWER_ALLOCATOR
>> allocating and limiting power to devices.
>>
>> config CPU_THERMAL
>> - bool "generic cpu cooling support"
>> - depends on CPU_FREQ
>> + bool "Generic cpu cooling support"
>> depends on THERMAL_OF
>> help
>> + Enable the CPU cooling features. If the system has no active
>> + cooling device available, this option allows to use the CPU
>> + as a cooling device.
>> +
>> +choice
>> + prompt "CPU cooling strategies"
>> + depends on CPU_THERMAL
>> + default CPU_FREQ_THERMAL
>> + help
>> + Select the CPU cooling strategy.
>> +
>> +config CPU_FREQ_THERMAL
>> + bool "CPU frequency cooling strategy"
>> + depends on CPU_FREQ
>> + help
>> This implements the generic cpu cooling mechanism through frequency
>> reduction. An ACPI version of this already exists
>> (drivers/acpi/processor_thermal.c).
>> This will be useful for platforms using the generic thermal interface
>> and not the ACPI interface.
>>
>> - If you want this support, you should say Y here.
>
> Should this line be moved to the CPU_THERMAL section above ?
>

Probably, I will double check it.

[ ... ]

>> -#endif /* CONFIG_CPU_THERMAL */
>> +#endif /* CONFIG_CPU_FREQ_THERMAL */
>>
>> #endif /* __CPU_COOLING_H__ */
>
> drivers/cpufreq/Kconfig: # if CPU_THERMAL is on and THERMAL=m, CPUFREQ_DT cannot be =y:
> drivers/cpufreq/Kconfig: depends on !CPU_THERMAL || THERMAL
> drivers/cpufreq/Kconfig: depends on !CPU_THERMAL || THERMAL
> drivers/cpufreq/Kconfig.arm: # if CPU_THERMAL is on and THERMAL=m, ARM_BIT_LITTLE_CPUFREQ cannot be =y
> drivers/cpufreq/Kconfig.arm: depends on !CPU_THERMAL || THERMAL
> drivers/cpufreq/Kconfig.arm: depends on !CPU_THERMAL || THERMAL
>
> All of these need to use CPU_FREQ_THERMAL now>
> include/trace/events/thermal.h:#ifdef CONFIG_CPU_THERMAL
> include/trace/events/thermal.h:#endif /* CONFIG_CPU_THERMAL */

Ok.

--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


2018-02-23 09:12:45

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [PATCH V2 0/7] CPU cooling device new strategies

On 23/02/2018 06:26, Viresh Kumar wrote:
> On Wed, Feb 21, 2018 at 8:59 PM, Daniel Lezcano
> <[email protected]> wrote:
>> Changelog:
>> V2:
>> - Dropped the cpu combo cooling device
>> - Added the acked-by tags
>> - Replaced task array by a percpu structure
>> - Fixed the copyright dates
>> - Fixed the extra lines
>> - Fixed the compilation macros to be reused
>> - Fixed the list node removal
>> - Massaged a couple of function names
>
> There are few patches in this series, including the cover-letter,
> where your script
> forgot to cc me and probably other cpu-cooling maintainers.
>
> Maybe just explicitly cc us all next time :)

Ah, sorry for that. Will take care next time.


--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


2018-02-23 11:29:54

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [PATCH V2 6/7] thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver

On 23/02/2018 08:34, Viresh Kumar wrote:
> On 21-02-18, 16:29, Daniel Lezcano wrote:
>> diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
>> index 5c219dc..9340216 100644
>> --- a/drivers/thermal/cpu_cooling.c
>> +++ b/drivers/thermal/cpu_cooling.c
>> @@ -10,18 +10,32 @@
>> * Viresh Kumar <[email protected]>
>> *
>> */
>> +#undef DEBUG
>
> Why is this required ?

It is usually added, so if you set the -DDEBUG flag when compiling, you
don't get all the pr_debug traces for all files, but the just the ones
where you commented the #undef above. pr_debug is a no-op otherwise.

>> +#define pr_fmt(fmt) "CPU cooling: " fmt
>
> I think you can use the dev_***() routines instead, as you can
> directly the CPU device from anywhere.

Can we postpone this change for later ? All the file is using pr_*
(cpufreq_cooling included). There is only one place where dev_err is
used but it is removed by the patch 3/7.

[ ... ]

>> +#ifdef CONFIG_CPU_IDLE_THERMAL
>> +/*
>> + * The idle duration injection. As we don't have yet a way to specify
>> + * from the DT configuration, let's default to a tick duration.
>> + */
>> +#define DEFAULT_IDLE_TIME_US TICK_USEC
>
> I think this macro is a bit unnecessary here. Its used only at
> initialization and so the same comment can be present there and you
> can use TICK_USEC there.
>
> Else, Keep it as it is and remove the "idle_cycle" field from the
> below structure, as it holds a constant forever.

Yes, makes sense.

>> +/**
>> + * struct cpuidle_cooling_device - data for the idle cooling device
>> + * @cdev: a pointer to a struct thermal_cooling_device
>> + * @cpumask: a cpumask containing the CPU managed by the cooling device
>
> Missed @node here.

Ok.

>> + * @timer: a hrtimer giving the tempo for the idle injection cycles
>> + * @kref: a kernel refcount on this structure
>> + * @count: an atomic to keep track of the last task exiting the idle cycle
>> + * @idle_cycle: an integer defining the duration of the idle injection
>> + * @state: an normalized integer giving the state of the cooling device
>> + */
>> +struct cpuidle_cooling_device {
>> + struct thermal_cooling_device *cdev;
>> + struct cpumask *cpumask;
>> + struct list_head node;
>> + struct hrtimer timer;
>> + struct kref kref;
>> + atomic_t count;
>> + unsigned int idle_cycle;
>> + unsigned int state;
>> +};
>> +
>> +/**
>> + * @tsk: an array of pointer to the idle injection tasks
>> + * @waitq: the waiq for the idle injection tasks
>> + */
>> +struct cpuidle_cooling_tsk {
>> + struct task_struct *tsk;
>> + wait_queue_head_t waitq;
>> +};
>> +
>> +DEFINE_PER_CPU(struct cpuidle_cooling_tsk, cpuidle_cooling_tsk);
>
> static ?

Ok.

[ ... ]

>> +static s64 cpuidle_cooling_runtime(struct cpuidle_cooling_device *idle_cdev)
>> +{
>> + s64 next_wakeup;
>> + int state = idle_cdev->state;
>> +
>> + /*
>> + * The function must never be called when there is no
>> + * mitigation because:
>> + * - that does not make sense
>> + * - we end up with a division by zero
>> + */
>> + BUG_ON(!state);
>
> As there is no locking in place, we can surely hit this case. What if
> the state changed to 0 right before this routine was called ?
>
> I would suggest we should just return 0 in that case and get away with
> the BUG_ON().

Ok.

>> +
>> + next_wakeup = (s64)((idle_cdev->idle_cycle * 100) / state) -
>> + idle_cdev->idle_cycle;
>> +
>> + return next_wakeup * NSEC_PER_USEC;
>> +}
>> +
>> +/**
>> + * cpuidle_cooling_injection_thread - Idle injection mainloop thread function
>> + * @arg: a void pointer containing the idle cooling device address
>> + *
>> + * This main function does basically two operations:
>> + *
>> + * - Goes idle for a specific amount of time
>> + *
>> + * - Sets a timer to wake up all the idle injection threads after a
>> + * running period
>> + *
>> + * That happens only when the mitigation is enabled, otherwise the
>> + * task is scheduled out.
>> + *
>> + * In order to keep the tasks synchronized together, it is the last
>> + * task exiting the idle period which is in charge of setting the
>> + * timer.
>> + *
>> + * This function never returns.
>> + */
>> +static int cpuidle_cooling_injection_thread(void *arg)
>> +{
>> + struct sched_param param = { .sched_priority = MAX_USER_RT_PRIO/2 };
>> + struct cpuidle_cooling_device *idle_cdev = arg;
>> + struct cpuidle_cooling_tsk *cct = per_cpu_ptr(&cpuidle_cooling_tsk,
>> + smp_processor_id());
>
> this_cpu_ptr ?

yeah, even better.

>> + DEFINE_WAIT(wait);
>> +
>> + set_freezable();
>> +
>> + sched_setscheduler(current, SCHED_FIFO, &param);
>> +
>> + while (1) {
>> + s64 next_wakeup;
>> +
>> + prepare_to_wait(&cct->waitq, &wait, TASK_INTERRUPTIBLE);
>> +
>> + schedule();
>> +
>> + atomic_inc(&idle_cdev->count);
>> +
>> + play_idle(idle_cdev->idle_cycle / USEC_PER_MSEC);
>> +
>> + /*
>> + * The last CPU waking up is in charge of setting the
>> + * timer. If the CPU is hotplugged, the timer will
>> + * move to another CPU (which may not belong to the
>> + * same cluster) but that is not a problem as the
>> + * timer will be set again by another CPU belonging to
>> + * the cluster, so this mechanism is self adaptive and
>> + * does not require any hotplugging dance.
>> + */
>
> Well this depends on how CPU hotplug really happens. What happens to
> the per-cpu-tasks which are in the middle of something when hotplug
> happens? Does hotplug wait for those per-cpu-tasks to finish ?
>
>> + if (!atomic_dec_and_test(&idle_cdev->count))
>> + continue;
>> +
>> + if (!idle_cdev->state)
>> + continue;
>> +
>> + next_wakeup = cpuidle_cooling_runtime(idle_cdev);
>> +
>> + hrtimer_start(&idle_cdev->timer, ns_to_ktime(next_wakeup),
>> + HRTIMER_MODE_REL_PINNED);
>> + }
>> +
>> + finish_wait(&cct->waitq, &wait);
>> +
>> + return 0;
>> +}
>> +
>> +/**
>> + * cpuidle_cooling_release - Kref based release helper
>> + * @kref: a pointer to the kref structure
>> + *
>> + * This function is automatically called by the kref_put function when
>> + * the idle cooling device refcount reaches zero. At this point, we
>> + * have the guarantee the structure is no longer in use and we can
>> + * safely release all the ressources.
>> + */
>> +static void __init cpuidle_cooling_release(struct kref *kref)
>> +{
>> + struct cpuidle_cooling_device *idle_cdev =
>> + container_of(kref, struct cpuidle_cooling_device, kref);
>> +
>> + thermal_cooling_device_unregister(idle_cdev->cdev);
>> + kfree(idle_cdev);
>> +}
>
> Maybe just move it closer to the only function that calls it?

Ok.

[ ... ]

>> +/**
>> + * cpuidle_cooling_get_cur_state - Get the current cooling state
>> + * @cdev: the thermal cooling device
>> + * @state: a pointer to the state
>> + *
>> + * The function just copy the state value from the private thermal
>> + * cooling device structure, the mapping is 1 <-> 1.
>> + *
>> + * The function can not fail, it returns always zero.
>> + */
>> +static int cpuidle_cooling_get_cur_state(struct thermal_cooling_device *cdev,
>> + unsigned long *state)
>> +{
>> + struct cpuidle_cooling_device *idle_cdev = cdev->devdata;
>
> This line isn't aligned properly. Spaces are present instead of a tab.

Ok.

[ ... ]

> Same at other places as well.

Ok, I will double check it.

[ ... ]

>> +static void cpuidle_cooling_unregister(void)
>> +{
>> + struct cpuidle_cooling_device *tmp, *idle_cdev = NULL;
>> + struct cpuidle_cooling_tsk *cct;
>> + int cpu;
>> +
>> + list_for_each_entry_safe(idle_cdev, tmp, &cpuidle_cdev_list, node) {
>> + for_each_cpu(cpu, idle_cdev->cpumask) {
>> + cct = per_cpu_ptr(&cpuidle_cooling_tsk, cpu);
>> + if (cct->tsk)
>> + kthread_stop(cct->tsk);
>
> What about hrtimer ? Shouldn't that be stopped as well ?

IMO, practically it is not possible to have a timer running at this
point, but rigorously it must be stopped in the release routine, so I
will add it.

>> + kref_put(&idle_cdev->kref, cpuidle_cooling_release);
>> + }
>> + }
>> +}
>> +
>> +/**
>> + * cpuidle_cooling_register - Idle cooling device initialization function
>> + *
>> + * This function is in charge of creating a cooling device per cluster
>> + * and register it to thermal framework. For this we rely on the
>> + * topology as there is nothing yet describing better the idle state
>> + * power domains.
>> + *
>> + * For each first CPU of the cluster's cpumask, we allocate the idle
>> + * cooling device, initialize the general fields and then we initialze
>> + * the rest in a per cpu basis.
>> + *
>> + * Returns zero on success, < 0 otherwise.
>
> This wouldn't get shown in the doc properly, it should be written as:
>
> * Return: zero on success, < 0 otherwise.

Ok.

>> + */
>> +int cpuidle_cooling_register(void)
>> +{
>> + struct cpuidle_cooling_device *idle_cdev = NULL;
>> + struct thermal_cooling_device *cdev;
>> + struct cpuidle_cooling_tsk *cct;
>> + struct task_struct *tsk;
>> + struct device_node *np;
>> + cpumask_t *cpumask;
>> + char dev_name[THERMAL_NAME_LENGTH];
>> + int ret = -ENOMEM, cpu;
>> + int index = 0;
>> +
>> + for_each_possible_cpu(cpu) {
>> + cpumask = topology_core_cpumask(cpu);
>> +
>> + cct = per_cpu_ptr(&cpuidle_cooling_tsk, cpu);
>> +
>> + /*
>> + * This condition makes the first cpu belonging to the
>> + * cluster to create a cooling device and allocates
>> + * the structure. Others CPUs belonging to the same
>> + * cluster will just increment the refcount on the
>> + * cooling device structure and initialize it.
>> + */
>> + if (cpu == cpumask_first(cpumask)) {
>
> Your function still have few assumptions of cpu numbering and it will
> break in few cases. What if the CPUs on a big Little system (4x4) are
> present in this order: B L L L L B B B ??
>
> This configuration can happen if CPUs in DT are marked as: 0-3 LITTLE,
> 4-7 big and a big CPU is used by the boot loader to bring up Linux.

Ok, how can I sort it out ?

>> + np = of_cpu_device_node_get(cpu);
>> +
>> + idle_cdev = kzalloc(sizeof(*idle_cdev), GFP_KERNEL);
>> + if (!idle_cdev)
>> + goto out_fail;
>> +
>> + idle_cdev->idle_cycle = DEFAULT_IDLE_TIME_US;
>> +
>> + atomic_set(&idle_cdev->count, 0);
>
> This should already be 0, isn't it ?

Yes.

>> +
>> + kref_init(&idle_cdev->kref);
>> +
>> + /*
>> + * Initialize the timer to wakeup all the idle
>> + * injection tasks
>> + */
>> + hrtimer_init(&idle_cdev->timer,
>> + CLOCK_MONOTONIC, HRTIMER_MODE_REL);
>> +
>> + /*
>> + * The wakeup function callback which is in
>> + * charge of waking up all CPUs belonging to
>> + * the same cluster
>> + */
>> + idle_cdev->timer.function = cpuidle_cooling_wakeup_fn;
>> +
>> + /*
>> + * The thermal cooling device name
>> + */
>> + snprintf(dev_name, sizeof(dev_name), "thermal-idle-%d", index++);
>> + cdev = thermal_of_cooling_device_register(np, dev_name,
>> + idle_cdev,
>> + &cpuidle_cooling_ops);
>
> You registered the cooling device, while the idle_cdev is still
> getting initialized. Ideally, we should register with the thermal core
> (or any other framework) when we are fully ready. For example, any of
> the callbacks present in cpuidle_cooling_ops() can get called by the
> core after this point and you should be ready to handle them. It will
> result in kernel crash in your case as idle_cdev isn't fully
> initialized yet. For example, with the set-state callback you will end
> up calling wake_up_task(NULL).
>
>> + if (IS_ERR(cdev)) {
>> + ret = PTR_ERR(cdev);
>> + goto out_fail;
>> + }
>> +
>> + idle_cdev->cdev = cdev;
>> +
>> + idle_cdev->cpumask = cpumask;
>> +
>> + list_add(&idle_cdev->node, &cpuidle_cdev_list);
>
> I would have removed the above two blank lines as these can all go
> together.

Ok.

>> +
>> + pr_info("Created idle cooling device for cpus '%*pbl'\n",
>> + cpumask_pr_args(cpumask));
>
> I am not sure if it makes sense to print the message right here. We
> can still fail while creating the cooling devices. Maybe a single
> print at the very end of the function is more than enough?

Ok.

>> + }
>> +
>> + kref_get(&idle_cdev->kref);
>
> Did you check if the kref thing is really working here or not? I think
> your structure will never get freed on errors because kref_init()
> initializes the count by 1 and then you do an additional kref_get()
> here for the first CPU of the cluster.

Yes, you are right. I will fix that.

>> +
>> + init_waitqueue_head(&cct->waitq);
>> +
>> + tsk = kthread_create_on_cpu(cpuidle_cooling_injection_thread,
>> + idle_cdev, cpu, "kidle_inject/%u");
>> + if (IS_ERR(tsk)) {
>> + ret = PTR_ERR(tsk);
>> + goto out_fail;
>> + }
>> +
>> + cct->tsk = tsk;
>> +
>> + wake_up_process(tsk);
>> + }
>> +
>> + return 0;
>> +
>> +out_fail:
>> + cpuidle_cooling_unregister();
>> + pr_err("Failed to create idle cooling device (%d)\n", ret);
>
> So you are already printing the error message here, just make the
> return type void as the caller of this can't do anything anyway.

Ok.

>> +
>> + return ret;
>> +}
>> +#endif /* CONFIG_CPU_IDLE_THERMAL */


Thanks for the review.

-- Daniel


--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


2018-02-23 15:28:14

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH V2 6/7] thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver

Hi Daniel,

On 21 February 2018 at 16:29, Daniel Lezcano <[email protected]> wrote:
> +
> +/**
> + * struct cpuidle_cooling_device - data for the idle cooling device
> + * @cdev: a pointer to a struct thermal_cooling_device
> + * @cpumask: a cpumask containing the CPU managed by the cooling device
> + * @timer: a hrtimer giving the tempo for the idle injection cycles
> + * @kref: a kernel refcount on this structure
> + * @count: an atomic to keep track of the last task exiting the idle cycle
> + * @idle_cycle: an integer defining the duration of the idle injection
> + * @state: an normalized integer giving the state of the cooling device
> + */
> +struct cpuidle_cooling_device {
> + struct thermal_cooling_device *cdev;
> + struct cpumask *cpumask;
> + struct list_head node;
> + struct hrtimer timer;
> + struct kref kref;
> + atomic_t count;
> + unsigned int idle_cycle;
> + unsigned int state;
> +};
> +
> +/**
> + * @tsk: an array of pointer to the idle injection tasks
> + * @waitq: the waiq for the idle injection tasks
> + */
> +struct cpuidle_cooling_tsk {
> + struct task_struct *tsk;
> + wait_queue_head_t waitq;

Why are you creating one wait_queue_head_t per cpu instead of one per
cooling device and then save a pointer in the per cpu struct
cpuidle_cooling_tsk ?
Then, you can use wake_up_interruptible_all() to wake up all threads
instead of using for_each_cpu ... wake_up_process() loop in
cpuidle_cooling_wakeup() ?

> +};
> +
> +DEFINE_PER_CPU(struct cpuidle_cooling_tsk, cpuidle_cooling_tsk);
> +
> +static LIST_HEAD(cpuidle_cdev_list);
> +
> +/**
> + * cpuidle_cooling_wakeup - Wake up all idle injection threads
> + * @idle_cdev: the idle cooling device
> + *
> + * Every idle injection task belonging to the idle cooling device and
> + * running on an online cpu will be wake up by this call.
> + */
> +static void cpuidle_cooling_wakeup(struct cpuidle_cooling_device *idle_cdev)
> +{
> + int cpu;
> + struct cpuidle_cooling_tsk *cct;
> +
> + for_each_cpu_and(cpu, idle_cdev->cpumask, cpu_online_mask) {
> + cct = per_cpu_ptr(&cpuidle_cooling_tsk, cpu);
> + wake_up_process(cct->tsk);
> + }
> +}
> +
> +/**
> + * cpuidle_cooling_wakeup_fn - Running cycle timer callback
> + * @timer: a hrtimer structure
> + *
> + * When the mitigation is acting, the CPU is allowed to run an amount
> + * of time, then the idle injection happens for the specified delay
> + * and the idle task injection schedules itself until the timer event
> + * wakes the idle injection tasks again for a new idle injection
> + * cycle. The time between the end of the idle injection and the timer
> + * expiration is the allocated running time for the CPU.
> + *
> + * Returns always HRTIMER_NORESTART
> + */
> +static enum hrtimer_restart cpuidle_cooling_wakeup_fn(struct hrtimer *timer)
> +{
> + struct cpuidle_cooling_device *idle_cdev =
> + container_of(timer, struct cpuidle_cooling_device, timer);
> +
> + cpuidle_cooling_wakeup(idle_cdev);
> +
> + return HRTIMER_NORESTART;
> +}
> +
> +/**

2018-02-24 02:52:54

by Tao Wang

[permalink] [raw]
Subject: Re: [PATCH V2 7/7] cpuidle/drivers/cpuidle-arm: Register the cooling device



On 2018/2/21 23:29, Daniel Lezcano wrote:
> Register the ARM generic cpuidle driver as a cooling device.
>
> Signed-off-by: Daniel Lezcano <[email protected]>
> ---
> drivers/cpuidle/cpuidle-arm.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/drivers/cpuidle/cpuidle-arm.c b/drivers/cpuidle/cpuidle-arm.c
> index ddee1b6..c100915 100644
> --- a/drivers/cpuidle/cpuidle-arm.c
> +++ b/drivers/cpuidle/cpuidle-arm.c
> @@ -11,6 +11,7 @@
>
> #define pr_fmt(fmt) "CPUidle arm: " fmt
>
> +#include <linux/cpu_cooling.h>
> #include <linux/cpuidle.h>
> #include <linux/cpumask.h>
> #include <linux/cpu_pm.h>
> @@ -172,6 +173,10 @@ static int __init arm_idle_init(void)
> goto out_fail;
> }
>
> + ret = cpuidle_cooling_register();
> + if (ret)
> + pr_warn("Failed to register the cpuidle cooling device\n");
> +
SoC which uses Big-Little architecture doesn't use cpuidle-arm driver, is it
better to put cpuidle_cooling_register to cpuidle_register_driver, and use the
cpumask of cpuidle driver to register cpuidle cooling device intead of get it
from topology?
> return 0;
>
> out_fail:
>


2018-02-24 22:55:50

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [PATCH V2 7/7] cpuidle/drivers/cpuidle-arm: Register the cooling device

On 24/02/2018 03:50, Wangtao (Kevin, Kirin) wrote:
>
>
> On 2018/2/21 23:29, Daniel Lezcano wrote:
>> Register the ARM generic cpuidle driver as a cooling device.
>>
>> Signed-off-by: Daniel Lezcano <[email protected]>
>> ---
>>   drivers/cpuidle/cpuidle-arm.c | 5 +++++
>>   1 file changed, 5 insertions(+)
>>
>> diff --git a/drivers/cpuidle/cpuidle-arm.c
>> b/drivers/cpuidle/cpuidle-arm.c
>> index ddee1b6..c100915 100644
>> --- a/drivers/cpuidle/cpuidle-arm.c
>> +++ b/drivers/cpuidle/cpuidle-arm.c
>> @@ -11,6 +11,7 @@
>>     #define pr_fmt(fmt) "CPUidle arm: " fmt
>>   +#include <linux/cpu_cooling.h>
>>   #include <linux/cpuidle.h>
>>   #include <linux/cpumask.h>
>>   #include <linux/cpu_pm.h>
>> @@ -172,6 +173,10 @@ static int __init arm_idle_init(void)
>>               goto out_fail;
>>       }
>>   +    ret = cpuidle_cooling_register();
>> +    if (ret)
>> +        pr_warn("Failed to register the cpuidle cooling device\n");
>> +
> SoC which uses Big-Little architecture doesn't use cpuidle-arm driver,
> is it
> better to put cpuidle_cooling_register to cpuidle_register_driver, and
> use the
> cpumask of cpuidle driver to register cpuidle cooling device intead of
> get it
> from topology?

The bL is supported by the cpuidle-arm driver, however due to how the
cpumask is built, we can't use this in cpuidle_register_driver for that,
it is a soapy board for now.


--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


2018-02-24 23:02:08

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [PATCH V2 6/7] thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver

On 23/02/2018 16:26, Vincent Guittot wrote:
> Hi Daniel,
>
> On 21 February 2018 at 16:29, Daniel Lezcano <[email protected]> wrote:
>> +
>> +/**
>> + * struct cpuidle_cooling_device - data for the idle cooling device
>> + * @cdev: a pointer to a struct thermal_cooling_device
>> + * @cpumask: a cpumask containing the CPU managed by the cooling device
>> + * @timer: a hrtimer giving the tempo for the idle injection cycles
>> + * @kref: a kernel refcount on this structure
>> + * @count: an atomic to keep track of the last task exiting the idle cycle
>> + * @idle_cycle: an integer defining the duration of the idle injection
>> + * @state: an normalized integer giving the state of the cooling device
>> + */
>> +struct cpuidle_cooling_device {
>> + struct thermal_cooling_device *cdev;
>> + struct cpumask *cpumask;
>> + struct list_head node;
>> + struct hrtimer timer;
>> + struct kref kref;
>> + atomic_t count;
>> + unsigned int idle_cycle;
>> + unsigned int state;
>> +};
>> +
>> +/**
>> + * @tsk: an array of pointer to the idle injection tasks
>> + * @waitq: the waiq for the idle injection tasks
>> + */
>> +struct cpuidle_cooling_tsk {
>> + struct task_struct *tsk;
>> + wait_queue_head_t waitq;
>
> Why are you creating one wait_queue_head_t per cpu instead of one per
> cooling device and then save a pointer in the per cpu struct
> cpuidle_cooling_tsk ?
> Then, you can use wake_up_interruptible_all() to wake up all threads
> instead of using for_each_cpu ... wake_up_process() loop in
> cpuidle_cooling_wakeup() ?
Yes, that should do the trick. I will give it a try.



<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


2018-02-26 04:31:22

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH V2 6/7] thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver

On 23-02-18, 12:28, Daniel Lezcano wrote:
> On 23/02/2018 08:34, Viresh Kumar wrote:
> > On 21-02-18, 16:29, Daniel Lezcano wrote:
> >> diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
> >> index 5c219dc..9340216 100644
> >> --- a/drivers/thermal/cpu_cooling.c
> >> +++ b/drivers/thermal/cpu_cooling.c
> >> @@ -10,18 +10,32 @@
> >> * Viresh Kumar <[email protected]>
> >> *
> >> */
> >> +#undef DEBUG
> >
> > Why is this required ?
>
> It is usually added, so if you set the -DDEBUG flag when compiling, you
> don't get all the pr_debug traces for all files, but the just the ones
> where you commented the #undef above. pr_debug is a no-op otherwise.

Yeah, but this is a mess as you need to go edit the files before
enabling debug with it. Everyone prefers the dynamic debug thing now,
where we don't need such stuff. Just drop it.

> >> +#define pr_fmt(fmt) "CPU cooling: " fmt
> >
> > I think you can use the dev_***() routines instead, as you can
> > directly the CPU device from anywhere.
>
> Can we postpone this change for later ? All the file is using pr_*
> (cpufreq_cooling included). There is only one place where dev_err is
> used but it is removed by the patch 3/7.

okay.

> >> + while (1) {
> >> + s64 next_wakeup;
> >> +
> >> + prepare_to_wait(&cct->waitq, &wait, TASK_INTERRUPTIBLE);
> >> +
> >> + schedule();
> >> +
> >> + atomic_inc(&idle_cdev->count);
> >> +
> >> + play_idle(idle_cdev->idle_cycle / USEC_PER_MSEC);
> >> +
> >> + /*
> >> + * The last CPU waking up is in charge of setting the
> >> + * timer. If the CPU is hotplugged, the timer will
> >> + * move to another CPU (which may not belong to the
> >> + * same cluster) but that is not a problem as the
> >> + * timer will be set again by another CPU belonging to
> >> + * the cluster, so this mechanism is self adaptive and
> >> + * does not require any hotplugging dance.
> >> + */
> >
> > Well this depends on how CPU hotplug really happens. What happens to
> > the per-cpu-tasks which are in the middle of something when hotplug
> > happens? Does hotplug wait for those per-cpu-tasks to finish ?

Missed this one ?

> >> +int cpuidle_cooling_register(void)
> >> +{
> >> + struct cpuidle_cooling_device *idle_cdev = NULL;
> >> + struct thermal_cooling_device *cdev;
> >> + struct cpuidle_cooling_tsk *cct;
> >> + struct task_struct *tsk;
> >> + struct device_node *np;
> >> + cpumask_t *cpumask;
> >> + char dev_name[THERMAL_NAME_LENGTH];
> >> + int ret = -ENOMEM, cpu;
> >> + int index = 0;
> >> +
> >> + for_each_possible_cpu(cpu) {
> >> + cpumask = topology_core_cpumask(cpu);
> >> +
> >> + cct = per_cpu_ptr(&cpuidle_cooling_tsk, cpu);
> >> +
> >> + /*
> >> + * This condition makes the first cpu belonging to the
> >> + * cluster to create a cooling device and allocates
> >> + * the structure. Others CPUs belonging to the same
> >> + * cluster will just increment the refcount on the
> >> + * cooling device structure and initialize it.
> >> + */
> >> + if (cpu == cpumask_first(cpumask)) {
> >
> > Your function still have few assumptions of cpu numbering and it will
> > break in few cases. What if the CPUs on a big Little system (4x4) are
> > present in this order: B L L L L B B B ??
> >
> > This configuration can happen if CPUs in DT are marked as: 0-3 LITTLE,
> > 4-7 big and a big CPU is used by the boot loader to bring up Linux.
>
> Ok, how can I sort it out ?

I would do something like this:

cpumask_copy(possible, cpu_possible_mask);

while (!cpumask_empty(possible)) {
first = cpumask_first(possible);
cpumask = topology_core_cpumask(first);
cpumask_andnot(possible, possible, cpumask);

allocate_cooling_dev(first); //This is most of this function in your patch.

while (!cpumask_empty(cpumask)) {
temp = cpumask_first(possible);
//rest init "temp"
cpumask_clear_cpu(temp, cpumask);
}

//Everything done, register cooling device for cpumask.
}

> >> + np = of_cpu_device_node_get(cpu);
> >> +
> >> + idle_cdev = kzalloc(sizeof(*idle_cdev), GFP_KERNEL);
> >> + if (!idle_cdev)
> >> + goto out_fail;
> >> +
> >> + idle_cdev->idle_cycle = DEFAULT_IDLE_TIME_US;
> >> +
> >> + atomic_set(&idle_cdev->count, 0);
> >
> > This should already be 0, isn't it ?
>
> Yes.

I read it as, "I will drop it" :)

--
viresh

2018-03-06 23:21:31

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH V2 5/7] thermal/drivers/cpu_cooling: Add idle cooling device documentation

Hi!

> --- /dev/null
> +++ b/Documentation/thermal/cpu-idle-cooling.txt
> @@ -0,0 +1,165 @@
> +
> +Situation:
> +----------
> +

Can we have some real header here? Also if this is .rst, maybe it
should be marked so?

> +Under certain circumstances, the SoC reaches a temperature exceeding
> +the allocated power budget or the maximum temperature limit. The

I don't understand. Power budget is in W, temperature is in
kelvin. Temperature can't exceed power budget AFAICT.

> +former must be mitigated to stabilize the SoC temperature around the
> +temperature control using the defined cooling devices, the latter

later?

> +catastrophic situation where a radical decision must be taken to
> +reduce the temperature under the critical threshold, that can impact
> +the performances.

performance.

> +Another situation is when the silicon reaches a certain temperature
> +which continues to increase even if the dynamic leakage is reduced to
> +its minimum by clock gating the component. The runaway phenomena will
> +continue with the static leakage and only powering down the component,
> +thus dropping the dynamic and static leakage will allow the component
> +to cool down. This situation is critical.

Critical here, critical there. I have trouble following
it. Theoretically hardware should protect itself, because you don't
want kernel bug to damage your CPU?

> +Last but not least, the system can ask for a specific power budget but
> +because of the OPP density, we can only choose an OPP with a power
> +budget lower than the requested one and underuse the CPU, thus losing
> +performances. In other words, one OPP under uses the CPU with a

performance.

> +lesser than the power budget and the next OPP exceed the power budget,
> +an intermediate OPP could have been used if it were present.

was.

> +Solutions:
> +----------
> +
> +If we can remove the static and the dynamic leakage for a specific
> +duration in a controlled period, the SoC temperature will
> +decrease. Acting at the idle state duration or the idle cycle

"should" decrease? If you are in bad environment..

> +The Operating Performance Point (OPP) density has a great influence on
> +the control precision of cpufreq, however different vendors have a
> +plethora of OPP density, and some have large power gap between OPPs,
> +that will result in loss of performance during thermal control and
> +loss of power in other scenes.

scene seems to be wrong word here.

> +At a specific OPP, we can assume injecting idle cycle on all CPUs,

Extra comma?

> +Idle Injection:
> +---------------
> +
> +The base concept of the idle injection is to force the CPU to go to an
> +idle state for a specified time each control cycle, it provides
> +another way to control CPU power and heat in addition to
> +cpufreq. Ideally, if all CPUs of a cluster inject idle synchronously,
> +this cluster can get into the deepest idle state and achieve minimum
> +power consumption, but that will also increase system response latency
> +if we inject less than cpuidle latency.

I don't understand last sentence.

> +The mitigation begins with a maximum period value which decrease

decreases?

> +more cooling effect is requested. When the period duration is equal
> to
> +the idle duration, then we are in a situation the platform can’t
> +dissipate the heat enough and the mitigation fails. In this case

fast enough?

> +situation is considered critical and there is nothing to do. The idle

Nothing to do? Maybe power the system down?

> +The idle injection duration value must comply with the constraints:
> +
> +- It is lesser or equal to the latency we tolerate when the mitigation

less ... than the latency

> +Minimum period
> +--------------
> +
> +The idle injection duration being fixed, it is obvious the minimum
> +period can’t be lesser than that, otherwise we will be scheduling the

less.

> +Practically, if the running power is lesses than the targeted power,

less.

> +However, in this demonstration we ignore three aspects:
> +
> + * The static leakage is not defined here, we can introduce it in the
> + equation but assuming it will be zero most of the time as it is

, but?

Best regards,
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


Attachments:
(No filename) (4.37 kB)
signature.asc (188.00 B)
Digital signature
Download all attachments

2018-03-07 11:44:06

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [PATCH V2 5/7] thermal/drivers/cpu_cooling: Add idle cooling device documentation

On 07/03/2018 00:19, Pavel Machek wrote:
> Hi!

Hi Pavel,

thanks for taking the time to review the documentation.

>> --- /dev/null
>> +++ b/Documentation/thermal/cpu-idle-cooling.txt
>> @@ -0,0 +1,165 @@
>> +
>> +Situation:
>> +----------
>> +
>
> Can we have some real header here? Also if this is .rst, maybe it
> should be marked so?

Ok, I will fix it.

>> +Under certain circumstances, the SoC reaches a temperature exceeding
>> +the allocated power budget or the maximum temperature limit. The
>
> I don't understand. Power budget is in W, temperature is in
> kelvin. Temperature can't exceed power budget AFAICT.

Yes, it is badly worded. Is the following better ?

"
Under certain circumstances a SoC can reach the maximum temperature
limit or is unable to stabilize the temperature around a temperature
control.

When the SoC has to stabilize the temperature, the kernel can act on a
cooling device to mitigate the dissipated power.

When the maximum temperature is reached and to prevent a catastrophic
situation a radical decision must be taken to reduce the temperature
under the critical threshold, that impacts the performance.

"

>> +former must be mitigated to stabilize the SoC temperature around the
>> +temperature control using the defined cooling devices, the latter
>
> later?
>
>> +catastrophic situation where a radical decision must be taken to
>> +reduce the temperature under the critical threshold, that can impact
>> +the performances.
>
> performance.
>
>> +Another situation is when the silicon reaches a certain temperature
>> +which continues to increase even if the dynamic leakage is reduced to
>> +its minimum by clock gating the component. The runaway phenomena will
>> +continue with the static leakage and only powering down the component,
>> +thus dropping the dynamic and static leakage will allow the component
>> +to cool down. This situation is critical.
>
> Critical here, critical there. I have trouble following
> it. Theoretically hardware should protect itself, because you don't
> want kernel bug to damage your CPU?

There are several levels of protection. The first level is mitigating
the temperature from the kernel, then in the temperature sensor a reset
line will trigger the reboot of the CPUs. Usually it is a register where
you write the maximum temperature, from the driver itself. I never tried
to write 1000°C in this register and see if I can burn the board.

I know some boards have another level of thermal protection in the
hardware itself and some other don't.

In any case, from a kernel point of view, it is a critical situation as
we are about to hard reboot the system and in this case it is preferable
to drop drastically the performance but give the opportunity to the
system to run in a degraded mode.

>> +Last but not least, the system can ask for a specific power budget but
>> +because of the OPP density, we can only choose an OPP with a power
>> +budget lower than the requested one and underuse the CPU, thus losing
>> +performances. In other words, one OPP under uses the CPU with a
>
> performance.
>
>> +lesser than the power budget and the next OPP exceed the power budget,
>> +an intermediate OPP could have been used if it were present.
>
> was.
>
>> +Solutions:
>> +----------
>> +
>> +If we can remove the static and the dynamic leakage for a specific
>> +duration in a controlled period, the SoC temperature will
>> +decrease. Acting at the idle state duration or the idle cycle
>
> "should" decrease? If you are in bad environment..

No, it will decrease in any case because of the static leakage drop. The
bad environment will impact the speed of this decrease.

>> +The Operating Performance Point (OPP) density has a great influence on
>> +the control precision of cpufreq, however different vendors have a
>> +plethora of OPP density, and some have large power gap between OPPs,
>> +that will result in loss of performance during thermal control and
>> +loss of power in other scenes.
>
> scene seems to be wrong word here.

yes, 'scenario' will be better :)

>> +At a specific OPP, we can assume injecting idle cycle on all CPUs,
>
> Extra comma?
>
>> +Idle Injection:
>> +---------------
>> +
>> +The base concept of the idle injection is to force the CPU to go to an
>> +idle state for a specified time each control cycle, it provides
>> +another way to control CPU power and heat in addition to
>> +cpufreq. Ideally, if all CPUs of a cluster inject idle synchronously,
>> +this cluster can get into the deepest idle state and achieve minimum
>> +power consumption, but that will also increase system response latency
>> +if we inject less than cpuidle latency.
>
> I don't understand last sentence.

Is it better ?

"Ideally, if all CPUs, belonging to the same cluster, inject their idle
cycle synchronously, the cluster can reach its power down state with a
minimum power consumption and static leakage drop. However, these idle
cycles injection will add extra latencies as the CPUs will have to
wakeup from a deep sleep state."


>> +The mitigation begins with a maximum period value which decrease
>
> decreases?
>
>> +more cooling effect is requested. When the period duration is equal
>> to
>> +the idle duration, then we are in a situation the platform can’t
>> +dissipate the heat enough and the mitigation fails. In this case
>
> fast enough?
>
>> +situation is considered critical and there is nothing to do. The idle
>
> Nothing to do? Maybe power the system down?

Nothing to do == the mitigation can't handle the situation, it reached
its limit. We can't do better.

Solution: add an emergency thermal shutdown (which is an orthogonal
feature to be added to the thermal framework).

Sidenote: it is a very unlikely case, as we are idle most of the time
when the heat is hard to dissipate. I tested this with a proto-SoC with
an interesting thermal behavior (temperature jumps insanely high),
running at full blast and bad heat dissipation, the mitigation never
reached the limit.

>> +The idle injection duration value must comply with the constraints:
>> +
>> +- It is lesser or equal to the latency we tolerate when the mitigation
>
> less ... than the latency
>
>> +Minimum period
>> +--------------
>> +
>> +The idle injection duration being fixed, it is obvious the minimum
>> +period can’t be lesser than that, otherwise we will be scheduling the
>
> less.
>
>> +Practically, if the running power is lesses than the targeted power,
>
> less.
>
>> +However, in this demonstration we ignore three aspects:
>> +
>> + * The static leakage is not defined here, we can introduce it in the
>> + equation but assuming it will be zero most of the time as it is
>
> , but?
>
> Best regards,

Thanks!


--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


2018-03-07 17:11:31

by Eduardo Valentin

[permalink] [raw]
Subject: Re: [PATCH V2 0/7] CPU cooling device new strategies

Hello Daniel,

On Wed, Feb 21, 2018 at 04:29:21PM +0100, Daniel Lezcano wrote:
> Changelog:
> V2:
> - Dropped the cpu combo cooling device
> - Added the acked-by tags
> - Replaced task array by a percpu structure
> - Fixed the copyright dates
> - Fixed the extra lines
> - Fixed the compilation macros to be reused
> - Fixed the list node removal
> - Massaged a couple of function names
>
>
> The following series provides a new way to cool down a SoC by reducing
> the dissipated power on the CPUs. Based on the initial work from Kevin
> Wangtao, the series implements a CPU cooling device based on idle
> injection, relying on the cpuidle framework.

Nice! Glad to see that Linaro took this work again. I have a few
questions, as follows.

>
> The patchset is designed to have the current DT binding for the
> cpufreq cooling device to be compatible with the new cooling devices.
>
> Different cpu cooling devices can not co-exist on the system, the cpu
> cooling device is enabled or not, and one cooling strategy is selected
> (cpufreq or cpuidle). It is not possible to have all of them available
> at the same time. However, the goal is to enable them all and be able
> to switch from one to another at runtime but that needs a rework of the
> thermal framework which is orthogonal to the feature we are providing.
>

Can you please elaborate on the limitations you found? Please be more
specific.

> This series is divided into two parts.
>
> The first part just provides trivial changes for the copyright and
> removes an unused field in the cpu freq cooling device structure.
>

Ok..

> The second part provides the idle injection cooling device, allowing a SoC
> without a cpufreq driver to use this cooling device as an alternative.
>

which is awesome!


> The preliminary benchmarks show the following changes:
>
> On the hikey6220, dhrystone shows a throughtput increase of 40% for an
> increase of the latency of 16% while sysbench shows a latency increase
> of 5%.

I don't follow these numbers. Throughput increase while injecting idle?
compared to what? percentages of what? Please be more specific to better
describer your work..

>
> Initially, the first version provided also the cpuidle + cpufreq combo
> cooling device but regarding the latest comments, there is a misfit with
> how the cpufreq cooling device and suspend/resume/cpu hotplug/module
> loading|unloading behave together while the combo cooling device was
> designed assuming the cpufreq cooling device was always there. This
> dynamic is investigated and the combo cooling device will be posted
> separetely after this series gets merged.

Yeah, this is one of the confusing parts. Could you please
remind us of the limitations here? Why can't we enable CPUfreq
on higher trip points and CPUidle on lower trip points, for example?
Specially from a system design point of view, the system engineer
typically would benefit of idle injections to achieve overall
average CPU frequencies in a more granular fashion, for example,
achieving performance vs. cooling between available real
frequencies, avoiding real switches.

Also, there is a major design question here. After Linaro's attempt
to send a cpufreq+cpuidle cooling device(s), there was an attempt
to generalize and extend intel powerclamp driver. Do you mind
explaining why refactoring intel powerclamp is not possible? Basic
idea is the same, no?



>
> Daniel Lezcano (7):
> thermal/drivers/cpu_cooling: Fixup the header and copyright
> thermal/drivers/cpu_cooling: Add Software Package Data Exchange (SPDX)
> thermal/drivers/cpu_cooling: Remove pointless field
> thermal/drivers/Kconfig: Convert the CPU cooling device to a choice
> thermal/drivers/cpu_cooling: Add idle cooling device documentation
> thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver
> cpuidle/drivers/cpuidle-arm: Register the cooling device
>
> Documentation/thermal/cpu-idle-cooling.txt | 165 ++++++++++
> drivers/cpuidle/cpuidle-arm.c | 5 +
> drivers/thermal/Kconfig | 30 +-
> drivers/thermal/cpu_cooling.c | 480 +++++++++++++++++++++++++++--
> include/linux/cpu_cooling.h | 15 +-
> 5 files changed, 668 insertions(+), 27 deletions(-)
> create mode 100644 Documentation/thermal/cpu-idle-cooling.txt
>
> --
> 2.7.4
>

2018-03-07 19:00:11

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [PATCH V2 0/7] CPU cooling device new strategies


Hi Eduardo,


On 07/03/2018 18:09, Eduardo Valentin wrote:
> Hello Daniel,
>
> On Wed, Feb 21, 2018 at 04:29:21PM +0100, Daniel Lezcano wrote:
>> Changelog:
>> V2:
>> - Dropped the cpu combo cooling device
>> - Added the acked-by tags
>> - Replaced task array by a percpu structure
>> - Fixed the copyright dates
>> - Fixed the extra lines
>> - Fixed the compilation macros to be reused
>> - Fixed the list node removal
>> - Massaged a couple of function names
>>
>>
>> The following series provides a new way to cool down a SoC by reducing
>> the dissipated power on the CPUs. Based on the initial work from Kevin
>> Wangtao, the series implements a CPU cooling device based on idle
>> injection, relying on the cpuidle framework.
>
> Nice! Glad to see that Linaro took this work again. I have a few
> questions, as follows.
>
>>
>> The patchset is designed to have the current DT binding for the
>> cpufreq cooling device to be compatible with the new cooling devices.
>>
>> Different cpu cooling devices can not co-exist on the system, the cpu
>> cooling device is enabled or not, and one cooling strategy is selected
>> (cpufreq or cpuidle). It is not possible to have all of them available
>> at the same time. However, the goal is to enable them all and be able
>> to switch from one to another at runtime but that needs a rework of the
>> thermal framework which is orthogonal to the feature we are providing.
>>
>
> Can you please elaborate on the limitations you found? Please be more
> specific.

I did not found the limitation because the dynamic change was not
implemented. My concern is principally regarding the change when we are
mitigating, I'm not sure the thermal framework supports that for the moment.

This is why Viresh proposed to first add the idle injection and then
support the dynamic change before resending the combo cooling device.

>> This series is divided into two parts.
>>
>> The first part just provides trivial changes for the copyright and
>> removes an unused field in the cpu freq cooling device structure.
>>
>
> Ok..
>
>> The second part provides the idle injection cooling device, allowing a SoC
>> without a cpufreq driver to use this cooling device as an alternative.
>>
>
> which is awesome!
>
>
>> The preliminary benchmarks show the following changes:
>>
>> On the hikey6220, dhrystone shows a throughtput increase of 40% for an
>> increase of the latency of 16% while sysbench shows a latency increase
>> of 5%.
>
> I don't follow these numbers. Throughput increase while injecting idle?
> compared to what? percentages of what? Please be more specific to better
> describer your work..

The dhrystone throughput is based on the virtual timer, when we are
running, it is at max opp, so the throughput increases. But regarding
the real time, it takes obviously more time to achieve as we are
artificially inserting idle cycles. With the cpufreq governor, we run at
a lower opp, so the throughput is less for dhrystone but it takes less
time to achieve.

Percentages are comparing cpufreq vs cpuidle cooling devices. I will
take care of presenting the results in a more clear way in the next version.

>> Initially, the first version provided also the cpuidle + cpufreq combo
>> cooling device but regarding the latest comments, there is a misfit with
>> how the cpufreq cooling device and suspend/resume/cpu hotplug/module
>> loading|unloading behave together while the combo cooling device was
>> designed assuming the cpufreq cooling device was always there. This
>> dynamic is investigated and the combo cooling device will be posted
>> separetely after this series gets merged.
>
> Yeah, this is one of the confusing parts. Could you please
> remind us of the limitations here? Why can't we enable CPUfreq
> on higher trip points and CPUidle on lower trip points, for example?

Sorry, I'm not getting the question. We don't want to enable cpuidle or
cpufreq at certain point but combine the cooling effect of both in order
to get the best tradeoff power / performance.

Let me give an example with a simple SoC - one core.

Let's say we have 4 OPPs and a core-sleep idle state. Respectively, the
OPPs consume 100mW, 500mW, 2W, 4W. Now the CPU is in an intensive work
running at the highest OPP, thus consuming 4W. The temperature increases
and reaches 75°C which is the mitigation point and where the sustainable
power is 1.7W.

- With the cpufreq cooling device, we can't have 4W, so we go back and
forth between 2W and 500mW.

- With the cpuidle cooling device, we are at the highest OPP (there is
no cpufreq driver) and we insert 47.5% of idle duration

- With the combo cooling device, we compute the round-up OPP (here 2W)
and we insert idle cycles for the remaining power to reach the
sustainable power, so 15%.

With the combo, we increase the performances for the same requested
power. There is no yet the state2power callbacks but we expect the
combination of dropping the static leakage and the higher OPP to give
better results in terms of performance and mitigation on energy eager
CPUs like the recent big ARM cpus with the IPA governor.

Going straight to the point of your question above, we can see the
cpufreq cooling device and the cpuidle cooling device have to
collaborate. If we unregister the cpufreq device, we have to do the math
for the power again in the combo cooling device. It is not a problem by
itself but needs an extra reflexion in the current code.

> Specially from a system design point of view, the system engineer
> typically would benefit of idle injections to achieve overall
> average CPU frequencies in a more granular fashion, for example,
> achieving performance vs. cooling between available real
> frequencies, avoiding real switches.
>
> Also, there is a major design question here. After Linaro's attempt
> to send a cpufreq+cpuidle cooling device(s), there was an attempt
> to generalize and extend intel powerclamp driver.

I'm not aware of such attempt.

> Do you mind
> explaining why refactoring intel powerclamp is not possible? Basic
> idea is the same, no?

Basically the idea is the same: run synchronized idle thread and call
play_idle(). That is all. Putting apart the intel_powerclamp is very x86
centric and contains a plethora of code not fitting our purpose, it
increases the idle duration while we are increasing the number of idle
cycles but keep the idle duration constant in order to have a control on
the latency for the user interactivity. If you compare the idle
injection threads codes (powerclamp and cpuidle cooling device), you
will also notice they are very different in terms of implementation.

The combo cooling device collaborates with the cpufreq cooling device
and reuses the DT binding, and finally it uses the power information
provided in the DT. The idle injection is a brick to the combo cooling
device.

Initially I thought we should refactor the intel_powerclamp but it
appears the combo cooling device reuses the cpufreq and cpuidle cooling
device, making sense to have them all in a single file and evolving to a
single cooling device with different strategies.



>> Daniel Lezcano (7):
>> thermal/drivers/cpu_cooling: Fixup the header and copyright
>> thermal/drivers/cpu_cooling: Add Software Package Data Exchange (SPDX)
>> thermal/drivers/cpu_cooling: Remove pointless field
>> thermal/drivers/Kconfig: Convert the CPU cooling device to a choice
>> thermal/drivers/cpu_cooling: Add idle cooling device documentation
>> thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver
>> cpuidle/drivers/cpuidle-arm: Register the cooling device
>>
>> Documentation/thermal/cpu-idle-cooling.txt | 165 ++++++++++
>> drivers/cpuidle/cpuidle-arm.c | 5 +
>> drivers/thermal/Kconfig | 30 +-
>> drivers/thermal/cpu_cooling.c | 480 +++++++++++++++++++++++++++--
>> include/linux/cpu_cooling.h | 15 +-
>> 5 files changed, 668 insertions(+), 27 deletions(-)
>> create mode 100644 Documentation/thermal/cpu-idle-cooling.txt
>>
>> --
>> 2.7.4
>>


--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


2018-03-08 09:02:20

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH V2 5/7] thermal/drivers/cpu_cooling: Add idle cooling device documentation

Hi!

> >> +Under certain circumstances, the SoC reaches a temperature exceeding
> >> +the allocated power budget or the maximum temperature limit. The
> >
> > I don't understand. Power budget is in W, temperature is in
> > kelvin. Temperature can't exceed power budget AFAICT.
>
> Yes, it is badly worded. Is the following better ?
>
> "
> Under certain circumstances a SoC can reach the maximum temperature
> limit or is unable to stabilize the temperature around a temperature
> control.
>
> When the SoC has to stabilize the temperature, the kernel can act on a
> cooling device to mitigate the dissipated power.
>
> When the maximum temperature is reached and to prevent a catastrophic
> situation a radical decision must be taken to reduce the temperature
> under the critical threshold, that impacts the performance.
>
> "

Actually... if hardware is expected to protect itself, I'd tone it
down. No need to be all catastrophic and critical... But yes, better.

> > Critical here, critical there. I have trouble following
> > it. Theoretically hardware should protect itself, because you don't
> > want kernel bug to damage your CPU?
>
> There are several levels of protection. The first level is mitigating
> the temperature from the kernel, then in the temperature sensor a reset
> line will trigger the reboot of the CPUs. Usually it is a register where
> you write the maximum temperature, from the driver itself. I never tried
> to write 1000?C in this register and see if I can burn the board.
>
> I know some boards have another level of thermal protection in the
> hardware itself and some other don't.
>
> In any case, from a kernel point of view, it is a critical situation as
> we are about to hard reboot the system and in this case it is preferable
> to drop drastically the performance but give the opportunity to the
> system to run in a degraded mode.

Agreed you want to keep going. In ACPI world, we shutdown when
critical trip point is reached, so this is somehow confusing.

> >> +Solutions:
> >> +----------
> >> +
> >> +If we can remove the static and the dynamic leakage for a specific
> >> +duration in a controlled period, the SoC temperature will
> >> +decrease. Acting at the idle state duration or the idle cycle
> >
> > "should" decrease? If you are in bad environment..
>
> No, it will decrease in any case because of the static leakage drop. The
> bad environment will impact the speed of this decrease.

I meant... if ambient temperature is 105C, there's not much you can do
to cool system down :-).

> >> +Idle Injection:
> >> +---------------
> >> +
> >> +The base concept of the idle injection is to force the CPU to go to an
> >> +idle state for a specified time each control cycle, it provides
> >> +another way to control CPU power and heat in addition to
> >> +cpufreq. Ideally, if all CPUs of a cluster inject idle synchronously,
> >> +this cluster can get into the deepest idle state and achieve minimum
> >> +power consumption, but that will also increase system response latency
> >> +if we inject less than cpuidle latency.
> >
> > I don't understand last sentence.
>
> Is it better ?
>
> "Ideally, if all CPUs, belonging to the same cluster, inject their idle
> cycle synchronously, the cluster can reach its power down state with a
> minimum power consumption and static leakage drop. However, these idle
> cycles injection will add extra latencies as the CPUs will have to
> wakeup from a deep sleep state."

Extra comma "CPUs , belonging". But yes, better.

> Thanks!

You are welcome. Best regards,
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


Attachments:
(No filename) (3.72 kB)
signature.asc (188.00 B)
Digital signature
Download all attachments

2018-03-08 11:56:08

by Daniel Thompson

[permalink] [raw]
Subject: Re: [PATCH V2 5/7] thermal/drivers/cpu_cooling: Add idle cooling device documentation

On Thu, Mar 08, 2018 at 09:59:49AM +0100, Pavel Machek wrote:
> Hi!
>
> > >> +Under certain circumstances, the SoC reaches a temperature exceeding
> > >> +the allocated power budget or the maximum temperature limit. The
> > >
> > > I don't understand. Power budget is in W, temperature is in
> > > kelvin. Temperature can't exceed power budget AFAICT.
> >
> > Yes, it is badly worded. Is the following better ?
> >
> > "
> > Under certain circumstances a SoC can reach the maximum temperature
> > limit or is unable to stabilize the temperature around a temperature
> > control.
> >
> > When the SoC has to stabilize the temperature, the kernel can act on a
> > cooling device to mitigate the dissipated power.
> >
> > When the maximum temperature is reached and to prevent a catastrophic
> > situation a radical decision must be taken to reduce the temperature
> > under the critical threshold, that impacts the performance.
> >
> > "
>
> Actually... if hardware is expected to protect itself, I'd tone it
> down. No need to be all catastrophic and critical... But yes, better.

Makes sense. For a thermally overcommitted but passively cooled device
work close to max operating temperature it is not a critical situation
requiring a radical reaction, it is normal operation.

Put another way, it would severely bogus to attach KERN_CRITICAL
messages to reaching the cooling threshold.


Daniel.


> > > Critical here, critical there. I have trouble following
> > > it. Theoretically hardware should protect itself, because you don't
> > > want kernel bug to damage your CPU?
> >
> > There are several levels of protection. The first level is mitigating
> > the temperature from the kernel, then in the temperature sensor a reset
> > line will trigger the reboot of the CPUs. Usually it is a register where
> > you write the maximum temperature, from the driver itself. I never tried
> > to write 1000?C in this register and see if I can burn the board.
> >
> > I know some boards have another level of thermal protection in the
> > hardware itself and some other don't.
> >
> > In any case, from a kernel point of view, it is a critical situation as
> > we are about to hard reboot the system and in this case it is preferable
> > to drop drastically the performance but give the opportunity to the
> > system to run in a degraded mode.
>
> Agreed you want to keep going. In ACPI world, we shutdown when
> critical trip point is reached, so this is somehow confusing.
>
> > >> +Solutions:
> > >> +----------
> > >> +
> > >> +If we can remove the static and the dynamic leakage for a specific
> > >> +duration in a controlled period, the SoC temperature will
> > >> +decrease. Acting at the idle state duration or the idle cycle
> > >
> > > "should" decrease? If you are in bad environment..
> >
> > No, it will decrease in any case because of the static leakage drop. The
> > bad environment will impact the speed of this decrease.
>
> I meant... if ambient temperature is 105C, there's not much you can do
> to cool system down :-).
>
> > >> +Idle Injection:
> > >> +---------------
> > >> +
> > >> +The base concept of the idle injection is to force the CPU to go to an
> > >> +idle state for a specified time each control cycle, it provides
> > >> +another way to control CPU power and heat in addition to
> > >> +cpufreq. Ideally, if all CPUs of a cluster inject idle synchronously,
> > >> +this cluster can get into the deepest idle state and achieve minimum
> > >> +power consumption, but that will also increase system response latency
> > >> +if we inject less than cpuidle latency.
> > >
> > > I don't understand last sentence.
> >
> > Is it better ?
> >
> > "Ideally, if all CPUs, belonging to the same cluster, inject their idle
> > cycle synchronously, the cluster can reach its power down state with a
> > minimum power consumption and static leakage drop. However, these idle
> > cycles injection will add extra latencies as the CPUs will have to
> > wakeup from a deep sleep state."
>
> Extra comma "CPUs , belonging". But yes, better.
>
> > Thanks!
>
> You are welcome. Best regards,
> Pavel
> --
> (english) http://www.livejournal.com/~pavelmachek
> (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html



2018-03-08 12:05:15

by Daniel Thompson

[permalink] [raw]
Subject: Re: [PATCH V2 0/7] CPU cooling device new strategies

On Wed, Mar 07, 2018 at 07:57:17PM +0100, Daniel Lezcano wrote:
> >> The preliminary benchmarks show the following changes:
> >>
> >> On the hikey6220, dhrystone shows a throughtput increase of 40% for an
> >> increase of the latency of 16% while sysbench shows a latency increase
> >> of 5%.
> >
> > I don't follow these numbers. Throughput increase while injecting idle?
> > compared to what? percentages of what? Please be more specific to better
> > describer your work..
>
> The dhrystone throughput is based on the virtual timer, when we are
> running, it is at max opp, so the throughput increases. But regarding
> the real time, it takes obviously more time to achieve as we are
> artificially inserting idle cycles. With the cpufreq governor, we run at
> a lower opp, so the throughput is less for dhrystone but it takes less
> time to achieve.
>
> Percentages are comparing cpufreq vs cpuidle cooling devices. I will
> take care of presenting the results in a more clear way in the next version.

I think we should also note that the current hikey settings for cpufreq
are very badly tuned for this platform. It has a single temp threshold
and it jumps from max freq to min freq.

IIRC Leo's work on Hikey thermals correctly it would be much better if
it used the power-allocator thermal governor or if if copied some of
the Samsung 32-bit platform by configuring the step governor with a
graduated with a slightly lower threshold that moves two stops back in
the OPP table (which is still fairly high clock speed... but it
thermally sustainable).


Daniel.


> >> Initially, the first version provided also the cpuidle + cpufreq combo
> >> cooling device but regarding the latest comments, there is a misfit with
> >> how the cpufreq cooling device and suspend/resume/cpu hotplug/module
> >> loading|unloading behave together while the combo cooling device was
> >> designed assuming the cpufreq cooling device was always there. This
> >> dynamic is investigated and the combo cooling device will be posted
> >> separetely after this series gets merged.
> >
> > Yeah, this is one of the confusing parts. Could you please
> > remind us of the limitations here? Why can't we enable CPUfreq
> > on higher trip points and CPUidle on lower trip points, for example?
>
> Sorry, I'm not getting the question. We don't want to enable cpuidle or
> cpufreq at certain point but combine the cooling effect of both in order
> to get the best tradeoff power / performance.
>
> Let me give an example with a simple SoC - one core.
>
> Let's say we have 4 OPPs and a core-sleep idle state. Respectively, the
> OPPs consume 100mW, 500mW, 2W, 4W. Now the CPU is in an intensive work
> running at the highest OPP, thus consuming 4W. The temperature increases
> and reaches 75°C which is the mitigation point and where the sustainable
> power is 1.7W.
>
> - With the cpufreq cooling device, we can't have 4W, so we go back and
> forth between 2W and 500mW.
>
> - With the cpuidle cooling device, we are at the highest OPP (there is
> no cpufreq driver) and we insert 47.5% of idle duration
>
> - With the combo cooling device, we compute the round-up OPP (here 2W)
> and we insert idle cycles for the remaining power to reach the
> sustainable power, so 15%.
>
> With the combo, we increase the performances for the same requested
> power. There is no yet the state2power callbacks but we expect the
> combination of dropping the static leakage and the higher OPP to give
> better results in terms of performance and mitigation on energy eager
> CPUs like the recent big ARM cpus with the IPA governor.
>
> Going straight to the point of your question above, we can see the
> cpufreq cooling device and the cpuidle cooling device have to
> collaborate. If we unregister the cpufreq device, we have to do the math
> for the power again in the combo cooling device. It is not a problem by
> itself but needs an extra reflexion in the current code.
>
> > Specially from a system design point of view, the system engineer
> > typically would benefit of idle injections to achieve overall
> > average CPU frequencies in a more granular fashion, for example,
> > achieving performance vs. cooling between available real
> > frequencies, avoiding real switches.
> >
> > Also, there is a major design question here. After Linaro's attempt
> > to send a cpufreq+cpuidle cooling device(s), there was an attempt
> > to generalize and extend intel powerclamp driver.
>
> I'm not aware of such attempt.
>
> > Do you mind
> > explaining why refactoring intel powerclamp is not possible? Basic
> > idea is the same, no?
>
> Basically the idea is the same: run synchronized idle thread and call
> play_idle(). That is all. Putting apart the intel_powerclamp is very x86
> centric and contains a plethora of code not fitting our purpose, it
> increases the idle duration while we are increasing the number of idle
> cycles but keep the idle duration constant in order to have a control on
> the latency for the user interactivity. If you compare the idle
> injection threads codes (powerclamp and cpuidle cooling device), you
> will also notice they are very different in terms of implementation.
>
> The combo cooling device collaborates with the cpufreq cooling device
> and reuses the DT binding, and finally it uses the power information
> provided in the DT. The idle injection is a brick to the combo cooling
> device.
>
> Initially I thought we should refactor the intel_powerclamp but it
> appears the combo cooling device reuses the cpufreq and cpuidle cooling
> device, making sense to have them all in a single file and evolving to a
> single cooling device with different strategies.
>
>
>
> >> Daniel Lezcano (7):
> >> thermal/drivers/cpu_cooling: Fixup the header and copyright
> >> thermal/drivers/cpu_cooling: Add Software Package Data Exchange (SPDX)
> >> thermal/drivers/cpu_cooling: Remove pointless field
> >> thermal/drivers/Kconfig: Convert the CPU cooling device to a choice
> >> thermal/drivers/cpu_cooling: Add idle cooling device documentation
> >> thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver
> >> cpuidle/drivers/cpuidle-arm: Register the cooling device
> >>
> >> Documentation/thermal/cpu-idle-cooling.txt | 165 ++++++++++
> >> drivers/cpuidle/cpuidle-arm.c | 5 +
> >> drivers/thermal/Kconfig | 30 +-
> >> drivers/thermal/cpu_cooling.c | 480 +++++++++++++++++++++++++++--
> >> include/linux/cpu_cooling.h | 15 +-
> >> 5 files changed, 668 insertions(+), 27 deletions(-)
> >> create mode 100644 Documentation/thermal/cpu-idle-cooling.txt
> >>
> >> --
> >> 2.7.4
> >>
>
>
> --
> <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
>
> Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
> <http://twitter.com/#!/linaroorg> Twitter |
> <http://www.linaro.org/linaro-blog/> Blog
>

2018-03-13 19:16:59

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [PATCH V2 6/7] thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver

On 26/02/2018 05:30, Viresh Kumar wrote:

[ ... ]

>>>> + /*
>>>> + * The last CPU waking up is in charge of setting the
>>>> + * timer. If the CPU is hotplugged, the timer will
>>>> + * move to another CPU (which may not belong to the
>>>> + * same cluster) but that is not a problem as the
>>>> + * timer will be set again by another CPU belonging to
>>>> + * the cluster, so this mechanism is self adaptive and
>>>> + * does not require any hotplugging dance.
>>>> + */
>>>
>>> Well this depends on how CPU hotplug really happens. What happens to
>>> the per-cpu-tasks which are in the middle of something when hotplug
>>> happens? Does hotplug wait for those per-cpu-tasks to finish ?
>
> Missed this one ?

To be frank, I don't know. I have been through the hp code and didn't
find the answer. AFAICT, the sched/core.c sched_cpu_dying() disables the
rq and then migrates the tasks. I assume this kthread is not migrated
and stays in sleeping state until the rq is online again. As the wakeup
callback discards offline cpus, the kthread is not wakeup at any moment.

I created a situation where that happens: mitigation (raised with
dhrystone) + cpu hotplugging (offline/online loop) and never hit any issue.

However, I'm wondering if using the 'struct smp_hotplug_thread'
infra-stucture (git show f97f8f06) shouldn't be used.


>>>> +int cpuidle_cooling_register(void)
>>>> +{
>>>> + struct cpuidle_cooling_device *idle_cdev = NULL;
>>>> + struct thermal_cooling_device *cdev;
>>>> + struct cpuidle_cooling_tsk *cct;
>>>> + struct task_struct *tsk;
>>>> + struct device_node *np;
>>>> + cpumask_t *cpumask;
>>>> + char dev_name[THERMAL_NAME_LENGTH];
>>>> + int ret = -ENOMEM, cpu;
>>>> + int index = 0;
>>>> +
>>>> + for_each_possible_cpu(cpu) {
>>>> + cpumask = topology_core_cpumask(cpu);
>>>> +
>>>> + cct = per_cpu_ptr(&cpuidle_cooling_tsk, cpu);
>>>> +
>>>> + /*
>>>> + * This condition makes the first cpu belonging to the
>>>> + * cluster to create a cooling device and allocates
>>>> + * the structure. Others CPUs belonging to the same
>>>> + * cluster will just increment the refcount on the
>>>> + * cooling device structure and initialize it.
>>>> + */
>>>> + if (cpu == cpumask_first(cpumask)) {
>>>
>>> Your function still have few assumptions of cpu numbering and it will
>>> break in few cases. What if the CPUs on a big Little system (4x4) are
>>> present in this order: B L L L L B B B ??
>>>
>>> This configuration can happen if CPUs in DT are marked as: 0-3 LITTLE,
>>> 4-7 big and a big CPU is used by the boot loader to bring up Linux.
>>
>> Ok, how can I sort it out ?
>
> I would do something like this:
>
> cpumask_copy(possible, cpu_possible_mask);
>
> while (!cpumask_empty(possible)) {
> first = cpumask_first(possible);
> cpumask = topology_core_cpumask(first);
> cpumask_andnot(possible, possible, cpumask);
>
> allocate_cooling_dev(first); //This is most of this function in your patch.
>
> while (!cpumask_empty(cpumask)) {
> temp = cpumask_first(possible);
> //rest init "temp"
> cpumask_clear_cpu(temp, cpumask);
> }
>
> //Everything done, register cooling device for cpumask.
> }
>
>>>> + np = of_cpu_device_node_get(cpu);
>>>> +
>>>> + idle_cdev = kzalloc(sizeof(*idle_cdev), GFP_KERNEL);
>>>> + if (!idle_cdev)
>>>> + goto out_fail;
>>>> +
>>>> + idle_cdev->idle_cycle = DEFAULT_IDLE_TIME_US;
>>>> +
>>>> + atomic_set(&idle_cdev->count, 0);
>>>
>>> This should already be 0, isn't it ?
>>
>> Yes.
>
> I read it as, "I will drop it" :)
>


--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


2018-03-26 14:32:34

by Leo Yan

[permalink] [raw]
Subject: Re: [PATCH V2 0/7] CPU cooling device new strategies

On Thu, Mar 08, 2018 at 12:03:52PM +0000, Daniel Thompson wrote:
> On Wed, Mar 07, 2018 at 07:57:17PM +0100, Daniel Lezcano wrote:
> > >> The preliminary benchmarks show the following changes:
> > >>
> > >> On the hikey6220, dhrystone shows a throughtput increase of 40% for an
> > >> increase of the latency of 16% while sysbench shows a latency increase
> > >> of 5%.
> > >
> > > I don't follow these numbers. Throughput increase while injecting idle?
> > > compared to what? percentages of what? Please be more specific to better
> > > describer your work..
> >
> > The dhrystone throughput is based on the virtual timer, when we are
> > running, it is at max opp, so the throughput increases. But regarding
> > the real time, it takes obviously more time to achieve as we are
> > artificially inserting idle cycles. With the cpufreq governor, we run at
> > a lower opp, so the throughput is less for dhrystone but it takes less
> > time to achieve.
> >
> > Percentages are comparing cpufreq vs cpuidle cooling devices. I will
> > take care of presenting the results in a more clear way in the next version.
>
> I think we should also note that the current hikey settings for cpufreq
> are very badly tuned for this platform. It has a single temp threshold
> and it jumps from max freq to min freq.
>
> IIRC Leo's work on Hikey thermals correctly it would be much better if
> it used the power-allocator thermal governor or if if copied some of
> the Samsung 32-bit platform by configuring the step governor with a
> graduated with a slightly lower threshold that moves two stops back in
> the OPP table (which is still fairly high clock speed... but it
> thermally sustainable).

I think Daniel L. is working on this patch set with 'power-allocator'
governor, and the parameters 'sustainable-power = <3326>' and
'dynamic-power-coefficient = <311>' are profiling value on Hikey
platform. Now we only consider dynamic power and skip static leakage
for 'power-allocator' governor. And all these parameters are merged
into Linux mainline kernel.

Daniel L. could correct me if I misunderstand the testing conditions.

Thanks,
Leo Yan

2018-03-27 02:04:48

by Leo Yan

[permalink] [raw]
Subject: Re: [PATCH V2 6/7] thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver

Hi Daniel,

On Wed, Feb 21, 2018 at 04:29:27PM +0100, Daniel Lezcano wrote:
> The cpu idle cooling driver performs synchronized idle injection across all
> cpus belonging to the same cluster and offers a new method to cool down a SoC.
>
> Each cluster has its own idle cooling device, each core has its own idle
> injection thread, each idle injection thread uses play_idle to enter idle. In
> order to reach the deepest idle state, each cooling device has the idle
> injection threads synchronized together.
>
> It has some similarity with the intel power clamp driver but it is actually
> designed to work on the ARM architecture via the DT with a mathematical proof
> with the power model which comes with the Documentation.
>
> The idle injection cycle is fixed while the running cycle is variable. That
> allows to have control on the device reactivity for the user experience. At
> the mitigation point the idle threads are unparked, they play idle the
> specified amount of time and they schedule themselves. The last thread sets
> the next idle injection deadline and when the timer expires it wakes up all
> the threads which in turn play idle again. Meanwhile the running cycle is
> changed by set_cur_state. When the mitigation ends, the threads are parked.
> The algorithm is self adaptive, so there is no need to handle hotplugging.

The idle injection threads are RT threads (FIFO) and I saw in
play_idle() set/clear flag PF_IDLE for it. Will these idle injection
threads utilization be accounted into RT utilization?

If idle injection threads utilization is accounted as RT tasks
utilization, will this impact CPUFreq governor 'schedutil' for OPP
selection?

[...]

Thanks,
Leo Yan

2018-03-27 03:37:32

by Leo Yan

[permalink] [raw]
Subject: Re: [PATCH V2 6/7] thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver

On Wed, Feb 21, 2018 at 04:29:27PM +0100, Daniel Lezcano wrote:

[...]

> +/**
> + * cpuidle_cooling_injection_thread - Idle injection mainloop thread function
> + * @arg: a void pointer containing the idle cooling device address
> + *
> + * This main function does basically two operations:
> + *
> + * - Goes idle for a specific amount of time
> + *
> + * - Sets a timer to wake up all the idle injection threads after a
> + * running period
> + *
> + * That happens only when the mitigation is enabled, otherwise the
> + * task is scheduled out.
> + *
> + * In order to keep the tasks synchronized together, it is the last
> + * task exiting the idle period which is in charge of setting the
> + * timer.
> + *
> + * This function never returns.
> + */
> +static int cpuidle_cooling_injection_thread(void *arg)
> +{
> + struct sched_param param = { .sched_priority = MAX_USER_RT_PRIO/2 };

I am just wandering if should set priority to (MAX_RT_PRIO - 1)?
Otherwise I am concern it might be cannot enter deep idle state when
any CPU idle injection thread is preempted by other higher priority RT
threads so all CPUs have no alignment for idle state entering/exiting.

> + struct cpuidle_cooling_device *idle_cdev = arg;
> + struct cpuidle_cooling_tsk *cct = per_cpu_ptr(&cpuidle_cooling_tsk,
> + smp_processor_id());
> + DEFINE_WAIT(wait);
> +
> + set_freezable();
> +
> + sched_setscheduler(current, SCHED_FIFO, &param);
> +
> + while (1) {
> + s64 next_wakeup;
> +
> + prepare_to_wait(&cct->waitq, &wait, TASK_INTERRUPTIBLE);
> +
> + schedule();
> +
> + atomic_inc(&idle_cdev->count);
> +
> + play_idle(idle_cdev->idle_cycle / USEC_PER_MSEC);
> +
> + /*
> + * The last CPU waking up is in charge of setting the
> + * timer. If the CPU is hotplugged, the timer will
> + * move to another CPU (which may not belong to the
> + * same cluster) but that is not a problem as the
> + * timer will be set again by another CPU belonging to
> + * the cluster, so this mechanism is self adaptive and
> + * does not require any hotplugging dance.
> + */
> + if (!atomic_dec_and_test(&idle_cdev->count))
> + continue;
> +
> + if (!idle_cdev->state)
> + continue;
> +
> + next_wakeup = cpuidle_cooling_runtime(idle_cdev);
> +
> + hrtimer_start(&idle_cdev->timer, ns_to_ktime(next_wakeup),
> + HRTIMER_MODE_REL_PINNED);

If SoC temperature descreases under tipping point, will the timer be
disabled for this case? Or will here set next timer event with big
value from next_wakeup?

[...]

Thanks,
Leo Yan

2018-03-27 03:45:04

by Leo Yan

[permalink] [raw]
Subject: Re: [PATCH V2 6/7] thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver

On Fri, Feb 23, 2018 at 12:28:51PM +0100, Daniel Lezcano wrote:
> On 23/02/2018 08:34, Viresh Kumar wrote:
> > On 21-02-18, 16:29, Daniel Lezcano wrote:

> [ ... ]
>
> >> +static s64 cpuidle_cooling_runtime(struct cpuidle_cooling_device *idle_cdev)
> >> +{
> >> + s64 next_wakeup;
> >> + int state = idle_cdev->state;
> >> +
> >> + /*
> >> + * The function must never be called when there is no
> >> + * mitigation because:
> >> + * - that does not make sense
> >> + * - we end up with a division by zero
> >> + */
> >> + BUG_ON(!state);
> >
> > As there is no locking in place, we can surely hit this case. What if
> > the state changed to 0 right before this routine was called ?
> >
> > I would suggest we should just return 0 in that case and get away with
> > the BUG_ON().

Here if 'state' equals to 0 and we return 0, then the return value will
be same with when 'state' = 100; this lets the return value confused.

I think for 'state' = 0, should we return -1 so indicate the hrtimer
will not be set for this case?

Thanks,
Leo Yan

2018-03-27 09:37:06

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [PATCH V2 0/7] CPU cooling device new strategies

On 26/03/2018 16:30, Leo Yan wrote:
> On Thu, Mar 08, 2018 at 12:03:52PM +0000, Daniel Thompson wrote:
>> On Wed, Mar 07, 2018 at 07:57:17PM +0100, Daniel Lezcano wrote:
>>>>> The preliminary benchmarks show the following changes:
>>>>>
>>>>> On the hikey6220, dhrystone shows a throughtput increase of 40% for an
>>>>> increase of the latency of 16% while sysbench shows a latency increase
>>>>> of 5%.
>>>>
>>>> I don't follow these numbers. Throughput increase while injecting idle?
>>>> compared to what? percentages of what? Please be more specific to better
>>>> describer your work..
>>>
>>> The dhrystone throughput is based on the virtual timer, when we are
>>> running, it is at max opp, so the throughput increases. But regarding
>>> the real time, it takes obviously more time to achieve as we are
>>> artificially inserting idle cycles. With the cpufreq governor, we run at
>>> a lower opp, so the throughput is less for dhrystone but it takes less
>>> time to achieve.
>>>
>>> Percentages are comparing cpufreq vs cpuidle cooling devices. I will
>>> take care of presenting the results in a more clear way in the next version.
>>
>> I think we should also note that the current hikey settings for cpufreq
>> are very badly tuned for this platform. It has a single temp threshold
>> and it jumps from max freq to min freq.
>>
>> IIRC Leo's work on Hikey thermals correctly it would be much better if
>> it used the power-allocator thermal governor or if if copied some of
>> the Samsung 32-bit platform by configuring the step governor with a
>> graduated with a slightly lower threshold that moves two stops back in
>> the OPP table (which is still fairly high clock speed... but it
>> thermally sustainable).
>
> I think Daniel L. is working on this patch set with 'power-allocator'
> governor, and the parameters 'sustainable-power = <3326>' and
> 'dynamic-power-coefficient = <311>' are profiling value on Hikey
> platform. Now we only consider dynamic power and skip static leakage
> for 'power-allocator' governor. And all these parameters are merged
> into Linux mainline kernel.
>
> Daniel L. could correct me if I misunderstand the testing conditions.

Well, the first iteration is without the powerallocator governor API. It
was tested with the step-wise governor only. But you are right by saying
it will use the dynamic-power-coefficient and sustainable-power and will
implement the power allocator version of the API. I'm working on the
power allocator version for the idle injection + OPP change as we need
to compute the capacity equivalence between the idle-injection cycles +
OPP and the lower OPP in order to change the OPP for optimized power /
performance trade-off.


--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


2018-03-27 10:34:18

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [PATCH V2 6/7] thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver

On 27/03/2018 04:03, Leo Yan wrote:
> Hi Daniel,
>
> On Wed, Feb 21, 2018 at 04:29:27PM +0100, Daniel Lezcano wrote:
>> The cpu idle cooling driver performs synchronized idle injection across all
>> cpus belonging to the same cluster and offers a new method to cool down a SoC.
>>
>> Each cluster has its own idle cooling device, each core has its own idle
>> injection thread, each idle injection thread uses play_idle to enter idle. In
>> order to reach the deepest idle state, each cooling device has the idle
>> injection threads synchronized together.
>>
>> It has some similarity with the intel power clamp driver but it is actually
>> designed to work on the ARM architecture via the DT with a mathematical proof
>> with the power model which comes with the Documentation.
>>
>> The idle injection cycle is fixed while the running cycle is variable. That
>> allows to have control on the device reactivity for the user experience. At
>> the mitigation point the idle threads are unparked, they play idle the
>> specified amount of time and they schedule themselves. The last thread sets
>> the next idle injection deadline and when the timer expires it wakes up all
>> the threads which in turn play idle again. Meanwhile the running cycle is
>> changed by set_cur_state. When the mitigation ends, the threads are parked.
>> The algorithm is self adaptive, so there is no need to handle hotplugging.
>
> The idle injection threads are RT threads (FIFO) and I saw in
> play_idle() set/clear flag PF_IDLE for it. Will these idle injection
> threads utilization be accounted into RT utilization?
>
> If idle injection threads utilization is accounted as RT tasks
> utilization, will this impact CPUFreq governor 'schedutil' for OPP
> selection?

Hi Leo,

The idle injection task has a very low utilization when it is not in the
play_idle function, basically it wakes up, sets a timer and play_idle().

Regarding the use case, the idle injection is the base brick for an
combo cooling device with cpufreq + cpuidle. When the idle injection is
used alone, it is because there is no cpufreq driver for the platform.
If there is a cpufreq driver, then we should endup with the cpu cooling
device where we have control of the OPP (and there is no idle injection
threads) or the combo cooling device.

Except I'm missing something, the idle injection threads won't impact
the OPP selection.





--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


2018-03-27 10:57:46

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [PATCH V2 6/7] thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver

On 27/03/2018 05:35, Leo Yan wrote:
> On Wed, Feb 21, 2018 at 04:29:27PM +0100, Daniel Lezcano wrote:
>
> [...]
>
>> +/**
>> + * cpuidle_cooling_injection_thread - Idle injection mainloop thread function
>> + * @arg: a void pointer containing the idle cooling device address
>> + *
>> + * This main function does basically two operations:
>> + *
>> + * - Goes idle for a specific amount of time
>> + *
>> + * - Sets a timer to wake up all the idle injection threads after a
>> + * running period
>> + *
>> + * That happens only when the mitigation is enabled, otherwise the
>> + * task is scheduled out.
>> + *
>> + * In order to keep the tasks synchronized together, it is the last
>> + * task exiting the idle period which is in charge of setting the
>> + * timer.
>> + *
>> + * This function never returns.
>> + */
>> +static int cpuidle_cooling_injection_thread(void *arg)
>> +{
>> + struct sched_param param = { .sched_priority = MAX_USER_RT_PRIO/2 };
>
> I am just wandering if should set priority to (MAX_RT_PRIO - 1)?
> Otherwise I am concern it might be cannot enter deep idle state when
> any CPU idle injection thread is preempted by other higher priority RT
> threads so all CPUs have no alignment for idle state entering/exiting.

I do believe we should consider other RT tasks more important than the
idle injection threads.

>> + struct cpuidle_cooling_device *idle_cdev = arg;
>> + struct cpuidle_cooling_tsk *cct = per_cpu_ptr(&cpuidle_cooling_tsk,
>> + smp_processor_id());
>> + DEFINE_WAIT(wait);
>> +
>> + set_freezable();
>> +
>> + sched_setscheduler(current, SCHED_FIFO, &param);
>> +
>> + while (1) {
>> + s64 next_wakeup;
>> +
>> + prepare_to_wait(&cct->waitq, &wait, TASK_INTERRUPTIBLE);
>> +
>> + schedule();
>> +
>> + atomic_inc(&idle_cdev->count);
>> +
>> + play_idle(idle_cdev->idle_cycle / USEC_PER_MSEC);
>> +
>> + /*
>> + * The last CPU waking up is in charge of setting the
>> + * timer. If the CPU is hotplugged, the timer will
>> + * move to another CPU (which may not belong to the
>> + * same cluster) but that is not a problem as the
>> + * timer will be set again by another CPU belonging to
>> + * the cluster, so this mechanism is self adaptive and
>> + * does not require any hotplugging dance.
>> + */
>> + if (!atomic_dec_and_test(&idle_cdev->count))
>> + continue;
>> +
>> + if (!idle_cdev->state)
>> + continue;
>> +
>> + next_wakeup = cpuidle_cooling_runtime(idle_cdev);
>> +
>> + hrtimer_start(&idle_cdev->timer, ns_to_ktime(next_wakeup),
>> + HRTIMER_MODE_REL_PINNED);
>
> If SoC temperature descreases under tipping point, will the timer be
> disabled for this case? Or will here set next timer event with big
> value from next_wakeup?

Another timer (the polling one) will update the 'state' variable to zero
in the set_cur_state. In the worst case, we check the idle_cdev->state
right before it is updated and we end up with an extra idle injection
cycle which is perfectly fine.





--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


2018-03-27 11:11:54

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [PATCH V2 6/7] thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver

On 27/03/2018 05:43, Leo Yan wrote:
> On Fri, Feb 23, 2018 at 12:28:51PM +0100, Daniel Lezcano wrote:
>> On 23/02/2018 08:34, Viresh Kumar wrote:
>>> On 21-02-18, 16:29, Daniel Lezcano wrote:
>
>> [ ... ]
>>
>>>> +static s64 cpuidle_cooling_runtime(struct cpuidle_cooling_device *idle_cdev)
>>>> +{
>>>> + s64 next_wakeup;
>>>> + int state = idle_cdev->state;
>>>> +
>>>> + /*
>>>> + * The function must never be called when there is no
>>>> + * mitigation because:
>>>> + * - that does not make sense
>>>> + * - we end up with a division by zero
>>>> + */
>>>> + BUG_ON(!state);
>>>
>>> As there is no locking in place, we can surely hit this case. What if
>>> the state changed to 0 right before this routine was called ?
>>>
>>> I would suggest we should just return 0 in that case and get away with
>>> the BUG_ON().
>
> Here if 'state' equals to 0 and we return 0, then the return value will
> be same with when 'state' = 100; this lets the return value confused.
>
> I think for 'state' = 0, should we return -1 so indicate the hrtimer
> will not be set for this case?

Yeah, I will look at how to make this smoother.

Thanks

-- Daniel


--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


2018-03-27 12:30:28

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH V2 6/7] thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver

Hi Daniel,

On 27/03/18 12:26, Daniel Lezcano wrote:
> On 27/03/2018 04:03, Leo Yan wrote:
> > Hi Daniel,
> >
> > On Wed, Feb 21, 2018 at 04:29:27PM +0100, Daniel Lezcano wrote:
> >> The cpu idle cooling driver performs synchronized idle injection across all
> >> cpus belonging to the same cluster and offers a new method to cool down a SoC.
> >>
> >> Each cluster has its own idle cooling device, each core has its own idle
> >> injection thread, each idle injection thread uses play_idle to enter idle. In
> >> order to reach the deepest idle state, each cooling device has the idle
> >> injection threads synchronized together.
> >>
> >> It has some similarity with the intel power clamp driver but it is actually
> >> designed to work on the ARM architecture via the DT with a mathematical proof
> >> with the power model which comes with the Documentation.
> >>
> >> The idle injection cycle is fixed while the running cycle is variable. That
> >> allows to have control on the device reactivity for the user experience. At
> >> the mitigation point the idle threads are unparked, they play idle the
> >> specified amount of time and they schedule themselves. The last thread sets
> >> the next idle injection deadline and when the timer expires it wakes up all
> >> the threads which in turn play idle again. Meanwhile the running cycle is
> >> changed by set_cur_state. When the mitigation ends, the threads are parked.
> >> The algorithm is self adaptive, so there is no need to handle hotplugging.
> >
> > The idle injection threads are RT threads (FIFO) and I saw in
> > play_idle() set/clear flag PF_IDLE for it. Will these idle injection
> > threads utilization be accounted into RT utilization?
> >
> > If idle injection threads utilization is accounted as RT tasks
> > utilization, will this impact CPUFreq governor 'schedutil' for OPP
> > selection?
>
> Hi Leo,
>
> The idle injection task has a very low utilization when it is not in the
> play_idle function, basically it wakes up, sets a timer and play_idle().
>
> Regarding the use case, the idle injection is the base brick for an
> combo cooling device with cpufreq + cpuidle. When the idle injection is
> used alone, it is because there is no cpufreq driver for the platform.
> If there is a cpufreq driver, then we should endup with the cpu cooling
> device where we have control of the OPP (and there is no idle injection
> threads) or the combo cooling device.
>
> Except I'm missing something, the idle injection threads won't impact
> the OPP selection.

Mmm, they actually might. schedutil selects max OPP as soon as it sees
an RT thread. I fear these guys might generate unwanted spikes. Maybe
you can filter them out?

Best,

- Juri

2018-03-27 12:34:02

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [PATCH V2 6/7] thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver

On 27/03/2018 14:28, Juri Lelli wrote:
> Hi Daniel,
>
> On 27/03/18 12:26, Daniel Lezcano wrote:
>> On 27/03/2018 04:03, Leo Yan wrote:
>>> Hi Daniel,
>>>
>>> On Wed, Feb 21, 2018 at 04:29:27PM +0100, Daniel Lezcano wrote:
>>>> The cpu idle cooling driver performs synchronized idle injection across all
>>>> cpus belonging to the same cluster and offers a new method to cool down a SoC.
>>>>
>>>> Each cluster has its own idle cooling device, each core has its own idle
>>>> injection thread, each idle injection thread uses play_idle to enter idle. In
>>>> order to reach the deepest idle state, each cooling device has the idle
>>>> injection threads synchronized together.
>>>>
>>>> It has some similarity with the intel power clamp driver but it is actually
>>>> designed to work on the ARM architecture via the DT with a mathematical proof
>>>> with the power model which comes with the Documentation.
>>>>
>>>> The idle injection cycle is fixed while the running cycle is variable. That
>>>> allows to have control on the device reactivity for the user experience. At
>>>> the mitigation point the idle threads are unparked, they play idle the
>>>> specified amount of time and they schedule themselves. The last thread sets
>>>> the next idle injection deadline and when the timer expires it wakes up all
>>>> the threads which in turn play idle again. Meanwhile the running cycle is
>>>> changed by set_cur_state. When the mitigation ends, the threads are parked.
>>>> The algorithm is self adaptive, so there is no need to handle hotplugging.
>>>
>>> The idle injection threads are RT threads (FIFO) and I saw in
>>> play_idle() set/clear flag PF_IDLE for it. Will these idle injection
>>> threads utilization be accounted into RT utilization?
>>>
>>> If idle injection threads utilization is accounted as RT tasks
>>> utilization, will this impact CPUFreq governor 'schedutil' for OPP
>>> selection?
>>
>> Hi Leo,
>>
>> The idle injection task has a very low utilization when it is not in the
>> play_idle function, basically it wakes up, sets a timer and play_idle().
>>
>> Regarding the use case, the idle injection is the base brick for an
>> combo cooling device with cpufreq + cpuidle. When the idle injection is
>> used alone, it is because there is no cpufreq driver for the platform.
>> If there is a cpufreq driver, then we should endup with the cpu cooling
>> device where we have control of the OPP (and there is no idle injection
>> threads) or the combo cooling device.
>>
>> Except I'm missing something, the idle injection threads won't impact
>> the OPP selection.
>
> Mmm, they actually might. schedutil selects max OPP as soon as it sees
> an RT thread. I fear these guys might generate unwanted spikes. Maybe
> you can filter them out?

Yes, absolutely. Leo pointed it also.

We might want to change the check at:

https://elixir.bootlin.com/linux/v4.16-rc7/source/kernel/sched/cpufreq_schedutil.c#L364

in order to ignore PF_IDLE tagged tasks.



--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


2018-03-27 13:11:14

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH V2 6/7] thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver

On 27/03/18 14:31, Daniel Lezcano wrote:
> On 27/03/2018 14:28, Juri Lelli wrote:
> > Hi Daniel,
> >
> > On 27/03/18 12:26, Daniel Lezcano wrote:
> >> On 27/03/2018 04:03, Leo Yan wrote:
> >>> Hi Daniel,
> >>>
> >>> On Wed, Feb 21, 2018 at 04:29:27PM +0100, Daniel Lezcano wrote:
> >>>> The cpu idle cooling driver performs synchronized idle injection across all
> >>>> cpus belonging to the same cluster and offers a new method to cool down a SoC.
> >>>>
> >>>> Each cluster has its own idle cooling device, each core has its own idle
> >>>> injection thread, each idle injection thread uses play_idle to enter idle. In
> >>>> order to reach the deepest idle state, each cooling device has the idle
> >>>> injection threads synchronized together.
> >>>>
> >>>> It has some similarity with the intel power clamp driver but it is actually
> >>>> designed to work on the ARM architecture via the DT with a mathematical proof
> >>>> with the power model which comes with the Documentation.
> >>>>
> >>>> The idle injection cycle is fixed while the running cycle is variable. That
> >>>> allows to have control on the device reactivity for the user experience. At
> >>>> the mitigation point the idle threads are unparked, they play idle the
> >>>> specified amount of time and they schedule themselves. The last thread sets
> >>>> the next idle injection deadline and when the timer expires it wakes up all
> >>>> the threads which in turn play idle again. Meanwhile the running cycle is
> >>>> changed by set_cur_state. When the mitigation ends, the threads are parked.
> >>>> The algorithm is self adaptive, so there is no need to handle hotplugging.
> >>>
> >>> The idle injection threads are RT threads (FIFO) and I saw in
> >>> play_idle() set/clear flag PF_IDLE for it. Will these idle injection
> >>> threads utilization be accounted into RT utilization?
> >>>
> >>> If idle injection threads utilization is accounted as RT tasks
> >>> utilization, will this impact CPUFreq governor 'schedutil' for OPP
> >>> selection?
> >>
> >> Hi Leo,
> >>
> >> The idle injection task has a very low utilization when it is not in the
> >> play_idle function, basically it wakes up, sets a timer and play_idle().
> >>
> >> Regarding the use case, the idle injection is the base brick for an
> >> combo cooling device with cpufreq + cpuidle. When the idle injection is
> >> used alone, it is because there is no cpufreq driver for the platform.
> >> If there is a cpufreq driver, then we should endup with the cpu cooling
> >> device where we have control of the OPP (and there is no idle injection
> >> threads) or the combo cooling device.
> >>
> >> Except I'm missing something, the idle injection threads won't impact
> >> the OPP selection.
> >
> > Mmm, they actually might. schedutil selects max OPP as soon as it sees
> > an RT thread. I fear these guys might generate unwanted spikes. Maybe
> > you can filter them out?
>
> Yes, absolutely. Leo pointed it also.
>
> We might want to change the check at:
>
> https://elixir.bootlin.com/linux/v4.16-rc7/source/kernel/sched/cpufreq_schedutil.c#L364
>
> in order to ignore PF_IDLE tagged tasks.

We might yes. And also for the update_single cases, I guess.

Best,

- Juri

2018-04-04 08:51:46

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [PATCH V2 6/7] thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver

On 26/02/2018 05:30, Viresh Kumar wrote:

[ ... ]

>>>> +
>>>> + for_each_possible_cpu(cpu) {
>>>> + cpumask = topology_core_cpumask(cpu);
>>>> +
>>>> + cct = per_cpu_ptr(&cpuidle_cooling_tsk, cpu);
>>>> +
>>>> + /*
>>>> + * This condition makes the first cpu belonging to the
>>>> + * cluster to create a cooling device and allocates
>>>> + * the structure. Others CPUs belonging to the same
>>>> + * cluster will just increment the refcount on the
>>>> + * cooling device structure and initialize it.
>>>> + */
>>>> + if (cpu == cpumask_first(cpumask)) {
>>>
>>> Your function still have few assumptions of cpu numbering and it will
>>> break in few cases. What if the CPUs on a big Little system (4x4) are
>>> present in this order: B L L L L B B B ??
>>>
>>> This configuration can happen if CPUs in DT are marked as: 0-3 LITTLE,
>>> 4-7 big and a big CPU is used by the boot loader to bring up Linux.
>>
>> Ok, how can I sort it out ?
>
> I would do something like this:
>
> cpumask_copy(possible, cpu_possible_mask);
>
> while (!cpumask_empty(possible)) {
> first = cpumask_first(possible);
> cpumask = topology_core_cpumask(first);
> cpumask_andnot(possible, possible, cpumask);
>
> allocate_cooling_dev(first); //This is most of this function in your patch.
>
> while (!cpumask_empty(cpumask)) {
> temp = cpumask_first(possible);
> //rest init "temp"
> cpumask_clear_cpu(temp, cpumask);
> }
>
> //Everything done, register cooling device for cpumask.
> }


Mmh, that sounds very complex. May be it is simpler to count the number
of cluster and initialize the idle_cdev for each cluster and then go for
this loop with the cluster cpumask.




--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


2018-04-05 04:51:06

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH V2 6/7] thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver

On 04-04-18, 10:50, Daniel Lezcano wrote:
> Mmh, that sounds very complex. May be it is simpler to count the number
> of cluster and initialize the idle_cdev for each cluster and then go for
> this loop with the cluster cpumask.

Maybe not sure. I have had such code in the past and it was quite
straight forward to understand :)

You can go with whichever version you like.

--
viresh