Changelog:
V3:
- Changed this description to be more clear with the numbers
- Took into account the comments for the documentation
- Switched to the smpboot threads to use the hotplug API
- Removed usage of the waitq as the code relies on the smpboot
- Removed the macro DEFAULT_IDLE_TIME_US and use directly TICK_USEC
- Removed the list to store the cooling devices
- Fixed static for the percpu cpuidle_cooling_tsk
- Removed the BUG_ON in the cpuidle_cooling_runtime function and return 0
- Used this_cpu_ptr instead of per_cpu_ptr(smp_processor_id())
- Moved the function cpuidle_cooling_release closer to its caller
- Fixed spaces instead of tab
- Fixed function description log
- Cancel timer on release
- Fixed return value doc
- Changed the initialization so the cpu numbering is no longer a problem
- Removed useless atomic_set(0)
- Moved the thermal cooling device creation after the cpuidle cooling device
- Let the refcount to zero at init
- Moved the initialization message at the end of the function
- Changed the cpuidle_cooling_register to return void
- Removed message in the cpuidle-arm driver if cpuidle_cooling_register fails
V2:
- Dropped the cpu combo cooling device
- Added the acked-by tags
- Replaced task array by a percpu structure
- Fixed the copyright dates
- Fixed the extra lines
- Fixed the compilation macros to be reused
- Fixed the list node removal
- Massaged a couple of function names
The following series provides a new way to cool down a SoC by reducing
the dissipated power on the CPUs. Based on the initial work from Kevin
Wangtao, the series implements a CPU cooling device based on idle
injection, relying on the cpuidle framework.
The patchset is designed to have the current DT binding for the
cpufreq cooling device to be compatible with the new cooling devices.
Different cpu cooling devices can not co-exist on the system, the cpu
cooling device is enabled or not, and one cooling strategy is selected
(cpufreq or cpuidle). It is not possible to have all of them available
at the same time. However, the goal is to enable them all and be able
to switch from one to another at runtime but the thermal framework may
need some attention regarding the mitigation vs switching the cooling
device at runtime.
This series is divided into two parts.
The first part just provides trivial changes for the copyright and
removes an unused field in the cpu freq cooling device structure.
The second part provides the idle injection cooling device, allowing a SoC
without a cpufreq driver to use this cooling device as an alternative.
The cpuidle cooling device has been tested against the cpufreq cooling
in the same conditions: same board and a fan on top of the SoC. For
optimal trade-off between perf vs cooling effect the cpuidle cooling
device acts on one cluster only, for the bL SoC, the big cluster is
idle-mitigated. The targetted temperature is 75°C.
Command:
--------
time ./dhrystone -t 8 -l 100
------------------------------------------------------
| hikey6220 | hikey3660 |
------------------------------------------------------
| cpuidle | cpufreq || cpuidle | cpufreq |
-----------------------------------------------------------------------
| DMIPS avg | 1007 | 922 || 2279 | 2250 |
-----------------------------------------------------------------------
| rtime (sec) | 1.21 | 1.19 || 49 | 51 |
-----------------------------------------------------------------------
| temp avg (mC) | 75043 | 74862 || 75640 | 75978 |
-----------------------------------------------------------------------
| temp stddev | 309.357 | 6205 || 3258 | 1950 |
-----------------------------------------------------------------------
| temp min (mC) | 74235 | 71880 || 66395 | 71315 |
-----------------------------------------------------------------------
| temp max (mC) | 75805 | 77375 || 84640 | 81360 |
-----------------------------------------------------------------------
We can see the both cooling method have similar behavior in this
configuration.
Daniel Lezcano (7):
thermal/drivers/cpu_cooling: Fixup the header and copyright
thermal/drivers/cpu_cooling: Add Software Package Data Exchange (SPDX)
thermal/drivers/cpu_cooling: Remove pointless field
thermal/drivers/Kconfig: Convert the CPU cooling device to a choice
thermal/drivers/cpu_cooling: Add idle cooling device documentation
thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver
cpuidle/drivers/cpuidle-arm: Register the cooling device
Documentation/thermal/cpu-idle-cooling.txt | 166 ++++++++++
drivers/cpuidle/cpuidle-arm.c | 3 +
drivers/thermal/Kconfig | 30 +-
drivers/thermal/cpu_cooling.c | 508 +++++++++++++++++++++++++++--
include/linux/cpu_cooling.h | 12 +-
5 files changed, 692 insertions(+), 27 deletions(-)
create mode 100644 Documentation/thermal/cpu-idle-cooling.txt
--
2.7.4
The structure cpufreq_cooling_device provides a backpointer to the thermal
device but this one is used for a trace and to unregister. For the trace,
we don't really need this field and the unregister function as the same
pointer passed as parameter. Remove it.
Acked-by: Viresh Kumar <[email protected]>
Signed-off-by: Daniel Lezcano <[email protected]>
---
drivers/thermal/cpu_cooling.c | 7 ++-----
1 file changed, 2 insertions(+), 5 deletions(-)
diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
index d7528bc..7bdc19f 100644
--- a/drivers/thermal/cpu_cooling.c
+++ b/drivers/thermal/cpu_cooling.c
@@ -88,7 +88,6 @@ struct cpufreq_cooling_device {
unsigned int clipped_freq;
unsigned int max_level;
struct freq_table *freq_table; /* In descending order */
- struct thermal_cooling_device *cdev;
struct cpufreq_policy *policy;
struct list_head node;
struct time_in_idle *idle_time;
@@ -197,8 +196,7 @@ static int update_freq_table(struct cpufreq_cooling_device *cpufreq_cdev,
dev = get_cpu_device(cpu);
if (unlikely(!dev)) {
- dev_warn(&cpufreq_cdev->cdev->device,
- "No cpu device for cpu %d\n", cpu);
+ pr_warn("No cpu device for cpu %d\n", cpu);
return -ENODEV;
}
@@ -762,7 +760,6 @@ __cpufreq_cooling_register(struct device_node *np,
goto remove_ida;
cpufreq_cdev->clipped_freq = cpufreq_cdev->freq_table[0].frequency;
- cpufreq_cdev->cdev = cdev;
mutex_lock(&cooling_list_lock);
/* Register the notifier for first cpufreq cooling device */
@@ -922,7 +919,7 @@ void cpufreq_cooling_unregister(struct thermal_cooling_device *cdev)
cpufreq_unregister_notifier(&thermal_cpufreq_notifier_block,
CPUFREQ_POLICY_NOTIFIER);
- thermal_cooling_device_unregister(cpufreq_cdev->cdev);
+ thermal_cooling_device_unregister(cdev);
ida_simple_remove(&cpufreq_ida, cpufreq_cdev->id);
kfree(cpufreq_cdev->idle_time);
kfree(cpufreq_cdev->freq_table);
--
2.7.4
The copyright format does not conform to the format requested by
Linaro: https://wiki.linaro.org/Copyright
Fix it.
Signed-off-by: Daniel Lezcano <[email protected]>
---
drivers/thermal/cpu_cooling.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
index dc63aba..42110ee 100644
--- a/drivers/thermal/cpu_cooling.c
+++ b/drivers/thermal/cpu_cooling.c
@@ -2,9 +2,11 @@
* linux/drivers/thermal/cpu_cooling.c
*
* Copyright (C) 2012 Samsung Electronics Co., Ltd(http://www.samsung.com)
- * Copyright (C) 2012 Amit Daniel <[email protected]>
*
- * Copyright (C) 2014 Viresh Kumar <[email protected]>
+ * Copyright (C) 2012-2018 Linaro Limited.
+ *
+ * Authors: Amit Daniel <[email protected]>
+ * Viresh Kumar <[email protected]>
*
* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
* This program is free software; you can redistribute it and/or modify
--
2.7.4
Register the ARM generic cpuidle driver as a cooling device.
Signed-off-by: Daniel Lezcano <[email protected]>
---
drivers/cpuidle/cpuidle-arm.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/cpuidle/cpuidle-arm.c b/drivers/cpuidle/cpuidle-arm.c
index ddee1b6..44b4fe3 100644
--- a/drivers/cpuidle/cpuidle-arm.c
+++ b/drivers/cpuidle/cpuidle-arm.c
@@ -11,6 +11,7 @@
#define pr_fmt(fmt) "CPUidle arm: " fmt
+#include <linux/cpu_cooling.h>
#include <linux/cpuidle.h>
#include <linux/cpumask.h>
#include <linux/cpu_pm.h>
@@ -172,6 +173,8 @@ static int __init arm_idle_init(void)
goto out_fail;
}
+ cpuidle_cooling_register();
+
return 0;
out_fail:
--
2.7.4
The cpu idle cooling driver performs synchronized idle injection across all
cpus belonging to the same cluster and offers a new method to cool down a SoC.
Each cluster has its own idle cooling device, each core has its own idle
injection thread, each idle injection thread uses play_idle to enter idle. In
order to reach the deepest idle state, each cooling device has the idle
injection threads synchronized together.
It has some similarity with the intel power clamp driver but it is actually
designed to work on the ARM architecture via the DT with a mathematical proof
with the power model which comes with the Documentation.
The idle injection cycle is fixed while the running cycle is variable. That
allows to have control on the device reactivity for the user experience. At
the mitigation point the idle threads are unparked, they play idle the
specified amount of time and they schedule themselves. The last thread sets
the next idle injection deadline and when the timer expires it wakes up all
the threads which in turn play idle again. Meanwhile the running cycle is
changed by set_cur_state. When the mitigation ends, the threads are parked.
The algorithm is self adaptive, so there is no need to handle hotplugging.
If we take an example of the balanced point, we can use the DT for the hi6220.
The sustainable power for the SoC is 3326mW to mitigate at 75°C. Eight cores
running at full blast at the maximum OPP consumes 5280mW. The first value is
given in the DT, the second is calculated from the OPP with the formula:
Pdyn = Cdyn x Voltage^2 x Frequency
As the SoC vendors don't want to share the static leakage values, we assume
it is zero, so the Prun = Pdyn + Pstatic = Pdyn + 0 = Pdyn.
In order to reduce the power to 3326mW, we have to apply a ratio to the
running time.
ratio = (Prun - Ptarget) / Ptarget = (5280 - 3326) / 3326 = 0,5874
We know the idle cycle which is fixed, let's assume 10ms. However from this
duration we have to substract the wake up latency for the cluster idle state.
In our case, it is 1.5ms. So for a 10ms latency for idle, we are really idle
8.5ms.
As we know the idle duration and the ratio, we can compute the running cycle.
running_cycle = 8.5 / 0.5874 = 14.47ms
So for 8.5ms of idle, we have 14.47ms of running cycle, and that brings the
SoC to the balanced trip point of 75°C.
The driver has been tested on the hi6220 and it appears the temperature
stabilizes at 75°C with an idle injection time of 10ms (8.5ms real) and
running cycle of 14ms as expected by the theory above.
Signed-off-by: Kevin Wangtao <[email protected]>
Signed-off-by: Daniel Lezcano <[email protected]>
---
drivers/thermal/Kconfig | 10 +
drivers/thermal/cpu_cooling.c | 479 ++++++++++++++++++++++++++++++++++++++++++
include/linux/cpu_cooling.h | 6 +
3 files changed, 495 insertions(+)
diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
index 5aaae1b..6c34117 100644
--- a/drivers/thermal/Kconfig
+++ b/drivers/thermal/Kconfig
@@ -166,6 +166,16 @@ config CPU_FREQ_THERMAL
This will be useful for platforms using the generic thermal interface
and not the ACPI interface.
+config CPU_IDLE_THERMAL
+ bool "CPU idle cooling strategy"
+ depends on CPU_IDLE
+ help
+ This implements the generic CPU cooling mechanism through
+ idle injection. This will throttle the CPU by injecting
+ fixed idle cycle. All CPUs belonging to the same cluster
+ will enter idle synchronously to reach the deepest idle
+ state.
+
endchoice
config CLOCK_THERMAL
diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
index 5c219dc..1eec8d6 100644
--- a/drivers/thermal/cpu_cooling.c
+++ b/drivers/thermal/cpu_cooling.c
@@ -10,18 +10,33 @@
* Viresh Kumar <[email protected]>
*
*/
+#define pr_fmt(fmt) "CPU cooling: " fmt
+
#include <linux/module.h>
#include <linux/thermal.h>
#include <linux/cpufreq.h>
+#include <linux/cpuidle.h>
#include <linux/err.h>
+#include <linux/freezer.h>
#include <linux/idr.h>
+#include <linux/kthread.h>
#include <linux/pm_opp.h>
#include <linux/slab.h>
+#include <linux/sched/prio.h>
+#include <linux/sched/rt.h>
+#include <linux/smpboot.h>
#include <linux/cpu.h>
#include <linux/cpu_cooling.h>
+#include <linux/ratelimit.h>
+
+#include <linux/platform_device.h>
+#include <linux/of_platform.h>
+
#include <trace/events/thermal.h>
+#include <uapi/linux/sched/types.h>
+
#ifdef CONFIG_CPU_FREQ_THERMAL
/*
* Cooling state <-> CPUFreq frequency
@@ -928,3 +943,467 @@ void cpufreq_cooling_unregister(struct thermal_cooling_device *cdev)
}
EXPORT_SYMBOL_GPL(cpufreq_cooling_unregister);
#endif /* CONFIG_CPU_FREQ_THERMAL */
+
+#ifdef CONFIG_CPU_IDLE_THERMAL
+/**
+ * struct cpuidle_cooling_device - data for the idle cooling device
+ * @cdev: a pointer to a struct thermal_cooling_device
+ * @cpumask: a cpumask containing the CPU managed by the cooling device
+ * @timer: a hrtimer giving the tempo for the idle injection cycles
+ * @kref: a kernel refcount on this structure
+ * @count: an atomic to keep track of the last task exiting the idle cycle
+ * @idle_cycle: an integer defining the duration of the idle injection
+ * @state: an normalized integer giving the state of the cooling device
+ */
+struct cpuidle_cooling_device {
+ struct thermal_cooling_device *cdev;
+ struct cpumask *cpumask;
+ struct hrtimer timer;
+ struct kref kref;
+ atomic_t count;
+ unsigned int idle_cycle;
+ unsigned long state;
+};
+
+struct cpuidle_cooling_thread {
+ struct task_struct *tsk;
+ int should_run;
+};
+
+static DEFINE_PER_CPU(struct cpuidle_cooling_thread, cpuidle_cooling_thread);
+static DEFINE_PER_CPU(struct cpuidle_cooling_device *, cpuidle_cooling_device);
+
+/**
+ * cpuidle_cooling_wakeup - Wake up all idle injection threads
+ * @idle_cdev: the idle cooling device
+ *
+ * Every idle injection task belonging to the idle cooling device and
+ * running on an online cpu will be wake up by this call.
+ */
+static void cpuidle_cooling_wakeup(struct cpuidle_cooling_device *idle_cdev)
+{
+ struct cpuidle_cooling_thread *cct;
+ int cpu;
+
+ for_each_cpu_and(cpu, idle_cdev->cpumask, cpu_online_mask) {
+ cct = per_cpu_ptr(&cpuidle_cooling_thread, cpu);
+ cct->should_run = 1;
+ wake_up_process(cct->tsk);
+ }
+}
+
+/**
+ * cpuidle_cooling_wakeup_fn - Running cycle timer callback
+ * @timer: a hrtimer structure
+ *
+ * When the mitigation is acting, the CPU is allowed to run an amount
+ * of time, then the idle injection happens for the specified delay
+ * and the idle task injection schedules itself until the timer event
+ * wakes the idle injection tasks again for a new idle injection
+ * cycle. The time between the end of the idle injection and the timer
+ * expiration is the allocated running time for the CPU.
+ *
+ * Always returns HRTIMER_NORESTART
+ */
+static enum hrtimer_restart cpuidle_cooling_wakeup_fn(struct hrtimer *timer)
+{
+ struct cpuidle_cooling_device *idle_cdev =
+ container_of(timer, struct cpuidle_cooling_device, timer);
+
+ cpuidle_cooling_wakeup(idle_cdev);
+
+ return HRTIMER_NORESTART;
+}
+
+/**
+ * cpuidle_cooling_runtime - Running time computation
+ * @idle_cdev: the idle cooling device
+ *
+ * The running duration is computed from the idle injection duration
+ * which is fixed. If we reach 100% of idle injection ratio, that
+ * means the running duration is zero. If we have a 50% ratio
+ * injection, that means we have equal duration for idle and for
+ * running duration.
+ *
+ * The formula is deduced as the following:
+ *
+ * running = idle x ((100 / ratio) - 1)
+ *
+ * For precision purpose for integer math, we use the following:
+ *
+ * running = (idle x 100) / ratio - idle
+ *
+ * For example, if we have an injected duration of 50%, then we end up
+ * with 10ms of idle injection and 10ms of running duration.
+ *
+ * Returns a s64 nanosecond based
+ */
+static s64 cpuidle_cooling_runtime(struct cpuidle_cooling_device *idle_cdev)
+{
+ s64 next_wakeup;
+ unsigned long state = idle_cdev->state;
+
+ /*
+ * The function should not be called when there is no
+ * mitigation because:
+ * - that does not make sense
+ * - we end up with a division by zero
+ */
+ if (!state)
+ return 0;
+
+ next_wakeup = (s64)((idle_cdev->idle_cycle * 100) / state) -
+ idle_cdev->idle_cycle;
+
+ return next_wakeup * NSEC_PER_USEC;
+}
+
+/**
+ * cpuidle_cooling_injection - Idle injection mainloop thread function
+ * @cpu: an integer giving the cpu number the thread is pinned on
+ *
+ * This main function does basically two operations:
+ *
+ * - Goes idle for a specific amount of time
+ *
+ * - Sets a timer to wake up all the idle injection threads after a
+ * running period
+ *
+ * That happens only when the mitigation is enabled, otherwise the
+ * task is scheduled out.
+ *
+ * In order to keep the tasks synchronized together, it is the last
+ * task exiting the idle period which is in charge of setting the
+ * timer.
+ *
+ * This function never returns.
+ */
+static void cpuidle_cooling_injection(unsigned int cpu)
+{
+ s64 next_wakeup;
+
+ struct cpuidle_cooling_device *idle_cdev =
+ per_cpu(cpuidle_cooling_device, cpu);
+
+ struct cpuidle_cooling_thread *cct =
+ per_cpu_ptr(&cpuidle_cooling_thread, cpu);
+
+ atomic_inc(&idle_cdev->count);
+
+ cct->should_run = 0;
+
+ play_idle(idle_cdev->idle_cycle / USEC_PER_MSEC);
+
+ /*
+ * The last CPU waking up is in charge of setting the
+ * timer. If the CPU is hotplugged, the timer will
+ * move to another CPU (which may not belong to the
+ * same cluster) but that is not a problem as the
+ * timer will be set again by another CPU belonging to
+ * the cluster, so this mechanism is self adaptive and
+ * does not require any hotplugging dance.
+ */
+ if (!atomic_dec_and_test(&idle_cdev->count))
+ return;
+
+ next_wakeup = cpuidle_cooling_runtime(idle_cdev);
+ if (next_wakeup)
+ hrtimer_start(&idle_cdev->timer, ns_to_ktime(next_wakeup),
+ HRTIMER_MODE_REL_PINNED);
+}
+
+static void cpuidle_cooling_setup(unsigned int cpu)
+{
+ struct sched_param param = { .sched_priority = MAX_USER_RT_PRIO/2 };
+
+ set_freezable();
+
+ sched_setscheduler(current, SCHED_FIFO, ¶m);
+}
+
+static int cpuidle_cooling_should_run(unsigned int cpu)
+{
+ struct cpuidle_cooling_thread *cct =
+ per_cpu_ptr(&cpuidle_cooling_thread, cpu);
+
+ return cct->should_run;
+}
+
+static struct smp_hotplug_thread cpuidle_cooling_threads = {
+ .store = &cpuidle_cooling_thread.tsk,
+ .thread_fn = cpuidle_cooling_injection,
+ .thread_comm = "thermal-idle/%u",
+ .thread_should_run = cpuidle_cooling_should_run,
+ .setup = cpuidle_cooling_setup,
+};
+
+/**
+ * cpuidle_cooling_get_max_state - Get the maximum state
+ * @cdev : the thermal cooling device
+ * @state : a pointer to the state variable to be filled
+ *
+ * The function always gives 100 as the injection ratio is percentile
+ * based for consistency accros different platforms.
+ *
+ * The function can not fail, it always returns zero.
+ */
+static int cpuidle_cooling_get_max_state(struct thermal_cooling_device *cdev,
+ unsigned long *state)
+{
+ /*
+ * Depending on the configuration or the hardware, the running
+ * cycle and the idle cycle could be different. We want unify
+ * that to an 0..100 interval, so the set state interface will
+ * be the same whatever the platform is.
+ *
+ * The state 100% will make the cluster 100% ... idle. A 0%
+ * injection ratio means no idle injection at all and 50%
+ * means for 10ms of idle injection, we have 10ms of running
+ * time.
+ */
+ *state = 100;
+
+ return 0;
+}
+
+/**
+ * cpuidle_cooling_get_cur_state - Get the current cooling state
+ * @cdev: the thermal cooling device
+ * @state: a pointer to the state
+ *
+ * The function just copy the state value from the private thermal
+ * cooling device structure, the mapping is 1 <-> 1.
+ *
+ * The function can not fail, it always returns zero.
+ */
+static int cpuidle_cooling_get_cur_state(struct thermal_cooling_device *cdev,
+ unsigned long *state)
+{
+ struct cpuidle_cooling_device *idle_cdev = cdev->devdata;
+
+ *state = idle_cdev->state;
+
+ return 0;
+}
+
+/**
+ * cpuidle_cooling_set_cur_state - Set the current cooling state
+ * @cdev: the thermal cooling device
+ * @state: the target state
+ *
+ * The function checks first if we are initiating the mitigation which
+ * in turn wakes up all the idle injection tasks belonging to the idle
+ * cooling device. In any case, it updates the internal state for the
+ * cooling device.
+ *
+ * The function can not fail, it always returns zero.
+ */
+static int cpuidle_cooling_set_cur_state(struct thermal_cooling_device *cdev,
+ unsigned long state)
+{
+ struct cpuidle_cooling_device *idle_cdev = cdev->devdata;
+ unsigned long current_state = idle_cdev->state;
+
+ idle_cdev->state = state;
+
+ if (current_state == 0 && state > 0) {
+ pr_debug("Starting cooling cpus '%*pbl'\n",
+ cpumask_pr_args(idle_cdev->cpumask));
+ cpuidle_cooling_wakeup(idle_cdev);
+ } else if (current_state > 0 && !state) {
+ pr_debug("Stopping cooling cpus '%*pbl'\n",
+ cpumask_pr_args(idle_cdev->cpumask));
+ }
+
+ return 0;
+}
+
+/**
+ * cpuidle_cooling_ops - thermal cooling device ops
+ */
+static struct thermal_cooling_device_ops cpuidle_cooling_ops = {
+ .get_max_state = cpuidle_cooling_get_max_state,
+ .get_cur_state = cpuidle_cooling_get_cur_state,
+ .set_cur_state = cpuidle_cooling_set_cur_state,
+};
+
+/**
+ * cpuidle_cooling_release - Kref based release helper
+ * @kref: a pointer to the kref structure
+ *
+ * This function is automatically called by the kref_put function when
+ * the idle cooling device refcount reaches zero. At this point, we
+ * have the guarantee the structure is no longer in use and we can
+ * safely release all the ressources.
+ */
+static void __init cpuidle_cooling_release(struct kref *kref)
+{
+ struct cpuidle_cooling_device *idle_cdev =
+ container_of(kref, struct cpuidle_cooling_device, kref);
+
+ if (idle_cdev->cdev)
+ thermal_cooling_device_unregister(idle_cdev->cdev);
+
+ hrtimer_cancel(&idle_cdev->timer);
+ kfree(idle_cdev);
+}
+
+/**
+ * cpuilde_cooling_unregister - Idle cooling device exit function
+ *
+ * This function unregisters the cpuidle cooling device and frees the
+ * ressources previously allocated by the init function. This function
+ * is called when the initialization fails.
+ */
+static void __init cpuidle_cooling_unregister(void)
+{
+ struct cpuidle_cooling_device *idle_cdev;
+ int cpu;
+
+ for_each_possible_cpu(cpu) {
+ idle_cdev = per_cpu(cpuidle_cooling_device, cpu);
+ if (idle_cdev)
+ kref_put(&idle_cdev->kref, cpuidle_cooling_release);
+ }
+}
+
+
+/**
+ * cpuidle_cooling_alloc - Allocate and initialize an idle cooling device
+ * @cpumask: a cpumask containing all the cpus handled by the cooling device
+ *
+ * The function is called at init time only. It allocates and
+ * initializes the different fields of the cpuidle cooling device
+ *
+ * It returns a pointer to an cpuidle_cooling_device structure on
+ * success, NULL on error.
+ */
+static struct cpuidle_cooling_device * __init cpuidle_cooling_alloc(
+ cpumask_t *cpumask)
+{
+ struct cpuidle_cooling_device *idle_cdev;
+ int cpu;
+
+ idle_cdev = kzalloc(sizeof(*idle_cdev), GFP_KERNEL);
+ if (!idle_cdev)
+ return NULL;
+
+ /*
+ * The idle duration injection. As we don't have yet a way to
+ * specify from the DT configuration, let's default to a tick
+ * duration.
+ */
+ idle_cdev->idle_cycle = TICK_USEC;
+
+ /*
+ * Initialize the timer to wakeup all the idle injection tasks
+ */
+ hrtimer_init(&idle_cdev->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+
+ /*
+ * The wakeup function callback which is in charge of waking
+ * up all CPUs belonging to the same cluster
+ */
+ idle_cdev->timer.function = cpuidle_cooling_wakeup_fn;
+
+ idle_cdev->cpumask = cpumask;
+
+ /*
+ * Assign on a per cpu basis belonging to the cluster, the per
+ * cpu cpuidle_cooling_device pointer and increment its
+ * refcount on it
+ */
+ for_each_cpu(cpu, cpumask) {
+ kref_get(&idle_cdev->kref);
+ per_cpu(cpuidle_cooling_device, cpu) = idle_cdev;
+ }
+
+ return idle_cdev;
+}
+
+/**
+ * cpuidle_cooling_register - Idle cooling device initialization function
+ *
+ * This function is in charge of creating a cooling device per cluster
+ * and register it to thermal framework. For this we rely on the
+ * topology as there is nothing yet describing better the idle state
+ * power domains.
+ *
+ * We create a cpuidle cooling device per cluster. For this reason we
+ * must, for each cluster, allocate and initialize the cooling device
+ * and for each cpu belonging to this cluster, do the initialization
+ * on a cpu basis.
+ *
+ * This approach for creating the cooling device is needed as we don't
+ * have the guarantee the CPU numbering is sequential.
+ *
+ * Unfortunately, there is no API to browse from top to bottom the
+ * topology, cluster->cpu, only the usual for_each_possible_cpu loop.
+ * In order to solve that, we use a cpumask to flag the cluster_id we
+ * already processed. The cpumask will always have enough room for all
+ * the cluster because it is based on NR_CPUS and it is not possible
+ * to have more clusters than cpus.
+ *
+ */
+void __init cpuidle_cooling_register(void)
+{
+ struct cpuidle_cooling_device *idle_cdev = NULL;
+ struct thermal_cooling_device *cdev;
+ struct device_node *np;
+ cpumask_var_t cpumask;
+ char dev_name[THERMAL_NAME_LENGTH];
+ int ret = -ENOMEM, cpu;
+ int cluster_id;
+
+ if (!zalloc_cpumask_var(&cpumask, GFP_KERNEL))
+ return;
+
+ for_each_possible_cpu(cpu) {
+
+ cluster_id = topology_physical_package_id(cpu);
+ if (cpumask_test_cpu(cluster_id, cpumask))
+ continue;
+
+ /*
+ * Allocate the cpuidle cooling device with the list
+ * of the cpus belonging to the cluster.
+ */
+ idle_cdev = cpuidle_cooling_alloc(topology_core_cpumask(cpu));
+ if (!idle_cdev)
+ goto out;
+
+ /*
+ * The thermal cooling device name, we use the
+ * cluster_id as the numbering index for the idle
+ * cooling device.
+ */
+ snprintf(dev_name, sizeof(dev_name), "thermal-idle-%d",
+ cluster_id);
+
+ np = of_cpu_device_node_get(cpu);
+ cdev = thermal_of_cooling_device_register(np, dev_name,
+ idle_cdev,
+ &cpuidle_cooling_ops);
+ if (IS_ERR(cdev)) {
+ ret = PTR_ERR(cdev);
+ goto out;
+ }
+
+ idle_cdev->cdev = cdev;
+ cpumask_set_cpu(cluster_id, cpumask);
+ }
+
+ ret = smpboot_register_percpu_thread(&cpuidle_cooling_threads);
+ if (ret)
+ goto out;
+
+ pr_info("Created cpuidle cooling device\n");
+out:
+ free_cpumask_var(cpumask);
+
+ if (ret) {
+ cpuidle_cooling_unregister();
+ pr_err("Failed to create idle cooling device (%d)\n", ret);
+ }
+}
+#endif /* CONFIG_CPU_IDLE_THERMAL */
diff --git a/include/linux/cpu_cooling.h b/include/linux/cpu_cooling.h
index c0accc7..af5520d 100644
--- a/include/linux/cpu_cooling.h
+++ b/include/linux/cpu_cooling.h
@@ -120,4 +120,10 @@ void cpufreq_cooling_unregister(struct thermal_cooling_device *cdev)
}
#endif /* CONFIG_CPU_FREQ_THERMAL */
+#ifdef CONFIG_CPU_IDLE_THERMAL
+extern void __init cpuidle_cooling_register(void);
+#else /* CONFIG_CPU_IDLE_THERMAL */
+static inline void __init cpuidle_cooling_register(void) { }
+#endif /* CONFIG_CPU_IDLE_THERMAL */
+
#endif /* __CPU_COOLING_H__ */
--
2.7.4
For license auditing purpose, let's add the SPDX tag.
Signed-off-by: Daniel Lezcano <[email protected]>
Acked-by: Viresh Kumar <[email protected]>
Acked-by: Philippe Ombredanne <[email protected]>
---
drivers/thermal/cpu_cooling.c | 16 +---------------
1 file changed, 1 insertion(+), 15 deletions(-)
diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
index 42110ee..d7528bc 100644
--- a/drivers/thermal/cpu_cooling.c
+++ b/drivers/thermal/cpu_cooling.c
@@ -1,3 +1,4 @@
+// SPDX-License-Identifier: GPL-2.0
/*
* linux/drivers/thermal/cpu_cooling.c
*
@@ -8,21 +9,6 @@
* Authors: Amit Daniel <[email protected]>
* Viresh Kumar <[email protected]>
*
- * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published by
- * the Free Software Foundation; version 2 of the License.
- *
- * This program is distributed in the hope that it will be useful, but
- * WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
- * General Public License for more details.
- *
- * You should have received a copy of the GNU General Public License along
- * with this program; if not, write to the Free Software Foundation, Inc.,
- * 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA.
- *
- * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*/
#include <linux/module.h>
#include <linux/thermal.h>
--
2.7.4
Provide some documentation for the idle injection cooling effect in
order to let people to understand the rational of the approach for the
idle injection CPU cooling device.
Signed-off-by: Daniel Lezcano <[email protected]>
---
Documentation/thermal/cpu-idle-cooling.txt | 166 +++++++++++++++++++++++++++++
1 file changed, 166 insertions(+)
create mode 100644 Documentation/thermal/cpu-idle-cooling.txt
diff --git a/Documentation/thermal/cpu-idle-cooling.txt b/Documentation/thermal/cpu-idle-cooling.txt
new file mode 100644
index 0000000..457cd99
--- /dev/null
+++ b/Documentation/thermal/cpu-idle-cooling.txt
@@ -0,0 +1,166 @@
+
+Situation:
+----------
+
+Under certain circumstances a SoC can reach the maximum temperature
+limit or is unable to stabilize the temperature around a temperature
+control. When the SoC has to stabilize the temperature, the kernel can
+act on a cooling device to mitigate the dissipated power. When the
+maximum temperature is reached and to prevent a reboot or a shutdown,
+a decision must be taken to reduce the temperature under the critical
+threshold, that impacts the performance.
+
+Another situation is when the silicon reaches a certain temperature
+which continues to increase even if the dynamic leakage is reduced to
+its minimum by clock gating the component. The runaway phenomena will
+continue with the static leakage and only powering down the component,
+thus dropping the dynamic and static leakage will allow the component
+to cool down.
+
+Last but not least, the system can ask for a specific power budget but
+because of the OPP density, we can only choose an OPP with a power
+budget lower than the requested one and underuse the CPU, thus losing
+performances. In other words, one OPP under uses the CPU with a power
+lesser than the power budget and the next OPP exceed the power budget,
+an intermediate OPP could have been used if it were present.
+
+Solutions:
+----------
+
+If we can remove the static and the dynamic leakage for a specific
+duration in a controlled period, the SoC temperature will
+decrease. Acting at the idle state duration or the idle cycle
+injection period, we can mitigate the temperature by modulating the
+power budget.
+
+The Operating Performance Point (OPP) density has a great influence on
+the control precision of cpufreq, however different vendors have a
+plethora of OPP density, and some have large power gap between OPPs,
+that will result in loss of performance during thermal control and
+loss of power in other scenes.
+
+At a specific OPP, we can assume injecting idle cycle on all CPUs,
+belonging to the same cluster, with a duration greater than the
+cluster idle state target residency, we drop the static and the
+dynamic leakage for this period (modulo the energy needed to enter
+this state). So the sustainable power with idle cycles has a linear
+relation with the OPP’s sustainable power and can be computed with a
+coefficient similar to:
+
+ Power(IdleCycle) = Coef x Power(OPP)
+
+Idle Injection:
+---------------
+
+The base concept of the idle injection is to force the CPU to go to an
+idle state for a specified time each control cycle, it provides
+another way to control CPU power and heat in addition to
+cpufreq. Ideally, if all CPUs belonging to the same cluster, inject
+their idle cycle synchronously, the cluster can reach its power down
+state with a minimum power consumption and static leakage
+drop. However, these idle cycles injection will add extra latencies as
+the CPUs will have to wakeup from a deep sleep state.
+
+ ^
+ |
+ |
+ |------- ------- -------
+ |_______|_____|_______|_____|_______|___________
+
+ <----->
+ idle <---->
+ running
+
+With the fixed idle injection duration, we can give a value which is
+an acceptable performance drop off or latency when we reach a specific
+temperature and we begin to mitigate by varying the Idle injection
+period.
+
+The mitigation begins with a maximum period value which decrease when
+more cooling effect is requested. When the period duration is equal to
+the idle duration, then we are in a situation the platform can’t
+dissipate the heat enough and the mitigation fails. In this case the
+situation is considered critical and there is nothing to do. The idle
+injection duration must be changed by configuration and until we reach
+the cooling effect, otherwise an additionnal cooling device must be
+used or ultimately decrease the SoC performance by dropping the
+highest OPP point of the SoC.
+
+The idle injection duration value must comply with the constraints:
+
+- It is lesser or equal to the latency we tolerate when the mitigation
+ begins. It is platform dependent and will depend on the user
+ experience, reactivity vs performance trade off we want. This value
+ should be specified.
+
+- It is greater than the idle state’s target residency we want to go
+ for thermal mitigation, otherwise we end up consuming more energy.
+
+Minimum period
+--------------
+
+The idle injection duration being fixed, it is obvious the minimum
+period can’t be lesser than that, otherwise we will be scheduling the
+idle injection task right before the idle injection duration is
+complete, so waking up the CPU to put it asleep again.
+
+Maximum period
+--------------
+
+The maximum period is the initial period when the mitigation
+begins. Theoretically when we reach the thermal trip point, we have to
+sustain a specified power for specific temperature but at this time we
+consume:
+
+ Power = Capacitance x Voltage^2 x Frequency x Utilisation
+
+... which is more than the sustainable power (or there is something
+wrong on the system setup). The ‘Capacitance’ and ‘Utilisation’ are a
+fixed value, ‘Voltage’ and the ‘Frequency’ are fixed artificially
+because we don’t want to change the OPP. We can group the
+‘Capacitance’ and the ‘Utilisation’ into a single term which is the
+‘Dynamic Power Coefficient (Cdyn)’ Simplifying the above, we have:
+
+ Pdyn = Cdyn x Voltage^2 x Frequency
+
+The IPA will ask us somehow to reduce our power in order to target the
+sustainable power defined in the device tree. So with the idle
+injection mechanism, we want an average power (Ptarget) resulting on
+an amount of time running at full power on a specific OPP and idle
+another amount of time. That could be put in a equation:
+
+ P(opp)target = ((trunning x (P(opp)running) + (tidle P(opp)idle)) /
+ (trunning + tidle)
+ ...
+
+ tidle = trunning x ((P(opp)running / P(opp)target) - 1)
+
+At this point if we know the running period for the CPU, that gives us
+the idle injection, we need. Alternatively if we have the idle
+injection duration, we can compute the running duration with:
+
+ trunning = tidle / ((P(opp)running / P(opp)target) - 1)
+
+Practically, if the running power is lesses than the targeted power,
+we end up with a negative time value, so obviously the equation usage
+is bound to a power reduction, hence a higher OPP is needed to have
+the running power greater than the targeted power.
+
+However, in this demonstration we ignore three aspects:
+
+ * The static leakage is not defined here, we can introduce it in the
+ equation but assuming it will be zero most of the time as it is
+ difficult to get the values from the SoC vendors
+
+ * The idle state wake up latency (or entry + exit latency) is not
+ taken into account, it must be added in the equation in order to
+ rigorously compute the idle injection
+
+ * The injected idle duration must be greater than the idle state
+ target residency, otherwise we end up consuming more energy and
+ potentially invert the mitigation effect
+
+So the final equation is:
+
+ trunning = (tidle - twakeup ) x
+ (((P(opp)dyn + P(opp)static ) - P(opp)target) / P(opp)target )
--
2.7.4
The next changes will add new way to cool down a CPU. In order to
sanitize and make the overall cpu cooling code consistent and robust
we must prevent the cpu cooling devices to co-exists with the same
purpose at the same time in the kernel.
Make the CPU cooling device a choice in the Kconfig, so only one CPU
cooling strategy can be chosen.
Signed-off-by: Daniel Lezcano <[email protected]>
---
drivers/thermal/Kconfig | 20 +++++++++++++++++---
drivers/thermal/cpu_cooling.c | 2 ++
include/linux/cpu_cooling.h | 6 +++---
3 files changed, 22 insertions(+), 6 deletions(-)
diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
index b6adc54..5aaae1b 100644
--- a/drivers/thermal/Kconfig
+++ b/drivers/thermal/Kconfig
@@ -142,17 +142,31 @@ config THERMAL_GOV_POWER_ALLOCATOR
allocating and limiting power to devices.
config CPU_THERMAL
- bool "generic cpu cooling support"
- depends on CPU_FREQ
+ bool "Generic cpu cooling support"
depends on THERMAL_OF
help
+ Enable the CPU cooling features. If the system has no active
+ cooling device available, this option allows to use the CPU
+ as a cooling device.
+
+choice
+ prompt "CPU cooling strategies"
+ depends on CPU_THERMAL
+ default CPU_FREQ_THERMAL
+ help
+ Select the CPU cooling strategy.
+
+config CPU_FREQ_THERMAL
+ bool "CPU frequency cooling strategy"
+ depends on CPU_FREQ
+ help
This implements the generic cpu cooling mechanism through frequency
reduction. An ACPI version of this already exists
(drivers/acpi/processor_thermal.c).
This will be useful for platforms using the generic thermal interface
and not the ACPI interface.
- If you want this support, you should say Y here.
+endchoice
config CLOCK_THERMAL
bool "Generic clock cooling support"
diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
index 7bdc19f..5c219dc 100644
--- a/drivers/thermal/cpu_cooling.c
+++ b/drivers/thermal/cpu_cooling.c
@@ -22,6 +22,7 @@
#include <trace/events/thermal.h>
+#ifdef CONFIG_CPU_FREQ_THERMAL
/*
* Cooling state <-> CPUFreq frequency
*
@@ -926,3 +927,4 @@ void cpufreq_cooling_unregister(struct thermal_cooling_device *cdev)
kfree(cpufreq_cdev);
}
EXPORT_SYMBOL_GPL(cpufreq_cooling_unregister);
+#endif /* CONFIG_CPU_FREQ_THERMAL */
diff --git a/include/linux/cpu_cooling.h b/include/linux/cpu_cooling.h
index d4292eb..c0accc7 100644
--- a/include/linux/cpu_cooling.h
+++ b/include/linux/cpu_cooling.h
@@ -33,7 +33,7 @@ struct cpufreq_policy;
typedef int (*get_static_t)(cpumask_t *cpumask, int interval,
unsigned long voltage, u32 *power);
-#ifdef CONFIG_CPU_THERMAL
+#ifdef CONFIG_CPU_FREQ_THERMAL
/**
* cpufreq_cooling_register - function to create cpufreq cooling device.
* @policy: cpufreq policy.
@@ -84,7 +84,7 @@ of_cpufreq_power_cooling_register(struct device_node *np,
*/
void cpufreq_cooling_unregister(struct thermal_cooling_device *cdev);
-#else /* !CONFIG_CPU_THERMAL */
+#else /* !CONFIG_CPU_FREQ_THERMAL */
static inline struct thermal_cooling_device *
cpufreq_cooling_register(struct cpufreq_policy *policy)
{
@@ -118,6 +118,6 @@ void cpufreq_cooling_unregister(struct thermal_cooling_device *cdev)
{
return;
}
-#endif /* CONFIG_CPU_THERMAL */
+#endif /* CONFIG_CPU_FREQ_THERMAL */
#endif /* __CPU_COOLING_H__ */
--
2.7.4
On 05-04-18, 18:16, Daniel Lezcano wrote:
> The copyright format does not conform to the format requested by
> Linaro: https://wiki.linaro.org/Copyright
>
> Fix it.
>
> Signed-off-by: Daniel Lezcano <[email protected]>
You forgot to include my Ack ?
> ---
> drivers/thermal/cpu_cooling.c | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
> index dc63aba..42110ee 100644
> --- a/drivers/thermal/cpu_cooling.c
> +++ b/drivers/thermal/cpu_cooling.c
> @@ -2,9 +2,11 @@
> * linux/drivers/thermal/cpu_cooling.c
> *
> * Copyright (C) 2012 Samsung Electronics Co., Ltd(http://www.samsung.com)
> - * Copyright (C) 2012 Amit Daniel <[email protected]>
> *
> - * Copyright (C) 2014 Viresh Kumar <[email protected]>
> + * Copyright (C) 2012-2018 Linaro Limited.
> + *
> + * Authors: Amit Daniel <[email protected]>
> + * Viresh Kumar <[email protected]>
> *
> * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> * This program is free software; you can redistribute it and/or modify
> --
> 2.7.4
--
viresh
On 05-04-18, 18:16, Daniel Lezcano wrote:
> The next changes will add new way to cool down a CPU. In order to
> sanitize and make the overall cpu cooling code consistent and robust
> we must prevent the cpu cooling devices to co-exists with the same
> purpose at the same time in the kernel.
>
> Make the CPU cooling device a choice in the Kconfig, so only one CPU
> cooling strategy can be chosen.
>
> Signed-off-by: Daniel Lezcano <[email protected]>
> ---
> drivers/thermal/Kconfig | 20 +++++++++++++++++---
> drivers/thermal/cpu_cooling.c | 2 ++
> include/linux/cpu_cooling.h | 6 +++---
> 3 files changed, 22 insertions(+), 6 deletions(-)
I don't see my review comments from V2 being addressed for this patch.
Can you please check that again ?
--
viresh
On 05-04-18, 18:16, Daniel Lezcano wrote:
> Register the ARM generic cpuidle driver as a cooling device.
>
> Signed-off-by: Daniel Lezcano <[email protected]>
> ---
> drivers/cpuidle/cpuidle-arm.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/drivers/cpuidle/cpuidle-arm.c b/drivers/cpuidle/cpuidle-arm.c
> index ddee1b6..44b4fe3 100644
> --- a/drivers/cpuidle/cpuidle-arm.c
> +++ b/drivers/cpuidle/cpuidle-arm.c
> @@ -11,6 +11,7 @@
>
> #define pr_fmt(fmt) "CPUidle arm: " fmt
>
> +#include <linux/cpu_cooling.h>
> #include <linux/cpuidle.h>
> #include <linux/cpumask.h>
> #include <linux/cpu_pm.h>
> @@ -172,6 +173,8 @@ static int __init arm_idle_init(void)
> goto out_fail;
> }
>
> + cpuidle_cooling_register();
> +
> return 0;
>
> out_fail:
Acked-by: Viresh Kumar <[email protected]>
--
viresh
On 05-04-18, 18:16, Daniel Lezcano wrote:
> The cpu idle cooling driver performs synchronized idle injection across all
> cpus belonging to the same cluster and offers a new method to cool down a SoC.
>
> Each cluster has its own idle cooling device, each core has its own idle
> injection thread, each idle injection thread uses play_idle to enter idle. In
> order to reach the deepest idle state, each cooling device has the idle
> injection threads synchronized together.
>
> It has some similarity with the intel power clamp driver but it is actually
> designed to work on the ARM architecture via the DT with a mathematical proof
> with the power model which comes with the Documentation.
>
> The idle injection cycle is fixed while the running cycle is variable. That
> allows to have control on the device reactivity for the user experience. At
> the mitigation point the idle threads are unparked, they play idle the
> specified amount of time and they schedule themselves. The last thread sets
> the next idle injection deadline and when the timer expires it wakes up all
> the threads which in turn play idle again. Meanwhile the running cycle is
> changed by set_cur_state. When the mitigation ends, the threads are parked.
> The algorithm is self adaptive, so there is no need to handle hotplugging.
>
> If we take an example of the balanced point, we can use the DT for the hi6220.
>
> The sustainable power for the SoC is 3326mW to mitigate at 75?C. Eight cores
> running at full blast at the maximum OPP consumes 5280mW. The first value is
> given in the DT, the second is calculated from the OPP with the formula:
>
> Pdyn = Cdyn x Voltage^2 x Frequency
>
> As the SoC vendors don't want to share the static leakage values, we assume
> it is zero, so the Prun = Pdyn + Pstatic = Pdyn + 0 = Pdyn.
>
> In order to reduce the power to 3326mW, we have to apply a ratio to the
> running time.
>
> ratio = (Prun - Ptarget) / Ptarget = (5280 - 3326) / 3326 = 0,5874
>
> We know the idle cycle which is fixed, let's assume 10ms. However from this
> duration we have to substract the wake up latency for the cluster idle state.
> In our case, it is 1.5ms. So for a 10ms latency for idle, we are really idle
> 8.5ms.
>
> As we know the idle duration and the ratio, we can compute the running cycle.
>
> running_cycle = 8.5 / 0.5874 = 14.47ms
>
> So for 8.5ms of idle, we have 14.47ms of running cycle, and that brings the
> SoC to the balanced trip point of 75?C.
>
> The driver has been tested on the hi6220 and it appears the temperature
> stabilizes at 75?C with an idle injection time of 10ms (8.5ms real) and
> running cycle of 14ms as expected by the theory above.
>
> Signed-off-by: Kevin Wangtao <[email protected]>
> Signed-off-by: Daniel Lezcano <[email protected]>
> ---
> drivers/thermal/Kconfig | 10 +
> drivers/thermal/cpu_cooling.c | 479 ++++++++++++++++++++++++++++++++++++++++++
> include/linux/cpu_cooling.h | 6 +
> 3 files changed, 495 insertions(+)
>
> diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
> index 5aaae1b..6c34117 100644
> --- a/drivers/thermal/Kconfig
> +++ b/drivers/thermal/Kconfig
> @@ -166,6 +166,16 @@ config CPU_FREQ_THERMAL
> This will be useful for platforms using the generic thermal interface
> and not the ACPI interface.
>
> +config CPU_IDLE_THERMAL
> + bool "CPU idle cooling strategy"
> + depends on CPU_IDLE
> + help
> + This implements the generic CPU cooling mechanism through
> + idle injection. This will throttle the CPU by injecting
> + fixed idle cycle. All CPUs belonging to the same cluster
> + will enter idle synchronously to reach the deepest idle
> + state.
> +
> endchoice
>
> config CLOCK_THERMAL
> diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
> index 5c219dc..1eec8d6 100644
> --- a/drivers/thermal/cpu_cooling.c
> +++ b/drivers/thermal/cpu_cooling.c
> @@ -10,18 +10,33 @@
> * Viresh Kumar <[email protected]>
> *
> */
> +#define pr_fmt(fmt) "CPU cooling: " fmt
> +
> #include <linux/module.h>
> #include <linux/thermal.h>
> #include <linux/cpufreq.h>
> +#include <linux/cpuidle.h>
> #include <linux/err.h>
> +#include <linux/freezer.h>
> #include <linux/idr.h>
> +#include <linux/kthread.h>
> #include <linux/pm_opp.h>
> #include <linux/slab.h>
> +#include <linux/sched/prio.h>
> +#include <linux/sched/rt.h>
> +#include <linux/smpboot.h>
> #include <linux/cpu.h>
> #include <linux/cpu_cooling.h>
>
> +#include <linux/ratelimit.h>
What part of the code is using this one ?
> +
Why add above 2 blank lines ?
> +#include <linux/platform_device.h>
> +#include <linux/of_platform.h>
> +
> #include <trace/events/thermal.h>
>
> +#include <uapi/linux/sched/types.h>
> +
> #ifdef CONFIG_CPU_FREQ_THERMAL
> /*
> * Cooling state <-> CPUFreq frequency
> @@ -928,3 +943,467 @@ void cpufreq_cooling_unregister(struct thermal_cooling_device *cdev)
> }
> EXPORT_SYMBOL_GPL(cpufreq_cooling_unregister);
> #endif /* CONFIG_CPU_FREQ_THERMAL */
> +
> +#ifdef CONFIG_CPU_IDLE_THERMAL
> +/**
> + * struct cpuidle_cooling_device - data for the idle cooling device
> + * @cdev: a pointer to a struct thermal_cooling_device
> + * @cpumask: a cpumask containing the CPU managed by the cooling device
> + * @timer: a hrtimer giving the tempo for the idle injection cycles
> + * @kref: a kernel refcount on this structure
> + * @count: an atomic to keep track of the last task exiting the idle cycle
> + * @idle_cycle: an integer defining the duration of the idle injection
> + * @state: an normalized integer giving the state of the cooling device
> + */
> +struct cpuidle_cooling_device {
> + struct thermal_cooling_device *cdev;
> + struct cpumask *cpumask;
> + struct hrtimer timer;
> + struct kref kref;
> + atomic_t count;
> + unsigned int idle_cycle;
> + unsigned long state;
> +};
> +
> +struct cpuidle_cooling_thread {
> + struct task_struct *tsk;
> + int should_run;
> +};
> +
> +static DEFINE_PER_CPU(struct cpuidle_cooling_thread, cpuidle_cooling_thread);
> +static DEFINE_PER_CPU(struct cpuidle_cooling_device *, cpuidle_cooling_device);
> +
> +/**
> + * cpuidle_cooling_wakeup - Wake up all idle injection threads
> + * @idle_cdev: the idle cooling device
> + *
> + * Every idle injection task belonging to the idle cooling device and
> + * running on an online cpu will be wake up by this call.
> + */
> +static void cpuidle_cooling_wakeup(struct cpuidle_cooling_device *idle_cdev)
> +{
> + struct cpuidle_cooling_thread *cct;
> + int cpu;
> +
> + for_each_cpu_and(cpu, idle_cdev->cpumask, cpu_online_mask) {
> + cct = per_cpu_ptr(&cpuidle_cooling_thread, cpu);
> + cct->should_run = 1;
> + wake_up_process(cct->tsk);
> + }
> +}
> +
> +/**
> + * cpuidle_cooling_wakeup_fn - Running cycle timer callback
> + * @timer: a hrtimer structure
> + *
> + * When the mitigation is acting, the CPU is allowed to run an amount
> + * of time, then the idle injection happens for the specified delay
> + * and the idle task injection schedules itself until the timer event
> + * wakes the idle injection tasks again for a new idle injection
> + * cycle. The time between the end of the idle injection and the timer
> + * expiration is the allocated running time for the CPU.
> + *
> + * Always returns HRTIMER_NORESTART
> + */
> +static enum hrtimer_restart cpuidle_cooling_wakeup_fn(struct hrtimer *timer)
> +{
> + struct cpuidle_cooling_device *idle_cdev =
> + container_of(timer, struct cpuidle_cooling_device, timer);
> +
> + cpuidle_cooling_wakeup(idle_cdev);
> +
> + return HRTIMER_NORESTART;
> +}
> +
> +/**
> + * cpuidle_cooling_runtime - Running time computation
> + * @idle_cdev: the idle cooling device
> + *
> + * The running duration is computed from the idle injection duration
> + * which is fixed. If we reach 100% of idle injection ratio, that
> + * means the running duration is zero. If we have a 50% ratio
> + * injection, that means we have equal duration for idle and for
> + * running duration.
> + *
> + * The formula is deduced as the following:
> + *
> + * running = idle x ((100 / ratio) - 1)
> + *
> + * For precision purpose for integer math, we use the following:
> + *
> + * running = (idle x 100) / ratio - idle
> + *
> + * For example, if we have an injected duration of 50%, then we end up
> + * with 10ms of idle injection and 10ms of running duration.
> + *
> + * Returns a s64 nanosecond based
> + */
> +static s64 cpuidle_cooling_runtime(struct cpuidle_cooling_device *idle_cdev)
> +{
> + s64 next_wakeup;
> + unsigned long state = idle_cdev->state;
> +
> + /*
> + * The function should not be called when there is no
> + * mitigation because:
> + * - that does not make sense
> + * - we end up with a division by zero
> + */
> + if (!state)
> + return 0;
> +
> + next_wakeup = (s64)((idle_cdev->idle_cycle * 100) / state) -
> + idle_cdev->idle_cycle;
> +
> + return next_wakeup * NSEC_PER_USEC;
> +}
> +
> +/**
> + * cpuidle_cooling_injection - Idle injection mainloop thread function
> + * @cpu: an integer giving the cpu number the thread is pinned on
> + *
> + * This main function does basically two operations:
> + *
> + * - Goes idle for a specific amount of time
> + *
> + * - Sets a timer to wake up all the idle injection threads after a
> + * running period
> + *
> + * That happens only when the mitigation is enabled, otherwise the
> + * task is scheduled out.
> + *
> + * In order to keep the tasks synchronized together, it is the last
> + * task exiting the idle period which is in charge of setting the
> + * timer.
> + *
> + * This function never returns.
> + */
> +static void cpuidle_cooling_injection(unsigned int cpu)
> +{
> + s64 next_wakeup;
> +
> + struct cpuidle_cooling_device *idle_cdev =
> + per_cpu(cpuidle_cooling_device, cpu);
> +
> + struct cpuidle_cooling_thread *cct =
> + per_cpu_ptr(&cpuidle_cooling_thread, cpu);
> +
> + atomic_inc(&idle_cdev->count);
> +
> + cct->should_run = 0;
> +
> + play_idle(idle_cdev->idle_cycle / USEC_PER_MSEC);
> +
> + /*
> + * The last CPU waking up is in charge of setting the
> + * timer. If the CPU is hotplugged, the timer will
> + * move to another CPU (which may not belong to the
> + * same cluster) but that is not a problem as the
> + * timer will be set again by another CPU belonging to
> + * the cluster, so this mechanism is self adaptive and
> + * does not require any hotplugging dance.
> + */
> + if (!atomic_dec_and_test(&idle_cdev->count))
> + return;
> +
> + next_wakeup = cpuidle_cooling_runtime(idle_cdev);
> + if (next_wakeup)
> + hrtimer_start(&idle_cdev->timer, ns_to_ktime(next_wakeup),
> + HRTIMER_MODE_REL_PINNED);
> +}
> +
> +static void cpuidle_cooling_setup(unsigned int cpu)
> +{
> + struct sched_param param = { .sched_priority = MAX_USER_RT_PRIO/2 };
Doesn't checkpatch report about space requirements around '/' ?
> +
> + set_freezable();
> +
> + sched_setscheduler(current, SCHED_FIFO, ¶m);
> +}
> +
> +static int cpuidle_cooling_should_run(unsigned int cpu)
> +{
> + struct cpuidle_cooling_thread *cct =
> + per_cpu_ptr(&cpuidle_cooling_thread, cpu);
> +
> + return cct->should_run;
> +}
> +
> +static struct smp_hotplug_thread cpuidle_cooling_threads = {
> + .store = &cpuidle_cooling_thread.tsk,
> + .thread_fn = cpuidle_cooling_injection,
> + .thread_comm = "thermal-idle/%u",
> + .thread_should_run = cpuidle_cooling_should_run,
> + .setup = cpuidle_cooling_setup,
> +};
> +
> +/**
> + * cpuidle_cooling_get_max_state - Get the maximum state
> + * @cdev : the thermal cooling device
> + * @state : a pointer to the state variable to be filled
> + *
> + * The function always gives 100 as the injection ratio is percentile
> + * based for consistency accros different platforms.
> + *
> + * The function can not fail, it always returns zero.
> + */
> +static int cpuidle_cooling_get_max_state(struct thermal_cooling_device *cdev,
> + unsigned long *state)
This isn't aligned as rest of the routines in this driver. Try running
checkpatch with --strict option.
> +{
> + /*
> + * Depending on the configuration or the hardware, the running
> + * cycle and the idle cycle could be different. We want unify
> + * that to an 0..100 interval, so the set state interface will
> + * be the same whatever the platform is.
> + *
> + * The state 100% will make the cluster 100% ... idle. A 0%
> + * injection ratio means no idle injection at all and 50%
> + * means for 10ms of idle injection, we have 10ms of running
> + * time.
> + */
> + *state = 100;
> +
> + return 0;
> +}
> +
> +/**
> + * cpuidle_cooling_get_cur_state - Get the current cooling state
> + * @cdev: the thermal cooling device
> + * @state: a pointer to the state
> + *
> + * The function just copy the state value from the private thermal
> + * cooling device structure, the mapping is 1 <-> 1.
> + *
> + * The function can not fail, it always returns zero.
> + */
> +static int cpuidle_cooling_get_cur_state(struct thermal_cooling_device *cdev,
> + unsigned long *state)
> +{
> + struct cpuidle_cooling_device *idle_cdev = cdev->devdata;
> +
> + *state = idle_cdev->state;
> +
> + return 0;
> +}
> +
> +/**
> + * cpuidle_cooling_set_cur_state - Set the current cooling state
> + * @cdev: the thermal cooling device
> + * @state: the target state
> + *
> + * The function checks first if we are initiating the mitigation which
> + * in turn wakes up all the idle injection tasks belonging to the idle
> + * cooling device. In any case, it updates the internal state for the
> + * cooling device.
> + *
> + * The function can not fail, it always returns zero.
> + */
> +static int cpuidle_cooling_set_cur_state(struct thermal_cooling_device *cdev,
> + unsigned long state)
> +{
> + struct cpuidle_cooling_device *idle_cdev = cdev->devdata;
> + unsigned long current_state = idle_cdev->state;
> +
> + idle_cdev->state = state;
> +
> + if (current_state == 0 && state > 0) {
> + pr_debug("Starting cooling cpus '%*pbl'\n",
> + cpumask_pr_args(idle_cdev->cpumask));
> + cpuidle_cooling_wakeup(idle_cdev);
> + } else if (current_state > 0 && !state) {
> + pr_debug("Stopping cooling cpus '%*pbl'\n",
> + cpumask_pr_args(idle_cdev->cpumask));
> + }
> +
> + return 0;
> +}
> +
> +/**
> + * cpuidle_cooling_ops - thermal cooling device ops
> + */
> +static struct thermal_cooling_device_ops cpuidle_cooling_ops = {
> + .get_max_state = cpuidle_cooling_get_max_state,
> + .get_cur_state = cpuidle_cooling_get_cur_state,
> + .set_cur_state = cpuidle_cooling_set_cur_state,
> +};
> +
> +/**
> + * cpuidle_cooling_release - Kref based release helper
> + * @kref: a pointer to the kref structure
> + *
> + * This function is automatically called by the kref_put function when
> + * the idle cooling device refcount reaches zero. At this point, we
> + * have the guarantee the structure is no longer in use and we can
> + * safely release all the ressources.
> + */
> +static void __init cpuidle_cooling_release(struct kref *kref)
> +{
> + struct cpuidle_cooling_device *idle_cdev =
> + container_of(kref, struct cpuidle_cooling_device, kref);
> +
> + if (idle_cdev->cdev)
> + thermal_cooling_device_unregister(idle_cdev->cdev);
> +
> + hrtimer_cancel(&idle_cdev->timer);
> + kfree(idle_cdev);
> +}
> +
> +/**
> + * cpuilde_cooling_unregister - Idle cooling device exit function
> + *
> + * This function unregisters the cpuidle cooling device and frees the
> + * ressources previously allocated by the init function. This function
> + * is called when the initialization fails.
> + */
> +static void __init cpuidle_cooling_unregister(void)
> +{
> + struct cpuidle_cooling_device *idle_cdev;
> + int cpu;
> +
> + for_each_possible_cpu(cpu) {
> + idle_cdev = per_cpu(cpuidle_cooling_device, cpu);
> + if (idle_cdev)
> + kref_put(&idle_cdev->kref, cpuidle_cooling_release);
> + }
> +}
> +
> +
> +/**
> + * cpuidle_cooling_alloc - Allocate and initialize an idle cooling device
> + * @cpumask: a cpumask containing all the cpus handled by the cooling device
> + *
> + * The function is called at init time only. It allocates and
> + * initializes the different fields of the cpuidle cooling device
> + *
> + * It returns a pointer to an cpuidle_cooling_device structure on
> + * success, NULL on error.
> + */
> +static struct cpuidle_cooling_device * __init cpuidle_cooling_alloc(
> + cpumask_t *cpumask)
I would rather break the line after __init .
> +{
> + struct cpuidle_cooling_device *idle_cdev;
> + int cpu;
> +
> + idle_cdev = kzalloc(sizeof(*idle_cdev), GFP_KERNEL);
> + if (!idle_cdev)
> + return NULL;
> +
> + /*
> + * The idle duration injection. As we don't have yet a way to
> + * specify from the DT configuration, let's default to a tick
> + * duration.
> + */
> + idle_cdev->idle_cycle = TICK_USEC;
> +
> + /*
> + * Initialize the timer to wakeup all the idle injection tasks
> + */
> + hrtimer_init(&idle_cdev->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> +
> + /*
> + * The wakeup function callback which is in charge of waking
> + * up all CPUs belonging to the same cluster
Need a full-stop (.) at the end. Please check all comments for this.
> + */
> + idle_cdev->timer.function = cpuidle_cooling_wakeup_fn;
> +
> + idle_cdev->cpumask = cpumask;
> +
> + /*
> + * Assign on a per cpu basis belonging to the cluster, the per
> + * cpu cpuidle_cooling_device pointer and increment its
> + * refcount on it
> + */
> + for_each_cpu(cpu, cpumask) {
> + kref_get(&idle_cdev->kref);
> + per_cpu(cpuidle_cooling_device, cpu) = idle_cdev;
> + }
> +
> + return idle_cdev;
> +}
> +
> +/**
> + * cpuidle_cooling_register - Idle cooling device initialization function
> + *
> + * This function is in charge of creating a cooling device per cluster
> + * and register it to thermal framework. For this we rely on the
> + * topology as there is nothing yet describing better the idle state
> + * power domains.
> + *
> + * We create a cpuidle cooling device per cluster. For this reason we
> + * must, for each cluster, allocate and initialize the cooling device
> + * and for each cpu belonging to this cluster, do the initialization
> + * on a cpu basis.
> + *
> + * This approach for creating the cooling device is needed as we don't
> + * have the guarantee the CPU numbering is sequential.
guarantee that the CPU
> + *
> + * Unfortunately, there is no API to browse from top to bottom the
> + * topology, cluster->cpu, only the usual for_each_possible_cpu loop.
> + * In order to solve that, we use a cpumask to flag the cluster_id we
> + * already processed. The cpumask will always have enough room for all
> + * the cluster because it is based on NR_CPUS and it is not possible
> + * to have more clusters than cpus.
> + *
> + */
> +void __init cpuidle_cooling_register(void)
> +{
> + struct cpuidle_cooling_device *idle_cdev = NULL;
> + struct thermal_cooling_device *cdev;
> + struct device_node *np;
> + cpumask_var_t cpumask;
maybe call it clustermask ?
> + char dev_name[THERMAL_NAME_LENGTH];
> + int ret = -ENOMEM, cpu;
> + int cluster_id;
> +
> + if (!zalloc_cpumask_var(&cpumask, GFP_KERNEL))
> + return;
> +
> + for_each_possible_cpu(cpu) {
> +
> + cluster_id = topology_physical_package_id(cpu);
> + if (cpumask_test_cpu(cluster_id, cpumask))
> + continue;
> +
> + /*
> + * Allocate the cpuidle cooling device with the list
> + * of the cpus belonging to the cluster.
> + */
> + idle_cdev = cpuidle_cooling_alloc(topology_core_cpumask(cpu));
> + if (!idle_cdev)
> + goto out;
> +
> + /*
> + * The thermal cooling device name, we use the
> + * cluster_id as the numbering index for the idle
> + * cooling device.
> + */
> + snprintf(dev_name, sizeof(dev_name), "thermal-idle-%d",
> + cluster_id);
> +
> + np = of_cpu_device_node_get(cpu);
Check if np is valid ?
> + cdev = thermal_of_cooling_device_register(np, dev_name,
> + idle_cdev,
> + &cpuidle_cooling_ops);
> + if (IS_ERR(cdev)) {
> + ret = PTR_ERR(cdev);
> + goto out;
> + }
> +
> + idle_cdev->cdev = cdev;
> + cpumask_set_cpu(cluster_id, cpumask);
> + }
> +
> + ret = smpboot_register_percpu_thread(&cpuidle_cooling_threads);
> + if (ret)
> + goto out;
> +
> + pr_info("Created cpuidle cooling device\n");
> +out:
> + free_cpumask_var(cpumask);
> +
> + if (ret) {
> + cpuidle_cooling_unregister();
> + pr_err("Failed to create idle cooling device (%d)\n", ret);
> + }
Maybe rearrange it a bit:
+ ret = smpboot_register_percpu_thread(&cpuidle_cooling_threads);
+
+out:
+ if (ret) {
+ cpuidle_cooling_unregister();
+ pr_err("Failed to create idle cooling device (%d)\n", ret);
+ } else {
+ pr_info("Created cpuidle cooling devices\n");
+ }
+
+ free_cpumask_var(cpumask);
??
> +}
> +#endif /* CONFIG_CPU_IDLE_THERMAL */
> diff --git a/include/linux/cpu_cooling.h b/include/linux/cpu_cooling.h
> index c0accc7..af5520d 100644
> --- a/include/linux/cpu_cooling.h
> +++ b/include/linux/cpu_cooling.h
> @@ -120,4 +120,10 @@ void cpufreq_cooling_unregister(struct thermal_cooling_device *cdev)
> }
> #endif /* CONFIG_CPU_FREQ_THERMAL */
>
> +#ifdef CONFIG_CPU_IDLE_THERMAL
> +extern void __init cpuidle_cooling_register(void);
> +#else /* CONFIG_CPU_IDLE_THERMAL */
> +static inline void __init cpuidle_cooling_register(void) { }
> +#endif /* CONFIG_CPU_IDLE_THERMAL */
> +
> #endif /* __CPU_COOLING_H__ */
> --
> 2.7.4
--
viresh
On 11/04/2018 08:15, Viresh Kumar wrote:
> On 05-04-18, 18:16, Daniel Lezcano wrote:
>> The copyright format does not conform to the format requested by
>> Linaro: https://wiki.linaro.org/Copyright
>>
>> Fix it.
>>
>> Signed-off-by: Daniel Lezcano <[email protected]>
>
> You forgot to include my Ack ?
Yes :)
--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
On 11/04/2018 08:18, Viresh Kumar wrote:
> On 05-04-18, 18:16, Daniel Lezcano wrote:
>> The next changes will add new way to cool down a CPU. In order to
>> sanitize and make the overall cpu cooling code consistent and robust
>> we must prevent the cpu cooling devices to co-exists with the same
>> purpose at the same time in the kernel.
>>
>> Make the CPU cooling device a choice in the Kconfig, so only one CPU
>> cooling strategy can be chosen.
>>
>> Signed-off-by: Daniel Lezcano <[email protected]>
>> ---
>> drivers/thermal/Kconfig | 20 +++++++++++++++++---
>> drivers/thermal/cpu_cooling.c | 2 ++
>> include/linux/cpu_cooling.h | 6 +++---
>> 3 files changed, 22 insertions(+), 6 deletions(-)
>
> I don't see my review comments from V2 being addressed for this patch.
> Can you please check that again ?
Right, I missed the end of the email :/
I will fix that.
--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
Hi Viresh,
thanks for the review.
On 11/04/2018 10:51, Viresh Kumar wrote:
[ ... ]
>> +void __init cpuidle_cooling_register(void)
>> +{
>> + struct cpuidle_cooling_device *idle_cdev = NULL;
>> + struct thermal_cooling_device *cdev;
>> + struct device_node *np;
>> + cpumask_var_t cpumask;
>
> maybe call it clustermask ?
Yeah, sounds better.
>> + cdev = thermal_of_cooling_device_register(np, dev_name,
>> + idle_cdev,
>> + &cpuidle_cooling_ops);
>> + if (IS_ERR(cdev)) {
>> + ret = PTR_ERR(cdev);
>> + goto out;
>> + }
>> +
>> + idle_cdev->cdev = cdev;
>> + cpumask_set_cpu(cluster_id, cpumask);
>> + }
>> +
>> + ret = smpboot_register_percpu_thread(&cpuidle_cooling_threads);
>> + if (ret)
>> + goto out;
>> +
>> + pr_info("Created cpuidle cooling device\n");
>> +out:
>> + free_cpumask_var(cpumask);
>> +
>> + if (ret) {
>> + cpuidle_cooling_unregister();
>> + pr_err("Failed to create idle cooling device (%d)\n", ret);
>> + }
>
> Maybe rearrange it a bit:
>
> + ret = smpboot_register_percpu_thread(&cpuidle_cooling_threads);
> +
> +out:
> + if (ret) {
> + cpuidle_cooling_unregister();
> + pr_err("Failed to create idle cooling device (%d)\n", ret);
> + } else {
> + pr_info("Created cpuidle cooling devices\n");
> + }
> +
> + free_cpumask_var(cpumask);
>
> ??
I usually tend to avoid using 'else' statements when possible (a coding
style practice) but if you prefer this version I'm fine with that.
--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
Hi Daniel,
On 05/04/18 17:16, Daniel Lezcano wrote:
[...]
> +/**
> + * cpuidle_cooling_register - Idle cooling device initialization function
> + *
> + * This function is in charge of creating a cooling device per cluster
> + * and register it to thermal framework. For this we rely on the
> + * topology as there is nothing yet describing better the idle state
> + * power domains.
> + *
> + * We create a cpuidle cooling device per cluster. For this reason we
> + * must, for each cluster, allocate and initialize the cooling device
> + * and for each cpu belonging to this cluster, do the initialization
> + * on a cpu basis.
> + *
> + * This approach for creating the cooling device is needed as we don't
> + * have the guarantee the CPU numbering is sequential.
> + *
> + * Unfortunately, there is no API to browse from top to bottom the
> + * topology, cluster->cpu, only the usual for_each_possible_cpu loop.
> + * In order to solve that, we use a cpumask to flag the cluster_id we
> + * already processed. The cpumask will always have enough room for all
> + * the cluster because it is based on NR_CPUS and it is not possible
> + * to have more clusters than cpus.
> + *
> + */
> +void __init cpuidle_cooling_register(void)
> +{
> + struct cpuidle_cooling_device *idle_cdev = NULL;
> + struct thermal_cooling_device *cdev;
> + struct device_node *np;
> + cpumask_var_t cpumask;
> + char dev_name[THERMAL_NAME_LENGTH];
> + int ret = -ENOMEM, cpu;
> + int cluster_id;
> +
> + if (!zalloc_cpumask_var(&cpumask, GFP_KERNEL))
> + return;
> +
> + for_each_possible_cpu(cpu) {
> +
> + cluster_id = topology_physical_package_id(cpu);
> + if (cpumask_test_cpu(cluster_id, cpumask))
> + continue;
Sorry for chiming in randomly, I haven't read the patches in detail.
But it was brought to my notice that topology_physical_package_id is
being used for cluster ID here. It's completely wrong and will
changesoon with ACPI topology related changes Jeremy is working on.
You will get the physical socket number(which is mostly 0 on single SoC
system). Makes sure that this won't break with that change.
Please note with cluster id not defined architecturally, relying on that
is simply problematic.
--
Regards,
Sudeep
On Thu, Apr 05, 2018 at 06:16:43PM +0200, Daniel Lezcano wrote:
> +/**
> + * cpuidle_cooling_register - Idle cooling device initialization function
> + *
> + * This function is in charge of creating a cooling device per cluster
> + * and register it to thermal framework. For this we rely on the
> + * topology as there is nothing yet describing better the idle state
> + * power domains.
> + *
> + * We create a cpuidle cooling device per cluster. For this reason we
> + * must, for each cluster, allocate and initialize the cooling device
> + * and for each cpu belonging to this cluster, do the initialization
> + * on a cpu basis.
> + *
> + * This approach for creating the cooling device is needed as we don't
> + * have the guarantee the CPU numbering is sequential.
> + *
> + * Unfortunately, there is no API to browse from top to bottom the
> + * topology, cluster->cpu, only the usual for_each_possible_cpu loop.
> + * In order to solve that, we use a cpumask to flag the cluster_id we
> + * already processed. The cpumask will always have enough room for all
> + * the cluster because it is based on NR_CPUS and it is not possible
> + * to have more clusters than cpus.
> + *
> + */
> +void __init cpuidle_cooling_register(void)
> +{
> + struct cpuidle_cooling_device *idle_cdev = NULL;
> + struct thermal_cooling_device *cdev;
> + struct device_node *np;
> + cpumask_var_t cpumask;
> + char dev_name[THERMAL_NAME_LENGTH];
> + int ret = -ENOMEM, cpu;
> + int cluster_id;
> +
> + if (!zalloc_cpumask_var(&cpumask, GFP_KERNEL))
> + return;
> +
> + for_each_possible_cpu(cpu) {
> +
> + cluster_id = topology_physical_package_id(cpu);
> + if (cpumask_test_cpu(cluster_id, cpumask))
> + continue;
> +
> + /*
> + * Allocate the cpuidle cooling device with the list
> + * of the cpus belonging to the cluster.
> + */
> + idle_cdev = cpuidle_cooling_alloc(topology_core_cpumask(cpu));
> + if (!idle_cdev)
> + goto out;
Perhaps nitpicking but it might be better to set ret to -ENOMEM here
and avoid initializing it during the declarations.
There's no bug in the current form since ret is never assigned to
outside of the error paths, but the unwritten assumption that ret keeps
its original value throughout the loop seems like a bit of a landmine
w.r.t. future maintainance.
Daniel.
On 13/04/2018 13:38, Daniel Thompson wrote:
[ ... ]
>> + /*
>> + * Allocate the cpuidle cooling device with the list
>> + * of the cpus belonging to the cluster.
>> + */
>> + idle_cdev = cpuidle_cooling_alloc(topology_core_cpumask(cpu));
>> + if (!idle_cdev)
>> + goto out;
>
> Perhaps nitpicking but it might be better to set ret to -ENOMEM here
> and avoid initializing it during the declarations.
>
> There's no bug in the current form since ret is never assigned to
> outside of the error paths, but the unwritten assumption that ret keeps
> its original value throughout the loop seems like a bit of a landmine
> w.r.t. future maintainance.
Agree.
--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
On 13/04/2018 13:23, Sudeep Holla wrote:
> Hi Daniel,
>
> On 05/04/18 17:16, Daniel Lezcano wrote:
>
> [...]
>
>> +/**
>> + * cpuidle_cooling_register - Idle cooling device initialization function
>> + *
>> + * This function is in charge of creating a cooling device per cluster
>> + * and register it to thermal framework. For this we rely on the
>> + * topology as there is nothing yet describing better the idle state
>> + * power domains.
>> + *
>> + * We create a cpuidle cooling device per cluster. For this reason we
>> + * must, for each cluster, allocate and initialize the cooling device
>> + * and for each cpu belonging to this cluster, do the initialization
>> + * on a cpu basis.
>> + *
>> + * This approach for creating the cooling device is needed as we don't
>> + * have the guarantee the CPU numbering is sequential.
>> + *
>> + * Unfortunately, there is no API to browse from top to bottom the
>> + * topology, cluster->cpu, only the usual for_each_possible_cpu loop.
>> + * In order to solve that, we use a cpumask to flag the cluster_id we
>> + * already processed. The cpumask will always have enough room for all
>> + * the cluster because it is based on NR_CPUS and it is not possible
>> + * to have more clusters than cpus.
>> + *
>> + */
>> +void __init cpuidle_cooling_register(void)
>> +{
>> + struct cpuidle_cooling_device *idle_cdev = NULL;
>> + struct thermal_cooling_device *cdev;
>> + struct device_node *np;
>> + cpumask_var_t cpumask;
>> + char dev_name[THERMAL_NAME_LENGTH];
>> + int ret = -ENOMEM, cpu;
>> + int cluster_id;
>> +
>> + if (!zalloc_cpumask_var(&cpumask, GFP_KERNEL))
>> + return;
>> +
>> + for_each_possible_cpu(cpu) {
>> +
>> + cluster_id = topology_physical_package_id(cpu);
>> + if (cpumask_test_cpu(cluster_id, cpumask))
>> + continue;
>
> Sorry for chiming in randomly, I haven't read the patches in detail.
> But it was brought to my notice that topology_physical_package_id is
> being used for cluster ID here. It's completely wrong and will
> changesoon with ACPI topology related changes Jeremy is working on.
>
> You will get the physical socket number(which is mostly 0 on single SoC
> system). Makes sure that this won't break with that change.
>
> Please note with cluster id not defined architecturally, relying on that
> is simply problematic.
Ok, noted. At the first glance, it should not be a problem.
--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
On 13-04-18, 13:47, Daniel Lezcano wrote:
> Ok, noted. At the first glance, it should not be a problem.
Why do you think it wouldn't be a problem ?
--
viresh
On 16/04/2018 09:37, Viresh Kumar wrote:
> On 13-04-18, 13:47, Daniel Lezcano wrote:
>> Ok, noted. At the first glance, it should not be a problem.
>
> Why do you think it wouldn't be a problem ?
Because we rely on the number to identify the cluster and flag it
'processed'. The number itself is not important.
--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
On 16-04-18, 11:45, Daniel Lezcano wrote:
> Can you elaborate a bit ? I'm not sure to get the point.
Sure. With your current code on Hikey960 (big/LITTLE), you end up
creating two cooling devices, one for the big cluster and one for
small cluster. Which is the right thing to do, as we also have two
cpufreq cooling devices.
But with the change Sudeep is referring to, the helper you used to get
cluster id will return 0 (SoC id) for all the 8 CPUs. So your code
will end up creating a single cpuidle cooling device for all the CPUs.
Which would be wrong.
> BTW, Am I asked to change my code to stick to something which is not
> merged ?
Sudeep looked pretty confident on how the meaning of this routine is
going to change very soon. I will let him respond on what guarantees
we have that it will get merged :)
--
viresh
On 16/04/2018 11:50, Viresh Kumar wrote:
> On 16-04-18, 11:45, Daniel Lezcano wrote:
>> Can you elaborate a bit ? I'm not sure to get the point.
>
> Sure. With your current code on Hikey960 (big/LITTLE), you end up
> creating two cooling devices, one for the big cluster and one for
> small cluster. Which is the right thing to do, as we also have two
> cpufreq cooling devices.
>
> But with the change Sudeep is referring to, the helper you used to get
> cluster id will return 0 (SoC id) for all the 8 CPUs. So your code
> will end up creating a single cpuidle cooling device for all the CPUs.
> Which would be wrong.
Is the semantic of topology_physical_package_id changing ? I don't
understand the change Sudeep is referring to.
I saw this attempt:
https://patchwork.kernel.org/patch/9959977/
>> BTW, Am I asked to change my code to stick to something which is not
>> merged ?
>
> Sudeep looked pretty confident on how the meaning of this routine is
> going to change very soon. I will let him respond on what guarantees
> we have that it will get merged :)
--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
On 16-04-18, 12:03, Daniel Lezcano wrote:
> On 16/04/2018 11:50, Viresh Kumar wrote:
> > On 16-04-18, 11:45, Daniel Lezcano wrote:
> >> Can you elaborate a bit ? I'm not sure to get the point.
> >
> > Sure. With your current code on Hikey960 (big/LITTLE), you end up
> > creating two cooling devices, one for the big cluster and one for
> > small cluster. Which is the right thing to do, as we also have two
> > cpufreq cooling devices.
> >
> > But with the change Sudeep is referring to, the helper you used to get
> > cluster id will return 0 (SoC id) for all the 8 CPUs. So your code
> > will end up creating a single cpuidle cooling device for all the CPUs.
> > Which would be wrong.
>
> Is the semantic of topology_physical_package_id changing ?
That's what I understood from his email.
> I don't
> understand the change Sudeep is referring to.
>
> I saw this attempt:
>
> https://patchwork.kernel.org/patch/9959977/
@Sudeep: Is using topology_cod_id() is okay in that case ?
--
viresh
On Mon, Apr 16, 2018 at 09:44:51AM +0200, Daniel Lezcano wrote:
> On 16/04/2018 09:37, Viresh Kumar wrote:
> > On 13-04-18, 13:47, Daniel Lezcano wrote:
> >> Ok, noted. At the first glance, it should not be a problem.
> >
> > Why do you think it wouldn't be a problem ?
>
> Because we rely on the number to identify the cluster and flag it
> 'processed'. The number itself is not important.
>
In that case, why can't cpu_possible_mask be used for simplicity if all
you need is the tracking and that need not be grouped under the "cluster"
banner.
--
Regards,
Sudeep
On 16-04-18, 09:44, Daniel Lezcano wrote:
> Because we rely on the number to identify the cluster and flag it
> 'processed'. The number itself is not important.
It is, because you are creating multiple groups (like cpufreq
policies) right now based on cluster id. Which will be zero for all
the CPUs. So in a big little system we will have two cpufreq cooling
devices and one cpuidle cooling device.
--
viresh
On 16/04/2018 11:37, Viresh Kumar wrote:
> On 16-04-18, 09:44, Daniel Lezcano wrote:
>> Because we rely on the number to identify the cluster and flag it
>> 'processed'. The number itself is not important.
>
> It is, because you are creating multiple groups (like cpufreq
> policies) right now based on cluster id. Which will be zero for all
> the CPUs. So in a big little system we will have two cpufreq cooling
> devices and one cpuidle cooling device.
Can you elaborate a bit ? I'm not sure to get the point.
BTW, Am I asked to change my code to stick to something which is not
merged ?
--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
On Mon, Apr 16, 2018 at 03:20:06PM +0530, Viresh Kumar wrote:
> On 16-04-18, 11:45, Daniel Lezcano wrote:
> > Can you elaborate a bit ? I'm not sure to get the point.
>
> Sure. With your current code on Hikey960 (big/LITTLE), you end up
> creating two cooling devices, one for the big cluster and one for
> small cluster. Which is the right thing to do, as we also have two
> cpufreq cooling devices.
>
> But with the change Sudeep is referring to, the helper you used to get
> cluster id will return 0 (SoC id) for all the 8 CPUs. So your code
> will end up creating a single cpuidle cooling device for all the CPUs.
> Which would be wrong.
>
> > BTW, Am I asked to change my code to stick to something which is not
> > merged ?
>
> Sudeep looked pretty confident on how the meaning of this routine is
> going to change very soon. I will let him respond on what guarantees
> we have that it will get merged :)
>
Viresh has already summarised it correctly. I don't have much to add.
Unless the system has multiple package, this routine will return 0(or some
fixed socket number, I don't think it can be different from 0 but
theoretically it can be anything but has to be same for all the CPUs)
--
Regards,
Sudeep
On Mon, Apr 16, 2018 at 02:10:30PM +0200, Daniel Lezcano wrote:
> On 16/04/2018 12:10, Viresh Kumar wrote:
> > On 16-04-18, 12:03, Daniel Lezcano wrote:
> >> On 16/04/2018 11:50, Viresh Kumar wrote:
> >>> On 16-04-18, 11:45, Daniel Lezcano wrote:
> >>>> Can you elaborate a bit ? I'm not sure to get the point.
> >>>
> >>> Sure. With your current code on Hikey960 (big/LITTLE), you end up
> >>> creating two cooling devices, one for the big cluster and one for
> >>> small cluster. Which is the right thing to do, as we also have two
> >>> cpufreq cooling devices.
> >>>
> >>> But with the change Sudeep is referring to, the helper you used to get
> >>> cluster id will return 0 (SoC id) for all the 8 CPUs. So your code
> >>> will end up creating a single cpuidle cooling device for all the CPUs.
> >>> Which would be wrong.
> >>
> >> Is the semantic of topology_physical_package_id changing ?
> >
> > That's what I understood from his email.
> >
> >> I don't
> >> understand the change Sudeep is referring to.
>
> Actually there is no impact with the change Sudeep is referring to. It
> is for ACPI, we are DT based. Confirmed with Jeremy.
>
No, it will change for DT. The aim is to be consistent irrespective of
h/w or f/w description(i.e ADCPI or DT)
--
Regards,
Sudeep
On Mon, Apr 16, 2018 at 02:49:35PM +0200, Daniel Lezcano wrote:
> On 16/04/2018 14:31, Sudeep Holla wrote:
> > On Mon, Apr 16, 2018 at 02:10:30PM +0200, Daniel Lezcano wrote:
> >> On 16/04/2018 12:10, Viresh Kumar wrote:
> >>> On 16-04-18, 12:03, Daniel Lezcano wrote:
> >>>> On 16/04/2018 11:50, Viresh Kumar wrote:
> >>>>> On 16-04-18, 11:45, Daniel Lezcano wrote:
> >>>>>> Can you elaborate a bit ? I'm not sure to get the point.
> >>>>>
> >>>>> Sure. With your current code on Hikey960 (big/LITTLE), you end up
> >>>>> creating two cooling devices, one for the big cluster and one for
> >>>>> small cluster. Which is the right thing to do, as we also have two
> >>>>> cpufreq cooling devices.
> >>>>>
> >>>>> But with the change Sudeep is referring to, the helper you used to get
> >>>>> cluster id will return 0 (SoC id) for all the 8 CPUs. So your code
> >>>>> will end up creating a single cpuidle cooling device for all the CPUs.
> >>>>> Which would be wrong.
> >>>>
> >>>> Is the semantic of topology_physical_package_id changing ?
> >>>
> >>> That's what I understood from his email.
> >>>
> >>>> I don't
> >>>> understand the change Sudeep is referring to.
> >>
> >> Actually there is no impact with the change Sudeep is referring to. It
> >> is for ACPI, we are DT based. Confirmed with Jeremy.
> >>
> >
> > No, it will change for DT. The aim is to be consistent irrespective of
> > h/w or f/w description(i.e ADCPI or DT)
>
> What will happen with the code using topology_physical_package_id ?
>
> drivers/acpi/processor_thermal.c: int id = topology_physical_package_id(cpu);
> drivers/acpi/processor_thermal.c: if (topology_physical_package_id(i) == id)
> drivers/acpi/processor_thermal.c: if (topology_physical_package_id(i) ==
> drivers/acpi/processor_thermal.c: topology_physical_package_id(cpu))
>
ACPI and hence will have right notion of package.
> drivers/block/mtip32xx/mtip32xx.c: topology_physical_package_id(cpumask_first(node_mask)),
>
Not sure if it was ever used on ARM/ARM64
> drivers/cpufreq/arm_big_little.c: return topology_physical_package_id(cpu);
Yes this is main affected driver. I don't think it's used by any ARM64
platform anymore. This is one of the reason I divorced SCPI driver from
this recently. ARM32 may still retain old definition of package_id(not
sure, depends if anyone as need to change that and maintainers view on
that). If not, we need to fix this driver along with the code changing
the definition.
>
> drivers/crypto/qat/qat_common/adf_common_drv.h: return topology_physical_package_id(smp_processor_id());
> drivers/crypto/virtio/virtio_crypto_common.h: node = topology_physical_package_id(cpu);
>
x86 specific and never assumed cluster id.
> kernel/events/core.c: event_pkg = topology_physical_package_id(event_cpu);
> kernel/events/core.c: local_pkg = topology_physical_package_id(local_cpu);
>
Generic and hence never assumed cluster id.
--
Regards,
Sudeep
On 16/04/2018 14:30, Lorenzo Pieralisi wrote:
> On Mon, Apr 16, 2018 at 02:10:30PM +0200, Daniel Lezcano wrote:
>> On 16/04/2018 12:10, Viresh Kumar wrote:
>>> On 16-04-18, 12:03, Daniel Lezcano wrote:
>>>> On 16/04/2018 11:50, Viresh Kumar wrote:
>>>>> On 16-04-18, 11:45, Daniel Lezcano wrote:
>>>>>> Can you elaborate a bit ? I'm not sure to get the point.
>>>>>
>>>>> Sure. With your current code on Hikey960 (big/LITTLE), you end up
>>>>> creating two cooling devices, one for the big cluster and one for
>>>>> small cluster. Which is the right thing to do, as we also have two
>>>>> cpufreq cooling devices.
>>>>>
>>>>> But with the change Sudeep is referring to, the helper you used to get
>>>>> cluster id will return 0 (SoC id) for all the 8 CPUs. So your code
>>>>> will end up creating a single cpuidle cooling device for all the CPUs.
>>>>> Which would be wrong.
>>>>
>>>> Is the semantic of topology_physical_package_id changing ?
>>>
>>> That's what I understood from his email.
>>>
>>>> I don't
>>>> understand the change Sudeep is referring to.
>>
>> Actually there is no impact with the change Sudeep is referring to. It
>> is for ACPI, we are DT based. Confirmed with Jeremy.
>>
>> So AFAICT, it is not a problem.
>
> It is a problem - DT or ACPI alike. Sudeep was referring to the notion
> of "cluster" that has no architectural meaning whatsoever and using
> topology_physical_package_id() to detect a "cluster" was/is/will always
> be the wrong thing to do. The notion of cluster must not appear in the
> kernel at all, it has no architectural meaning. I understand you need
> to group CPUs but that has to be done in a different way, through
> cooling devices, thermal domains or power domains DT/ACPI bindings but
> not by using topology masks.
I don't get it. What is the cluster concept defined in the ARM
documentation?
ARM Cortex-A53 MPCore Processor Technical Reference Manual
4.5.2. Multiprocessor Affinity Register
I see the documentation says:
A cluster with two cores, three cores, ...
How the kernel can represent that if you kill the
topology_physical_package_id() ?
> You should be able to figure out this week at OSPM the reasoning behind
> what I am saying above.
--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
On Mon, Apr 16, 2018 at 03:57:03PM +0200, Daniel Lezcano wrote:
> On 16/04/2018 14:30, Lorenzo Pieralisi wrote:
> > On Mon, Apr 16, 2018 at 02:10:30PM +0200, Daniel Lezcano wrote:
> >> On 16/04/2018 12:10, Viresh Kumar wrote:
> >>> On 16-04-18, 12:03, Daniel Lezcano wrote:
> >>>> On 16/04/2018 11:50, Viresh Kumar wrote:
> >>>>> On 16-04-18, 11:45, Daniel Lezcano wrote:
> >>>>>> Can you elaborate a bit ? I'm not sure to get the point.
> >>>>>
> >>>>> Sure. With your current code on Hikey960 (big/LITTLE), you end up
> >>>>> creating two cooling devices, one for the big cluster and one for
> >>>>> small cluster. Which is the right thing to do, as we also have two
> >>>>> cpufreq cooling devices.
> >>>>>
> >>>>> But with the change Sudeep is referring to, the helper you used to get
> >>>>> cluster id will return 0 (SoC id) for all the 8 CPUs. So your code
> >>>>> will end up creating a single cpuidle cooling device for all the CPUs.
> >>>>> Which would be wrong.
> >>>>
> >>>> Is the semantic of topology_physical_package_id changing ?
> >>>
> >>> That's what I understood from his email.
> >>>
> >>>> I don't
> >>>> understand the change Sudeep is referring to.
> >>
> >> Actually there is no impact with the change Sudeep is referring to. It
> >> is for ACPI, we are DT based. Confirmed with Jeremy.
> >>
> >> So AFAICT, it is not a problem.
> >
> > It is a problem - DT or ACPI alike. Sudeep was referring to the notion
> > of "cluster" that has no architectural meaning whatsoever and using
> > topology_physical_package_id() to detect a "cluster" was/is/will always
> > be the wrong thing to do. The notion of cluster must not appear in the
> > kernel at all, it has no architectural meaning. I understand you need
> > to group CPUs but that has to be done in a different way, through
> > cooling devices, thermal domains or power domains DT/ACPI bindings but
> > not by using topology masks.
>
> I don't get it. What is the cluster concept defined in the ARM
> documentation?
>
> ARM Cortex-A53 MPCore Processor Technical Reference Manual
>
> 4.5.2. Multiprocessor Affinity Register
>
> I see the documentation says:
>
> A cluster with two cores, three cores, ...
>
> How the kernel can represent that if you kill the
> topology_physical_package_id() ?
From an Arm ARM perspective (ARM v8 reference manual), the MPIDR_EL1 has
no notion of cluster which means that a cluster is not architecturally
defined on Arm systems.
Currently, as Morten explained today, topology_physical_package_id()
is supposed to represent a "cluster" and that's completely wrong
because a "cluster" cannot be defined from an architectural perspective.
It was a bodge used as a shortcut, wrongly. We should have never used
that API for that purpose and there must be no code in the kernel
relying on:
topology_physical_package_id()
to define a cluster; the information you require to group CPUs must
come from something else, which is firmware bindings(DT or ACPI) as
I mentioned.
Please speak to Sudeep who will fill you on the reasoning above.
Thanks,
Lorenzo
On 16/04/2018 12:10, Viresh Kumar wrote:
> On 16-04-18, 12:03, Daniel Lezcano wrote:
>> On 16/04/2018 11:50, Viresh Kumar wrote:
>>> On 16-04-18, 11:45, Daniel Lezcano wrote:
>>>> Can you elaborate a bit ? I'm not sure to get the point.
>>>
>>> Sure. With your current code on Hikey960 (big/LITTLE), you end up
>>> creating two cooling devices, one for the big cluster and one for
>>> small cluster. Which is the right thing to do, as we also have two
>>> cpufreq cooling devices.
>>>
>>> But with the change Sudeep is referring to, the helper you used to get
>>> cluster id will return 0 (SoC id) for all the 8 CPUs. So your code
>>> will end up creating a single cpuidle cooling device for all the CPUs.
>>> Which would be wrong.
>>
>> Is the semantic of topology_physical_package_id changing ?
>
> That's what I understood from his email.
>
>> I don't
>> understand the change Sudeep is referring to.
Actually there is no impact with the change Sudeep is referring to. It
is for ACPI, we are DT based. Confirmed with Jeremy.
So AFAICT, it is not a problem.
--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
On Mon, Apr 16, 2018 at 02:10:30PM +0200, Daniel Lezcano wrote:
> On 16/04/2018 12:10, Viresh Kumar wrote:
> > On 16-04-18, 12:03, Daniel Lezcano wrote:
> >> On 16/04/2018 11:50, Viresh Kumar wrote:
> >>> On 16-04-18, 11:45, Daniel Lezcano wrote:
> >>>> Can you elaborate a bit ? I'm not sure to get the point.
> >>>
> >>> Sure. With your current code on Hikey960 (big/LITTLE), you end up
> >>> creating two cooling devices, one for the big cluster and one for
> >>> small cluster. Which is the right thing to do, as we also have two
> >>> cpufreq cooling devices.
> >>>
> >>> But with the change Sudeep is referring to, the helper you used to get
> >>> cluster id will return 0 (SoC id) for all the 8 CPUs. So your code
> >>> will end up creating a single cpuidle cooling device for all the CPUs.
> >>> Which would be wrong.
> >>
> >> Is the semantic of topology_physical_package_id changing ?
> >
> > That's what I understood from his email.
> >
> >> I don't
> >> understand the change Sudeep is referring to.
>
> Actually there is no impact with the change Sudeep is referring to. It
> is for ACPI, we are DT based. Confirmed with Jeremy.
>
> So AFAICT, it is not a problem.
It is a problem - DT or ACPI alike. Sudeep was referring to the notion
of "cluster" that has no architectural meaning whatsoever and using
topology_physical_package_id() to detect a "cluster" was/is/will always
be the wrong thing to do. The notion of cluster must not appear in the
kernel at all, it has no architectural meaning. I understand you need
to group CPUs but that has to be done in a different way, through
cooling devices, thermal domains or power domains DT/ACPI bindings but
not by using topology masks.
You should be able to figure out this week at OSPM the reasoning behind
what I am saying above.
Cheers,
Lorenzo
On 16/04/2018 14:31, Sudeep Holla wrote:
> On Mon, Apr 16, 2018 at 02:10:30PM +0200, Daniel Lezcano wrote:
>> On 16/04/2018 12:10, Viresh Kumar wrote:
>>> On 16-04-18, 12:03, Daniel Lezcano wrote:
>>>> On 16/04/2018 11:50, Viresh Kumar wrote:
>>>>> On 16-04-18, 11:45, Daniel Lezcano wrote:
>>>>>> Can you elaborate a bit ? I'm not sure to get the point.
>>>>>
>>>>> Sure. With your current code on Hikey960 (big/LITTLE), you end up
>>>>> creating two cooling devices, one for the big cluster and one for
>>>>> small cluster. Which is the right thing to do, as we also have two
>>>>> cpufreq cooling devices.
>>>>>
>>>>> But with the change Sudeep is referring to, the helper you used to get
>>>>> cluster id will return 0 (SoC id) for all the 8 CPUs. So your code
>>>>> will end up creating a single cpuidle cooling device for all the CPUs.
>>>>> Which would be wrong.
>>>>
>>>> Is the semantic of topology_physical_package_id changing ?
>>>
>>> That's what I understood from his email.
>>>
>>>> I don't
>>>> understand the change Sudeep is referring to.
>>
>> Actually there is no impact with the change Sudeep is referring to. It
>> is for ACPI, we are DT based. Confirmed with Jeremy.
>>
>
> No, it will change for DT. The aim is to be consistent irrespective of
> h/w or f/w description(i.e ADCPI or DT)
What will happen with the code using topology_physical_package_id ?
drivers/acpi/processor_thermal.c: int id =
topology_physical_package_id(cpu);
drivers/acpi/processor_thermal.c: if
(topology_physical_package_id(i) == id)
drivers/acpi/processor_thermal.c: if
(topology_physical_package_id(i) ==
drivers/acpi/processor_thermal.c:
topology_physical_package_id(cpu))
drivers/block/mtip32xx/mtip32xx.c:
topology_physical_package_id(cpumask_first(node_mask)),
drivers/cpufreq/arm_big_little.c: return
topology_physical_package_id(cpu);
drivers/crypto/qat/qat_common/adf_common_drv.h: return
topology_physical_package_id(smp_processor_id());
drivers/crypto/virtio/virtio_crypto_common.h: node =
topology_physical_package_id(cpu);
kernel/events/core.c: event_pkg =
topology_physical_package_id(event_cpu);
kernel/events/core.c: local_pkg =
topology_physical_package_id(local_cpu);
--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
On 16/04/2018 16:22, Lorenzo Pieralisi wrote:
> On Mon, Apr 16, 2018 at 03:57:03PM +0200, Daniel Lezcano wrote:
>> On 16/04/2018 14:30, Lorenzo Pieralisi wrote:
>>> On Mon, Apr 16, 2018 at 02:10:30PM +0200, Daniel Lezcano wrote:
>>>> On 16/04/2018 12:10, Viresh Kumar wrote:
>>>>> On 16-04-18, 12:03, Daniel Lezcano wrote:
>>>>>> On 16/04/2018 11:50, Viresh Kumar wrote:
>>>>>>> On 16-04-18, 11:45, Daniel Lezcano wrote:
>>>>>>>> Can you elaborate a bit ? I'm not sure to get the point.
>>>>>>>
>>>>>>> Sure. With your current code on Hikey960 (big/LITTLE), you end up
>>>>>>> creating two cooling devices, one for the big cluster and one for
>>>>>>> small cluster. Which is the right thing to do, as we also have two
>>>>>>> cpufreq cooling devices.
>>>>>>>
>>>>>>> But with the change Sudeep is referring to, the helper you used to get
>>>>>>> cluster id will return 0 (SoC id) for all the 8 CPUs. So your code
>>>>>>> will end up creating a single cpuidle cooling device for all the CPUs.
>>>>>>> Which would be wrong.
>>>>>>
>>>>>> Is the semantic of topology_physical_package_id changing ?
>>>>>
>>>>> That's what I understood from his email.
>>>>>
>>>>>> I don't
>>>>>> understand the change Sudeep is referring to.
>>>>
>>>> Actually there is no impact with the change Sudeep is referring to. It
>>>> is for ACPI, we are DT based. Confirmed with Jeremy.
>>>>
>>>> So AFAICT, it is not a problem.
>>>
>>> It is a problem - DT or ACPI alike. Sudeep was referring to the notion
>>> of "cluster" that has no architectural meaning whatsoever and using
>>> topology_physical_package_id() to detect a "cluster" was/is/will always
>>> be the wrong thing to do. The notion of cluster must not appear in the
>>> kernel at all, it has no architectural meaning. I understand you need
>>> to group CPUs but that has to be done in a different way, through
>>> cooling devices, thermal domains or power domains DT/ACPI bindings but
>>> not by using topology masks.
>>
>> I don't get it. What is the cluster concept defined in the ARM
>> documentation?
>>
>> ARM Cortex-A53 MPCore Processor Technical Reference Manual
>>
>> 4.5.2. Multiprocessor Affinity Register
>>
>> I see the documentation says:
>>
>> A cluster with two cores, three cores, ...
>>
>> How the kernel can represent that if you kill the
>> topology_physical_package_id() ?
>
> From an Arm ARM perspective (ARM v8 reference manual), the MPIDR_EL1 has
> no notion of cluster which means that a cluster is not architecturally
> defined on Arm systems.
Sorry, I'm lost :/ You say the MPIDR_EL1 has no notion of cluster but
the documentation describing this register is all talking about cluster.
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0500g/BABHBJCI.html
> Currently, as Morten explained today, topology_physical_package_id()
> is supposed to represent a "cluster" and that's completely wrong
> because a "cluster" cannot be defined from an architectural perspective.
>
> It was a bodge used as a shortcut, wrongly. We should have never used
> that API for that purpose and there must be no code in the kernel
> relying on:
>
> topology_physical_package_id()
>
> to define a cluster; the information you require to group CPUs must
> come from something else, which is firmware bindings(DT or ACPI) as
> I mentioned.
Why not ?
diff --git a/arch/arm64/include/asm/topology.h
b/arch/arm64/include/asm/topology.h
index c4f2d50..ac0776d 100644
--- a/arch/arm64/include/asm/topology.h
+++ b/arch/arm64/include/asm/topology.h
@@ -14,7 +14,8 @@ struct cpu_topology {
extern struct cpu_topology cpu_topology[NR_CPUS];
-#define topology_physical_package_id(cpu)
(cpu_topology[cpu].cluster_id)
+#define topology_physical_package_id(cpu) (0)
+#define topology_physical_cluster_id(cpu)
(cpu_topology[cpu].cluster_id)
#define topology_core_id(cpu) (cpu_topology[cpu].core_id)
#define topology_core_cpumask(cpu) (&cpu_topology[cpu].core_sibling)
#define topology_sibling_cpumask(cpu) (&cpu_topology[cpu].thread_sibling)
> Please speak to Sudeep who will fill you on the reasoning above.
Yes, Sudeep is next to me but I would prefer to keep the discussion on
the mailing list so everyone can get the reasoning.
--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
On Tue, Apr 17, 2018 at 09:17:36AM +0200, Daniel Lezcano wrote:
[...]
> >>>> Actually there is no impact with the change Sudeep is referring to. It
> >>>> is for ACPI, we are DT based. Confirmed with Jeremy.
> >>>>
> >>>> So AFAICT, it is not a problem.
> >>>
> >>> It is a problem - DT or ACPI alike. Sudeep was referring to the notion
> >>> of "cluster" that has no architectural meaning whatsoever and using
> >>> topology_physical_package_id() to detect a "cluster" was/is/will always
> >>> be the wrong thing to do. The notion of cluster must not appear in the
> >>> kernel at all, it has no architectural meaning. I understand you need
> >>> to group CPUs but that has to be done in a different way, through
> >>> cooling devices, thermal domains or power domains DT/ACPI bindings but
> >>> not by using topology masks.
> >>
> >> I don't get it. What is the cluster concept defined in the ARM
> >> documentation?
> >>
> >> ARM Cortex-A53 MPCore Processor Technical Reference Manual
> >>
> >> 4.5.2. Multiprocessor Affinity Register
> >>
> >> I see the documentation says:
> >>
> >> A cluster with two cores, three cores, ...
> >>
> >> How the kernel can represent that if you kill the
> >> topology_physical_package_id() ?
> >
> > From an Arm ARM perspective (ARM v8 reference manual), the MPIDR_EL1 has
> > no notion of cluster which means that a cluster is not architecturally
> > defined on Arm systems.
>
> Sorry, I'm lost :/ You say the MPIDR_EL1 has no notion of cluster but
> the documentation describing this register is all talking about cluster.
>
> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0500g/BABHBJCI.html
I pointed you at the documentation I am referring to. You are referring
to A53 TRM, I am referring to the Arm architecture reference manual that
is the reference for all Arm cores.
> > Currently, as Morten explained today, topology_physical_package_id()
> > is supposed to represent a "cluster" and that's completely wrong
> > because a "cluster" cannot be defined from an architectural perspective.
> >
> > It was a bodge used as a shortcut, wrongly. We should have never used
> > that API for that purpose and there must be no code in the kernel
> > relying on:
> >
> > topology_physical_package_id()
> >
> > to define a cluster; the information you require to group CPUs must
> > come from something else, which is firmware bindings(DT or ACPI) as
> > I mentioned.
>
> Why not ?
I explained why not :). A cluster is not defined architecturally on Arm
- it is as simple as that and you can't rely on a given MPIDR_EL1
subfield to define what a cluster id is.
> diff --git a/arch/arm64/include/asm/topology.h
> b/arch/arm64/include/asm/topology.h
> index c4f2d50..ac0776d 100644
> --- a/arch/arm64/include/asm/topology.h
> +++ b/arch/arm64/include/asm/topology.h
> @@ -14,7 +14,8 @@ struct cpu_topology {
>
> extern struct cpu_topology cpu_topology[NR_CPUS];
>
> -#define topology_physical_package_id(cpu)
> (cpu_topology[cpu].cluster_id)
> +#define topology_physical_package_id(cpu) (0)
> +#define topology_physical_cluster_id(cpu)
There is no such a thing (and there is no architecturally defined
package id on Arm either).
> (cpu_topology[cpu].cluster_id)
> #define topology_core_id(cpu) (cpu_topology[cpu].core_id)
> #define topology_core_cpumask(cpu) (&cpu_topology[cpu].core_sibling)
> #define topology_sibling_cpumask(cpu) (&cpu_topology[cpu].thread_sibling)
>
>
> > Please speak to Sudeep who will fill you on the reasoning above.
>
> Yes, Sudeep is next to me but I would prefer to keep the discussion on
> the mailing list so everyone can get the reasoning.
It is not a reasoning - it is the Arm architecture. There is no
architecturally defined cluster id on Arm. The affinity bits in
MPIDR_EL1 must be treated as a unique number that represents a given
core/thread, how the bits are allocated across affinity levels is not
something that you can rely on architecturally - that's why DT/ACPI
topology bindings exist to group cpus in a hierarchical topology.
HTH,
Lorenzo
---
On 05-04-18, 18:16, Daniel Lezcano wrote:
> The cpu idle cooling driver performs synchronized idle injection across all
> cpus belonging to the same cluster and offers a new method to cool down a SoC.
>
> Each cluster has its own idle cooling device, each core has its own idle
> injection thread, each idle injection thread uses play_idle to enter idle. In
> order to reach the deepest idle state, each cooling device has the idle
> injection threads synchronized together.
>
> It has some similarity with the intel power clamp driver but it is actually
> designed to work on the ARM architecture via the DT with a mathematical proof
> with the power model which comes with the Documentation.
>
> The idle injection cycle is fixed while the running cycle is variable. That
> allows to have control on the device reactivity for the user experience. At
> the mitigation point the idle threads are unparked, they play idle the
> specified amount of time and they schedule themselves. The last thread sets
> the next idle injection deadline and when the timer expires it wakes up all
> the threads which in turn play idle again. Meanwhile the running cycle is
> changed by set_cur_state. When the mitigation ends, the threads are parked.
> The algorithm is self adaptive, so there is no need to handle hotplugging.
>
> If we take an example of the balanced point, we can use the DT for the hi6220.
>
> The sustainable power for the SoC is 3326mW to mitigate at 75°C. Eight cores
> running at full blast at the maximum OPP consumes 5280mW. The first value is
> given in the DT, the second is calculated from the OPP with the formula:
>
> Pdyn = Cdyn x Voltage^2 x Frequency
>
> As the SoC vendors don't want to share the static leakage values, we assume
> it is zero, so the Prun = Pdyn + Pstatic = Pdyn + 0 = Pdyn.
>
> In order to reduce the power to 3326mW, we have to apply a ratio to the
> running time.
>
> ratio = (Prun - Ptarget) / Ptarget = (5280 - 3326) / 3326 = 0,5874
>
> We know the idle cycle which is fixed, let's assume 10ms. However from this
> duration we have to substract the wake up latency for the cluster idle state.
> In our case, it is 1.5ms. So for a 10ms latency for idle, we are really idle
> 8.5ms.
>
> As we know the idle duration and the ratio, we can compute the running cycle.
>
> running_cycle = 8.5 / 0.5874 = 14.47ms
>
> So for 8.5ms of idle, we have 14.47ms of running cycle, and that brings the
> SoC to the balanced trip point of 75°C.
>
> The driver has been tested on the hi6220 and it appears the temperature
> stabilizes at 75°C with an idle injection time of 10ms (8.5ms real) and
> running cycle of 14ms as expected by the theory above.
>
> Signed-off-by: Kevin Wangtao <[email protected]>
> Signed-off-by: Daniel Lezcano <[email protected]>
> ---
> drivers/thermal/Kconfig | 10 +
> drivers/thermal/cpu_cooling.c | 479 ++++++++++++++++++++++++++++++++++++++++++
> include/linux/cpu_cooling.h | 6 +
> 3 files changed, 495 insertions(+)
>
> diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
> index 5aaae1b..6c34117 100644
> --- a/drivers/thermal/Kconfig
> +++ b/drivers/thermal/Kconfig
> @@ -166,6 +166,16 @@ config CPU_FREQ_THERMAL
> This will be useful for platforms using the generic thermal interface
> and not the ACPI interface.
>
> +config CPU_IDLE_THERMAL
> + bool "CPU idle cooling strategy"
> + depends on CPU_IDLE
> + help
> + This implements the generic CPU cooling mechanism through
> + idle injection. This will throttle the CPU by injecting
> + fixed idle cycle. All CPUs belonging to the same cluster
> + will enter idle synchronously to reach the deepest idle
> + state.
> +
> endchoice
>
> config CLOCK_THERMAL
> diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
> index 5c219dc..1eec8d6 100644
> --- a/drivers/thermal/cpu_cooling.c
> +++ b/drivers/thermal/cpu_cooling.c
> @@ -10,18 +10,33 @@
> * Viresh Kumar <[email protected]>
> *
> */
> +#define pr_fmt(fmt) "CPU cooling: " fmt
> +
> #include <linux/module.h>
> #include <linux/thermal.h>
> #include <linux/cpufreq.h>
> +#include <linux/cpuidle.h>
> #include <linux/err.h>
> +#include <linux/freezer.h>
> #include <linux/idr.h>
> +#include <linux/kthread.h>
> #include <linux/pm_opp.h>
> #include <linux/slab.h>
> +#include <linux/sched/prio.h>
> +#include <linux/sched/rt.h>
> +#include <linux/smpboot.h>
> #include <linux/cpu.h>
> #include <linux/cpu_cooling.h>
>
> +#include <linux/ratelimit.h>
> +
> +#include <linux/platform_device.h>
> +#include <linux/of_platform.h>
> +
> #include <trace/events/thermal.h>
>
> +#include <uapi/linux/sched/types.h>
> +
> #ifdef CONFIG_CPU_FREQ_THERMAL
> /*
> * Cooling state <-> CPUFreq frequency
> @@ -928,3 +943,467 @@ void cpufreq_cooling_unregister(struct thermal_cooling_device *cdev)
> }
> EXPORT_SYMBOL_GPL(cpufreq_cooling_unregister);
> #endif /* CONFIG_CPU_FREQ_THERMAL */
> +
> +#ifdef CONFIG_CPU_IDLE_THERMAL
> +/**
> + * struct cpuidle_cooling_device - data for the idle cooling device
> + * @cdev: a pointer to a struct thermal_cooling_device
> + * @cpumask: a cpumask containing the CPU managed by the cooling device
> + * @timer: a hrtimer giving the tempo for the idle injection cycles
> + * @kref: a kernel refcount on this structure
> + * @count: an atomic to keep track of the last task exiting the idle cycle
> + * @idle_cycle: an integer defining the duration of the idle injection
> + * @state: an normalized integer giving the state of the cooling device
> + */
> +struct cpuidle_cooling_device {
> + struct thermal_cooling_device *cdev;
> + struct cpumask *cpumask;
> + struct hrtimer timer;
> + struct kref kref;
> + atomic_t count;
> + unsigned int idle_cycle;
> + unsigned long state;
> +};
> +
> +struct cpuidle_cooling_thread {
> + struct task_struct *tsk;
> + int should_run;
> +};
> +
> +static DEFINE_PER_CPU(struct cpuidle_cooling_thread, cpuidle_cooling_thread);
> +static DEFINE_PER_CPU(struct cpuidle_cooling_device *, cpuidle_cooling_device);
> +
> +/**
> + * cpuidle_cooling_wakeup - Wake up all idle injection threads
> + * @idle_cdev: the idle cooling device
> + *
> + * Every idle injection task belonging to the idle cooling device and
> + * running on an online cpu will be wake up by this call.
> + */
> +static void cpuidle_cooling_wakeup(struct cpuidle_cooling_device *idle_cdev)
> +{
> + struct cpuidle_cooling_thread *cct;
> + int cpu;
> +
> + for_each_cpu_and(cpu, idle_cdev->cpumask, cpu_online_mask) {
> + cct = per_cpu_ptr(&cpuidle_cooling_thread, cpu);
> + cct->should_run = 1;
> + wake_up_process(cct->tsk);
> + }
> +}
> +
> +/**
> + * cpuidle_cooling_wakeup_fn - Running cycle timer callback
> + * @timer: a hrtimer structure
> + *
> + * When the mitigation is acting, the CPU is allowed to run an amount
> + * of time, then the idle injection happens for the specified delay
> + * and the idle task injection schedules itself until the timer event
> + * wakes the idle injection tasks again for a new idle injection
> + * cycle. The time between the end of the idle injection and the timer
> + * expiration is the allocated running time for the CPU.
> + *
> + * Always returns HRTIMER_NORESTART
> + */
> +static enum hrtimer_restart cpuidle_cooling_wakeup_fn(struct hrtimer *timer)
> +{
> + struct cpuidle_cooling_device *idle_cdev =
> + container_of(timer, struct cpuidle_cooling_device, timer);
> +
> + cpuidle_cooling_wakeup(idle_cdev);
> +
> + return HRTIMER_NORESTART;
> +}
> +
> +/**
> + * cpuidle_cooling_runtime - Running time computation
> + * @idle_cdev: the idle cooling device
> + *
> + * The running duration is computed from the idle injection duration
> + * which is fixed. If we reach 100% of idle injection ratio, that
> + * means the running duration is zero. If we have a 50% ratio
> + * injection, that means we have equal duration for idle and for
> + * running duration.
> + *
> + * The formula is deduced as the following:
> + *
> + * running = idle x ((100 / ratio) - 1)
> + *
> + * For precision purpose for integer math, we use the following:
> + *
> + * running = (idle x 100) / ratio - idle
> + *
> + * For example, if we have an injected duration of 50%, then we end up
> + * with 10ms of idle injection and 10ms of running duration.
> + *
> + * Returns a s64 nanosecond based
> + */
> +static s64 cpuidle_cooling_runtime(struct cpuidle_cooling_device *idle_cdev)
> +{
> + s64 next_wakeup;
> + unsigned long state = idle_cdev->state;
> +
> + /*
> + * The function should not be called when there is no
> + * mitigation because:
> + * - that does not make sense
> + * - we end up with a division by zero
> + */
> + if (!state)
> + return 0;
> +
> + next_wakeup = (s64)((idle_cdev->idle_cycle * 100) / state) -
> + idle_cdev->idle_cycle;
> +
> + return next_wakeup * NSEC_PER_USEC;
> +}
> +
There is a bug in your calculation formula here when "state" becomes 100.
You return 0 for the injection rate, which is the same as "rate" being 0,
which is dangerous. You stop cooling when it's most necessary :)
I'm not sure how much sense really being 100% idle makes, so I, when testing
this, just say if (state == 100) { state = 99 }. Anyways, just don't return 0.
Daniel, thanks a lot for these additions! Could you send an update of this?
btw, that's what I'm referring to:
https://lore.kernel.org/linux-pm/[email protected]/
I know it's a little old already, but it seems like there hasn't been any
equivalent solution in the meantime, has it?
Using cpuidle for cooling is way more effective than cpufreq (which often
hardly is).
thanks again,
martin
On 05.08.19 07:11, Martin Kepplinger wrote:
> ---
>
> On 05-04-18, 18:16, Daniel Lezcano wrote:
>> The cpu idle cooling driver performs synchronized idle injection across all
>> cpus belonging to the same cluster and offers a new method to cool down a SoC.
>>
>> Each cluster has its own idle cooling device, each core has its own idle
>> injection thread, each idle injection thread uses play_idle to enter idle. In
>> order to reach the deepest idle state, each cooling device has the idle
>> injection threads synchronized together.
>>
>> It has some similarity with the intel power clamp driver but it is actually
>> designed to work on the ARM architecture via the DT with a mathematical proof
>> with the power model which comes with the Documentation.
>>
>> The idle injection cycle is fixed while the running cycle is variable. That
>> allows to have control on the device reactivity for the user experience. At
>> the mitigation point the idle threads are unparked, they play idle the
>> specified amount of time and they schedule themselves. The last thread sets
>> the next idle injection deadline and when the timer expires it wakes up all
>> the threads which in turn play idle again. Meanwhile the running cycle is
>> changed by set_cur_state. When the mitigation ends, the threads are parked.
>> The algorithm is self adaptive, so there is no need to handle hotplugging.
>>
>> If we take an example of the balanced point, we can use the DT for the hi6220.
>>
>> The sustainable power for the SoC is 3326mW to mitigate at 75°C. Eight cores
>> running at full blast at the maximum OPP consumes 5280mW. The first value is
>> given in the DT, the second is calculated from the OPP with the formula:
>>
>> Pdyn = Cdyn x Voltage^2 x Frequency
>>
>> As the SoC vendors don't want to share the static leakage values, we assume
>> it is zero, so the Prun = Pdyn + Pstatic = Pdyn + 0 = Pdyn.
>>
>> In order to reduce the power to 3326mW, we have to apply a ratio to the
>> running time.
>>
>> ratio = (Prun - Ptarget) / Ptarget = (5280 - 3326) / 3326 = 0,5874
>>
>> We know the idle cycle which is fixed, let's assume 10ms. However from this
>> duration we have to substract the wake up latency for the cluster idle state.
>> In our case, it is 1.5ms. So for a 10ms latency for idle, we are really idle
>> 8.5ms.
>>
>> As we know the idle duration and the ratio, we can compute the running cycle.
>>
>> running_cycle = 8.5 / 0.5874 = 14.47ms
>>
>> So for 8.5ms of idle, we have 14.47ms of running cycle, and that brings the
>> SoC to the balanced trip point of 75°C.
>>
>> The driver has been tested on the hi6220 and it appears the temperature
>> stabilizes at 75°C with an idle injection time of 10ms (8.5ms real) and
>> running cycle of 14ms as expected by the theory above.
>>
>> Signed-off-by: Kevin Wangtao <[email protected]>
>> Signed-off-by: Daniel Lezcano <[email protected]>
>> ---
>> drivers/thermal/Kconfig | 10 +
>> drivers/thermal/cpu_cooling.c | 479 ++++++++++++++++++++++++++++++++++++++++++
>> include/linux/cpu_cooling.h | 6 +
>> 3 files changed, 495 insertions(+)
>>
>> diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
>> index 5aaae1b..6c34117 100644
>> --- a/drivers/thermal/Kconfig
>> +++ b/drivers/thermal/Kconfig
>> @@ -166,6 +166,16 @@ config CPU_FREQ_THERMAL
>> This will be useful for platforms using the generic thermal interface
>> and not the ACPI interface.
>>
>> +config CPU_IDLE_THERMAL
>> + bool "CPU idle cooling strategy"
>> + depends on CPU_IDLE
>> + help
>> + This implements the generic CPU cooling mechanism through
>> + idle injection. This will throttle the CPU by injecting
>> + fixed idle cycle. All CPUs belonging to the same cluster
>> + will enter idle synchronously to reach the deepest idle
>> + state.
>> +
>> endchoice
>>
>> config CLOCK_THERMAL
>> diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
>> index 5c219dc..1eec8d6 100644
>> --- a/drivers/thermal/cpu_cooling.c
>> +++ b/drivers/thermal/cpu_cooling.c
>> @@ -10,18 +10,33 @@
>> * Viresh Kumar <[email protected]>
>> *
>> */
>> +#define pr_fmt(fmt) "CPU cooling: " fmt
>> +
>> #include <linux/module.h>
>> #include <linux/thermal.h>
>> #include <linux/cpufreq.h>
>> +#include <linux/cpuidle.h>
>> #include <linux/err.h>
>> +#include <linux/freezer.h>
>> #include <linux/idr.h>
>> +#include <linux/kthread.h>
>> #include <linux/pm_opp.h>
>> #include <linux/slab.h>
>> +#include <linux/sched/prio.h>
>> +#include <linux/sched/rt.h>
>> +#include <linux/smpboot.h>
>> #include <linux/cpu.h>
>> #include <linux/cpu_cooling.h>
>>
>> +#include <linux/ratelimit.h>
>> +
>> +#include <linux/platform_device.h>
>> +#include <linux/of_platform.h>
>> +
>> #include <trace/events/thermal.h>
>>
>> +#include <uapi/linux/sched/types.h>
>> +
>> #ifdef CONFIG_CPU_FREQ_THERMAL
>> /*
>> * Cooling state <-> CPUFreq frequency
>> @@ -928,3 +943,467 @@ void cpufreq_cooling_unregister(struct thermal_cooling_device *cdev)
>> }
>> EXPORT_SYMBOL_GPL(cpufreq_cooling_unregister);
>> #endif /* CONFIG_CPU_FREQ_THERMAL */
>> +
>> +#ifdef CONFIG_CPU_IDLE_THERMAL
>> +/**
>> + * struct cpuidle_cooling_device - data for the idle cooling device
>> + * @cdev: a pointer to a struct thermal_cooling_device
>> + * @cpumask: a cpumask containing the CPU managed by the cooling device
>> + * @timer: a hrtimer giving the tempo for the idle injection cycles
>> + * @kref: a kernel refcount on this structure
>> + * @count: an atomic to keep track of the last task exiting the idle cycle
>> + * @idle_cycle: an integer defining the duration of the idle injection
>> + * @state: an normalized integer giving the state of the cooling device
>> + */
>> +struct cpuidle_cooling_device {
>> + struct thermal_cooling_device *cdev;
>> + struct cpumask *cpumask;
>> + struct hrtimer timer;
>> + struct kref kref;
>> + atomic_t count;
>> + unsigned int idle_cycle;
>> + unsigned long state;
>> +};
>> +
>> +struct cpuidle_cooling_thread {
>> + struct task_struct *tsk;
>> + int should_run;
>> +};
>> +
>> +static DEFINE_PER_CPU(struct cpuidle_cooling_thread, cpuidle_cooling_thread);
>> +static DEFINE_PER_CPU(struct cpuidle_cooling_device *, cpuidle_cooling_device);
>> +
>> +/**
>> + * cpuidle_cooling_wakeup - Wake up all idle injection threads
>> + * @idle_cdev: the idle cooling device
>> + *
>> + * Every idle injection task belonging to the idle cooling device and
>> + * running on an online cpu will be wake up by this call.
>> + */
>> +static void cpuidle_cooling_wakeup(struct cpuidle_cooling_device *idle_cdev)
>> +{
>> + struct cpuidle_cooling_thread *cct;
>> + int cpu;
>> +
>> + for_each_cpu_and(cpu, idle_cdev->cpumask, cpu_online_mask) {
>> + cct = per_cpu_ptr(&cpuidle_cooling_thread, cpu);
>> + cct->should_run = 1;
>> + wake_up_process(cct->tsk);
>> + }
>> +}
>> +
>> +/**
>> + * cpuidle_cooling_wakeup_fn - Running cycle timer callback
>> + * @timer: a hrtimer structure
>> + *
>> + * When the mitigation is acting, the CPU is allowed to run an amount
>> + * of time, then the idle injection happens for the specified delay
>> + * and the idle task injection schedules itself until the timer event
>> + * wakes the idle injection tasks again for a new idle injection
>> + * cycle. The time between the end of the idle injection and the timer
>> + * expiration is the allocated running time for the CPU.
>> + *
>> + * Always returns HRTIMER_NORESTART
>> + */
>> +static enum hrtimer_restart cpuidle_cooling_wakeup_fn(struct hrtimer *timer)
>> +{
>> + struct cpuidle_cooling_device *idle_cdev =
>> + container_of(timer, struct cpuidle_cooling_device, timer);
>> +
>> + cpuidle_cooling_wakeup(idle_cdev);
>> +
>> + return HRTIMER_NORESTART;
>> +}
>> +
>> +/**
>> + * cpuidle_cooling_runtime - Running time computation
>> + * @idle_cdev: the idle cooling device
>> + *
>> + * The running duration is computed from the idle injection duration
>> + * which is fixed. If we reach 100% of idle injection ratio, that
>> + * means the running duration is zero. If we have a 50% ratio
>> + * injection, that means we have equal duration for idle and for
>> + * running duration.
>> + *
>> + * The formula is deduced as the following:
>> + *
>> + * running = idle x ((100 / ratio) - 1)
>> + *
>> + * For precision purpose for integer math, we use the following:
>> + *
>> + * running = (idle x 100) / ratio - idle
>> + *
>> + * For example, if we have an injected duration of 50%, then we end up
>> + * with 10ms of idle injection and 10ms of running duration.
>> + *
>> + * Returns a s64 nanosecond based
>> + */
>> +static s64 cpuidle_cooling_runtime(struct cpuidle_cooling_device *idle_cdev)
>> +{
>> + s64 next_wakeup;
>> + unsigned long state = idle_cdev->state;
>> +
>> + /*
>> + * The function should not be called when there is no
>> + * mitigation because:
>> + * - that does not make sense
>> + * - we end up with a division by zero
>> + */
>> + if (!state)
>> + return 0;
>> +
>> + next_wakeup = (s64)((idle_cdev->idle_cycle * 100) / state) -
>> + idle_cdev->idle_cycle;
>> +
>> + return next_wakeup * NSEC_PER_USEC;
>> +}
>> +
>
> There is a bug in your calculation formula here when "state" becomes 100.
> You return 0 for the injection rate, which is the same as "rate" being 0,
> which is dangerous. You stop cooling when it's most necessary :)
>
> I'm not sure how much sense really being 100% idle makes, so I, when testing
> this, just say if (state == 100) { state = 99 }. Anyways, just don't return 0.
>
oh and also, this breaks S3 suspend:
Aug 5 06:09:20 pureos kernel: [ 807.487887] PM: suspend entry (deep)
Aug 5 06:09:40 pureos kernel: [ 807.501148] Filesystems sync: 0.013
seconds
Aug 5 06:09:40 pureos kernel: [ 807.501591] Freezing user space
processes ... (elapsed 0.003 seconds) done.
Aug 5 06:09:40 pureos kernel: [ 807.504741] OOM killer disabled.
Aug 5 06:09:40 pureos kernel: [ 807.504744] Freezing remaining
freezable tasks ...
Aug 5 06:09:40 pureos kernel: [ 827.517712] Freezing of tasks failed
after 20.002 seconds (4 tasks refusing to freeze, wq_busy=0):
Aug 5 06:09:40 pureos kernel: [ 827.527122] thermal-idle/0 S 0
161 2 0x00000028
Aug 5 06:09:40 pureos kernel: [ 827.527131] Call trace:
Aug 5 06:09:40 pureos kernel: [ 827.527148] __switch_to+0xb4/0x200
Aug 5 06:09:40 pureos kernel: [ 827.527156] __schedule+0x1e0/0x488
Aug 5 06:09:40 pureos kernel: [ 827.527162] schedule+0x38/0xc8
Aug 5 06:09:40 pureos kernel: [ 827.527169] smpboot_thread_fn+0x250/0x2a8
Aug 5 06:09:40 pureos kernel: [ 827.527176] kthread+0xf4/0x120
Aug 5 06:09:40 pureos kernel: [ 827.527182] ret_from_fork+0x10/0x18
Aug 5 06:09:40 pureos kernel: [ 827.527186] thermal-idle/1 S 0
162 2 0x00000028
Aug 5 06:09:40 pureos kernel: [ 827.527192] Call trace:
Aug 5 06:09:40 pureos kernel: [ 827.527197] __switch_to+0x188/0x200
Aug 5 06:09:40 pureos kernel: [ 827.527203] __schedule+0x1e0/0x488
Aug 5 06:09:40 pureos kernel: [ 827.527208] schedule+0x38/0xc8
Aug 5 06:09:40 pureos kernel: [ 827.527213] smpboot_thread_fn+0x250/0x2a8
Aug 5 06:09:40 pureos kernel: [ 827.527218] kthread+0xf4/0x120
Aug 5 06:09:40 pureos kernel: [ 827.527222] ret_from_fork+0x10/0x18
Aug 5 06:09:40 pureos kernel: [ 827.527226] thermal-idle/2 S 0
163 2 0x00000028
Aug 5 06:09:40 pureos kernel: [ 827.527231] Call trace:
Aug 5 06:09:40 pureos kernel: [ 827.527237] __switch_to+0xb4/0x200
Aug 5 06:09:40 pureos kernel: [ 827.527242] __schedule+0x1e0/0x488
Aug 5 06:09:40 pureos kernel: [ 827.527247] schedule+0x38/0xc8
Aug 5 06:09:40 pureos kernel: [ 827.527259] smpboot_thread_fn+0x250/0x2a8
Aug 5 06:09:40 pureos kernel: [ 827.527264] kthread+0xf4/0x120
Aug 5 06:09:40 pureos kernel: [ 827.527268] ret_from_fork+0x10/0x18
Aug 5 06:09:40 pureos kernel: [ 827.527272] thermal-idle/3 S 0
164 2 0x00000028
Aug 5 06:09:40 pureos kernel: [ 827.527278] Call trace:
Aug 5 06:09:40 pureos kernel: [ 827.527283] __switch_to+0xb4/0x200
Aug 5 06:09:40 pureos kernel: [ 827.527288] __schedule+0x1e0/0x488
Aug 5 06:09:40 pureos kernel: [ 827.527293] schedule+0x38/0xc8
Aug 5 06:09:40 pureos kernel: [ 827.527298] smpboot_thread_fn+0x250/0x2a8
Aug 5 06:09:40 pureos kernel: [ 827.527303] kthread+0xf4/0x120
Aug 5 06:09:40 pureos kernel: [ 827.527308] ret_from_fork+0x10/0x18
Aug 5 06:09:40 pureos kernel: [ 827.527375] Restarting kernel threads
... done.
Aug 5 06:09:40 pureos kernel: [ 827.527771] OOM killer enabled.
Aug 5 06:09:40 pureos kernel: [ 827.527772] Restarting tasks ... done.
Aug 5 06:09:40 pureos kernel: [ 827.528926] PM: suspend exit
do you know where things might go wrong here?
thanks,
martin
On 05/08/2019 07:11, Martin Kepplinger wrote:
> ---
[ ... ]
>> +static s64 cpuidle_cooling_runtime(struct cpuidle_cooling_device *idle_cdev)
>> +{
>> + s64 next_wakeup;
>> + unsigned long state = idle_cdev->state;
>> +
>> + /*
>> + * The function should not be called when there is no
>> + * mitigation because:
>> + * - that does not make sense
>> + * - we end up with a division by zero
>> + */
>> + if (!state)
>> + return 0;
>> +
>> + next_wakeup = (s64)((idle_cdev->idle_cycle * 100) / state) -
>> + idle_cdev->idle_cycle;
>> +
>> + return next_wakeup * NSEC_PER_USEC;
>> +}
>> +
>
> There is a bug in your calculation formula here when "state" becomes 100.
> You return 0 for the injection rate, which is the same as "rate" being 0,
> which is dangerous. You stop cooling when it's most necessary :)
Right, thanks for spotting this.
> I'm not sure how much sense really being 100% idle makes, so I, when testing
> this, just say if (state == 100) { state = 99 }. Anyways, just don't return 0.
>
> Daniel, thanks a lot for these additions! Could you send an update of this?
Yes, I'm working on a new version.
> btw, that's what I'm referring to:
> https://lore.kernel.org/linux-pm/[email protected]/
> I know it's a little old already, but it seems like there hasn't been any
> equivalent solution in the meantime, has it?
>
> Using cpuidle for cooling is way more effective than cpufreq (which often
> hardly is).
On which platform that happens?
--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
On 05/08/2019 08:53, Martin Kepplinger wrote:
[ ... ]
>>> +static s64 cpuidle_cooling_runtime(struct cpuidle_cooling_device *idle_cdev)
>>> +{
>>> + s64 next_wakeup;
>>> + unsigned long state = idle_cdev->state;
>>> +
>>> + /*
>>> + * The function should not be called when there is no
>>> + * mitigation because:
>>> + * - that does not make sense
>>> + * - we end up with a division by zero
>>> + */
>>> + if (!state)
>>> + return 0;
>>> +
>>> + next_wakeup = (s64)((idle_cdev->idle_cycle * 100) / state) -
>>> + idle_cdev->idle_cycle;
>>> +
>>> + return next_wakeup * NSEC_PER_USEC;
>>> +}
>>> +
>>
>> There is a bug in your calculation formula here when "state" becomes 100.
>> You return 0 for the injection rate, which is the same as "rate" being 0,
>> which is dangerous. You stop cooling when it's most necessary :)
>>
>> I'm not sure how much sense really being 100% idle makes, so I, when testing
>> this, just say if (state == 100) { state = 99 }. Anyways, just don't return 0.
>>
>
> oh and also, this breaks S3 suspend:
What breaks the S3 suspend? The idle cooling device or the bug above ?
> Aug 5 06:09:20 pureos kernel: [ 807.487887] PM: suspend entry (deep)
> Aug 5 06:09:40 pureos kernel: [ 807.501148] Filesystems sync: 0.013
> seconds
> Aug 5 06:09:40 pureos kernel: [ 807.501591] Freezing user space
> processes ... (elapsed 0.003 seconds) done.
> Aug 5 06:09:40 pureos kernel: [ 807.504741] OOM killer disabled.
> Aug 5 06:09:40 pureos kernel: [ 807.504744] Freezing remaining
> freezable tasks ...
> Aug 5 06:09:40 pureos kernel: [ 827.517712] Freezing of tasks failed
> after 20.002 seconds (4 tasks refusing to freeze, wq_busy=0):
> Aug 5 06:09:40 pureos kernel: [ 827.527122] thermal-idle/0 S 0
> 161 2 0x00000028
> Aug 5 06:09:40 pureos kernel: [ 827.527131] Call trace:
> Aug 5 06:09:40 pureos kernel: [ 827.527148] __switch_to+0xb4/0x200
> Aug 5 06:09:40 pureos kernel: [ 827.527156] __schedule+0x1e0/0x488
> Aug 5 06:09:40 pureos kernel: [ 827.527162] schedule+0x38/0xc8
> Aug 5 06:09:40 pureos kernel: [ 827.527169] smpboot_thread_fn+0x250/0x2a8
> Aug 5 06:09:40 pureos kernel: [ 827.527176] kthread+0xf4/0x120
> Aug 5 06:09:40 pureos kernel: [ 827.527182] ret_from_fork+0x10/0x18
> Aug 5 06:09:40 pureos kernel: [ 827.527186] thermal-idle/1 S 0
> 162 2 0x00000028
> Aug 5 06:09:40 pureos kernel: [ 827.527192] Call trace:
> Aug 5 06:09:40 pureos kernel: [ 827.527197] __switch_to+0x188/0x200
> Aug 5 06:09:40 pureos kernel: [ 827.527203] __schedule+0x1e0/0x488
> Aug 5 06:09:40 pureos kernel: [ 827.527208] schedule+0x38/0xc8
> Aug 5 06:09:40 pureos kernel: [ 827.527213] smpboot_thread_fn+0x250/0x2a8
> Aug 5 06:09:40 pureos kernel: [ 827.527218] kthread+0xf4/0x120
> Aug 5 06:09:40 pureos kernel: [ 827.527222] ret_from_fork+0x10/0x18
> Aug 5 06:09:40 pureos kernel: [ 827.527226] thermal-idle/2 S 0
> 163 2 0x00000028
> Aug 5 06:09:40 pureos kernel: [ 827.527231] Call trace:
> Aug 5 06:09:40 pureos kernel: [ 827.527237] __switch_to+0xb4/0x200
> Aug 5 06:09:40 pureos kernel: [ 827.527242] __schedule+0x1e0/0x488
> Aug 5 06:09:40 pureos kernel: [ 827.527247] schedule+0x38/0xc8
> Aug 5 06:09:40 pureos kernel: [ 827.527259] smpboot_thread_fn+0x250/0x2a8
> Aug 5 06:09:40 pureos kernel: [ 827.527264] kthread+0xf4/0x120
> Aug 5 06:09:40 pureos kernel: [ 827.527268] ret_from_fork+0x10/0x18
> Aug 5 06:09:40 pureos kernel: [ 827.527272] thermal-idle/3 S 0
> 164 2 0x00000028
> Aug 5 06:09:40 pureos kernel: [ 827.527278] Call trace:
> Aug 5 06:09:40 pureos kernel: [ 827.527283] __switch_to+0xb4/0x200
> Aug 5 06:09:40 pureos kernel: [ 827.527288] __schedule+0x1e0/0x488
> Aug 5 06:09:40 pureos kernel: [ 827.527293] schedule+0x38/0xc8
> Aug 5 06:09:40 pureos kernel: [ 827.527298] smpboot_thread_fn+0x250/0x2a8
> Aug 5 06:09:40 pureos kernel: [ 827.527303] kthread+0xf4/0x120
> Aug 5 06:09:40 pureos kernel: [ 827.527308] ret_from_fork+0x10/0x18
> Aug 5 06:09:40 pureos kernel: [ 827.527375] Restarting kernel threads
> ... done.
> Aug 5 06:09:40 pureos kernel: [ 827.527771] OOM killer enabled.
> Aug 5 06:09:40 pureos kernel: [ 827.527772] Restarting tasks ... done.
> Aug 5 06:09:40 pureos kernel: [ 827.528926] PM: suspend exit
>
>
> do you know where things might go wrong here?
>
> thanks,
>
> martin
>
--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
On 05.08.19 09:37, Daniel Lezcano wrote:
> On 05/08/2019 07:11, Martin Kepplinger wrote:
>> ---
>
> [ ... ]
>
>>> +static s64 cpuidle_cooling_runtime(struct cpuidle_cooling_device *idle_cdev)
>>> +{
>>> + s64 next_wakeup;
>>> + unsigned long state = idle_cdev->state;
>>> +
>>> + /*
>>> + * The function should not be called when there is no
>>> + * mitigation because:
>>> + * - that does not make sense
>>> + * - we end up with a division by zero
>>> + */
>>> + if (!state)
>>> + return 0;
>>> +
>>> + next_wakeup = (s64)((idle_cdev->idle_cycle * 100) / state) -
>>> + idle_cdev->idle_cycle;
>>> +
>>> + return next_wakeup * NSEC_PER_USEC;
>>> +}
>>> +
>>
>> There is a bug in your calculation formula here when "state" becomes 100.
>> You return 0 for the injection rate, which is the same as "rate" being 0,
>> which is dangerous. You stop cooling when it's most necessary :)
>
> Right, thanks for spotting this.
>
>> I'm not sure how much sense really being 100% idle makes, so I, when testing
>> this, just say if (state == 100) { state = 99 }. Anyways, just don't return 0.
>>
>> Daniel, thanks a lot for these additions! Could you send an update of this?
>
> Yes, I'm working on a new version.
great.
>
>> btw, that's what I'm referring to:
>> https://lore.kernel.org/linux-pm/[email protected]/
>> I know it's a little old already, but it seems like there hasn't been any
>> equivalent solution in the meantime, has it?
>>
>> Using cpuidle for cooling is way more effective than cpufreq (which often
>> hardly is).
>
> On which platform that happens?
>
>
I'm running this on imx8mq.
thanks,
martin
On 05.08.19 09:39, Daniel Lezcano wrote:
> On 05/08/2019 08:53, Martin Kepplinger wrote:
>
> [ ... ]
>
>>>> +static s64 cpuidle_cooling_runtime(struct cpuidle_cooling_device *idle_cdev)
>>>> +{
>>>> + s64 next_wakeup;
>>>> + unsigned long state = idle_cdev->state;
>>>> +
>>>> + /*
>>>> + * The function should not be called when there is no
>>>> + * mitigation because:
>>>> + * - that does not make sense
>>>> + * - we end up with a division by zero
>>>> + */
>>>> + if (!state)
>>>> + return 0;
>>>> +
>>>> + next_wakeup = (s64)((idle_cdev->idle_cycle * 100) / state) -
>>>> + idle_cdev->idle_cycle;
>>>> +
>>>> + return next_wakeup * NSEC_PER_USEC;
>>>> +}
>>>> +
>>>
>>> There is a bug in your calculation formula here when "state" becomes 100.
>>> You return 0 for the injection rate, which is the same as "rate" being 0,
>>> which is dangerous. You stop cooling when it's most necessary :)
>>>
>>> I'm not sure how much sense really being 100% idle makes, so I, when testing
>>> this, just say if (state == 100) { state = 99 }. Anyways, just don't return 0.
>>>
>>
>> oh and also, this breaks S3 suspend:
>
> What breaks the S3 suspend? The idle cooling device or the bug above ?
The idle cooling device. I have to configure it out: remove
CONFIG_CPU_IDLE_THERMAL to test suspend/resume again. Errors in the
kernel log, see below.
>
>> Aug 5 06:09:20 pureos kernel: [ 807.487887] PM: suspend entry (deep)
>> Aug 5 06:09:40 pureos kernel: [ 807.501148] Filesystems sync: 0.013
>> seconds
>> Aug 5 06:09:40 pureos kernel: [ 807.501591] Freezing user space
>> processes ... (elapsed 0.003 seconds) done.
>> Aug 5 06:09:40 pureos kernel: [ 807.504741] OOM killer disabled.
>> Aug 5 06:09:40 pureos kernel: [ 807.504744] Freezing remaining
>> freezable tasks ...
>> Aug 5 06:09:40 pureos kernel: [ 827.517712] Freezing of tasks failed
>> after 20.002 seconds (4 tasks refusing to freeze, wq_busy=0):
>> Aug 5 06:09:40 pureos kernel: [ 827.527122] thermal-idle/0 S 0
>> 161 2 0x00000028
>> Aug 5 06:09:40 pureos kernel: [ 827.527131] Call trace:
>> Aug 5 06:09:40 pureos kernel: [ 827.527148] __switch_to+0xb4/0x200
>> Aug 5 06:09:40 pureos kernel: [ 827.527156] __schedule+0x1e0/0x488
>> Aug 5 06:09:40 pureos kernel: [ 827.527162] schedule+0x38/0xc8
>> Aug 5 06:09:40 pureos kernel: [ 827.527169] smpboot_thread_fn+0x250/0x2a8
>> Aug 5 06:09:40 pureos kernel: [ 827.527176] kthread+0xf4/0x120
>> Aug 5 06:09:40 pureos kernel: [ 827.527182] ret_from_fork+0x10/0x18
>> Aug 5 06:09:40 pureos kernel: [ 827.527186] thermal-idle/1 S 0
>> 162 2 0x00000028
>> Aug 5 06:09:40 pureos kernel: [ 827.527192] Call trace:
>> Aug 5 06:09:40 pureos kernel: [ 827.527197] __switch_to+0x188/0x200
>> Aug 5 06:09:40 pureos kernel: [ 827.527203] __schedule+0x1e0/0x488
>> Aug 5 06:09:40 pureos kernel: [ 827.527208] schedule+0x38/0xc8
>> Aug 5 06:09:40 pureos kernel: [ 827.527213] smpboot_thread_fn+0x250/0x2a8
>> Aug 5 06:09:40 pureos kernel: [ 827.527218] kthread+0xf4/0x120
>> Aug 5 06:09:40 pureos kernel: [ 827.527222] ret_from_fork+0x10/0x18
>> Aug 5 06:09:40 pureos kernel: [ 827.527226] thermal-idle/2 S 0
>> 163 2 0x00000028
>> Aug 5 06:09:40 pureos kernel: [ 827.527231] Call trace:
>> Aug 5 06:09:40 pureos kernel: [ 827.527237] __switch_to+0xb4/0x200
>> Aug 5 06:09:40 pureos kernel: [ 827.527242] __schedule+0x1e0/0x488
>> Aug 5 06:09:40 pureos kernel: [ 827.527247] schedule+0x38/0xc8
>> Aug 5 06:09:40 pureos kernel: [ 827.527259] smpboot_thread_fn+0x250/0x2a8
>> Aug 5 06:09:40 pureos kernel: [ 827.527264] kthread+0xf4/0x120
>> Aug 5 06:09:40 pureos kernel: [ 827.527268] ret_from_fork+0x10/0x18
>> Aug 5 06:09:40 pureos kernel: [ 827.527272] thermal-idle/3 S 0
>> 164 2 0x00000028
>> Aug 5 06:09:40 pureos kernel: [ 827.527278] Call trace:
>> Aug 5 06:09:40 pureos kernel: [ 827.527283] __switch_to+0xb4/0x200
>> Aug 5 06:09:40 pureos kernel: [ 827.527288] __schedule+0x1e0/0x488
>> Aug 5 06:09:40 pureos kernel: [ 827.527293] schedule+0x38/0xc8
>> Aug 5 06:09:40 pureos kernel: [ 827.527298] smpboot_thread_fn+0x250/0x2a8
>> Aug 5 06:09:40 pureos kernel: [ 827.527303] kthread+0xf4/0x120
>> Aug 5 06:09:40 pureos kernel: [ 827.527308] ret_from_fork+0x10/0x18
>> Aug 5 06:09:40 pureos kernel: [ 827.527375] Restarting kernel threads
>> ... done.
>> Aug 5 06:09:40 pureos kernel: [ 827.527771] OOM killer enabled.
>> Aug 5 06:09:40 pureos kernel: [ 827.527772] Restarting tasks ... done.
>> Aug 5 06:09:40 pureos kernel: [ 827.528926] PM: suspend exit
>>
>>
>> do you know where things might go wrong here?
>>
>> thanks,
>>
>> martin
>>
>
>
On 05/08/2019 09:42, Martin Kepplinger wrote:
> On 05.08.19 09:39, Daniel Lezcano wrote:
>> On 05/08/2019 08:53, Martin Kepplinger wrote:
>>
>> [ ... ]
>>
>>>>> +static s64 cpuidle_cooling_runtime(struct cpuidle_cooling_device *idle_cdev)
>>>>> +{
>>>>> + s64 next_wakeup;
>>>>> + unsigned long state = idle_cdev->state;
>>>>> +
>>>>> + /*
>>>>> + * The function should not be called when there is no
>>>>> + * mitigation because:
>>>>> + * - that does not make sense
>>>>> + * - we end up with a division by zero
>>>>> + */
>>>>> + if (!state)
>>>>> + return 0;
>>>>> +
>>>>> + next_wakeup = (s64)((idle_cdev->idle_cycle * 100) / state) -
>>>>> + idle_cdev->idle_cycle;
>>>>> +
>>>>> + return next_wakeup * NSEC_PER_USEC;
>>>>> +}
>>>>> +
>>>>
>>>> There is a bug in your calculation formula here when "state" becomes 100.
>>>> You return 0 for the injection rate, which is the same as "rate" being 0,
>>>> which is dangerous. You stop cooling when it's most necessary :)
>>>>
>>>> I'm not sure how much sense really being 100% idle makes, so I, when testing
>>>> this, just say if (state == 100) { state = 99 }. Anyways, just don't return 0.
>>>>
>>>
>>> oh and also, this breaks S3 suspend:
>>
>> What breaks the S3 suspend? The idle cooling device or the bug above ?
>
> The idle cooling device. I have to configure it out: remove
> CONFIG_CPU_IDLE_THERMAL to test suspend/resume again. Errors in the
> kernel log, see below.
Ok, thanks for reporting. I'll fix the issue.
>>> Aug 5 06:09:20 pureos kernel: [ 807.487887] PM: suspend entry (deep)
>>> Aug 5 06:09:40 pureos kernel: [ 807.501148] Filesystems sync: 0.013
>>> seconds
>>> Aug 5 06:09:40 pureos kernel: [ 807.501591] Freezing user space
>>> processes ... (elapsed 0.003 seconds) done.
>>> Aug 5 06:09:40 pureos kernel: [ 807.504741] OOM killer disabled.
>>> Aug 5 06:09:40 pureos kernel: [ 807.504744] Freezing remaining
>>> freezable tasks ...
>>> Aug 5 06:09:40 pureos kernel: [ 827.517712] Freezing of tasks failed
>>> after 20.002 seconds (4 tasks refusing to freeze, wq_busy=0):
>>> Aug 5 06:09:40 pureos kernel: [ 827.527122] thermal-idle/0 S 0
>>> 161 2 0x00000028
>>> Aug 5 06:09:40 pureos kernel: [ 827.527131] Call trace:
>>> Aug 5 06:09:40 pureos kernel: [ 827.527148] __switch_to+0xb4/0x200
>>> Aug 5 06:09:40 pureos kernel: [ 827.527156] __schedule+0x1e0/0x488
>>> Aug 5 06:09:40 pureos kernel: [ 827.527162] schedule+0x38/0xc8
>>> Aug 5 06:09:40 pureos kernel: [ 827.527169] smpboot_thread_fn+0x250/0x2a8
>>> Aug 5 06:09:40 pureos kernel: [ 827.527176] kthread+0xf4/0x120
>>> Aug 5 06:09:40 pureos kernel: [ 827.527182] ret_from_fork+0x10/0x18
>>> Aug 5 06:09:40 pureos kernel: [ 827.527186] thermal-idle/1 S 0
>>> 162 2 0x00000028
>>> Aug 5 06:09:40 pureos kernel: [ 827.527192] Call trace:
>>> Aug 5 06:09:40 pureos kernel: [ 827.527197] __switch_to+0x188/0x200
>>> Aug 5 06:09:40 pureos kernel: [ 827.527203] __schedule+0x1e0/0x488
>>> Aug 5 06:09:40 pureos kernel: [ 827.527208] schedule+0x38/0xc8
>>> Aug 5 06:09:40 pureos kernel: [ 827.527213] smpboot_thread_fn+0x250/0x2a8
>>> Aug 5 06:09:40 pureos kernel: [ 827.527218] kthread+0xf4/0x120
>>> Aug 5 06:09:40 pureos kernel: [ 827.527222] ret_from_fork+0x10/0x18
>>> Aug 5 06:09:40 pureos kernel: [ 827.527226] thermal-idle/2 S 0
>>> 163 2 0x00000028
>>> Aug 5 06:09:40 pureos kernel: [ 827.527231] Call trace:
>>> Aug 5 06:09:40 pureos kernel: [ 827.527237] __switch_to+0xb4/0x200
>>> Aug 5 06:09:40 pureos kernel: [ 827.527242] __schedule+0x1e0/0x488
>>> Aug 5 06:09:40 pureos kernel: [ 827.527247] schedule+0x38/0xc8
>>> Aug 5 06:09:40 pureos kernel: [ 827.527259] smpboot_thread_fn+0x250/0x2a8
>>> Aug 5 06:09:40 pureos kernel: [ 827.527264] kthread+0xf4/0x120
>>> Aug 5 06:09:40 pureos kernel: [ 827.527268] ret_from_fork+0x10/0x18
>>> Aug 5 06:09:40 pureos kernel: [ 827.527272] thermal-idle/3 S 0
>>> 164 2 0x00000028
>>> Aug 5 06:09:40 pureos kernel: [ 827.527278] Call trace:
>>> Aug 5 06:09:40 pureos kernel: [ 827.527283] __switch_to+0xb4/0x200
>>> Aug 5 06:09:40 pureos kernel: [ 827.527288] __schedule+0x1e0/0x488
>>> Aug 5 06:09:40 pureos kernel: [ 827.527293] schedule+0x38/0xc8
>>> Aug 5 06:09:40 pureos kernel: [ 827.527298] smpboot_thread_fn+0x250/0x2a8
>>> Aug 5 06:09:40 pureos kernel: [ 827.527303] kthread+0xf4/0x120
>>> Aug 5 06:09:40 pureos kernel: [ 827.527308] ret_from_fork+0x10/0x18
>>> Aug 5 06:09:40 pureos kernel: [ 827.527375] Restarting kernel threads
>>> ... done.
>>> Aug 5 06:09:40 pureos kernel: [ 827.527771] OOM killer enabled.
>>> Aug 5 06:09:40 pureos kernel: [ 827.527772] Restarting tasks ... done.
>>> Aug 5 06:09:40 pureos kernel: [ 827.528926] PM: suspend exit
>>>
>>>
>>> do you know where things might go wrong here?
>>>
>>> thanks,
>>>
>>> martin
>>>
>>
>>
>
--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
Hi Martin,
On 25/10/2019 13:22, Martin Kepplinger wrote:
> Hi Daniel,
>
> Quick note: You're missing a kref_init() in this version of the cooling driver.
>
> Also, as already mentioned, set_freezable() doesn't seem to work without
> supporting suspend/resume.
Right, it is removed.
> And since cpuidle got somewhat completely reworked in 5.4, could you create
> a new, rebased version of this? Let me know in case I missed an earlier update.
Yes, I'm on it. There is a bug in the cpuidle registering I'm
investigating right now. If you want I can put a branch with the idle
injection cooling device as soon as it is fixed.
Note most of the patches are now merged. The cpuidle cooling device was
split into the idle injection mechanism in the drivers/powercap [1] and
the cooling device initialization which is the part to be merged.
The idle injection resolution was increased [2].
The idle state injection selection is currently in the review process [3].
The remaining part is the thermal cooling device itself which is a
separated cooling device from the existing cpu one.
-- Daniel
[1] https://lwn.net/Articles/757393/
[2] https://patchwork.kernel.org/patch/11074003/
[3] https://lkml.org/lkml/2019/9/9/447
--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
Hi Daniel,
Quick note: You're missing a kref_init() in this version of the cooling driver.
Also, as already mentioned, set_freezable() doesn't seem to work without
supporting suspend/resume.
And since cpuidle got somewhat completely reworked in 5.4, could you create
a new, rebased version of this? Let me know in case I missed an earlier update.
thanks,
martin
---
On 25.10.19 16:45, Daniel Lezcano wrote:
>
> Hi Martin,
>
> On 25/10/2019 13:22, Martin Kepplinger wrote:
>> Hi Daniel,
>>
>> Quick note: You're missing a kref_init() in this version of the cooling driver.
>>
>> Also, as already mentioned, set_freezable() doesn't seem to work without
>> supporting suspend/resume.
>
> Right, it is removed.
>
>> And since cpuidle got somewhat completely reworked in 5.4, could you create
>> a new, rebased version of this? Let me know in case I missed an earlier update.
>
> Yes, I'm on it. There is a bug in the cpuidle registering I'm
> investigating right now. If you want I can put a branch with the idle
> injection cooling device as soon as it is fixed.
>
> Note most of the patches are now merged. The cpuidle cooling device was
> split into the idle injection mechanism in the drivers/powercap [1] and
> the cooling device initialization which is the part to be merged.
>
> The idle injection resolution was increased [2].
>
> The idle state injection selection is currently in the review process [3].
>
> The remaining part is the thermal cooling device itself which is a
> separated cooling device from the existing cpu one.
>
> -- Daniel
>
> [1] https://lwn.net/Articles/757393/
> [2] https://patchwork.kernel.org/patch/11074003/
> [3] https://lkml.org/lkml/2019/9/9/447
>
Hi Daniel,
Putting up a branch may make sense and seems like it would make testing
easy. I'll definitely try to test early as soon as you have the pieces
in place one way or another :)
thanks for your work on this,
martin
On 26/10/2019 20:23, Martin Kepplinger wrote:
[ ... ]
> Putting up a branch may make sense and seems like it would make testing
> easy. I'll definitely try to test early as soon as you have the pieces
> in place one way or another :)
https://git.linaro.org/people/daniel.lezcano/linux.git/log/?h=thermal/idle-cooling
The Kconfig options must be set:
CONFIG_POWERCAP=y
CONFIG_IDLE_INJECT=y
CONFIG_CPU_IDLE_THERMAL=y
The DT must be changed to add the trip point pointing to the idle state.
- Daniel
--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog