Received: by 10.223.176.5 with SMTP id f5csp1100640wra; Wed, 31 Jan 2018 01:06:18 -0800 (PST) X-Google-Smtp-Source: AH8x227jSYyJENMK4SHawd9eFfeKSnPqbMl98LMGqUFjh3VhHXVdEwF4ubx8JtD6PTDpMZQIheZq X-Received: by 10.99.97.208 with SMTP id v199mr25250579pgb.387.1517389577989; Wed, 31 Jan 2018 01:06:17 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1517389577; cv=none; d=google.com; s=arc-20160816; b=hqUH8iqu2g/VWM9EK7+iSWmKwpz3XuQBumYt7sKzoUjgg95LVsOq+PrqnKre5dGoJI mCQ+lby1gGhqGQbsOQy8xTEYpT9ioLCyLNNpUO4h/NccIzWybAEzMSpXQD4nbQhVHq/o 5z1tcOjR9u/ys5VjlzqEtOSS2Jb6aWtu9XFIs8IlNoBFM0ktDozOkAmpnevs6xLkkqkw mhMNo0iACgCaLWgnGtcr2pNbzIq+KWlwsm1UlrR01YnvTDekTy/vOK+vwv01SFlhFC+Y SPDFc5DJFAOTonfZXYfnURtCtZlf38tDcsccUjbig+tsn1kJKe73s4LmoS1swDMM2Fwi EJfw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:cc:to:subject :message-id:date:from:references:in-reply-to:mime-version :dkim-signature:arc-authentication-results; bh=/M6erhoC0qvQrXxBKTApSa9ceoo6saCWuObfg1TQ8TU=; b=lu/ZX08baK5o6sqo9QmJamhEDmhpCouQstzYUqDARNN4rPI1cN5Dk51T36Tx71VZsv 4b9jT4y+2Nz3Q1Xo6lS2AX6TIe2RIMfII0rLlinUQ0baq8H8SoPOEQQL4tsdJS6HozG2 KEUFxfuXX3CLyqHF6JjFlpJSLMzwigks1DW6f+eOJwnUWqjLC89UmpYPMhbaGkonglrY OVtUFJNGv6mRKDwj9CAIuLrajEpwV/SU4ETOo4W5Jf/yhUHTBbH6KpFv2FfIo+LF2/sT owRiN1QGNFAuKttFyzG/yvpw5MTTmX1j/gFq5mPeiKXE4O6C3Y4xPnu3Uvr+Smeh5xPO NXfw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=R/WhU4Be; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id q4-v6si5527703plb.124.2018.01.31.01.06.03; Wed, 31 Jan 2018 01:06:17 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=R/WhU4Be; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753345AbeAaJBq (ORCPT + 99 others); Wed, 31 Jan 2018 04:01:46 -0500 Received: from mail-io0-f193.google.com ([209.85.223.193]:34122 "EHLO mail-io0-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752811AbeAaJBn (ORCPT ); Wed, 31 Jan 2018 04:01:43 -0500 Received: by mail-io0-f193.google.com with SMTP id c17so14403546iod.1 for ; Wed, 31 Jan 2018 01:01:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=/M6erhoC0qvQrXxBKTApSa9ceoo6saCWuObfg1TQ8TU=; b=R/WhU4Be96aODTidn/+55NX1ttOclTUXM2Jo0mYSaoaFjX+OZgMJrg1OR0VF37bi6R i6jSWyWF1TOyN1yTqwJ9lQz85dNG+fudsBcJaBwx0NgnQ66ptIc/roseAIpAbbuPe+Nl Z9+8bZoF/5pCZAbBj00SY2Ft1rbTDpEvsziIg= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=/M6erhoC0qvQrXxBKTApSa9ceoo6saCWuObfg1TQ8TU=; b=BWf92OnkOwgWHxNWeg0e8dOblMATIgdISDv/xVxmnyVWL0hie9hDrR92krcF9C7d0G U0BqBnr3s6aPh+767a/PnUuC5SFL81LIFaG09f8yYI3Pp078ltNoCeGl2DgrWznxsqQk +PkCb//KZkJ42HOLGL70iwOyGONoiLexNxg9/mqmXMWY484/CjBZfau6OnYfp8r9XGh7 5Ts08n58SXXjwEy6iR64612w7SprK6Csnx+IAN/fwT+Tn3N/BSQxGv4Gu97dH//EAN3w ayPWH1XsbnBKB99rmNUukwC1H/H4H1v2RRG9vhkQZ1mT0fI3IeUhBB2hPx09vmdgEm16 qTCQ== X-Gm-Message-State: AKwxytfAoYadCeu4Z2wN0GlmL1mDRi3gF4joCLKb56Vn7bZiTZ5gXWQi dOEo92YbsaUy3PN84/Jog1tif1kHB1ISZzLn5V4RRQ== X-Received: by 10.107.178.195 with SMTP id b186mr36181866iof.9.1517389302501; Wed, 31 Jan 2018 01:01:42 -0800 (PST) MIME-Version: 1.0 Received: by 10.107.50.198 with HTTP; Wed, 31 Jan 2018 01:01:21 -0800 (PST) In-Reply-To: <1516721671-16360-6-git-send-email-daniel.lezcano@linaro.org> References: <1516721671-16360-1-git-send-email-daniel.lezcano@linaro.org> <1516721671-16360-6-git-send-email-daniel.lezcano@linaro.org> From: Vincent Guittot Date: Wed, 31 Jan 2018 10:01:21 +0100 Message-ID: Subject: Re: [PATCH 5/8] thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver To: Daniel Lezcano Cc: Eduardo Valentin , Kevin Wangtao , Leo Yan , Amit Kachhap , viresh kumar , linux-kernel , Zhang Rui , Javi Merino , "open list:THERMAL" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Daniel, On 23 January 2018 at 16:34, Daniel Lezcano wro= te: > The cpu idle cooling driver performs synchronized idle injection across a= ll > cpus belonging to the same cluster and offers a new method to cool down a= SoC. > > Each cluster has its own idle cooling device, each core has its own idle > injection thread, each idle injection thread uses play_idle to enter idle= . In > order to reach the deepest idle state, each cooling device has the idle > injection threads synchronized together. > > It has some similarity with the intel power clamp driver but it is actual= ly > designed to work on the ARM architecture via the DT with a mathematical p= roof > with the power model which comes with the Documentation. > > The idle injection cycle is fixed while the running cycle is variable. Th= at > allows to have control on the device reactivity for the user experience. = At > the mitigation point the idle threads are unparked, they play idle the > specified amount of time and they schedule themselves. The last thread se= ts > the next idle injection deadline and when the timer expires it wakes up a= ll > the threads which in turn play idle again. Meanwhile the running cycle is > changed by set_cur_state. When the mitigation ends, the threads are park= ed. > The algorithm is self adaptive, so there is no need to handle hotplugging= . > > If we take an example of the balanced point, we can use the DT for the hi= 6220. > > The sustainable power for the SoC is 3326mW to mitigate at 75=C2=B0C. Eig= ht cores > running at full blast at the maximum OPP consumes 5280mW. The first value= is > given in the DT, the second is calculated from the OPP with the formula: > > Pdyn =3D Cdyn x Voltage^2 x Frequency > > As the SoC vendors don't want to share the static leakage values, we assu= me > it is zero, so the Prun =3D Pdyn + Pstatic =3D Pdyn + 0 =3D Pdyn. > > In order to reduce the power to 3326mW, we have to apply a ratio to the > running time. > > ratio =3D (Prun - Ptarget) / Ptarget =3D (5280 - 3326) / 3326 =3D 0,5874 > > We know the idle cycle which is fixed, let's assume 10ms. However from th= is > duration we have to substract the wake up latency for the cluster idle st= ate. > In our case, it is 1.5ms. So for a 10ms latency for idle, we are really i= dle > 8.5ms. > > As we know the idle duration and the ratio, we can compute the running cy= cle. > > running_cycle =3D 8.5 / 0.5874 =3D 14.47ms > > So for 8.5ms of idle, we have 14.47ms of running cycle, and that brings t= he > SoC to the balanced trip point of 75=C2=B0C. > > The driver has been tested on the hi6220 and it appears the temperature > stabilizes at 75=C2=B0C with an idle injection time of 10ms (8.5ms real) = and > running cycle of 14ms as expected by the theory above. > > Signed-off-by: Kevin WangTao > Signed-off-by: Daniel Lezcano > --- > drivers/thermal/Kconfig | 10 + > drivers/thermal/cpu_cooling.c | 471 ++++++++++++++++++++++++++++++++++++= ++++++ > include/linux/cpu_cooling.h | 6 + > 3 files changed, 487 insertions(+) > > diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig > index 925e73b..4bd4be7 100644 > --- a/drivers/thermal/Kconfig > +++ b/drivers/thermal/Kconfig > @@ -166,6 +166,16 @@ config CPU_FREQ_THERMAL > This will be useful for platforms using the generic thermal int= erface > and not the ACPI interface. > > +config CPU_IDLE_THERMAL > + bool "CPU idle cooling strategy" > + depends on CPU_IDLE > + help > + This implements the generic CPU cooling mechanism through > + idle injection. This will throttle the CPU by injecting > + fixed idle cycle. All CPUs belonging to the same cluster > + will enter idle synchronously to reach the deepest idle > + state. > + > endchoice > > config CLOCK_THERMAL > diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.= c > index d05bb73..916a627 100644 > --- a/drivers/thermal/cpu_cooling.c > +++ b/drivers/thermal/cpu_cooling.c > @@ -10,18 +10,33 @@ > * Viresh Kumar > * > */ > +#undef DEBUG > +#define pr_fmt(fmt) "CPU cooling: " fmt > + > #include > #include > #include > +#include > #include > +#include > #include > +#include > #include > #include > +#include > +#include > #include > #include > +#include > + > +#include > +#include > > #include > > +#include > + > +#ifdef CONFIG_CPU_FREQ_THERMAL > /* > * Cooling state <-> CPUFreq frequency > * > @@ -926,3 +941,459 @@ void cpufreq_cooling_unregister(struct thermal_cool= ing_device *cdev) > kfree(cpufreq_cdev); > } > EXPORT_SYMBOL_GPL(cpufreq_cooling_unregister); > + > +#endif /* CPU_FREQ_THERMAL */ > + > +#ifdef CONFIG_CPU_IDLE_THERMAL > +/* > + * The idle duration injection. As we don't have yet a way to specify > + * from the DT configuration, let's default to a tick duration. > + */ > +#define DEFAULT_IDLE_TIME_US TICK_USEC > + > +/** > + * struct cpuidle_cooling_device - data for the idle cooling device > + * @cdev: a pointer to a struct thermal_cooling_device > + * @tsk: an array of pointer to the idle injection tasks > + * @cpumask: a cpumask containing the CPU managed by the cooling device > + * @timer: a hrtimer giving the tempo for the idle injection cycles > + * @kref: a kernel refcount on this structure > + * @waitq: the waiq for the idle injection tasks > + * @count: an atomic to keep track of the last task exiting the idle cyc= le > + * @idle_cycle: an integer defining the duration of the idle injection > + * @state: an normalized integer giving the state of the cooling device > + */ > +struct cpuidle_cooling_device { > + struct thermal_cooling_device *cdev; > + struct task_struct **tsk; > + struct cpumask *cpumask; > + struct list_head node; > + struct hrtimer timer; > + struct kref kref; > + wait_queue_head_t *waitq; > + atomic_t count; > + unsigned int idle_cycle; > + unsigned int state; > +}; > + > +static LIST_HEAD(cpuidle_cdev_list); > + > +/** > + * cpuidle_cooling_wakeup - Wake up all idle injection threads > + * @idle_cdev: the idle cooling device > + * > + * Every idle injection task belonging to the idle cooling device and > + * running on an online cpu will be wake up by this call. > + */ > +static void cpuidle_cooling_wakeup(struct cpuidle_cooling_device *idle_c= dev) > +{ > + int cpu; > + int weight =3D cpumask_weight(idle_cdev->cpumask); > + > + for_each_cpu_and(cpu, idle_cdev->cpumask, cpu_online_mask) > + wake_up_process(idle_cdev->tsk[cpu % weight]); > +} > + > +/** > + * cpuidle_cooling_wakeup_fn - Running cycle timer callback > + * @timer: a hrtimer structure > + * > + * When the mitigation is acting, the CPU is allowed to run an amount > + * of time, then the idle injection happens for the specified delay > + * and the idle task injection schedules itself until the timer event > + * wakes the idle injection tasks again for a new idle injection > + * cycle. The time between the end of the idle injection and the timer > + * expiration is the allocated running time for the CPU. > + * > + * Returns always HRTIMER_NORESTART > + */ > +static enum hrtimer_restart cpuidle_cooling_wakeup_fn(struct hrtimer *ti= mer) > +{ > + struct cpuidle_cooling_device *idle_cdev =3D > + container_of(timer, struct cpuidle_cooling_device, timer)= ; > + > + cpuidle_cooling_wakeup(idle_cdev); > + > + return HRTIMER_NORESTART; > +} > + > +/** > + * cpuidle_cooling_runtime - Running time computation > + * @idle_cdev: the idle cooling device > + * > + * The running duration is computed from the idle injection duration > + * which is fixed. If we reach 100% of idle injection ratio, that > + * means the running duration is zero. If we have a 50% ratio > + * injection, that means we have equal duration for idle and for > + * running duration. > + * > + * The formula is deduced as the following: > + * > + * running =3D idle x ((100 / ratio) - 1) > + * > + * For precision purpose for integer math, we use the following: > + * > + * running =3D (idle x 100) / ratio - idle > + * > + * For example, if we have an injected duration of 50%, then we end up > + * with 10ms of idle injection and 10ms of running duration. > + * > + * Returns a s64 nanosecond based > + */ > +static s64 cpuidle_cooling_runtime(struct cpuidle_cooling_device *idle_c= dev) > +{ > + s64 next_wakeup; > + int state =3D idle_cdev->state; > + > + /* > + * The function must never be called when there is no > + * mitigation because: > + * - that does not make sense > + * - we end up with a division by zero > + */ > + BUG_ON(!state); > + > + next_wakeup =3D (s64)((idle_cdev->idle_cycle * 100) / state) - > + idle_cdev->idle_cycle; > + > + return next_wakeup * NSEC_PER_USEC; > +} > + > +/** > + * cpuidle_cooling_injection_thread - Idle injection mainloop thread fun= ction > + * @arg: a void pointer containing the idle cooling device address > + * > + * This main function does basically two operations: > + * > + * - Goes idle for a specific amount of time > + * > + * - Sets a timer to wake up all the idle injection threads after a > + * running period > + * > + * That happens only when the mitigation is enabled, otherwise the > + * task is scheduled out. > + * > + * In order to keep the tasks synchronized together, it is the last > + * task exiting the idle period which is in charge of setting the > + * timer. > + * > + * This function never returns. > + */ > +static int cpuidle_cooling_injection_thread(void *arg) > +{ > + struct sched_param param =3D { .sched_priority =3D MAX_USER_RT_PR= IO/2 }; > + struct cpuidle_cooling_device *idle_cdev =3D arg; > + int index =3D smp_processor_id() % cpumask_weight(idle_cdev->cpum= ask); > + DEFINE_WAIT(wait); > + > + set_freezable(); > + > + sched_setscheduler(current, SCHED_FIFO, ¶m); > + > + while (1) { > + > + s64 next_wakeup; > + > + prepare_to_wait(&idle_cdev->waitq[index], > + &wait, TASK_INTERRUPTIBLE); > + > + schedule(); > + > + atomic_inc(&idle_cdev->count); > + > + play_idle(idle_cdev->idle_cycle / USEC_PER_MSEC); > + > + /* > + * The last CPU waking up is in charge of setting the > + * timer. If the CPU is hotplugged, the timer will > + * move to another CPU (which may not belong to the > + * same cluster) but that is not a problem as the > + * timer will be set again by another CPU belonging to > + * the cluster, so this mechanism is self adaptive and > + * does not require any hotplugging dance. > + */ > + if (!atomic_dec_and_test(&idle_cdev->count)) > + continue; > + > + if (!idle_cdev->state) > + continue; > + > + next_wakeup =3D cpuidle_cooling_runtime(idle_cdev); > + > + hrtimer_start(&idle_cdev->timer, ns_to_ktime(next_wakeup)= , > + HRTIMER_MODE_REL_PINNED); > + } > + > + finish_wait(&idle_cdev->waitq[index], &wait); > + > + return 0; > +} > + > +/** > + * cpuidle_cooling_get_max_state - Get the maximum state > + * @cdev : the thermal cooling device > + * @state : a pointer to the state variable to be filled > + * > + * The function gives always 100 as the injection ratio is percentile > + * based for consistency accros different platforms. > + * > + * The function can not fail, it returns always zero. > + */ > +static int cpuidle_cooling_get_max_state(struct thermal_cooling_device *= cdev, > + unsigned long *state) > +{ > + /* > + * Depending on the configuration or the hardware, the running > + * cycle and the idle cycle could be different. We want unify > + * that to an 0..100 interval, so the set state interface will > + * be the same whatever the platform is. > + * > + * The state 100% will make the cluster 100% ... idle. A 0% > + * injection ratio means no idle injection at all and 50% > + * means for 10ms of idle injection, we have 10ms of running > + * time. > + */ > + *state =3D 100; > + > + return 0; > +} > + > +/** > + * cpuidle_cooling_get_cur_state - Get the current cooling state > + * @cdev: the thermal cooling device > + * @state: a pointer to the state > + * > + * The function just copy the state value from the private thermal > + * cooling device structure, the mapping is 1 <-> 1. > + * > + * The function can not fail, it returns always zero. > + */ > +static int cpuidle_cooling_get_cur_state(struct thermal_cooling_device *= cdev, > + unsigned long *state) > +{ > + struct cpuidle_cooling_device *idle_cdev =3D cdev->devdata; > + > + *state =3D idle_cdev->state; > + > + return 0; > +} > + > +/** > + * cpuidle_cooling_set_cur_state - Set the current cooling state > + * @cdev: the thermal cooling device > + * @state: the target state > + * > + * The function checks first if we are initiating the mitigation which > + * in turn wakes up all the idle injection tasks belonging to the idle > + * cooling device. In any case, it updates the internal state for the > + * cooling device. > + * > + * The function can not fail, it returns always zero. > + */ > +static int cpuidle_cooling_set_cur_state(struct thermal_cooling_device *= cdev, > + unsigned long state) > +{ > + struct cpuidle_cooling_device *idle_cdev =3D cdev->devdata; > + unsigned long current_state =3D idle_cdev->state; > + > + idle_cdev->state =3D state; > + > + if (current_state =3D=3D 0 && state > 0) { > + pr_debug("Starting cooling cpus '%*pbl'\n", > + cpumask_pr_args(idle_cdev->cpumask)); > + cpuidle_cooling_wakeup(idle_cdev); > + } else if (current_state > 0 && !state) { > + pr_debug("Stopping cooling cpus '%*pbl'\n", > + cpumask_pr_args(idle_cdev->cpumask)); > + } > + > + return 0; > +} > + > +/** > + * cpuidle_cooling_ops - thermal cooling device ops > + */ > +static struct thermal_cooling_device_ops cpuidle_cooling_ops =3D { > + .get_max_state =3D cpuidle_cooling_get_max_state, > + .get_cur_state =3D cpuidle_cooling_get_cur_state, > + .set_cur_state =3D cpuidle_cooling_set_cur_state, > +}; > + > +/** > + * cpuidle_cooling_release - Kref based release helper > + * @kref: a pointer to the kref structure > + * > + * This function is automatically called by the kref_put function when > + * the idle cooling device refcount reaches zero. At this point, we > + * have the guarantee the structure is no longer in use and we can > + * safely release all the ressources. > + */ > +static void __init cpuidle_cooling_release(struct kref *kref) > +{ > + struct cpuidle_cooling_device *idle_cdev =3D > + container_of(kref, struct cpuidle_cooling_device, kref); > + > + thermal_cooling_device_unregister(idle_cdev->cdev); > + kfree(idle_cdev->waitq); > + kfree(idle_cdev->tsk); > + kfree(idle_cdev); > +} > + > +/** > + * cpuidle_cooling_register - Idle cooling device initialization functio= n > + * > + * This function is in charge of creating a cooling device per cluster > + * and register it to thermal framework. For this we rely on the > + * topology as there is nothing yet describing better the idle state > + * power domains. > + * > + * For each first CPU of the cluster's cpumask, we allocate the idle > + * cooling device, initialize the general fields and then we initialze > + * the rest in a per cpu basis. > + * > + * Returns zero on success, < 0 otherwise. > + */ > +int cpuidle_cooling_register(void) > +{ > + struct cpuidle_cooling_device *idle_cdev =3D NULL; > + struct thermal_cooling_device *cdev; > + struct task_struct *tsk; > + struct device_node *np; > + cpumask_t *cpumask; > + char dev_name[THERMAL_NAME_LENGTH]; > + int weight; > + int ret =3D -ENOMEM, cpu; > + int index =3D 0; > + > + for_each_possible_cpu(cpu) { > + > + cpumask =3D topology_core_cpumask(cpu); > + weight =3D cpumask_weight(cpumask); > + > + /* > + * This condition makes the first cpu belonging to the > + * cluster to create a cooling device and allocates > + * the structure. Others CPUs belonging to the same > + * cluster will just increment the refcount on the > + * cooling device structure and initialize it. > + */ > + if (cpu =3D=3D cpumask_first(cpumask)) { > + > + np =3D of_cpu_device_node_get(cpu); > + > + idle_cdev =3D kzalloc(sizeof(*idle_cdev), GFP_KER= NEL); > + if (!idle_cdev) > + goto out_fail; > + > + idle_cdev->tsk =3D kzalloc(sizeof(*idle_cdev->tsk= ) * > + weight, GFP_KERNEL); > + if (!idle_cdev->tsk) > + goto out_fail; > + > + idle_cdev->waitq =3D kzalloc(sizeof(*idle_cdev->w= aitq) * > + weight, GFP_KERNEL); > + if (!idle_cdev->waitq) > + goto out_fail; > + > + idle_cdev->idle_cycle =3D DEFAULT_IDLE_TIME_US; > + > + atomic_set(&idle_cdev->count, 0); > + > + kref_init(&idle_cdev->kref); > + > + /* > + * Initialize the timer to wakeup all the idle > + * injection tasks > + */ > + hrtimer_init(&idle_cdev->timer, > + CLOCK_MONOTONIC, HRTIMER_MODE_REL); > + > + /* > + * The wakeup function callback which is in > + * charge of waking up all CPUs belonging to > + * the same cluster > + */ > + idle_cdev->timer.function =3D cpuidle_cooling_wak= eup_fn; > + > + /* > + * The thermal cooling device name > + */ > + snprintf(dev_name, sizeof(dev_name), "thermal-idl= e-%d", index++); > + cdev =3D thermal_of_cooling_device_register(np, d= ev_name, > + idle_cd= ev, > + &cpuidl= e_cooling_ops); > + if (IS_ERR(cdev)) { > + ret =3D PTR_ERR(cdev); > + goto out_fail; > + } > + > + idle_cdev->cdev =3D cdev; > + > + idle_cdev->cpumask =3D cpumask; > + > + list_add(&idle_cdev->node, &cpuidle_cdev_list); > + > + pr_info("Created idle cooling device for cpus '%*= pbl'\n", > + cpumask_pr_args(cpumask)); > + } > + > + kref_get(&idle_cdev->kref); > + > + /* > + * Each cooling device is per package. Each package > + * has a set of cpus where the physical number is > + * duplicate in the kernel namespace. We need a way to > + * address the waitq[] and tsk[] arrays with index > + * which are not Linux cpu numbered. > + * > + * One solution is to use the > + * topology_core_id(cpu). Other solution is to use the > + * modulo. > + * > + * eg. 2 x cluster - 4 cores. > + * > + * Physical numbering -> Linux numbering -> % nr_cpus > + * > + * Pkg0 - Cpu0 -> 0 -> 0 > + * Pkg0 - Cpu1 -> 1 -> 1 > + * Pkg0 - Cpu2 -> 2 -> 2 > + * Pkg0 - Cpu3 -> 3 -> 3 > + * > + * Pkg1 - Cpu0 -> 4 -> 0 > + * Pkg1 - Cpu1 -> 5 -> 1 > + * Pkg1 - Cpu2 -> 6 -> 2 > + * Pkg1 - Cpu3 -> 7 -> 3 I'm not sure that the assumption above for the CPU numbering is safe. Can't you use a per cpu structure to point to resources that are per cpu instead ? so you will not have to rely on CPU ordering > + */ > + init_waitqueue_head(&idle_cdev->waitq[cpu % weight]); > + > + tsk =3D kthread_create_on_cpu(cpuidle_cooling_injection_t= hread, > + idle_cdev, cpu, "kidle_inject= /%u"); > + if (IS_ERR(tsk)) { > + ret =3D PTR_ERR(tsk); > + goto out_fail; > + } > + > + idle_cdev->tsk[cpu % weight] =3D tsk; > + > + wake_up_process(tsk); > + } > + > + return 0; > + > +out_fail: > + list_for_each_entry(idle_cdev, &cpuidle_cdev_list, node) { > + > + for_each_cpu(cpu, idle_cdev->cpumask) { > + > + if (idle_cdev->tsk[cpu]) > + kthread_stop(idle_cdev->tsk[cpu]); > + > + kref_put(&idle_cdev->kref, cpuidle_cooling_releas= e); > + } > + } > + > + pr_err("Failed to create idle cooling device (%d)\n", ret); > + > + return ret; > +} > +#endif > diff --git a/include/linux/cpu_cooling.h b/include/linux/cpu_cooling.h > index d4292eb..2b5950b 100644 > --- a/include/linux/cpu_cooling.h > +++ b/include/linux/cpu_cooling.h > @@ -45,6 +45,7 @@ struct thermal_cooling_device * > cpufreq_power_cooling_register(struct cpufreq_policy *policy, > u32 capacitance, get_static_t plat_static_= func); > > +extern int cpuidle_cooling_register(void); > /** > * of_cpufreq_cooling_register - create cpufreq cooling device based on = DT. > * @np: a valid struct device_node to the cooling device device tree nod= e. > @@ -118,6 +119,11 @@ void cpufreq_cooling_unregister(struct thermal_cooli= ng_device *cdev) > { > return; > } > + > +static inline int cpuidle_cooling_register(void) > +{ > + return 0; > +} > #endif /* CONFIG_CPU_THERMAL */ > > #endif /* __CPU_COOLING_H__ */ > -- > 2.7.4 >