Dropped the per core idle injection patch from this series. Hence the
subject of this series is changed. Now it is just for using
idle inject framework for intel_powerclamp.
Add additionl patch for module paramters to set cpumask and max idle percent.
The old cover letter:
This series introduces per CPU idle injection. In preparation for this
enhance the existing powercap/idle_inject and modify intel_powerclamp
to use this. Then add per core idle injection driver.
v5:
- additionl patch for module paramters to set cpumask and max idle percent
- One fix for update callback
- Rebased on top of patch
"thermal: intel_powerclamp: Fix cur_state for multi package system"
v4:
- Dropped the per core idle inject patch
v3
- Change callback from per CPU to per device and use in intel_powerclamp
- Remove unused var in per cpu idle injection module
v2
- Update based on feedback from Rafael on patch 2/4
- Kconfig dependency issue
Reported-by: kernel test robot <[email protected]>
Srinivas Pandruvada (4):
powercap: idle_inject: Export symbols
powercap: idle_inject: Add update callback
thermal/drivers/intel_powerclamp: Use powercap idle-inject framework
thermal/drivers/intel_powerclamp: Add additional module params
.../driver-api/thermal/intel_powerclamp.rst | 22 +
drivers/powercap/idle_inject.c | 57 +-
drivers/thermal/intel/Kconfig | 2 +
drivers/thermal/intel/intel_powerclamp.c | 512 ++++++++++--------
include/linux/idle_inject.h | 3 +
5 files changed, 367 insertions(+), 229 deletions(-)
--
2.39.1
The powercap/idle_inject core uses play_idle_precise() to inject idle
time. But play_idle_precise() can't ensure that the CPU is fully idle
for the specified duration because of wakeups due to interrupts. To
compensate for the reduced idle time due to these wakes, the caller
can adjust requested idle time for the next cycle.
The goal of idle injection is to keep system at some idle percent on
average, so this is fine to overshoot or undershoot instantaneous idle
times.
The idle inject core provides an interface idle_inject_set_duration()
to set idle and runtime duration.
Some architectures provide interface to get actual idle time observed
by the hardware. So, the effective idle percent can be adjusted using
the hardware feedback. For example, Intel CPUs provides package idle
counters, which is currently used by Intel powerclamp driver to
readjust runtime duration.
When the caller's desired idle time over a period is less or greater
than the actual CPU idle time observed by the hardware, caller can
readjust idle and runtime duration for the next cycle.
The only way this can be done currently is by monitoring hardware idle
time from a different software thread and readjust idle and runtime
duration using idle_inject_set_duration().
This can be avoided by adding a callback which callers can register and
readjust from this callback function.
Add a capability to register an optional update() callback, which can be
called from the idle inject core before waking up CPUs for idle injection.
This callback can be registered via a new interface:
idle_inject_register_full().
During this process of constantly adjusting idle and runtime duration
there can be some cases where actual idle time is more than the desired.
In this case idle inject can be skipped for a cycle. If update() callback
returns false, then the idle inject core skips waking up CPUs for the
idle injection.
Signed-off-by: Srinivas Pandruvada <[email protected]>
---
v4/v5:
- No change
v3:
- Replace prepare/complete callback with update callback
v2
- Replace begin/end with prepare/complete
- Add new interface idle_inject_register_full with callbacks
- Update kernel doc
- Update commit description
drivers/powercap/idle_inject.c | 50 ++++++++++++++++++++++++++++++----
include/linux/idle_inject.h | 3 ++
2 files changed, 47 insertions(+), 6 deletions(-)
diff --git a/drivers/powercap/idle_inject.c b/drivers/powercap/idle_inject.c
index ec02b370ec16..3ac81086d71f 100644
--- a/drivers/powercap/idle_inject.c
+++ b/drivers/powercap/idle_inject.c
@@ -63,13 +63,27 @@ struct idle_inject_thread {
* @idle_duration_us: duration of CPU idle time to inject
* @run_duration_us: duration of CPU run time to allow
* @latency_us: max allowed latency
+ * @update: Optional callback deciding whether or not to skip idle
+ * injection in the given cycle.
* @cpumask: mask of CPUs affected by idle injection
+ *
+ * This structure is used to define per instance idle inject device data. Each
+ * instance has an idle duration, a run duration and mask of CPUs to inject
+ * idle.
+ * Actual idle is injected by calling kernel scheduler interface
+ * play_idle_precise(). There is one optional callbacks which the caller can
+ * register by calling idle_inject_register_full():
+ * update() - This callback is called just before waking up CPUs to inject
+ * idle. If this callback returns false, CPUs are not woken up to inject idle
+ * for this cycle. Also gives opportunity to the caller to readjust idle
+ * and run duration by calling idle_inject_set_duration() for the next cycle.
*/
struct idle_inject_device {
struct hrtimer timer;
unsigned int idle_duration_us;
unsigned int run_duration_us;
unsigned int latency_us;
+ bool (*update)(void);
unsigned long cpumask[];
};
@@ -111,11 +125,12 @@ static enum hrtimer_restart idle_inject_timer_fn(struct hrtimer *timer)
struct idle_inject_device *ii_dev =
container_of(timer, struct idle_inject_device, timer);
+ if (!ii_dev->update || (ii_dev->update && ii_dev->update()))
+ idle_inject_wakeup(ii_dev);
+
duration_us = READ_ONCE(ii_dev->run_duration_us);
duration_us += READ_ONCE(ii_dev->idle_duration_us);
- idle_inject_wakeup(ii_dev);
-
hrtimer_forward_now(timer, ns_to_ktime(duration_us * NSEC_PER_USEC));
return HRTIMER_RESTART;
@@ -297,17 +312,22 @@ static int idle_inject_should_run(unsigned int cpu)
}
/**
- * idle_inject_register - initialize idle injection on a set of CPUs
+ * idle_inject_register_full - initialize idle injection on a set of CPUs
* @cpumask: CPUs to be affected by idle injection
+ * @update: This callback is called just before waking up CPUs to inject
+ * idle
*
* This function creates an idle injection control device structure for the
- * given set of CPUs and initializes the timer associated with it. It does not
- * start any injection cycles.
+ * given set of CPUs and initializes the timer associated with it. This
+ * function also allows to register update()callback.
+ * It does not start any injection cycles.
*
* Return: NULL if memory allocation fails, idle injection control device
* pointer on success.
*/
-struct idle_inject_device *idle_inject_register(struct cpumask *cpumask)
+
+struct idle_inject_device *idle_inject_register_full(struct cpumask *cpumask,
+ bool (*update)(void))
{
struct idle_inject_device *ii_dev;
int cpu, cpu_rb;
@@ -320,6 +340,7 @@ struct idle_inject_device *idle_inject_register(struct cpumask *cpumask)
hrtimer_init(&ii_dev->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
ii_dev->timer.function = idle_inject_timer_fn;
ii_dev->latency_us = UINT_MAX;
+ ii_dev->update = update;
for_each_cpu(cpu, to_cpumask(ii_dev->cpumask)) {
@@ -344,6 +365,23 @@ struct idle_inject_device *idle_inject_register(struct cpumask *cpumask)
return NULL;
}
+EXPORT_SYMBOL_NS_GPL(idle_inject_register_full, IDLE_INJECT);
+
+/**
+ * idle_inject_register - initialize idle injection on a set of CPUs
+ * @cpumask: CPUs to be affected by idle injection
+ *
+ * This function creates an idle injection control device structure for the
+ * given set of CPUs and initializes the timer associated with it. It does not
+ * start any injection cycles.
+ *
+ * Return: NULL if memory allocation fails, idle injection control device
+ * pointer on success.
+ */
+struct idle_inject_device *idle_inject_register(struct cpumask *cpumask)
+{
+ return idle_inject_register_full(cpumask, NULL);
+}
EXPORT_SYMBOL_NS_GPL(idle_inject_register, IDLE_INJECT);
/**
diff --git a/include/linux/idle_inject.h b/include/linux/idle_inject.h
index fb88e23a99d3..a85d5dd40f72 100644
--- a/include/linux/idle_inject.h
+++ b/include/linux/idle_inject.h
@@ -13,6 +13,9 @@ struct idle_inject_device;
struct idle_inject_device *idle_inject_register(struct cpumask *cpumask);
+struct idle_inject_device *idle_inject_register_full(struct cpumask *cpumask,
+ bool (*update)(void));
+
void idle_inject_unregister(struct idle_inject_device *ii_dev);
int idle_inject_start(struct idle_inject_device *ii_dev);
--
2.39.1
There are two idle injection implementation in the Linux kernel. One
via intel_powerclamp and the other using powercap/idle_inject. Both
implementation end up in calling play_idle* function from a FIFO
priority thread. Both can't be used at the same time.
It is better to use one idle injection framework for better
maintainability. In this way, there is only one caller for play_idle.
Here powercap/idle_inject can be used for both per-core and for system
wide idle injection. This framework has a well defined interface which
allow registry for per-core or for all CPUs (system wide).
This reduces code complexity in the intel powerclamp driver as all the
per CPU kthreads, delayed work and calls to play_idle can be removed.
The changes include:
- Remove unneeded include files
- Remove per CPU kthread workers: balancing_work and idle_injection_work.
- Reuse the compensation related code by moving from previous worker
thread to idle_injection callback.
- Adjust the idle_duration and runtime by using powercap/idle_inject
interface.
- Remove all variables, which are not required once powercap/idle_inject
is used.
- Add mutex to avoid race during removal of idle injection during module
unload and user action to change idle inject percent. Also for
protection during dynamic adjustment of run and idle time from
update() callback.
- Remove online/offline callbacks to designate control CPU
- Use cpu_present_mask global variable for CPU mask
- Remove hot plug locks
Signed-off-by: Srinivas Pandruvada <[email protected]>
---
v5:
- Can no longer sleep in update() callbak which is a hr timer callback
So skip if mutex will result in sleep.
v4:
- Remove local cpumask for present CPUs as there is one global one
also no need for hot plug locks as it can't change
- Add some comments on functions
v3:
- Use Update callback which is per device
- Remove use of control_cpu and online/offline callback to set this
v2:
- Use idle_inject_register_full instead of idle_inject_register
- Also fix dependency issue with POWERCAP config
Reported-by: kernel test robot <[email protected]>
drivers/thermal/intel/Kconfig | 2 +
drivers/thermal/intel/intel_powerclamp.c | 379 ++++++++++-------------
2 files changed, 158 insertions(+), 223 deletions(-)
diff --git a/drivers/thermal/intel/Kconfig b/drivers/thermal/intel/Kconfig
index e50fd260484a..b346a646fffb 100644
--- a/drivers/thermal/intel/Kconfig
+++ b/drivers/thermal/intel/Kconfig
@@ -3,6 +3,8 @@ config INTEL_POWERCLAMP
tristate "Intel PowerClamp idle injection driver"
depends on X86
depends on CPU_SUP_INTEL
+ select POWERCAP
+ select IDLE_INJECT
help
Enable this to enable Intel PowerClamp idle injection driver. This
enforce idle time which results in more package C-state residency. The
diff --git a/drivers/thermal/intel/intel_powerclamp.c b/drivers/thermal/intel/intel_powerclamp.c
index 64f082c584b2..850195ebe5e0 100644
--- a/drivers/thermal/intel/intel_powerclamp.c
+++ b/drivers/thermal/intel/intel_powerclamp.c
@@ -2,7 +2,7 @@
/*
* intel_powerclamp.c - package c-state idle injection
*
- * Copyright (c) 2012, Intel Corporation.
+ * Copyright (c) 2012-2023, Intel Corporation.
*
* Authors:
* Arjan van de Ven <[email protected]>
@@ -27,21 +27,15 @@
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/delay.h>
-#include <linux/kthread.h>
#include <linux/cpu.h>
#include <linux/thermal.h>
-#include <linux/slab.h>
-#include <linux/tick.h>
#include <linux/debugfs.h>
#include <linux/seq_file.h>
-#include <linux/sched/rt.h>
-#include <uapi/linux/sched/types.h>
+#include <linux/idle_inject.h>
-#include <asm/nmi.h>
#include <asm/msr.h>
#include <asm/mwait.h>
#include <asm/cpu_device_id.h>
-#include <asm/hardirq.h>
#define MAX_TARGET_RATIO (50U)
/* For each undisturbed clamping period (no extra wake ups during idle time),
@@ -59,35 +53,26 @@ static unsigned int target_mwait;
static struct dentry *debug_dir;
static bool poll_pkg_cstate_enable;
-/* user selected target */
-static unsigned int set_target_ratio;
+/* Idle ratio observed using package C-state counters */
static unsigned int current_ratio;
-static bool should_skip;
-static unsigned int control_cpu; /* The cpu assigned to collect stat and update
- * control parameters. default to BSP but BSP
- * can be offlined.
- */
-static bool clamping;
+/* Skip the idle injection till set to true */
+static bool should_skip;
-struct powerclamp_worker_data {
- struct kthread_worker *worker;
- struct kthread_work balancing_work;
- struct kthread_delayed_work idle_injection_work;
+struct powerclamp_data {
unsigned int cpu;
unsigned int count;
unsigned int guard;
unsigned int window_size_now;
unsigned int target_ratio;
- unsigned int duration_jiffies;
bool clamping;
};
-static struct powerclamp_worker_data __percpu *worker_data;
+static struct powerclamp_data powerclamp_data;
+
static struct thermal_cooling_device *cooling_dev;
-static unsigned long *cpu_clamping_mask; /* bit map for tracking per cpu
- * clamping kthread worker
- */
+
+static DEFINE_MUTEX(powerclamp_lock);
static unsigned int duration;
static unsigned int pkg_cstate_ratio_cur;
@@ -306,7 +291,7 @@ static void adjust_compensation(int target_ratio, unsigned int win)
if (d->confidence >= CONFIDENCE_OK)
return;
- delta = set_target_ratio - current_ratio;
+ delta = powerclamp_data.target_ratio - current_ratio;
/* filter out bad data */
if (delta >= 0 && delta <= (1+target_ratio/10)) {
if (d->steady_comp)
@@ -345,82 +330,39 @@ static bool powerclamp_adjust_controls(unsigned int target_ratio,
adjust_compensation(target_ratio, win);
/* if we are above target+guard, skip */
- return set_target_ratio + guard <= current_ratio;
+ return powerclamp_data.target_ratio + guard <= current_ratio;
}
-static void clamp_balancing_func(struct kthread_work *work)
+/*
+ * This function calculates runtime from the current target ratio.
+ * This function gets called under powerclamp_lock.
+ */
+static unsigned int get_run_time(void)
{
- struct powerclamp_worker_data *w_data;
- int sleeptime;
- unsigned long target_jiffies;
unsigned int compensated_ratio;
- int interval; /* jiffies to sleep for each attempt */
-
- w_data = container_of(work, struct powerclamp_worker_data,
- balancing_work);
+ unsigned int runtime;
/*
* make sure user selected ratio does not take effect until
* the next round. adjust target_ratio if user has changed
* target such that we can converge quickly.
*/
- w_data->target_ratio = READ_ONCE(set_target_ratio);
- w_data->guard = 1 + w_data->target_ratio / 20;
- w_data->window_size_now = window_size;
- w_data->duration_jiffies = msecs_to_jiffies(duration);
- w_data->count++;
+ powerclamp_data.guard = 1 + powerclamp_data.target_ratio / 20;
+ powerclamp_data.window_size_now = window_size;
/*
* systems may have different ability to enter package level
* c-states, thus we need to compensate the injected idle ratio
* to achieve the actual target reported by the HW.
*/
- compensated_ratio = w_data->target_ratio +
- get_compensation(w_data->target_ratio);
+ compensated_ratio = powerclamp_data.target_ratio +
+ get_compensation(powerclamp_data.target_ratio);
if (compensated_ratio <= 0)
compensated_ratio = 1;
- interval = w_data->duration_jiffies * 100 / compensated_ratio;
-
- /* align idle time */
- target_jiffies = roundup(jiffies, interval);
- sleeptime = target_jiffies - jiffies;
- if (sleeptime <= 0)
- sleeptime = 1;
-
- if (clamping && w_data->clamping && cpu_online(w_data->cpu))
- kthread_queue_delayed_work(w_data->worker,
- &w_data->idle_injection_work,
- sleeptime);
-}
-static void clamp_idle_injection_func(struct kthread_work *work)
-{
- struct powerclamp_worker_data *w_data;
+ runtime = duration * 100 / compensated_ratio - duration;
- w_data = container_of(work, struct powerclamp_worker_data,
- idle_injection_work.work);
-
- /*
- * only elected controlling cpu can collect stats and update
- * control parameters.
- */
- if (w_data->cpu == control_cpu &&
- !(w_data->count % w_data->window_size_now)) {
- should_skip =
- powerclamp_adjust_controls(w_data->target_ratio,
- w_data->guard,
- w_data->window_size_now);
- smp_mb();
- }
-
- if (should_skip)
- goto balance;
-
- play_idle(jiffies_to_usecs(w_data->duration_jiffies));
-
-balance:
- if (clamping && w_data->clamping && cpu_online(w_data->cpu))
- kthread_queue_work(w_data->worker, &w_data->balancing_work);
+ return runtime;
}
/*
@@ -456,127 +398,135 @@ static void poll_pkg_cstate(struct work_struct *dummy)
msr_last = msr_now;
tsc_last = tsc_now;
- if (true == clamping)
+ mutex_lock(&powerclamp_lock);
+ if (powerclamp_data.clamping)
schedule_delayed_work(&poll_pkg_cstate_work, HZ);
+ mutex_unlock(&powerclamp_lock);
}
-static void start_power_clamp_worker(unsigned long cpu)
-{
- struct powerclamp_worker_data *w_data = per_cpu_ptr(worker_data, cpu);
- struct kthread_worker *worker;
-
- worker = kthread_create_worker_on_cpu(cpu, 0, "kidle_inj/%ld", cpu);
- if (IS_ERR(worker))
- return;
-
- w_data->worker = worker;
- w_data->count = 0;
- w_data->cpu = cpu;
- w_data->clamping = true;
- set_bit(cpu, cpu_clamping_mask);
- sched_set_fifo(worker->task);
- kthread_init_work(&w_data->balancing_work, clamp_balancing_func);
- kthread_init_delayed_work(&w_data->idle_injection_work,
- clamp_idle_injection_func);
- kthread_queue_work(w_data->worker, &w_data->balancing_work);
-}
+static struct idle_inject_device *ii_dev;
-static void stop_power_clamp_worker(unsigned long cpu)
+/*
+ * This function is called from idle injection core on timer expiry
+ * for the run duration. This allows powerclamp to readjust or skip
+ * injecting idle for this cycle.
+ */
+static bool idle_inject_update(void)
{
- struct powerclamp_worker_data *w_data = per_cpu_ptr(worker_data, cpu);
+ bool update;
- if (!w_data->worker)
- return;
+ /* We can't sleep in this callback */
+ if (!mutex_trylock(&powerclamp_lock))
+ return true;
- w_data->clamping = false;
- /*
- * Make sure that all works that get queued after this point see
- * the clamping disabled. The counter part is not needed because
- * there is an implicit memory barrier when the queued work
- * is proceed.
- */
- smp_wmb();
- kthread_cancel_work_sync(&w_data->balancing_work);
- kthread_cancel_delayed_work_sync(&w_data->idle_injection_work);
- /*
- * The balancing work still might be queued here because
- * the handling of the "clapming" variable, cancel, and queue
- * operations are not synchronized via a lock. But it is not
- * a big deal. The balancing work is fast and destroy kthread
- * will wait for it.
- */
- clear_bit(w_data->cpu, cpu_clamping_mask);
- kthread_destroy_worker(w_data->worker);
+ if (!(powerclamp_data.count % powerclamp_data.window_size_now)) {
- w_data->worker = NULL;
-}
+ should_skip = powerclamp_adjust_controls(powerclamp_data.target_ratio,
+ powerclamp_data.guard,
+ powerclamp_data.window_size_now);
+ update = true;
+ }
-static int start_power_clamp(void)
-{
- unsigned long cpu;
+ if (update) {
+ unsigned int runtime = get_run_time();
- set_target_ratio = clamp(set_target_ratio, 0U, MAX_TARGET_RATIO - 1);
- /* prevent cpu hotplug */
- cpus_read_lock();
+ idle_inject_set_duration(ii_dev, runtime, duration);
+ }
- /* prefer BSP */
- control_cpu = cpumask_first(cpu_online_mask);
+ powerclamp_data.count++;
- clamping = true;
- if (poll_pkg_cstate_enable)
- schedule_delayed_work(&poll_pkg_cstate_work, 0);
+ mutex_unlock(&powerclamp_lock);
- /* start one kthread worker per online cpu */
- for_each_online_cpu(cpu) {
- start_power_clamp_worker(cpu);
- }
- cpus_read_unlock();
+ if (should_skip)
+ return false;
- return 0;
+ return true;
}
-static void end_power_clamp(void)
+/* This function starts idle injection by calling idle_inject_start() */
+static void trigger_idle_injection(void)
{
- int i;
+ unsigned int runtime = get_run_time();
+ idle_inject_set_duration(ii_dev, runtime, duration);
+ idle_inject_start(ii_dev);
+ powerclamp_data.clamping = true;
+}
+
+/*
+ * This function is called from start_power_clamp() to register
+ * CPUS with powercap idle injection register and set default
+ * idle duration and latency.
+ */
+static int powerclamp_idle_injection_register(void)
+{
/*
- * Block requeuing in all the kthread workers. They will flush and
- * stop faster.
+ * The idle inject core will only inject for online CPUs,
+ * So we can register for all present CPUs. In this way
+ * if some CPU goes online/offline while idle inject
+ * is registered, nothing additional calls are required.
+ * The same runtime and idle time is applicable for
+ * newly onlined CPUs if any.
+ *
+ * Here cpu_present_mask can be used as is.
+ * cast to (struct cpumask *) is required as the
+ * cpu_present_mask is const struct cpumask *, otherwise
+ * there will be compiler warnings.
*/
- clamping = false;
- for_each_set_bit(i, cpu_clamping_mask, num_possible_cpus()) {
- pr_debug("clamping worker for cpu %d alive, destroy\n", i);
- stop_power_clamp_worker(i);
+ ii_dev = idle_inject_register_full((struct cpumask *)cpu_present_mask,
+ idle_inject_update);
+ if (!ii_dev) {
+ pr_err("powerclamp: idle_inject_register failed\n");
+ return -EAGAIN;
}
+
+ idle_inject_set_duration(ii_dev, TICK_USEC, duration);
+ idle_inject_set_latency(ii_dev, UINT_MAX);
+
+ return 0;
}
-static int powerclamp_cpu_online(unsigned int cpu)
+/*
+ * This function is called from end_power_clamp() to stop idle injection
+ * and unregister CPUS from powercap idle injection core.
+ */
+static void remove_idle_injection(void)
{
- if (clamping == false)
- return 0;
- start_power_clamp_worker(cpu);
- /* prefer BSP as controlling CPU */
- if (cpu == 0) {
- control_cpu = 0;
- smp_mb();
- }
- return 0;
+ if (!powerclamp_data.clamping)
+ return;
+
+ powerclamp_data.clamping = false;
+ idle_inject_stop(ii_dev);
}
-static int powerclamp_cpu_predown(unsigned int cpu)
+/*
+ * This function is called when user change the cooling device
+ * state from zero to some other value.
+ */
+static int start_power_clamp(void)
{
- if (clamping == false)
- return 0;
+ int ret;
- stop_power_clamp_worker(cpu);
- if (cpu != control_cpu)
- return 0;
+ ret = powerclamp_idle_injection_register();
+ if (!ret) {
+ trigger_idle_injection();
+ if (poll_pkg_cstate_enable)
+ schedule_delayed_work(&poll_pkg_cstate_work, 0);
+ }
- control_cpu = cpumask_first(cpu_online_mask);
- if (control_cpu == cpu)
- control_cpu = cpumask_next(cpu, cpu_online_mask);
- smp_mb();
- return 0;
+ return ret;
+}
+
+/*
+ * This function is called when user change the cooling device
+ * state from non zero value zero.
+ */
+static void end_power_clamp(void)
+{
+ if (powerclamp_data.clamping) {
+ remove_idle_injection();
+ idle_inject_unregister(ii_dev);
+ }
}
static int powerclamp_get_max_state(struct thermal_cooling_device *cdev,
@@ -590,16 +540,20 @@ static int powerclamp_get_max_state(struct thermal_cooling_device *cdev,
static int powerclamp_get_cur_state(struct thermal_cooling_device *cdev,
unsigned long *state)
{
- if (true == clamping) {
+ mutex_lock(&powerclamp_lock);
+
+ if (powerclamp_data.clamping) {
if (poll_pkg_cstate_enable)
*state = pkg_cstate_ratio_cur;
else
- *state = set_target_ratio;
+ *state = powerclamp_data.target_ratio;
} else {
/* to save power, do not poll idle ratio while not clamping */
*state = -1; /* indicates invalid state */
}
+ mutex_unlock(&powerclamp_lock);
+
return 0;
}
@@ -608,24 +562,32 @@ static int powerclamp_set_cur_state(struct thermal_cooling_device *cdev,
{
int ret = 0;
+ mutex_lock(&powerclamp_lock);
+
new_target_ratio = clamp(new_target_ratio, 0UL,
- (unsigned long) (MAX_TARGET_RATIO-1));
- if (set_target_ratio == 0 && new_target_ratio > 0) {
+ (unsigned long) (MAX_TARGET_RATIO - 1));
+ if (!powerclamp_data.target_ratio && new_target_ratio > 0) {
pr_info("Start idle injection to reduce power\n");
- set_target_ratio = new_target_ratio;
+ powerclamp_data.target_ratio = new_target_ratio;
ret = start_power_clamp();
+ if (ret)
+ powerclamp_data.target_ratio = 0;
goto exit_set;
- } else if (set_target_ratio > 0 && new_target_ratio == 0) {
+ } else if (powerclamp_data.target_ratio > 0 && new_target_ratio == 0) {
pr_info("Stop forced idle injection\n");
end_power_clamp();
- set_target_ratio = 0;
+ powerclamp_data.target_ratio = 0;
} else /* adjust currently running */ {
- set_target_ratio = new_target_ratio;
- /* make new set_target_ratio visible to other cpus */
- smp_mb();
+ unsigned int runtime;
+
+ powerclamp_data.target_ratio = new_target_ratio;
+ runtime = get_run_time();
+ idle_inject_set_duration(ii_dev, runtime, duration);
}
exit_set:
+ mutex_unlock(&powerclamp_lock);
+
return ret;
}
@@ -666,7 +628,6 @@ static int powerclamp_debug_show(struct seq_file *m, void *unused)
{
int i = 0;
- seq_printf(m, "controlling cpu: %d\n", control_cpu);
seq_printf(m, "pct confidence steady dynamic (compensation)\n");
for (i = 0; i < MAX_TARGET_RATIO; i++) {
seq_printf(m, "%d\t%lu\t%lu\t%lu\n",
@@ -689,78 +650,50 @@ static inline void powerclamp_create_debug_files(void)
&powerclamp_debug_fops);
}
-static enum cpuhp_state hp_state;
-
static int __init powerclamp_init(void)
{
int retval;
- cpu_clamping_mask = bitmap_zalloc(num_possible_cpus(), GFP_KERNEL);
- if (!cpu_clamping_mask)
- return -ENOMEM;
-
/* probe cpu features and ids here */
retval = powerclamp_probe();
if (retval)
- goto exit_free;
+ return retval;
/* set default limit, maybe adjusted during runtime based on feedback */
window_size = 2;
- retval = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN,
- "thermal/intel_powerclamp:online",
- powerclamp_cpu_online,
- powerclamp_cpu_predown);
- if (retval < 0)
- goto exit_free;
-
- hp_state = retval;
-
- worker_data = alloc_percpu(struct powerclamp_worker_data);
- if (!worker_data) {
- retval = -ENOMEM;
- goto exit_unregister;
- }
if (topology_max_packages() == 1 && topology_max_die_per_package() == 1)
poll_pkg_cstate_enable = true;
cooling_dev = thermal_cooling_device_register("intel_powerclamp", NULL,
- &powerclamp_cooling_ops);
- if (IS_ERR(cooling_dev)) {
- retval = -ENODEV;
- goto exit_free_thread;
- }
+ &powerclamp_cooling_ops);
+ if (IS_ERR(cooling_dev))
+ return -ENODEV;
if (!duration)
- duration = jiffies_to_msecs(DEFAULT_DURATION_JIFFIES);
+ duration = jiffies_to_usecs(DEFAULT_DURATION_JIFFIES);
powerclamp_create_debug_files();
return 0;
-
-exit_free_thread:
- free_percpu(worker_data);
-exit_unregister:
- cpuhp_remove_state_nocalls(hp_state);
-exit_free:
- bitmap_free(cpu_clamping_mask);
- return retval;
}
module_init(powerclamp_init);
static void __exit powerclamp_exit(void)
{
+ mutex_lock(&powerclamp_lock);
end_power_clamp();
- cpuhp_remove_state_nocalls(hp_state);
- free_percpu(worker_data);
+ mutex_unlock(&powerclamp_lock);
+
thermal_cooling_device_unregister(cooling_dev);
- bitmap_free(cpu_clamping_mask);
cancel_delayed_work_sync(&poll_pkg_cstate_work);
debugfs_remove_recursive(debug_dir);
}
module_exit(powerclamp_exit);
+MODULE_IMPORT_NS(IDLE_INJECT);
+
MODULE_LICENSE("GPL");
MODULE_AUTHOR("Arjan van de Ven <[email protected]>");
MODULE_AUTHOR("Jacob Pan <[email protected]>");
--
2.39.1
In some use cases, it is desirable to only inject idle on certain set
of CPUs. For example on Alder Lake systems, it is possible that we force
idle only on P-Cores for thermal reasons. Also the idle percent can be
more than 50% if we only choose partial set of CPUs in the system.
Introduce module parameters for setting cpumask and max_idle. They
can be only changed when the cooling device is inactive. This module
already have other module parameters. There is no change done for
those parameters.
cpumask (Read/Write): A bit mask of CPUs to inject idle. The format of
this bitmask is same as used in other subsystems like in
/proc/irq/*/smp_affinity. The mask is comma separated 32 bit groups.
Each CPU is one bit. For example for 256 CPU system the full mask is:
ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
The leftmost mask is for CPU 0-32.
max_idle (Read/Write): Maximum injected idle time to the total CPU time
ratio in percent range from 1 to 100. Even if the cooling device max_state
is always 100 (100%), this parameter allows to add a max idle percent
limit. The default is 50, to match the current implementation of powerclamp
driver. Also doesn't allow value more than 75, if the cpumask includes
every CPU present in the system.
Signed-off-by: Srinivas Pandruvada <[email protected]>
---
v5
New patch
.../driver-api/thermal/intel_powerclamp.rst | 22 +++
drivers/thermal/intel/intel_powerclamp.c | 169 ++++++++++++++++--
2 files changed, 173 insertions(+), 18 deletions(-)
diff --git a/Documentation/driver-api/thermal/intel_powerclamp.rst b/Documentation/driver-api/thermal/intel_powerclamp.rst
index 3f6dfb0b3ea6..d805e28b7a45 100644
--- a/Documentation/driver-api/thermal/intel_powerclamp.rst
+++ b/Documentation/driver-api/thermal/intel_powerclamp.rst
@@ -26,6 +26,8 @@ By:
- Generic Thermal Layer (sysfs)
- Kernel APIs (TBD)
+ (*) Module Parameters
+
INTRODUCTION
============
@@ -318,3 +320,23 @@ device, a PID based userspace thermal controller can manage to
control CPU temperature effectively, when no other thermal influence
is added. For example, a UltraBook user can compile the kernel under
certain temperature (below most active trip points).
+
+Module Parameters
+=================
+
+``cpumask`` (RW)
+ A bit mask of CPUs to inject idle. The format of the bitmask is same as
+ used in other subsystems like in /proc/irq/*/smp_affinity. The mask is
+ comma separated 32 bit groups. Each CPU is one bit. For example for a 256
+ CPU system the full mask is:
+ ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
+
+ The leftmost mask is for CPU 0-32.
+
+``max_idle`` (RW)
+ Maximum injected idle time to the total CPU time ratio in percent range
+ from 1 to 100. Even if the cooling device max_state is always 100 (100%),
+ this parameter allows to add a max idle percent limit. The default is 50,
+ to match the current implementation of powerclamp driver. Also doesn't
+ allow value more than 75, if the cpumask includes every CPU present in
+ the system.
diff --git a/drivers/thermal/intel/intel_powerclamp.c b/drivers/thermal/intel/intel_powerclamp.c
index 850195ebe5e0..68830b726da2 100644
--- a/drivers/thermal/intel/intel_powerclamp.c
+++ b/drivers/thermal/intel/intel_powerclamp.c
@@ -37,7 +37,7 @@
#include <asm/mwait.h>
#include <asm/cpu_device_id.h>
-#define MAX_TARGET_RATIO (50U)
+#define MAX_TARGET_RATIO (100U)
/* For each undisturbed clamping period (no extra wake ups during idle time),
* we increment the confidence counter for the given target ratio.
* CONFIDENCE_OK defines the level where runtime calibration results are
@@ -109,6 +109,135 @@ static const struct kernel_param_ops duration_ops = {
module_param_cb(duration, &duration_ops, &duration, 0644);
MODULE_PARM_DESC(duration, "forced idle time for each attempt in msec.");
+#define DEFAULT_MAX_IDLE 50
+#define MAX_ALL_CPU_IDLE 75
+
+static u8 max_idle = DEFAULT_MAX_IDLE;
+
+static cpumask_var_t idle_injection_cpu_mask;
+
+static int allocate_idle_injection_mask(void)
+{
+ /* This mask is allocated only one time and freed during module exit */
+ if (!idle_injection_cpu_mask) {
+ if (!zalloc_cpumask_var(&idle_injection_cpu_mask, GFP_KERNEL))
+ return -ENOMEM;
+
+ cpumask_copy(idle_injection_cpu_mask, cpu_present_mask);
+ }
+
+ return 0;
+}
+
+static int cpumask_set(const char *arg, const struct kernel_param *kp)
+{
+ int ret;
+
+ mutex_lock(&powerclamp_lock);
+
+ /* Can't set mask when cooling device is in use */
+ if (powerclamp_data.clamping) {
+ ret = -EAGAIN;
+ goto skip_cpumask_set;
+ }
+
+ /*
+ * When module parameters are passed from kernel command line
+ * during insmod, the module parameter callback is called
+ * before powerclamp_init(), so we can't assume that some
+ * cpumask can be allocated before here.
+ */
+ ret = allocate_idle_injection_mask();
+ if (ret)
+ goto skip_cpumask_set;
+
+ ret = bitmap_parse(arg, strlen(arg), cpumask_bits(idle_injection_cpu_mask),
+ nr_cpumask_bits);
+ if (ret)
+ goto skip_cpumask_set;
+
+ if (cpumask_empty(idle_injection_cpu_mask)) {
+ ret = -EINVAL;
+ goto skip_cpumask_set;
+ }
+
+ if (cpumask_equal(cpu_present_mask, idle_injection_cpu_mask) &&
+ max_idle > MAX_ALL_CPU_IDLE) {
+ ret = -EINVAL;
+ goto skip_cpumask_set;
+ }
+
+ mutex_unlock(&powerclamp_lock);
+
+ return 0;
+
+skip_cpumask_set:
+ mutex_unlock(&powerclamp_lock);
+
+ return ret;
+}
+
+static int cpumask_get(char *buf, const struct kernel_param *kp)
+{
+ if (!idle_injection_cpu_mask)
+ return -EINVAL;
+
+ return bitmap_print_to_pagebuf(false, buf, cpumask_bits(idle_injection_cpu_mask),
+ nr_cpumask_bits);
+}
+
+static const struct kernel_param_ops cpumask_ops = {
+ .set = cpumask_set,
+ .get = cpumask_get,
+};
+
+module_param_cb(cpumask, &cpumask_ops, NULL, 0644);
+MODULE_PARM_DESC(cpumask, "Mask of CPUs to use for idle injection.");
+
+static int max_idle_set(const char *arg, const struct kernel_param *kp)
+{
+ u8 _max_idle;
+ int ret = 0;
+
+ mutex_lock(&powerclamp_lock);
+
+ /* Can't set mask when cooling device is in use */
+ if (powerclamp_data.clamping) {
+ ret = -EAGAIN;
+ goto skip_limit_set;
+ }
+
+ ret = kstrtou8(arg, 10, &_max_idle);
+ if (ret)
+ goto skip_limit_set;
+
+ if (_max_idle > MAX_TARGET_RATIO) {
+ ret = -EINVAL;
+ goto skip_limit_set;
+ }
+
+ if (idle_injection_cpu_mask && cpumask_equal(cpu_present_mask, idle_injection_cpu_mask) &&
+ _max_idle > MAX_ALL_CPU_IDLE) {
+ ret = -EINVAL;
+ goto skip_limit_set;
+ }
+
+ max_idle = _max_idle;
+
+skip_limit_set:
+ mutex_unlock(&powerclamp_lock);
+
+ return ret;
+}
+
+static const struct kernel_param_ops max_idle_ops = {
+ .set = max_idle_set,
+ .get = param_get_int,
+};
+
+module_param_cb(max_idle, &max_idle_ops, &max_idle, 0644);
+MODULE_PARM_DESC(max_idle, "maximum injected idle time to the total CPU time ratio in percent range:1-100");
+
struct powerclamp_calibration_data {
unsigned long confidence; /* used for calibration, basically a counter
* gets incremented each time a clamping
@@ -342,6 +471,10 @@ static unsigned int get_run_time(void)
unsigned int compensated_ratio;
unsigned int runtime;
+ /* No compensation for non systemwide idle injection */
+ if (max_idle > MAX_ALL_CPU_IDLE)
+ return (duration * 100 / powerclamp_data.target_ratio - duration);
+
/*
* make sure user selected ratio does not take effect until
* the next round. adjust target_ratio if user has changed
@@ -460,21 +593,11 @@ static void trigger_idle_injection(void)
*/
static int powerclamp_idle_injection_register(void)
{
- /*
- * The idle inject core will only inject for online CPUs,
- * So we can register for all present CPUs. In this way
- * if some CPU goes online/offline while idle inject
- * is registered, nothing additional calls are required.
- * The same runtime and idle time is applicable for
- * newly onlined CPUs if any.
- *
- * Here cpu_present_mask can be used as is.
- * cast to (struct cpumask *) is required as the
- * cpu_present_mask is const struct cpumask *, otherwise
- * there will be compiler warnings.
- */
- ii_dev = idle_inject_register_full((struct cpumask *)cpu_present_mask,
- idle_inject_update);
+ if (cpumask_equal(cpu_present_mask, idle_injection_cpu_mask))
+ ii_dev = idle_inject_register_full(idle_injection_cpu_mask, idle_inject_update);
+ else
+ ii_dev = idle_inject_register(idle_injection_cpu_mask);
+
if (!ii_dev) {
pr_err("powerclamp: idle_inject_register failed\n");
return -EAGAIN;
@@ -510,7 +633,7 @@ static int start_power_clamp(void)
ret = powerclamp_idle_injection_register();
if (!ret) {
trigger_idle_injection();
- if (poll_pkg_cstate_enable)
+ if (poll_pkg_cstate_enable && max_idle < MAX_ALL_CPU_IDLE)
schedule_delayed_work(&poll_pkg_cstate_work, 0);
}
@@ -565,7 +688,7 @@ static int powerclamp_set_cur_state(struct thermal_cooling_device *cdev,
mutex_lock(&powerclamp_lock);
new_target_ratio = clamp(new_target_ratio, 0UL,
- (unsigned long) (MAX_TARGET_RATIO - 1));
+ (unsigned long) (max_idle - 1));
if (!powerclamp_data.target_ratio && new_target_ratio > 0) {
pr_info("Start idle injection to reduce power\n");
powerclamp_data.target_ratio = new_target_ratio;
@@ -656,6 +779,13 @@ static int __init powerclamp_init(void)
/* probe cpu features and ids here */
retval = powerclamp_probe();
+ if (retval)
+ return retval;
+
+ mutex_lock(&powerclamp_lock);
+ retval = allocate_idle_injection_mask();
+ mutex_unlock(&powerclamp_lock);
+
if (retval)
return retval;
@@ -689,6 +819,9 @@ static void __exit powerclamp_exit(void)
cancel_delayed_work_sync(&poll_pkg_cstate_work);
debugfs_remove_recursive(debug_dir);
+
+ if (idle_injection_cpu_mask)
+ free_cpumask_var(idle_injection_cpu_mask);
}
module_exit(powerclamp_exit);
--
2.39.1
On Wed, Feb 1, 2023 at 7:35 PM Srinivas Pandruvada
<[email protected]> wrote:
>
> Dropped the per core idle injection patch from this series. Hence the
> subject of this series is changed. Now it is just for using
> idle inject framework for intel_powerclamp.
> Add additionl patch for module paramters to set cpumask and max idle percent.
>
> The old cover letter:
> This series introduces per CPU idle injection. In preparation for this
> enhance the existing powercap/idle_inject and modify intel_powerclamp
> to use this. Then add per core idle injection driver.
>
> v5:
> - additionl patch for module paramters to set cpumask and max idle percent
> - One fix for update callback
> - Rebased on top of patch
> "thermal: intel_powerclamp: Fix cur_state for multi package system"
Patches [1-3/4] have been applied as 6.3 material and I'll comment
patch [4/4] separately.
>
> v4:
> - Dropped the per core idle inject patch
>
> v3
> - Change callback from per CPU to per device and use in intel_powerclamp
> - Remove unused var in per cpu idle injection module
>
> v2
> - Update based on feedback from Rafael on patch 2/4
> - Kconfig dependency issue
> Reported-by: kernel test robot <[email protected]>
>
> Srinivas Pandruvada (4):
> powercap: idle_inject: Export symbols
> powercap: idle_inject: Add update callback
> thermal/drivers/intel_powerclamp: Use powercap idle-inject framework
> thermal/drivers/intel_powerclamp: Add additional module params
>
> .../driver-api/thermal/intel_powerclamp.rst | 22 +
> drivers/powercap/idle_inject.c | 57 +-
> drivers/thermal/intel/Kconfig | 2 +
> drivers/thermal/intel/intel_powerclamp.c | 512 ++++++++++--------
> include/linux/idle_inject.h | 3 +
> 5 files changed, 367 insertions(+), 229 deletions(-)
>
> --
> 2.39.1
>
First, I would say "Add two module parameters" in the subject.
On Wed, Feb 1, 2023 at 7:35 PM Srinivas Pandruvada
<[email protected]> wrote:
>
> In some use cases, it is desirable to only inject idle on certain set
> of CPUs. For example on Alder Lake systems, it is possible that we force
> idle only on P-Cores for thermal reasons. Also the idle percent can be
> more than 50% if we only choose partial set of CPUs in the system.
>
> Introduce module parameters for setting cpumask and max_idle.
"Introduce 2 new module parameters for this purpose."
> They can be only changed when the cooling device is inactive. This module
> already have other module parameters. There is no change done for
> those parameters.
s/have/has/
"... other parameters that are not affected by this change."
> cpumask (Read/Write): A bit mask of CPUs to inject idle. The format of
> this bitmask is same as used in other subsystems like in
> /proc/irq/*/smp_affinity. The mask is comma separated 32 bit groups.
> Each CPU is one bit. For example for 256 CPU system the full mask is:
> ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
> The leftmost mask is for CPU 0-32.
>
> max_idle (Read/Write): Maximum injected idle time to the total CPU time
> ratio in percent range from 1 to 100. Even if the cooling device max_state
> is always 100 (100%), this parameter allows to add a max idle percent
> limit. The default is 50, to match the current implementation of powerclamp
> driver. Also doesn't allow value more than 75, if the cpumask includes
> every CPU present in the system.
>
> Signed-off-by: Srinivas Pandruvada <[email protected]>
> ---
> v5
> New patch
>
> .../driver-api/thermal/intel_powerclamp.rst | 22 +++
> drivers/thermal/intel/intel_powerclamp.c | 169 ++++++++++++++++--
> 2 files changed, 173 insertions(+), 18 deletions(-)
>
> diff --git a/Documentation/driver-api/thermal/intel_powerclamp.rst b/Documentation/driver-api/thermal/intel_powerclamp.rst
> index 3f6dfb0b3ea6..d805e28b7a45 100644
> --- a/Documentation/driver-api/thermal/intel_powerclamp.rst
> +++ b/Documentation/driver-api/thermal/intel_powerclamp.rst
> @@ -26,6 +26,8 @@ By:
> - Generic Thermal Layer (sysfs)
> - Kernel APIs (TBD)
>
> + (*) Module Parameters
> +
> INTRODUCTION
> ============
>
> @@ -318,3 +320,23 @@ device, a PID based userspace thermal controller can manage to
> control CPU temperature effectively, when no other thermal influence
> is added. For example, a UltraBook user can compile the kernel under
> certain temperature (below most active trip points).
> +
> +Module Parameters
> +=================
> +
> +``cpumask`` (RW)
> + A bit mask of CPUs to inject idle. The format of the bitmask is same as
> + used in other subsystems like in /proc/irq/*/smp_affinity. The mask is
> + comma separated 32 bit groups. Each CPU is one bit. For example for a 256
> + CPU system the full mask is:
> + ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
> +
> + The leftmost mask is for CPU 0-32.
> +
> +``max_idle`` (RW)
> + Maximum injected idle time to the total CPU time ratio in percent range
> + from 1 to 100. Even if the cooling device max_state is always 100 (100%),
> + this parameter allows to add a max idle percent limit. The default is 50,
> + to match the current implementation of powerclamp driver. Also doesn't
> + allow value more than 75, if the cpumask includes every CPU present in
> + the system.
I'm not sure if this is driver-api. It's admin-guide rather IMO.
> diff --git a/drivers/thermal/intel/intel_powerclamp.c b/drivers/thermal/intel/intel_powerclamp.c
> index 850195ebe5e0..68830b726da2 100644
> --- a/drivers/thermal/intel/intel_powerclamp.c
> +++ b/drivers/thermal/intel/intel_powerclamp.c
> @@ -37,7 +37,7 @@
> #include <asm/mwait.h>
> #include <asm/cpu_device_id.h>
>
> -#define MAX_TARGET_RATIO (50U)
> +#define MAX_TARGET_RATIO (100U)
> /* For each undisturbed clamping period (no extra wake ups during idle time),
> * we increment the confidence counter for the given target ratio.
> * CONFIDENCE_OK defines the level where runtime calibration results are
> @@ -109,6 +109,135 @@ static const struct kernel_param_ops duration_ops = {
> module_param_cb(duration, &duration_ops, &duration, 0644);
> MODULE_PARM_DESC(duration, "forced idle time for each attempt in msec.");
>
> +#define DEFAULT_MAX_IDLE 50
> +#define MAX_ALL_CPU_IDLE 75
> +
> +static u8 max_idle = DEFAULT_MAX_IDLE;
> +
> +static cpumask_var_t idle_injection_cpu_mask;
> +
> +static int allocate_idle_injection_mask(void)
> +{
> + /* This mask is allocated only one time and freed during module exit */
> + if (!idle_injection_cpu_mask) {
I would do
if (idle_injection_cpu_mask)
return 0;
here and rearrange the rest of the function accordingly.
> + if (!zalloc_cpumask_var(&idle_injection_cpu_mask, GFP_KERNEL))
> + return -ENOMEM;
> +
> + cpumask_copy(idle_injection_cpu_mask, cpu_present_mask);
> + }
> +
> + return 0;
> +}
> +
> +static int cpumask_set(const char *arg, const struct kernel_param *kp)
> +{
> + int ret;
> +
> + mutex_lock(&powerclamp_lock);
> +
> + /* Can't set mask when cooling device is in use */
> + if (powerclamp_data.clamping) {
> + ret = -EAGAIN;
> + goto skip_cpumask_set;
> + }
> +
> + /*
> + * When module parameters are passed from kernel command line
> + * during insmod, the module parameter callback is called
> + * before powerclamp_init(), so we can't assume that some
> + * cpumask can be allocated before here.
> + */
> + ret = allocate_idle_injection_mask();
Could it be allocated by powerclamp_init(), though? It is not useful
before that function runs anyway.
> + if (ret)
> + goto skip_cpumask_set;
> +
> + ret = bitmap_parse(arg, strlen(arg), cpumask_bits(idle_injection_cpu_mask),
So this would replace the existing idle_injection_cpu_mask even if it
is going to fail later.
> + nr_cpumask_bits);
> + if (ret)
> + goto skip_cpumask_set;
> +
> + if (cpumask_empty(idle_injection_cpu_mask)) {
> + ret = -EINVAL;
> + goto skip_cpumask_set;
> + }
> +
> + if (cpumask_equal(cpu_present_mask, idle_injection_cpu_mask) &&
Should this check be against cpu_online_mask instead? Arguably
offline CPUs don't matter here.
> + max_idle > MAX_ALL_CPU_IDLE) {
> + ret = -EINVAL;
> + goto skip_cpumask_set;
> + }
> +
> + mutex_unlock(&powerclamp_lock);
> +
> + return 0;
> +
> +skip_cpumask_set:
> + mutex_unlock(&powerclamp_lock);
> +
> + return ret;
> +}
> +
> +static int cpumask_get(char *buf, const struct kernel_param *kp)
> +{
> + if (!idle_injection_cpu_mask)
> + return -EINVAL;
I would return -ENODEV here.
> +
> + return bitmap_print_to_pagebuf(false, buf, cpumask_bits(idle_injection_cpu_mask),
> + nr_cpumask_bits);
> +}
> +
> +static const struct kernel_param_ops cpumask_ops = {
> + .set = cpumask_set,
> + .get = cpumask_get,
> +};
> +
> +module_param_cb(cpumask, &cpumask_ops, NULL, 0644);
> +MODULE_PARM_DESC(cpumask, "Mask of CPUs to use for idle injection.");
> +
> +static int max_idle_set(const char *arg, const struct kernel_param *kp)
> +{
> + u8 _max_idle;
new_max_idle would be slightly better I think.
> + int ret = 0;
> +
> + mutex_lock(&powerclamp_lock);
> +
> + /* Can't set mask when cooling device is in use */
> + if (powerclamp_data.clamping) {
> + ret = -EAGAIN;
> + goto skip_limit_set;
> + }
> +
> + ret = kstrtou8(arg, 10, &_max_idle);
> + if (ret)
> + goto skip_limit_set;
> +
> + if (_max_idle > MAX_TARGET_RATIO) {
> + ret = -EINVAL;
> + goto skip_limit_set;
> + }
> +
> + if (idle_injection_cpu_mask && cpumask_equal(cpu_present_mask, idle_injection_cpu_mask) &&
> + _max_idle > MAX_ALL_CPU_IDLE) {
The same check is done here and in cpumask_set(). Could it be done in
a separate function called from here and from there?
> + ret = -EINVAL;
> + goto skip_limit_set;
> + }
> +
> + max_idle = _max_idle;
> +
> +skip_limit_set:
> + mutex_unlock(&powerclamp_lock);
> +
> + return ret;
> +}
> +
> +static const struct kernel_param_ops max_idle_ops = {
> + .set = max_idle_set,
> + .get = param_get_int,
> +};
> +
> +module_param_cb(max_idle, &max_idle_ops, &max_idle, 0644);
> +MODULE_PARM_DESC(max_idle, "maximum injected idle time to the total CPU time ratio in percent range:1-100");
> +
> struct powerclamp_calibration_data {
> unsigned long confidence; /* used for calibration, basically a counter
> * gets incremented each time a clamping
> @@ -342,6 +471,10 @@ static unsigned int get_run_time(void)
> unsigned int compensated_ratio;
> unsigned int runtime;
>
> + /* No compensation for non systemwide idle injection */
> + if (max_idle > MAX_ALL_CPU_IDLE)
> + return (duration * 100 / powerclamp_data.target_ratio - duration);
> +
> /*
> * make sure user selected ratio does not take effect until
> * the next round. adjust target_ratio if user has changed
> @@ -460,21 +593,11 @@ static void trigger_idle_injection(void)
> */
> static int powerclamp_idle_injection_register(void)
> {
> - /*
> - * The idle inject core will only inject for online CPUs,
> - * So we can register for all present CPUs. In this way
> - * if some CPU goes online/offline while idle inject
> - * is registered, nothing additional calls are required.
> - * The same runtime and idle time is applicable for
> - * newly onlined CPUs if any.
> - *
> - * Here cpu_present_mask can be used as is.
> - * cast to (struct cpumask *) is required as the
> - * cpu_present_mask is const struct cpumask *, otherwise
> - * there will be compiler warnings.
> - */
> - ii_dev = idle_inject_register_full((struct cpumask *)cpu_present_mask,
> - idle_inject_update);
> + if (cpumask_equal(cpu_present_mask, idle_injection_cpu_mask))
> + ii_dev = idle_inject_register_full(idle_injection_cpu_mask, idle_inject_update);
> + else
> + ii_dev = idle_inject_register(idle_injection_cpu_mask);
> +
> if (!ii_dev) {
> pr_err("powerclamp: idle_inject_register failed\n");
> return -EAGAIN;
> @@ -510,7 +633,7 @@ static int start_power_clamp(void)
> ret = powerclamp_idle_injection_register();
> if (!ret) {
> trigger_idle_injection();
> - if (poll_pkg_cstate_enable)
> + if (poll_pkg_cstate_enable && max_idle < MAX_ALL_CPU_IDLE)
Why is the additional check needed here?
> schedule_delayed_work(&poll_pkg_cstate_work, 0);
> }
>
> @@ -565,7 +688,7 @@ static int powerclamp_set_cur_state(struct thermal_cooling_device *cdev,
> mutex_lock(&powerclamp_lock);
>
> new_target_ratio = clamp(new_target_ratio, 0UL,
> - (unsigned long) (MAX_TARGET_RATIO - 1));
> + (unsigned long) (max_idle - 1));
> if (!powerclamp_data.target_ratio && new_target_ratio > 0) {
> pr_info("Start idle injection to reduce power\n");
> powerclamp_data.target_ratio = new_target_ratio;
> @@ -656,6 +779,13 @@ static int __init powerclamp_init(void)
>
> /* probe cpu features and ids here */
> retval = powerclamp_probe();
> + if (retval)
> + return retval;
> +
> + mutex_lock(&powerclamp_lock);
> + retval = allocate_idle_injection_mask();
> + mutex_unlock(&powerclamp_lock);
> +
> if (retval)
> return retval;
>
> @@ -689,6 +819,9 @@ static void __exit powerclamp_exit(void)
>
> cancel_delayed_work_sync(&poll_pkg_cstate_work);
> debugfs_remove_recursive(debug_dir);
> +
> + if (idle_injection_cpu_mask)
> + free_cpumask_var(idle_injection_cpu_mask);
> }
> module_exit(powerclamp_exit);
>
> --
On Thu, 2023-02-02 at 17:41 +0100, Rafael J. Wysocki wrote:
> First, I would say "Add two module parameters" in the subject.
>
> On Wed, Feb 1, 2023 at 7:35 PM Srinivas Pandruvada
> <[email protected]> wrote:
> >
> > In some use cases, it is desirable to only inject idle on certain
> > set
> > of CPUs. For example on Alder Lake systems, it is possible that we
> > force
> > idle only on P-Cores for thermal reasons. Also the idle percent can
> > be
> > more than 50% if we only choose partial set of CPUs in the system.
> >
> > Introduce module parameters for setting cpumask and max_idle.
>
> "Introduce 2 new module parameters for this purpose."
>
> > They can be only changed when the cooling device is inactive. This
> > module
> > already have other module parameters. There is no change done for
> > those parameters.
>
> s/have/has/
OK
>
> "... other parameters that are not affected by this change."
>
> > cpumask (Read/Write): A bit mask of CPUs to inject idle. The format
> > of
> > this bitmask is same as used in other subsystems like in
> > /proc/irq/*/smp_affinity. The mask is comma separated 32 bit
> > groups.
> > Each CPU is one bit. For example for 256 CPU system the full mask
> > is:
> > ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffff
> > ffff
> > The leftmost mask is for CPU 0-32.
> >
> > max_idle (Read/Write): Maximum injected idle time to the total CPU
> > time
> > ratio in percent range from 1 to 100. Even if the cooling device
> > max_state
> > is always 100 (100%), this parameter allows to add a max idle
> > percent
> > limit. The default is 50, to match the current implementation of
> > powerclamp
> > driver. Also doesn't allow value more than 75, if the cpumask
> > includes
> > every CPU present in the system.
> >
> > Signed-off-by: Srinivas Pandruvada
> > <[email protected]>
> > ---
> > v5
> > New patch
> >
> > .../driver-api/thermal/intel_powerclamp.rst | 22 +++
> > drivers/thermal/intel/intel_powerclamp.c | 169
> > ++++++++++++++++--
> > 2 files changed, 173 insertions(+), 18 deletions(-)
> >
> > diff --git a/Documentation/driver-api/thermal/intel_powerclamp.rst
> > b/Documentation/driver-api/thermal/intel_powerclamp.rst
> > index 3f6dfb0b3ea6..d805e28b7a45 100644
> > --- a/Documentation/driver-api/thermal/intel_powerclamp.rst
> > +++ b/Documentation/driver-api/thermal/intel_powerclamp.rst
> > @@ -26,6 +26,8 @@ By:
> > - Generic Thermal Layer (sysfs)
> > - Kernel APIs (TBD)
> >
> > + (*) Module Parameters
> > +
> > INTRODUCTION
> > ============
> >
> > @@ -318,3 +320,23 @@ device, a PID based userspace thermal
> > controller can manage to
> > control CPU temperature effectively, when no other thermal
> > influence
> > is added. For example, a UltraBook user can compile the kernel
> > under
> > certain temperature (below most active trip points).
> > +
> > +Module Parameters
> > +=================
> > +
> > +``cpumask`` (RW)
> > + A bit mask of CPUs to inject idle. The format of the
> > bitmask is same as
> > + used in other subsystems like in /proc/irq/*/smp_affinity.
> > The mask is
> > + comma separated 32 bit groups. Each CPU is one bit. For
> > example for a 256
> > + CPU system the full mask is:
> > +
> > ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffff
> > ffff
> > +
> > + The leftmost mask is for CPU 0-32.
> > +
> > +``max_idle`` (RW)
> > + Maximum injected idle time to the total CPU time ratio in
> > percent range
> > + from 1 to 100. Even if the cooling device max_state is
> > always 100 (100%),
> > + this parameter allows to add a max idle percent limit. The
> > default is 50,
> > + to match the current implementation of powerclamp driver.
> > Also doesn't
> > + allow value more than 75, if the cpumask includes every CPU
> > present in
> > + the system.
>
> I'm not sure if this is driver-api. It's admin-guide rather IMO.
>
This whole file can be moved to admin_guide as there is no API.
> > diff --git a/drivers/thermal/intel/intel_powerclamp.c
> > b/drivers/thermal/intel/intel_powerclamp.c
> > index 850195ebe5e0..68830b726da2 100644
> > --- a/drivers/thermal/intel/intel_powerclamp.c
> > +++ b/drivers/thermal/intel/intel_powerclamp.c
> > @@ -37,7 +37,7 @@
> > #include <asm/mwait.h>
> > #include <asm/cpu_device_id.h>
> >
> > -#define MAX_TARGET_RATIO (50U)
> > +#define MAX_TARGET_RATIO (100U)
> > /* For each undisturbed clamping period (no extra wake ups during
> > idle time),
> > * we increment the confidence counter for the given target ratio.
> > * CONFIDENCE_OK defines the level where runtime calibration
> > results are
> > @@ -109,6 +109,135 @@ static const struct kernel_param_ops
> > duration_ops = {
> > module_param_cb(duration, &duration_ops, &duration, 0644);
> > MODULE_PARM_DESC(duration, "forced idle time for each attempt in
> > msec.");
> >
> > +#define DEFAULT_MAX_IDLE 50
> > +#define MAX_ALL_CPU_IDLE 75
> > +
> > +static u8 max_idle = DEFAULT_MAX_IDLE;
> > +
> > +static cpumask_var_t idle_injection_cpu_mask;
> > +
> > +static int allocate_idle_injection_mask(void)
> > +{
> > + /* This mask is allocated only one time and freed during
> > module exit */
> > + if (!idle_injection_cpu_mask) {
>
> I would do
>
> if (idle_injection_cpu_mask)
> return 0;
OK
>
> here and rearrange the rest of the function accordingly.
>
> > + if (!zalloc_cpumask_var(&idle_injection_cpu_mask,
> > GFP_KERNEL))
> > + return -ENOMEM;
> > +
> > + cpumask_copy(idle_injection_cpu_mask,
> > cpu_present_mask);
> > + }
> > +
> > + return 0;
> > +}
> > +
> > +static int cpumask_set(const char *arg, const struct kernel_param
> > *kp)
> > +{
> > + int ret;
> > +
> > + mutex_lock(&powerclamp_lock);
> > +
> > + /* Can't set mask when cooling device is in use */
> > + if (powerclamp_data.clamping) {
> > + ret = -EAGAIN;
> > + goto skip_cpumask_set;
> > + }
> > +
> > + /*
> > + * When module parameters are passed from kernel command
> > line
> > + * during insmod, the module parameter callback is called
> > + * before powerclamp_init(), so we can't assume that some
> > + * cpumask can be allocated before here.
> > + */
> > + ret = allocate_idle_injection_mask();
>
> Could it be allocated by powerclamp_init(), though? It is not useful
> before that function runs anyway.
>
It is required when passed as module param. For example:
#insmod drivers/thermal/intel/intel_powerclamp.ko cpumask=f
If you return error, module will not be loaded at all.
insmod: ERROR: could not insert
module drivers/thermal/intel/intel_powerclamp.ko: Invalid parameters
You have to return success then module parameters will be silently
ignored.
> > + if (ret)
> > + goto skip_cpumask_set;
> > +
> > + ret = bitmap_parse(arg, strlen(arg),
> > cpumask_bits(idle_injection_cpu_mask),
>
> So this would replace the existing idle_injection_cpu_mask even if it
> is going to fail later.
Yes, when user sets 0s in mask. Otherwise I have to keep a separate
cpumask and then copy. The size of cpumask type is 1024 bytes.
I will fix this.
>
> > + nr_cpumask_bits);
> > + if (ret)
> > + goto skip_cpumask_set;
> > +
> > + if (cpumask_empty(idle_injection_cpu_mask)) {
> > + ret = -EINVAL;
> > + goto skip_cpumask_set;
> > + }
> > +
> > + if (cpumask_equal(cpu_present_mask,
> > idle_injection_cpu_mask) &&
>
> Should this check be against cpu_online_mask instead? Arguably
> offline CPUs don't matter here.
It is present mask which takes care of case when you started idle
injection on CPUS when some CPUs were offline and then online while the
session is going on. The idle-inject core takes care of online during
session. Othewise online CPUs don't get idle injection and will run at
100%.
>
> > + max_idle > MAX_ALL_CPU_IDLE) {
> > + ret = -EINVAL;
> > + goto skip_cpumask_set;
> > + }
> > +
> > + mutex_unlock(&powerclamp_lock);
> > +
> > + return 0;
> > +
> > +skip_cpumask_set:
> > + mutex_unlock(&powerclamp_lock);
> > +
> > + return ret;
> > +}
> > +
> > +static int cpumask_get(char *buf, const struct kernel_param *kp)
> > +{
> > + if (!idle_injection_cpu_mask)
> > + return -EINVAL;
>
> I would return -ENODEV here.
OK
>
> > +
> > + return bitmap_print_to_pagebuf(false, buf,
> > cpumask_bits(idle_injection_cpu_mask),
> > + nr_cpumask_bits);
> > +}
> > +
> > +static const struct kernel_param_ops cpumask_ops = {
> > + .set = cpumask_set,
> > + .get = cpumask_get,
> > +};
> > +
> > +module_param_cb(cpumask, &cpumask_ops, NULL, 0644);
> > +MODULE_PARM_DESC(cpumask, "Mask of CPUs to use for idle
> > injection.");
> > +
> > +static int max_idle_set(const char *arg, const struct kernel_param
> > *kp)
> > +{
> > + u8 _max_idle;
>
> new_max_idle would be slightly better I think.
OK
>
> > + int ret = 0;
> > +
> > + mutex_lock(&powerclamp_lock);
> > +
> > + /* Can't set mask when cooling device is in use */
> > + if (powerclamp_data.clamping) {
> > + ret = -EAGAIN;
> > + goto skip_limit_set;
> > + }
> > +
> > + ret = kstrtou8(arg, 10, &_max_idle);
> > + if (ret)
> > + goto skip_limit_set;
> > +
> > + if (_max_idle > MAX_TARGET_RATIO) {
> > + ret = -EINVAL;
> > + goto skip_limit_set;
> > + }
> > +
> > + if (idle_injection_cpu_mask &&
> > cpumask_equal(cpu_present_mask, idle_injection_cpu_mask) &&
> > + _max_idle > MAX_ALL_CPU_IDLE) {
>
> The same check is done here and in cpumask_set(). Could it be done
> in
> a separate function called from here and from there?
OK.
>
> > + ret = -EINVAL;
> > + goto skip_limit_set;
> > + }
> > +
> > + max_idle = _max_idle;
> > +
> > +skip_limit_set:
> > + mutex_unlock(&powerclamp_lock);
> > +
> > + return ret;
> > +}
> > +
> > +static const struct kernel_param_ops max_idle_ops = {
> > + .set = max_idle_set,
> > + .get = param_get_int,
> > +};
> > +
> > +module_param_cb(max_idle, &max_idle_ops, &max_idle, 0644);
> > +MODULE_PARM_DESC(max_idle, "maximum injected idle time to the
> > total CPU time ratio in percent range:1-100");
> > +
> > struct powerclamp_calibration_data {
> > unsigned long confidence; /* used for calibration,
> > basically a counter
> > * gets incremented each time a
> > clamping
> > @@ -342,6 +471,10 @@ static unsigned int get_run_time(void)
> > unsigned int compensated_ratio;
> > unsigned int runtime;
> >
> > + /* No compensation for non systemwide idle injection */
> > + if (max_idle > MAX_ALL_CPU_IDLE)
> > + return (duration * 100 /
> > powerclamp_data.target_ratio - duration);
> > +
> > /*
> > * make sure user selected ratio does not take effect until
> > * the next round. adjust target_ratio if user has changed
> > @@ -460,21 +593,11 @@ static void trigger_idle_injection(void)
> > */
> > static int powerclamp_idle_injection_register(void)
> > {
> > - /*
> > - * The idle inject core will only inject for online CPUs,
> > - * So we can register for all present CPUs. In this way
> > - * if some CPU goes online/offline while idle inject
> > - * is registered, nothing additional calls are required.
> > - * The same runtime and idle time is applicable for
> > - * newly onlined CPUs if any.
> > - *
> > - * Here cpu_present_mask can be used as is.
> > - * cast to (struct cpumask *) is required as the
> > - * cpu_present_mask is const struct cpumask *, otherwise
> > - * there will be compiler warnings.
> > - */
> > - ii_dev = idle_inject_register_full((struct cpumask
> > *)cpu_present_mask,
> > - idle_inject_update);
> > + if (cpumask_equal(cpu_present_mask,
> > idle_injection_cpu_mask))
> > + ii_dev =
> > idle_inject_register_full(idle_injection_cpu_mask,
> > idle_inject_update);
> > + else
> > + ii_dev =
> > idle_inject_register(idle_injection_cpu_mask);
> > +
> > if (!ii_dev) {
> > pr_err("powerclamp: idle_inject_register
> > failed\n");
> > return -EAGAIN;
> > @@ -510,7 +633,7 @@ static int start_power_clamp(void)
> > ret = powerclamp_idle_injection_register();
> > if (!ret) {
> > trigger_idle_injection();
> > - if (poll_pkg_cstate_enable)
> > + if (poll_pkg_cstate_enable && max_idle <
> > MAX_ALL_CPU_IDLE)
>
> Why is the additional check needed here?
poll_pkg_cstate_enable is true for single package system, the
additional check prevents starting package c-state polling thread.
But poll_pkg_cstate_enable can be set to false to avoid this check.
Thanks,
Srinivas
>
> >
> > schedule_delayed_work(&poll_pkg_cstate_work, 0);
> > }
> >
> > @@ -565,7 +688,7 @@ static int powerclamp_set_cur_state(struct
> > thermal_cooling_device *cdev,
> > mutex_lock(&powerclamp_lock);
> >
> > new_target_ratio = clamp(new_target_ratio, 0UL,
> > - (unsigned long) (MAX_TARGET_RATIO -
> > 1));
> > + (unsigned long) (max_idle - 1));
> > if (!powerclamp_data.target_ratio && new_target_ratio > 0)
> > {
> > pr_info("Start idle injection to reduce power\n");
> > powerclamp_data.target_ratio = new_target_ratio;
> > @@ -656,6 +779,13 @@ static int __init powerclamp_init(void)
> >
> > /* probe cpu features and ids here */
> > retval = powerclamp_probe();
> > + if (retval)
> > + return retval;
> > +
> > + mutex_lock(&powerclamp_lock);
> > + retval = allocate_idle_injection_mask();
> > + mutex_unlock(&powerclamp_lock);
> > +
> > if (retval)
> > return retval;
> >
> > @@ -689,6 +819,9 @@ static void __exit powerclamp_exit(void)
> >
> > cancel_delayed_work_sync(&poll_pkg_cstate_work);
> > debugfs_remove_recursive(debug_dir);
> > +
> > + if (idle_injection_cpu_mask)
> > + free_cpumask_var(idle_injection_cpu_mask);
> > }
> > module_exit(powerclamp_exit);
> >
> > --
Hi, Srinivas,
On Wed, 2023-02-01 at 10:28 -0800, Srinivas Pandruvada wrote:
> In some use cases, it is desirable to only inject idle on certain set
> of CPUs. For example on Alder Lake systems, it is possible that we
> force
> idle only on P-Cores for thermal reasons. Also the idle percent can
> be
> more than 50% if we only choose partial set of CPUs in the system.
>
> Introduce module parameters for setting cpumask and max_idle. They
> can be only changed when the cooling device is inactive. This module
> already have other module parameters. There is no change done for
> those parameters.
>
> cpumask (Read/Write): A bit mask of CPUs to inject idle. The format
> of
> this bitmask is same as used in other subsystems like in
> /proc/irq/*/smp_affinity. The mask is comma separated 32 bit groups.
> Each CPU is one bit. For example for 256 CPU system the full mask is:
> ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffff
> ff
> The leftmost mask is for CPU 0-32.
>
> max_idle (Read/Write): Maximum injected idle time to the total CPU
> time
> ratio in percent range from 1 to 100. Even if the cooling device
> max_state
> is always 100 (100%), this parameter allows to add a max idle percent
> limit. The default is 50, to match the current implementation of
> powerclamp
> driver. Also doesn't allow value more than 75, if the cpumask
> includes
> every CPU present in the system.
>
> Signed-off-by: Srinivas Pandruvada <
> [email protected]>
> ---
> v5
> New patch
>
> .../driver-api/thermal/intel_powerclamp.rst | 22 +++
> drivers/thermal/intel/intel_powerclamp.c | 169
> ++++++++++++++++--
> 2 files changed, 173 insertions(+), 18 deletions(-)
>
> diff --git a/Documentation/driver-api/thermal/intel_powerclamp.rst
> b/Documentation/driver-api/thermal/intel_powerclamp.rst
> index 3f6dfb0b3ea6..d805e28b7a45 100644
> --- a/Documentation/driver-api/thermal/intel_powerclamp.rst
> +++ b/Documentation/driver-api/thermal/intel_powerclamp.rst
> @@ -26,6 +26,8 @@ By:
> - Generic Thermal Layer (sysfs)
> - Kernel APIs (TBD)
>
> + (*) Module Parameters
> +
> INTRODUCTION
> ============
>
> @@ -318,3 +320,23 @@ device, a PID based userspace thermal controller
> can manage to
> control CPU temperature effectively, when no other thermal influence
> is added. For example, a UltraBook user can compile the kernel under
> certain temperature (below most active trip points).
> +
> +Module Parameters
> +=================
> +
> +``cpumask`` (RW)
> + A bit mask of CPUs to inject idle. The format of the bitmask is
> same as
> + used in other subsystems like in /proc/irq/*/smp_affinity. The
> mask is
> + comma separated 32 bit groups. Each CPU is one bit. For example
> for a 256
> + CPU system the full mask is:
> + ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,
> ffffffff
> +
> + The leftmost mask is for CPU 0-32.
> +
> +``max_idle`` (RW)
> + Maximum injected idle time to the total CPU time ratio in
> percent range
> + from 1 to 100. Even if the cooling device max_state is always
> 100 (100%),
> + this parameter allows to add a max idle percent limit. The
> default is 50,
> + to match the current implementation of powerclamp driver. Also
> doesn't
> + allow value more than 75, if the cpumask includes every CPU
> present in
> + the system.
> diff --git a/drivers/thermal/intel/intel_powerclamp.c
> b/drivers/thermal/intel/intel_powerclamp.c
> index 850195ebe5e0..68830b726da2 100644
> --- a/drivers/thermal/intel/intel_powerclamp.c
> +++ b/drivers/thermal/intel/intel_powerclamp.c
> @@ -37,7 +37,7 @@
> #include <asm/mwait.h>
> #include <asm/cpu_device_id.h>
>
> -#define MAX_TARGET_RATIO (50U)
> +#define MAX_TARGET_RATIO (100U)
> /* For each undisturbed clamping period (no extra wake ups during
> idle time),
> * we increment the confidence counter for the given target ratio.
> * CONFIDENCE_OK defines the level where runtime calibration results
> are
> @@ -109,6 +109,135 @@ static const struct kernel_param_ops
> duration_ops = {
> module_param_cb(duration, &duration_ops, &duration, 0644);
> MODULE_PARM_DESC(duration, "forced idle time for each attempt in
> msec.");
>
> +#define DEFAULT_MAX_IDLE 50
> +#define MAX_ALL_CPU_IDLE 75
> +
> +static u8 max_idle = DEFAULT_MAX_IDLE;
> +
> +static cpumask_var_t idle_injection_cpu_mask;
#ifdef CONFIG_CPUMASK_OFFSTACK
typedef struct cpumask *cpumask_var_t;
#else
typedef struct cpumask cpumask_var_t[1];
fi
I happened to build with CONFIG_CPUMASK_OFFSTACK cleared, and got
following errors
$ make
CALL scripts/checksyscalls.sh
DESCEND objtool
AR drivers/thermal/intel/built-in.a
CC [M] drivers/thermal/intel/intel_powerclamp.o
drivers/thermal/intel/intel_powerclamp.c: In function ‘allocate_idle_injection_mask’:
drivers/thermal/intel/intel_powerclamp.c:122:6: error: the address of ‘idle_injection_cpu_mask’ will always evaluate as ‘true’ [-Werror=address]
122 | if (!idle_injection_cpu_mask) {
| ^
drivers/thermal/intel/intel_powerclamp.c: In function ‘cpumask_get’:
drivers/thermal/intel/intel_powerclamp.c:183:6: error: the address of ‘idle_injection_cpu_mask’ will always evaluate as ‘true’ [-Werror=address]
183 | if (!idle_injection_cpu_mask)
| ^
drivers/thermal/intel/intel_powerclamp.c: In function ‘max_idle_set’:
drivers/thermal/intel/intel_powerclamp.c:220:6: error: the address of ‘idle_injection_cpu_mask’ will always evaluate as ‘true’ [-Werror=address]
220 | if (idle_injection_cpu_mask && cpumask_equal(cpu_present_mask, idle_injection_cpu_mask) &&
| ^~~~~~~~~~~~~~~~~~~~~~~
drivers/thermal/intel/intel_powerclamp.c: In function ‘powerclamp_exit’:
drivers/thermal/intel/intel_powerclamp.c:825:6: error: the address of ‘idle_injection_cpu_mask’ will always evaluate as ‘true’ [-Werror=address]
825 | if (idle_injection_cpu_mask)
| ^~~~~~~~~~~~~~~~~~~~~~~
cc1: all warnings being treated as errors
make[4]: *** [scripts/Makefile.build:252: drivers/thermal/intel/intel_powerclamp.o] Error 1
make[3]: *** [scripts/Makefile.build:504: drivers/thermal/intel] Error 2
make[2]: *** [scripts/Makefile.build:504: drivers/thermal] Error 2
make[1]: *** [scripts/Makefile.build:504: drivers] Error 2
make: *** [Makefile:2021: .] Error 2
> +
> +static int allocate_idle_injection_mask(void)
> +{
> + /* This mask is allocated only one time and freed during module
> exit */
> + if (!idle_injection_cpu_mask) {
> + if (!zalloc_cpumask_var(&idle_injection_cpu_mask,
> GFP_KERNEL))
> + return -ENOMEM;
> +
> + cpumask_copy(idle_injection_cpu_mask,
> cpu_present_mask);
> + }
> +
> + return 0;
> +}
> +
> +static int cpumask_set(const char *arg, const struct kernel_param
> *kp)
> +{
> + int ret;
> +
> + mutex_lock(&powerclamp_lock);
> +
> + /* Can't set mask when cooling device is in use */
> + if (powerclamp_data.clamping) {
> + ret = -EAGAIN;
> + goto skip_cpumask_set;
> + }
> +
> + /*
> + * When module parameters are passed from kernel command line
> + * during insmod, the module parameter callback is called
> + * before powerclamp_init(), so we can't assume that some
> + * cpumask can be allocated before here.
> + */
> + ret = allocate_idle_injection_mask();
> + if (ret)
> + goto skip_cpumask_set;
> +
Just a suggestion, can we have a quick path for restoring to all cpu
mode?
Say, echo > /sys/module/intel_powerclamp/parameters/cpumask
> + ret = bitmap_parse(arg, strlen(arg),
> cpumask_bits(idle_injection_cpu_mask),
> + nr_cpumask_bits);
> + if (ret)
> + goto skip_cpumask_set;
> +
> + if (cpumask_empty(idle_injection_cpu_mask)) {
> + ret = -EINVAL;
> + goto skip_cpumask_set;
> + }
> +
> + if (cpumask_equal(cpu_present_mask, idle_injection_cpu_mask) &&
> + max_idle > MAX_ALL_CPU_IDLE) {
> + ret = -EINVAL;
> + goto skip_cpumask_set;
> + }
> +
> + mutex_unlock(&powerclamp_lock);
> +
> + return 0;
> +
> +skip_cpumask_set:
> + mutex_unlock(&powerclamp_lock);
> +
> + return ret;
> +}
> +
> +static int cpumask_get(char *buf, const struct kernel_param *kp)
> +{
> + if (!idle_injection_cpu_mask)
> + return -EINVAL;
> +
> + return bitmap_print_to_pagebuf(false, buf,
> cpumask_bits(idle_injection_cpu_mask),
> + nr_cpumask_bits);
> +}
> +
> +static const struct kernel_param_ops cpumask_ops = {
> + .set = cpumask_set,
> + .get = cpumask_get,
> +};
> +
> +module_param_cb(cpumask, &cpumask_ops, NULL, 0644);
> +MODULE_PARM_DESC(cpumask, "Mask of CPUs to use for idle
> injection.");
> +
> +static int max_idle_set(const char *arg, const struct kernel_param
> *kp)
> +{
> + u8 _max_idle;
> + int ret = 0;
> +
> + mutex_lock(&powerclamp_lock);
> +
> + /* Can't set mask when cooling device is in use */
> + if (powerclamp_data.clamping) {
> + ret = -EAGAIN;
> + goto skip_limit_set;
> + }
> +
> + ret = kstrtou8(arg, 10, &_max_idle);
> + if (ret)
> + goto skip_limit_set;
> +
> + if (_max_idle > MAX_TARGET_RATIO) {
> + ret = -EINVAL;
> + goto skip_limit_set;
> + }
> +
> + if (idle_injection_cpu_mask && cpumask_equal(cpu_present_mask,
> idle_injection_cpu_mask) &&
> + _max_idle > MAX_ALL_CPU_IDLE) {
> + ret = -EINVAL;
> + goto skip_limit_set;
> + }
> +
> + max_idle = _max_idle;
> +
> +skip_limit_set:
> + mutex_unlock(&powerclamp_lock);
> +
> + return ret;
> +}
> +
> +static const struct kernel_param_ops max_idle_ops = {
> + .set = max_idle_set,
> + .get = param_get_int,
> +};
> +
> +module_param_cb(max_idle, &max_idle_ops, &max_idle, 0644);
> +MODULE_PARM_DESC(max_idle, "maximum injected idle time to the total
> CPU time ratio in percent range:1-100");
> +
> struct powerclamp_calibration_data {
> unsigned long confidence; /* used for calibration, basically a
> counter
> * gets incremented each time a
> clamping
> @@ -342,6 +471,10 @@ static unsigned int get_run_time(void)
> unsigned int compensated_ratio;
> unsigned int runtime;
>
> + /* No compensation for non systemwide idle injection */
> + if (max_idle > MAX_ALL_CPU_IDLE)
> + return (duration * 100 / powerclamp_data.target_ratio -
> duration);
The comment and the code is not aligned.
we can still set max_idle to a value lower than MAX_ALL_CPU_IDLE for
non systemwide idle injection, right?
thanks,
rui
> +
> /*
> * make sure user selected ratio does not take effect until
> * the next round. adjust target_ratio if user has changed
> @@ -460,21 +593,11 @@ static void trigger_idle_injection(void)
> */
> static int powerclamp_idle_injection_register(void)
> {
> - /*
> - * The idle inject core will only inject for online CPUs,
> - * So we can register for all present CPUs. In this way
> - * if some CPU goes online/offline while idle inject
> - * is registered, nothing additional calls are required.
> - * The same runtime and idle time is applicable for
> - * newly onlined CPUs if any.
> - *
> - * Here cpu_present_mask can be used as is.
> - * cast to (struct cpumask *) is required as the
> - * cpu_present_mask is const struct cpumask *, otherwise
> - * there will be compiler warnings.
> - */
> - ii_dev = idle_inject_register_full((struct cpumask
> *)cpu_present_mask,
> - idle_inject_update);
> + if (cpumask_equal(cpu_present_mask, idle_injection_cpu_mask))
> + ii_dev =
> idle_inject_register_full(idle_injection_cpu_mask,
> idle_inject_update);
> + else
> + ii_dev = idle_inject_register(idle_injection_cpu_mask);
> +
> if (!ii_dev) {
> pr_err("powerclamp: idle_inject_register failed\n");
> return -EAGAIN;
> @@ -510,7 +633,7 @@ static int start_power_clamp(void)
> ret = powerclamp_idle_injection_register();
> if (!ret) {
> trigger_idle_injection();
> - if (poll_pkg_cstate_enable)
> + if (poll_pkg_cstate_enable && max_idle <
> MAX_ALL_CPU_IDLE)
> schedule_delayed_work(&poll_pkg_cstate_work,
> 0);
> }
>
> @@ -565,7 +688,7 @@ static int powerclamp_set_cur_state(struct
> thermal_cooling_device *cdev,
> mutex_lock(&powerclamp_lock);
>
> new_target_ratio = clamp(new_target_ratio, 0UL,
> - (unsigned long) (MAX_TARGET_RATIO -
> 1));
> + (unsigned long) (max_idle - 1));
> if (!powerclamp_data.target_ratio && new_target_ratio > 0) {
> pr_info("Start idle injection to reduce power\n");
> powerclamp_data.target_ratio = new_target_ratio;
> @@ -656,6 +779,13 @@ static int __init powerclamp_init(void)
>
> /* probe cpu features and ids here */
> retval = powerclamp_probe();
> + if (retval)
> + return retval;
> +
> + mutex_lock(&powerclamp_lock);
> + retval = allocate_idle_injection_mask();
> + mutex_unlock(&powerclamp_lock);
> +
> if (retval)
> return retval;
>
> @@ -689,6 +819,9 @@ static void __exit powerclamp_exit(void)
>
> cancel_delayed_work_sync(&poll_pkg_cstate_work);
> debugfs_remove_recursive(debug_dir);
> +
> + if (idle_injection_cpu_mask)
> + free_cpumask_var(idle_injection_cpu_mask);
> }
> module_exit(powerclamp_exit);
>
Hi Rui,
>
[...]
> #ifdef CONFIG_CPUMASK_OFFSTACK
> typedef struct cpumask *cpumask_var_t;
> #else
> typedef struct cpumask cpumask_var_t[1];
> fi
>
> I happened to build with CONFIG_CPUMASK_OFFSTACK cleared, and got
> following errors
>
>
Try v2 for the module parameters
Thanks,
Srinivas
> $ make
> CALL scripts/checksyscalls.sh
> DESCEND objtool
> AR drivers/thermal/intel/built-in.a
> CC [M] drivers/thermal/intel/intel_powerclamp.o
> drivers/thermal/intel/intel_powerclamp.c: In function
> ‘allocate_idle_injection_mask’:
> drivers/thermal/intel/intel_powerclamp.c:122:6: error: the address of
> ‘idle_injection_cpu_mask’ will always evaluate as ‘true’ [-
> Werror=address]
> 122 | if (!idle_injection_cpu_mask) {
> | ^
> drivers/thermal/intel/intel_powerclamp.c: In function ‘cpumask_get’:
> drivers/thermal/intel/intel_powerclamp.c:183:6: error: the address of
> ‘idle_injection_cpu_mask’ will always evaluate as ‘true’ [-
> Werror=address]
> 183 | if (!idle_injection_cpu_mask)
> | ^
> drivers/thermal/intel/intel_powerclamp.c: In function ‘max_idle_set’:
> drivers/thermal/intel/intel_powerclamp.c:220:6: error: the address of
> ‘idle_injection_cpu_mask’ will always evaluate as ‘true’ [-
> Werror=address]
> 220 | if (idle_injection_cpu_mask &&
> cpumask_equal(cpu_present_mask, idle_injection_cpu_mask) &&
> | ^~~~~~~~~~~~~~~~~~~~~~~
> drivers/thermal/intel/intel_powerclamp.c: In function
> ‘powerclamp_exit’:
> drivers/thermal/intel/intel_powerclamp.c:825:6: error: the address of
> ‘idle_injection_cpu_mask’ will always evaluate as ‘true’ [-
> Werror=address]
> 825 | if (idle_injection_cpu_mask)
> | ^~~~~~~~~~~~~~~~~~~~~~~
> cc1: all warnings being treated as errors
> make[4]: *** [scripts/Makefile.build:252:
> drivers/thermal/intel/intel_powerclamp.o] Error 1
> make[3]: *** [scripts/Makefile.build:504: drivers/thermal/intel]
> Error 2
> make[2]: *** [scripts/Makefile.build:504: drivers/thermal] Error 2
> make[1]: *** [scripts/Makefile.build:504: drivers] Error 2
> make: *** [Makefile:2021: .] Error 2
>
> > +
> > +static int allocate_idle_injection_mask(void)
> > +{
> > + /* This mask is allocated only one time and freed during
> > module
> > exit */
> > + if (!idle_injection_cpu_mask) {
> > + if (!zalloc_cpumask_var(&idle_injection_cpu_mask,
> > GFP_KERNEL))
> > + return -ENOMEM;
> > +
> > + cpumask_copy(idle_injection_cpu_mask,
> > cpu_present_mask);
> > + }
> > +
> > + return 0;
> > +}
> > +
> > +static int cpumask_set(const char *arg, const struct kernel_param
> > *kp)
> > +{
> > + int ret;
> > +
> > + mutex_lock(&powerclamp_lock);
> > +
> > + /* Can't set mask when cooling device is in use */
> > + if (powerclamp_data.clamping) {
> > + ret = -EAGAIN;
> > + goto skip_cpumask_set;
> > + }
> > +
> > + /*
> > + * When module parameters are passed from kernel command
> > line
> > + * during insmod, the module parameter callback is called
> > + * before powerclamp_init(), so we can't assume that some
> > + * cpumask can be allocated before here.
> > + */
> > + ret = allocate_idle_injection_mask();
> > + if (ret)
> > + goto skip_cpumask_set;
> > +
>
> Just a suggestion, can we have a quick path for restoring to all cpu
> mode?
>
> Say, echo > /sys/module/intel_powerclamp/parameters/cpumask
>
> > + ret = bitmap_parse(arg, strlen(arg),
> > cpumask_bits(idle_injection_cpu_mask),
> > + nr_cpumask_bits);
> > + if (ret)
> > + goto skip_cpumask_set;
> > +
> > + if (cpumask_empty(idle_injection_cpu_mask)) {
> > + ret = -EINVAL;
> > + goto skip_cpumask_set;
> > + }
> > +
> > + if (cpumask_equal(cpu_present_mask,
> > idle_injection_cpu_mask) &&
> > + max_idle > MAX_ALL_CPU_IDLE) {
> > + ret = -EINVAL;
> > + goto skip_cpumask_set;
> > + }
> > +
> > + mutex_unlock(&powerclamp_lock);
> > +
> > + return 0;
> > +
> > +skip_cpumask_set:
> > + mutex_unlock(&powerclamp_lock);
> > +
> > + return ret;
> > +}
> > +
> > +static int cpumask_get(char *buf, const struct kernel_param *kp)
> > +{
> > + if (!idle_injection_cpu_mask)
> > + return -EINVAL;
> > +
> > + return bitmap_print_to_pagebuf(false, buf,
> > cpumask_bits(idle_injection_cpu_mask),
> > + nr_cpumask_bits);
> > +}
> > +
> > +static const struct kernel_param_ops cpumask_ops = {
> > + .set = cpumask_set,
> > + .get = cpumask_get,
> > +};
> > +
> > +module_param_cb(cpumask, &cpumask_ops, NULL, 0644);
> > +MODULE_PARM_DESC(cpumask, "Mask of CPUs to use for idle
> > injection.");
> > +
> > +static int max_idle_set(const char *arg, const struct kernel_param
> > *kp)
> > +{
> > + u8 _max_idle;
> > + int ret = 0;
> > +
> > + mutex_lock(&powerclamp_lock);
> > +
> > + /* Can't set mask when cooling device is in use */
> > + if (powerclamp_data.clamping) {
> > + ret = -EAGAIN;
> > + goto skip_limit_set;
> > + }
> > +
> > + ret = kstrtou8(arg, 10, &_max_idle);
> > + if (ret)
> > + goto skip_limit_set;
> > +
> > + if (_max_idle > MAX_TARGET_RATIO) {
> > + ret = -EINVAL;
> > + goto skip_limit_set;
> > + }
> > +
> > + if (idle_injection_cpu_mask &&
> > cpumask_equal(cpu_present_mask,
> > idle_injection_cpu_mask) &&
> > + _max_idle > MAX_ALL_CPU_IDLE) {
> > + ret = -EINVAL;
> > + goto skip_limit_set;
> > + }
> > +
> > + max_idle = _max_idle;
> > +
> > +skip_limit_set:
> > + mutex_unlock(&powerclamp_lock);
> > +
> > + return ret;
> > +}
> > +
> > +static const struct kernel_param_ops max_idle_ops = {
> > + .set = max_idle_set,
> > + .get = param_get_int,
> > +};
> > +
> > +module_param_cb(max_idle, &max_idle_ops, &max_idle, 0644);
> > +MODULE_PARM_DESC(max_idle, "maximum injected idle time to the
> > total
> > CPU time ratio in percent range:1-100");
> > +
> > struct powerclamp_calibration_data {
> > unsigned long confidence; /* used for calibration,
> > basically a
> > counter
> > * gets incremented each time a
> > clamping
> > @@ -342,6 +471,10 @@ static unsigned int get_run_time(void)
> > unsigned int compensated_ratio;
> > unsigned int runtime;
> >
> > + /* No compensation for non systemwide idle injection */
> > + if (max_idle > MAX_ALL_CPU_IDLE)
> > + return (duration * 100 /
> > powerclamp_data.target_ratio -
> > duration);
>
> The comment and the code is not aligned.
> we can still set max_idle to a value lower than MAX_ALL_CPU_IDLE for
> non systemwide idle injection, right?
>
> thanks,
> rui
>
> > +
> > /*
> > * make sure user selected ratio does not take effect until
> > * the next round. adjust target_ratio if user has changed
> > @@ -460,21 +593,11 @@ static void trigger_idle_injection(void)
> > */
> > static int powerclamp_idle_injection_register(void)
> > {
> > - /*
> > - * The idle inject core will only inject for online CPUs,
> > - * So we can register for all present CPUs. In this way
> > - * if some CPU goes online/offline while idle inject
> > - * is registered, nothing additional calls are required.
> > - * The same runtime and idle time is applicable for
> > - * newly onlined CPUs if any.
> > - *
> > - * Here cpu_present_mask can be used as is.
> > - * cast to (struct cpumask *) is required as the
> > - * cpu_present_mask is const struct cpumask *, otherwise
> > - * there will be compiler warnings.
> > - */
> > - ii_dev = idle_inject_register_full((struct cpumask
> > *)cpu_present_mask,
> > - idle_inject_update);
> > + if (cpumask_equal(cpu_present_mask,
> > idle_injection_cpu_mask))
> > + ii_dev =
> > idle_inject_register_full(idle_injection_cpu_mask,
> > idle_inject_update);
> > + else
> > + ii_dev =
> > idle_inject_register(idle_injection_cpu_mask);
> > +
> > if (!ii_dev) {
> > pr_err("powerclamp: idle_inject_register
> > failed\n");
> > return -EAGAIN;
> > @@ -510,7 +633,7 @@ static int start_power_clamp(void)
> > ret = powerclamp_idle_injection_register();
> > if (!ret) {
> > trigger_idle_injection();
> > - if (poll_pkg_cstate_enable)
> > + if (poll_pkg_cstate_enable && max_idle <
> > MAX_ALL_CPU_IDLE)
> > schedule_delayed_work(&poll_pkg_cstate_work
> > ,
> > 0);
> > }
> >
> > @@ -565,7 +688,7 @@ static int powerclamp_set_cur_state(struct
> > thermal_cooling_device *cdev,
> > mutex_lock(&powerclamp_lock);
> >
> > new_target_ratio = clamp(new_target_ratio, 0UL,
> > - (unsigned long) (MAX_TARGET_RATIO -
> > 1));
> > + (unsigned long) (max_idle - 1));
> > if (!powerclamp_data.target_ratio && new_target_ratio > 0)
> > {
> > pr_info("Start idle injection to reduce power\n");
> > powerclamp_data.target_ratio = new_target_ratio;
> > @@ -656,6 +779,13 @@ static int __init powerclamp_init(void)
> >
> > /* probe cpu features and ids here */
> > retval = powerclamp_probe();
> > + if (retval)
> > + return retval;
> > +
> > + mutex_lock(&powerclamp_lock);
> > + retval = allocate_idle_injection_mask();
> > + mutex_unlock(&powerclamp_lock);
> > +
> > if (retval)
> > return retval;
> >
> > @@ -689,6 +819,9 @@ static void __exit powerclamp_exit(void)
> >
> > cancel_delayed_work_sync(&poll_pkg_cstate_work);
> > debugfs_remove_recursive(debug_dir);
> > +
> > + if (idle_injection_cpu_mask)
> > + free_cpumask_var(idle_injection_cpu_mask);
> > }
> > module_exit(powerclamp_exit);
> >