2015-02-09 04:50:45

by Preeti U Murthy

[permalink] [raw]
Subject: [PATCH V2] idle/intel_powerclamp: Redesign idle injection to use bandwidth control mechanism

The powerclamp driver injects idle periods to stay within the thermal constraints.
The driver does a fake idle by spawning per-cpu threads that call the mwait
instruction. This behavior of fake idle can confuse the other kernel subsystems.
For instance it calls into the nohz tick handlers, which are meant to be called
only by the idle thread. It sets the state of the tick in each cpu to idle and
stops the tick, when there are tasks on the runqueue. As a result the callers of
idle_cpu()/ tick_nohz_tick_stopped() see different states of the cpu; while the
former thinks that the cpu is busy, the latter thinks that it is idle. The outcome
may be inconsistency in the scheduler/nohz states which can lead to serious
consequences. One of them was reported on this thread:
https://lkml.org/lkml/2014/12/11/365.

Thomas posted out a patch to disable the powerclamp driver from calling into the
tick nohz code which has taken care of the above regression for the moment. However
powerclamp driver as a result, will not be able to inject idle periods due to the
presence of periodic ticks. With the current design of fake idle, we cannot move
towards a better solution.
https://lkml.org/lkml/2014/12/18/169

This patch aims at removing the concept of fake idle and instead makes the cpus
truly idle by throttling the runqueues during the idle injection periods. The situation
is in fact very similar to throttling of cfs_rqs when they exceed their bandwidths.
The idle injection metrics can be mapped to the bandwidth control metrics 'quota' and
'period' to achieve the same result. When the powerclamping is begun or when the
clamping controls have been modified, the bandwidth for the root task group is set.
The 'quota' will be the amount of time that the system needs to be busy and 'period'
will be the sum of this busy duration and the idle duration as calculated by the driver.
This gets rid of per-cpu kthreads, control cpu, hotplug notifiers and clamping mask since
the thread starting powerclamping will set the bandwidth and throttling of all cpus will
automatically fall in place. None of the other cpus need be bothered about this. This
simplifies the design of the driver.

Of course this is only if the idle injection metrics can be conveniently transformed
into bandwidth control metrics. There are a couple of other primary concerns around if
doing the below two in this patch is valid.
a. This patch exports the functions to set the quota and period of task groups.
b. This patch removes the constraint of not being able to set the root task grp's bandwidth.

Signed-off-by: Preeti U Murthy <[email protected]>
---
V1: https://lkml.org/lkml/2014/12/21/23

V1 tried to throttle cfs_rqs without modifying available bandwidth. This will not
be an appropriate solution since the scheduler only considers a runqueue as
throttled when its bandwidth has been exceeded. Going around this path can
lead to unintended consequences. Hence V2 was designed to throttle runqueues
through bandwidth control.

drivers/thermal/intel_powerclamp.c | 289 +++++++++---------------------------
include/linux/sched.h | 6 +
kernel/sched/core.c | 5 -
3 files changed, 82 insertions(+), 218 deletions(-)

diff --git a/drivers/thermal/intel_powerclamp.c b/drivers/thermal/intel_powerclamp.c
index 6ceebd6..f43903c 100644
--- a/drivers/thermal/intel_powerclamp.c
+++ b/drivers/thermal/intel_powerclamp.c
@@ -78,20 +78,14 @@ static struct dentry *debug_dir;
static unsigned int set_target_ratio;
static unsigned int current_ratio;
static bool should_skip;
+static unsigned int count = 0;
static bool reduce_irq;
static atomic_t idle_wakeup_counter;
-static unsigned int control_cpu; /* The cpu assigned to collect stat and update
- * control parameters. default to BSP but BSP
- * can be offlined.
- */
static bool clamping;
+static void clamp_timer_fn(unsigned long foo);
+DEFINE_TIMER(wakeup_timer, clamp_timer_fn, 0, 0);

-
-static struct task_struct * __percpu *powerclamp_thread;
static struct thermal_cooling_device *cooling_dev;
-static unsigned long *cpu_clamping_mask; /* bit map for tracking per cpu
- * clamping thread
- */

static unsigned int duration;
static unsigned int pkg_cstate_ratio_cur;
@@ -163,7 +157,7 @@ static int window_size_set(const char *arg, const struct kernel_param *kp)
smp_mb();

exit_win:
-
+ clamp_cpus();
return ret;
}

@@ -256,10 +250,6 @@ static u64 pkg_state_counter(void)
return count;
}

-static void noop_timer(unsigned long foo)
-{
- /* empty... just the fact that we get the interrupt wakes us up */
-}

static unsigned int get_compensation(int ratio)
{
@@ -362,102 +352,78 @@ static bool powerclamp_adjust_controls(unsigned int target_ratio,
return set_target_ratio + guard <= current_ratio;
}

-static int clamp_thread(void *arg)
+static int clamp_cpus(void)
{
- int cpunr = (unsigned long)arg;
- DEFINE_TIMER(wakeup_timer, noop_timer, 0, 0);
- static const struct sched_param param = {
- .sched_priority = MAX_USER_RT_PRIO/2,
- };
- unsigned int count = 0;
unsigned int target_ratio;
+ u64 quota = RUNTIME_INF, period;

- set_bit(cpunr, cpu_clamping_mask);
- set_freezable();
- init_timer_on_stack(&wakeup_timer);
- sched_setscheduler(current, SCHED_FIFO, &param);
-
- while (true == clamping && !kthread_should_stop() &&
- cpu_online(cpunr)) {
- int sleeptime;
- unsigned long target_jiffies;
- unsigned int guard;
- unsigned int compensation = 0;
- int interval; /* jiffies to sleep for each attempt */
- unsigned int duration_jiffies = msecs_to_jiffies(duration);
- unsigned int window_size_now;
-
- try_to_freeze();
- /*
- * make sure user selected ratio does not take effect until
- * the next round. adjust target_ratio if user has changed
- * target such that we can converge quickly.
- */
- target_ratio = set_target_ratio;
- guard = 1 + target_ratio/20;
- window_size_now = window_size;
- count++;
-
- /*
- * systems may have different ability to enter package level
- * c-states, thus we need to compensate the injected idle ratio
- * to achieve the actual target reported by the HW.
- */
- compensation = get_compensation(target_ratio);
- interval = duration_jiffies*100/(target_ratio+compensation);
-
- /* align idle time */
- target_jiffies = roundup(jiffies, interval);
- sleeptime = target_jiffies - jiffies;
- if (sleeptime <= 0)
- sleeptime = 1;
- schedule_timeout_interruptible(sleeptime);
- /*
- * only elected controlling cpu can collect stats and update
- * control parameters.
- */
- if (cpunr == control_cpu && !(count%window_size_now)) {
- should_skip =
- powerclamp_adjust_controls(target_ratio,
- guard, window_size_now);
- smp_mb();
- }
+ if (clamping == false)
+ goto out;

- if (should_skip)
- continue;
-
- target_jiffies = jiffies + duration_jiffies;
- mod_timer(&wakeup_timer, target_jiffies);
- if (unlikely(local_softirq_pending()))
- continue;
- /*
- * stop tick sched during idle time, interrupts are still
- * allowed. thus jiffies are updated properly.
- */
- preempt_disable();
- /* mwait until target jiffies is reached */
- while (time_before(jiffies, target_jiffies)) {
- unsigned long ecx = 1;
- unsigned long eax = target_mwait;
-
- /*
- * REVISIT: may call enter_idle() to notify drivers who
- * can save power during cpu idle. same for exit_idle()
- */
- local_touch_nmi();
- stop_critical_timings();
- mwait_idle_with_hints(eax, ecx);
- start_critical_timings();
- atomic_inc(&idle_wakeup_counter);
- }
- preempt_enable();
+again:
+ int sleeptime;
+ unsigned long target_jiffies;
+ unsigned int guard;
+ unsigned int compensation = 0;
+ int interval; /* jiffies to sleep for each attempt */
+ unsigned int duration_jiffies = msecs_to_jiffies(duration);
+ unsigned int window_size_now;
+
+ /*
+ * make sure user selected ratio does not take effect until
+ * the next round. adjust target_ratio if user has changed
+ * target such that we can converge quickly.
+ */
+ target_ratio = set_target_ratio;
+ guard = 1 + target_ratio/20;
+ window_size_now = window_size;
+
+ /*
+ * systems may have different ability to enter package level
+ * c-states, thus we need to compensate the injected idle ratio
+ * to achieve the actual target reported by the HW.
+ */
+ compensation = get_compensation(target_ratio);
+ interval = duration_jiffies*100/(target_ratio+compensation);
+
+ /* align idle time */
+ target_jiffies = roundup(jiffies, interval);
+ sleeptime = target_jiffies - jiffies;
+ if (sleeptime <= 0)
+ sleeptime = 1;
+
+ if (!(count%window_size_now)) {
+ should_skip =
+ powerclamp_adjust_controls(target_ratio,
+ guard, window_size_now);
+ smp_mb();
}
- del_timer_sync(&wakeup_timer);
- clear_bit(cpunr, cpu_clamping_mask);
+
+ if (should_skip)
+ goto again;
+
+ target_jiffies = jiffies + sleeptime + duration_jiffies;
+ mod_timer(&wakeup_timer, target_jiffies);
+ if (unlikely(local_softirq_pending()))
+ goto again;
+
+ quota = jiffies_to_usecs(sleeptime);
+ period = jiffies_to_usecs(sleeptime + duration_jiffies);
+
+out:
+ tg_set_cfs_quota(&root_task_group, quota);
+ tg_set_cfs_period(&root_task_group, period);

return 0;
}

+static void clamp_timer_fn(unsigned long foo)
+{
+ /* Evaluate to see if clamping controls need to be adjusted */
+ count++;
+ clamp_cpus();
+}
+
/*
* 1 HZ polling while clamping is active, useful for userspace
* to monitor actual idle ratio.
@@ -501,8 +467,7 @@ static void poll_pkg_cstate(struct work_struct *dummy)

static int start_power_clamp(void)
{
- unsigned long cpu;
- struct task_struct *thread;
+ clamping = true;

/* check if pkg cstate counter is completely 0, abort in this case */
if (!has_pkg_state_counter()) {
@@ -511,108 +476,21 @@ static int start_power_clamp(void)
}

set_target_ratio = clamp(set_target_ratio, 0U, MAX_TARGET_RATIO - 1);
- /* prevent cpu hotplug */
- get_online_cpus();
-
- /* prefer BSP */
- control_cpu = 0;
- if (!cpu_online(control_cpu))
- control_cpu = smp_processor_id();

- clamping = true;
schedule_delayed_work(&poll_pkg_cstate_work, 0);
-
- /* start one thread per online cpu */
- for_each_online_cpu(cpu) {
- struct task_struct **p =
- per_cpu_ptr(powerclamp_thread, cpu);
-
- thread = kthread_create_on_node(clamp_thread,
- (void *) cpu,
- cpu_to_node(cpu),
- "kidle_inject/%ld", cpu);
- /* bind to cpu here */
- if (likely(!IS_ERR(thread))) {
- kthread_bind(thread, cpu);
- wake_up_process(thread);
- *p = thread;
- }
-
- }
- put_online_cpus();
+ clamp_cpus();

return 0;
}

static void end_power_clamp(void)
{
- int i;
- struct task_struct *thread;
-
clamping = false;
- /*
- * make clamping visible to other cpus and give per cpu clamping threads
- * sometime to exit, or gets killed later.
- */
- smp_mb();
- msleep(20);
- if (bitmap_weight(cpu_clamping_mask, num_possible_cpus())) {
- for_each_set_bit(i, cpu_clamping_mask, num_possible_cpus()) {
- pr_debug("clamping thread for cpu %d alive, kill\n", i);
- thread = *per_cpu_ptr(powerclamp_thread, i);
- kthread_stop(thread);
- }
- }
-}

-static int powerclamp_cpu_callback(struct notifier_block *nfb,
- unsigned long action, void *hcpu)
-{
- unsigned long cpu = (unsigned long)hcpu;
- struct task_struct *thread;
- struct task_struct **percpu_thread =
- per_cpu_ptr(powerclamp_thread, cpu);
-
- if (false == clamping)
- goto exit_ok;
-
- switch (action) {
- case CPU_ONLINE:
- thread = kthread_create_on_node(clamp_thread,
- (void *) cpu,
- cpu_to_node(cpu),
- "kidle_inject/%lu", cpu);
- if (likely(!IS_ERR(thread))) {
- kthread_bind(thread, cpu);
- wake_up_process(thread);
- *percpu_thread = thread;
- }
- /* prefer BSP as controlling CPU */
- if (cpu == 0) {
- control_cpu = 0;
- smp_mb();
- }
- break;
- case CPU_DEAD:
- if (test_bit(cpu, cpu_clamping_mask)) {
- pr_err("cpu %lu dead but powerclamping thread is not\n",
- cpu);
- kthread_stop(*percpu_thread);
- }
- if (cpu == control_cpu) {
- control_cpu = smp_processor_id();
- smp_mb();
- }
- }
-
-exit_ok:
- return NOTIFY_OK;
+ clamp_cpus();
+ del_timer_sync(&wakeup_timer);
}

-static struct notifier_block powerclamp_cpu_notifier = {
- .notifier_call = powerclamp_cpu_callback,
-};
-
static int powerclamp_get_max_state(struct thermal_cooling_device *cdev,
unsigned long *state)
{
@@ -656,6 +534,7 @@ static int powerclamp_set_cur_state(struct thermal_cooling_device *cdev,
}

exit_set:
+ clamp_cpus();
return ret;
}

@@ -764,11 +643,6 @@ static int powerclamp_init(void)
int retval;
int bitmap_size;

- bitmap_size = BITS_TO_LONGS(num_possible_cpus()) * sizeof(long);
- cpu_clamping_mask = kzalloc(bitmap_size, GFP_KERNEL);
- if (!cpu_clamping_mask)
- return -ENOMEM;
-
/* probe cpu features and ids here */
retval = powerclamp_probe();
if (retval)
@@ -776,19 +650,12 @@ static int powerclamp_init(void)

/* set default limit, maybe adjusted during runtime based on feedback */
window_size = 2;
- register_hotcpu_notifier(&powerclamp_cpu_notifier);
-
- powerclamp_thread = alloc_percpu(struct task_struct *);
- if (!powerclamp_thread) {
- retval = -ENOMEM;
- goto exit_unregister;
- }

cooling_dev = thermal_cooling_device_register("intel_powerclamp", NULL,
&powerclamp_cooling_ops);
if (IS_ERR(cooling_dev)) {
retval = -ENODEV;
- goto exit_free_thread;
+ goto exit;
}

if (!duration)
@@ -798,23 +665,15 @@ static int powerclamp_init(void)

return 0;

-exit_free_thread:
- free_percpu(powerclamp_thread);
-exit_unregister:
- unregister_hotcpu_notifier(&powerclamp_cpu_notifier);
-exit_free:
- kfree(cpu_clamping_mask);
+exit:
return retval;
}
module_init(powerclamp_init);

static void powerclamp_exit(void)
{
- unregister_hotcpu_notifier(&powerclamp_cpu_notifier);
end_power_clamp();
- free_percpu(powerclamp_thread);
thermal_cooling_device_unregister(cooling_dev);
- kfree(cpu_clamping_mask);

cancel_delayed_work_sync(&poll_pkg_cstate_work);
debugfs_remove_recursive(debug_dir);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8db31ef..6a7ccb2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -3002,6 +3002,12 @@ extern long sched_getaffinity(pid_t pid, struct cpumask *mask);

#ifdef CONFIG_CGROUP_SCHED
extern struct task_group root_task_group;
+extern int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us);
+extern int tg_set_cfs_period(struct task_group *tg, long cfs_period_us);
+#else
+
+static inline int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us);
+static inline int tg_set_cfs_period(struct task_group *tg, long cfs_period_us);
#endif /* CONFIG_CGROUP_SCHED */

extern int task_can_switch_user(struct user_struct *up,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e628cb1..f47621a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8059,9 +8059,6 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
int i, ret = 0, runtime_enabled, runtime_was_enabled;
struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;

- if (tg == &root_task_group)
- return -EINVAL;
-
/*
* Ensure we have at some amount of bandwidth every period. This is
* to prevent reaching a state of large arrears when throttled via
@@ -8141,6 +8138,7 @@ int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us)

return tg_set_cfs_bandwidth(tg, period, quota);
}
+EXPORT_SYMBOL_GPL(tg_set_cfs_quota);

long tg_get_cfs_quota(struct task_group *tg)
{
@@ -8164,6 +8162,7 @@ int tg_set_cfs_period(struct task_group *tg, long cfs_period_us)

return tg_set_cfs_bandwidth(tg, period, quota);
}
+EXPORT_SYMBOL_GPL(tg_set_cfs_period);

long tg_get_cfs_period(struct task_group *tg)
{


2015-02-09 07:32:35

by Steven Noonan

[permalink] [raw]
Subject: Re: [PATCH V2] idle/intel_powerclamp: Redesign idle injection to use bandwidth control mechanism

On Sun, Feb 8, 2015 at 8:49 PM, Preeti U Murthy
<[email protected]> wrote:
> The powerclamp driver injects idle periods to stay within the thermal constraints.
> The driver does a fake idle by spawning per-cpu threads that call the mwait
> instruction. This behavior of fake idle can confuse the other kernel subsystems.
> For instance it calls into the nohz tick handlers, which are meant to be called
> only by the idle thread. It sets the state of the tick in each cpu to idle and
> stops the tick, when there are tasks on the runqueue. As a result the callers of
> idle_cpu()/ tick_nohz_tick_stopped() see different states of the cpu; while the
> former thinks that the cpu is busy, the latter thinks that it is idle. The outcome
> may be inconsistency in the scheduler/nohz states which can lead to serious
> consequences. One of them was reported on this thread:
> https://lkml.org/lkml/2014/12/11/365.
>
> Thomas posted out a patch to disable the powerclamp driver from calling into the
> tick nohz code which has taken care of the above regression for the moment. However
> powerclamp driver as a result, will not be able to inject idle periods due to the
> presence of periodic ticks. With the current design of fake idle, we cannot move
> towards a better solution.
> https://lkml.org/lkml/2014/12/18/169
>
> This patch aims at removing the concept of fake idle and instead makes the cpus
> truly idle by throttling the runqueues during the idle injection periods. The situation
> is in fact very similar to throttling of cfs_rqs when they exceed their bandwidths.
> The idle injection metrics can be mapped to the bandwidth control metrics 'quota' and
> 'period' to achieve the same result. When the powerclamping is begun or when the
> clamping controls have been modified, the bandwidth for the root task group is set.
> The 'quota' will be the amount of time that the system needs to be busy and 'period'
> will be the sum of this busy duration and the idle duration as calculated by the driver.
> This gets rid of per-cpu kthreads, control cpu, hotplug notifiers and clamping mask since
> the thread starting powerclamping will set the bandwidth and throttling of all cpus will
> automatically fall in place. None of the other cpus need be bothered about this. This
> simplifies the design of the driver.
>
> Of course this is only if the idle injection metrics can be conveniently transformed
> into bandwidth control metrics. There are a couple of other primary concerns around if
> doing the below two in this patch is valid.
> a. This patch exports the functions to set the quota and period of task groups.
> b. This patch removes the constraint of not being able to set the root task grp's bandwidth.
>
> Signed-off-by: Preeti U Murthy <[email protected]>

This doesn't compile.

Missing forward declaration:

drivers/thermal/intel_powerclamp.c: In function ‘window_size_set’:
drivers/thermal/intel_powerclamp.c:160:2: error: implicit declaration
of function ‘clamp_cpus’ [-Werror=implicit-function-declaration]
clamp_cpus();
^
drivers/thermal/intel_powerclamp.c: At top level:
drivers/thermal/intel_powerclamp.c:355:12: error: static declaration
of ‘clamp_cpus’ follows non-static declaration
static int clamp_cpus(void)
^
drivers/thermal/intel_powerclamp.c:160:2: note: previous implicit
declaration of ‘clamp_cpus’ was here
clamp_cpus();
^


Missing RUNTIME_INF definition (only appears in kernel/sched/sched.h
from what I can see?):

drivers/thermal/intel_powerclamp.c: In function ‘clamp_cpus’:
drivers/thermal/intel_powerclamp.c:358:14: error: ‘RUNTIME_INF’
undeclared (first use in this function)
u64 quota = RUNTIME_INF, period;
^
drivers/thermal/intel_powerclamp.c:358:14: note: each undeclared
identifier is reported only once for each function it appears in


Weird label placement that doesn't make sense:

drivers/thermal/intel_powerclamp.c:364:2: error: a label can only be
part of a statement and a declaration is not a statement
int sleeptime;
^
drivers/thermal/intel_powerclamp.c:365:2: error: expected expression
before ‘unsigned’
unsigned long target_jiffies;
^
drivers/thermal/intel_powerclamp.c:366:2: warning: ISO C90 forbids
mixed declarations and code [-Wdeclaration-after-statement]
unsigned int guard;
^
drivers/thermal/intel_powerclamp.c:390:2: error: ‘target_jiffies’
undeclared (first use in this function)
target_jiffies = roundup(jiffies, interval);
^


Missing debug variable:

drivers/thermal/intel_powerclamp.c: In function ‘powerclamp_debug_show’:
drivers/thermal/intel_powerclamp.c:598:41: error: ‘control_cpu’
undeclared (first use in this function)
seq_printf(m, "controlling cpu: %d\n", control_cpu);
^

Missing label:

drivers/thermal/intel_powerclamp.c: In function ‘powerclamp_init’:
drivers/thermal/intel_powerclamp.c:649:3: error: label ‘exit_free’
used but not defined
goto exit_free;
^
drivers/thermal/intel_powerclamp.c:644:6: warning: unused variable
‘bitmap_size’ [-Wunused-variable]
int bitmap_size;
^

> ---
> V1: https://lkml.org/lkml/2014/12/21/23
>
> V1 tried to throttle cfs_rqs without modifying available bandwidth. This will not
> be an appropriate solution since the scheduler only considers a runqueue as
> throttled when its bandwidth has been exceeded. Going around this path can
> lead to unintended consequences. Hence V2 was designed to throttle runqueues
> through bandwidth control.
>
> drivers/thermal/intel_powerclamp.c | 289 +++++++++---------------------------
> include/linux/sched.h | 6 +
> kernel/sched/core.c | 5 -
> 3 files changed, 82 insertions(+), 218 deletions(-)
>
> diff --git a/drivers/thermal/intel_powerclamp.c b/drivers/thermal/intel_powerclamp.c
> index 6ceebd6..f43903c 100644
> --- a/drivers/thermal/intel_powerclamp.c
> +++ b/drivers/thermal/intel_powerclamp.c
> @@ -78,20 +78,14 @@ static struct dentry *debug_dir;
> static unsigned int set_target_ratio;
> static unsigned int current_ratio;
> static bool should_skip;
> +static unsigned int count = 0;
> static bool reduce_irq;
> static atomic_t idle_wakeup_counter;
> -static unsigned int control_cpu; /* The cpu assigned to collect stat and update
> - * control parameters. default to BSP but BSP
> - * can be offlined.
> - */
> static bool clamping;
> +static void clamp_timer_fn(unsigned long foo);
> +DEFINE_TIMER(wakeup_timer, clamp_timer_fn, 0, 0);
>
> -
> -static struct task_struct * __percpu *powerclamp_thread;
> static struct thermal_cooling_device *cooling_dev;
> -static unsigned long *cpu_clamping_mask; /* bit map for tracking per cpu
> - * clamping thread
> - */
>
> static unsigned int duration;
> static unsigned int pkg_cstate_ratio_cur;
> @@ -163,7 +157,7 @@ static int window_size_set(const char *arg, const struct kernel_param *kp)
> smp_mb();
>
> exit_win:
> -
> + clamp_cpus();
> return ret;
> }
>
> @@ -256,10 +250,6 @@ static u64 pkg_state_counter(void)
> return count;
> }
>
> -static void noop_timer(unsigned long foo)
> -{
> - /* empty... just the fact that we get the interrupt wakes us up */
> -}
>
> static unsigned int get_compensation(int ratio)
> {
> @@ -362,102 +352,78 @@ static bool powerclamp_adjust_controls(unsigned int target_ratio,
> return set_target_ratio + guard <= current_ratio;
> }
>
> -static int clamp_thread(void *arg)
> +static int clamp_cpus(void)
> {
> - int cpunr = (unsigned long)arg;
> - DEFINE_TIMER(wakeup_timer, noop_timer, 0, 0);
> - static const struct sched_param param = {
> - .sched_priority = MAX_USER_RT_PRIO/2,
> - };
> - unsigned int count = 0;
> unsigned int target_ratio;
> + u64 quota = RUNTIME_INF, period;
>
> - set_bit(cpunr, cpu_clamping_mask);
> - set_freezable();
> - init_timer_on_stack(&wakeup_timer);
> - sched_setscheduler(current, SCHED_FIFO, &param);
> -
> - while (true == clamping && !kthread_should_stop() &&
> - cpu_online(cpunr)) {
> - int sleeptime;
> - unsigned long target_jiffies;
> - unsigned int guard;
> - unsigned int compensation = 0;
> - int interval; /* jiffies to sleep for each attempt */
> - unsigned int duration_jiffies = msecs_to_jiffies(duration);
> - unsigned int window_size_now;
> -
> - try_to_freeze();
> - /*
> - * make sure user selected ratio does not take effect until
> - * the next round. adjust target_ratio if user has changed
> - * target such that we can converge quickly.
> - */
> - target_ratio = set_target_ratio;
> - guard = 1 + target_ratio/20;
> - window_size_now = window_size;
> - count++;
> -
> - /*
> - * systems may have different ability to enter package level
> - * c-states, thus we need to compensate the injected idle ratio
> - * to achieve the actual target reported by the HW.
> - */
> - compensation = get_compensation(target_ratio);
> - interval = duration_jiffies*100/(target_ratio+compensation);
> -
> - /* align idle time */
> - target_jiffies = roundup(jiffies, interval);
> - sleeptime = target_jiffies - jiffies;
> - if (sleeptime <= 0)
> - sleeptime = 1;
> - schedule_timeout_interruptible(sleeptime);
> - /*
> - * only elected controlling cpu can collect stats and update
> - * control parameters.
> - */
> - if (cpunr == control_cpu && !(count%window_size_now)) {
> - should_skip =
> - powerclamp_adjust_controls(target_ratio,
> - guard, window_size_now);
> - smp_mb();
> - }
> + if (clamping == false)
> + goto out;
>
> - if (should_skip)
> - continue;
> -
> - target_jiffies = jiffies + duration_jiffies;
> - mod_timer(&wakeup_timer, target_jiffies);
> - if (unlikely(local_softirq_pending()))
> - continue;
> - /*
> - * stop tick sched during idle time, interrupts are still
> - * allowed. thus jiffies are updated properly.
> - */
> - preempt_disable();
> - /* mwait until target jiffies is reached */
> - while (time_before(jiffies, target_jiffies)) {
> - unsigned long ecx = 1;
> - unsigned long eax = target_mwait;
> -
> - /*
> - * REVISIT: may call enter_idle() to notify drivers who
> - * can save power during cpu idle. same for exit_idle()
> - */
> - local_touch_nmi();
> - stop_critical_timings();
> - mwait_idle_with_hints(eax, ecx);
> - start_critical_timings();
> - atomic_inc(&idle_wakeup_counter);
> - }
> - preempt_enable();
> +again:
> + int sleeptime;
> + unsigned long target_jiffies;
> + unsigned int guard;
> + unsigned int compensation = 0;
> + int interval; /* jiffies to sleep for each attempt */
> + unsigned int duration_jiffies = msecs_to_jiffies(duration);
> + unsigned int window_size_now;
> +
> + /*
> + * make sure user selected ratio does not take effect until
> + * the next round. adjust target_ratio if user has changed
> + * target such that we can converge quickly.
> + */
> + target_ratio = set_target_ratio;
> + guard = 1 + target_ratio/20;
> + window_size_now = window_size;
> +
> + /*
> + * systems may have different ability to enter package level
> + * c-states, thus we need to compensate the injected idle ratio
> + * to achieve the actual target reported by the HW.
> + */
> + compensation = get_compensation(target_ratio);
> + interval = duration_jiffies*100/(target_ratio+compensation);
> +
> + /* align idle time */
> + target_jiffies = roundup(jiffies, interval);
> + sleeptime = target_jiffies - jiffies;
> + if (sleeptime <= 0)
> + sleeptime = 1;
> +
> + if (!(count%window_size_now)) {
> + should_skip =
> + powerclamp_adjust_controls(target_ratio,
> + guard, window_size_now);
> + smp_mb();
> }
> - del_timer_sync(&wakeup_timer);
> - clear_bit(cpunr, cpu_clamping_mask);
> +
> + if (should_skip)
> + goto again;
> +
> + target_jiffies = jiffies + sleeptime + duration_jiffies;
> + mod_timer(&wakeup_timer, target_jiffies);
> + if (unlikely(local_softirq_pending()))
> + goto again;
> +
> + quota = jiffies_to_usecs(sleeptime);
> + period = jiffies_to_usecs(sleeptime + duration_jiffies);
> +
> +out:
> + tg_set_cfs_quota(&root_task_group, quota);
> + tg_set_cfs_period(&root_task_group, period);
>
> return 0;
> }
>
> +static void clamp_timer_fn(unsigned long foo)
> +{
> + /* Evaluate to see if clamping controls need to be adjusted */
> + count++;
> + clamp_cpus();
> +}
> +
> /*
> * 1 HZ polling while clamping is active, useful for userspace
> * to monitor actual idle ratio.
> @@ -501,8 +467,7 @@ static void poll_pkg_cstate(struct work_struct *dummy)
>
> static int start_power_clamp(void)
> {
> - unsigned long cpu;
> - struct task_struct *thread;
> + clamping = true;
>
> /* check if pkg cstate counter is completely 0, abort in this case */
> if (!has_pkg_state_counter()) {
> @@ -511,108 +476,21 @@ static int start_power_clamp(void)
> }
>
> set_target_ratio = clamp(set_target_ratio, 0U, MAX_TARGET_RATIO - 1);
> - /* prevent cpu hotplug */
> - get_online_cpus();
> -
> - /* prefer BSP */
> - control_cpu = 0;
> - if (!cpu_online(control_cpu))
> - control_cpu = smp_processor_id();
>
> - clamping = true;
> schedule_delayed_work(&poll_pkg_cstate_work, 0);
> -
> - /* start one thread per online cpu */
> - for_each_online_cpu(cpu) {
> - struct task_struct **p =
> - per_cpu_ptr(powerclamp_thread, cpu);
> -
> - thread = kthread_create_on_node(clamp_thread,
> - (void *) cpu,
> - cpu_to_node(cpu),
> - "kidle_inject/%ld", cpu);
> - /* bind to cpu here */
> - if (likely(!IS_ERR(thread))) {
> - kthread_bind(thread, cpu);
> - wake_up_process(thread);
> - *p = thread;
> - }
> -
> - }
> - put_online_cpus();
> + clamp_cpus();
>
> return 0;
> }
>
> static void end_power_clamp(void)
> {
> - int i;
> - struct task_struct *thread;
> -
> clamping = false;
> - /*
> - * make clamping visible to other cpus and give per cpu clamping threads
> - * sometime to exit, or gets killed later.
> - */
> - smp_mb();
> - msleep(20);
> - if (bitmap_weight(cpu_clamping_mask, num_possible_cpus())) {
> - for_each_set_bit(i, cpu_clamping_mask, num_possible_cpus()) {
> - pr_debug("clamping thread for cpu %d alive, kill\n", i);
> - thread = *per_cpu_ptr(powerclamp_thread, i);
> - kthread_stop(thread);
> - }
> - }
> -}
>
> -static int powerclamp_cpu_callback(struct notifier_block *nfb,
> - unsigned long action, void *hcpu)
> -{
> - unsigned long cpu = (unsigned long)hcpu;
> - struct task_struct *thread;
> - struct task_struct **percpu_thread =
> - per_cpu_ptr(powerclamp_thread, cpu);
> -
> - if (false == clamping)
> - goto exit_ok;
> -
> - switch (action) {
> - case CPU_ONLINE:
> - thread = kthread_create_on_node(clamp_thread,
> - (void *) cpu,
> - cpu_to_node(cpu),
> - "kidle_inject/%lu", cpu);
> - if (likely(!IS_ERR(thread))) {
> - kthread_bind(thread, cpu);
> - wake_up_process(thread);
> - *percpu_thread = thread;
> - }
> - /* prefer BSP as controlling CPU */
> - if (cpu == 0) {
> - control_cpu = 0;
> - smp_mb();
> - }
> - break;
> - case CPU_DEAD:
> - if (test_bit(cpu, cpu_clamping_mask)) {
> - pr_err("cpu %lu dead but powerclamping thread is not\n",
> - cpu);
> - kthread_stop(*percpu_thread);
> - }
> - if (cpu == control_cpu) {
> - control_cpu = smp_processor_id();
> - smp_mb();
> - }
> - }
> -
> -exit_ok:
> - return NOTIFY_OK;
> + clamp_cpus();
> + del_timer_sync(&wakeup_timer);
> }
>
> -static struct notifier_block powerclamp_cpu_notifier = {
> - .notifier_call = powerclamp_cpu_callback,
> -};
> -
> static int powerclamp_get_max_state(struct thermal_cooling_device *cdev,
> unsigned long *state)
> {
> @@ -656,6 +534,7 @@ static int powerclamp_set_cur_state(struct thermal_cooling_device *cdev,
> }
>
> exit_set:
> + clamp_cpus();
> return ret;
> }
>
> @@ -764,11 +643,6 @@ static int powerclamp_init(void)
> int retval;
> int bitmap_size;
>
> - bitmap_size = BITS_TO_LONGS(num_possible_cpus()) * sizeof(long);
> - cpu_clamping_mask = kzalloc(bitmap_size, GFP_KERNEL);
> - if (!cpu_clamping_mask)
> - return -ENOMEM;
> -
> /* probe cpu features and ids here */
> retval = powerclamp_probe();
> if (retval)
> @@ -776,19 +650,12 @@ static int powerclamp_init(void)
>
> /* set default limit, maybe adjusted during runtime based on feedback */
> window_size = 2;
> - register_hotcpu_notifier(&powerclamp_cpu_notifier);
> -
> - powerclamp_thread = alloc_percpu(struct task_struct *);
> - if (!powerclamp_thread) {
> - retval = -ENOMEM;
> - goto exit_unregister;
> - }
>
> cooling_dev = thermal_cooling_device_register("intel_powerclamp", NULL,
> &powerclamp_cooling_ops);
> if (IS_ERR(cooling_dev)) {
> retval = -ENODEV;
> - goto exit_free_thread;
> + goto exit;
> }
>
> if (!duration)
> @@ -798,23 +665,15 @@ static int powerclamp_init(void)
>
> return 0;
>
> -exit_free_thread:
> - free_percpu(powerclamp_thread);
> -exit_unregister:
> - unregister_hotcpu_notifier(&powerclamp_cpu_notifier);
> -exit_free:
> - kfree(cpu_clamping_mask);
> +exit:
> return retval;
> }
> module_init(powerclamp_init);
>
> static void powerclamp_exit(void)
> {
> - unregister_hotcpu_notifier(&powerclamp_cpu_notifier);
> end_power_clamp();
> - free_percpu(powerclamp_thread);
> thermal_cooling_device_unregister(cooling_dev);
> - kfree(cpu_clamping_mask);
>
> cancel_delayed_work_sync(&poll_pkg_cstate_work);
> debugfs_remove_recursive(debug_dir);
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 8db31ef..6a7ccb2 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -3002,6 +3002,12 @@ extern long sched_getaffinity(pid_t pid, struct cpumask *mask);
>
> #ifdef CONFIG_CGROUP_SCHED
> extern struct task_group root_task_group;
> +extern int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us);
> +extern int tg_set_cfs_period(struct task_group *tg, long cfs_period_us);
> +#else
> +
> +static inline int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us);
> +static inline int tg_set_cfs_period(struct task_group *tg, long cfs_period_us);
> #endif /* CONFIG_CGROUP_SCHED */
>
> extern int task_can_switch_user(struct user_struct *up,
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index e628cb1..f47621a 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -8059,9 +8059,6 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
> int i, ret = 0, runtime_enabled, runtime_was_enabled;
> struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
>
> - if (tg == &root_task_group)
> - return -EINVAL;
> -
> /*
> * Ensure we have at some amount of bandwidth every period. This is
> * to prevent reaching a state of large arrears when throttled via
> @@ -8141,6 +8138,7 @@ int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us)
>
> return tg_set_cfs_bandwidth(tg, period, quota);
> }
> +EXPORT_SYMBOL_GPL(tg_set_cfs_quota);
>
> long tg_get_cfs_quota(struct task_group *tg)
> {
> @@ -8164,6 +8162,7 @@ int tg_set_cfs_period(struct task_group *tg, long cfs_period_us)
>
> return tg_set_cfs_bandwidth(tg, period, quota);
> }
> +EXPORT_SYMBOL_GPL(tg_set_cfs_period);
>
> long tg_get_cfs_period(struct task_group *tg)
> {
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2015-02-09 11:52:37

by Preeti U Murthy

[permalink] [raw]
Subject: Re: [PATCH V2] idle/intel_powerclamp: Redesign idle injection to use bandwidth control mechanism

Hi Steven,

On 02/09/2015 01:02 PM, Steven Noonan wrote:
> On Sun, Feb 8, 2015 at 8:49 PM, Preeti U Murthy
> <[email protected]> wrote:
>> The powerclamp driver injects idle periods to stay within the thermal constraints.
>> The driver does a fake idle by spawning per-cpu threads that call the mwait
>> instruction. This behavior of fake idle can confuse the other kernel subsystems.
>> For instance it calls into the nohz tick handlers, which are meant to be called
>> only by the idle thread. It sets the state of the tick in each cpu to idle and
>> stops the tick, when there are tasks on the runqueue. As a result the callers of
>> idle_cpu()/ tick_nohz_tick_stopped() see different states of the cpu; while the
>> former thinks that the cpu is busy, the latter thinks that it is idle. The outcome
>> may be inconsistency in the scheduler/nohz states which can lead to serious
>> consequences. One of them was reported on this thread:
>> https://lkml.org/lkml/2014/12/11/365.
>>
>> Thomas posted out a patch to disable the powerclamp driver from calling into the
>> tick nohz code which has taken care of the above regression for the moment. However
>> powerclamp driver as a result, will not be able to inject idle periods due to the
>> presence of periodic ticks. With the current design of fake idle, we cannot move
>> towards a better solution.
>> https://lkml.org/lkml/2014/12/18/169
>>
>> This patch aims at removing the concept of fake idle and instead makes the cpus
>> truly idle by throttling the runqueues during the idle injection periods. The situation
>> is in fact very similar to throttling of cfs_rqs when they exceed their bandwidths.
>> The idle injection metrics can be mapped to the bandwidth control metrics 'quota' and
>> 'period' to achieve the same result. When the powerclamping is begun or when the
>> clamping controls have been modified, the bandwidth for the root task group is set.
>> The 'quota' will be the amount of time that the system needs to be busy and 'period'
>> will be the sum of this busy duration and the idle duration as calculated by the driver.
>> This gets rid of per-cpu kthreads, control cpu, hotplug notifiers and clamping mask since
>> the thread starting powerclamping will set the bandwidth and throttling of all cpus will
>> automatically fall in place. None of the other cpus need be bothered about this. This
>> simplifies the design of the driver.
>>
>> Of course this is only if the idle injection metrics can be conveniently transformed
>> into bandwidth control metrics. There are a couple of other primary concerns around if
>> doing the below two in this patch is valid.
>> a. This patch exports the functions to set the quota and period of task groups.
>> b. This patch removes the constraint of not being able to set the root task grp's bandwidth.
>>
>> Signed-off-by: Preeti U Murthy <[email protected]>
>
> This doesn't compile.

Thanks for reporting this! I realized that I had not compiled in the powerclamp driver
as a module while compile testing it. I was focusing on the issues with the design and
failed to cross verify this. Apologies for the inconvenience.

Find the diff compile tested below.

I also realized that clamp_cpus() that sets the bandwidth cannot be called from
multiple places. Currently I am calling it from end_powerclamp(), when the user changes the
idle clamping duration and from a queued timer. This will require synchronization between
callers which is not really called for. The queued wakeup_timer alone can re-evaluate the
clamping metrics after every throttle-unthrottle period and this should suffice as far
as I can see. Thoughts ?

Regards
Preeti U Murthy

----------------------------------------------------------------------------------

V2 of intel_powerclamp driver

From: Preeti U Murthy <[email protected]>


---
drivers/thermal/Kconfig | 1
drivers/thermal/intel_powerclamp.c | 301 ++++++++++--------------------------
include/linux/sched.h | 9 +
kernel/sched/core.c | 6 -
kernel/sched/sched.h | 5 -
5 files changed, 95 insertions(+), 227 deletions(-)

diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
index af40db0..4b7cd02 100644
--- a/drivers/thermal/Kconfig
+++ b/drivers/thermal/Kconfig
@@ -233,6 +233,7 @@ config INTEL_POWERCLAMP
depends on THERMAL
depends on X86
depends on CPU_SUP_INTEL
+ select CFS_BANDWIDTH
help
Enable this to enable Intel PowerClamp idle injection driver. This
enforce idle time which results in more package C-state residency. The
diff --git a/drivers/thermal/intel_powerclamp.c b/drivers/thermal/intel_powerclamp.c
index 6ceebd6..4bd07bb 100644
--- a/drivers/thermal/intel_powerclamp.c
+++ b/drivers/thermal/intel_powerclamp.c
@@ -51,6 +51,7 @@
#include <linux/debugfs.h>
#include <linux/seq_file.h>
#include <linux/sched/rt.h>
+#include <linux/sched.h>

#include <asm/nmi.h>
#include <asm/msr.h>
@@ -78,20 +79,18 @@ static struct dentry *debug_dir;
static unsigned int set_target_ratio;
static unsigned int current_ratio;
static bool should_skip;
+static unsigned int count = 0;
static bool reduce_irq;
static atomic_t idle_wakeup_counter;
-static unsigned int control_cpu; /* The cpu assigned to collect stat and update
- * control parameters. default to BSP but BSP
- * can be offlined.
- */
static bool clamping;

+/* Timer that evaluates bandwidth reset if clamping knobs have changed */
+static void clamp_timer_fn(unsigned long foo);
+DEFINE_TIMER(wakeup_timer, clamp_timer_fn, 0, 0);
+
+static void clamp_cpus(void);

-static struct task_struct * __percpu *powerclamp_thread;
static struct thermal_cooling_device *cooling_dev;
-static unsigned long *cpu_clamping_mask; /* bit map for tracking per cpu
- * clamping thread
- */

static unsigned int duration;
static unsigned int pkg_cstate_ratio_cur;
@@ -163,7 +162,7 @@ static int window_size_set(const char *arg, const struct kernel_param *kp)
smp_mb();

exit_win:
-
+ clamp_cpus();
return ret;
}

@@ -256,10 +255,6 @@ static u64 pkg_state_counter(void)
return count;
}

-static void noop_timer(unsigned long foo)
-{
- /* empty... just the fact that we get the interrupt wakes us up */
-}

static unsigned int get_compensation(int ratio)
{
@@ -362,100 +357,77 @@ static bool powerclamp_adjust_controls(unsigned int target_ratio,
return set_target_ratio + guard <= current_ratio;
}

-static int clamp_thread(void *arg)
+static void clamp_cpus(void)
{
- int cpunr = (unsigned long)arg;
- DEFINE_TIMER(wakeup_timer, noop_timer, 0, 0);
- static const struct sched_param param = {
- .sched_priority = MAX_USER_RT_PRIO/2,
- };
- unsigned int count = 0;
unsigned int target_ratio;
+ u64 quota, period;
+ int sleeptime;
+ unsigned long target_jiffies;
+ unsigned int guard;
+ unsigned int compensation;
+ int interval; /* jiffies to sleep for each attempt */
+ unsigned int duration_jiffies;
+ unsigned int window_size_now;
+
+again:
+ if (clamping == false)
+ goto out;
+
+ quota = RUNTIME_INF;
+ period = NSEC_PER_SEC;
+ compensation = 0;
+ duration_jiffies = msecs_to_jiffies(duration);
+ /*
+ * make sure user selected ratio does not take effect until
+ * the next round. adjust target_ratio if user has changed
+ * target such that we can converge quickly.
+ */
+ target_ratio = set_target_ratio;
+ guard = 1 + target_ratio/20;
+ window_size_now = window_size;

- set_bit(cpunr, cpu_clamping_mask);
- set_freezable();
- init_timer_on_stack(&wakeup_timer);
- sched_setscheduler(current, SCHED_FIFO, &param);
-
- while (true == clamping && !kthread_should_stop() &&
- cpu_online(cpunr)) {
- int sleeptime;
- unsigned long target_jiffies;
- unsigned int guard;
- unsigned int compensation = 0;
- int interval; /* jiffies to sleep for each attempt */
- unsigned int duration_jiffies = msecs_to_jiffies(duration);
- unsigned int window_size_now;
-
- try_to_freeze();
- /*
- * make sure user selected ratio does not take effect until
- * the next round. adjust target_ratio if user has changed
- * target such that we can converge quickly.
- */
- target_ratio = set_target_ratio;
- guard = 1 + target_ratio/20;
- window_size_now = window_size;
- count++;
-
- /*
- * systems may have different ability to enter package level
- * c-states, thus we need to compensate the injected idle ratio
- * to achieve the actual target reported by the HW.
- */
- compensation = get_compensation(target_ratio);
- interval = duration_jiffies*100/(target_ratio+compensation);
-
- /* align idle time */
- target_jiffies = roundup(jiffies, interval);
- sleeptime = target_jiffies - jiffies;
- if (sleeptime <= 0)
- sleeptime = 1;
- schedule_timeout_interruptible(sleeptime);
- /*
- * only elected controlling cpu can collect stats and update
- * control parameters.
- */
- if (cpunr == control_cpu && !(count%window_size_now)) {
- should_skip =
- powerclamp_adjust_controls(target_ratio,
- guard, window_size_now);
- smp_mb();
- }
-
- if (should_skip)
- continue;
-
- target_jiffies = jiffies + duration_jiffies;
- mod_timer(&wakeup_timer, target_jiffies);
- if (unlikely(local_softirq_pending()))
- continue;
- /*
- * stop tick sched during idle time, interrupts are still
- * allowed. thus jiffies are updated properly.
- */
- preempt_disable();
- /* mwait until target jiffies is reached */
- while (time_before(jiffies, target_jiffies)) {
- unsigned long ecx = 1;
- unsigned long eax = target_mwait;
-
- /*
- * REVISIT: may call enter_idle() to notify drivers who
- * can save power during cpu idle. same for exit_idle()
- */
- local_touch_nmi();
- stop_critical_timings();
- mwait_idle_with_hints(eax, ecx);
- start_critical_timings();
- atomic_inc(&idle_wakeup_counter);
- }
- preempt_enable();
+ /*
+ * systems may have different ability to enter package level
+ * c-states, thus we need to compensate the injected idle ratio
+ * to achieve the actual target reported by the HW.
+ */
+ compensation = get_compensation(target_ratio);
+ interval = duration_jiffies*100/(target_ratio+compensation);
+
+ /* align idle time */
+ target_jiffies = roundup(jiffies, interval);
+ sleeptime = target_jiffies - jiffies;
+ if (sleeptime <= 0)
+ sleeptime = 1;
+
+ if (!(count%window_size_now)) {
+ should_skip =
+ powerclamp_adjust_controls(target_ratio,
+ guard, window_size_now);
+ smp_mb();
}
- del_timer_sync(&wakeup_timer);
- clear_bit(cpunr, cpu_clamping_mask);

- return 0;
+ if (should_skip)
+ goto again;
+
+ target_jiffies = jiffies + sleeptime + duration_jiffies;
+ mod_timer(&wakeup_timer, target_jiffies);
+ if (unlikely(local_softirq_pending()))
+ goto again;
+
+ quota = jiffies_to_usecs(sleeptime);
+ period = jiffies_to_usecs(sleeptime + duration_jiffies);
+
+out:
+ tg_set_cfs_quota(&root_task_group, quota);
+ tg_set_cfs_period(&root_task_group, period);
+}
+
+static void clamp_timer_fn(unsigned long foo)
+{
+ /* Evaluate to see if clamping controls need to be adjusted */
+ count++;
+ clamp_cpus();
}

/*
@@ -501,8 +473,7 @@ static void poll_pkg_cstate(struct work_struct *dummy)

static int start_power_clamp(void)
{
- unsigned long cpu;
- struct task_struct *thread;
+ clamping = true;

/* check if pkg cstate counter is completely 0, abort in this case */
if (!has_pkg_state_counter()) {
@@ -511,108 +482,21 @@ static int start_power_clamp(void)
}

set_target_ratio = clamp(set_target_ratio, 0U, MAX_TARGET_RATIO - 1);
- /* prevent cpu hotplug */
- get_online_cpus();
-
- /* prefer BSP */
- control_cpu = 0;
- if (!cpu_online(control_cpu))
- control_cpu = smp_processor_id();

- clamping = true;
schedule_delayed_work(&poll_pkg_cstate_work, 0);
-
- /* start one thread per online cpu */
- for_each_online_cpu(cpu) {
- struct task_struct **p =
- per_cpu_ptr(powerclamp_thread, cpu);
-
- thread = kthread_create_on_node(clamp_thread,
- (void *) cpu,
- cpu_to_node(cpu),
- "kidle_inject/%ld", cpu);
- /* bind to cpu here */
- if (likely(!IS_ERR(thread))) {
- kthread_bind(thread, cpu);
- wake_up_process(thread);
- *p = thread;
- }
-
- }
- put_online_cpus();
+ clamp_cpus();

return 0;
}

static void end_power_clamp(void)
{
- int i;
- struct task_struct *thread;
-
clamping = false;
- /*
- * make clamping visible to other cpus and give per cpu clamping threads
- * sometime to exit, or gets killed later.
- */
- smp_mb();
- msleep(20);
- if (bitmap_weight(cpu_clamping_mask, num_possible_cpus())) {
- for_each_set_bit(i, cpu_clamping_mask, num_possible_cpus()) {
- pr_debug("clamping thread for cpu %d alive, kill\n", i);
- thread = *per_cpu_ptr(powerclamp_thread, i);
- kthread_stop(thread);
- }
- }
-}

-static int powerclamp_cpu_callback(struct notifier_block *nfb,
- unsigned long action, void *hcpu)
-{
- unsigned long cpu = (unsigned long)hcpu;
- struct task_struct *thread;
- struct task_struct **percpu_thread =
- per_cpu_ptr(powerclamp_thread, cpu);
-
- if (false == clamping)
- goto exit_ok;
-
- switch (action) {
- case CPU_ONLINE:
- thread = kthread_create_on_node(clamp_thread,
- (void *) cpu,
- cpu_to_node(cpu),
- "kidle_inject/%lu", cpu);
- if (likely(!IS_ERR(thread))) {
- kthread_bind(thread, cpu);
- wake_up_process(thread);
- *percpu_thread = thread;
- }
- /* prefer BSP as controlling CPU */
- if (cpu == 0) {
- control_cpu = 0;
- smp_mb();
- }
- break;
- case CPU_DEAD:
- if (test_bit(cpu, cpu_clamping_mask)) {
- pr_err("cpu %lu dead but powerclamping thread is not\n",
- cpu);
- kthread_stop(*percpu_thread);
- }
- if (cpu == control_cpu) {
- control_cpu = smp_processor_id();
- smp_mb();
- }
- }
-
-exit_ok:
- return NOTIFY_OK;
+ clamp_cpus();
+ del_timer_sync(&wakeup_timer);
}

-static struct notifier_block powerclamp_cpu_notifier = {
- .notifier_call = powerclamp_cpu_callback,
-};
-
static int powerclamp_get_max_state(struct thermal_cooling_device *cdev,
unsigned long *state)
{
@@ -656,6 +540,7 @@ static int powerclamp_set_cur_state(struct thermal_cooling_device *cdev,
}

exit_set:
+ clamp_cpus();
return ret;
}

@@ -716,7 +601,6 @@ static int powerclamp_debug_show(struct seq_file *m, void *unused)
{
int i = 0;

- seq_printf(m, "controlling cpu: %d\n", control_cpu);
seq_printf(m, "pct confidence steady dynamic (compensation)\n");
for (i = 0; i < MAX_TARGET_RATIO; i++) {
seq_printf(m, "%d\t%lu\t%lu\t%lu\n",
@@ -762,33 +646,20 @@ file_error:
static int powerclamp_init(void)
{
int retval;
- int bitmap_size;
-
- bitmap_size = BITS_TO_LONGS(num_possible_cpus()) * sizeof(long);
- cpu_clamping_mask = kzalloc(bitmap_size, GFP_KERNEL);
- if (!cpu_clamping_mask)
- return -ENOMEM;

/* probe cpu features and ids here */
retval = powerclamp_probe();
if (retval)
- goto exit_free;
+ goto exit;

/* set default limit, maybe adjusted during runtime based on feedback */
window_size = 2;
- register_hotcpu_notifier(&powerclamp_cpu_notifier);
-
- powerclamp_thread = alloc_percpu(struct task_struct *);
- if (!powerclamp_thread) {
- retval = -ENOMEM;
- goto exit_unregister;
- }

cooling_dev = thermal_cooling_device_register("intel_powerclamp", NULL,
&powerclamp_cooling_ops);
if (IS_ERR(cooling_dev)) {
retval = -ENODEV;
- goto exit_free_thread;
+ goto exit;
}

if (!duration)
@@ -798,23 +669,15 @@ static int powerclamp_init(void)

return 0;

-exit_free_thread:
- free_percpu(powerclamp_thread);
-exit_unregister:
- unregister_hotcpu_notifier(&powerclamp_cpu_notifier);
-exit_free:
- kfree(cpu_clamping_mask);
+exit:
return retval;
}
module_init(powerclamp_init);

static void powerclamp_exit(void)
{
- unregister_hotcpu_notifier(&powerclamp_cpu_notifier);
end_power_clamp();
- free_percpu(powerclamp_thread);
thermal_cooling_device_unregister(cooling_dev);
- kfree(cpu_clamping_mask);

cancel_delayed_work_sync(&poll_pkg_cstate_work);
debugfs_remove_recursive(debug_dir);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8db31ef..2493942 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -163,6 +163,11 @@ extern void get_avenrun(unsigned long *loads, unsigned long offset, int shift);
load += n*(FIXED_1-exp); \
load >>= FSHIFT;

+/*
+ * single value that denotes runtime == period, ie unlimited time.
+ */
+#define RUNTIME_INF ((u64)~0ULL)
+
extern unsigned long total_forks;
extern int nr_threads;
DECLARE_PER_CPU(unsigned long, process_counts);
@@ -3002,6 +3007,10 @@ extern long sched_getaffinity(pid_t pid, struct cpumask *mask);

#ifdef CONFIG_CGROUP_SCHED
extern struct task_group root_task_group;
+#ifdef CONFIG_CFS_BANDWIDTH
+extern int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us);
+extern int tg_set_cfs_period(struct task_group *tg, long cfs_period_us);
+#endif /* CONFIG_CFS_BANDWIDTH */
#endif /* CONFIG_CGROUP_SCHED */

extern int task_can_switch_user(struct user_struct *up,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e628cb1..7471b06 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7097,6 +7097,7 @@ int in_sched_functions(unsigned long addr)
* Every task in system belongs to this group at bootup.
*/
struct task_group root_task_group;
+EXPORT_SYMBOL_GPL(root_task_group);
LIST_HEAD(task_groups);
#endif

@@ -8059,9 +8060,6 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
int i, ret = 0, runtime_enabled, runtime_was_enabled;
struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;

- if (tg == &root_task_group)
- return -EINVAL;
-
/*
* Ensure we have at some amount of bandwidth every period. This is
* to prevent reaching a state of large arrears when throttled via
@@ -8141,6 +8139,7 @@ int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us)

return tg_set_cfs_bandwidth(tg, period, quota);
}
+EXPORT_SYMBOL_GPL(tg_set_cfs_quota);

long tg_get_cfs_quota(struct task_group *tg)
{
@@ -8164,6 +8163,7 @@ int tg_set_cfs_period(struct task_group *tg, long cfs_period_us)

return tg_set_cfs_bandwidth(tg, period, quota);
}
+EXPORT_SYMBOL_GPL(tg_set_cfs_period);

long tg_get_cfs_period(struct task_group *tg)
{
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9a2a45c..20493e4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -72,11 +72,6 @@ extern void update_cpu_load_active(struct rq *this_rq);
* These are the 'tuning knobs' of the scheduler:
*/

-/*
- * single value that denotes runtime == period, ie unlimited time.
- */
-#define RUNTIME_INF ((u64)~0ULL)
-
static inline int fair_policy(int policy)
{
return policy == SCHED_NORMAL || policy == SCHED_BATCH;

2015-02-09 17:56:13

by Steven Noonan

[permalink] [raw]
Subject: Re: [PATCH V2] idle/intel_powerclamp: Redesign idle injection to use bandwidth control mechanism

On Mon, Feb 9, 2015 at 3:51 AM, Preeti U Murthy
<[email protected]> wrote:
> Hi Steven,
>
> On 02/09/2015 01:02 PM, Steven Noonan wrote:
>> On Sun, Feb 8, 2015 at 8:49 PM, Preeti U Murthy
>> <[email protected]> wrote:
>>> The powerclamp driver injects idle periods to stay within the thermal constraints.
>>> The driver does a fake idle by spawning per-cpu threads that call the mwait
>>> instruction. This behavior of fake idle can confuse the other kernel subsystems.
>>> For instance it calls into the nohz tick handlers, which are meant to be called
>>> only by the idle thread. It sets the state of the tick in each cpu to idle and
>>> stops the tick, when there are tasks on the runqueue. As a result the callers of
>>> idle_cpu()/ tick_nohz_tick_stopped() see different states of the cpu; while the
>>> former thinks that the cpu is busy, the latter thinks that it is idle. The outcome
>>> may be inconsistency in the scheduler/nohz states which can lead to serious
>>> consequences. One of them was reported on this thread:
>>> https://lkml.org/lkml/2014/12/11/365.
>>>
>>> Thomas posted out a patch to disable the powerclamp driver from calling into the
>>> tick nohz code which has taken care of the above regression for the moment. However
>>> powerclamp driver as a result, will not be able to inject idle periods due to the
>>> presence of periodic ticks. With the current design of fake idle, we cannot move
>>> towards a better solution.
>>> https://lkml.org/lkml/2014/12/18/169
>>>
>>> This patch aims at removing the concept of fake idle and instead makes the cpus
>>> truly idle by throttling the runqueues during the idle injection periods. The situation
>>> is in fact very similar to throttling of cfs_rqs when they exceed their bandwidths.
>>> The idle injection metrics can be mapped to the bandwidth control metrics 'quota' and
>>> 'period' to achieve the same result. When the powerclamping is begun or when the
>>> clamping controls have been modified, the bandwidth for the root task group is set.
>>> The 'quota' will be the amount of time that the system needs to be busy and 'period'
>>> will be the sum of this busy duration and the idle duration as calculated by the driver.
>>> This gets rid of per-cpu kthreads, control cpu, hotplug notifiers and clamping mask since
>>> the thread starting powerclamping will set the bandwidth and throttling of all cpus will
>>> automatically fall in place. None of the other cpus need be bothered about this. This
>>> simplifies the design of the driver.
>>>
>>> Of course this is only if the idle injection metrics can be conveniently transformed
>>> into bandwidth control metrics. There are a couple of other primary concerns around if
>>> doing the below two in this patch is valid.
>>> a. This patch exports the functions to set the quota and period of task groups.
>>> b. This patch removes the constraint of not being able to set the root task grp's bandwidth.
>>>
>>> Signed-off-by: Preeti U Murthy <[email protected]>
>>
>> This doesn't compile.
>
> Thanks for reporting this! I realized that I had not compiled in the powerclamp driver
> as a module while compile testing it. I was focusing on the issues with the design and
> failed to cross verify this. Apologies for the inconvenience.
>
> Find the diff compile tested below.
>
> I also realized that clamp_cpus() that sets the bandwidth cannot be called from
> multiple places. Currently I am calling it from end_powerclamp(), when the user changes the
> idle clamping duration and from a queued timer. This will require synchronization between
> callers which is not really called for. The queued wakeup_timer alone can re-evaluate the
> clamping metrics after every throttle-unthrottle period and this should suffice as far
> as I can see. Thoughts ?

Hmm, I've had two system lockups so far while running a kernel with
intel_powerclamp loaded. Both times it slowly ground to a halt and
processes piled up...

> ----------------------------------------------------------------------------------
>
> V2 of intel_powerclamp driver
>
> From: Preeti U Murthy <[email protected]>
>
>
> ---
> drivers/thermal/Kconfig | 1
> drivers/thermal/intel_powerclamp.c | 301 ++++++++++--------------------------
> include/linux/sched.h | 9 +
> kernel/sched/core.c | 6 -
> kernel/sched/sched.h | 5 -
> 5 files changed, 95 insertions(+), 227 deletions(-)
>
> diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
> index af40db0..4b7cd02 100644
> --- a/drivers/thermal/Kconfig
> +++ b/drivers/thermal/Kconfig
> @@ -233,6 +233,7 @@ config INTEL_POWERCLAMP
> depends on THERMAL
> depends on X86
> depends on CPU_SUP_INTEL
> + select CFS_BANDWIDTH
> help
> Enable this to enable Intel PowerClamp idle injection driver. This
> enforce idle time which results in more package C-state residency. The
> diff --git a/drivers/thermal/intel_powerclamp.c b/drivers/thermal/intel_powerclamp.c
> index 6ceebd6..4bd07bb 100644
> --- a/drivers/thermal/intel_powerclamp.c
> +++ b/drivers/thermal/intel_powerclamp.c
> @@ -51,6 +51,7 @@
> #include <linux/debugfs.h>
> #include <linux/seq_file.h>
> #include <linux/sched/rt.h>
> +#include <linux/sched.h>
>
> #include <asm/nmi.h>
> #include <asm/msr.h>
> @@ -78,20 +79,18 @@ static struct dentry *debug_dir;
> static unsigned int set_target_ratio;
> static unsigned int current_ratio;
> static bool should_skip;
> +static unsigned int count = 0;
> static bool reduce_irq;
> static atomic_t idle_wakeup_counter;
> -static unsigned int control_cpu; /* The cpu assigned to collect stat and update
> - * control parameters. default to BSP but BSP
> - * can be offlined.
> - */
> static bool clamping;
>
> +/* Timer that evaluates bandwidth reset if clamping knobs have changed */
> +static void clamp_timer_fn(unsigned long foo);
> +DEFINE_TIMER(wakeup_timer, clamp_timer_fn, 0, 0);
> +
> +static void clamp_cpus(void);
>
> -static struct task_struct * __percpu *powerclamp_thread;
> static struct thermal_cooling_device *cooling_dev;
> -static unsigned long *cpu_clamping_mask; /* bit map for tracking per cpu
> - * clamping thread
> - */
>
> static unsigned int duration;
> static unsigned int pkg_cstate_ratio_cur;
> @@ -163,7 +162,7 @@ static int window_size_set(const char *arg, const struct kernel_param *kp)
> smp_mb();
>
> exit_win:
> -
> + clamp_cpus();
> return ret;
> }
>
> @@ -256,10 +255,6 @@ static u64 pkg_state_counter(void)
> return count;
> }
>
> -static void noop_timer(unsigned long foo)
> -{
> - /* empty... just the fact that we get the interrupt wakes us up */
> -}
>
> static unsigned int get_compensation(int ratio)
> {
> @@ -362,100 +357,77 @@ static bool powerclamp_adjust_controls(unsigned int target_ratio,
> return set_target_ratio + guard <= current_ratio;
> }
>
> -static int clamp_thread(void *arg)
> +static void clamp_cpus(void)
> {
> - int cpunr = (unsigned long)arg;
> - DEFINE_TIMER(wakeup_timer, noop_timer, 0, 0);
> - static const struct sched_param param = {
> - .sched_priority = MAX_USER_RT_PRIO/2,
> - };
> - unsigned int count = 0;
> unsigned int target_ratio;
> + u64 quota, period;
> + int sleeptime;
> + unsigned long target_jiffies;
> + unsigned int guard;
> + unsigned int compensation;
> + int interval; /* jiffies to sleep for each attempt */
> + unsigned int duration_jiffies;
> + unsigned int window_size_now;
> +
> +again:
> + if (clamping == false)
> + goto out;
> +
> + quota = RUNTIME_INF;
> + period = NSEC_PER_SEC;
> + compensation = 0;
> + duration_jiffies = msecs_to_jiffies(duration);
> + /*
> + * make sure user selected ratio does not take effect until
> + * the next round. adjust target_ratio if user has changed
> + * target such that we can converge quickly.
> + */
> + target_ratio = set_target_ratio;
> + guard = 1 + target_ratio/20;
> + window_size_now = window_size;
>
> - set_bit(cpunr, cpu_clamping_mask);
> - set_freezable();
> - init_timer_on_stack(&wakeup_timer);
> - sched_setscheduler(current, SCHED_FIFO, &param);
> -
> - while (true == clamping && !kthread_should_stop() &&
> - cpu_online(cpunr)) {
> - int sleeptime;
> - unsigned long target_jiffies;
> - unsigned int guard;
> - unsigned int compensation = 0;
> - int interval; /* jiffies to sleep for each attempt */
> - unsigned int duration_jiffies = msecs_to_jiffies(duration);
> - unsigned int window_size_now;
> -
> - try_to_freeze();
> - /*
> - * make sure user selected ratio does not take effect until
> - * the next round. adjust target_ratio if user has changed
> - * target such that we can converge quickly.
> - */
> - target_ratio = set_target_ratio;
> - guard = 1 + target_ratio/20;
> - window_size_now = window_size;
> - count++;
> -
> - /*
> - * systems may have different ability to enter package level
> - * c-states, thus we need to compensate the injected idle ratio
> - * to achieve the actual target reported by the HW.
> - */
> - compensation = get_compensation(target_ratio);
> - interval = duration_jiffies*100/(target_ratio+compensation);
> -
> - /* align idle time */
> - target_jiffies = roundup(jiffies, interval);
> - sleeptime = target_jiffies - jiffies;
> - if (sleeptime <= 0)
> - sleeptime = 1;
> - schedule_timeout_interruptible(sleeptime);
> - /*
> - * only elected controlling cpu can collect stats and update
> - * control parameters.
> - */
> - if (cpunr == control_cpu && !(count%window_size_now)) {
> - should_skip =
> - powerclamp_adjust_controls(target_ratio,
> - guard, window_size_now);
> - smp_mb();
> - }
> -
> - if (should_skip)
> - continue;
> -
> - target_jiffies = jiffies + duration_jiffies;
> - mod_timer(&wakeup_timer, target_jiffies);
> - if (unlikely(local_softirq_pending()))
> - continue;
> - /*
> - * stop tick sched during idle time, interrupts are still
> - * allowed. thus jiffies are updated properly.
> - */
> - preempt_disable();
> - /* mwait until target jiffies is reached */
> - while (time_before(jiffies, target_jiffies)) {
> - unsigned long ecx = 1;
> - unsigned long eax = target_mwait;
> -
> - /*
> - * REVISIT: may call enter_idle() to notify drivers who
> - * can save power during cpu idle. same for exit_idle()
> - */
> - local_touch_nmi();
> - stop_critical_timings();
> - mwait_idle_with_hints(eax, ecx);
> - start_critical_timings();
> - atomic_inc(&idle_wakeup_counter);
> - }
> - preempt_enable();
> + /*
> + * systems may have different ability to enter package level
> + * c-states, thus we need to compensate the injected idle ratio
> + * to achieve the actual target reported by the HW.
> + */
> + compensation = get_compensation(target_ratio);
> + interval = duration_jiffies*100/(target_ratio+compensation);
> +
> + /* align idle time */
> + target_jiffies = roundup(jiffies, interval);
> + sleeptime = target_jiffies - jiffies;
> + if (sleeptime <= 0)
> + sleeptime = 1;
> +
> + if (!(count%window_size_now)) {
> + should_skip =
> + powerclamp_adjust_controls(target_ratio,
> + guard, window_size_now);
> + smp_mb();
> }
> - del_timer_sync(&wakeup_timer);
> - clear_bit(cpunr, cpu_clamping_mask);
>
> - return 0;
> + if (should_skip)
> + goto again;
> +
> + target_jiffies = jiffies + sleeptime + duration_jiffies;
> + mod_timer(&wakeup_timer, target_jiffies);
> + if (unlikely(local_softirq_pending()))
> + goto again;
> +
> + quota = jiffies_to_usecs(sleeptime);
> + period = jiffies_to_usecs(sleeptime + duration_jiffies);
> +
> +out:
> + tg_set_cfs_quota(&root_task_group, quota);
> + tg_set_cfs_period(&root_task_group, period);
> +}
> +
> +static void clamp_timer_fn(unsigned long foo)
> +{
> + /* Evaluate to see if clamping controls need to be adjusted */
> + count++;
> + clamp_cpus();
> }
>
> /*
> @@ -501,8 +473,7 @@ static void poll_pkg_cstate(struct work_struct *dummy)
>
> static int start_power_clamp(void)
> {
> - unsigned long cpu;
> - struct task_struct *thread;
> + clamping = true;
>
> /* check if pkg cstate counter is completely 0, abort in this case */
> if (!has_pkg_state_counter()) {
> @@ -511,108 +482,21 @@ static int start_power_clamp(void)
> }
>
> set_target_ratio = clamp(set_target_ratio, 0U, MAX_TARGET_RATIO - 1);
> - /* prevent cpu hotplug */
> - get_online_cpus();
> -
> - /* prefer BSP */
> - control_cpu = 0;
> - if (!cpu_online(control_cpu))
> - control_cpu = smp_processor_id();
>
> - clamping = true;
> schedule_delayed_work(&poll_pkg_cstate_work, 0);
> -
> - /* start one thread per online cpu */
> - for_each_online_cpu(cpu) {
> - struct task_struct **p =
> - per_cpu_ptr(powerclamp_thread, cpu);
> -
> - thread = kthread_create_on_node(clamp_thread,
> - (void *) cpu,
> - cpu_to_node(cpu),
> - "kidle_inject/%ld", cpu);
> - /* bind to cpu here */
> - if (likely(!IS_ERR(thread))) {
> - kthread_bind(thread, cpu);
> - wake_up_process(thread);
> - *p = thread;
> - }
> -
> - }
> - put_online_cpus();
> + clamp_cpus();
>
> return 0;
> }
>
> static void end_power_clamp(void)
> {
> - int i;
> - struct task_struct *thread;
> -
> clamping = false;
> - /*
> - * make clamping visible to other cpus and give per cpu clamping threads
> - * sometime to exit, or gets killed later.
> - */
> - smp_mb();
> - msleep(20);
> - if (bitmap_weight(cpu_clamping_mask, num_possible_cpus())) {
> - for_each_set_bit(i, cpu_clamping_mask, num_possible_cpus()) {
> - pr_debug("clamping thread for cpu %d alive, kill\n", i);
> - thread = *per_cpu_ptr(powerclamp_thread, i);
> - kthread_stop(thread);
> - }
> - }
> -}
>
> -static int powerclamp_cpu_callback(struct notifier_block *nfb,
> - unsigned long action, void *hcpu)
> -{
> - unsigned long cpu = (unsigned long)hcpu;
> - struct task_struct *thread;
> - struct task_struct **percpu_thread =
> - per_cpu_ptr(powerclamp_thread, cpu);
> -
> - if (false == clamping)
> - goto exit_ok;
> -
> - switch (action) {
> - case CPU_ONLINE:
> - thread = kthread_create_on_node(clamp_thread,
> - (void *) cpu,
> - cpu_to_node(cpu),
> - "kidle_inject/%lu", cpu);
> - if (likely(!IS_ERR(thread))) {
> - kthread_bind(thread, cpu);
> - wake_up_process(thread);
> - *percpu_thread = thread;
> - }
> - /* prefer BSP as controlling CPU */
> - if (cpu == 0) {
> - control_cpu = 0;
> - smp_mb();
> - }
> - break;
> - case CPU_DEAD:
> - if (test_bit(cpu, cpu_clamping_mask)) {
> - pr_err("cpu %lu dead but powerclamping thread is not\n",
> - cpu);
> - kthread_stop(*percpu_thread);
> - }
> - if (cpu == control_cpu) {
> - control_cpu = smp_processor_id();
> - smp_mb();
> - }
> - }
> -
> -exit_ok:
> - return NOTIFY_OK;
> + clamp_cpus();
> + del_timer_sync(&wakeup_timer);
> }
>
> -static struct notifier_block powerclamp_cpu_notifier = {
> - .notifier_call = powerclamp_cpu_callback,
> -};
> -
> static int powerclamp_get_max_state(struct thermal_cooling_device *cdev,
> unsigned long *state)
> {
> @@ -656,6 +540,7 @@ static int powerclamp_set_cur_state(struct thermal_cooling_device *cdev,
> }
>
> exit_set:
> + clamp_cpus();
> return ret;
> }
>
> @@ -716,7 +601,6 @@ static int powerclamp_debug_show(struct seq_file *m, void *unused)
> {
> int i = 0;
>
> - seq_printf(m, "controlling cpu: %d\n", control_cpu);
> seq_printf(m, "pct confidence steady dynamic (compensation)\n");
> for (i = 0; i < MAX_TARGET_RATIO; i++) {
> seq_printf(m, "%d\t%lu\t%lu\t%lu\n",
> @@ -762,33 +646,20 @@ file_error:
> static int powerclamp_init(void)
> {
> int retval;
> - int bitmap_size;
> -
> - bitmap_size = BITS_TO_LONGS(num_possible_cpus()) * sizeof(long);
> - cpu_clamping_mask = kzalloc(bitmap_size, GFP_KERNEL);
> - if (!cpu_clamping_mask)
> - return -ENOMEM;
>
> /* probe cpu features and ids here */
> retval = powerclamp_probe();
> if (retval)
> - goto exit_free;
> + goto exit;
>
> /* set default limit, maybe adjusted during runtime based on feedback */
> window_size = 2;
> - register_hotcpu_notifier(&powerclamp_cpu_notifier);
> -
> - powerclamp_thread = alloc_percpu(struct task_struct *);
> - if (!powerclamp_thread) {
> - retval = -ENOMEM;
> - goto exit_unregister;
> - }
>
> cooling_dev = thermal_cooling_device_register("intel_powerclamp", NULL,
> &powerclamp_cooling_ops);
> if (IS_ERR(cooling_dev)) {
> retval = -ENODEV;
> - goto exit_free_thread;
> + goto exit;
> }
>
> if (!duration)
> @@ -798,23 +669,15 @@ static int powerclamp_init(void)
>
> return 0;
>
> -exit_free_thread:
> - free_percpu(powerclamp_thread);
> -exit_unregister:
> - unregister_hotcpu_notifier(&powerclamp_cpu_notifier);
> -exit_free:
> - kfree(cpu_clamping_mask);
> +exit:
> return retval;
> }
> module_init(powerclamp_init);
>
> static void powerclamp_exit(void)
> {
> - unregister_hotcpu_notifier(&powerclamp_cpu_notifier);
> end_power_clamp();
> - free_percpu(powerclamp_thread);
> thermal_cooling_device_unregister(cooling_dev);
> - kfree(cpu_clamping_mask);
>
> cancel_delayed_work_sync(&poll_pkg_cstate_work);
> debugfs_remove_recursive(debug_dir);
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 8db31ef..2493942 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -163,6 +163,11 @@ extern void get_avenrun(unsigned long *loads, unsigned long offset, int shift);
> load += n*(FIXED_1-exp); \
> load >>= FSHIFT;
>
> +/*
> + * single value that denotes runtime == period, ie unlimited time.
> + */
> +#define RUNTIME_INF ((u64)~0ULL)
> +
> extern unsigned long total_forks;
> extern int nr_threads;
> DECLARE_PER_CPU(unsigned long, process_counts);
> @@ -3002,6 +3007,10 @@ extern long sched_getaffinity(pid_t pid, struct cpumask *mask);
>
> #ifdef CONFIG_CGROUP_SCHED
> extern struct task_group root_task_group;
> +#ifdef CONFIG_CFS_BANDWIDTH
> +extern int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us);
> +extern int tg_set_cfs_period(struct task_group *tg, long cfs_period_us);
> +#endif /* CONFIG_CFS_BANDWIDTH */
> #endif /* CONFIG_CGROUP_SCHED */
>
> extern int task_can_switch_user(struct user_struct *up,
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index e628cb1..7471b06 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7097,6 +7097,7 @@ int in_sched_functions(unsigned long addr)
> * Every task in system belongs to this group at bootup.
> */
> struct task_group root_task_group;
> +EXPORT_SYMBOL_GPL(root_task_group);
> LIST_HEAD(task_groups);
> #endif
>
> @@ -8059,9 +8060,6 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
> int i, ret = 0, runtime_enabled, runtime_was_enabled;
> struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
>
> - if (tg == &root_task_group)
> - return -EINVAL;
> -
> /*
> * Ensure we have at some amount of bandwidth every period. This is
> * to prevent reaching a state of large arrears when throttled via
> @@ -8141,6 +8139,7 @@ int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us)
>
> return tg_set_cfs_bandwidth(tg, period, quota);
> }
> +EXPORT_SYMBOL_GPL(tg_set_cfs_quota);
>
> long tg_get_cfs_quota(struct task_group *tg)
> {
> @@ -8164,6 +8163,7 @@ int tg_set_cfs_period(struct task_group *tg, long cfs_period_us)
>
> return tg_set_cfs_bandwidth(tg, period, quota);
> }
> +EXPORT_SYMBOL_GPL(tg_set_cfs_period);
>
> long tg_get_cfs_period(struct task_group *tg)
> {
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 9a2a45c..20493e4 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -72,11 +72,6 @@ extern void update_cpu_load_active(struct rq *this_rq);
> * These are the 'tuning knobs' of the scheduler:
> */
>
> -/*
> - * single value that denotes runtime == period, ie unlimited time.
> - */
> -#define RUNTIME_INF ((u64)~0ULL)
> -
> static inline int fair_policy(int policy)
> {
> return policy == SCHED_NORMAL || policy == SCHED_BATCH;
>

2015-02-09 18:14:46

by Steven Noonan

[permalink] [raw]
Subject: Re: [PATCH V2] idle/intel_powerclamp: Redesign idle injection to use bandwidth control mechanism

On Mon, Feb 9, 2015 at 9:56 AM, Steven Noonan <[email protected]> wrote:
> On Mon, Feb 9, 2015 at 3:51 AM, Preeti U Murthy
> <[email protected]> wrote:
>> Hi Steven,
>>
>> On 02/09/2015 01:02 PM, Steven Noonan wrote:
>>> On Sun, Feb 8, 2015 at 8:49 PM, Preeti U Murthy
>>> <[email protected]> wrote:
>>>> The powerclamp driver injects idle periods to stay within the thermal constraints.
>>>> The driver does a fake idle by spawning per-cpu threads that call the mwait
>>>> instruction. This behavior of fake idle can confuse the other kernel subsystems.
>>>> For instance it calls into the nohz tick handlers, which are meant to be called
>>>> only by the idle thread. It sets the state of the tick in each cpu to idle and
>>>> stops the tick, when there are tasks on the runqueue. As a result the callers of
>>>> idle_cpu()/ tick_nohz_tick_stopped() see different states of the cpu; while the
>>>> former thinks that the cpu is busy, the latter thinks that it is idle. The outcome
>>>> may be inconsistency in the scheduler/nohz states which can lead to serious
>>>> consequences. One of them was reported on this thread:
>>>> https://lkml.org/lkml/2014/12/11/365.
>>>>
>>>> Thomas posted out a patch to disable the powerclamp driver from calling into the
>>>> tick nohz code which has taken care of the above regression for the moment. However
>>>> powerclamp driver as a result, will not be able to inject idle periods due to the
>>>> presence of periodic ticks. With the current design of fake idle, we cannot move
>>>> towards a better solution.
>>>> https://lkml.org/lkml/2014/12/18/169
>>>>
>>>> This patch aims at removing the concept of fake idle and instead makes the cpus
>>>> truly idle by throttling the runqueues during the idle injection periods. The situation
>>>> is in fact very similar to throttling of cfs_rqs when they exceed their bandwidths.
>>>> The idle injection metrics can be mapped to the bandwidth control metrics 'quota' and
>>>> 'period' to achieve the same result. When the powerclamping is begun or when the
>>>> clamping controls have been modified, the bandwidth for the root task group is set.
>>>> The 'quota' will be the amount of time that the system needs to be busy and 'period'
>>>> will be the sum of this busy duration and the idle duration as calculated by the driver.
>>>> This gets rid of per-cpu kthreads, control cpu, hotplug notifiers and clamping mask since
>>>> the thread starting powerclamping will set the bandwidth and throttling of all cpus will
>>>> automatically fall in place. None of the other cpus need be bothered about this. This
>>>> simplifies the design of the driver.
>>>>
>>>> Of course this is only if the idle injection metrics can be conveniently transformed
>>>> into bandwidth control metrics. There are a couple of other primary concerns around if
>>>> doing the below two in this patch is valid.
>>>> a. This patch exports the functions to set the quota and period of task groups.
>>>> b. This patch removes the constraint of not being able to set the root task grp's bandwidth.
>>>>
>>>> Signed-off-by: Preeti U Murthy <[email protected]>
>>>
>>> This doesn't compile.
>>
>> Thanks for reporting this! I realized that I had not compiled in the powerclamp driver
>> as a module while compile testing it. I was focusing on the issues with the design and
>> failed to cross verify this. Apologies for the inconvenience.
>>
>> Find the diff compile tested below.
>>
>> I also realized that clamp_cpus() that sets the bandwidth cannot be called from
>> multiple places. Currently I am calling it from end_powerclamp(), when the user changes the
>> idle clamping duration and from a queued timer. This will require synchronization between
>> callers which is not really called for. The queued wakeup_timer alone can re-evaluate the
>> clamping metrics after every throttle-unthrottle period and this should suffice as far
>> as I can see. Thoughts ?
>
> Hmm, I've had two system lockups so far while running a kernel with
> intel_powerclamp loaded. Both times it slowly ground to a halt and
> processes piled up...

I let the hung system sit for a while and it eventually partially recovered and
came back with this in dmesg (9:24 is around system boot time, 9:51 is around
when it went out to lunch, and 10:01 is when it came back):

Feb 09 09:24:16 osprey kernel: wlp2s0: authenticate with e0:3f:49:28:3b:8c
Feb 09 09:24:16 osprey kernel: wlp2s0: send auth to e0:3f:49:28:3b:8c (try 1/3)
Feb 09 09:24:16 osprey kernel: wlp2s0: authenticated
Feb 09 09:24:16 osprey kernel: wlp2s0: associate with e0:3f:49:28:3b:8c (try 1/3)
Feb 09 09:24:16 osprey kernel: wlp2s0: RX AssocResp from e0:3f:49:28:3b:8c (capab=0x1011 status=0 aid=4)
Feb 09 09:24:16 osprey kernel: wlp2s0: associated
Feb 09 09:50:19 osprey kernel: intel_powerclamp: Start idle injection to reduce power
Feb 09 09:50:23 osprey kernel: intel_powerclamp: Stop forced idle injection
Feb 09 09:50:27 osprey kernel: intel_powerclamp: Start idle injection to reduce power
Feb 09 09:50:31 osprey kernel: intel_powerclamp: Stop forced idle injection
Feb 09 09:50:31 osprey kernel: intel_powerclamp: Start idle injection to reduce power
Feb 09 09:50:57 osprey kernel: Watchdog[1271]: segfault at 0 ip 00007f580734ad7b sp 00007f57f04175a0 error 6 in chromium[7f580310e000+66c6000]
Feb 09 09:51:46 osprey kernel: INFO: rcu_preempt self-detected stall on CPU
Feb 09 09:51:46 osprey kernel: 0: (1 GPs behind) idle=d1d/2/0 softirq=48032/48032 last_accelerate: a04e/8abc, nonlazy_posted: 0, L.
Feb 09 09:51:46 osprey kernel: (t=60000 jiffies g=40078 c=40077 q=69418)
Feb 09 09:51:46 osprey kernel: Task dump for CPU 0:
Feb 09 09:51:46 osprey kernel: swapper/0 R running task 0 0 0 0x00000008
Feb 09 09:51:46 osprey kernel: 0000000000000000 ffff88041ea03ae0 ffffffff81099946 0000000000000000
Feb 09 09:51:46 osprey kernel: ffffffff81a56f00 ffff88041ea03b00 ffffffff8109ce8d ffff88041ea0d300
Feb 09 09:51:46 osprey kernel: 0000000000000001 ffff88041ea03b30 ffffffff810cd391 ffff88041ea0d300
Feb 09 09:51:46 osprey kernel: Call Trace:
Feb 09 09:51:46 osprey kernel: <IRQ> [<ffffffff81099946>] sched_show_task+0xb6/0x120
Feb 09 09:51:46 osprey kernel: [<ffffffff8109ce8d>] dump_cpu_task+0x3d/0x50
Feb 09 09:51:46 osprey kernel: [<ffffffff810cd391>] rcu_dump_cpu_stacks+0x91/0xd0
Feb 09 09:51:46 osprey kernel: [<ffffffff810d15c2>] rcu_check_callbacks+0x4d2/0x7e0
Feb 09 09:51:46 osprey kernel: [<ffffffff812c6e73>] ? __this_cpu_preempt_check+0x13/0x20
Feb 09 09:51:46 osprey kernel: [<ffffffff810d7f46>] ? hrtimer_run_queues+0x36/0x110
Feb 09 09:51:46 osprey kernel: [<ffffffff810d68cb>] update_process_times+0x4b/0x80
Feb 09 09:51:46 osprey kernel: [<ffffffff810e5e65>] tick_sched_handle.isra.13+0x25/0x60
Feb 09 09:51:46 osprey kernel: [<ffffffff810e5ee5>] tick_sched_timer+0x45/0x80
Feb 09 09:51:46 osprey kernel: [<ffffffff810d6e87>] __run_hrtimer+0x77/0x250
Feb 09 09:51:46 osprey kernel: [<ffffffff810e5ea0>] ? tick_sched_handle.isra.13+0x60/0x60
Feb 09 09:51:46 osprey kernel: [<ffffffff810d79d7>] hrtimer_interrupt+0x107/0x250
Feb 09 09:51:46 osprey kernel: [<ffffffff8104acbb>] local_apic_timer_interrupt+0x3b/0x70
Feb 09 09:51:46 osprey kernel: [<ffffffff8152a8a3>] smp_apic_timer_interrupt+0x43/0x60
Feb 09 09:51:46 osprey kernel: [<ffffffff81528aed>] apic_timer_interrupt+0x6d/0x80
Feb 09 09:51:46 osprey kernel: [<ffffffffa0b0728b>] ? 0xffffffffa0b0728b
Feb 09 09:51:46 osprey kernel: [<ffffffffa0b0718d>] ? 0xffffffffa0b0718d
Feb 09 09:51:46 osprey kernel: [<ffffffffa0b075d0>] ? 0xffffffffa0b075d0
Feb 09 09:51:46 osprey kernel: [<ffffffffa0b075e5>] 0xffffffffa0b075e5
Feb 09 09:51:46 osprey kernel: [<ffffffff810d54ca>] call_timer_fn+0x3a/0x160
Feb 09 09:51:46 osprey kernel: [<ffffffffa0b075d0>] ? 0xffffffffa0b075d0
Feb 09 09:51:46 osprey kernel: [<ffffffff810d5dc4>] run_timer_softirq+0x264/0x310
Feb 09 09:51:46 osprey kernel: [<ffffffff81073ab6>] __do_softirq+0xf6/0x2e0
Feb 09 09:51:46 osprey kernel: [<ffffffff81073efe>] irq_exit+0x8e/0xb0
Feb 09 09:51:46 osprey kernel: [<ffffffff8152a8a8>] smp_apic_timer_interrupt+0x48/0x60
Feb 09 09:51:46 osprey kernel: [<ffffffff81528aed>] apic_timer_interrupt+0x6d/0x80
Feb 09 09:51:46 osprey kernel: <EOI> [<ffffffff813e5db5>] ? cpuidle_enter_state+0x65/0x1a0
Feb 09 09:51:46 osprey kernel: [<ffffffff813e5da1>] ? cpuidle_enter_state+0x51/0x1a0
Feb 09 09:51:46 osprey kernel: [<ffffffff813e5fd7>] cpuidle_enter+0x17/0x20
Feb 09 09:51:46 osprey kernel: [<ffffffff810af014>] cpu_startup_entry+0x3f4/0x460
Feb 09 09:51:46 osprey kernel: [<ffffffff815187f5>] rest_init+0x85/0x90
Feb 09 09:51:46 osprey kernel: [<ffffffff81ba0f7e>] start_kernel+0x445/0x452
Feb 09 09:51:46 osprey kernel: [<ffffffff81ba0120>] ? early_idt_handlers+0x120/0x120
Feb 09 09:51:46 osprey kernel: [<ffffffff81ba04d7>] x86_64_start_reservations+0x2a/0x2c
Feb 09 09:51:46 osprey kernel: [<ffffffff81ba061c>] x86_64_start_kernel+0x143/0x152
Feb 09 09:54:46 osprey kernel: INFO: rcu_preempt self-detected stall on CPU
Feb 09 09:54:46 osprey kernel: 0: (1 GPs behind) idle=d1d/2/0 softirq=48032/48032 last_accelerate: a04e/49df, nonlazy_posted: 0, L.
Feb 09 09:54:46 osprey kernel: (t=240003 jiffies g=40078 c=40077 q=72706)
Feb 09 09:54:46 osprey kernel: Task dump for CPU 0:
Feb 09 09:54:46 osprey kernel: swapper/0 R running task 0 0 0 0x00000008
Feb 09 09:54:46 osprey kernel: 0000000000000000 ffff88041ea03ae0 ffffffff81099946 0000000000000000
Feb 09 09:54:46 osprey kernel: ffffffff81a56f00 ffff88041ea03b00 ffffffff8109ce8d ffff88041ea0d300
Feb 09 09:54:46 osprey kernel: 0000000000000001 ffff88041ea03b30 ffffffff810cd391 ffff88041ea0d300
Feb 09 09:54:46 osprey kernel: Call Trace:
Feb 09 09:54:46 osprey kernel: <IRQ> [<ffffffff81099946>] sched_show_task+0xb6/0x120
Feb 09 09:54:46 osprey kernel: [<ffffffff8109ce8d>] dump_cpu_task+0x3d/0x50
Feb 09 09:54:46 osprey kernel: [<ffffffff810cd391>] rcu_dump_cpu_stacks+0x91/0xd0
Feb 09 09:54:46 osprey kernel: [<ffffffff810d15c2>] rcu_check_callbacks+0x4d2/0x7e0
Feb 09 09:54:46 osprey kernel: [<ffffffff812c6e73>] ? __this_cpu_preempt_check+0x13/0x20
Feb 09 09:54:46 osprey kernel: [<ffffffff810d7f46>] ? hrtimer_run_queues+0x36/0x110
Feb 09 09:54:46 osprey kernel: [<ffffffff810d68cb>] update_process_times+0x4b/0x80
Feb 09 09:54:46 osprey kernel: [<ffffffff810e5e65>] tick_sched_handle.isra.13+0x25/0x60
Feb 09 09:54:46 osprey kernel: [<ffffffff810e5ee5>] tick_sched_timer+0x45/0x80
Feb 09 09:54:46 osprey kernel: [<ffffffff810d6e87>] __run_hrtimer+0x77/0x250
Feb 09 09:54:46 osprey kernel: [<ffffffff810e5ea0>] ? tick_sched_handle.isra.13+0x60/0x60
Feb 09 09:54:46 osprey kernel: [<ffffffff810d79d7>] hrtimer_interrupt+0x107/0x250
Feb 09 09:54:46 osprey kernel: [<ffffffff8104acbb>] local_apic_timer_interrupt+0x3b/0x70
Feb 09 09:54:46 osprey kernel: [<ffffffff8152a8a3>] smp_apic_timer_interrupt+0x43/0x60
Feb 09 09:54:46 osprey kernel: [<ffffffff81528aed>] apic_timer_interrupt+0x6d/0x80
Feb 09 09:54:46 osprey kernel: [<ffffffffa0b0726d>] ? 0xffffffffa0b0726d
Feb 09 09:54:46 osprey kernel: [<ffffffffa0b0718d>] ? 0xffffffffa0b0718d
Feb 09 09:54:46 osprey kernel: [<ffffffffa0b075d0>] ? 0xffffffffa0b075d0
Feb 09 09:54:46 osprey kernel: [<ffffffffa0b075e5>] 0xffffffffa0b075e5
Feb 09 09:54:46 osprey kernel: [<ffffffff810d54ca>] call_timer_fn+0x3a/0x160
Feb 09 09:54:46 osprey kernel: [<ffffffffa0b075d0>] ? 0xffffffffa0b075d0
Feb 09 09:54:46 osprey kernel: [<ffffffff810d5dc4>] run_timer_softirq+0x264/0x310
Feb 09 09:54:46 osprey kernel: [<ffffffff81073ab6>] __do_softirq+0xf6/0x2e0
Feb 09 09:54:46 osprey kernel: [<ffffffff81073efe>] irq_exit+0x8e/0xb0
Feb 09 09:54:46 osprey kernel: [<ffffffff8152a8a8>] smp_apic_timer_interrupt+0x48/0x60
Feb 09 09:54:46 osprey kernel: [<ffffffff81528aed>] apic_timer_interrupt+0x6d/0x80
Feb 09 09:54:46 osprey kernel: <EOI> [<ffffffff813e5db5>] ? cpuidle_enter_state+0x65/0x1a0
Feb 09 09:54:46 osprey kernel: [<ffffffff813e5da1>] ? cpuidle_enter_state+0x51/0x1a0
Feb 09 09:54:46 osprey kernel: [<ffffffff813e5fd7>] cpuidle_enter+0x17/0x20
Feb 09 09:54:46 osprey kernel: [<ffffffff810af014>] cpu_startup_entry+0x3f4/0x460
Feb 09 09:54:46 osprey kernel: [<ffffffff815187f5>] rest_init+0x85/0x90
Feb 09 09:54:46 osprey kernel: [<ffffffff81ba0f7e>] start_kernel+0x445/0x452
Feb 09 09:54:46 osprey kernel: [<ffffffff81ba0120>] ? early_idt_handlers+0x120/0x120
Feb 09 09:54:46 osprey kernel: [<ffffffff81ba04d7>] x86_64_start_reservations+0x2a/0x2c
Feb 09 09:54:46 osprey kernel: [<ffffffff81ba061c>] x86_64_start_kernel+0x143/0x152
Feb 09 09:57:46 osprey kernel: INFO: rcu_preempt self-detected stall on CPU
Feb 09 09:57:46 osprey kernel: 0: (1 GPs behind) idle=d1d/2/0 softirq=48032/48032 last_accelerate: a04e/0902, nonlazy_posted: 0, L.
Feb 09 09:57:46 osprey kernel: (t=420006 jiffies g=40078 c=40077 q=76104)
Feb 09 09:57:46 osprey kernel: Task dump for CPU 0:
Feb 09 09:57:46 osprey kernel: swapper/0 R running task 0 0 0 0x00000008
Feb 09 09:57:46 osprey kernel: 0000000000000000 ffff88041ea03ae0 ffffffff81099946 0000000000000000
Feb 09 09:57:46 osprey kernel: ffffffff81a56f00 ffff88041ea03b00 ffffffff8109ce8d ffff88041ea0d300
Feb 09 09:57:46 osprey kernel: 0000000000000001 ffff88041ea03b30 ffffffff810cd391 ffff88041ea0d300
Feb 09 09:57:46 osprey kernel: Call Trace:
Feb 09 09:57:46 osprey kernel: <IRQ> [<ffffffff81099946>] sched_show_task+0xb6/0x120
Feb 09 09:57:46 osprey kernel: [<ffffffff8109ce8d>] dump_cpu_task+0x3d/0x50
Feb 09 09:57:46 osprey kernel: [<ffffffff810cd391>] rcu_dump_cpu_stacks+0x91/0xd0
Feb 09 09:57:46 osprey kernel: [<ffffffff810d15c2>] rcu_check_callbacks+0x4d2/0x7e0
Feb 09 09:57:46 osprey kernel: [<ffffffff812c6e73>] ? __this_cpu_preempt_check+0x13/0x20
Feb 09 09:57:46 osprey kernel: [<ffffffff810d7f46>] ? hrtimer_run_queues+0x36/0x110
Feb 09 09:57:46 osprey kernel: [<ffffffff810d68cb>] update_process_times+0x4b/0x80
Feb 09 09:57:46 osprey kernel: [<ffffffff810e5e65>] tick_sched_handle.isra.13+0x25/0x60
Feb 09 09:57:46 osprey kernel: [<ffffffff810e5ee5>] tick_sched_timer+0x45/0x80
Feb 09 09:57:46 osprey kernel: [<ffffffff810d6e87>] __run_hrtimer+0x77/0x250
Feb 09 09:57:46 osprey kernel: [<ffffffff810e5ea0>] ? tick_sched_handle.isra.13+0x60/0x60
Feb 09 09:57:46 osprey kernel: [<ffffffff810d79d7>] hrtimer_interrupt+0x107/0x250
Feb 09 09:57:46 osprey kernel: [<ffffffff8104acbb>] local_apic_timer_interrupt+0x3b/0x70
Feb 09 09:57:46 osprey kernel: [<ffffffff8152a8a3>] smp_apic_timer_interrupt+0x43/0x60
Feb 09 09:57:46 osprey kernel: [<ffffffff81528aed>] apic_timer_interrupt+0x6d/0x80
Feb 09 09:57:46 osprey kernel: [<ffffffffa0b0726d>] ? 0xffffffffa0b0726d
Feb 09 09:57:46 osprey kernel: [<ffffffffa0b0718d>] ? 0xffffffffa0b0718d
Feb 09 09:57:46 osprey kernel: [<ffffffffa0b075d0>] ? 0xffffffffa0b075d0
Feb 09 09:57:46 osprey kernel: [<ffffffffa0b075e5>] 0xffffffffa0b075e5
Feb 09 09:57:46 osprey kernel: [<ffffffff810d54ca>] call_timer_fn+0x3a/0x160
Feb 09 09:57:46 osprey kernel: [<ffffffffa0b075d0>] ? 0xffffffffa0b075d0
Feb 09 09:57:46 osprey kernel: [<ffffffff810d5dc4>] run_timer_softirq+0x264/0x310
Feb 09 09:57:46 osprey kernel: [<ffffffff81073ab6>] __do_softirq+0xf6/0x2e0
Feb 09 09:57:46 osprey kernel: [<ffffffff81073efe>] irq_exit+0x8e/0xb0
Feb 09 09:57:46 osprey kernel: [<ffffffff8152a8a8>] smp_apic_timer_interrupt+0x48/0x60
Feb 09 09:57:46 osprey kernel: [<ffffffff81528aed>] apic_timer_interrupt+0x6d/0x80
Feb 09 09:57:46 osprey kernel: <EOI> [<ffffffff813e5db5>] ? cpuidle_enter_state+0x65/0x1a0
Feb 09 09:57:46 osprey kernel: [<ffffffff813e5da1>] ? cpuidle_enter_state+0x51/0x1a0
Feb 09 09:57:46 osprey kernel: [<ffffffff813e5fd7>] cpuidle_enter+0x17/0x20
Feb 09 09:57:46 osprey kernel: [<ffffffff810af014>] cpu_startup_entry+0x3f4/0x460
Feb 09 09:57:46 osprey kernel: [<ffffffff815187f5>] rest_init+0x85/0x90
Feb 09 09:57:46 osprey kernel: [<ffffffff81ba0f7e>] start_kernel+0x445/0x452
Feb 09 09:57:46 osprey kernel: [<ffffffff81ba0120>] ? early_idt_handlers+0x120/0x120
Feb 09 09:57:46 osprey kernel: [<ffffffff81ba04d7>] x86_64_start_reservations+0x2a/0x2c
Feb 09 09:57:46 osprey kernel: [<ffffffff81ba061c>] x86_64_start_kernel+0x143/0x152
Feb 09 10:00:46 osprey kernel: INFO: rcu_preempt self-detected stall on CPU
Feb 09 10:00:46 osprey kernel: 0: (1 GPs behind) idle=d1d/2/0 softirq=48032/48032 last_accelerate: a04e/c825, nonlazy_posted: 0, L.
Feb 09 10:00:46 osprey kernel: (t=600009 jiffies g=40078 c=40077 q=79433)
Feb 09 10:00:46 osprey kernel: Task dump for CPU 0:
Feb 09 10:00:46 osprey kernel: swapper/0 R running task 0 0 0 0x00000008
Feb 09 10:00:46 osprey kernel: 0000000000000000 ffff88041ea03ae0 ffffffff81099946 0000000000000000
Feb 09 10:00:46 osprey kernel: ffffffff81a56f00 ffff88041ea03b00 ffffffff8109ce8d ffff88041ea0d300
Feb 09 10:00:46 osprey kernel: 0000000000000001 ffff88041ea03b30 ffffffff810cd391 ffff88041ea0d300
Feb 09 10:00:46 osprey kernel: Call Trace:
Feb 09 10:00:46 osprey kernel: <IRQ> [<ffffffff81099946>] sched_show_task+0xb6/0x120
Feb 09 10:00:46 osprey kernel: [<ffffffff8109ce8d>] dump_cpu_task+0x3d/0x50
Feb 09 10:00:46 osprey kernel: [<ffffffff810cd391>] rcu_dump_cpu_stacks+0x91/0xd0
Feb 09 10:00:46 osprey kernel: [<ffffffff810d15c2>] rcu_check_callbacks+0x4d2/0x7e0
Feb 09 10:00:46 osprey kernel: [<ffffffff812c6e73>] ? __this_cpu_preempt_check+0x13/0x20
Feb 09 10:00:46 osprey kernel: [<ffffffff810d7f46>] ? hrtimer_run_queues+0x36/0x110
Feb 09 10:00:46 osprey kernel: [<ffffffff810d68cb>] update_process_times+0x4b/0x80
Feb 09 10:00:46 osprey kernel: [<ffffffff810e5e65>] tick_sched_handle.isra.13+0x25/0x60
Feb 09 10:00:46 osprey kernel: [<ffffffff810e5ee5>] tick_sched_timer+0x45/0x80
Feb 09 10:00:46 osprey kernel: [<ffffffff810d6e87>] __run_hrtimer+0x77/0x250
Feb 09 10:00:46 osprey kernel: [<ffffffff810e5ea0>] ? tick_sched_handle.isra.13+0x60/0x60
Feb 09 10:00:46 osprey kernel: [<ffffffff810d79d7>] hrtimer_interrupt+0x107/0x250
Feb 09 10:00:46 osprey kernel: [<ffffffff8104acbb>] local_apic_timer_interrupt+0x3b/0x70
Feb 09 10:00:46 osprey kernel: [<ffffffff8152a8a3>] smp_apic_timer_interrupt+0x43/0x60
Feb 09 10:00:46 osprey kernel: [<ffffffff81528aed>] apic_timer_interrupt+0x6d/0x80
Feb 09 10:00:46 osprey kernel: [<ffffffffa0b0726d>] ? 0xffffffffa0b0726d
Feb 09 10:00:46 osprey kernel: [<ffffffffa0b0718d>] ? 0xffffffffa0b0718d
Feb 09 10:00:46 osprey kernel: [<ffffffffa0b075d0>] ? 0xffffffffa0b075d0
Feb 09 10:00:46 osprey kernel: [<ffffffffa0b075e5>] 0xffffffffa0b075e5
Feb 09 10:00:46 osprey kernel: [<ffffffff810d54ca>] call_timer_fn+0x3a/0x160
Feb 09 10:00:46 osprey kernel: [<ffffffffa0b075d0>] ? 0xffffffffa0b075d0
Feb 09 10:00:46 osprey kernel: [<ffffffff810d5dc4>] run_timer_softirq+0x264/0x310
Feb 09 10:00:46 osprey kernel: [<ffffffff81073ab6>] __do_softirq+0xf6/0x2e0
Feb 09 10:00:46 osprey kernel: [<ffffffff81073efe>] irq_exit+0x8e/0xb0
Feb 09 10:00:46 osprey kernel: [<ffffffff8152a8a8>] smp_apic_timer_interrupt+0x48/0x60
Feb 09 10:00:46 osprey kernel: [<ffffffff81528aed>] apic_timer_interrupt+0x6d/0x80
Feb 09 10:00:46 osprey kernel: <EOI> [<ffffffff813e5db5>] ? cpuidle_enter_state+0x65/0x1a0
Feb 09 10:00:46 osprey kernel: [<ffffffff813e5da1>] ? cpuidle_enter_state+0x51/0x1a0
Feb 09 10:00:46 osprey kernel: [<ffffffff813e5fd7>] cpuidle_enter+0x17/0x20
Feb 09 10:00:46 osprey kernel: [<ffffffff810af014>] cpu_startup_entry+0x3f4/0x460
Feb 09 10:00:46 osprey kernel: [<ffffffff815187f5>] rest_init+0x85/0x90
Feb 09 10:00:46 osprey kernel: [<ffffffff81ba0f7e>] start_kernel+0x445/0x452
Feb 09 10:00:46 osprey kernel: [<ffffffff81ba0120>] ? early_idt_handlers+0x120/0x120
Feb 09 10:00:46 osprey kernel: [<ffffffff81ba04d7>] x86_64_start_reservations+0x2a/0x2c
Feb 09 10:00:46 osprey kernel: [<ffffffff81ba061c>] x86_64_start_kernel+0x143/0x152

I couldn't do much at this point. While I had some interactivity, processes
were failing to spawn, so i had to just 'echo s > /proc/sysrq-trigger; echo b >
/proc/sysrq-trigger'.

>> ----------------------------------------------------------------------------------
>>
>> V2 of intel_powerclamp driver
>>
>> From: Preeti U Murthy <[email protected]>
>>
>>
>> ---
>> drivers/thermal/Kconfig | 1
>> drivers/thermal/intel_powerclamp.c | 301 ++++++++++--------------------------
>> include/linux/sched.h | 9 +
>> kernel/sched/core.c | 6 -
>> kernel/sched/sched.h | 5 -
>> 5 files changed, 95 insertions(+), 227 deletions(-)
>>
>> diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
>> index af40db0..4b7cd02 100644
>> --- a/drivers/thermal/Kconfig
>> +++ b/drivers/thermal/Kconfig
>> @@ -233,6 +233,7 @@ config INTEL_POWERCLAMP
>> depends on THERMAL
>> depends on X86
>> depends on CPU_SUP_INTEL
>> + select CFS_BANDWIDTH
>> help
>> Enable this to enable Intel PowerClamp idle injection driver. This
>> enforce idle time which results in more package C-state residency. The
>> diff --git a/drivers/thermal/intel_powerclamp.c b/drivers/thermal/intel_powerclamp.c
>> index 6ceebd6..4bd07bb 100644
>> --- a/drivers/thermal/intel_powerclamp.c
>> +++ b/drivers/thermal/intel_powerclamp.c
>> @@ -51,6 +51,7 @@
>> #include <linux/debugfs.h>
>> #include <linux/seq_file.h>
>> #include <linux/sched/rt.h>
>> +#include <linux/sched.h>
>>
>> #include <asm/nmi.h>
>> #include <asm/msr.h>
>> @@ -78,20 +79,18 @@ static struct dentry *debug_dir;
>> static unsigned int set_target_ratio;
>> static unsigned int current_ratio;
>> static bool should_skip;
>> +static unsigned int count = 0;
>> static bool reduce_irq;
>> static atomic_t idle_wakeup_counter;
>> -static unsigned int control_cpu; /* The cpu assigned to collect stat and update
>> - * control parameters. default to BSP but BSP
>> - * can be offlined.
>> - */
>> static bool clamping;
>>
>> +/* Timer that evaluates bandwidth reset if clamping knobs have changed */
>> +static void clamp_timer_fn(unsigned long foo);
>> +DEFINE_TIMER(wakeup_timer, clamp_timer_fn, 0, 0);
>> +
>> +static void clamp_cpus(void);
>>
>> -static struct task_struct * __percpu *powerclamp_thread;
>> static struct thermal_cooling_device *cooling_dev;
>> -static unsigned long *cpu_clamping_mask; /* bit map for tracking per cpu
>> - * clamping thread
>> - */
>>
>> static unsigned int duration;
>> static unsigned int pkg_cstate_ratio_cur;
>> @@ -163,7 +162,7 @@ static int window_size_set(const char *arg, const struct kernel_param *kp)
>> smp_mb();
>>
>> exit_win:
>> -
>> + clamp_cpus();
>> return ret;
>> }
>>
>> @@ -256,10 +255,6 @@ static u64 pkg_state_counter(void)
>> return count;
>> }
>>
>> -static void noop_timer(unsigned long foo)
>> -{
>> - /* empty... just the fact that we get the interrupt wakes us up */
>> -}
>>
>> static unsigned int get_compensation(int ratio)
>> {
>> @@ -362,100 +357,77 @@ static bool powerclamp_adjust_controls(unsigned int target_ratio,
>> return set_target_ratio + guard <= current_ratio;
>> }
>>
>> -static int clamp_thread(void *arg)
>> +static void clamp_cpus(void)
>> {
>> - int cpunr = (unsigned long)arg;
>> - DEFINE_TIMER(wakeup_timer, noop_timer, 0, 0);
>> - static const struct sched_param param = {
>> - .sched_priority = MAX_USER_RT_PRIO/2,
>> - };
>> - unsigned int count = 0;
>> unsigned int target_ratio;
>> + u64 quota, period;
>> + int sleeptime;
>> + unsigned long target_jiffies;
>> + unsigned int guard;
>> + unsigned int compensation;
>> + int interval; /* jiffies to sleep for each attempt */
>> + unsigned int duration_jiffies;
>> + unsigned int window_size_now;
>> +
>> +again:
>> + if (clamping == false)
>> + goto out;
>> +
>> + quota = RUNTIME_INF;
>> + period = NSEC_PER_SEC;
>> + compensation = 0;
>> + duration_jiffies = msecs_to_jiffies(duration);
>> + /*
>> + * make sure user selected ratio does not take effect until
>> + * the next round. adjust target_ratio if user has changed
>> + * target such that we can converge quickly.
>> + */
>> + target_ratio = set_target_ratio;
>> + guard = 1 + target_ratio/20;
>> + window_size_now = window_size;
>>
>> - set_bit(cpunr, cpu_clamping_mask);
>> - set_freezable();
>> - init_timer_on_stack(&wakeup_timer);
>> - sched_setscheduler(current, SCHED_FIFO, &param);
>> -
>> - while (true == clamping && !kthread_should_stop() &&
>> - cpu_online(cpunr)) {
>> - int sleeptime;
>> - unsigned long target_jiffies;
>> - unsigned int guard;
>> - unsigned int compensation = 0;
>> - int interval; /* jiffies to sleep for each attempt */
>> - unsigned int duration_jiffies = msecs_to_jiffies(duration);
>> - unsigned int window_size_now;
>> -
>> - try_to_freeze();
>> - /*
>> - * make sure user selected ratio does not take effect until
>> - * the next round. adjust target_ratio if user has changed
>> - * target such that we can converge quickly.
>> - */
>> - target_ratio = set_target_ratio;
>> - guard = 1 + target_ratio/20;
>> - window_size_now = window_size;
>> - count++;
>> -
>> - /*
>> - * systems may have different ability to enter package level
>> - * c-states, thus we need to compensate the injected idle ratio
>> - * to achieve the actual target reported by the HW.
>> - */
>> - compensation = get_compensation(target_ratio);
>> - interval = duration_jiffies*100/(target_ratio+compensation);
>> -
>> - /* align idle time */
>> - target_jiffies = roundup(jiffies, interval);
>> - sleeptime = target_jiffies - jiffies;
>> - if (sleeptime <= 0)
>> - sleeptime = 1;
>> - schedule_timeout_interruptible(sleeptime);
>> - /*
>> - * only elected controlling cpu can collect stats and update
>> - * control parameters.
>> - */
>> - if (cpunr == control_cpu && !(count%window_size_now)) {
>> - should_skip =
>> - powerclamp_adjust_controls(target_ratio,
>> - guard, window_size_now);
>> - smp_mb();
>> - }
>> -
>> - if (should_skip)
>> - continue;
>> -
>> - target_jiffies = jiffies + duration_jiffies;
>> - mod_timer(&wakeup_timer, target_jiffies);
>> - if (unlikely(local_softirq_pending()))
>> - continue;
>> - /*
>> - * stop tick sched during idle time, interrupts are still
>> - * allowed. thus jiffies are updated properly.
>> - */
>> - preempt_disable();
>> - /* mwait until target jiffies is reached */
>> - while (time_before(jiffies, target_jiffies)) {
>> - unsigned long ecx = 1;
>> - unsigned long eax = target_mwait;
>> -
>> - /*
>> - * REVISIT: may call enter_idle() to notify drivers who
>> - * can save power during cpu idle. same for exit_idle()
>> - */
>> - local_touch_nmi();
>> - stop_critical_timings();
>> - mwait_idle_with_hints(eax, ecx);
>> - start_critical_timings();
>> - atomic_inc(&idle_wakeup_counter);
>> - }
>> - preempt_enable();
>> + /*
>> + * systems may have different ability to enter package level
>> + * c-states, thus we need to compensate the injected idle ratio
>> + * to achieve the actual target reported by the HW.
>> + */
>> + compensation = get_compensation(target_ratio);
>> + interval = duration_jiffies*100/(target_ratio+compensation);
>> +
>> + /* align idle time */
>> + target_jiffies = roundup(jiffies, interval);
>> + sleeptime = target_jiffies - jiffies;
>> + if (sleeptime <= 0)
>> + sleeptime = 1;
>> +
>> + if (!(count%window_size_now)) {
>> + should_skip =
>> + powerclamp_adjust_controls(target_ratio,
>> + guard, window_size_now);
>> + smp_mb();
>> }
>> - del_timer_sync(&wakeup_timer);
>> - clear_bit(cpunr, cpu_clamping_mask);
>>
>> - return 0;
>> + if (should_skip)
>> + goto again;
>> +
>> + target_jiffies = jiffies + sleeptime + duration_jiffies;
>> + mod_timer(&wakeup_timer, target_jiffies);
>> + if (unlikely(local_softirq_pending()))
>> + goto again;
>> +
>> + quota = jiffies_to_usecs(sleeptime);
>> + period = jiffies_to_usecs(sleeptime + duration_jiffies);
>> +
>> +out:
>> + tg_set_cfs_quota(&root_task_group, quota);
>> + tg_set_cfs_period(&root_task_group, period);
>> +}
>> +
>> +static void clamp_timer_fn(unsigned long foo)
>> +{
>> + /* Evaluate to see if clamping controls need to be adjusted */
>> + count++;
>> + clamp_cpus();
>> }
>>
>> /*
>> @@ -501,8 +473,7 @@ static void poll_pkg_cstate(struct work_struct *dummy)
>>
>> static int start_power_clamp(void)
>> {
>> - unsigned long cpu;
>> - struct task_struct *thread;
>> + clamping = true;
>>
>> /* check if pkg cstate counter is completely 0, abort in this case */
>> if (!has_pkg_state_counter()) {
>> @@ -511,108 +482,21 @@ static int start_power_clamp(void)
>> }
>>
>> set_target_ratio = clamp(set_target_ratio, 0U, MAX_TARGET_RATIO - 1);
>> - /* prevent cpu hotplug */
>> - get_online_cpus();
>> -
>> - /* prefer BSP */
>> - control_cpu = 0;
>> - if (!cpu_online(control_cpu))
>> - control_cpu = smp_processor_id();
>>
>> - clamping = true;
>> schedule_delayed_work(&poll_pkg_cstate_work, 0);
>> -
>> - /* start one thread per online cpu */
>> - for_each_online_cpu(cpu) {
>> - struct task_struct **p =
>> - per_cpu_ptr(powerclamp_thread, cpu);
>> -
>> - thread = kthread_create_on_node(clamp_thread,
>> - (void *) cpu,
>> - cpu_to_node(cpu),
>> - "kidle_inject/%ld", cpu);
>> - /* bind to cpu here */
>> - if (likely(!IS_ERR(thread))) {
>> - kthread_bind(thread, cpu);
>> - wake_up_process(thread);
>> - *p = thread;
>> - }
>> -
>> - }
>> - put_online_cpus();
>> + clamp_cpus();
>>
>> return 0;
>> }
>>
>> static void end_power_clamp(void)
>> {
>> - int i;
>> - struct task_struct *thread;
>> -
>> clamping = false;
>> - /*
>> - * make clamping visible to other cpus and give per cpu clamping threads
>> - * sometime to exit, or gets killed later.
>> - */
>> - smp_mb();
>> - msleep(20);
>> - if (bitmap_weight(cpu_clamping_mask, num_possible_cpus())) {
>> - for_each_set_bit(i, cpu_clamping_mask, num_possible_cpus()) {
>> - pr_debug("clamping thread for cpu %d alive, kill\n", i);
>> - thread = *per_cpu_ptr(powerclamp_thread, i);
>> - kthread_stop(thread);
>> - }
>> - }
>> -}
>>
>> -static int powerclamp_cpu_callback(struct notifier_block *nfb,
>> - unsigned long action, void *hcpu)
>> -{
>> - unsigned long cpu = (unsigned long)hcpu;
>> - struct task_struct *thread;
>> - struct task_struct **percpu_thread =
>> - per_cpu_ptr(powerclamp_thread, cpu);
>> -
>> - if (false == clamping)
>> - goto exit_ok;
>> -
>> - switch (action) {
>> - case CPU_ONLINE:
>> - thread = kthread_create_on_node(clamp_thread,
>> - (void *) cpu,
>> - cpu_to_node(cpu),
>> - "kidle_inject/%lu", cpu);
>> - if (likely(!IS_ERR(thread))) {
>> - kthread_bind(thread, cpu);
>> - wake_up_process(thread);
>> - *percpu_thread = thread;
>> - }
>> - /* prefer BSP as controlling CPU */
>> - if (cpu == 0) {
>> - control_cpu = 0;
>> - smp_mb();
>> - }
>> - break;
>> - case CPU_DEAD:
>> - if (test_bit(cpu, cpu_clamping_mask)) {
>> - pr_err("cpu %lu dead but powerclamping thread is not\n",
>> - cpu);
>> - kthread_stop(*percpu_thread);
>> - }
>> - if (cpu == control_cpu) {
>> - control_cpu = smp_processor_id();
>> - smp_mb();
>> - }
>> - }
>> -
>> -exit_ok:
>> - return NOTIFY_OK;
>> + clamp_cpus();
>> + del_timer_sync(&wakeup_timer);
>> }
>>
>> -static struct notifier_block powerclamp_cpu_notifier = {
>> - .notifier_call = powerclamp_cpu_callback,
>> -};
>> -
>> static int powerclamp_get_max_state(struct thermal_cooling_device *cdev,
>> unsigned long *state)
>> {
>> @@ -656,6 +540,7 @@ static int powerclamp_set_cur_state(struct thermal_cooling_device *cdev,
>> }
>>
>> exit_set:
>> + clamp_cpus();
>> return ret;
>> }
>>
>> @@ -716,7 +601,6 @@ static int powerclamp_debug_show(struct seq_file *m, void *unused)
>> {
>> int i = 0;
>>
>> - seq_printf(m, "controlling cpu: %d\n", control_cpu);
>> seq_printf(m, "pct confidence steady dynamic (compensation)\n");
>> for (i = 0; i < MAX_TARGET_RATIO; i++) {
>> seq_printf(m, "%d\t%lu\t%lu\t%lu\n",
>> @@ -762,33 +646,20 @@ file_error:
>> static int powerclamp_init(void)
>> {
>> int retval;
>> - int bitmap_size;
>> -
>> - bitmap_size = BITS_TO_LONGS(num_possible_cpus()) * sizeof(long);
>> - cpu_clamping_mask = kzalloc(bitmap_size, GFP_KERNEL);
>> - if (!cpu_clamping_mask)
>> - return -ENOMEM;
>>
>> /* probe cpu features and ids here */
>> retval = powerclamp_probe();
>> if (retval)
>> - goto exit_free;
>> + goto exit;
>>
>> /* set default limit, maybe adjusted during runtime based on feedback */
>> window_size = 2;
>> - register_hotcpu_notifier(&powerclamp_cpu_notifier);
>> -
>> - powerclamp_thread = alloc_percpu(struct task_struct *);
>> - if (!powerclamp_thread) {
>> - retval = -ENOMEM;
>> - goto exit_unregister;
>> - }
>>
>> cooling_dev = thermal_cooling_device_register("intel_powerclamp", NULL,
>> &powerclamp_cooling_ops);
>> if (IS_ERR(cooling_dev)) {
>> retval = -ENODEV;
>> - goto exit_free_thread;
>> + goto exit;
>> }
>>
>> if (!duration)
>> @@ -798,23 +669,15 @@ static int powerclamp_init(void)
>>
>> return 0;
>>
>> -exit_free_thread:
>> - free_percpu(powerclamp_thread);
>> -exit_unregister:
>> - unregister_hotcpu_notifier(&powerclamp_cpu_notifier);
>> -exit_free:
>> - kfree(cpu_clamping_mask);
>> +exit:
>> return retval;
>> }
>> module_init(powerclamp_init);
>>
>> static void powerclamp_exit(void)
>> {
>> - unregister_hotcpu_notifier(&powerclamp_cpu_notifier);
>> end_power_clamp();
>> - free_percpu(powerclamp_thread);
>> thermal_cooling_device_unregister(cooling_dev);
>> - kfree(cpu_clamping_mask);
>>
>> cancel_delayed_work_sync(&poll_pkg_cstate_work);
>> debugfs_remove_recursive(debug_dir);
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 8db31ef..2493942 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -163,6 +163,11 @@ extern void get_avenrun(unsigned long *loads, unsigned long offset, int shift);
>> load += n*(FIXED_1-exp); \
>> load >>= FSHIFT;
>>
>> +/*
>> + * single value that denotes runtime == period, ie unlimited time.
>> + */
>> +#define RUNTIME_INF ((u64)~0ULL)
>> +
>> extern unsigned long total_forks;
>> extern int nr_threads;
>> DECLARE_PER_CPU(unsigned long, process_counts);
>> @@ -3002,6 +3007,10 @@ extern long sched_getaffinity(pid_t pid, struct cpumask *mask);
>>
>> #ifdef CONFIG_CGROUP_SCHED
>> extern struct task_group root_task_group;
>> +#ifdef CONFIG_CFS_BANDWIDTH
>> +extern int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us);
>> +extern int tg_set_cfs_period(struct task_group *tg, long cfs_period_us);
>> +#endif /* CONFIG_CFS_BANDWIDTH */
>> #endif /* CONFIG_CGROUP_SCHED */
>>
>> extern int task_can_switch_user(struct user_struct *up,
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index e628cb1..7471b06 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -7097,6 +7097,7 @@ int in_sched_functions(unsigned long addr)
>> * Every task in system belongs to this group at bootup.
>> */
>> struct task_group root_task_group;
>> +EXPORT_SYMBOL_GPL(root_task_group);
>> LIST_HEAD(task_groups);
>> #endif
>>
>> @@ -8059,9 +8060,6 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
>> int i, ret = 0, runtime_enabled, runtime_was_enabled;
>> struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
>>
>> - if (tg == &root_task_group)
>> - return -EINVAL;
>> -
>> /*
>> * Ensure we have at some amount of bandwidth every period. This is
>> * to prevent reaching a state of large arrears when throttled via
>> @@ -8141,6 +8139,7 @@ int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us)
>>
>> return tg_set_cfs_bandwidth(tg, period, quota);
>> }
>> +EXPORT_SYMBOL_GPL(tg_set_cfs_quota);
>>
>> long tg_get_cfs_quota(struct task_group *tg)
>> {
>> @@ -8164,6 +8163,7 @@ int tg_set_cfs_period(struct task_group *tg, long cfs_period_us)
>>
>> return tg_set_cfs_bandwidth(tg, period, quota);
>> }
>> +EXPORT_SYMBOL_GPL(tg_set_cfs_period);
>>
>> long tg_get_cfs_period(struct task_group *tg)
>> {
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 9a2a45c..20493e4 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -72,11 +72,6 @@ extern void update_cpu_load_active(struct rq *this_rq);
>> * These are the 'tuning knobs' of the scheduler:
>> */
>>
>> -/*
>> - * single value that denotes runtime == period, ie unlimited time.
>> - */
>> -#define RUNTIME_INF ((u64)~0ULL)
>> -
>> static inline int fair_policy(int policy)
>> {
>> return policy == SCHED_NORMAL || policy == SCHED_BATCH;
>>

2015-02-11 09:00:41

by Preeti U Murthy

[permalink] [raw]
Subject: Re: [PATCH V2] idle/intel_powerclamp: Redesign idle injection to use bandwidth control mechanism

On 02/09/2015 11:44 PM, Steven Noonan wrote:
> On Mon, Feb 9, 2015 at 9:56 AM, Steven Noonan <[email protected]> wrote:
>> On Mon, Feb 9, 2015 at 3:51 AM, Preeti U Murthy
>> <[email protected]> wrote:
>>> Hi Steven,
>>>
>>> On 02/09/2015 01:02 PM, Steven Noonan wrote:
>>>> On Sun, Feb 8, 2015 at 8:49 PM, Preeti U Murthy
>>>> <[email protected]> wrote:
>>>>> The powerclamp driver injects idle periods to stay within the thermal constraints.
>>>>> The driver does a fake idle by spawning per-cpu threads that call the mwait
>>>>> instruction. This behavior of fake idle can confuse the other kernel subsystems.
>>>>> For instance it calls into the nohz tick handlers, which are meant to be called
>>>>> only by the idle thread. It sets the state of the tick in each cpu to idle and
>>>>> stops the tick, when there are tasks on the runqueue. As a result the callers of
>>>>> idle_cpu()/ tick_nohz_tick_stopped() see different states of the cpu; while the
>>>>> former thinks that the cpu is busy, the latter thinks that it is idle. The outcome
>>>>> may be inconsistency in the scheduler/nohz states which can lead to serious
>>>>> consequences. One of them was reported on this thread:
>>>>> https://lkml.org/lkml/2014/12/11/365.
>>>>>
>>>>> Thomas posted out a patch to disable the powerclamp driver from calling into the
>>>>> tick nohz code which has taken care of the above regression for the moment. However
>>>>> powerclamp driver as a result, will not be able to inject idle periods due to the
>>>>> presence of periodic ticks. With the current design of fake idle, we cannot move
>>>>> towards a better solution.
>>>>> https://lkml.org/lkml/2014/12/18/169
>>>>>
>>>>> This patch aims at removing the concept of fake idle and instead makes the cpus
>>>>> truly idle by throttling the runqueues during the idle injection periods. The situation
>>>>> is in fact very similar to throttling of cfs_rqs when they exceed their bandwidths.
>>>>> The idle injection metrics can be mapped to the bandwidth control metrics 'quota' and
>>>>> 'period' to achieve the same result. When the powerclamping is begun or when the
>>>>> clamping controls have been modified, the bandwidth for the root task group is set.
>>>>> The 'quota' will be the amount of time that the system needs to be busy and 'period'
>>>>> will be the sum of this busy duration and the idle duration as calculated by the driver.
>>>>> This gets rid of per-cpu kthreads, control cpu, hotplug notifiers and clamping mask since
>>>>> the thread starting powerclamping will set the bandwidth and throttling of all cpus will
>>>>> automatically fall in place. None of the other cpus need be bothered about this. This
>>>>> simplifies the design of the driver.
>>>>>
>>>>> Of course this is only if the idle injection metrics can be conveniently transformed
>>>>> into bandwidth control metrics. There are a couple of other primary concerns around if
>>>>> doing the below two in this patch is valid.
>>>>> a. This patch exports the functions to set the quota and period of task groups.
>>>>> b. This patch removes the constraint of not being able to set the root task grp's bandwidth.
>>>>>
>>>>> Signed-off-by: Preeti U Murthy <[email protected]>
>>>>
>>>> This doesn't compile.
>>>
>>> Thanks for reporting this! I realized that I had not compiled in the powerclamp driver
>>> as a module while compile testing it. I was focusing on the issues with the design and
>>> failed to cross verify this. Apologies for the inconvenience.
>>>
>>> Find the diff compile tested below.
>>>
>>> I also realized that clamp_cpus() that sets the bandwidth cannot be called from
>>> multiple places. Currently I am calling it from end_powerclamp(), when the user changes the
>>> idle clamping duration and from a queued timer. This will require synchronization between
>>> callers which is not really called for. The queued wakeup_timer alone can re-evaluate the
>>> clamping metrics after every throttle-unthrottle period and this should suffice as far
>>> as I can see. Thoughts ?
>>
>> Hmm, I've had two system lockups so far while running a kernel with
>> intel_powerclamp loaded. Both times it slowly ground to a halt and
>> processes piled up...

Hmm I see. I am not surprised because this patch is not complete yet.
The idea was to gain suggestions and review around the approach first
before I went ahead to making it robust. Let me ease the creases that I
found and repost this patch for you to test. Thanks a lot for the
testing efforts! :)

Regards
Preeti U Murthy

2015-02-13 13:32:07

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH V2] idle/intel_powerclamp: Redesign idle injection to use bandwidth control mechanism

On Mon, Feb 09, 2015 at 10:19:43AM +0530, Preeti U Murthy wrote:
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 8db31ef..6a7ccb2 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -3002,6 +3002,12 @@ extern long sched_getaffinity(pid_t pid, struct cpumask *mask);
>
> #ifdef CONFIG_CGROUP_SCHED
> extern struct task_group root_task_group;
> +extern int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us);
> +extern int tg_set_cfs_period(struct task_group *tg, long cfs_period_us);
> +#else
> +
> +static inline int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us);
> +static inline int tg_set_cfs_period(struct task_group *tg, long cfs_period_us);
> #endif /* CONFIG_CGROUP_SCHED */

Instead you might want to make the whole powerclamp thing depend on
CONFIG_CFS_BANDWIDTH.

Also, exposing these and root_task_group is of course vile. Not to
mention you change the user (cgroup) interface without mention.

In any case, I cannot see how this could ever work. Bandwidth is shared
across CPUs; nothing will even attempt to get CPUs to idle at the same
time.

2015-02-13 14:04:02

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH V2] idle/intel_powerclamp: Redesign idle injection to use bandwidth control mechanism


> Also, exposing these and root_task_group is of course vile. Not to
> mention you change the user (cgroup) interface without mention.
>
> In any case, I cannot see how this could ever work. Bandwidth is shared
> across CPUs; nothing will even attempt to get CPUs to idle at the same
> time.

idle injection is only worth it if you can get package C states,
e.g. all cpus in the system are idle at the same time.

(and in powerclamp, the package C state %age is the target of the control
loop, since that's pretty much the amount of power reduced)

>
>