Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933694AbbBIR4N (ORCPT ); Mon, 9 Feb 2015 12:56:13 -0500 Received: from mail-lb0-f176.google.com ([209.85.217.176]:39750 "EHLO mail-lb0-f176.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933111AbbBIR4K (ORCPT ); Mon, 9 Feb 2015 12:56:10 -0500 MIME-Version: 1.0 In-Reply-To: <54D89F46.4030803@linux.vnet.ibm.com> References: <20150209044852.6231.66456.stgit@preeti.in.ibm.com> <54D89F46.4030803@linux.vnet.ibm.com> Date: Mon, 9 Feb 2015 09:56:08 -0800 Message-ID: Subject: Re: [PATCH V2] idle/intel_powerclamp: Redesign idle injection to use bandwidth control mechanism From: Steven Noonan To: Preeti U Murthy Cc: Peter Zijlstra , Thomas Gleixner , jacob.jun.pan@intel.com, Arjan van de Ven , Linux Kernel mailing List , =?UTF-8?B?RnLDqWTDqXJpYyBXZWlzYmVja2Vy?= , frederic@kernel.org, Daniel Lezcano , Amit Kucheria , Eduardo Valentin , Viresh Kumar , rui.zhang@intel.com Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 23677 Lines: 605 On Mon, Feb 9, 2015 at 3:51 AM, Preeti U Murthy wrote: > Hi Steven, > > On 02/09/2015 01:02 PM, Steven Noonan wrote: >> On Sun, Feb 8, 2015 at 8:49 PM, Preeti U Murthy >> wrote: >>> The powerclamp driver injects idle periods to stay within the thermal constraints. >>> The driver does a fake idle by spawning per-cpu threads that call the mwait >>> instruction. This behavior of fake idle can confuse the other kernel subsystems. >>> For instance it calls into the nohz tick handlers, which are meant to be called >>> only by the idle thread. It sets the state of the tick in each cpu to idle and >>> stops the tick, when there are tasks on the runqueue. As a result the callers of >>> idle_cpu()/ tick_nohz_tick_stopped() see different states of the cpu; while the >>> former thinks that the cpu is busy, the latter thinks that it is idle. The outcome >>> may be inconsistency in the scheduler/nohz states which can lead to serious >>> consequences. One of them was reported on this thread: >>> https://lkml.org/lkml/2014/12/11/365. >>> >>> Thomas posted out a patch to disable the powerclamp driver from calling into the >>> tick nohz code which has taken care of the above regression for the moment. However >>> powerclamp driver as a result, will not be able to inject idle periods due to the >>> presence of periodic ticks. With the current design of fake idle, we cannot move >>> towards a better solution. >>> https://lkml.org/lkml/2014/12/18/169 >>> >>> This patch aims at removing the concept of fake idle and instead makes the cpus >>> truly idle by throttling the runqueues during the idle injection periods. The situation >>> is in fact very similar to throttling of cfs_rqs when they exceed their bandwidths. >>> The idle injection metrics can be mapped to the bandwidth control metrics 'quota' and >>> 'period' to achieve the same result. When the powerclamping is begun or when the >>> clamping controls have been modified, the bandwidth for the root task group is set. >>> The 'quota' will be the amount of time that the system needs to be busy and 'period' >>> will be the sum of this busy duration and the idle duration as calculated by the driver. >>> This gets rid of per-cpu kthreads, control cpu, hotplug notifiers and clamping mask since >>> the thread starting powerclamping will set the bandwidth and throttling of all cpus will >>> automatically fall in place. None of the other cpus need be bothered about this. This >>> simplifies the design of the driver. >>> >>> Of course this is only if the idle injection metrics can be conveniently transformed >>> into bandwidth control metrics. There are a couple of other primary concerns around if >>> doing the below two in this patch is valid. >>> a. This patch exports the functions to set the quota and period of task groups. >>> b. This patch removes the constraint of not being able to set the root task grp's bandwidth. >>> >>> Signed-off-by: Preeti U Murthy >> >> This doesn't compile. > > Thanks for reporting this! I realized that I had not compiled in the powerclamp driver > as a module while compile testing it. I was focusing on the issues with the design and > failed to cross verify this. Apologies for the inconvenience. > > Find the diff compile tested below. > > I also realized that clamp_cpus() that sets the bandwidth cannot be called from > multiple places. Currently I am calling it from end_powerclamp(), when the user changes the > idle clamping duration and from a queued timer. This will require synchronization between > callers which is not really called for. The queued wakeup_timer alone can re-evaluate the > clamping metrics after every throttle-unthrottle period and this should suffice as far > as I can see. Thoughts ? Hmm, I've had two system lockups so far while running a kernel with intel_powerclamp loaded. Both times it slowly ground to a halt and processes piled up... > ---------------------------------------------------------------------------------- > > V2 of intel_powerclamp driver > > From: Preeti U Murthy > > > --- > drivers/thermal/Kconfig | 1 > drivers/thermal/intel_powerclamp.c | 301 ++++++++++-------------------------- > include/linux/sched.h | 9 + > kernel/sched/core.c | 6 - > kernel/sched/sched.h | 5 - > 5 files changed, 95 insertions(+), 227 deletions(-) > > diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig > index af40db0..4b7cd02 100644 > --- a/drivers/thermal/Kconfig > +++ b/drivers/thermal/Kconfig > @@ -233,6 +233,7 @@ config INTEL_POWERCLAMP > depends on THERMAL > depends on X86 > depends on CPU_SUP_INTEL > + select CFS_BANDWIDTH > help > Enable this to enable Intel PowerClamp idle injection driver. This > enforce idle time which results in more package C-state residency. The > diff --git a/drivers/thermal/intel_powerclamp.c b/drivers/thermal/intel_powerclamp.c > index 6ceebd6..4bd07bb 100644 > --- a/drivers/thermal/intel_powerclamp.c > +++ b/drivers/thermal/intel_powerclamp.c > @@ -51,6 +51,7 @@ > #include > #include > #include > +#include > > #include > #include > @@ -78,20 +79,18 @@ static struct dentry *debug_dir; > static unsigned int set_target_ratio; > static unsigned int current_ratio; > static bool should_skip; > +static unsigned int count = 0; > static bool reduce_irq; > static atomic_t idle_wakeup_counter; > -static unsigned int control_cpu; /* The cpu assigned to collect stat and update > - * control parameters. default to BSP but BSP > - * can be offlined. > - */ > static bool clamping; > > +/* Timer that evaluates bandwidth reset if clamping knobs have changed */ > +static void clamp_timer_fn(unsigned long foo); > +DEFINE_TIMER(wakeup_timer, clamp_timer_fn, 0, 0); > + > +static void clamp_cpus(void); > > -static struct task_struct * __percpu *powerclamp_thread; > static struct thermal_cooling_device *cooling_dev; > -static unsigned long *cpu_clamping_mask; /* bit map for tracking per cpu > - * clamping thread > - */ > > static unsigned int duration; > static unsigned int pkg_cstate_ratio_cur; > @@ -163,7 +162,7 @@ static int window_size_set(const char *arg, const struct kernel_param *kp) > smp_mb(); > > exit_win: > - > + clamp_cpus(); > return ret; > } > > @@ -256,10 +255,6 @@ static u64 pkg_state_counter(void) > return count; > } > > -static void noop_timer(unsigned long foo) > -{ > - /* empty... just the fact that we get the interrupt wakes us up */ > -} > > static unsigned int get_compensation(int ratio) > { > @@ -362,100 +357,77 @@ static bool powerclamp_adjust_controls(unsigned int target_ratio, > return set_target_ratio + guard <= current_ratio; > } > > -static int clamp_thread(void *arg) > +static void clamp_cpus(void) > { > - int cpunr = (unsigned long)arg; > - DEFINE_TIMER(wakeup_timer, noop_timer, 0, 0); > - static const struct sched_param param = { > - .sched_priority = MAX_USER_RT_PRIO/2, > - }; > - unsigned int count = 0; > unsigned int target_ratio; > + u64 quota, period; > + int sleeptime; > + unsigned long target_jiffies; > + unsigned int guard; > + unsigned int compensation; > + int interval; /* jiffies to sleep for each attempt */ > + unsigned int duration_jiffies; > + unsigned int window_size_now; > + > +again: > + if (clamping == false) > + goto out; > + > + quota = RUNTIME_INF; > + period = NSEC_PER_SEC; > + compensation = 0; > + duration_jiffies = msecs_to_jiffies(duration); > + /* > + * make sure user selected ratio does not take effect until > + * the next round. adjust target_ratio if user has changed > + * target such that we can converge quickly. > + */ > + target_ratio = set_target_ratio; > + guard = 1 + target_ratio/20; > + window_size_now = window_size; > > - set_bit(cpunr, cpu_clamping_mask); > - set_freezable(); > - init_timer_on_stack(&wakeup_timer); > - sched_setscheduler(current, SCHED_FIFO, ¶m); > - > - while (true == clamping && !kthread_should_stop() && > - cpu_online(cpunr)) { > - int sleeptime; > - unsigned long target_jiffies; > - unsigned int guard; > - unsigned int compensation = 0; > - int interval; /* jiffies to sleep for each attempt */ > - unsigned int duration_jiffies = msecs_to_jiffies(duration); > - unsigned int window_size_now; > - > - try_to_freeze(); > - /* > - * make sure user selected ratio does not take effect until > - * the next round. adjust target_ratio if user has changed > - * target such that we can converge quickly. > - */ > - target_ratio = set_target_ratio; > - guard = 1 + target_ratio/20; > - window_size_now = window_size; > - count++; > - > - /* > - * systems may have different ability to enter package level > - * c-states, thus we need to compensate the injected idle ratio > - * to achieve the actual target reported by the HW. > - */ > - compensation = get_compensation(target_ratio); > - interval = duration_jiffies*100/(target_ratio+compensation); > - > - /* align idle time */ > - target_jiffies = roundup(jiffies, interval); > - sleeptime = target_jiffies - jiffies; > - if (sleeptime <= 0) > - sleeptime = 1; > - schedule_timeout_interruptible(sleeptime); > - /* > - * only elected controlling cpu can collect stats and update > - * control parameters. > - */ > - if (cpunr == control_cpu && !(count%window_size_now)) { > - should_skip = > - powerclamp_adjust_controls(target_ratio, > - guard, window_size_now); > - smp_mb(); > - } > - > - if (should_skip) > - continue; > - > - target_jiffies = jiffies + duration_jiffies; > - mod_timer(&wakeup_timer, target_jiffies); > - if (unlikely(local_softirq_pending())) > - continue; > - /* > - * stop tick sched during idle time, interrupts are still > - * allowed. thus jiffies are updated properly. > - */ > - preempt_disable(); > - /* mwait until target jiffies is reached */ > - while (time_before(jiffies, target_jiffies)) { > - unsigned long ecx = 1; > - unsigned long eax = target_mwait; > - > - /* > - * REVISIT: may call enter_idle() to notify drivers who > - * can save power during cpu idle. same for exit_idle() > - */ > - local_touch_nmi(); > - stop_critical_timings(); > - mwait_idle_with_hints(eax, ecx); > - start_critical_timings(); > - atomic_inc(&idle_wakeup_counter); > - } > - preempt_enable(); > + /* > + * systems may have different ability to enter package level > + * c-states, thus we need to compensate the injected idle ratio > + * to achieve the actual target reported by the HW. > + */ > + compensation = get_compensation(target_ratio); > + interval = duration_jiffies*100/(target_ratio+compensation); > + > + /* align idle time */ > + target_jiffies = roundup(jiffies, interval); > + sleeptime = target_jiffies - jiffies; > + if (sleeptime <= 0) > + sleeptime = 1; > + > + if (!(count%window_size_now)) { > + should_skip = > + powerclamp_adjust_controls(target_ratio, > + guard, window_size_now); > + smp_mb(); > } > - del_timer_sync(&wakeup_timer); > - clear_bit(cpunr, cpu_clamping_mask); > > - return 0; > + if (should_skip) > + goto again; > + > + target_jiffies = jiffies + sleeptime + duration_jiffies; > + mod_timer(&wakeup_timer, target_jiffies); > + if (unlikely(local_softirq_pending())) > + goto again; > + > + quota = jiffies_to_usecs(sleeptime); > + period = jiffies_to_usecs(sleeptime + duration_jiffies); > + > +out: > + tg_set_cfs_quota(&root_task_group, quota); > + tg_set_cfs_period(&root_task_group, period); > +} > + > +static void clamp_timer_fn(unsigned long foo) > +{ > + /* Evaluate to see if clamping controls need to be adjusted */ > + count++; > + clamp_cpus(); > } > > /* > @@ -501,8 +473,7 @@ static void poll_pkg_cstate(struct work_struct *dummy) > > static int start_power_clamp(void) > { > - unsigned long cpu; > - struct task_struct *thread; > + clamping = true; > > /* check if pkg cstate counter is completely 0, abort in this case */ > if (!has_pkg_state_counter()) { > @@ -511,108 +482,21 @@ static int start_power_clamp(void) > } > > set_target_ratio = clamp(set_target_ratio, 0U, MAX_TARGET_RATIO - 1); > - /* prevent cpu hotplug */ > - get_online_cpus(); > - > - /* prefer BSP */ > - control_cpu = 0; > - if (!cpu_online(control_cpu)) > - control_cpu = smp_processor_id(); > > - clamping = true; > schedule_delayed_work(&poll_pkg_cstate_work, 0); > - > - /* start one thread per online cpu */ > - for_each_online_cpu(cpu) { > - struct task_struct **p = > - per_cpu_ptr(powerclamp_thread, cpu); > - > - thread = kthread_create_on_node(clamp_thread, > - (void *) cpu, > - cpu_to_node(cpu), > - "kidle_inject/%ld", cpu); > - /* bind to cpu here */ > - if (likely(!IS_ERR(thread))) { > - kthread_bind(thread, cpu); > - wake_up_process(thread); > - *p = thread; > - } > - > - } > - put_online_cpus(); > + clamp_cpus(); > > return 0; > } > > static void end_power_clamp(void) > { > - int i; > - struct task_struct *thread; > - > clamping = false; > - /* > - * make clamping visible to other cpus and give per cpu clamping threads > - * sometime to exit, or gets killed later. > - */ > - smp_mb(); > - msleep(20); > - if (bitmap_weight(cpu_clamping_mask, num_possible_cpus())) { > - for_each_set_bit(i, cpu_clamping_mask, num_possible_cpus()) { > - pr_debug("clamping thread for cpu %d alive, kill\n", i); > - thread = *per_cpu_ptr(powerclamp_thread, i); > - kthread_stop(thread); > - } > - } > -} > > -static int powerclamp_cpu_callback(struct notifier_block *nfb, > - unsigned long action, void *hcpu) > -{ > - unsigned long cpu = (unsigned long)hcpu; > - struct task_struct *thread; > - struct task_struct **percpu_thread = > - per_cpu_ptr(powerclamp_thread, cpu); > - > - if (false == clamping) > - goto exit_ok; > - > - switch (action) { > - case CPU_ONLINE: > - thread = kthread_create_on_node(clamp_thread, > - (void *) cpu, > - cpu_to_node(cpu), > - "kidle_inject/%lu", cpu); > - if (likely(!IS_ERR(thread))) { > - kthread_bind(thread, cpu); > - wake_up_process(thread); > - *percpu_thread = thread; > - } > - /* prefer BSP as controlling CPU */ > - if (cpu == 0) { > - control_cpu = 0; > - smp_mb(); > - } > - break; > - case CPU_DEAD: > - if (test_bit(cpu, cpu_clamping_mask)) { > - pr_err("cpu %lu dead but powerclamping thread is not\n", > - cpu); > - kthread_stop(*percpu_thread); > - } > - if (cpu == control_cpu) { > - control_cpu = smp_processor_id(); > - smp_mb(); > - } > - } > - > -exit_ok: > - return NOTIFY_OK; > + clamp_cpus(); > + del_timer_sync(&wakeup_timer); > } > > -static struct notifier_block powerclamp_cpu_notifier = { > - .notifier_call = powerclamp_cpu_callback, > -}; > - > static int powerclamp_get_max_state(struct thermal_cooling_device *cdev, > unsigned long *state) > { > @@ -656,6 +540,7 @@ static int powerclamp_set_cur_state(struct thermal_cooling_device *cdev, > } > > exit_set: > + clamp_cpus(); > return ret; > } > > @@ -716,7 +601,6 @@ static int powerclamp_debug_show(struct seq_file *m, void *unused) > { > int i = 0; > > - seq_printf(m, "controlling cpu: %d\n", control_cpu); > seq_printf(m, "pct confidence steady dynamic (compensation)\n"); > for (i = 0; i < MAX_TARGET_RATIO; i++) { > seq_printf(m, "%d\t%lu\t%lu\t%lu\n", > @@ -762,33 +646,20 @@ file_error: > static int powerclamp_init(void) > { > int retval; > - int bitmap_size; > - > - bitmap_size = BITS_TO_LONGS(num_possible_cpus()) * sizeof(long); > - cpu_clamping_mask = kzalloc(bitmap_size, GFP_KERNEL); > - if (!cpu_clamping_mask) > - return -ENOMEM; > > /* probe cpu features and ids here */ > retval = powerclamp_probe(); > if (retval) > - goto exit_free; > + goto exit; > > /* set default limit, maybe adjusted during runtime based on feedback */ > window_size = 2; > - register_hotcpu_notifier(&powerclamp_cpu_notifier); > - > - powerclamp_thread = alloc_percpu(struct task_struct *); > - if (!powerclamp_thread) { > - retval = -ENOMEM; > - goto exit_unregister; > - } > > cooling_dev = thermal_cooling_device_register("intel_powerclamp", NULL, > &powerclamp_cooling_ops); > if (IS_ERR(cooling_dev)) { > retval = -ENODEV; > - goto exit_free_thread; > + goto exit; > } > > if (!duration) > @@ -798,23 +669,15 @@ static int powerclamp_init(void) > > return 0; > > -exit_free_thread: > - free_percpu(powerclamp_thread); > -exit_unregister: > - unregister_hotcpu_notifier(&powerclamp_cpu_notifier); > -exit_free: > - kfree(cpu_clamping_mask); > +exit: > return retval; > } > module_init(powerclamp_init); > > static void powerclamp_exit(void) > { > - unregister_hotcpu_notifier(&powerclamp_cpu_notifier); > end_power_clamp(); > - free_percpu(powerclamp_thread); > thermal_cooling_device_unregister(cooling_dev); > - kfree(cpu_clamping_mask); > > cancel_delayed_work_sync(&poll_pkg_cstate_work); > debugfs_remove_recursive(debug_dir); > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 8db31ef..2493942 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -163,6 +163,11 @@ extern void get_avenrun(unsigned long *loads, unsigned long offset, int shift); > load += n*(FIXED_1-exp); \ > load >>= FSHIFT; > > +/* > + * single value that denotes runtime == period, ie unlimited time. > + */ > +#define RUNTIME_INF ((u64)~0ULL) > + > extern unsigned long total_forks; > extern int nr_threads; > DECLARE_PER_CPU(unsigned long, process_counts); > @@ -3002,6 +3007,10 @@ extern long sched_getaffinity(pid_t pid, struct cpumask *mask); > > #ifdef CONFIG_CGROUP_SCHED > extern struct task_group root_task_group; > +#ifdef CONFIG_CFS_BANDWIDTH > +extern int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us); > +extern int tg_set_cfs_period(struct task_group *tg, long cfs_period_us); > +#endif /* CONFIG_CFS_BANDWIDTH */ > #endif /* CONFIG_CGROUP_SCHED */ > > extern int task_can_switch_user(struct user_struct *up, > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index e628cb1..7471b06 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -7097,6 +7097,7 @@ int in_sched_functions(unsigned long addr) > * Every task in system belongs to this group at bootup. > */ > struct task_group root_task_group; > +EXPORT_SYMBOL_GPL(root_task_group); > LIST_HEAD(task_groups); > #endif > > @@ -8059,9 +8060,6 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota) > int i, ret = 0, runtime_enabled, runtime_was_enabled; > struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth; > > - if (tg == &root_task_group) > - return -EINVAL; > - > /* > * Ensure we have at some amount of bandwidth every period. This is > * to prevent reaching a state of large arrears when throttled via > @@ -8141,6 +8139,7 @@ int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us) > > return tg_set_cfs_bandwidth(tg, period, quota); > } > +EXPORT_SYMBOL_GPL(tg_set_cfs_quota); > > long tg_get_cfs_quota(struct task_group *tg) > { > @@ -8164,6 +8163,7 @@ int tg_set_cfs_period(struct task_group *tg, long cfs_period_us) > > return tg_set_cfs_bandwidth(tg, period, quota); > } > +EXPORT_SYMBOL_GPL(tg_set_cfs_period); > > long tg_get_cfs_period(struct task_group *tg) > { > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index 9a2a45c..20493e4 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -72,11 +72,6 @@ extern void update_cpu_load_active(struct rq *this_rq); > * These are the 'tuning knobs' of the scheduler: > */ > > -/* > - * single value that denotes runtime == period, ie unlimited time. > - */ > -#define RUNTIME_INF ((u64)~0ULL) > - > static inline int fair_policy(int policy) > { > return policy == SCHED_NORMAL || policy == SCHED_BATCH; > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/