Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759642AbbBIHcf (ORCPT ); Mon, 9 Feb 2015 02:32:35 -0500 Received: from mail-la0-f48.google.com ([209.85.215.48]:33123 "EHLO mail-la0-f48.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1759478AbbBIHcd convert rfc822-to-8bit (ORCPT ); Mon, 9 Feb 2015 02:32:33 -0500 MIME-Version: 1.0 In-Reply-To: <20150209044852.6231.66456.stgit@preeti.in.ibm.com> References: <20150209044852.6231.66456.stgit@preeti.in.ibm.com> Date: Sun, 8 Feb 2015 23:32:31 -0800 Message-ID: Subject: Re: [PATCH V2] idle/intel_powerclamp: Redesign idle injection to use bandwidth control mechanism From: Steven Noonan To: Preeti U Murthy Cc: Peter Zijlstra , Thomas Gleixner , arjan@linux.intel.com, Linux Kernel mailing List , jacob.jun.pan@intel.com, fweisbec@gmail.com, frederic@kernel.org, daniel.lezcano@linaro.org, amit.kucheria@linaro.org, edubezval@gmail.com, viresh.kumar@linaro.org, rui.zhang@intel.com Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 22964 Lines: 588 On Sun, Feb 8, 2015 at 8:49 PM, Preeti U Murthy wrote: > The powerclamp driver injects idle periods to stay within the thermal constraints. > The driver does a fake idle by spawning per-cpu threads that call the mwait > instruction. This behavior of fake idle can confuse the other kernel subsystems. > For instance it calls into the nohz tick handlers, which are meant to be called > only by the idle thread. It sets the state of the tick in each cpu to idle and > stops the tick, when there are tasks on the runqueue. As a result the callers of > idle_cpu()/ tick_nohz_tick_stopped() see different states of the cpu; while the > former thinks that the cpu is busy, the latter thinks that it is idle. The outcome > may be inconsistency in the scheduler/nohz states which can lead to serious > consequences. One of them was reported on this thread: > https://lkml.org/lkml/2014/12/11/365. > > Thomas posted out a patch to disable the powerclamp driver from calling into the > tick nohz code which has taken care of the above regression for the moment. However > powerclamp driver as a result, will not be able to inject idle periods due to the > presence of periodic ticks. With the current design of fake idle, we cannot move > towards a better solution. > https://lkml.org/lkml/2014/12/18/169 > > This patch aims at removing the concept of fake idle and instead makes the cpus > truly idle by throttling the runqueues during the idle injection periods. The situation > is in fact very similar to throttling of cfs_rqs when they exceed their bandwidths. > The idle injection metrics can be mapped to the bandwidth control metrics 'quota' and > 'period' to achieve the same result. When the powerclamping is begun or when the > clamping controls have been modified, the bandwidth for the root task group is set. > The 'quota' will be the amount of time that the system needs to be busy and 'period' > will be the sum of this busy duration and the idle duration as calculated by the driver. > This gets rid of per-cpu kthreads, control cpu, hotplug notifiers and clamping mask since > the thread starting powerclamping will set the bandwidth and throttling of all cpus will > automatically fall in place. None of the other cpus need be bothered about this. This > simplifies the design of the driver. > > Of course this is only if the idle injection metrics can be conveniently transformed > into bandwidth control metrics. There are a couple of other primary concerns around if > doing the below two in this patch is valid. > a. This patch exports the functions to set the quota and period of task groups. > b. This patch removes the constraint of not being able to set the root task grp's bandwidth. > > Signed-off-by: Preeti U Murthy This doesn't compile. Missing forward declaration: drivers/thermal/intel_powerclamp.c: In function ‘window_size_set’: drivers/thermal/intel_powerclamp.c:160:2: error: implicit declaration of function ‘clamp_cpus’ [-Werror=implicit-function-declaration] clamp_cpus(); ^ drivers/thermal/intel_powerclamp.c: At top level: drivers/thermal/intel_powerclamp.c:355:12: error: static declaration of ‘clamp_cpus’ follows non-static declaration static int clamp_cpus(void) ^ drivers/thermal/intel_powerclamp.c:160:2: note: previous implicit declaration of ‘clamp_cpus’ was here clamp_cpus(); ^ Missing RUNTIME_INF definition (only appears in kernel/sched/sched.h from what I can see?): drivers/thermal/intel_powerclamp.c: In function ‘clamp_cpus’: drivers/thermal/intel_powerclamp.c:358:14: error: ‘RUNTIME_INF’ undeclared (first use in this function) u64 quota = RUNTIME_INF, period; ^ drivers/thermal/intel_powerclamp.c:358:14: note: each undeclared identifier is reported only once for each function it appears in Weird label placement that doesn't make sense: drivers/thermal/intel_powerclamp.c:364:2: error: a label can only be part of a statement and a declaration is not a statement int sleeptime; ^ drivers/thermal/intel_powerclamp.c:365:2: error: expected expression before ‘unsigned’ unsigned long target_jiffies; ^ drivers/thermal/intel_powerclamp.c:366:2: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement] unsigned int guard; ^ drivers/thermal/intel_powerclamp.c:390:2: error: ‘target_jiffies’ undeclared (first use in this function) target_jiffies = roundup(jiffies, interval); ^ Missing debug variable: drivers/thermal/intel_powerclamp.c: In function ‘powerclamp_debug_show’: drivers/thermal/intel_powerclamp.c:598:41: error: ‘control_cpu’ undeclared (first use in this function) seq_printf(m, "controlling cpu: %d\n", control_cpu); ^ Missing label: drivers/thermal/intel_powerclamp.c: In function ‘powerclamp_init’: drivers/thermal/intel_powerclamp.c:649:3: error: label ‘exit_free’ used but not defined goto exit_free; ^ drivers/thermal/intel_powerclamp.c:644:6: warning: unused variable ‘bitmap_size’ [-Wunused-variable] int bitmap_size; ^ > --- > V1: https://lkml.org/lkml/2014/12/21/23 > > V1 tried to throttle cfs_rqs without modifying available bandwidth. This will not > be an appropriate solution since the scheduler only considers a runqueue as > throttled when its bandwidth has been exceeded. Going around this path can > lead to unintended consequences. Hence V2 was designed to throttle runqueues > through bandwidth control. > > drivers/thermal/intel_powerclamp.c | 289 +++++++++--------------------------- > include/linux/sched.h | 6 + > kernel/sched/core.c | 5 - > 3 files changed, 82 insertions(+), 218 deletions(-) > > diff --git a/drivers/thermal/intel_powerclamp.c b/drivers/thermal/intel_powerclamp.c > index 6ceebd6..f43903c 100644 > --- a/drivers/thermal/intel_powerclamp.c > +++ b/drivers/thermal/intel_powerclamp.c > @@ -78,20 +78,14 @@ static struct dentry *debug_dir; > static unsigned int set_target_ratio; > static unsigned int current_ratio; > static bool should_skip; > +static unsigned int count = 0; > static bool reduce_irq; > static atomic_t idle_wakeup_counter; > -static unsigned int control_cpu; /* The cpu assigned to collect stat and update > - * control parameters. default to BSP but BSP > - * can be offlined. > - */ > static bool clamping; > +static void clamp_timer_fn(unsigned long foo); > +DEFINE_TIMER(wakeup_timer, clamp_timer_fn, 0, 0); > > - > -static struct task_struct * __percpu *powerclamp_thread; > static struct thermal_cooling_device *cooling_dev; > -static unsigned long *cpu_clamping_mask; /* bit map for tracking per cpu > - * clamping thread > - */ > > static unsigned int duration; > static unsigned int pkg_cstate_ratio_cur; > @@ -163,7 +157,7 @@ static int window_size_set(const char *arg, const struct kernel_param *kp) > smp_mb(); > > exit_win: > - > + clamp_cpus(); > return ret; > } > > @@ -256,10 +250,6 @@ static u64 pkg_state_counter(void) > return count; > } > > -static void noop_timer(unsigned long foo) > -{ > - /* empty... just the fact that we get the interrupt wakes us up */ > -} > > static unsigned int get_compensation(int ratio) > { > @@ -362,102 +352,78 @@ static bool powerclamp_adjust_controls(unsigned int target_ratio, > return set_target_ratio + guard <= current_ratio; > } > > -static int clamp_thread(void *arg) > +static int clamp_cpus(void) > { > - int cpunr = (unsigned long)arg; > - DEFINE_TIMER(wakeup_timer, noop_timer, 0, 0); > - static const struct sched_param param = { > - .sched_priority = MAX_USER_RT_PRIO/2, > - }; > - unsigned int count = 0; > unsigned int target_ratio; > + u64 quota = RUNTIME_INF, period; > > - set_bit(cpunr, cpu_clamping_mask); > - set_freezable(); > - init_timer_on_stack(&wakeup_timer); > - sched_setscheduler(current, SCHED_FIFO, ¶m); > - > - while (true == clamping && !kthread_should_stop() && > - cpu_online(cpunr)) { > - int sleeptime; > - unsigned long target_jiffies; > - unsigned int guard; > - unsigned int compensation = 0; > - int interval; /* jiffies to sleep for each attempt */ > - unsigned int duration_jiffies = msecs_to_jiffies(duration); > - unsigned int window_size_now; > - > - try_to_freeze(); > - /* > - * make sure user selected ratio does not take effect until > - * the next round. adjust target_ratio if user has changed > - * target such that we can converge quickly. > - */ > - target_ratio = set_target_ratio; > - guard = 1 + target_ratio/20; > - window_size_now = window_size; > - count++; > - > - /* > - * systems may have different ability to enter package level > - * c-states, thus we need to compensate the injected idle ratio > - * to achieve the actual target reported by the HW. > - */ > - compensation = get_compensation(target_ratio); > - interval = duration_jiffies*100/(target_ratio+compensation); > - > - /* align idle time */ > - target_jiffies = roundup(jiffies, interval); > - sleeptime = target_jiffies - jiffies; > - if (sleeptime <= 0) > - sleeptime = 1; > - schedule_timeout_interruptible(sleeptime); > - /* > - * only elected controlling cpu can collect stats and update > - * control parameters. > - */ > - if (cpunr == control_cpu && !(count%window_size_now)) { > - should_skip = > - powerclamp_adjust_controls(target_ratio, > - guard, window_size_now); > - smp_mb(); > - } > + if (clamping == false) > + goto out; > > - if (should_skip) > - continue; > - > - target_jiffies = jiffies + duration_jiffies; > - mod_timer(&wakeup_timer, target_jiffies); > - if (unlikely(local_softirq_pending())) > - continue; > - /* > - * stop tick sched during idle time, interrupts are still > - * allowed. thus jiffies are updated properly. > - */ > - preempt_disable(); > - /* mwait until target jiffies is reached */ > - while (time_before(jiffies, target_jiffies)) { > - unsigned long ecx = 1; > - unsigned long eax = target_mwait; > - > - /* > - * REVISIT: may call enter_idle() to notify drivers who > - * can save power during cpu idle. same for exit_idle() > - */ > - local_touch_nmi(); > - stop_critical_timings(); > - mwait_idle_with_hints(eax, ecx); > - start_critical_timings(); > - atomic_inc(&idle_wakeup_counter); > - } > - preempt_enable(); > +again: > + int sleeptime; > + unsigned long target_jiffies; > + unsigned int guard; > + unsigned int compensation = 0; > + int interval; /* jiffies to sleep for each attempt */ > + unsigned int duration_jiffies = msecs_to_jiffies(duration); > + unsigned int window_size_now; > + > + /* > + * make sure user selected ratio does not take effect until > + * the next round. adjust target_ratio if user has changed > + * target such that we can converge quickly. > + */ > + target_ratio = set_target_ratio; > + guard = 1 + target_ratio/20; > + window_size_now = window_size; > + > + /* > + * systems may have different ability to enter package level > + * c-states, thus we need to compensate the injected idle ratio > + * to achieve the actual target reported by the HW. > + */ > + compensation = get_compensation(target_ratio); > + interval = duration_jiffies*100/(target_ratio+compensation); > + > + /* align idle time */ > + target_jiffies = roundup(jiffies, interval); > + sleeptime = target_jiffies - jiffies; > + if (sleeptime <= 0) > + sleeptime = 1; > + > + if (!(count%window_size_now)) { > + should_skip = > + powerclamp_adjust_controls(target_ratio, > + guard, window_size_now); > + smp_mb(); > } > - del_timer_sync(&wakeup_timer); > - clear_bit(cpunr, cpu_clamping_mask); > + > + if (should_skip) > + goto again; > + > + target_jiffies = jiffies + sleeptime + duration_jiffies; > + mod_timer(&wakeup_timer, target_jiffies); > + if (unlikely(local_softirq_pending())) > + goto again; > + > + quota = jiffies_to_usecs(sleeptime); > + period = jiffies_to_usecs(sleeptime + duration_jiffies); > + > +out: > + tg_set_cfs_quota(&root_task_group, quota); > + tg_set_cfs_period(&root_task_group, period); > > return 0; > } > > +static void clamp_timer_fn(unsigned long foo) > +{ > + /* Evaluate to see if clamping controls need to be adjusted */ > + count++; > + clamp_cpus(); > +} > + > /* > * 1 HZ polling while clamping is active, useful for userspace > * to monitor actual idle ratio. > @@ -501,8 +467,7 @@ static void poll_pkg_cstate(struct work_struct *dummy) > > static int start_power_clamp(void) > { > - unsigned long cpu; > - struct task_struct *thread; > + clamping = true; > > /* check if pkg cstate counter is completely 0, abort in this case */ > if (!has_pkg_state_counter()) { > @@ -511,108 +476,21 @@ static int start_power_clamp(void) > } > > set_target_ratio = clamp(set_target_ratio, 0U, MAX_TARGET_RATIO - 1); > - /* prevent cpu hotplug */ > - get_online_cpus(); > - > - /* prefer BSP */ > - control_cpu = 0; > - if (!cpu_online(control_cpu)) > - control_cpu = smp_processor_id(); > > - clamping = true; > schedule_delayed_work(&poll_pkg_cstate_work, 0); > - > - /* start one thread per online cpu */ > - for_each_online_cpu(cpu) { > - struct task_struct **p = > - per_cpu_ptr(powerclamp_thread, cpu); > - > - thread = kthread_create_on_node(clamp_thread, > - (void *) cpu, > - cpu_to_node(cpu), > - "kidle_inject/%ld", cpu); > - /* bind to cpu here */ > - if (likely(!IS_ERR(thread))) { > - kthread_bind(thread, cpu); > - wake_up_process(thread); > - *p = thread; > - } > - > - } > - put_online_cpus(); > + clamp_cpus(); > > return 0; > } > > static void end_power_clamp(void) > { > - int i; > - struct task_struct *thread; > - > clamping = false; > - /* > - * make clamping visible to other cpus and give per cpu clamping threads > - * sometime to exit, or gets killed later. > - */ > - smp_mb(); > - msleep(20); > - if (bitmap_weight(cpu_clamping_mask, num_possible_cpus())) { > - for_each_set_bit(i, cpu_clamping_mask, num_possible_cpus()) { > - pr_debug("clamping thread for cpu %d alive, kill\n", i); > - thread = *per_cpu_ptr(powerclamp_thread, i); > - kthread_stop(thread); > - } > - } > -} > > -static int powerclamp_cpu_callback(struct notifier_block *nfb, > - unsigned long action, void *hcpu) > -{ > - unsigned long cpu = (unsigned long)hcpu; > - struct task_struct *thread; > - struct task_struct **percpu_thread = > - per_cpu_ptr(powerclamp_thread, cpu); > - > - if (false == clamping) > - goto exit_ok; > - > - switch (action) { > - case CPU_ONLINE: > - thread = kthread_create_on_node(clamp_thread, > - (void *) cpu, > - cpu_to_node(cpu), > - "kidle_inject/%lu", cpu); > - if (likely(!IS_ERR(thread))) { > - kthread_bind(thread, cpu); > - wake_up_process(thread); > - *percpu_thread = thread; > - } > - /* prefer BSP as controlling CPU */ > - if (cpu == 0) { > - control_cpu = 0; > - smp_mb(); > - } > - break; > - case CPU_DEAD: > - if (test_bit(cpu, cpu_clamping_mask)) { > - pr_err("cpu %lu dead but powerclamping thread is not\n", > - cpu); > - kthread_stop(*percpu_thread); > - } > - if (cpu == control_cpu) { > - control_cpu = smp_processor_id(); > - smp_mb(); > - } > - } > - > -exit_ok: > - return NOTIFY_OK; > + clamp_cpus(); > + del_timer_sync(&wakeup_timer); > } > > -static struct notifier_block powerclamp_cpu_notifier = { > - .notifier_call = powerclamp_cpu_callback, > -}; > - > static int powerclamp_get_max_state(struct thermal_cooling_device *cdev, > unsigned long *state) > { > @@ -656,6 +534,7 @@ static int powerclamp_set_cur_state(struct thermal_cooling_device *cdev, > } > > exit_set: > + clamp_cpus(); > return ret; > } > > @@ -764,11 +643,6 @@ static int powerclamp_init(void) > int retval; > int bitmap_size; > > - bitmap_size = BITS_TO_LONGS(num_possible_cpus()) * sizeof(long); > - cpu_clamping_mask = kzalloc(bitmap_size, GFP_KERNEL); > - if (!cpu_clamping_mask) > - return -ENOMEM; > - > /* probe cpu features and ids here */ > retval = powerclamp_probe(); > if (retval) > @@ -776,19 +650,12 @@ static int powerclamp_init(void) > > /* set default limit, maybe adjusted during runtime based on feedback */ > window_size = 2; > - register_hotcpu_notifier(&powerclamp_cpu_notifier); > - > - powerclamp_thread = alloc_percpu(struct task_struct *); > - if (!powerclamp_thread) { > - retval = -ENOMEM; > - goto exit_unregister; > - } > > cooling_dev = thermal_cooling_device_register("intel_powerclamp", NULL, > &powerclamp_cooling_ops); > if (IS_ERR(cooling_dev)) { > retval = -ENODEV; > - goto exit_free_thread; > + goto exit; > } > > if (!duration) > @@ -798,23 +665,15 @@ static int powerclamp_init(void) > > return 0; > > -exit_free_thread: > - free_percpu(powerclamp_thread); > -exit_unregister: > - unregister_hotcpu_notifier(&powerclamp_cpu_notifier); > -exit_free: > - kfree(cpu_clamping_mask); > +exit: > return retval; > } > module_init(powerclamp_init); > > static void powerclamp_exit(void) > { > - unregister_hotcpu_notifier(&powerclamp_cpu_notifier); > end_power_clamp(); > - free_percpu(powerclamp_thread); > thermal_cooling_device_unregister(cooling_dev); > - kfree(cpu_clamping_mask); > > cancel_delayed_work_sync(&poll_pkg_cstate_work); > debugfs_remove_recursive(debug_dir); > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 8db31ef..6a7ccb2 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -3002,6 +3002,12 @@ extern long sched_getaffinity(pid_t pid, struct cpumask *mask); > > #ifdef CONFIG_CGROUP_SCHED > extern struct task_group root_task_group; > +extern int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us); > +extern int tg_set_cfs_period(struct task_group *tg, long cfs_period_us); > +#else > + > +static inline int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us); > +static inline int tg_set_cfs_period(struct task_group *tg, long cfs_period_us); > #endif /* CONFIG_CGROUP_SCHED */ > > extern int task_can_switch_user(struct user_struct *up, > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index e628cb1..f47621a 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -8059,9 +8059,6 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota) > int i, ret = 0, runtime_enabled, runtime_was_enabled; > struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth; > > - if (tg == &root_task_group) > - return -EINVAL; > - > /* > * Ensure we have at some amount of bandwidth every period. This is > * to prevent reaching a state of large arrears when throttled via > @@ -8141,6 +8138,7 @@ int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us) > > return tg_set_cfs_bandwidth(tg, period, quota); > } > +EXPORT_SYMBOL_GPL(tg_set_cfs_quota); > > long tg_get_cfs_quota(struct task_group *tg) > { > @@ -8164,6 +8162,7 @@ int tg_set_cfs_period(struct task_group *tg, long cfs_period_us) > > return tg_set_cfs_bandwidth(tg, period, quota); > } > +EXPORT_SYMBOL_GPL(tg_set_cfs_period); > > long tg_get_cfs_period(struct task_group *tg) > { > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/