Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759589AbbBILwh (ORCPT ); Mon, 9 Feb 2015 06:52:37 -0500 Received: from e37.co.us.ibm.com ([32.97.110.158]:50931 "EHLO e37.co.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752734AbbBILwf (ORCPT ); Mon, 9 Feb 2015 06:52:35 -0500 Message-ID: <54D89F46.4030803@linux.vnet.ibm.com> Date: Mon, 09 Feb 2015 17:21:34 +0530 From: Preeti U Murthy User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.6.0 MIME-Version: 1.0 To: Steven Noonan , Peter Zijlstra , Thomas Gleixner , jacob.jun.pan@intel.com CC: arjan@linux.intel.com, Linux Kernel mailing List , fweisbec@gmail.com, frederic@kernel.org, daniel.lezcano@linaro.org, amit.kucheria@linaro.org, edubezval@gmail.com, viresh.kumar@linaro.org, rui.zhang@intel.com Subject: Re: [PATCH V2] idle/intel_powerclamp: Redesign idle injection to use bandwidth control mechanism References: <20150209044852.6231.66456.stgit@preeti.in.ibm.com> In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-TM-AS-MML: disable X-Content-Scanned: Fidelis XPS MAILER x-cbid: 15020911-0025-0000-0000-0000086AC8B7 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 19005 Lines: 602 Hi Steven, On 02/09/2015 01:02 PM, Steven Noonan wrote: > On Sun, Feb 8, 2015 at 8:49 PM, Preeti U Murthy > wrote: >> The powerclamp driver injects idle periods to stay within the thermal constraints. >> The driver does a fake idle by spawning per-cpu threads that call the mwait >> instruction. This behavior of fake idle can confuse the other kernel subsystems. >> For instance it calls into the nohz tick handlers, which are meant to be called >> only by the idle thread. It sets the state of the tick in each cpu to idle and >> stops the tick, when there are tasks on the runqueue. As a result the callers of >> idle_cpu()/ tick_nohz_tick_stopped() see different states of the cpu; while the >> former thinks that the cpu is busy, the latter thinks that it is idle. The outcome >> may be inconsistency in the scheduler/nohz states which can lead to serious >> consequences. One of them was reported on this thread: >> https://lkml.org/lkml/2014/12/11/365. >> >> Thomas posted out a patch to disable the powerclamp driver from calling into the >> tick nohz code which has taken care of the above regression for the moment. However >> powerclamp driver as a result, will not be able to inject idle periods due to the >> presence of periodic ticks. With the current design of fake idle, we cannot move >> towards a better solution. >> https://lkml.org/lkml/2014/12/18/169 >> >> This patch aims at removing the concept of fake idle and instead makes the cpus >> truly idle by throttling the runqueues during the idle injection periods. The situation >> is in fact very similar to throttling of cfs_rqs when they exceed their bandwidths. >> The idle injection metrics can be mapped to the bandwidth control metrics 'quota' and >> 'period' to achieve the same result. When the powerclamping is begun or when the >> clamping controls have been modified, the bandwidth for the root task group is set. >> The 'quota' will be the amount of time that the system needs to be busy and 'period' >> will be the sum of this busy duration and the idle duration as calculated by the driver. >> This gets rid of per-cpu kthreads, control cpu, hotplug notifiers and clamping mask since >> the thread starting powerclamping will set the bandwidth and throttling of all cpus will >> automatically fall in place. None of the other cpus need be bothered about this. This >> simplifies the design of the driver. >> >> Of course this is only if the idle injection metrics can be conveniently transformed >> into bandwidth control metrics. There are a couple of other primary concerns around if >> doing the below two in this patch is valid. >> a. This patch exports the functions to set the quota and period of task groups. >> b. This patch removes the constraint of not being able to set the root task grp's bandwidth. >> >> Signed-off-by: Preeti U Murthy > > This doesn't compile. Thanks for reporting this! I realized that I had not compiled in the powerclamp driver as a module while compile testing it. I was focusing on the issues with the design and failed to cross verify this. Apologies for the inconvenience. Find the diff compile tested below. I also realized that clamp_cpus() that sets the bandwidth cannot be called from multiple places. Currently I am calling it from end_powerclamp(), when the user changes the idle clamping duration and from a queued timer. This will require synchronization between callers which is not really called for. The queued wakeup_timer alone can re-evaluate the clamping metrics after every throttle-unthrottle period and this should suffice as far as I can see. Thoughts ? Regards Preeti U Murthy ---------------------------------------------------------------------------------- V2 of intel_powerclamp driver From: Preeti U Murthy --- drivers/thermal/Kconfig | 1 drivers/thermal/intel_powerclamp.c | 301 ++++++++++-------------------------- include/linux/sched.h | 9 + kernel/sched/core.c | 6 - kernel/sched/sched.h | 5 - 5 files changed, 95 insertions(+), 227 deletions(-) diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig index af40db0..4b7cd02 100644 --- a/drivers/thermal/Kconfig +++ b/drivers/thermal/Kconfig @@ -233,6 +233,7 @@ config INTEL_POWERCLAMP depends on THERMAL depends on X86 depends on CPU_SUP_INTEL + select CFS_BANDWIDTH help Enable this to enable Intel PowerClamp idle injection driver. This enforce idle time which results in more package C-state residency. The diff --git a/drivers/thermal/intel_powerclamp.c b/drivers/thermal/intel_powerclamp.c index 6ceebd6..4bd07bb 100644 --- a/drivers/thermal/intel_powerclamp.c +++ b/drivers/thermal/intel_powerclamp.c @@ -51,6 +51,7 @@ #include #include #include +#include #include #include @@ -78,20 +79,18 @@ static struct dentry *debug_dir; static unsigned int set_target_ratio; static unsigned int current_ratio; static bool should_skip; +static unsigned int count = 0; static bool reduce_irq; static atomic_t idle_wakeup_counter; -static unsigned int control_cpu; /* The cpu assigned to collect stat and update - * control parameters. default to BSP but BSP - * can be offlined. - */ static bool clamping; +/* Timer that evaluates bandwidth reset if clamping knobs have changed */ +static void clamp_timer_fn(unsigned long foo); +DEFINE_TIMER(wakeup_timer, clamp_timer_fn, 0, 0); + +static void clamp_cpus(void); -static struct task_struct * __percpu *powerclamp_thread; static struct thermal_cooling_device *cooling_dev; -static unsigned long *cpu_clamping_mask; /* bit map for tracking per cpu - * clamping thread - */ static unsigned int duration; static unsigned int pkg_cstate_ratio_cur; @@ -163,7 +162,7 @@ static int window_size_set(const char *arg, const struct kernel_param *kp) smp_mb(); exit_win: - + clamp_cpus(); return ret; } @@ -256,10 +255,6 @@ static u64 pkg_state_counter(void) return count; } -static void noop_timer(unsigned long foo) -{ - /* empty... just the fact that we get the interrupt wakes us up */ -} static unsigned int get_compensation(int ratio) { @@ -362,100 +357,77 @@ static bool powerclamp_adjust_controls(unsigned int target_ratio, return set_target_ratio + guard <= current_ratio; } -static int clamp_thread(void *arg) +static void clamp_cpus(void) { - int cpunr = (unsigned long)arg; - DEFINE_TIMER(wakeup_timer, noop_timer, 0, 0); - static const struct sched_param param = { - .sched_priority = MAX_USER_RT_PRIO/2, - }; - unsigned int count = 0; unsigned int target_ratio; + u64 quota, period; + int sleeptime; + unsigned long target_jiffies; + unsigned int guard; + unsigned int compensation; + int interval; /* jiffies to sleep for each attempt */ + unsigned int duration_jiffies; + unsigned int window_size_now; + +again: + if (clamping == false) + goto out; + + quota = RUNTIME_INF; + period = NSEC_PER_SEC; + compensation = 0; + duration_jiffies = msecs_to_jiffies(duration); + /* + * make sure user selected ratio does not take effect until + * the next round. adjust target_ratio if user has changed + * target such that we can converge quickly. + */ + target_ratio = set_target_ratio; + guard = 1 + target_ratio/20; + window_size_now = window_size; - set_bit(cpunr, cpu_clamping_mask); - set_freezable(); - init_timer_on_stack(&wakeup_timer); - sched_setscheduler(current, SCHED_FIFO, ¶m); - - while (true == clamping && !kthread_should_stop() && - cpu_online(cpunr)) { - int sleeptime; - unsigned long target_jiffies; - unsigned int guard; - unsigned int compensation = 0; - int interval; /* jiffies to sleep for each attempt */ - unsigned int duration_jiffies = msecs_to_jiffies(duration); - unsigned int window_size_now; - - try_to_freeze(); - /* - * make sure user selected ratio does not take effect until - * the next round. adjust target_ratio if user has changed - * target such that we can converge quickly. - */ - target_ratio = set_target_ratio; - guard = 1 + target_ratio/20; - window_size_now = window_size; - count++; - - /* - * systems may have different ability to enter package level - * c-states, thus we need to compensate the injected idle ratio - * to achieve the actual target reported by the HW. - */ - compensation = get_compensation(target_ratio); - interval = duration_jiffies*100/(target_ratio+compensation); - - /* align idle time */ - target_jiffies = roundup(jiffies, interval); - sleeptime = target_jiffies - jiffies; - if (sleeptime <= 0) - sleeptime = 1; - schedule_timeout_interruptible(sleeptime); - /* - * only elected controlling cpu can collect stats and update - * control parameters. - */ - if (cpunr == control_cpu && !(count%window_size_now)) { - should_skip = - powerclamp_adjust_controls(target_ratio, - guard, window_size_now); - smp_mb(); - } - - if (should_skip) - continue; - - target_jiffies = jiffies + duration_jiffies; - mod_timer(&wakeup_timer, target_jiffies); - if (unlikely(local_softirq_pending())) - continue; - /* - * stop tick sched during idle time, interrupts are still - * allowed. thus jiffies are updated properly. - */ - preempt_disable(); - /* mwait until target jiffies is reached */ - while (time_before(jiffies, target_jiffies)) { - unsigned long ecx = 1; - unsigned long eax = target_mwait; - - /* - * REVISIT: may call enter_idle() to notify drivers who - * can save power during cpu idle. same for exit_idle() - */ - local_touch_nmi(); - stop_critical_timings(); - mwait_idle_with_hints(eax, ecx); - start_critical_timings(); - atomic_inc(&idle_wakeup_counter); - } - preempt_enable(); + /* + * systems may have different ability to enter package level + * c-states, thus we need to compensate the injected idle ratio + * to achieve the actual target reported by the HW. + */ + compensation = get_compensation(target_ratio); + interval = duration_jiffies*100/(target_ratio+compensation); + + /* align idle time */ + target_jiffies = roundup(jiffies, interval); + sleeptime = target_jiffies - jiffies; + if (sleeptime <= 0) + sleeptime = 1; + + if (!(count%window_size_now)) { + should_skip = + powerclamp_adjust_controls(target_ratio, + guard, window_size_now); + smp_mb(); } - del_timer_sync(&wakeup_timer); - clear_bit(cpunr, cpu_clamping_mask); - return 0; + if (should_skip) + goto again; + + target_jiffies = jiffies + sleeptime + duration_jiffies; + mod_timer(&wakeup_timer, target_jiffies); + if (unlikely(local_softirq_pending())) + goto again; + + quota = jiffies_to_usecs(sleeptime); + period = jiffies_to_usecs(sleeptime + duration_jiffies); + +out: + tg_set_cfs_quota(&root_task_group, quota); + tg_set_cfs_period(&root_task_group, period); +} + +static void clamp_timer_fn(unsigned long foo) +{ + /* Evaluate to see if clamping controls need to be adjusted */ + count++; + clamp_cpus(); } /* @@ -501,8 +473,7 @@ static void poll_pkg_cstate(struct work_struct *dummy) static int start_power_clamp(void) { - unsigned long cpu; - struct task_struct *thread; + clamping = true; /* check if pkg cstate counter is completely 0, abort in this case */ if (!has_pkg_state_counter()) { @@ -511,108 +482,21 @@ static int start_power_clamp(void) } set_target_ratio = clamp(set_target_ratio, 0U, MAX_TARGET_RATIO - 1); - /* prevent cpu hotplug */ - get_online_cpus(); - - /* prefer BSP */ - control_cpu = 0; - if (!cpu_online(control_cpu)) - control_cpu = smp_processor_id(); - clamping = true; schedule_delayed_work(&poll_pkg_cstate_work, 0); - - /* start one thread per online cpu */ - for_each_online_cpu(cpu) { - struct task_struct **p = - per_cpu_ptr(powerclamp_thread, cpu); - - thread = kthread_create_on_node(clamp_thread, - (void *) cpu, - cpu_to_node(cpu), - "kidle_inject/%ld", cpu); - /* bind to cpu here */ - if (likely(!IS_ERR(thread))) { - kthread_bind(thread, cpu); - wake_up_process(thread); - *p = thread; - } - - } - put_online_cpus(); + clamp_cpus(); return 0; } static void end_power_clamp(void) { - int i; - struct task_struct *thread; - clamping = false; - /* - * make clamping visible to other cpus and give per cpu clamping threads - * sometime to exit, or gets killed later. - */ - smp_mb(); - msleep(20); - if (bitmap_weight(cpu_clamping_mask, num_possible_cpus())) { - for_each_set_bit(i, cpu_clamping_mask, num_possible_cpus()) { - pr_debug("clamping thread for cpu %d alive, kill\n", i); - thread = *per_cpu_ptr(powerclamp_thread, i); - kthread_stop(thread); - } - } -} -static int powerclamp_cpu_callback(struct notifier_block *nfb, - unsigned long action, void *hcpu) -{ - unsigned long cpu = (unsigned long)hcpu; - struct task_struct *thread; - struct task_struct **percpu_thread = - per_cpu_ptr(powerclamp_thread, cpu); - - if (false == clamping) - goto exit_ok; - - switch (action) { - case CPU_ONLINE: - thread = kthread_create_on_node(clamp_thread, - (void *) cpu, - cpu_to_node(cpu), - "kidle_inject/%lu", cpu); - if (likely(!IS_ERR(thread))) { - kthread_bind(thread, cpu); - wake_up_process(thread); - *percpu_thread = thread; - } - /* prefer BSP as controlling CPU */ - if (cpu == 0) { - control_cpu = 0; - smp_mb(); - } - break; - case CPU_DEAD: - if (test_bit(cpu, cpu_clamping_mask)) { - pr_err("cpu %lu dead but powerclamping thread is not\n", - cpu); - kthread_stop(*percpu_thread); - } - if (cpu == control_cpu) { - control_cpu = smp_processor_id(); - smp_mb(); - } - } - -exit_ok: - return NOTIFY_OK; + clamp_cpus(); + del_timer_sync(&wakeup_timer); } -static struct notifier_block powerclamp_cpu_notifier = { - .notifier_call = powerclamp_cpu_callback, -}; - static int powerclamp_get_max_state(struct thermal_cooling_device *cdev, unsigned long *state) { @@ -656,6 +540,7 @@ static int powerclamp_set_cur_state(struct thermal_cooling_device *cdev, } exit_set: + clamp_cpus(); return ret; } @@ -716,7 +601,6 @@ static int powerclamp_debug_show(struct seq_file *m, void *unused) { int i = 0; - seq_printf(m, "controlling cpu: %d\n", control_cpu); seq_printf(m, "pct confidence steady dynamic (compensation)\n"); for (i = 0; i < MAX_TARGET_RATIO; i++) { seq_printf(m, "%d\t%lu\t%lu\t%lu\n", @@ -762,33 +646,20 @@ file_error: static int powerclamp_init(void) { int retval; - int bitmap_size; - - bitmap_size = BITS_TO_LONGS(num_possible_cpus()) * sizeof(long); - cpu_clamping_mask = kzalloc(bitmap_size, GFP_KERNEL); - if (!cpu_clamping_mask) - return -ENOMEM; /* probe cpu features and ids here */ retval = powerclamp_probe(); if (retval) - goto exit_free; + goto exit; /* set default limit, maybe adjusted during runtime based on feedback */ window_size = 2; - register_hotcpu_notifier(&powerclamp_cpu_notifier); - - powerclamp_thread = alloc_percpu(struct task_struct *); - if (!powerclamp_thread) { - retval = -ENOMEM; - goto exit_unregister; - } cooling_dev = thermal_cooling_device_register("intel_powerclamp", NULL, &powerclamp_cooling_ops); if (IS_ERR(cooling_dev)) { retval = -ENODEV; - goto exit_free_thread; + goto exit; } if (!duration) @@ -798,23 +669,15 @@ static int powerclamp_init(void) return 0; -exit_free_thread: - free_percpu(powerclamp_thread); -exit_unregister: - unregister_hotcpu_notifier(&powerclamp_cpu_notifier); -exit_free: - kfree(cpu_clamping_mask); +exit: return retval; } module_init(powerclamp_init); static void powerclamp_exit(void) { - unregister_hotcpu_notifier(&powerclamp_cpu_notifier); end_power_clamp(); - free_percpu(powerclamp_thread); thermal_cooling_device_unregister(cooling_dev); - kfree(cpu_clamping_mask); cancel_delayed_work_sync(&poll_pkg_cstate_work); debugfs_remove_recursive(debug_dir); diff --git a/include/linux/sched.h b/include/linux/sched.h index 8db31ef..2493942 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -163,6 +163,11 @@ extern void get_avenrun(unsigned long *loads, unsigned long offset, int shift); load += n*(FIXED_1-exp); \ load >>= FSHIFT; +/* + * single value that denotes runtime == period, ie unlimited time. + */ +#define RUNTIME_INF ((u64)~0ULL) + extern unsigned long total_forks; extern int nr_threads; DECLARE_PER_CPU(unsigned long, process_counts); @@ -3002,6 +3007,10 @@ extern long sched_getaffinity(pid_t pid, struct cpumask *mask); #ifdef CONFIG_CGROUP_SCHED extern struct task_group root_task_group; +#ifdef CONFIG_CFS_BANDWIDTH +extern int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us); +extern int tg_set_cfs_period(struct task_group *tg, long cfs_period_us); +#endif /* CONFIG_CFS_BANDWIDTH */ #endif /* CONFIG_CGROUP_SCHED */ extern int task_can_switch_user(struct user_struct *up, diff --git a/kernel/sched/core.c b/kernel/sched/core.c index e628cb1..7471b06 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -7097,6 +7097,7 @@ int in_sched_functions(unsigned long addr) * Every task in system belongs to this group at bootup. */ struct task_group root_task_group; +EXPORT_SYMBOL_GPL(root_task_group); LIST_HEAD(task_groups); #endif @@ -8059,9 +8060,6 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota) int i, ret = 0, runtime_enabled, runtime_was_enabled; struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth; - if (tg == &root_task_group) - return -EINVAL; - /* * Ensure we have at some amount of bandwidth every period. This is * to prevent reaching a state of large arrears when throttled via @@ -8141,6 +8139,7 @@ int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us) return tg_set_cfs_bandwidth(tg, period, quota); } +EXPORT_SYMBOL_GPL(tg_set_cfs_quota); long tg_get_cfs_quota(struct task_group *tg) { @@ -8164,6 +8163,7 @@ int tg_set_cfs_period(struct task_group *tg, long cfs_period_us) return tg_set_cfs_bandwidth(tg, period, quota); } +EXPORT_SYMBOL_GPL(tg_set_cfs_period); long tg_get_cfs_period(struct task_group *tg) { diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 9a2a45c..20493e4 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -72,11 +72,6 @@ extern void update_cpu_load_active(struct rq *this_rq); * These are the 'tuning knobs' of the scheduler: */ -/* - * single value that denotes runtime == period, ie unlimited time. - */ -#define RUNTIME_INF ((u64)~0ULL) - static inline int fair_policy(int policy) { return policy == SCHED_NORMAL || policy == SCHED_BATCH; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/