Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754536AbbLKLFD (ORCPT ); Fri, 11 Dec 2015 06:05:03 -0500 Received: from foss.arm.com ([217.140.101.70]:60672 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753482AbbLKLEx (ORCPT ); Fri, 11 Dec 2015 06:04:53 -0500 Date: Fri, 11 Dec 2015 11:04:43 +0000 From: Juri Lelli To: Steve Muckle Cc: Peter Zijlstra , Ingo Molnar , linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org, Vincent Guittot , Morten Rasmussen , Dietmar Eggemann , Patrick Bellasi , Michael Turquette , Ricky Liang Subject: Re: [RFCv6 PATCH 03/10] sched: scheduler-driven cpu frequency selection Message-ID: <20151211110443.GA6645@e106622-lin> References: <1449641971-20827-1-git-send-email-smuckle@linaro.org> <1449641971-20827-4-git-send-email-smuckle@linaro.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1449641971-20827-4-git-send-email-smuckle@linaro.org> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 18670 Lines: 604 Hi Steve, On 08/12/15 22:19, Steve Muckle wrote: > From: Michael Turquette > > Scheduler-driven CPU frequency selection hopes to exploit both > per-task and global information in the scheduler to improve frequency > selection policy, achieving lower power consumption, improved > responsiveness/performance, and less reliance on heuristics and > tunables. For further discussion on the motivation of this integration > see [0]. > > This patch implements a shim layer between the Linux scheduler and the > cpufreq subsystem. The interface accepts capacity requests from the > CFS, RT and deadline sched classes. The requests from each sched class > are summed on each CPU with a margin applied to the CFS and RT > capacity requests to provide some headroom. Deadline requests are > expected to be precise enough given their nature to not require > headroom. The maximum total capacity request for a CPU in a frequency > domain drives the requested frequency for that domain. > > Policy is determined by both the sched classes and this shim layer. > > Note that this algorithm is event-driven. There is no polling loop to > check cpu idle time nor any other method which is unsynchronized with > the scheduler, aside from a throttling mechanism to ensure frequency > changes are not attempted faster than the hardware can accommodate them. > > Thanks to Juri Lelli for contributing design ideas, > code and test results, and to Ricky Liang > for initialization and static key inc/dec fixes. > > [0] http://article.gmane.org/gmane.linux.kernel/1499836 > > [smuckle@linaro.org: various additions and fixes, revised commit text] > > CC: Ricky Liang > Signed-off-by: Michael Turquette > Signed-off-by: Juri Lelli > Signed-off-by: Steve Muckle > --- > drivers/cpufreq/Kconfig | 20 +++ > include/linux/cpufreq.h | 3 + > include/linux/sched.h | 8 + > kernel/sched/Makefile | 1 + > kernel/sched/cpufreq_sched.c | 364 +++++++++++++++++++++++++++++++++++++++++++ > kernel/sched/sched.h | 51 ++++++ > 6 files changed, 447 insertions(+) > create mode 100644 kernel/sched/cpufreq_sched.c > > diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig > index 659879a..6f2e96c 100644 > --- a/drivers/cpufreq/Kconfig > +++ b/drivers/cpufreq/Kconfig > @@ -102,6 +102,14 @@ config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE > Be aware that not all cpufreq drivers support the conservative > governor. If unsure have a look at the help section of the > driver. Fallback governor will be the performance governor. > + > +config CPU_FREQ_DEFAULT_GOV_SCHED > + bool "sched" > + select CPU_FREQ_GOV_SCHED > + help > + Use the CPUfreq governor 'sched' as default. This scales > + cpu frequency using CPU utilization estimates from the > + scheduler. > endchoice > > config CPU_FREQ_GOV_PERFORMANCE > @@ -183,6 +191,18 @@ config CPU_FREQ_GOV_CONSERVATIVE > > If in doubt, say N. > > +config CPU_FREQ_GOV_SCHED > + bool "'sched' cpufreq governor" > + depends on CPU_FREQ We depend on IRQ_WORK as well, which in turn I think depends on SMP. As briefly discussed with Peter on IRC, we might want to use smp_call_function_single_async() instead to break this dependecies chain (and be able to use this governor on UP as well). > + select CPU_FREQ_GOV_COMMON > + help > + 'sched' - this governor scales cpu frequency from the > + scheduler as a function of cpu capacity utilization. It does > + not evaluate utilization on a periodic basis (as ondemand > + does) but instead is event-driven by the scheduler. > + > + If in doubt, say N. > + > comment "CPU frequency scaling drivers" > > config CPUFREQ_DT > diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h > index 7f8c63d..7e4bde1 100644 > --- a/include/linux/cpufreq.h > +++ b/include/linux/cpufreq.h > @@ -495,6 +495,9 @@ extern struct cpufreq_governor cpufreq_gov_ondemand; > #elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE) > extern struct cpufreq_governor cpufreq_gov_conservative; > #define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_conservative) > +#elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_SCHED) > +extern struct cpufreq_governor cpufreq_gov_sched; > +#define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_sched) > #endif > > /********************************************************************* > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 3b0de68..d910a31 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -927,6 +927,14 @@ enum cpu_idle_type { > #define SCHED_CAPACITY_SHIFT 10 > #define SCHED_CAPACITY_SCALE (1L << SCHED_CAPACITY_SHIFT) > > +struct sched_capacity_reqs { > + unsigned long cfs; > + unsigned long rt; > + unsigned long dl; > + > + unsigned long total; > +}; > + > /* > * Wake-queues are lists of tasks with a pending wakeup, whose > * callers have already marked the task as woken internally, > diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile > index 6768797..90ed832 100644 > --- a/kernel/sched/Makefile > +++ b/kernel/sched/Makefile > @@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o > obj-$(CONFIG_SCHEDSTATS) += stats.o > obj-$(CONFIG_SCHED_DEBUG) += debug.o > obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o > +obj-$(CONFIG_CPU_FREQ_GOV_SCHED) += cpufreq_sched.o > diff --git a/kernel/sched/cpufreq_sched.c b/kernel/sched/cpufreq_sched.c > new file mode 100644 > index 0000000..af8b5bc > --- /dev/null > +++ b/kernel/sched/cpufreq_sched.c > @@ -0,0 +1,364 @@ > +/* > + * Copyright (C) 2015 Michael Turquette > + * > + * This program is free software; you can redistribute it and/or modify > + * it under the terms of the GNU General Public License version 2 as > + * published by the Free Software Foundation. > + */ > + > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +#include "sched.h" > + > +#define THROTTLE_NSEC 50000000 /* 50ms default */ > + > +struct static_key __read_mostly __sched_freq = STATIC_KEY_INIT_FALSE; > +static bool __read_mostly cpufreq_driver_slow; > + > +#ifndef CONFIG_CPU_FREQ_DEFAULT_GOV_SCHED > +static struct cpufreq_governor cpufreq_gov_sched; > +#endif > + > +/* > + * Capacity margin added to CFS and RT capacity requests to provide > + * some head room if task utilization further increases. > + */ > +unsigned int capacity_margin = 1280; > + > +static DEFINE_PER_CPU(unsigned long, enabled); > +DEFINE_PER_CPU(struct sched_capacity_reqs, cpu_sched_capacity_reqs); > + > +/** > + * gov_data - per-policy data internal to the governor > + * @throttle: next throttling period expiry. Derived from throttle_nsec > + * @throttle_nsec: throttle period length in nanoseconds > + * @task: worker thread for dvfs transition that may block/sleep > + * @irq_work: callback used to wake up worker thread > + * @requested_freq: last frequency requested by the sched governor > + * > + * struct gov_data is the per-policy cpufreq_sched-specific data structure. A > + * per-policy instance of it is created when the cpufreq_sched governor receives > + * the CPUFREQ_GOV_START condition and a pointer to it exists in the gov_data > + * member of struct cpufreq_policy. > + * > + * Readers of this data must call down_read(policy->rwsem). Writers must > + * call down_write(policy->rwsem). > + */ > +struct gov_data { > + ktime_t throttle; > + unsigned int throttle_nsec; > + struct task_struct *task; > + struct irq_work irq_work; > + unsigned int requested_freq; > +}; > + > +static void cpufreq_sched_try_driver_target(struct cpufreq_policy *policy, > + unsigned int freq) > +{ > + struct gov_data *gd = policy->governor_data; > + > + /* avoid race with cpufreq_sched_stop */ > + if (!down_write_trylock(&policy->rwsem)) > + return; > + > + __cpufreq_driver_target(policy, freq, CPUFREQ_RELATION_L); > + > + gd->throttle = ktime_add_ns(ktime_get(), gd->throttle_nsec); As I think you proposed at Connect, we could use post frequency transition notifiers to implement throttling. Is this something that you already tried implementing/planning to experiment with? > + up_write(&policy->rwsem); > +} > + > +static bool finish_last_request(struct gov_data *gd) > +{ > + ktime_t now = ktime_get(); > + > + if (ktime_after(now, gd->throttle)) > + return false; > + > + while (1) { > + int usec_left = ktime_to_ns(ktime_sub(gd->throttle, now)); > + > + usec_left /= NSEC_PER_USEC; > + usleep_range(usec_left, usec_left + 100); > + now = ktime_get(); > + if (ktime_after(now, gd->throttle)) > + return true; > + } > +} > + > +/* > + * we pass in struct cpufreq_policy. This is safe because changing out the > + * policy requires a call to __cpufreq_governor(policy, CPUFREQ_GOV_STOP), > + * which tears down all of the data structures and __cpufreq_governor(policy, > + * CPUFREQ_GOV_START) will do a full rebuild, including this kthread with the > + * new policy pointer > + */ > +static int cpufreq_sched_thread(void *data) > +{ > + struct sched_param param; > + struct cpufreq_policy *policy; > + struct gov_data *gd; > + unsigned int new_request = 0; > + unsigned int last_request = 0; > + int ret; > + > + policy = (struct cpufreq_policy *) data; > + gd = policy->governor_data; > + > + param.sched_priority = 50; > + ret = sched_setscheduler_nocheck(gd->task, SCHED_FIFO, ¶m); > + if (ret) { > + pr_warn("%s: failed to set SCHED_FIFO\n", __func__); > + do_exit(-EINVAL); > + } else { > + pr_debug("%s: kthread (%d) set to SCHED_FIFO\n", > + __func__, gd->task->pid); > + } > + > + do { > + set_current_state(TASK_INTERRUPTIBLE); > + new_request = gd->requested_freq; > + if (new_request == last_request) { > + schedule(); > + } else { Shouldn't we have to do the following here? @@ -125,9 +125,9 @@ static int cpufreq_sched_thread(void *data) } do { - set_current_state(TASK_INTERRUPTIBLE); new_request = gd->requested_freq; if (new_request == last_request) { + set_current_state(TASK_INTERRUPTIBLE); schedule(); } else { /* Otherwise we set task to INTERRUPTIBLE state right after it has been woken up. Thanks, - Juri > + /* > + * if the frequency thread sleeps while waiting to be > + * unthrottled, start over to check for a newer request > + */ > + if (finish_last_request(gd)) > + continue; > + last_request = new_request; > + cpufreq_sched_try_driver_target(policy, new_request); > + } > + } while (!kthread_should_stop()); > + > + return 0; > +} > + > +static void cpufreq_sched_irq_work(struct irq_work *irq_work) > +{ > + struct gov_data *gd; > + > + gd = container_of(irq_work, struct gov_data, irq_work); > + if (!gd) > + return; > + > + wake_up_process(gd->task); > +} > + > +static void update_fdomain_capacity_request(int cpu) > +{ > + unsigned int freq_new, index_new, cpu_tmp; > + struct cpufreq_policy *policy; > + struct gov_data *gd; > + unsigned long capacity = 0; > + > + /* > + * Avoid grabbing the policy if possible. A test is still > + * required after locking the CPU's policy to avoid racing > + * with the governor changing. > + */ > + if (!per_cpu(enabled, cpu)) > + return; > + > + policy = cpufreq_cpu_get(cpu); > + if (IS_ERR_OR_NULL(policy)) > + return; > + > + if (policy->governor != &cpufreq_gov_sched || > + !policy->governor_data) > + goto out; > + > + gd = policy->governor_data; > + > + /* find max capacity requested by cpus in this policy */ > + for_each_cpu(cpu_tmp, policy->cpus) { > + struct sched_capacity_reqs *scr; > + > + scr = &per_cpu(cpu_sched_capacity_reqs, cpu_tmp); > + capacity = max(capacity, scr->total); > + } > + > + /* Convert the new maximum capacity request into a cpu frequency */ > + freq_new = capacity * policy->max >> SCHED_CAPACITY_SHIFT; > + if (cpufreq_frequency_table_target(policy, policy->freq_table, > + freq_new, CPUFREQ_RELATION_L, > + &index_new)) > + goto out; > + freq_new = policy->freq_table[index_new].frequency; > + > + if (freq_new == gd->requested_freq) > + goto out; > + > + gd->requested_freq = freq_new; > + > + /* > + * Throttling is not yet supported on platforms with fast cpufreq > + * drivers. > + */ > + if (cpufreq_driver_slow) > + irq_work_queue_on(&gd->irq_work, cpu); > + else > + cpufreq_sched_try_driver_target(policy, freq_new); > + > +out: > + cpufreq_cpu_put(policy); > +} > + > +void update_cpu_capacity_request(int cpu, bool request) > +{ > + unsigned long new_capacity; > + struct sched_capacity_reqs *scr; > + > + /* The rq lock serializes access to the CPU's sched_capacity_reqs. */ > + lockdep_assert_held(&cpu_rq(cpu)->lock); > + > + scr = &per_cpu(cpu_sched_capacity_reqs, cpu); > + > + new_capacity = scr->cfs + scr->rt; > + new_capacity = new_capacity * capacity_margin > + / SCHED_CAPACITY_SCALE; > + new_capacity += scr->dl; > + > + if (new_capacity == scr->total) > + return; > + > + scr->total = new_capacity; > + if (request) > + update_fdomain_capacity_request(cpu); > +} > + > +static inline void set_sched_freq(void) > +{ > + static_key_slow_inc(&__sched_freq); > +} > + > +static inline void clear_sched_freq(void) > +{ > + static_key_slow_dec(&__sched_freq); > +} > + > +static int cpufreq_sched_policy_init(struct cpufreq_policy *policy) > +{ > + struct gov_data *gd; > + int cpu; > + > + for_each_cpu(cpu, policy->cpus) > + memset(&per_cpu(cpu_sched_capacity_reqs, cpu), 0, > + sizeof(struct sched_capacity_reqs)); > + > + gd = kzalloc(sizeof(*gd), GFP_KERNEL); > + if (!gd) > + return -ENOMEM; > + > + gd->throttle_nsec = policy->cpuinfo.transition_latency ? > + policy->cpuinfo.transition_latency : > + THROTTLE_NSEC; > + pr_debug("%s: throttle threshold = %u [ns]\n", > + __func__, gd->throttle_nsec); > + > + if (cpufreq_driver_is_slow()) { > + cpufreq_driver_slow = true; > + gd->task = kthread_create(cpufreq_sched_thread, policy, > + "kschedfreq:%d", > + cpumask_first(policy->related_cpus)); > + if (IS_ERR_OR_NULL(gd->task)) { > + pr_err("%s: failed to create kschedfreq thread\n", > + __func__); > + goto err; > + } > + get_task_struct(gd->task); > + kthread_bind_mask(gd->task, policy->related_cpus); > + wake_up_process(gd->task); > + init_irq_work(&gd->irq_work, cpufreq_sched_irq_work); > + } > + > + policy->governor_data = gd; > + set_sched_freq(); > + > + return 0; > + > +err: > + kfree(gd); > + return -ENOMEM; > +} > + > +static int cpufreq_sched_policy_exit(struct cpufreq_policy *policy) > +{ > + struct gov_data *gd = policy->governor_data; > + > + clear_sched_freq(); > + if (cpufreq_driver_slow) { > + kthread_stop(gd->task); > + put_task_struct(gd->task); > + } > + > + policy->governor_data = NULL; > + > + kfree(gd); > + return 0; > +} > + > +static int cpufreq_sched_start(struct cpufreq_policy *policy) > +{ > + int cpu; > + > + for_each_cpu(cpu, policy->cpus) > + per_cpu(enabled, cpu) = 1; > + > + return 0; > +} > + > +static int cpufreq_sched_stop(struct cpufreq_policy *policy) > +{ > + int cpu; > + > + for_each_cpu(cpu, policy->cpus) > + per_cpu(enabled, cpu) = 0; > + > + return 0; > +} > + > +static int cpufreq_sched_setup(struct cpufreq_policy *policy, > + unsigned int event) > +{ > + switch (event) { > + case CPUFREQ_GOV_POLICY_INIT: > + return cpufreq_sched_policy_init(policy); > + case CPUFREQ_GOV_POLICY_EXIT: > + return cpufreq_sched_policy_exit(policy); > + case CPUFREQ_GOV_START: > + return cpufreq_sched_start(policy); > + case CPUFREQ_GOV_STOP: > + return cpufreq_sched_stop(policy); > + case CPUFREQ_GOV_LIMITS: > + break; > + } > + return 0; > +} > + > +#ifndef CONFIG_CPU_FREQ_DEFAULT_GOV_SCHED > +static > +#endif > +struct cpufreq_governor cpufreq_gov_sched = { > + .name = "sched", > + .governor = cpufreq_sched_setup, > + .owner = THIS_MODULE, > +}; > + > +static int __init cpufreq_sched_init(void) > +{ > + int cpu; > + > + for_each_cpu(cpu, cpu_possible_mask) > + per_cpu(enabled, cpu) = 0; > + return cpufreq_register_governor(&cpufreq_gov_sched); > +} > + > +/* Try to make this the default governor */ > +fs_initcall(cpufreq_sched_init); > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index a5a6b3e..a88dbec 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -1383,6 +1383,57 @@ unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu) > } > #endif > > +#ifdef CONFIG_CPU_FREQ_GOV_SCHED > +extern unsigned int capacity_margin; > +extern struct static_key __sched_freq; > + > +static inline bool sched_freq(void) > +{ > + return static_key_false(&__sched_freq); > +} > + > +DECLARE_PER_CPU(struct sched_capacity_reqs, cpu_sched_capacity_reqs); > +void update_cpu_capacity_request(int cpu, bool request); > + > +static inline void set_cfs_cpu_capacity(int cpu, bool request, > + unsigned long capacity) > +{ > + if (per_cpu(cpu_sched_capacity_reqs, cpu).cfs != capacity) { > + per_cpu(cpu_sched_capacity_reqs, cpu).cfs = capacity; > + update_cpu_capacity_request(cpu, request); > + } > +} > + > +static inline void set_rt_cpu_capacity(int cpu, bool request, > + unsigned long capacity) > +{ > + if (per_cpu(cpu_sched_capacity_reqs, cpu).rt != capacity) { > + per_cpu(cpu_sched_capacity_reqs, cpu).rt = capacity; > + update_cpu_capacity_request(cpu, request); > + } > +} > + > +static inline void set_dl_cpu_capacity(int cpu, bool request, > + unsigned long capacity) > +{ > + if (per_cpu(cpu_sched_capacity_reqs, cpu).dl != capacity) { > + per_cpu(cpu_sched_capacity_reqs, cpu).dl = capacity; > + update_cpu_capacity_request(cpu, request); > + } > +} > +#else > +static inline bool sched_freq(void) { return false; } > +static inline void set_cfs_cpu_capacity(int cpu, bool request, > + unsigned long capacity) > +{ } > +static inline void set_rt_cpu_capacity(int cpu, bool request, > + unsigned long capacity) > +{ } > +static inline void set_dl_cpu_capacity(int cpu, bool request, > + unsigned long capacity) > +{ } > +#endif > + > static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta) > { > rq->rt_avg += rt_delta * arch_scale_freq_capacity(NULL, cpu_of(rq)); > -- > 2.4.10 > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/