Received: by 2002:ac0:98c7:0:0:0:0:0 with SMTP id g7-v6csp3747775imd; Mon, 29 Oct 2018 11:43:26 -0700 (PDT) X-Google-Smtp-Source: AJdET5ckzM6Rra9qN2SclE3SXaVN5fR+YWrqfS6PCo+Jy11U1Rotua0ijGtyXmFR3wTuOEpU5qaZ X-Received: by 2002:a63:c54a:: with SMTP id g10-v6mr14551390pgd.201.1540838606693; Mon, 29 Oct 2018 11:43:26 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1540838606; cv=none; d=google.com; s=arc-20160816; b=tXFpcQtW63q4xAxTJBQ6WWUickrOKftbq1/blYL9kIFjM+gBCzJvcj8RpuZN3mq+eT PoyjjDI2gzsLW0IrOFqBbq8K3Arwb2eStJeNBtbPJqlFN6PYHqMnP45Jm4WAzZe+jRRr V35WY8ouyyxbznxS3RmLCDoYSfLT2iVvy0zKZIFrl4SR/M8JVhvqj24LseGWeu5rrN5o pLlMSVUtxKbbAD25LjGOtyYuH4qV3CXEccGrSKGnOiYaXYwb+29y60SycPlPnUzPauj3 fJfdp8paV7YVtEycmR6wmgule/en7bbUlph6d3o2Kli2pHga6wkE9uaV3hV5vLACrBHq eVkw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-transfer-encoding:content-disposition:mime-version :references:message-id:subject:cc:to:from:date; bh=d9jFmvKZZ1vOm7jN6wLZ/6hXfCdYtbLvtzjBK2+2q4Q=; b=huQqHUnsIsAWMkzzKs6TILtK6/l1HmNhWuaM0aq5+y8uYMQlNLW2zno4P/wwAScVQ7 ypswM3v0hyYuU+wKWBVAcrBBJFtsTMRtq5/aDTHW+gcLR9SlUCb1vwlHX3hv9uspndml 37O+m+Eiuntcp30QjUZh+mZHxUPeFY3Xsu4UIJid6peHA043T7Yki29y4beuygtYpIoK hmLUObymZr9b3HbgGyYCcIzADkf4N+/rqWGm7qRlsdXbh47jjc8BuW2YSBKQ9b3Wegxr 2A2RiKWhYKsrL+1lUVXTxIT7MKub8ZhEUQHQlo1tTNbwp8iQOpgk6Ina9aCU867ywixu o8vA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id y1-v6si21566243pli.131.2018.10.29.11.43.10; Mon, 29 Oct 2018 11:43:26 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729187AbeJ3Dcb (ORCPT + 99 others); Mon, 29 Oct 2018 23:32:31 -0400 Received: from foss.arm.com ([217.140.101.70]:44900 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726036AbeJ3Dcb (ORCPT ); Mon, 29 Oct 2018 23:32:31 -0400 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id DBB8E341; Mon, 29 Oct 2018 11:42:36 -0700 (PDT) Received: from e110439-lin (e110439-lin.cambridge.arm.com [10.1.194.43]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 16A1E3F6A8; Mon, 29 Oct 2018 11:42:33 -0700 (PDT) Date: Mon, 29 Oct 2018 18:42:28 +0000 From: Patrick Bellasi To: linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org Cc: Ingo Molnar , Peter Zijlstra , Tejun Heo , "Rafael J . Wysocki" , Vincent Guittot , Viresh Kumar , Paul Turner , Quentin Perret , Dietmar Eggemann , Morten Rasmussen , Juri Lelli , Todd Kjos , Joel Fernandes , Steve Muckle , Suren Baghdasaryan Subject: Re: [PATCH v5 04/15] sched/core: uclamp: add CPU's clamp groups accounting Message-ID: <20181029184228.GA14309@e110439-lin> References: <20181029183311.29175-1-patrick.bellasi@arm.com> <20181029183311.29175-5-patrick.bellasi@arm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20181029183311.29175-5-patrick.bellasi@arm.com> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Slightly older version posted by error along with the correct one. Please comment on: Message-ID: <20181029183311.29175-6-patrick.bellasi@arm.com> Sorry for the noise. On 29-Oct 18:32, Patrick Bellasi wrote: > Utilization clamping allows to clamp the utilization of a CPU within a > [util_min, util_max] range which depends on the set of currently > RUNNABLE tasks on that CPU. > Each task references two "clamp groups" defining the minimum and maximum > utilization clamp values to be considered for that task. These clamp > value are mapped by a clamp group which is enforced on a CPU only when > there is at least one RUNNABLE task referencing that clamp group. > > When tasks are enqueued/dequeued on/from a CPU, the set of clamp groups > active on that CPU can change. Since each clamp group enforces a > different utilization clamp value, once the set of these groups changes > it's required to re-compute what is the new "aggregated" clamp value to > apply on that CPU. > > Clamp values are always MAX aggregated for both util_min and util_max. > This is to ensure that no tasks can affect the performance of other > co-scheduled tasks which are either more boosted (i.e. with higher > util_min clamp) or less capped (i.e. with higher util_max clamp). > > Here we introduce the required support to properly reference count clamp > groups at each task enqueue/dequeue time. > > Tasks have a: > task_struct::uclamp::group_id[clamp_idx] > indexing, for each clamp index (i.e. util_{min,max}), the clamp group > they have to refcount at enqueue time. > > CPUs rq have a: > rq::uclamp::group[clamp_idx][group_idx].tasks > which is used to reference count how many tasks are currently RUNNABLE on > that CPU for each clamp group of each clamp index. > > The clamp value of each clamp group is tracked by > rq::uclamp::group[][].value > thus making rq::uclamp::group[][] an unordered array of clamp values. > However, the MAX aggregation of the currently active clamp groups is > implemented to minimize the number of times we need to scan the complete > (unordered) clamp group array to figure out the new max value. This > operation indeed happens only when we dequeue the last task of the clamp > group corresponding to the current max clamp, and thus the CPU is either > entering IDLE or going to schedule a less boosted or more clamped task. > Moreover, the expected number of different clamp values, which can be > configured at build time, is usually so small that a more advanced > ordering algorithm is not needed. In real use-cases we expect less then > 10 different clamp values for each clamp index. > > Signed-off-by: Patrick Bellasi > Cc: Ingo Molnar > Cc: Peter Zijlstra > Cc: Paul Turner > Cc: Suren Baghdasaryan > Cc: Todd Kjos > Cc: Joel Fernandes > Cc: Juri Lelli > Cc: Quentin Perret > Cc: Dietmar Eggemann > Cc: Morten Rasmussen > Cc: linux-kernel@vger.kernel.org > Cc: linux-pm@vger.kernel.org > > --- > Changes in v5: > Message-ID: <20180914134128.GP1413@e110439-lin> > - remove not required check for?(group_id == UCLAMP_NOT_VALID) > in?uclamp_cpu_put_id > Message-ID: <20180912174456.GJ1413@e110439-lin> > - use bitfields to compress uclamp_group > Others: > - consistently use "unsigned int" for both clamp_id and group_id > - fixup documentation > - reduced usage of inline comments > - rebased on v4.19.0 > > Changes in v4: > Message-ID: <20180816133249.GA2964@e110439-lin> > - keep the WARN in uclamp_cpu_put_id() but beautify a bit that code > - add another WARN on the unexpected condition of releasing a refcount > from a CPU which has a lower clamp value active > Other: > - ensure (and check) that all tasks have a valid group_id at > uclamp_cpu_get_id() > - rework uclamp_cpu layout to better fit into just 2x64B cache lines > - fix some s/SCHED_DEBUG/CONFIG_SCHED_DEBUG/ > - rebased on v4.19-rc1 > Changes in v3: > Message-ID: > - add WARN on unlikely un-referenced decrement in uclamp_cpu_put_id() > - rename UCLAMP_NONE into UCLAMP_NOT_VALID > Message-ID: > - few typos fixed > Other: > - rebased on tip/sched/core > Changes in v2: > Message-ID: <20180413093822.GM4129@hirez.programming.kicks-ass.net> > - refactored struct rq::uclamp_cpu to be more cache efficient > no more holes, re-arranged vectors to match cache lines with expected > data locality > Message-ID: <20180413094615.GT4043@hirez.programming.kicks-ass.net> > - use *rq as parameter whenever already available > - add scheduling class's uclamp_enabled marker > - get rid of the "confusing" single callback uclamp_task_update() > and use uclamp_cpu_{get,put}() directly from {en,de}queue_task() > - fix/remove "bad" comments > Message-ID: <20180413113337.GU14248@e110439-lin> > - remove inline from init_uclamp, flag it __init > Other: > - rabased on v4.18-rc4 > - improved documentation to make more explicit some concepts. > --- > include/linux/sched.h | 5 ++ > kernel/sched/core.c | 185 ++++++++++++++++++++++++++++++++++++++++++ > kernel/sched/sched.h | 49 +++++++++++ > 3 files changed, 239 insertions(+) > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index facace271ea1..3ab1cbd4e3b1 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -604,11 +604,16 @@ struct sched_dl_entity { > * The mapped bit is set whenever a task has been mapped on a clamp group for > * the first time. When this bit is set, any clamp group get (for a new clamp > * value) will be matches by a clamp group put (for the old clamp value). > + * > + * The active bit is set whenever a task has got an effective clamp group > + * and value assigned, which can be different from the user requested ones. > + * This allows to know a task is actually refcounting a CPU's clamp group. > */ > struct uclamp_se { > unsigned int value : SCHED_CAPACITY_SHIFT + 1; > unsigned int group_id : order_base_2(UCLAMP_GROUPS); > unsigned int mapped : 1; > + unsigned int active : 1; > }; > #endif /* CONFIG_UCLAMP_TASK */ > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 654327d7f212..a98a96a7d9f1 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -783,6 +783,159 @@ union uclamp_map { > */ > static union uclamp_map uclamp_maps[UCLAMP_CNT][UCLAMP_GROUPS]; > > +/** > + * uclamp_cpu_update: updates the utilization clamp of a CPU > + * @rq: the CPU's rq which utilization clamp has to be updated > + * @clamp_id: the clamp index to update > + * > + * When tasks are enqueued/dequeued on/from a CPU, the set of currently active > + * clamp groups can change. Since each clamp group enforces a different > + * utilization clamp value, once the set of active groups changes it can be > + * required to re-compute what is the new clamp value to apply for that CPU. > + * > + * For the specified clamp index, this method computes the new CPU utilization > + * clamp to use until the next change on the set of active clamp groups. > + */ > +static inline void uclamp_cpu_update(struct rq *rq, unsigned int clamp_id) > +{ > + unsigned int group_id; > + int max_value = 0; > + > + for (group_id = 0; group_id < UCLAMP_GROUPS; ++group_id) { > + if (!rq->uclamp.group[clamp_id][group_id].tasks) > + continue; > + /* Both min and max clamps are MAX aggregated */ > + if (max_value < rq->uclamp.group[clamp_id][group_id].value) > + max_value = rq->uclamp.group[clamp_id][group_id].value; > + if (max_value >= SCHED_CAPACITY_SCALE) > + break; > + } > + rq->uclamp.value[clamp_id] = max_value; > +} > + > +/** > + * uclamp_cpu_get_id(): increase reference count for a clamp group on a CPU > + * @p: the task being enqueued on a CPU > + * @rq: the CPU's rq where the clamp group has to be reference counted > + * @clamp_id: the clamp index to update > + * > + * Once a task is enqueued on a CPU's rq, the clamp group currently defined by > + * the task's uclamp::group_id is reference counted on that CPU. > + */ > +static inline void uclamp_cpu_get_id(struct task_struct *p, struct rq *rq, > + unsigned int clamp_id) > +{ > + unsigned int group_id; > + > + if (unlikely(!p->uclamp[clamp_id].mapped)) > + return; > + > + group_id = p->uclamp[clamp_id].group_id; > + p->uclamp[clamp_id].active = true; > + > + rq->uclamp.group[clamp_id][group_id].tasks += 1; > + > + if (rq->uclamp.value[clamp_id] < p->uclamp[clamp_id].value) > + rq->uclamp.value[clamp_id] = p->uclamp[clamp_id].value; > +} > + > +/** > + * uclamp_cpu_put_id(): decrease reference count for a clamp group on a CPU > + * @p: the task being dequeued from a CPU > + * @rq: the CPU's rq from where the clamp group has to be released > + * @clamp_id: the clamp index to update > + * > + * When a task is dequeued from a CPU's rq, the CPU's clamp group reference > + * counted by the task is released. > + * If this was the last task reference coutning the current max clamp group, > + * then the CPU clamping is updated to find the new max for the specified > + * clamp index. > + */ > +static inline void uclamp_cpu_put_id(struct task_struct *p, struct rq *rq, > + unsigned int clamp_id) > +{ > + unsigned int clamp_value; > + unsigned int group_id; > + > + if (unlikely(!p->uclamp[clamp_id].mapped)) > + return; > + > + group_id = p->uclamp[clamp_id].group_id; > + p->uclamp[clamp_id].active = false; > + > + if (likely(rq->uclamp.group[clamp_id][group_id].tasks)) > + rq->uclamp.group[clamp_id][group_id].tasks -= 1; > +#ifdef CONFIG_SCHED_DEBUG > + else { > + WARN(1, "invalid CPU[%d] clamp group [%u:%u] refcount\n", > + cpu_of(rq), clamp_id, group_id); > + } > +#endif > + > + if (likely(rq->uclamp.group[clamp_id][group_id].tasks)) > + return; > + > + clamp_value = rq->uclamp.group[clamp_id][group_id].value; > +#ifdef CONFIG_SCHED_DEBUG > + if (unlikely(clamp_value > rq->uclamp.value[clamp_id])) { > + WARN(1, "invalid CPU[%d] clamp group [%u:%u] value\n", > + cpu_of(rq), clamp_id, group_id); > + } > +#endif > + if (clamp_value >= rq->uclamp.value[clamp_id]) > + uclamp_cpu_update(rq, clamp_id); > +} > + > +/** > + * uclamp_cpu_get(): increase CPU's clamp group refcount > + * @rq: the CPU's rq where the task is enqueued > + * @p: the task being enqueued > + * > + * When a task is enqueued on a CPU's rq, all the clamp groups currently > + * enforced on a task are reference counted on that rq. Since not all > + * scheduling classes have utilization clamping support, their tasks will > + * be silently ignored. > + * > + * This method updates the utilization clamp constraints considering the > + * requirements for the specified task. Thus, this update must be done before > + * calling into the scheduling classes, which will eventually update schedutil > + * considering the new task requirements. > + */ > +static inline void uclamp_cpu_get(struct rq *rq, struct task_struct *p) > +{ > + unsigned int clamp_id; > + > + if (unlikely(!p->sched_class->uclamp_enabled)) > + return; > + > + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) > + uclamp_cpu_get_id(p, rq, clamp_id); > +} > + > +/** > + * uclamp_cpu_put(): decrease CPU's clamp group refcount > + * @rq: the CPU's rq from where the task is dequeued > + * @p: the task being dequeued > + * > + * When a task is dequeued from a CPU's rq, all the clamp groups the task has > + * reference counted at enqueue time are now released. > + * > + * This method updates the utilization clamp constraints considering the > + * requirements for the specified task. Thus, this update must be done before > + * calling into the scheduling classes, which will eventually update schedutil > + * considering the new task requirements. > + */ > +static inline void uclamp_cpu_put(struct rq *rq, struct task_struct *p) > +{ > + unsigned int clamp_id; > + > + if (unlikely(!p->sched_class->uclamp_enabled)) > + return; > + > + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) > + uclamp_cpu_put_id(p, rq, clamp_id); > +} > + > /** > * uclamp_group_put: decrease the reference count for a clamp group > * @clamp_id: the clamp index which was affected by a task group > @@ -836,6 +989,7 @@ static void uclamp_group_get(struct uclamp_se *uc_se, unsigned int clamp_id, > unsigned int free_group_id; > unsigned int group_id; > unsigned long res; > + int cpu; > > retry: > > @@ -866,6 +1020,28 @@ static void uclamp_group_get(struct uclamp_se *uc_se, unsigned int clamp_id, > if (res != uc_map_old.data) > goto retry; > > + /* Ensure each CPU tracks the correct value for this clamp group */ > + if (likely(uc_map_new.se_count > 1)) > + goto done; > + for_each_possible_cpu(cpu) { > + struct uclamp_cpu *uc_cpu = &cpu_rq(cpu)->uclamp; > + > + /* Refcounting is expected to be always 0 for free groups */ > + if (unlikely(uc_cpu->group[clamp_id][group_id].tasks)) { > + uc_cpu->group[clamp_id][group_id].tasks = 0; > +#ifdef CONFIG_SCHED_DEBUG > + WARN(1, "invalid CPU[%d] clamp group [%u:%u] refcount\n", > + cpu, clamp_id, group_id); > +#endif > + } > + > + if (uc_cpu->group[clamp_id][group_id].value == clamp_value) > + continue; > + uc_cpu->group[clamp_id][group_id].value = clamp_value; > + } > + > +done: > + > /* Update SE's clamp values and attach it to new clamp group */ > uc_se->value = clamp_value; > uc_se->group_id = group_id; > @@ -948,6 +1124,7 @@ static void uclamp_fork(struct task_struct *p, bool reset) > clamp_value = uclamp_none(clamp_id); > > p->uclamp[clamp_id].mapped = false; > + p->uclamp[clamp_id].active = false; > uclamp_group_get(&p->uclamp[clamp_id], clamp_id, clamp_value); > } > } > @@ -959,9 +1136,13 @@ static void __init init_uclamp(void) > { > struct uclamp_se *uc_se; > unsigned int clamp_id; > + int cpu; > > mutex_init(&uclamp_mutex); > > + for_each_possible_cpu(cpu) > + memset(&cpu_rq(cpu)->uclamp, 0, sizeof(struct uclamp_cpu)); > + > memset(uclamp_maps, 0, sizeof(uclamp_maps)); > for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) { > uc_se = &init_task.uclamp[clamp_id]; > @@ -970,6 +1151,8 @@ static void __init init_uclamp(void) > } > > #else /* CONFIG_UCLAMP_TASK */ > +static inline void uclamp_cpu_get(struct rq *rq, struct task_struct *p) { } > +static inline void uclamp_cpu_put(struct rq *rq, struct task_struct *p) { } > static inline int __setscheduler_uclamp(struct task_struct *p, > const struct sched_attr *attr) > { > @@ -987,6 +1170,7 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags) > if (!(flags & ENQUEUE_RESTORE)) > sched_info_queued(rq, p); > > + uclamp_cpu_get(rq, p); > p->sched_class->enqueue_task(rq, p, flags); > } > > @@ -998,6 +1182,7 @@ static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags) > if (!(flags & DEQUEUE_SAVE)) > sched_info_dequeued(rq, p); > > + uclamp_cpu_put(rq, p); > p->sched_class->dequeue_task(rq, p, flags); > } > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index 947ab14d3d5b..1755c9c9f4f0 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -766,6 +766,50 @@ extern void rto_push_irq_work_func(struct irq_work *work); > #endif > #endif /* CONFIG_SMP */ > > +#ifdef CONFIG_UCLAMP_TASK > +/** > + * struct uclamp_group - Utilization clamp Group > + * @value: utilization clamp value for tasks on this clamp group > + * @tasks: number of RUNNABLE tasks on this clamp group > + * > + * Keep track of how many tasks are RUNNABLE for a given utilization > + * clamp value. > + */ > +struct uclamp_group { > + unsigned long value : SCHED_CAPACITY_SHIFT + 1; > + unsigned long tasks : BITS_PER_LONG - SCHED_CAPACITY_SHIFT - 1; > +}; > + > +/** > + * struct uclamp_cpu - CPU's utilization clamp > + * @value: currently active clamp values for a CPU > + * @group: utilization clamp groups affecting a CPU > + * > + * Keep track of RUNNABLE tasks on a CPUs to aggregate their clamp values. > + * A clamp value is affecting a CPU where there is at least one task RUNNABLE > + * (or actually running) with that value. > + * > + * We have up to UCLAMP_CNT possible different clamp value, which are > + * currently only two: minmum utilization and maximum utilization. > + * > + * All utilization clamping values are MAX aggregated, since: > + * - for util_min: we want to run the CPU at least at the max of the minimum > + * utilization required by its currently RUNNABLE tasks. > + * - for util_max: we want to allow the CPU to run up to the max of the > + * maximum utilization allowed by its currently RUNNABLE tasks. > + * > + * Since on each system we expect only a limited number of different > + * utilization clamp values (CONFIG_UCLAMP_GROUPS_COUNT), we use a simple > + * array to track the metrics required to compute all the per-CPU utilization > + * clamp values. The additional slot is used to track the default clamp > + * values, i.e. no min/max clamping at all. > + */ > +struct uclamp_cpu { > + struct uclamp_group group[UCLAMP_CNT][CONFIG_UCLAMP_GROUPS_COUNT + 1]; > + int value[UCLAMP_CNT]; > +}; > +#endif /* CONFIG_UCLAMP_TASK */ > + > /* > * This is the main, per-CPU runqueue data structure. > * > @@ -804,6 +848,11 @@ struct rq { > unsigned long nr_load_updates; > u64 nr_switches; > > +#ifdef CONFIG_UCLAMP_TASK > + /* Utilization clamp values based on CPU's RUNNABLE tasks */ > + struct uclamp_cpu uclamp ____cacheline_aligned; > +#endif > + > struct cfs_rq cfs; > struct rt_rq rt; > struct dl_rq dl; > -- > 2.18.0 > -- #include Patrick Bellasi