Hi,
v2 of the RFC set implementing frequency/cpu invariance and OPP selection for
SCHED_DEADLINE [1]. The set is based on mainline as of today (ae64f9bd1d36).
Patches high level description:
o [01-02]/08 add the necessary links to start accounting DEADLINE contribution
to OPP selection
o 03/08 it's a temporary solution to make possible (on ARM) to change
frequency for DEADLINE tasks (that would possibly delay the SCHED_FIFO
worker kthread); proper solution would be to be able to issue frequency
transition from an atomic ctx
o [04-05]/08 it's a schedutil change that copes with the fact that DEADLINE
doesn't require periodic OPP selection triggering point
o [06-07]/08 make arch_scale_{freq,cpu}_capacity() function available on !CONFIG_SMP
configurations too
o 08/08 implements frequency/cpu invariance for tasks' reservation
parameters; which basically means that we implement GRUB-PA [2]
Changes w.r.t. RFCv1:
- rebase on mainline
- return -EINVAL for user trying to use the new flag (Peter)
- s/SPECIAL/SUGOV/ in the flag name (several comments from people to
find better naming, Steve thinks SUGOV is more greppable than others)
- give worker kthread a fake (unused) bandwidth, so that if priority
inheritance is triggered we don't BUG_ON on zero runtime
- filter out fake bandwidth when computing SCHED_DEADLINE bandwidth (fix by
Claudio Scordino)
Please have a look. Feedback and comments are, as usual, more than welcome.
In case you would like to test this out:
https://github.com/jlelli/linux.git upstream/deadline/freq-rfc-v2
Best,
- Juri
[1] v0: https://lkml.org/lkml/2017/5/23/249
v1: https://lkml.org/lkml/2017/7/5/139
[2] C. Scordino, G. Lipari, A Resource Reservation Algorithm for
Power-Aware Scheduling of Periodic and Aperiodic Real-Time Tasks,
IEEE Transactions on Computers, December 2006
Juri Lelli (8):
sched/cpufreq_schedutil: make use of DEADLINE utilization signal
sched/deadline: move cpu frequency selection triggering points
sched/cpufreq_schedutil: make worker kthread be SCHED_DEADLINE
sched/cpufreq_schedutil: split utilization signals
sched/cpufreq_schedutil: always consider all CPUs when deciding next
freq
sched/sched.h: remove sd arch_scale_freq_capacity parameter
sched/sched.h: move arch_scale_{freq,cpu}_capacity outside CONFIG_SMP
sched/deadline: make bandwidth enforcement scale-invariant
include/linux/arch_topology.h | 2 +-
include/linux/sched.h | 1 +
include/linux/sched/cpufreq.h | 2 -
include/linux/sched/topology.h | 12 ++--
kernel/sched/core.c | 15 ++++-
kernel/sched/cpufreq_schedutil.c | 84 +++++++++++++++---------
kernel/sched/deadline.c | 136 ++++++++++++++++++++++++++++-----------
kernel/sched/fair.c | 4 +-
kernel/sched/sched.h | 53 +++++++++++----
9 files changed, 217 insertions(+), 92 deletions(-)
--
2.14.3
From: Juri Lelli <[email protected]>
Since SCHED_DEADLINE doesn't track utilization signal (but reserves a
fraction of CPU bandwidth to tasks admitted to the system), there is no
point in evaluating frequency changes during each tick event.
Move frequency selection triggering points to where running_bw changes.
Co-authored-by: Claudio Scordino <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Luca Abeni <[email protected]>
Reviewed-by: Viresh Kumar <[email protected]>
---
kernel/sched/deadline.c | 7 ++++---
kernel/sched/sched.h | 12 ++++++------
2 files changed, 10 insertions(+), 9 deletions(-)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 2473736c7616..7e4038bf9954 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -86,6 +86,8 @@ void add_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
dl_rq->running_bw += dl_bw;
SCHED_WARN_ON(dl_rq->running_bw < old); /* overflow */
SCHED_WARN_ON(dl_rq->running_bw > dl_rq->this_bw);
+ /* kick cpufreq (see the comment in kernel/sched/sched.h). */
+ cpufreq_update_util(rq_of_dl_rq(dl_rq), SCHED_CPUFREQ_DL);
}
static inline
@@ -98,6 +100,8 @@ void sub_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
SCHED_WARN_ON(dl_rq->running_bw > old); /* underflow */
if (dl_rq->running_bw > old)
dl_rq->running_bw = 0;
+ /* kick cpufreq (see the comment in kernel/sched/sched.h). */
+ cpufreq_update_util(rq_of_dl_rq(dl_rq), SCHED_CPUFREQ_DL);
}
static inline
@@ -1134,9 +1138,6 @@ static void update_curr_dl(struct rq *rq)
return;
}
- /* kick cpufreq (see the comment in kernel/sched/sched.h). */
- cpufreq_update_util(rq, SCHED_CPUFREQ_DL);
-
schedstat_set(curr->se.statistics.exec_max,
max(curr->se.statistics.exec_max, delta_exec));
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b19552a212de..a1730e39cbc6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2096,14 +2096,14 @@ DECLARE_PER_CPU(struct update_util_data *, cpufreq_update_util_data);
* The way cpufreq is currently arranged requires it to evaluate the CPU
* performance state (frequency/voltage) on a regular basis to prevent it from
* being stuck in a completely inadequate performance level for too long.
- * That is not guaranteed to happen if the updates are only triggered from CFS,
- * though, because they may not be coming in if RT or deadline tasks are active
- * all the time (or there are RT and DL tasks only).
+ * That is not guaranteed to happen if the updates are only triggered from CFS
+ * and DL, though, because they may not be coming in if only RT tasks are
+ * active all the time (or there are RT tasks only).
*
- * As a workaround for that issue, this function is called by the RT and DL
- * sched classes to trigger extra cpufreq updates to prevent it from stalling,
+ * As a workaround for that issue, this function is called periodically by the
+ * RT sched class to trigger extra cpufreq updates to prevent it from stalling,
* but that really is a band-aid. Going forward it should be replaced with
- * solutions targeted more specifically at RT and DL tasks.
+ * solutions targeted more specifically at RT tasks.
*/
static inline void cpufreq_update_util(struct rq *rq, unsigned int flags)
{
--
2.14.3
From: Juri Lelli <[email protected]>
To be able to treat utilization signals of different scheduling classes
in different ways (e.g., CFS signal might be stale while DEADLINE signal
is never stale by design) we need to split sugov_cpu::util signal in two:
util_cfs and util_dl.
This patch does that by also changing sugov_get_util() parameter list.
After this change, aggregation of the different signals has to be performed
by sugov_get_util() users (so that they can decide what to do with the
different signals).
Suggested-by: Rafael J. Wysocki <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Luca Abeni <[email protected]>
Cc: Claudio Scordino <[email protected]>
Acked-by: Viresh Kumar <[email protected]>
---
kernel/sched/cpufreq_schedutil.c | 32 ++++++++++++++++++--------------
1 file changed, 18 insertions(+), 14 deletions(-)
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index c22457868ee6..a3072f24dc16 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -60,7 +60,8 @@ struct sugov_cpu {
u64 last_update;
/* The fields below are only needed when sharing a policy. */
- unsigned long util;
+ unsigned long util_cfs;
+ unsigned long util_dl;
unsigned long max;
unsigned int flags;
@@ -176,20 +177,25 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
return cpufreq_driver_resolve_freq(policy, freq);
}
-static void sugov_get_util(unsigned long *util, unsigned long *max, int cpu)
+static void sugov_get_util(struct sugov_cpu *sg_cpu)
{
- struct rq *rq = cpu_rq(cpu);
- unsigned long dl_util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE)
- >> BW_SHIFT;
+ struct rq *rq = cpu_rq(sg_cpu->cpu);
- *max = arch_scale_cpu_capacity(NULL, cpu);
+ sg_cpu->max = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);
+ sg_cpu->util_cfs = rq->cfs.avg.util_avg;
+ sg_cpu->util_dl = (rq->dl.running_bw * SCHED_CAPACITY_SCALE)
+ >> BW_SHIFT;
+}
+
+static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
+{
/*
* Ideally we would like to set util_dl as min/guaranteed freq and
* util_cfs + util_dl as requested freq. However, cpufreq is not yet
* ready for such an interface. So, we only do the latter for now.
*/
- *util = min(rq->cfs.avg.util_avg + dl_util, *max);
+ return min(sg_cpu->util_cfs + sg_cpu->util_dl, sg_cpu->max);
}
static void sugov_set_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
@@ -280,7 +286,9 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
if (flags & SCHED_CPUFREQ_RT) {
next_f = policy->cpuinfo.max_freq;
} else {
- sugov_get_util(&util, &max, sg_cpu->cpu);
+ sugov_get_util(sg_cpu);
+ max = sg_cpu->max;
+ util = sugov_aggregate_util(sg_cpu);
sugov_iowait_boost(sg_cpu, &util, &max);
next_f = get_next_freq(sg_policy, util, max);
/*
@@ -325,8 +333,8 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
if (j_sg_cpu->flags & SCHED_CPUFREQ_RT)
return policy->cpuinfo.max_freq;
- j_util = j_sg_cpu->util;
j_max = j_sg_cpu->max;
+ j_util = sugov_aggregate_util(j_sg_cpu);
if (j_util * max > j_max * util) {
util = j_util;
max = j_max;
@@ -343,15 +351,11 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
{
struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_util);
struct sugov_policy *sg_policy = sg_cpu->sg_policy;
- unsigned long util, max;
unsigned int next_f;
- sugov_get_util(&util, &max, sg_cpu->cpu);
-
raw_spin_lock(&sg_policy->update_lock);
- sg_cpu->util = util;
- sg_cpu->max = max;
+ sugov_get_util(sg_cpu);
sg_cpu->flags = flags;
sugov_set_iowait_boost(sg_cpu, time, flags);
--
2.14.3
From: Juri Lelli <[email protected]>
No assumption can be made upon the rate at which frequency updates get
triggered, as there are scheduling policies (like SCHED_DEADLINE) which
don't trigger them so frequently.
Remove such assumption from the code, by always considering
SCHED_DEADLINE utilization signal as not stale.
Signed-off-by: Juri Lelli <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Luca Abeni <[email protected]>
Cc: Claudio Scordino <[email protected]>
Acked-by: Viresh Kumar <[email protected]>
---
kernel/sched/cpufreq_schedutil.c | 16 ++++++++++------
1 file changed, 10 insertions(+), 6 deletions(-)
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index a3072f24dc16..b7a576c8dcaa 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -318,17 +318,21 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
s64 delta_ns;
/*
- * If the CPU utilization was last updated before the previous
- * frequency update and the time elapsed between the last update
- * of the CPU utilization and the last frequency update is long
- * enough, don't take the CPU into account as it probably is
- * idle now (and clear iowait_boost for it).
+ * If the CFS CPU utilization was last updated before the
+ * previous frequency update and the time elapsed between the
+ * last update of the CPU utilization and the last frequency
+ * update is long enough, reset iowait_boost and util_cfs, as
+ * they are now probably stale. However, still consider the
+ * CPU contribution if it has some DEADLINE utilization
+ * (util_dl).
*/
delta_ns = time - j_sg_cpu->last_update;
if (delta_ns > TICK_NSEC) {
j_sg_cpu->iowait_boost = 0;
j_sg_cpu->iowait_boost_pending = false;
- continue;
+ j_sg_cpu->util_cfs = 0;
+ if (j_sg_cpu->util_dl == 0)
+ continue;
}
if (j_sg_cpu->flags & SCHED_CPUFREQ_RT)
return policy->cpuinfo.max_freq;
--
2.14.3
From: Juri Lelli <[email protected]>
sd parameter is never used in arch_scale_freq_capacity (and it's hard to
see where information coming from scheduling domains might help doing
frequency invariance scaling).
Remove it; also in anticipation of moving arch_scale_freq_capacity
outside CONFIG_SMP.
Signed-off-by: Juri Lelli <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
include/linux/arch_topology.h | 2 +-
kernel/sched/fair.c | 2 +-
kernel/sched/sched.h | 4 ++--
3 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h
index 304511267c82..2b709416de05 100644
--- a/include/linux/arch_topology.h
+++ b/include/linux/arch_topology.h
@@ -27,7 +27,7 @@ void topology_set_cpu_scale(unsigned int cpu, unsigned long capacity);
DECLARE_PER_CPU(unsigned long, freq_scale);
static inline
-unsigned long topology_get_freq_scale(struct sched_domain *sd, int cpu)
+unsigned long topology_get_freq_scale(int cpu)
{
return per_cpu(freq_scale, cpu);
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4037e19bbca2..535d9409f4af 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3122,7 +3122,7 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
u32 contrib = (u32)delta; /* p == 0 -> delta < 1024 */
u64 periods;
- scale_freq = arch_scale_freq_capacity(NULL, cpu);
+ scale_freq = arch_scale_freq_capacity(cpu);
scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
delta += sa->period_contrib;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 280b421a82e8..b64207d54a55 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1712,7 +1712,7 @@ extern void sched_avg_update(struct rq *rq);
#ifndef arch_scale_freq_capacity
static __always_inline
-unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
+unsigned long arch_scale_freq_capacity(int cpu)
{
return SCHED_CAPACITY_SCALE;
}
@@ -1731,7 +1731,7 @@ unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
{
- rq->rt_avg += rt_delta * arch_scale_freq_capacity(NULL, cpu_of(rq));
+ rq->rt_avg += rt_delta * arch_scale_freq_capacity(cpu_of(rq));
sched_avg_update(rq);
}
#else
--
2.14.3
From: Juri Lelli <[email protected]>
Apply frequency and cpu scale-invariance correction factor to bandwidth
enforcement (similar to what we already do to fair utilization tracking).
Each delta_exec gets scaled considering current frequency and maximum
cpu capacity; which means that the reservation runtime parameter (that
need to be specified profiling the task execution at max frequency on
biggest capacity core) gets thus scaled accordingly.
Signed-off-by: Juri Lelli <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Luca Abeni <[email protected]>
Cc: Claudio Scordino <[email protected]>
---
kernel/sched/deadline.c | 26 ++++++++++++++++++++++----
kernel/sched/fair.c | 2 --
kernel/sched/sched.h | 2 ++
3 files changed, 24 insertions(+), 6 deletions(-)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 40f12aab9250..741d2fe26f88 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1151,7 +1151,8 @@ static void update_curr_dl(struct rq *rq)
{
struct task_struct *curr = rq->curr;
struct sched_dl_entity *dl_se = &curr->dl;
- u64 delta_exec;
+ u64 delta_exec, scaled_delta_exec;
+ int cpu = cpu_of(rq);
if (!dl_task(curr) || !on_dl_rq(dl_se))
return;
@@ -1185,9 +1186,26 @@ static void update_curr_dl(struct rq *rq)
if (unlikely(dl_entity_is_special(dl_se)))
return;
- if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM))
- delta_exec = grub_reclaim(delta_exec, rq, &curr->dl);
- dl_se->runtime -= delta_exec;
+ /*
+ * For tasks that participate in GRUB, we implement GRUB-PA: the
+ * spare reclaimed bandwidth is used to clock down frequency.
+ *
+ * For the others, we still need to scale reservation parameters
+ * according to current frequency and CPU maximum capacity.
+ */
+ if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM)) {
+ scaled_delta_exec = grub_reclaim(delta_exec,
+ rq,
+ &curr->dl);
+ } else {
+ unsigned long scale_freq = arch_scale_freq_capacity(cpu);
+ unsigned long scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
+
+ scaled_delta_exec = cap_scale(delta_exec, scale_freq);
+ scaled_delta_exec = cap_scale(scaled_delta_exec, scale_cpu);
+ }
+
+ dl_se->runtime -= scaled_delta_exec;
throttle:
if (dl_runtime_exceeded(dl_se) || dl_se->dl_yielded) {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 535d9409f4af..5bc3273a5c1c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3091,8 +3091,6 @@ static u32 __accumulate_pelt_segments(u64 periods, u32 d1, u32 d3)
return c1 + c2 + c3;
}
-#define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
-
/*
* Accumulate the three separate parts of the sum; d1 the remainder
* of the last (incomplete) period, d2 the span of full periods and d3
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0022c649fabb..6d9d55e764fa 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -156,6 +156,8 @@ static inline int task_has_dl_policy(struct task_struct *p)
return dl_policy(p->policy);
}
+#define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
+
/*
* !! For sched_setattr_nocheck() (kernel) only !!
*
--
2.14.3
From: Juri Lelli <[email protected]>
Currently, frequency and cpu capacity scaling is only performed on
CONFIG_SMP systems (as CFS PELT signals are only present for such
systems). However, other scheduling classes want to do freq/cpu scaling,
and for !CONFIG_SMP configurations as well.
arch_scale_freq_capacity is useful to implement frequency scaling even
on !CONFIG_SMP platforms, so we simply move it outside CONFIG_SMP
ifdeffery.
Even if arch_scale_cpu_capacity is not useful on !CONFIG_SMP platforms,
we make a default implementation available for such configurations anyway
to simplify scheduler code doing CPU scale invariance.
Signed-off-by: Juri Lelli <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Reviewed-by: Steven Rostedt (VMware) <[email protected]>
---
include/linux/sched/topology.h | 12 ++++++------
kernel/sched/sched.h | 13 ++++++++++---
2 files changed, 16 insertions(+), 9 deletions(-)
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index cf257c2e728d..26347741ba50 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -6,6 +6,12 @@
#include <linux/sched/idle.h>
+/*
+ * Increase resolution of cpu_capacity calculations
+ */
+#define SCHED_CAPACITY_SHIFT SCHED_FIXEDPOINT_SHIFT
+#define SCHED_CAPACITY_SCALE (1L << SCHED_CAPACITY_SHIFT)
+
/*
* sched-domains (multiprocessor balancing) declarations:
*/
@@ -27,12 +33,6 @@
#define SD_OVERLAP 0x2000 /* sched_domains of this level overlap */
#define SD_NUMA 0x4000 /* cross-node balancing */
-/*
- * Increase resolution of cpu_capacity calculations
- */
-#define SCHED_CAPACITY_SHIFT SCHED_FIXEDPOINT_SHIFT
-#define SCHED_CAPACITY_SCALE (1L << SCHED_CAPACITY_SHIFT)
-
#ifdef CONFIG_SCHED_SMT
static inline int cpu_smt_flags(void)
{
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b64207d54a55..0022c649fabb 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1707,9 +1707,6 @@ static inline int hrtick_enabled(struct rq *rq)
#endif /* CONFIG_SCHED_HRTICK */
-#ifdef CONFIG_SMP
-extern void sched_avg_update(struct rq *rq);
-
#ifndef arch_scale_freq_capacity
static __always_inline
unsigned long arch_scale_freq_capacity(int cpu)
@@ -1718,6 +1715,9 @@ unsigned long arch_scale_freq_capacity(int cpu)
}
#endif
+#ifdef CONFIG_SMP
+extern void sched_avg_update(struct rq *rq);
+
#ifndef arch_scale_cpu_capacity
static __always_inline
unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
@@ -1735,6 +1735,13 @@ static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
sched_avg_update(rq);
}
#else
+#ifndef arch_scale_cpu_capacity
+static __always_inline
+unsigned long arch_scale_cpu_capacity(void __always_unused *sd, int cpu)
+{
+ return SCHED_CAPACITY_SCALE;
+}
+#endif
static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta) { }
static inline void sched_avg_update(struct rq *rq) { }
#endif
--
2.14.3
From: Juri Lelli <[email protected]>
SCHED_DEADLINE tracks active utilization signal with a per dl_rq
variable named running_bw.
Make use of that to drive cpu frequency selection: add up FAIR and
DEADLINE contribution to get the required CPU capacity to handle both
requirements (while RT still selects max frequency).
Co-authored-by: Claudio Scordino <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Luca Abeni <[email protected]>
Acked-by: Viresh Kumar <[email protected]>
---
include/linux/sched/cpufreq.h | 2 --
kernel/sched/cpufreq_schedutil.c | 25 +++++++++++++++----------
2 files changed, 15 insertions(+), 12 deletions(-)
diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h
index d1ad3d825561..0b55834efd46 100644
--- a/include/linux/sched/cpufreq.h
+++ b/include/linux/sched/cpufreq.h
@@ -12,8 +12,6 @@
#define SCHED_CPUFREQ_DL (1U << 1)
#define SCHED_CPUFREQ_IOWAIT (1U << 2)
-#define SCHED_CPUFREQ_RT_DL (SCHED_CPUFREQ_RT | SCHED_CPUFREQ_DL)
-
#ifdef CONFIG_CPU_FREQ
struct update_util_data {
void (*func)(struct update_util_data *data, u64 time, unsigned int flags);
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 2f52ec0f1539..de1ad1fffbdc 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -179,12 +179,17 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
static void sugov_get_util(unsigned long *util, unsigned long *max, int cpu)
{
struct rq *rq = cpu_rq(cpu);
- unsigned long cfs_max;
+ unsigned long dl_util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE)
+ >> BW_SHIFT;
- cfs_max = arch_scale_cpu_capacity(NULL, cpu);
+ *max = arch_scale_cpu_capacity(NULL, cpu);
- *util = min(rq->cfs.avg.util_avg, cfs_max);
- *max = cfs_max;
+ /*
+ * Ideally we would like to set util_dl as min/guaranteed freq and
+ * util_cfs + util_dl as requested freq. However, cpufreq is not yet
+ * ready for such an interface. So, we only do the latter for now.
+ */
+ *util = min(rq->cfs.avg.util_avg + dl_util, *max);
}
static void sugov_set_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
@@ -272,7 +277,7 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
busy = sugov_cpu_is_busy(sg_cpu);
- if (flags & SCHED_CPUFREQ_RT_DL) {
+ if (flags & SCHED_CPUFREQ_RT) {
next_f = policy->cpuinfo.max_freq;
} else {
sugov_get_util(&util, &max, sg_cpu->cpu);
@@ -317,7 +322,7 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
j_sg_cpu->iowait_boost_pending = false;
continue;
}
- if (j_sg_cpu->flags & SCHED_CPUFREQ_RT_DL)
+ if (j_sg_cpu->flags & SCHED_CPUFREQ_RT)
return policy->cpuinfo.max_freq;
j_util = j_sg_cpu->util;
@@ -353,7 +358,7 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
sg_cpu->last_update = time;
if (sugov_should_update_freq(sg_policy, time)) {
- if (flags & SCHED_CPUFREQ_RT_DL)
+ if (flags & SCHED_CPUFREQ_RT)
next_f = sg_policy->policy->cpuinfo.max_freq;
else
next_f = sugov_next_freq_shared(sg_cpu, time);
@@ -383,9 +388,9 @@ static void sugov_irq_work(struct irq_work *irq_work)
sg_policy = container_of(irq_work, struct sugov_policy, irq_work);
/*
- * For RT and deadline tasks, the schedutil governor shoots the
- * frequency to maximum. Special care must be taken to ensure that this
- * kthread doesn't result in the same behavior.
+ * For RT tasks, the schedutil governor shoots the frequency to maximum.
+ * Special care must be taken to ensure that this kthread doesn't result
+ * in the same behavior.
*
* This is (mostly) guaranteed by the work_in_progress flag. The flag is
* updated only at the end of the sugov_work() function and before that
--
2.14.3
From: Juri Lelli <[email protected]>
Worker kthread needs to be able to change frequency for all other
threads.
Make it special, just under STOP class.
Signed-off-by: Juri Lelli <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Luca Abeni <[email protected]>
Cc: Claudio Scordino <[email protected]>
---
Changes from RFCv1:
- return -EINVAL for user trying to use the new flag (Peter)
- s/SPECIAL/SUGOV/ in the flag name (several comments from people to
find better naming, Steve thinks SUGOV is more greppable than others)
- give worker kthread a fake (unused) bandwidth, so that if priority
inheritance is triggered we don't BUG_ON on zero runtime
- filter out fake bandwidth when computing SCHED_DEADLINE bandwidth (fix by
Claudio Scordino)
---
include/linux/sched.h | 1 +
kernel/sched/core.c | 15 +++++-
kernel/sched/cpufreq_schedutil.c | 19 ++++++--
kernel/sched/deadline.c | 103 +++++++++++++++++++++++++++------------
kernel/sched/sched.h | 22 ++++++++-
5 files changed, 124 insertions(+), 36 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 21991d668d35..c4b2d4a5cfab 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1438,6 +1438,7 @@ extern int idle_cpu(int cpu);
extern int sched_setscheduler(struct task_struct *, int, const struct sched_param *);
extern int sched_setscheduler_nocheck(struct task_struct *, int, const struct sched_param *);
extern int sched_setattr(struct task_struct *, const struct sched_attr *);
+extern int sched_setattr_nocheck(struct task_struct *, const struct sched_attr *);
extern struct task_struct *idle_task(int cpu);
/**
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 75554f366fd3..5be52a3c1c1b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4041,7 +4041,9 @@ static int __sched_setscheduler(struct task_struct *p,
}
if (attr->sched_flags &
- ~(SCHED_FLAG_RESET_ON_FORK | SCHED_FLAG_RECLAIM))
+ ~(SCHED_FLAG_RESET_ON_FORK |
+ SCHED_FLAG_RECLAIM |
+ SCHED_FLAG_SUGOV))
return -EINVAL;
/*
@@ -4108,6 +4110,9 @@ static int __sched_setscheduler(struct task_struct *p,
}
if (user) {
+ if (attr->sched_flags & SCHED_FLAG_SUGOV)
+ return -EINVAL;
+
retval = security_task_setscheduler(p);
if (retval)
return retval;
@@ -4163,7 +4168,8 @@ static int __sched_setscheduler(struct task_struct *p,
}
#endif
#ifdef CONFIG_SMP
- if (dl_bandwidth_enabled() && dl_policy(policy)) {
+ if (dl_bandwidth_enabled() && dl_policy(policy) &&
+ !(attr->sched_flags & SCHED_FLAG_SUGOV)) {
cpumask_t *span = rq->rd->span;
/*
@@ -4293,6 +4299,11 @@ int sched_setattr(struct task_struct *p, const struct sched_attr *attr)
}
EXPORT_SYMBOL_GPL(sched_setattr);
+int sched_setattr_nocheck(struct task_struct *p, const struct sched_attr *attr)
+{
+ return __sched_setscheduler(p, attr, false, true);
+}
+
/**
* sched_setscheduler_nocheck - change the scheduling policy and/or RT priority of a thread from kernelspace.
* @p: the task in question.
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index de1ad1fffbdc..c22457868ee6 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -475,7 +475,20 @@ static void sugov_policy_free(struct sugov_policy *sg_policy)
static int sugov_kthread_create(struct sugov_policy *sg_policy)
{
struct task_struct *thread;
- struct sched_param param = { .sched_priority = MAX_USER_RT_PRIO / 2 };
+ struct sched_attr attr = {
+ .size = sizeof(struct sched_attr),
+ .sched_policy = SCHED_DEADLINE,
+ .sched_flags = SCHED_FLAG_SUGOV,
+ .sched_nice = 0,
+ .sched_priority = 0,
+ /*
+ * Fake (unused) bandwidth; workaround to "fix"
+ * priority inheritance.
+ */
+ .sched_runtime = 1000000,
+ .sched_deadline = 10000000,
+ .sched_period = 10000000,
+ };
struct cpufreq_policy *policy = sg_policy->policy;
int ret;
@@ -493,10 +506,10 @@ static int sugov_kthread_create(struct sugov_policy *sg_policy)
return PTR_ERR(thread);
}
- ret = sched_setscheduler_nocheck(thread, SCHED_FIFO, ¶m);
+ ret = sched_setattr_nocheck(thread, &attr);
if (ret) {
kthread_stop(thread);
- pr_warn("%s: failed to set SCHED_FIFO\n", __func__);
+ pr_warn("%s: failed to set SCHED_DEADLINE\n", __func__);
return ret;
}
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 7e4038bf9954..40f12aab9250 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -78,7 +78,7 @@ static inline int dl_bw_cpus(int i)
#endif
static inline
-void add_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
+void __add_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
{
u64 old = dl_rq->running_bw;
@@ -91,7 +91,7 @@ void add_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
}
static inline
-void sub_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
+void __sub_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
{
u64 old = dl_rq->running_bw;
@@ -105,7 +105,7 @@ void sub_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
}
static inline
-void add_rq_bw(u64 dl_bw, struct dl_rq *dl_rq)
+void __add_rq_bw(u64 dl_bw, struct dl_rq *dl_rq)
{
u64 old = dl_rq->this_bw;
@@ -115,7 +115,7 @@ void add_rq_bw(u64 dl_bw, struct dl_rq *dl_rq)
}
static inline
-void sub_rq_bw(u64 dl_bw, struct dl_rq *dl_rq)
+void __sub_rq_bw(u64 dl_bw, struct dl_rq *dl_rq)
{
u64 old = dl_rq->this_bw;
@@ -127,16 +127,46 @@ void sub_rq_bw(u64 dl_bw, struct dl_rq *dl_rq)
SCHED_WARN_ON(dl_rq->running_bw > dl_rq->this_bw);
}
+static inline
+void add_rq_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
+{
+ if (!(dl_se->flags & SCHED_FLAG_SUGOV))
+ __add_rq_bw(dl_se->dl_bw, dl_rq);
+}
+
+static inline
+void sub_rq_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
+{
+ if (!(dl_se->flags & SCHED_FLAG_SUGOV))
+ __sub_rq_bw(dl_se->dl_bw, dl_rq);
+}
+
+static inline
+void add_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
+{
+ if (!(dl_se->flags & SCHED_FLAG_SUGOV))
+ __add_running_bw(dl_se->dl_bw, dl_rq);
+}
+
+static inline
+void sub_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
+{
+ if (!(dl_se->flags & SCHED_FLAG_SUGOV))
+ __sub_running_bw(dl_se->dl_bw, dl_rq);
+}
+
void dl_change_utilization(struct task_struct *p, u64 new_bw)
{
struct rq *rq;
+ BUG_ON(p->dl.flags & SCHED_FLAG_SUGOV);
+
if (task_on_rq_queued(p))
return;
rq = task_rq(p);
if (p->dl.dl_non_contending) {
- sub_running_bw(p->dl.dl_bw, &rq->dl);
+ sub_running_bw(&p->dl, &rq->dl);
p->dl.dl_non_contending = 0;
/*
* If the timer handler is currently running and the
@@ -148,8 +178,8 @@ void dl_change_utilization(struct task_struct *p, u64 new_bw)
if (hrtimer_try_to_cancel(&p->dl.inactive_timer) == 1)
put_task_struct(p);
}
- sub_rq_bw(p->dl.dl_bw, &rq->dl);
- add_rq_bw(new_bw, &rq->dl);
+ __sub_rq_bw(p->dl.dl_bw, &rq->dl);
+ __add_rq_bw(new_bw, &rq->dl);
}
/*
@@ -221,6 +251,9 @@ static void task_non_contending(struct task_struct *p)
if (dl_se->dl_runtime == 0)
return;
+ if (unlikely(dl_entity_is_special(dl_se)))
+ return;
+
WARN_ON(hrtimer_active(&dl_se->inactive_timer));
WARN_ON(dl_se->dl_non_contending);
@@ -240,12 +273,12 @@ static void task_non_contending(struct task_struct *p)
*/
if (zerolag_time < 0) {
if (dl_task(p))
- sub_running_bw(dl_se->dl_bw, dl_rq);
+ sub_running_bw(dl_se, dl_rq);
if (!dl_task(p) || p->state == TASK_DEAD) {
struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
if (p->state == TASK_DEAD)
- sub_rq_bw(p->dl.dl_bw, &rq->dl);
+ sub_rq_bw(&p->dl, &rq->dl);
raw_spin_lock(&dl_b->lock);
__dl_sub(dl_b, p->dl.dl_bw, dl_bw_cpus(task_cpu(p)));
__dl_clear_params(p);
@@ -272,7 +305,7 @@ static void task_contending(struct sched_dl_entity *dl_se, int flags)
return;
if (flags & ENQUEUE_MIGRATED)
- add_rq_bw(dl_se->dl_bw, dl_rq);
+ add_rq_bw(dl_se, dl_rq);
if (dl_se->dl_non_contending) {
dl_se->dl_non_contending = 0;
@@ -293,7 +326,7 @@ static void task_contending(struct sched_dl_entity *dl_se, int flags)
* when the "inactive timer" fired).
* So, add it back.
*/
- add_running_bw(dl_se->dl_bw, dl_rq);
+ add_running_bw(dl_se, dl_rq);
}
}
@@ -1149,6 +1182,9 @@ static void update_curr_dl(struct rq *rq)
sched_rt_avg_update(rq, delta_exec);
+ if (unlikely(dl_entity_is_special(dl_se)))
+ return;
+
if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM))
delta_exec = grub_reclaim(delta_exec, rq, &curr->dl);
dl_se->runtime -= delta_exec;
@@ -1205,8 +1241,8 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
if (p->state == TASK_DEAD && dl_se->dl_non_contending) {
- sub_running_bw(p->dl.dl_bw, dl_rq_of_se(&p->dl));
- sub_rq_bw(p->dl.dl_bw, dl_rq_of_se(&p->dl));
+ sub_running_bw(&p->dl, dl_rq_of_se(&p->dl));
+ sub_rq_bw(&p->dl, dl_rq_of_se(&p->dl));
dl_se->dl_non_contending = 0;
}
@@ -1223,7 +1259,7 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
sched_clock_tick();
update_rq_clock(rq);
- sub_running_bw(dl_se->dl_bw, &rq->dl);
+ sub_running_bw(dl_se, &rq->dl);
dl_se->dl_non_contending = 0;
unlock:
task_rq_unlock(rq, p, &rf);
@@ -1417,8 +1453,8 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
dl_check_constrained_dl(&p->dl);
if (p->on_rq == TASK_ON_RQ_MIGRATING || flags & ENQUEUE_RESTORE) {
- add_rq_bw(p->dl.dl_bw, &rq->dl);
- add_running_bw(p->dl.dl_bw, &rq->dl);
+ add_rq_bw(&p->dl, &rq->dl);
+ add_running_bw(&p->dl, &rq->dl);
}
/*
@@ -1458,8 +1494,8 @@ static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
__dequeue_task_dl(rq, p, flags);
if (p->on_rq == TASK_ON_RQ_MIGRATING || flags & DEQUEUE_SAVE) {
- sub_running_bw(p->dl.dl_bw, &rq->dl);
- sub_rq_bw(p->dl.dl_bw, &rq->dl);
+ sub_running_bw(&p->dl, &rq->dl);
+ sub_rq_bw(&p->dl, &rq->dl);
}
/*
@@ -1565,7 +1601,7 @@ static void migrate_task_rq_dl(struct task_struct *p)
*/
raw_spin_lock(&rq->lock);
if (p->dl.dl_non_contending) {
- sub_running_bw(p->dl.dl_bw, &rq->dl);
+ sub_running_bw(&p->dl, &rq->dl);
p->dl.dl_non_contending = 0;
/*
* If the timer handler is currently running and the
@@ -1577,7 +1613,7 @@ static void migrate_task_rq_dl(struct task_struct *p)
if (hrtimer_try_to_cancel(&p->dl.inactive_timer) == 1)
put_task_struct(p);
}
- sub_rq_bw(p->dl.dl_bw, &rq->dl);
+ sub_rq_bw(&p->dl, &rq->dl);
raw_spin_unlock(&rq->lock);
}
@@ -2020,11 +2056,11 @@ static int push_dl_task(struct rq *rq)
}
deactivate_task(rq, next_task, 0);
- sub_running_bw(next_task->dl.dl_bw, &rq->dl);
- sub_rq_bw(next_task->dl.dl_bw, &rq->dl);
+ sub_running_bw(&next_task->dl, &rq->dl);
+ sub_rq_bw(&next_task->dl, &rq->dl);
set_task_cpu(next_task, later_rq->cpu);
- add_rq_bw(next_task->dl.dl_bw, &later_rq->dl);
- add_running_bw(next_task->dl.dl_bw, &later_rq->dl);
+ add_rq_bw(&next_task->dl, &later_rq->dl);
+ add_running_bw(&next_task->dl, &later_rq->dl);
activate_task(later_rq, next_task, 0);
ret = 1;
@@ -2112,11 +2148,11 @@ static void pull_dl_task(struct rq *this_rq)
resched = true;
deactivate_task(src_rq, p, 0);
- sub_running_bw(p->dl.dl_bw, &src_rq->dl);
- sub_rq_bw(p->dl.dl_bw, &src_rq->dl);
+ sub_running_bw(&p->dl, &src_rq->dl);
+ sub_rq_bw(&p->dl, &src_rq->dl);
set_task_cpu(p, this_cpu);
- add_rq_bw(p->dl.dl_bw, &this_rq->dl);
- add_running_bw(p->dl.dl_bw, &this_rq->dl);
+ add_rq_bw(&p->dl, &this_rq->dl);
+ add_running_bw(&p->dl, &this_rq->dl);
activate_task(this_rq, p, 0);
dmin = p->dl.deadline;
@@ -2225,7 +2261,7 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
task_non_contending(p);
if (!task_on_rq_queued(p))
- sub_rq_bw(p->dl.dl_bw, &rq->dl);
+ sub_rq_bw(&p->dl, &rq->dl);
/*
* We cannot use inactive_task_timer() to invoke sub_running_bw()
@@ -2257,7 +2293,7 @@ static void switched_to_dl(struct rq *rq, struct task_struct *p)
/* If p is not queued we will update its parameters at next wakeup. */
if (!task_on_rq_queued(p)) {
- add_rq_bw(p->dl.dl_bw, &rq->dl);
+ add_rq_bw(&p->dl, &rq->dl);
return;
}
@@ -2436,6 +2472,9 @@ int sched_dl_overflow(struct task_struct *p, int policy,
u64 new_bw = dl_policy(policy) ? to_ratio(period, runtime) : 0;
int cpus, err = -1;
+ if (attr->sched_flags & SCHED_FLAG_SUGOV)
+ return 0;
+
/* !deadline task may carry old deadline bandwidth */
if (new_bw == p->dl.dl_bw && task_has_dl_policy(p))
return 0;
@@ -2522,6 +2561,10 @@ void __getparam_dl(struct task_struct *p, struct sched_attr *attr)
*/
bool __checkparam_dl(const struct sched_attr *attr)
{
+ /* special dl tasks don't actually use any parameter */
+ if (attr->sched_flags & SCHED_FLAG_SUGOV)
+ return true;
+
/* deadline != 0 */
if (attr->sched_deadline == 0)
return false;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a1730e39cbc6..280b421a82e8 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -156,13 +156,33 @@ static inline int task_has_dl_policy(struct task_struct *p)
return dl_policy(p->policy);
}
+/*
+ * !! For sched_setattr_nocheck() (kernel) only !!
+ *
+ * This is actually gross. :(
+ *
+ * It is used to make schedutil kworker(s) higher priority than SCHED_DEADLINE
+ * tasks, but still be able to sleep. We need this on platforms that cannot
+ * atomically change clock frequency. Remove once fast switching will be
+ * available on such platforms.
+ *
+ * SUGOV stands for SchedUtil GOVernor.
+ */
+#define SCHED_FLAG_SUGOV 0x10000000
+
+static inline int dl_entity_is_special(struct sched_dl_entity *dl_se)
+{
+ return dl_se->flags & SCHED_FLAG_SUGOV;
+}
+
/*
* Tells if entity @a should preempt entity @b.
*/
static inline bool
dl_entity_preempt(struct sched_dl_entity *a, struct sched_dl_entity *b)
{
- return dl_time_before(a->deadline, b->deadline);
+ return dl_entity_is_special(a) ||
+ dl_time_before(a->deadline, b->deadline);
}
/*
--
2.14.3
Hi Juri,
On 04-Dec 11:23, Juri Lelli wrote:
[...]
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index de1ad1fffbdc..c22457868ee6 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -475,7 +475,20 @@ static void sugov_policy_free(struct sugov_policy *sg_policy)
> static int sugov_kthread_create(struct sugov_policy *sg_policy)
> {
> struct task_struct *thread;
> - struct sched_param param = { .sched_priority = MAX_USER_RT_PRIO / 2 };
> + struct sched_attr attr = {
> + .size = sizeof(struct sched_attr),
> + .sched_policy = SCHED_DEADLINE,
> + .sched_flags = SCHED_FLAG_SUGOV,
> + .sched_nice = 0,
> + .sched_priority = 0,
> + /*
> + * Fake (unused) bandwidth; workaround to "fix"
> + * priority inheritance.
> + */
> + .sched_runtime = 1000000,
> + .sched_deadline = 10000000,
> + .sched_period = 10000000,
Why not assigning a minimal (but yet CBS accounted) bandwidth to
this DL task?
I understand that it should be a minimal task which bandwidth
requirement is likely into the "noise".
Is there any other more specific reason?
On the other hand, the advantage of having a minimal bandwidth would
be to remove most of the following "special" code on bandwidth
accouting, while still the flag can be used to favour this DL task
over others. Isn't it?
> + };
> struct cpufreq_policy *policy = sg_policy->policy;
> int ret;
>
[...]
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 7e4038bf9954..40f12aab9250 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -78,7 +78,7 @@ static inline int dl_bw_cpus(int i)
> #endif
>
> static inline
> -void add_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
> +void __add_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
> {
> u64 old = dl_rq->running_bw;
>
> @@ -91,7 +91,7 @@ void add_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
> }
>
> static inline
> -void sub_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
> +void __sub_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
> {
> u64 old = dl_rq->running_bw;
>
> @@ -105,7 +105,7 @@ void sub_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
> }
>
> static inline
> -void add_rq_bw(u64 dl_bw, struct dl_rq *dl_rq)
> +void __add_rq_bw(u64 dl_bw, struct dl_rq *dl_rq)
> {
> u64 old = dl_rq->this_bw;
>
> @@ -115,7 +115,7 @@ void add_rq_bw(u64 dl_bw, struct dl_rq *dl_rq)
> }
>
> static inline
> -void sub_rq_bw(u64 dl_bw, struct dl_rq *dl_rq)
> +void __sub_rq_bw(u64 dl_bw, struct dl_rq *dl_rq)
> {
> u64 old = dl_rq->this_bw;
>
> @@ -127,16 +127,46 @@ void sub_rq_bw(u64 dl_bw, struct dl_rq *dl_rq)
> SCHED_WARN_ON(dl_rq->running_bw > dl_rq->this_bw);
> }
>
> +static inline
> +void add_rq_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
> +{
> + if (!(dl_se->flags & SCHED_FLAG_SUGOV))
> + __add_rq_bw(dl_se->dl_bw, dl_rq);
What about using for all these wrappers the same utility function you
already use in this source file? I.e.
if (unlikely(dl_entity_is_special(dl_se)))
return;
__add_rq_bw(dl_se->dl_bw, dl_rq);
A further optimization based on that is described hereafter.
> +}
> +
> +static inline
> +void sub_rq_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
> +{
> + if (!(dl_se->flags & SCHED_FLAG_SUGOV))
> + __sub_rq_bw(dl_se->dl_bw, dl_rq);
> +}
> +
> +static inline
> +void add_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
> +{
> + if (!(dl_se->flags & SCHED_FLAG_SUGOV))
> + __add_running_bw(dl_se->dl_bw, dl_rq);
> +}
> +
> +static inline
> +void sub_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
> +{
> + if (!(dl_se->flags & SCHED_FLAG_SUGOV))
> + __sub_running_bw(dl_se->dl_bw, dl_rq);
> +}
> +
> void dl_change_utilization(struct task_struct *p, u64 new_bw)
> {
> struct rq *rq;
>
> + BUG_ON(p->dl.flags & SCHED_FLAG_SUGOV);
> +
> if (task_on_rq_queued(p))
> return;
>
> rq = task_rq(p);
> if (p->dl.dl_non_contending) {
> - sub_running_bw(p->dl.dl_bw, &rq->dl);
> + sub_running_bw(&p->dl, &rq->dl);
> p->dl.dl_non_contending = 0;
> /*
> * If the timer handler is currently running and the
> @@ -148,8 +178,8 @@ void dl_change_utilization(struct task_struct *p, u64 new_bw)
> if (hrtimer_try_to_cancel(&p->dl.inactive_timer) == 1)
> put_task_struct(p);
> }
> - sub_rq_bw(p->dl.dl_bw, &rq->dl);
> - add_rq_bw(new_bw, &rq->dl);
> + __sub_rq_bw(p->dl.dl_bw, &rq->dl);
> + __add_rq_bw(new_bw, &rq->dl);
> }
>
> /*
> @@ -221,6 +251,9 @@ static void task_non_contending(struct task_struct *p)
> if (dl_se->dl_runtime == 0)
> return;
>
> + if (unlikely(dl_entity_is_special(dl_se)))
> + return;
> +
> WARN_ON(hrtimer_active(&dl_se->inactive_timer));
> WARN_ON(dl_se->dl_non_contending);
>
> @@ -240,12 +273,12 @@ static void task_non_contending(struct task_struct *p)
> */
> if (zerolag_time < 0) {
> if (dl_task(p))
> - sub_running_bw(dl_se->dl_bw, dl_rq);
> + sub_running_bw(dl_se, dl_rq);
> if (!dl_task(p) || p->state == TASK_DEAD) {
> struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
>
> if (p->state == TASK_DEAD)
> - sub_rq_bw(p->dl.dl_bw, &rq->dl);
> + sub_rq_bw(&p->dl, &rq->dl);
> raw_spin_lock(&dl_b->lock);
> __dl_sub(dl_b, p->dl.dl_bw, dl_bw_cpus(task_cpu(p)));
> __dl_clear_params(p);
> @@ -272,7 +305,7 @@ static void task_contending(struct sched_dl_entity *dl_se, int flags)
> return;
>
> if (flags & ENQUEUE_MIGRATED)
> - add_rq_bw(dl_se->dl_bw, dl_rq);
> + add_rq_bw(dl_se, dl_rq);
>
> if (dl_se->dl_non_contending) {
> dl_se->dl_non_contending = 0;
> @@ -293,7 +326,7 @@ static void task_contending(struct sched_dl_entity *dl_se, int flags)
> * when the "inactive timer" fired).
> * So, add it back.
> */
> - add_running_bw(dl_se->dl_bw, dl_rq);
> + add_running_bw(dl_se, dl_rq);
> }
> }
>
> @@ -1149,6 +1182,9 @@ static void update_curr_dl(struct rq *rq)
>
> sched_rt_avg_update(rq, delta_exec);
>
> + if (unlikely(dl_entity_is_special(dl_se)))
> + return;
> +
> if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM))
> delta_exec = grub_reclaim(delta_exec, rq, &curr->dl);
> dl_se->runtime -= delta_exec;
> @@ -1205,8 +1241,8 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
> struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
>
> if (p->state == TASK_DEAD && dl_se->dl_non_contending) {
> - sub_running_bw(p->dl.dl_bw, dl_rq_of_se(&p->dl));
> - sub_rq_bw(p->dl.dl_bw, dl_rq_of_se(&p->dl));
> + sub_running_bw(&p->dl, dl_rq_of_se(&p->dl));
> + sub_rq_bw(&p->dl, dl_rq_of_se(&p->dl));
> dl_se->dl_non_contending = 0;
> }
>
> @@ -1223,7 +1259,7 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
> sched_clock_tick();
> update_rq_clock(rq);
>
> - sub_running_bw(dl_se->dl_bw, &rq->dl);
> + sub_running_bw(dl_se, &rq->dl);
> dl_se->dl_non_contending = 0;
> unlock:
> task_rq_unlock(rq, p, &rf);
> @@ -1417,8 +1453,8 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
> dl_check_constrained_dl(&p->dl);
>
> if (p->on_rq == TASK_ON_RQ_MIGRATING || flags & ENQUEUE_RESTORE) {
> - add_rq_bw(p->dl.dl_bw, &rq->dl);
> - add_running_bw(p->dl.dl_bw, &rq->dl);
> + add_rq_bw(&p->dl, &rq->dl);
> + add_running_bw(&p->dl, &rq->dl);
> }
>
> /*
> @@ -1458,8 +1494,8 @@ static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
> __dequeue_task_dl(rq, p, flags);
>
> if (p->on_rq == TASK_ON_RQ_MIGRATING || flags & DEQUEUE_SAVE) {
> - sub_running_bw(p->dl.dl_bw, &rq->dl);
> - sub_rq_bw(p->dl.dl_bw, &rq->dl);
> + sub_running_bw(&p->dl, &rq->dl);
> + sub_rq_bw(&p->dl, &rq->dl);
> }
>
> /*
> @@ -1565,7 +1601,7 @@ static void migrate_task_rq_dl(struct task_struct *p)
> */
> raw_spin_lock(&rq->lock);
> if (p->dl.dl_non_contending) {
> - sub_running_bw(p->dl.dl_bw, &rq->dl);
> + sub_running_bw(&p->dl, &rq->dl);
> p->dl.dl_non_contending = 0;
> /*
> * If the timer handler is currently running and the
> @@ -1577,7 +1613,7 @@ static void migrate_task_rq_dl(struct task_struct *p)
> if (hrtimer_try_to_cancel(&p->dl.inactive_timer) == 1)
> put_task_struct(p);
> }
> - sub_rq_bw(p->dl.dl_bw, &rq->dl);
> + sub_rq_bw(&p->dl, &rq->dl);
> raw_spin_unlock(&rq->lock);
> }
>
> @@ -2020,11 +2056,11 @@ static int push_dl_task(struct rq *rq)
> }
>
> deactivate_task(rq, next_task, 0);
> - sub_running_bw(next_task->dl.dl_bw, &rq->dl);
> - sub_rq_bw(next_task->dl.dl_bw, &rq->dl);
> + sub_running_bw(&next_task->dl, &rq->dl);
> + sub_rq_bw(&next_task->dl, &rq->dl);
> set_task_cpu(next_task, later_rq->cpu);
> - add_rq_bw(next_task->dl.dl_bw, &later_rq->dl);
> - add_running_bw(next_task->dl.dl_bw, &later_rq->dl);
> + add_rq_bw(&next_task->dl, &later_rq->dl);
> + add_running_bw(&next_task->dl, &later_rq->dl);
> activate_task(later_rq, next_task, 0);
> ret = 1;
>
> @@ -2112,11 +2148,11 @@ static void pull_dl_task(struct rq *this_rq)
> resched = true;
>
> deactivate_task(src_rq, p, 0);
> - sub_running_bw(p->dl.dl_bw, &src_rq->dl);
> - sub_rq_bw(p->dl.dl_bw, &src_rq->dl);
> + sub_running_bw(&p->dl, &src_rq->dl);
> + sub_rq_bw(&p->dl, &src_rq->dl);
> set_task_cpu(p, this_cpu);
> - add_rq_bw(p->dl.dl_bw, &this_rq->dl);
> - add_running_bw(p->dl.dl_bw, &this_rq->dl);
> + add_rq_bw(&p->dl, &this_rq->dl);
> + add_running_bw(&p->dl, &this_rq->dl);
> activate_task(this_rq, p, 0);
> dmin = p->dl.deadline;
>
> @@ -2225,7 +2261,7 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
> task_non_contending(p);
>
> if (!task_on_rq_queued(p))
> - sub_rq_bw(p->dl.dl_bw, &rq->dl);
> + sub_rq_bw(&p->dl, &rq->dl);
>
> /*
> * We cannot use inactive_task_timer() to invoke sub_running_bw()
> @@ -2257,7 +2293,7 @@ static void switched_to_dl(struct rq *rq, struct task_struct *p)
>
> /* If p is not queued we will update its parameters at next wakeup. */
> if (!task_on_rq_queued(p)) {
> - add_rq_bw(p->dl.dl_bw, &rq->dl);
> + add_rq_bw(&p->dl, &rq->dl);
>
> return;
> }
> @@ -2436,6 +2472,9 @@ int sched_dl_overflow(struct task_struct *p, int policy,
> u64 new_bw = dl_policy(policy) ? to_ratio(period, runtime) : 0;
> int cpus, err = -1;
>
> + if (attr->sched_flags & SCHED_FLAG_SUGOV)
> + return 0;
> +
Same note on using:
if (unlikely(dl_entity_is_special(dl_se)))
here and in the next chunk too.
> /* !deadline task may carry old deadline bandwidth */
> if (new_bw == p->dl.dl_bw && task_has_dl_policy(p))
> return 0;
> @@ -2522,6 +2561,10 @@ void __getparam_dl(struct task_struct *p, struct sched_attr *attr)
> */
> bool __checkparam_dl(const struct sched_attr *attr)
> {
> + /* special dl tasks don't actually use any parameter */
> + if (attr->sched_flags & SCHED_FLAG_SUGOV)
> + return true;
> +
> /* deadline != 0 */
> if (attr->sched_deadline == 0)
> return false;
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index a1730e39cbc6..280b421a82e8 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -156,13 +156,33 @@ static inline int task_has_dl_policy(struct task_struct *p)
> return dl_policy(p->policy);
> }
>
> +/*
> + * !! For sched_setattr_nocheck() (kernel) only !!
> + *
> + * This is actually gross. :(
> + *
> + * It is used to make schedutil kworker(s) higher priority than SCHED_DEADLINE
> + * tasks, but still be able to sleep. We need this on platforms that cannot
> + * atomically change clock frequency. Remove once fast switching will be
> + * available on such platforms.
> + *
> + * SUGOV stands for SchedUtil GOVernor.
> + */
> +#define SCHED_FLAG_SUGOV 0x10000000
> +
> +static inline int dl_entity_is_special(struct sched_dl_entity *dl_se)
This should better return a bool...
> +{
... and maybe it can optimize some builds via constants propagations to add:
#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
> + return dl_se->flags & SCHED_FLAG_SUGOV;
#else
return false;
#endif
> +}
> +
> /*
> * Tells if entity @a should preempt entity @b.
> */
> static inline bool
> dl_entity_preempt(struct sched_dl_entity *a, struct sched_dl_entity *b)
> {
> - return dl_time_before(a->deadline, b->deadline);
> + return dl_entity_is_special(a) ||
> + dl_time_before(a->deadline, b->deadline);
Given that being special is less likely, perhaps better to have:
return dl_time_before(a->deadline, b->deadline) ||
dl_entity_is_special(a);
> }
>
> /*
> --
> 2.14.3
>
--
#include <best/regards.h>
Patrick Bellasi
Hi,
On 05/12/17 11:55, Patrick Bellasi wrote:
> Hi Juri,
>
> On 04-Dec 11:23, Juri Lelli wrote:
> [...]
>
> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > index de1ad1fffbdc..c22457868ee6 100644
> > --- a/kernel/sched/cpufreq_schedutil.c
> > +++ b/kernel/sched/cpufreq_schedutil.c
> > @@ -475,7 +475,20 @@ static void sugov_policy_free(struct sugov_policy *sg_policy)
> > static int sugov_kthread_create(struct sugov_policy *sg_policy)
> > {
> > struct task_struct *thread;
> > - struct sched_param param = { .sched_priority = MAX_USER_RT_PRIO / 2 };
> > + struct sched_attr attr = {
> > + .size = sizeof(struct sched_attr),
> > + .sched_policy = SCHED_DEADLINE,
> > + .sched_flags = SCHED_FLAG_SUGOV,
> > + .sched_nice = 0,
> > + .sched_priority = 0,
> > + /*
> > + * Fake (unused) bandwidth; workaround to "fix"
> > + * priority inheritance.
> > + */
> > + .sched_runtime = 1000000,
> > + .sched_deadline = 10000000,
> > + .sched_period = 10000000,
>
> Why not assigning a minimal (but yet CBS accounted) bandwidth to
> this DL task?
>
> I understand that it should be a minimal task which bandwidth
> requirement is likely into the "noise".
> Is there any other more specific reason?
>
At least two, IMHO.
1. Throttling: assigning any sort of bandwidth is difficult (every
platform is different), and if that is too small the task responsible
for changing frequency might be throttled and delayed; if too big you
are wasting resources.
2. Affinity: some platform affine these kthreads to related_cpus; and it
is something you might want to do to save power anyway. Problem with DL
is that (at least currently) you are not free to change a task's
affinity mask without creating an exclusive cpuset.
[...]
> > +static inline
> > +void add_rq_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
> > +{
> > + if (!(dl_se->flags & SCHED_FLAG_SUGOV))
> > + __add_rq_bw(dl_se->dl_bw, dl_rq);
>
> What about using for all these wrappers the same utility function you
> already use in this source file? I.e.
>
> if (unlikely(dl_entity_is_special(dl_se)))
> return;
> __add_rq_bw(dl_se->dl_bw, dl_rq);
Should work. I'll try to do the change.
[...]
> > @@ -2436,6 +2472,9 @@ int sched_dl_overflow(struct task_struct *p, int policy,
> > u64 new_bw = dl_policy(policy) ? to_ratio(period, runtime) : 0;
> > int cpus, err = -1;
> >
> > + if (attr->sched_flags & SCHED_FLAG_SUGOV)
> > + return 0;
> > +
>
> Same note on using:
>
> if (unlikely(dl_entity_is_special(dl_se)))
>
> here and in the next chunk too.
OK.
>
> > /* !deadline task may carry old deadline bandwidth */
> > if (new_bw == p->dl.dl_bw && task_has_dl_policy(p))
> > return 0;
> > @@ -2522,6 +2561,10 @@ void __getparam_dl(struct task_struct *p, struct sched_attr *attr)
> > */
> > bool __checkparam_dl(const struct sched_attr *attr)
> > {
> > + /* special dl tasks don't actually use any parameter */
> > + if (attr->sched_flags & SCHED_FLAG_SUGOV)
> > + return true;
> > +
> > /* deadline != 0 */
> > if (attr->sched_deadline == 0)
> > return false;
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index a1730e39cbc6..280b421a82e8 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -156,13 +156,33 @@ static inline int task_has_dl_policy(struct task_struct *p)
> > return dl_policy(p->policy);
> > }
> >
> > +/*
> > + * !! For sched_setattr_nocheck() (kernel) only !!
> > + *
> > + * This is actually gross. :(
> > + *
> > + * It is used to make schedutil kworker(s) higher priority than SCHED_DEADLINE
> > + * tasks, but still be able to sleep. We need this on platforms that cannot
> > + * atomically change clock frequency. Remove once fast switching will be
> > + * available on such platforms.
> > + *
> > + * SUGOV stands for SchedUtil GOVernor.
> > + */
> > +#define SCHED_FLAG_SUGOV 0x10000000
> > +
> > +static inline int dl_entity_is_special(struct sched_dl_entity *dl_se)
>
> This should better return a bool...
>
>
> > +{
>
> ... and maybe it can optimize some builds via constants propagations to add:
>
> #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
> > + return dl_se->flags & SCHED_FLAG_SUGOV;
> #else
> return false;
> #endif
Sure.
>
> > +}
> > +
> > /*
> > * Tells if entity @a should preempt entity @b.
> > */
> > static inline bool
> > dl_entity_preempt(struct sched_dl_entity *a, struct sched_dl_entity *b)
> > {
> > - return dl_time_before(a->deadline, b->deadline);
> > + return dl_entity_is_special(a) ||
> > + dl_time_before(a->deadline, b->deadline);
>
> Given that being special is less likely, perhaps better to have:
>
> return dl_time_before(a->deadline, b->deadline) ||
> dl_entity_is_special(a);
OK.
Thanks for the review!
Best,
- Juri
Hi Juri,
On 04-Dec 11:23, Juri Lelli wrote:
> From: Juri Lelli <[email protected]>
>
> SCHED_DEADLINE tracks active utilization signal with a per dl_rq
> variable named running_bw.
>
> Make use of that to drive cpu frequency selection: add up FAIR and
> DEADLINE contribution to get the required CPU capacity to handle both
> requirements (while RT still selects max frequency).
>
> Co-authored-by: Claudio Scordino <[email protected]>
> Signed-off-by: Juri Lelli <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: Viresh Kumar <[email protected]>
> Cc: Luca Abeni <[email protected]>
> Acked-by: Viresh Kumar <[email protected]>
> ---
> include/linux/sched/cpufreq.h | 2 --
> kernel/sched/cpufreq_schedutil.c | 25 +++++++++++++++----------
> 2 files changed, 15 insertions(+), 12 deletions(-)
>
> diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h
> index d1ad3d825561..0b55834efd46 100644
> --- a/include/linux/sched/cpufreq.h
> +++ b/include/linux/sched/cpufreq.h
> @@ -12,8 +12,6 @@
> #define SCHED_CPUFREQ_DL (1U << 1)
> #define SCHED_CPUFREQ_IOWAIT (1U << 2)
>
> -#define SCHED_CPUFREQ_RT_DL (SCHED_CPUFREQ_RT | SCHED_CPUFREQ_DL)
> -
> #ifdef CONFIG_CPU_FREQ
> struct update_util_data {
> void (*func)(struct update_util_data *data, u64 time, unsigned int flags);
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 2f52ec0f1539..de1ad1fffbdc 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -179,12 +179,17 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
> static void sugov_get_util(unsigned long *util, unsigned long *max, int cpu)
> {
> struct rq *rq = cpu_rq(cpu);
> - unsigned long cfs_max;
> + unsigned long dl_util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE)
> + >> BW_SHIFT;
What about using a pair of getter methods (e.g. cpu_util_{cfs,dl}) to
be defined in kernel/sched/sched.h?
This would help to hide class-specific signals mangling from cpufreq.
And here we can have something "more abstract" like:
unsigned long util_cfs = cpu_util_cfs(rq);
unsigned long util_dl = cpu_util_dl(rq);
>
> - cfs_max = arch_scale_cpu_capacity(NULL, cpu);
> + *max = arch_scale_cpu_capacity(NULL, cpu);
>
> - *util = min(rq->cfs.avg.util_avg, cfs_max);
> - *max = cfs_max;
> + /*
> + * Ideally we would like to set util_dl as min/guaranteed freq and
> + * util_cfs + util_dl as requested freq. However, cpufreq is not yet
> + * ready for such an interface. So, we only do the latter for now.
> + */
Maybe I don't completely get the above comment, but to me it is not
really required.
When you say that "util_dl" should be set to a min/guaranteed freq
are you not actually talking about a DL implementation detail?
>From the cpufreq standpoint instead, we should always set a capacity
which can accommodate util_dl + util_cfs.
We don't care about the meaning of util_dl and we should always assume
(by default) that the signal is properly updated by the scheduling
class... which unfortunately does not always happen for CFS.
> + *util = min(rq->cfs.avg.util_avg + dl_util, *max);
With the above proposal, here also we will have:
*util = min(util_cfs + util_dl, *max);
> }
>
> static void sugov_set_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
> @@ -272,7 +277,7 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
>
> busy = sugov_cpu_is_busy(sg_cpu);
>
> - if (flags & SCHED_CPUFREQ_RT_DL) {
> + if (flags & SCHED_CPUFREQ_RT) {
> next_f = policy->cpuinfo.max_freq;
> } else {
> sugov_get_util(&util, &max, sg_cpu->cpu);
> @@ -317,7 +322,7 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
> j_sg_cpu->iowait_boost_pending = false;
> continue;
> }
> - if (j_sg_cpu->flags & SCHED_CPUFREQ_RT_DL)
> + if (j_sg_cpu->flags & SCHED_CPUFREQ_RT)
> return policy->cpuinfo.max_freq;
>
> j_util = j_sg_cpu->util;
> @@ -353,7 +358,7 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
> sg_cpu->last_update = time;
>
> if (sugov_should_update_freq(sg_policy, time)) {
> - if (flags & SCHED_CPUFREQ_RT_DL)
> + if (flags & SCHED_CPUFREQ_RT)
> next_f = sg_policy->policy->cpuinfo.max_freq;
> else
> next_f = sugov_next_freq_shared(sg_cpu, time);
> @@ -383,9 +388,9 @@ static void sugov_irq_work(struct irq_work *irq_work)
> sg_policy = container_of(irq_work, struct sugov_policy, irq_work);
>
> /*
> - * For RT and deadline tasks, the schedutil governor shoots the
> - * frequency to maximum. Special care must be taken to ensure that this
> - * kthread doesn't result in the same behavior.
> + * For RT tasks, the schedutil governor shoots the frequency to maximum.
> + * Special care must be taken to ensure that this kthread doesn't result
> + * in the same behavior.
> *
> * This is (mostly) guaranteed by the work_in_progress flag. The flag is
> * updated only at the end of the sugov_work() function and before that
> --
> 2.14.3
>
--
#include <best/regards.h>
Patrick Bellasi
Hi Juri,
On 04-Dec 11:23, Juri Lelli wrote:
> From: Juri Lelli <[email protected]>
>
> To be able to treat utilization signals of different scheduling classes
> in different ways (e.g., CFS signal might be stale while DEADLINE signal
> is never stale by design) we need to split sugov_cpu::util signal in two:
> util_cfs and util_dl.
>
> This patch does that by also changing sugov_get_util() parameter list.
> After this change, aggregation of the different signals has to be performed
> by sugov_get_util() users (so that they can decide what to do with the
> different signals).
Did not tried myself, but to me it would be nice to have this patch
squashed with the first one of this series. After all, looking at this
one it seems that [RFC PATH 1/8] is just adding util_dl but it's not
really using it the proper way.
Here instead is where you better introduce two separate signals,
tracked by struct sugov_cpu, and properly aggregate them.
But perhaps that's just me being picky ;-)
> Suggested-by: Rafael J. Wysocki <[email protected]>
> Signed-off-by: Juri Lelli <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: Viresh Kumar <[email protected]>
> Cc: Luca Abeni <[email protected]>
> Cc: Claudio Scordino <[email protected]>
> Acked-by: Viresh Kumar <[email protected]>
> ---
> kernel/sched/cpufreq_schedutil.c | 32 ++++++++++++++++++--------------
> 1 file changed, 18 insertions(+), 14 deletions(-)
>
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index c22457868ee6..a3072f24dc16 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -60,7 +60,8 @@ struct sugov_cpu {
> u64 last_update;
>
> /* The fields below are only needed when sharing a policy. */
> - unsigned long util;
> + unsigned long util_cfs;
> + unsigned long util_dl;
> unsigned long max;
> unsigned int flags;
>
> @@ -176,20 +177,25 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
> return cpufreq_driver_resolve_freq(policy, freq);
> }
>
> -static void sugov_get_util(unsigned long *util, unsigned long *max, int cpu)
> +static void sugov_get_util(struct sugov_cpu *sg_cpu)
> {
> - struct rq *rq = cpu_rq(cpu);
> - unsigned long dl_util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE)
> - >> BW_SHIFT;
> + struct rq *rq = cpu_rq(sg_cpu->cpu);
>
> - *max = arch_scale_cpu_capacity(NULL, cpu);
> + sg_cpu->max = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);
> + sg_cpu->util_cfs = rq->cfs.avg.util_avg;
> + sg_cpu->util_dl = (rq->dl.running_bw * SCHED_CAPACITY_SCALE)
> + >> BW_SHIFT;
> +}
> +
> +static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
> +{
>
> /*
> * Ideally we would like to set util_dl as min/guaranteed freq and
> * util_cfs + util_dl as requested freq. However, cpufreq is not yet
> * ready for such an interface. So, we only do the latter for now.
> */
> - *util = min(rq->cfs.avg.util_avg + dl_util, *max);
> + return min(sg_cpu->util_cfs + sg_cpu->util_dl, sg_cpu->max);
> }
>
> static void sugov_set_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
> @@ -280,7 +286,9 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
> if (flags & SCHED_CPUFREQ_RT) {
> next_f = policy->cpuinfo.max_freq;
> } else {
> - sugov_get_util(&util, &max, sg_cpu->cpu);
> + sugov_get_util(sg_cpu);
> + max = sg_cpu->max;
> + util = sugov_aggregate_util(sg_cpu);
> sugov_iowait_boost(sg_cpu, &util, &max);
> next_f = get_next_freq(sg_policy, util, max);
> /*
> @@ -325,8 +333,8 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
> if (j_sg_cpu->flags & SCHED_CPUFREQ_RT)
> return policy->cpuinfo.max_freq;
>
> - j_util = j_sg_cpu->util;
> j_max = j_sg_cpu->max;
> + j_util = sugov_aggregate_util(j_sg_cpu);
> if (j_util * max > j_max * util) {
> util = j_util;
> max = j_max;
> @@ -343,15 +351,11 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
> {
> struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_util);
> struct sugov_policy *sg_policy = sg_cpu->sg_policy;
> - unsigned long util, max;
> unsigned int next_f;
>
> - sugov_get_util(&util, &max, sg_cpu->cpu);
> -
> raw_spin_lock(&sg_policy->update_lock);
>
> - sg_cpu->util = util;
> - sg_cpu->max = max;
> + sugov_get_util(sg_cpu);
> sg_cpu->flags = flags;
>
> sugov_set_iowait_boost(sg_cpu, time, flags);
> --
> 2.14.3
>
--
#include <best/regards.h>
Patrick Bellasi
Hi,
On 05/12/17 15:09, Patrick Bellasi wrote:
> Hi Juri,
>
[...]
> > static void sugov_get_util(unsigned long *util, unsigned long *max, int cpu)
> > {
> > struct rq *rq = cpu_rq(cpu);
> > - unsigned long cfs_max;
> > + unsigned long dl_util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE)
> > + >> BW_SHIFT;
>
> What about using a pair of getter methods (e.g. cpu_util_{cfs,dl}) to
> be defined in kernel/sched/sched.h?
>
> This would help to hide class-specific signals mangling from cpufreq.
> And here we can have something "more abstract" like:
>
> unsigned long util_cfs = cpu_util_cfs(rq);
> unsigned long util_dl = cpu_util_dl(rq);
LGTM. I'll cook something for next spin.
>
> >
> > - cfs_max = arch_scale_cpu_capacity(NULL, cpu);
> > + *max = arch_scale_cpu_capacity(NULL, cpu);
> >
> > - *util = min(rq->cfs.avg.util_avg, cfs_max);
> > - *max = cfs_max;
> > + /*
> > + * Ideally we would like to set util_dl as min/guaranteed freq and
> > + * util_cfs + util_dl as requested freq. However, cpufreq is not yet
> > + * ready for such an interface. So, we only do the latter for now.
> > + */
>
> Maybe I don't completely get the above comment, but to me it is not
> really required.
>
> When you say that "util_dl" should be set to a min/guaranteed freq
> are you not actually talking about a DL implementation detail?
>
> From the cpufreq standpoint instead, we should always set a capacity
> which can accommodate util_dl + util_cfs.
It's more for platforms which supports such combination of values for
frequency requests (CPPC like, AFAIU). The idea being that util_dl is
what the system has to always guarantee, but it could go up to the sum
if feasible.
>
> We don't care about the meaning of util_dl and we should always assume
> (by default) that the signal is properly updated by the scheduling
> class... which unfortunately does not always happen for CFS.
>
>
> > + *util = min(rq->cfs.avg.util_avg + dl_util, *max);
>
> With the above proposal, here also we will have:
>
> *util = min(util_cfs + util_dl, *max);
Looks cleaner.
Thanks,
- Juri
Hi,
On 05/12/17 15:17, Patrick Bellasi wrote:
> Hi Juri,
>
> On 04-Dec 11:23, Juri Lelli wrote:
> > From: Juri Lelli <[email protected]>
> >
> > To be able to treat utilization signals of different scheduling classes
> > in different ways (e.g., CFS signal might be stale while DEADLINE signal
> > is never stale by design) we need to split sugov_cpu::util signal in two:
> > util_cfs and util_dl.
> >
> > This patch does that by also changing sugov_get_util() parameter list.
> > After this change, aggregation of the different signals has to be performed
> > by sugov_get_util() users (so that they can decide what to do with the
> > different signals).
>
> Did not tried myself, but to me it would be nice to have this patch
> squashed with the first one of this series. After all, looking at this
> one it seems that [RFC PATH 1/8] is just adding util_dl but it's not
> really using it the proper way.
>
> Here instead is where you better introduce two separate signals,
> tracked by struct sugov_cpu, and properly aggregate them.
>
> But perhaps that's just me being picky ;-)
>
Sure. It looked too invasive as a single patch to me. Also, I was trying
to follow the "one change one patch" rule. So, I'd keep them separate.
What others think?
Best,
- Juri
On 05-Dec 16:24, Juri Lelli wrote:
> On 05/12/17 15:09, Patrick Bellasi wrote:
> > > + /*
> > > + * Ideally we would like to set util_dl as min/guaranteed freq and
> > > + * util_cfs + util_dl as requested freq. However, cpufreq is not yet
> > > + * ready for such an interface. So, we only do the latter for now.
> > > + */
> >
> > Maybe I don't completely get the above comment, but to me it is not
> > really required.
> >
> > When you say that "util_dl" should be set to a min/guaranteed freq
> > are you not actually talking about a DL implementation detail?
> >
> > From the cpufreq standpoint instead, we should always set a capacity
> > which can accommodate util_dl + util_cfs.
>
> It's more for platforms which supports such combination of values for
> frequency requests (CPPC like, AFAIU). The idea being that util_dl is
> what the system has to always guarantee, but it could go up to the sum
> if feasible.
I see, you mean for systems where you can specify both values at the
same time, i.e.
- please give me util_dl...
- ... but if you have more beer, I would like util_dl + util_cfs
However, I'm not an expert, on those systems can we really set a
minimum guaranteed performance level?
I was more of the idea that the "minimum guaranteed" is something we
can only read from "firmware", while we can only ask for something
which is never "guaranteed".
> > We don't care about the meaning of util_dl and we should always assume
> > (by default) that the signal is properly updated by the scheduling
> > class... which unfortunately does not always happen for CFS.
> >
--
#include <best/regards.h>
Patrick Bellasi
On 05/12/17 16:34, Patrick Bellasi wrote:
> On 05-Dec 16:24, Juri Lelli wrote:
> > On 05/12/17 15:09, Patrick Bellasi wrote:
> > > > + /*
> > > > + * Ideally we would like to set util_dl as min/guaranteed freq and
> > > > + * util_cfs + util_dl as requested freq. However, cpufreq is not yet
> > > > + * ready for such an interface. So, we only do the latter for now.
> > > > + */
> > >
> > > Maybe I don't completely get the above comment, but to me it is not
> > > really required.
> > >
> > > When you say that "util_dl" should be set to a min/guaranteed freq
> > > are you not actually talking about a DL implementation detail?
> > >
> > > From the cpufreq standpoint instead, we should always set a capacity
> > > which can accommodate util_dl + util_cfs.
> >
> > It's more for platforms which supports such combination of values for
> > frequency requests (CPPC like, AFAIU). The idea being that util_dl is
> > what the system has to always guarantee, but it could go up to the sum
> > if feasible.
>
> I see, you mean for systems where you can specify both values at the
> same time, i.e.
> - please give me util_dl...
> - ... but if you have more beer, I would like util_dl + util_cfs
>
> However, I'm not an expert, on those systems can we really set a
> minimum guaranteed performance level?
This is my current understanding. Never played with such platforms
myself. :)
Best,
- Juri
On Tue, Dec 05, 2017 at 04:34:10PM +0000, Patrick Bellasi wrote:
> On 05-Dec 16:24, Juri Lelli wrote:
> However, I'm not an expert, on those systems can we really set a
> minimum guaranteed performance level?
If you look at the Intel SDM, Volume 3, 14.4 Hardware-Controlled
Performance States (HWP), which is the Intel implementation of ACPI
CPPC.
You'll see that IA32_HWP_CAPABILITIES has a Guaranteed_Performance field
and describes that upon changes to this frequency we will receive
notifications (Interrupts).
If you then look at IA32_HWP_REQUEST, you'll see a Minimum_Performance
field, which we can raise up-to the guaranteed level, and would/should
contain the DEADLINE stuff.
HWP_REQUEST also includes a Desired_Performance field, which is where we
want to be for DL+CFS.
Trouble is that cpufreq doesn't yet support the various CPPC fields. So
we have this comment here at the input side stating what we'd want to do
once cpufreq itself grows the interface bits.
On Tue, Dec 05, 2017 at 01:34:00PM +0100, Juri Lelli wrote:
> > What about using for all these wrappers the same utility function you
> > already use in this source file? I.e.
> >
> > if (unlikely(dl_entity_is_special(dl_se)))
> > return;
> > __add_rq_bw(dl_se->dl_bw, dl_rq);
> > > +static inline int dl_entity_is_special(struct sched_dl_entity *dl_se)
> > > +{
> > #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
> > > + return dl_se->flags & SCHED_FLAG_SUGOV;
> > #else
> > return false;
> > #endif
> > > +}
Move the unlikely in here, saves on typing.
Commit-ID: d4edd662ac1657126df7ffd74a278958b133a77d
Gitweb: https://git.kernel.org/tip/d4edd662ac1657126df7ffd74a278958b133a77d
Author: Juri Lelli <[email protected]>
AuthorDate: Mon, 4 Dec 2017 11:23:18 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 10 Jan 2018 11:30:32 +0100
sched/cpufreq: Use the DEADLINE utilization signal
SCHED_DEADLINE tracks active utilization signal with a per dl_rq
variable named running_bw.
Make use of that to drive CPU frequency selection: add up FAIR and
DEADLINE contribution to get the required CPU capacity to handle both
requirements (while RT still selects max frequency).
Co-authored-by: Claudio Scordino <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Viresh Kumar <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Luca Abeni <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J . Wysocki <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/sched/cpufreq.h | 2 --
kernel/sched/cpufreq_schedutil.c | 25 +++++++++++++++----------
kernel/sched/sched.h | 10 ++++++++++
3 files changed, 25 insertions(+), 12 deletions(-)
diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h
index d1ad3d8..0b55834 100644
--- a/include/linux/sched/cpufreq.h
+++ b/include/linux/sched/cpufreq.h
@@ -12,8 +12,6 @@
#define SCHED_CPUFREQ_DL (1U << 1)
#define SCHED_CPUFREQ_IOWAIT (1U << 2)
-#define SCHED_CPUFREQ_RT_DL (SCHED_CPUFREQ_RT | SCHED_CPUFREQ_DL)
-
#ifdef CONFIG_CPU_FREQ
struct update_util_data {
void (*func)(struct update_util_data *data, u64 time, unsigned int flags);
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 6dd1ec9..8d266bc 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -179,12 +179,17 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
static void sugov_get_util(unsigned long *util, unsigned long *max, int cpu)
{
struct rq *rq = cpu_rq(cpu);
- unsigned long cfs_max;
+ unsigned long util_cfs = cpu_util_cfs(rq);
+ unsigned long util_dl = cpu_util_dl(rq);
- cfs_max = arch_scale_cpu_capacity(NULL, cpu);
+ *max = arch_scale_cpu_capacity(NULL, cpu);
- *util = min(rq->cfs.avg.util_avg, cfs_max);
- *max = cfs_max;
+ /*
+ * Ideally we would like to set util_dl as min/guaranteed freq and
+ * util_cfs + util_dl as requested freq. However, cpufreq is not yet
+ * ready for such an interface. So, we only do the latter for now.
+ */
+ *util = min(util_cfs + util_dl, *max);
}
static void sugov_set_iowait_boost(struct sugov_cpu *sg_cpu, u64 time)
@@ -271,7 +276,7 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
busy = sugov_cpu_is_busy(sg_cpu);
- if (flags & SCHED_CPUFREQ_RT_DL) {
+ if (flags & SCHED_CPUFREQ_RT) {
next_f = policy->cpuinfo.max_freq;
} else {
sugov_get_util(&util, &max, sg_cpu->cpu);
@@ -316,7 +321,7 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
j_sg_cpu->iowait_boost_pending = false;
continue;
}
- if (j_sg_cpu->flags & SCHED_CPUFREQ_RT_DL)
+ if (j_sg_cpu->flags & SCHED_CPUFREQ_RT)
return policy->cpuinfo.max_freq;
j_util = j_sg_cpu->util;
@@ -352,7 +357,7 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
sg_cpu->last_update = time;
if (sugov_should_update_freq(sg_policy, time)) {
- if (flags & SCHED_CPUFREQ_RT_DL)
+ if (flags & SCHED_CPUFREQ_RT)
next_f = sg_policy->policy->cpuinfo.max_freq;
else
next_f = sugov_next_freq_shared(sg_cpu, time);
@@ -382,9 +387,9 @@ static void sugov_irq_work(struct irq_work *irq_work)
sg_policy = container_of(irq_work, struct sugov_policy, irq_work);
/*
- * For RT and deadline tasks, the schedutil governor shoots the
- * frequency to maximum. Special care must be taken to ensure that this
- * kthread doesn't result in the same behavior.
+ * For RT tasks, the schedutil governor shoots the frequency to maximum.
+ * Special care must be taken to ensure that this kthread doesn't result
+ * in the same behavior.
*
* This is (mostly) guaranteed by the work_in_progress flag. The flag is
* updated only at the end of the sugov_work() function and before that
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 43f5d6e..136ab50 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2084,3 +2084,13 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
#else /* arch_scale_freq_capacity */
#define arch_scale_freq_invariant() (false)
#endif
+
+static inline unsigned long cpu_util_dl(struct rq *rq)
+{
+ return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
+}
+
+static inline unsigned long cpu_util_cfs(struct rq *rq)
+{
+ return rq->cfs.avg.util_avg;
+}
Commit-ID: e0367b12674bf4420870cd0237e3ebafb2ec9593
Gitweb: https://git.kernel.org/tip/e0367b12674bf4420870cd0237e3ebafb2ec9593
Author: Juri Lelli <[email protected]>
AuthorDate: Mon, 4 Dec 2017 11:23:19 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 10 Jan 2018 11:30:32 +0100
sched/deadline: Move CPU frequency selection triggering points
Since SCHED_DEADLINE doesn't track utilization signal (but reserves a
fraction of CPU bandwidth to tasks admitted to the system), there is no
point in evaluating frequency changes during each tick event.
Move frequency selection triggering points to where running_bw changes.
Co-authored-by: Claudio Scordino <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Viresh Kumar <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Luca Abeni <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J . Wysocki <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/deadline.c | 7 ++++---
kernel/sched/sched.h | 12 ++++++------
2 files changed, 10 insertions(+), 9 deletions(-)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 4c666db..f584837 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -86,6 +86,8 @@ void add_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
dl_rq->running_bw += dl_bw;
SCHED_WARN_ON(dl_rq->running_bw < old); /* overflow */
SCHED_WARN_ON(dl_rq->running_bw > dl_rq->this_bw);
+ /* kick cpufreq (see the comment in kernel/sched/sched.h). */
+ cpufreq_update_util(rq_of_dl_rq(dl_rq), SCHED_CPUFREQ_DL);
}
static inline
@@ -98,6 +100,8 @@ void sub_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
SCHED_WARN_ON(dl_rq->running_bw > old); /* underflow */
if (dl_rq->running_bw > old)
dl_rq->running_bw = 0;
+ /* kick cpufreq (see the comment in kernel/sched/sched.h). */
+ cpufreq_update_util(rq_of_dl_rq(dl_rq), SCHED_CPUFREQ_DL);
}
static inline
@@ -1134,9 +1138,6 @@ static void update_curr_dl(struct rq *rq)
return;
}
- /* kick cpufreq (see the comment in kernel/sched/sched.h). */
- cpufreq_update_util(rq, SCHED_CPUFREQ_DL);
-
schedstat_set(curr->se.statistics.exec_max,
max(curr->se.statistics.exec_max, delta_exec));
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 136ab50..863964f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2055,14 +2055,14 @@ DECLARE_PER_CPU(struct update_util_data *, cpufreq_update_util_data);
* The way cpufreq is currently arranged requires it to evaluate the CPU
* performance state (frequency/voltage) on a regular basis to prevent it from
* being stuck in a completely inadequate performance level for too long.
- * That is not guaranteed to happen if the updates are only triggered from CFS,
- * though, because they may not be coming in if RT or deadline tasks are active
- * all the time (or there are RT and DL tasks only).
+ * That is not guaranteed to happen if the updates are only triggered from CFS
+ * and DL, though, because they may not be coming in if only RT tasks are
+ * active all the time (or there are RT tasks only).
*
- * As a workaround for that issue, this function is called by the RT and DL
- * sched classes to trigger extra cpufreq updates to prevent it from stalling,
+ * As a workaround for that issue, this function is called periodically by the
+ * RT sched class to trigger extra cpufreq updates to prevent it from stalling,
* but that really is a band-aid. Going forward it should be replaced with
- * solutions targeted more specifically at RT and DL tasks.
+ * solutions targeted more specifically at RT tasks.
*/
static inline void cpufreq_update_util(struct rq *rq, unsigned int flags)
{
Commit-ID: 7673c8a4c75d1cac2cd47156b9768f462683a09d
Gitweb: https://git.kernel.org/tip/7673c8a4c75d1cac2cd47156b9768f462683a09d
Author: Juri Lelli <[email protected]>
AuthorDate: Mon, 4 Dec 2017 11:23:23 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 10 Jan 2018 12:53:34 +0100
sched/cpufreq: Remove arch_scale_freq_capacity()'s 'sd' parameter
The 'sd' parameter is never used in arch_scale_freq_capacity() (and it's hard to
see where information coming from scheduling domains might help doing
frequency invariance scaling).
Remove it; also in anticipation of moving arch_scale_freq_capacity()
outside CONFIG_SMP.
Signed-off-by: Juri Lelli <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/arch_topology.h | 2 +-
kernel/sched/fair.c | 2 +-
kernel/sched/sched.h | 4 ++--
3 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h
index 3045112..2b70941 100644
--- a/include/linux/arch_topology.h
+++ b/include/linux/arch_topology.h
@@ -27,7 +27,7 @@ void topology_set_cpu_scale(unsigned int cpu, unsigned long capacity);
DECLARE_PER_CPU(unsigned long, freq_scale);
static inline
-unsigned long topology_get_freq_scale(struct sched_domain *sd, int cpu)
+unsigned long topology_get_freq_scale(int cpu)
{
return per_cpu(freq_scale, cpu);
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9fec992..1485975 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3120,7 +3120,7 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
u32 contrib = (u32)delta; /* p == 0 -> delta < 1024 */
u64 periods;
- scale_freq = arch_scale_freq_capacity(NULL, cpu);
+ scale_freq = arch_scale_freq_capacity(cpu);
scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
delta += sa->period_contrib;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c519733..b710019 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1675,7 +1675,7 @@ extern void sched_avg_update(struct rq *rq);
#ifndef arch_scale_freq_capacity
static __always_inline
-unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
+unsigned long arch_scale_freq_capacity(int cpu)
{
return SCHED_CAPACITY_SCALE;
}
@@ -1694,7 +1694,7 @@ unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
{
- rq->rt_avg += rt_delta * arch_scale_freq_capacity(NULL, cpu_of(rq));
+ rq->rt_avg += rt_delta * arch_scale_freq_capacity(cpu_of(rq));
sched_avg_update(rq);
}
#else
Commit-ID: d18be45dbfef2e0bb12b9696c21aeae92f83b1ea
Gitweb: https://git.kernel.org/tip/d18be45dbfef2e0bb12b9696c21aeae92f83b1ea
Author: Juri Lelli <[email protected]>
AuthorDate: Mon, 4 Dec 2017 11:23:21 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 10 Jan 2018 12:53:34 +0100
sched/cpufreq: Split utilization signals
To be able to treat utilization signals of different scheduling classes
in different ways (e.g., CFS signal might be stale while DEADLINE signal
is never stale by design) we need to split sugov_cpu::util signal in two:
util_cfs and util_dl.
This patch does that by also changing sugov_get_util() parameter list.
After this change, aggregation of the different signals has to be performed
by sugov_get_util() users (so that they can decide what to do with the
different signals).
Suggested-by: Rafael J. Wysocki <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Viresh Kumar <[email protected]>
Cc: Claudio Scordino <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Luca Abeni <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/cpufreq_schedutil.c | 30 ++++++++++++++++--------------
1 file changed, 16 insertions(+), 14 deletions(-)
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index bd5f997..e9e0713 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -60,7 +60,8 @@ struct sugov_cpu {
u64 last_update;
/* The fields below are only needed when sharing a policy. */
- unsigned long util;
+ unsigned long util_cfs;
+ unsigned long util_dl;
unsigned long max;
unsigned int flags;
@@ -176,20 +177,23 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
return cpufreq_driver_resolve_freq(policy, freq);
}
-static void sugov_get_util(unsigned long *util, unsigned long *max, int cpu)
+static void sugov_get_util(struct sugov_cpu *sg_cpu)
{
- struct rq *rq = cpu_rq(cpu);
- unsigned long util_cfs = cpu_util_cfs(rq);
- unsigned long util_dl = cpu_util_dl(rq);
+ struct rq *rq = cpu_rq(sg_cpu->cpu);
- *max = arch_scale_cpu_capacity(NULL, cpu);
+ sg_cpu->max = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);
+ sg_cpu->util_cfs = cpu_util_cfs(rq);
+ sg_cpu->util_dl = cpu_util_dl(rq);
+}
+static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
+{
/*
* Ideally we would like to set util_dl as min/guaranteed freq and
* util_cfs + util_dl as requested freq. However, cpufreq is not yet
* ready for such an interface. So, we only do the latter for now.
*/
- *util = min(util_cfs + util_dl, *max);
+ return min(sg_cpu->util_cfs + sg_cpu->util_dl, sg_cpu->max);
}
static void sugov_set_iowait_boost(struct sugov_cpu *sg_cpu, u64 time)
@@ -279,7 +283,9 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
if (flags & SCHED_CPUFREQ_RT) {
next_f = policy->cpuinfo.max_freq;
} else {
- sugov_get_util(&util, &max, sg_cpu->cpu);
+ sugov_get_util(sg_cpu);
+ max = sg_cpu->max;
+ util = sugov_aggregate_util(sg_cpu);
sugov_iowait_boost(sg_cpu, &util, &max);
next_f = get_next_freq(sg_policy, util, max);
/*
@@ -324,8 +330,8 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
if (j_sg_cpu->flags & SCHED_CPUFREQ_RT)
return policy->cpuinfo.max_freq;
- j_util = j_sg_cpu->util;
j_max = j_sg_cpu->max;
+ j_util = sugov_aggregate_util(j_sg_cpu);
if (j_util * max > j_max * util) {
util = j_util;
max = j_max;
@@ -342,15 +348,11 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
{
struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_util);
struct sugov_policy *sg_policy = sg_cpu->sg_policy;
- unsigned long util, max;
unsigned int next_f;
- sugov_get_util(&util, &max, sg_cpu->cpu);
-
raw_spin_lock(&sg_policy->update_lock);
- sg_cpu->util = util;
- sg_cpu->max = max;
+ sugov_get_util(sg_cpu);
sg_cpu->flags = flags;
sugov_set_iowait_boost(sg_cpu, time);
Commit-ID: 0fa7d181f1a60149061632266bb432b4b61acdac
Gitweb: https://git.kernel.org/tip/0fa7d181f1a60149061632266bb432b4b61acdac
Author: Juri Lelli <[email protected]>
AuthorDate: Mon, 4 Dec 2017 11:23:22 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 10 Jan 2018 12:53:34 +0100
sched/cpufreq: Always consider all CPUs when deciding next freq
No assumption can be made upon the rate at which frequency updates get
triggered, as there are scheduling policies (like SCHED_DEADLINE) which
don't trigger them so frequently.
Remove such assumption from the code, by always considering
SCHED_DEADLINE utilization signal as not stale.
Signed-off-by: Juri Lelli <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Viresh Kumar <[email protected]>
Cc: Claudio Scordino <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Luca Abeni <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J . Wysocki <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/cpufreq_schedutil.c | 16 ++++++++++------
1 file changed, 10 insertions(+), 6 deletions(-)
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index e9e0713..dd062a1 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -315,17 +315,21 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
s64 delta_ns;
/*
- * If the CPU utilization was last updated before the previous
- * frequency update and the time elapsed between the last update
- * of the CPU utilization and the last frequency update is long
- * enough, don't take the CPU into account as it probably is
- * idle now (and clear iowait_boost for it).
+ * If the CFS CPU utilization was last updated before the
+ * previous frequency update and the time elapsed between the
+ * last update of the CPU utilization and the last frequency
+ * update is long enough, reset iowait_boost and util_cfs, as
+ * they are now probably stale. However, still consider the
+ * CPU contribution if it has some DEADLINE utilization
+ * (util_dl).
*/
delta_ns = time - j_sg_cpu->last_update;
if (delta_ns > TICK_NSEC) {
j_sg_cpu->iowait_boost = 0;
j_sg_cpu->iowait_boost_pending = false;
- continue;
+ j_sg_cpu->util_cfs = 0;
+ if (j_sg_cpu->util_dl == 0)
+ continue;
}
if (j_sg_cpu->flags & SCHED_CPUFREQ_RT)
return policy->cpuinfo.max_freq;
Commit-ID: 794a56ebd9a57db12abaec63f038c6eb073461f7
Gitweb: https://git.kernel.org/tip/794a56ebd9a57db12abaec63f038c6eb073461f7
Author: Juri Lelli <[email protected]>
AuthorDate: Mon, 4 Dec 2017 11:23:20 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 10 Jan 2018 12:53:29 +0100
sched/cpufreq: Change the worker kthread to SCHED_DEADLINE
Worker kthread needs to be able to change frequency for all other
threads.
Make it special, just under STOP class.
Signed-off-by: Juri Lelli <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Claudio Scordino <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Luca Abeni <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J . Wysocki <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/sched.h | 1 +
kernel/sched/core.c | 13 ++++-
kernel/sched/cpufreq_schedutil.c | 19 ++++++--
kernel/sched/deadline.c | 103 +++++++++++++++++++++++++++------------
kernel/sched/sched.h | 30 +++++++++++-
5 files changed, 130 insertions(+), 36 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 274a449..f750671 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1431,6 +1431,7 @@ extern int idle_cpu(int cpu);
extern int sched_setscheduler(struct task_struct *, int, const struct sched_param *);
extern int sched_setscheduler_nocheck(struct task_struct *, int, const struct sched_param *);
extern int sched_setattr(struct task_struct *, const struct sched_attr *);
+extern int sched_setattr_nocheck(struct task_struct *, const struct sched_attr *);
extern struct task_struct *idle_task(int cpu);
/**
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e28391b..402ef4f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4085,7 +4085,7 @@ recheck:
return -EINVAL;
}
- if (attr->sched_flags & ~SCHED_FLAG_ALL)
+ if (attr->sched_flags & ~(SCHED_FLAG_ALL | SCHED_FLAG_SUGOV))
return -EINVAL;
/*
@@ -4152,6 +4152,9 @@ recheck:
}
if (user) {
+ if (attr->sched_flags & SCHED_FLAG_SUGOV)
+ return -EINVAL;
+
retval = security_task_setscheduler(p);
if (retval)
return retval;
@@ -4207,7 +4210,8 @@ change:
}
#endif
#ifdef CONFIG_SMP
- if (dl_bandwidth_enabled() && dl_policy(policy)) {
+ if (dl_bandwidth_enabled() && dl_policy(policy) &&
+ !(attr->sched_flags & SCHED_FLAG_SUGOV)) {
cpumask_t *span = rq->rd->span;
/*
@@ -4337,6 +4341,11 @@ int sched_setattr(struct task_struct *p, const struct sched_attr *attr)
}
EXPORT_SYMBOL_GPL(sched_setattr);
+int sched_setattr_nocheck(struct task_struct *p, const struct sched_attr *attr)
+{
+ return __sched_setscheduler(p, attr, false, true);
+}
+
/**
* sched_setscheduler_nocheck - change the scheduling policy and/or RT priority of a thread from kernelspace.
* @p: the task in question.
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 8d266bc..bd5f997 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -474,7 +474,20 @@ static void sugov_policy_free(struct sugov_policy *sg_policy)
static int sugov_kthread_create(struct sugov_policy *sg_policy)
{
struct task_struct *thread;
- struct sched_param param = { .sched_priority = MAX_USER_RT_PRIO / 2 };
+ struct sched_attr attr = {
+ .size = sizeof(struct sched_attr),
+ .sched_policy = SCHED_DEADLINE,
+ .sched_flags = SCHED_FLAG_SUGOV,
+ .sched_nice = 0,
+ .sched_priority = 0,
+ /*
+ * Fake (unused) bandwidth; workaround to "fix"
+ * priority inheritance.
+ */
+ .sched_runtime = 1000000,
+ .sched_deadline = 10000000,
+ .sched_period = 10000000,
+ };
struct cpufreq_policy *policy = sg_policy->policy;
int ret;
@@ -492,10 +505,10 @@ static int sugov_kthread_create(struct sugov_policy *sg_policy)
return PTR_ERR(thread);
}
- ret = sched_setscheduler_nocheck(thread, SCHED_FIFO, ¶m);
+ ret = sched_setattr_nocheck(thread, &attr);
if (ret) {
kthread_stop(thread);
- pr_warn("%s: failed to set SCHED_FIFO\n", __func__);
+ pr_warn("%s: failed to set SCHED_DEADLINE\n", __func__);
return ret;
}
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index f584837..54a0dc1 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -78,7 +78,7 @@ static inline int dl_bw_cpus(int i)
#endif
static inline
-void add_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
+void __add_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
{
u64 old = dl_rq->running_bw;
@@ -91,7 +91,7 @@ void add_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
}
static inline
-void sub_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
+void __sub_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
{
u64 old = dl_rq->running_bw;
@@ -105,7 +105,7 @@ void sub_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
}
static inline
-void add_rq_bw(u64 dl_bw, struct dl_rq *dl_rq)
+void __add_rq_bw(u64 dl_bw, struct dl_rq *dl_rq)
{
u64 old = dl_rq->this_bw;
@@ -115,7 +115,7 @@ void add_rq_bw(u64 dl_bw, struct dl_rq *dl_rq)
}
static inline
-void sub_rq_bw(u64 dl_bw, struct dl_rq *dl_rq)
+void __sub_rq_bw(u64 dl_bw, struct dl_rq *dl_rq)
{
u64 old = dl_rq->this_bw;
@@ -127,16 +127,46 @@ void sub_rq_bw(u64 dl_bw, struct dl_rq *dl_rq)
SCHED_WARN_ON(dl_rq->running_bw > dl_rq->this_bw);
}
+static inline
+void add_rq_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
+{
+ if (!dl_entity_is_special(dl_se))
+ __add_rq_bw(dl_se->dl_bw, dl_rq);
+}
+
+static inline
+void sub_rq_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
+{
+ if (!dl_entity_is_special(dl_se))
+ __sub_rq_bw(dl_se->dl_bw, dl_rq);
+}
+
+static inline
+void add_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
+{
+ if (!dl_entity_is_special(dl_se))
+ __add_running_bw(dl_se->dl_bw, dl_rq);
+}
+
+static inline
+void sub_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
+{
+ if (!dl_entity_is_special(dl_se))
+ __sub_running_bw(dl_se->dl_bw, dl_rq);
+}
+
void dl_change_utilization(struct task_struct *p, u64 new_bw)
{
struct rq *rq;
+ BUG_ON(p->dl.flags & SCHED_FLAG_SUGOV);
+
if (task_on_rq_queued(p))
return;
rq = task_rq(p);
if (p->dl.dl_non_contending) {
- sub_running_bw(p->dl.dl_bw, &rq->dl);
+ sub_running_bw(&p->dl, &rq->dl);
p->dl.dl_non_contending = 0;
/*
* If the timer handler is currently running and the
@@ -148,8 +178,8 @@ void dl_change_utilization(struct task_struct *p, u64 new_bw)
if (hrtimer_try_to_cancel(&p->dl.inactive_timer) == 1)
put_task_struct(p);
}
- sub_rq_bw(p->dl.dl_bw, &rq->dl);
- add_rq_bw(new_bw, &rq->dl);
+ __sub_rq_bw(p->dl.dl_bw, &rq->dl);
+ __add_rq_bw(new_bw, &rq->dl);
}
/*
@@ -221,6 +251,9 @@ static void task_non_contending(struct task_struct *p)
if (dl_se->dl_runtime == 0)
return;
+ if (dl_entity_is_special(dl_se))
+ return;
+
WARN_ON(hrtimer_active(&dl_se->inactive_timer));
WARN_ON(dl_se->dl_non_contending);
@@ -240,12 +273,12 @@ static void task_non_contending(struct task_struct *p)
*/
if (zerolag_time < 0) {
if (dl_task(p))
- sub_running_bw(dl_se->dl_bw, dl_rq);
+ sub_running_bw(dl_se, dl_rq);
if (!dl_task(p) || p->state == TASK_DEAD) {
struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
if (p->state == TASK_DEAD)
- sub_rq_bw(p->dl.dl_bw, &rq->dl);
+ sub_rq_bw(&p->dl, &rq->dl);
raw_spin_lock(&dl_b->lock);
__dl_sub(dl_b, p->dl.dl_bw, dl_bw_cpus(task_cpu(p)));
__dl_clear_params(p);
@@ -272,7 +305,7 @@ static void task_contending(struct sched_dl_entity *dl_se, int flags)
return;
if (flags & ENQUEUE_MIGRATED)
- add_rq_bw(dl_se->dl_bw, dl_rq);
+ add_rq_bw(dl_se, dl_rq);
if (dl_se->dl_non_contending) {
dl_se->dl_non_contending = 0;
@@ -293,7 +326,7 @@ static void task_contending(struct sched_dl_entity *dl_se, int flags)
* when the "inactive timer" fired).
* So, add it back.
*/
- add_running_bw(dl_se->dl_bw, dl_rq);
+ add_running_bw(dl_se, dl_rq);
}
}
@@ -1149,6 +1182,9 @@ static void update_curr_dl(struct rq *rq)
sched_rt_avg_update(rq, delta_exec);
+ if (dl_entity_is_special(dl_se))
+ return;
+
if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM))
delta_exec = grub_reclaim(delta_exec, rq, &curr->dl);
dl_se->runtime -= delta_exec;
@@ -1211,8 +1247,8 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
if (p->state == TASK_DEAD && dl_se->dl_non_contending) {
- sub_running_bw(p->dl.dl_bw, dl_rq_of_se(&p->dl));
- sub_rq_bw(p->dl.dl_bw, dl_rq_of_se(&p->dl));
+ sub_running_bw(&p->dl, dl_rq_of_se(&p->dl));
+ sub_rq_bw(&p->dl, dl_rq_of_se(&p->dl));
dl_se->dl_non_contending = 0;
}
@@ -1229,7 +1265,7 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
sched_clock_tick();
update_rq_clock(rq);
- sub_running_bw(dl_se->dl_bw, &rq->dl);
+ sub_running_bw(dl_se, &rq->dl);
dl_se->dl_non_contending = 0;
unlock:
task_rq_unlock(rq, p, &rf);
@@ -1423,8 +1459,8 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
dl_check_constrained_dl(&p->dl);
if (p->on_rq == TASK_ON_RQ_MIGRATING || flags & ENQUEUE_RESTORE) {
- add_rq_bw(p->dl.dl_bw, &rq->dl);
- add_running_bw(p->dl.dl_bw, &rq->dl);
+ add_rq_bw(&p->dl, &rq->dl);
+ add_running_bw(&p->dl, &rq->dl);
}
/*
@@ -1464,8 +1500,8 @@ static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
__dequeue_task_dl(rq, p, flags);
if (p->on_rq == TASK_ON_RQ_MIGRATING || flags & DEQUEUE_SAVE) {
- sub_running_bw(p->dl.dl_bw, &rq->dl);
- sub_rq_bw(p->dl.dl_bw, &rq->dl);
+ sub_running_bw(&p->dl, &rq->dl);
+ sub_rq_bw(&p->dl, &rq->dl);
}
/*
@@ -1571,7 +1607,7 @@ static void migrate_task_rq_dl(struct task_struct *p)
*/
raw_spin_lock(&rq->lock);
if (p->dl.dl_non_contending) {
- sub_running_bw(p->dl.dl_bw, &rq->dl);
+ sub_running_bw(&p->dl, &rq->dl);
p->dl.dl_non_contending = 0;
/*
* If the timer handler is currently running and the
@@ -1583,7 +1619,7 @@ static void migrate_task_rq_dl(struct task_struct *p)
if (hrtimer_try_to_cancel(&p->dl.inactive_timer) == 1)
put_task_struct(p);
}
- sub_rq_bw(p->dl.dl_bw, &rq->dl);
+ sub_rq_bw(&p->dl, &rq->dl);
raw_spin_unlock(&rq->lock);
}
@@ -2026,11 +2062,11 @@ retry:
}
deactivate_task(rq, next_task, 0);
- sub_running_bw(next_task->dl.dl_bw, &rq->dl);
- sub_rq_bw(next_task->dl.dl_bw, &rq->dl);
+ sub_running_bw(&next_task->dl, &rq->dl);
+ sub_rq_bw(&next_task->dl, &rq->dl);
set_task_cpu(next_task, later_rq->cpu);
- add_rq_bw(next_task->dl.dl_bw, &later_rq->dl);
- add_running_bw(next_task->dl.dl_bw, &later_rq->dl);
+ add_rq_bw(&next_task->dl, &later_rq->dl);
+ add_running_bw(&next_task->dl, &later_rq->dl);
activate_task(later_rq, next_task, 0);
ret = 1;
@@ -2118,11 +2154,11 @@ static void pull_dl_task(struct rq *this_rq)
resched = true;
deactivate_task(src_rq, p, 0);
- sub_running_bw(p->dl.dl_bw, &src_rq->dl);
- sub_rq_bw(p->dl.dl_bw, &src_rq->dl);
+ sub_running_bw(&p->dl, &src_rq->dl);
+ sub_rq_bw(&p->dl, &src_rq->dl);
set_task_cpu(p, this_cpu);
- add_rq_bw(p->dl.dl_bw, &this_rq->dl);
- add_running_bw(p->dl.dl_bw, &this_rq->dl);
+ add_rq_bw(&p->dl, &this_rq->dl);
+ add_running_bw(&p->dl, &this_rq->dl);
activate_task(this_rq, p, 0);
dmin = p->dl.deadline;
@@ -2231,7 +2267,7 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
task_non_contending(p);
if (!task_on_rq_queued(p))
- sub_rq_bw(p->dl.dl_bw, &rq->dl);
+ sub_rq_bw(&p->dl, &rq->dl);
/*
* We cannot use inactive_task_timer() to invoke sub_running_bw()
@@ -2263,7 +2299,7 @@ static void switched_to_dl(struct rq *rq, struct task_struct *p)
/* If p is not queued we will update its parameters at next wakeup. */
if (!task_on_rq_queued(p)) {
- add_rq_bw(p->dl.dl_bw, &rq->dl);
+ add_rq_bw(&p->dl, &rq->dl);
return;
}
@@ -2442,6 +2478,9 @@ int sched_dl_overflow(struct task_struct *p, int policy,
u64 new_bw = dl_policy(policy) ? to_ratio(period, runtime) : 0;
int cpus, err = -1;
+ if (attr->sched_flags & SCHED_FLAG_SUGOV)
+ return 0;
+
/* !deadline task may carry old deadline bandwidth */
if (new_bw == p->dl.dl_bw && task_has_dl_policy(p))
return 0;
@@ -2528,6 +2567,10 @@ void __getparam_dl(struct task_struct *p, struct sched_attr *attr)
*/
bool __checkparam_dl(const struct sched_attr *attr)
{
+ /* special dl tasks don't actually use any parameter */
+ if (attr->sched_flags & SCHED_FLAG_SUGOV)
+ return true;
+
/* deadline != 0 */
if (attr->sched_deadline == 0)
return false;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 863964f..c519733 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -157,12 +157,36 @@ static inline int task_has_dl_policy(struct task_struct *p)
}
/*
+ * !! For sched_setattr_nocheck() (kernel) only !!
+ *
+ * This is actually gross. :(
+ *
+ * It is used to make schedutil kworker(s) higher priority than SCHED_DEADLINE
+ * tasks, but still be able to sleep. We need this on platforms that cannot
+ * atomically change clock frequency. Remove once fast switching will be
+ * available on such platforms.
+ *
+ * SUGOV stands for SchedUtil GOVernor.
+ */
+#define SCHED_FLAG_SUGOV 0x10000000
+
+static inline bool dl_entity_is_special(struct sched_dl_entity *dl_se)
+{
+#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
+ return unlikely(dl_se->flags & SCHED_FLAG_SUGOV);
+#else
+ return false;
+#endif
+}
+
+/*
* Tells if entity @a should preempt entity @b.
*/
static inline bool
dl_entity_preempt(struct sched_dl_entity *a, struct sched_dl_entity *b)
{
- return dl_time_before(a->deadline, b->deadline);
+ return dl_entity_is_special(a) ||
+ dl_time_before(a->deadline, b->deadline);
}
/*
@@ -2085,6 +2109,8 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
#define arch_scale_freq_invariant() (false)
#endif
+#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
+
static inline unsigned long cpu_util_dl(struct rq *rq)
{
return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
@@ -2094,3 +2120,5 @@ static inline unsigned long cpu_util_cfs(struct rq *rq)
{
return rq->cfs.avg.util_avg;
}
+
+#endif
Commit-ID: 07881166a892fa4908ac4924660a7793f75d6544
Gitweb: https://git.kernel.org/tip/07881166a892fa4908ac4924660a7793f75d6544
Author: Juri Lelli <[email protected]>
AuthorDate: Mon, 4 Dec 2017 11:23:25 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 10 Jan 2018 12:53:35 +0100
sched/deadline: Make bandwidth enforcement scale-invariant
Apply frequency and CPU scale-invariance correction factor to bandwidth
enforcement (similar to what we already do to fair utilization tracking).
Each delta_exec gets scaled considering current frequency and maximum
CPU capacity; which means that the reservation runtime parameter (that
need to be specified profiling the task execution at max frequency on
biggest capacity core) gets thus scaled accordingly.
Signed-off-by: Juri Lelli <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Claudio Scordino <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Luca Abeni <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J . Wysocki <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/deadline.c | 26 ++++++++++++++++++++++----
kernel/sched/fair.c | 2 --
kernel/sched/sched.h | 2 ++
3 files changed, 24 insertions(+), 6 deletions(-)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 54a0dc1..9bb0e0c 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1151,7 +1151,8 @@ static void update_curr_dl(struct rq *rq)
{
struct task_struct *curr = rq->curr;
struct sched_dl_entity *dl_se = &curr->dl;
- u64 delta_exec;
+ u64 delta_exec, scaled_delta_exec;
+ int cpu = cpu_of(rq);
if (!dl_task(curr) || !on_dl_rq(dl_se))
return;
@@ -1185,9 +1186,26 @@ static void update_curr_dl(struct rq *rq)
if (dl_entity_is_special(dl_se))
return;
- if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM))
- delta_exec = grub_reclaim(delta_exec, rq, &curr->dl);
- dl_se->runtime -= delta_exec;
+ /*
+ * For tasks that participate in GRUB, we implement GRUB-PA: the
+ * spare reclaimed bandwidth is used to clock down frequency.
+ *
+ * For the others, we still need to scale reservation parameters
+ * according to current frequency and CPU maximum capacity.
+ */
+ if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM)) {
+ scaled_delta_exec = grub_reclaim(delta_exec,
+ rq,
+ &curr->dl);
+ } else {
+ unsigned long scale_freq = arch_scale_freq_capacity(cpu);
+ unsigned long scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
+
+ scaled_delta_exec = cap_scale(delta_exec, scale_freq);
+ scaled_delta_exec = cap_scale(scaled_delta_exec, scale_cpu);
+ }
+
+ dl_se->runtime -= scaled_delta_exec;
throttle:
if (dl_runtime_exceeded(dl_se) || dl_se->dl_yielded) {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1485975..1070803 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3089,8 +3089,6 @@ static u32 __accumulate_pelt_segments(u64 periods, u32 d1, u32 d3)
return c1 + c2 + c3;
}
-#define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
-
/*
* Accumulate the three separate parts of the sum; d1 the remainder
* of the last (incomplete) period, d2 the span of full periods and d3
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e122c89..2e95505 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -156,6 +156,8 @@ static inline int task_has_dl_policy(struct task_struct *p)
return dl_policy(p->policy);
}
+#define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
+
/*
* !! For sched_setattr_nocheck() (kernel) only !!
*
Commit-ID: 7e1a9208f6c7e66bb4e5d2ed18dfd191230f431b
Gitweb: https://git.kernel.org/tip/7e1a9208f6c7e66bb4e5d2ed18dfd191230f431b
Author: Juri Lelli <[email protected]>
AuthorDate: Mon, 4 Dec 2017 11:23:24 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 10 Jan 2018 12:53:35 +0100
sched/cpufreq: Move arch_scale_{freq,cpu}_capacity() outside of #ifdef CONFIG_SMP
Currently, frequency and cpu capacity scaling is only performed on
CONFIG_SMP systems (as CFS PELT signals are only present for such
systems). However, other scheduling classes want to do freq/cpu scaling,
and for !CONFIG_SMP configurations as well.
arch_scale_freq_capacity() is useful to implement frequency scaling even
on !CONFIG_SMP platforms, so we simply move it outside CONFIG_SMP
ifdeffery.
Even if arch_scale_cpu_capacity() is not useful on !CONFIG_SMP platforms,
we make a default implementation available for such configurations anyway
to simplify scheduler code doing CPU scale invariance.
Signed-off-by: Juri Lelli <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Steven Rostedt (VMware) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/sched/topology.h | 12 ++++++------
kernel/sched/sched.h | 13 ++++++++++---
2 files changed, 16 insertions(+), 9 deletions(-)
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index cf257c2..2634774 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -7,6 +7,12 @@
#include <linux/sched/idle.h>
/*
+ * Increase resolution of cpu_capacity calculations
+ */
+#define SCHED_CAPACITY_SHIFT SCHED_FIXEDPOINT_SHIFT
+#define SCHED_CAPACITY_SCALE (1L << SCHED_CAPACITY_SHIFT)
+
+/*
* sched-domains (multiprocessor balancing) declarations:
*/
#ifdef CONFIG_SMP
@@ -27,12 +33,6 @@
#define SD_OVERLAP 0x2000 /* sched_domains of this level overlap */
#define SD_NUMA 0x4000 /* cross-node balancing */
-/*
- * Increase resolution of cpu_capacity calculations
- */
-#define SCHED_CAPACITY_SHIFT SCHED_FIXEDPOINT_SHIFT
-#define SCHED_CAPACITY_SCALE (1L << SCHED_CAPACITY_SHIFT)
-
#ifdef CONFIG_SCHED_SMT
static inline int cpu_smt_flags(void)
{
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b710019..e122c89 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1670,9 +1670,6 @@ static inline int hrtick_enabled(struct rq *rq)
#endif /* CONFIG_SCHED_HRTICK */
-#ifdef CONFIG_SMP
-extern void sched_avg_update(struct rq *rq);
-
#ifndef arch_scale_freq_capacity
static __always_inline
unsigned long arch_scale_freq_capacity(int cpu)
@@ -1681,6 +1678,9 @@ unsigned long arch_scale_freq_capacity(int cpu)
}
#endif
+#ifdef CONFIG_SMP
+extern void sched_avg_update(struct rq *rq);
+
#ifndef arch_scale_cpu_capacity
static __always_inline
unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
@@ -1698,6 +1698,13 @@ static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
sched_avg_update(rq);
}
#else
+#ifndef arch_scale_cpu_capacity
+static __always_inline
+unsigned long arch_scale_cpu_capacity(void __always_unused *sd, int cpu)
+{
+ return SCHED_CAPACITY_SCALE;
+}
+#endif
static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta) { }
static inline void sched_avg_update(struct rq *rq) { }
#endif