2017-07-05 09:00:15

by Juri Lelli

[permalink] [raw]
Subject: [RFC PATCH v1 0/8] SCHED_DEADLINE freq/cpu invariance and OPP selection

Hi,

v1 of the RFC set implementing frequency/cpu invariance and OPP selection for
SCHED_DEADLINE [1]. The set is based on tip/sched/core as of today
(72298e5c92c5), which now already includes Luca's "CPU reclaiming for
SCHED_DEADLINE".

Thanks a lot for reviewing RFCv0!

Patches high level description:

o [01-02]/08 add the necessary links to start accounting DEADLINE contribution
to OPP selection
o 03/08 it's a temporary solution to make possible (on ARM) to change
frequency for DEADLINE tasks (that would possibly delay the SCHED_FIFO
worker kthread); proper solution would be to be able to issue frequency
transition from an atomic ctx
o [04-05]/08 it's a schedutil change that copes with the fact that DEADLINE
doesn't require periodic OPP selection triggering point
o [06-07]/08 make arch_scale_{freq,cpu}_capacity() function available on !CONFIG_SMP
configurations too
o 08/08 implements frequency/cpu invariance for tasks' reservation
parameters; which basically means that we implement GRUB-PA [2]

Changes w.r.t. RFCv0:

- rebase on tip/sched/core
- make use of BW_SHIFT for calculations (Peter)
- added a comment about guaranteed/requested frequency (Peter)
- use a high bit for sched_flags SCHED_FLAG_SPECIAL hack; don't expose it to
userspace and add comments (Peter)
- refactor aggregation of utilization from scheduling classes

Please have a look. Feedback and comments are, as usual, more than welcome.

In case you would like to test this out:

git://linux-arm.org/linux-jl.git upstream/deadline/freq-rfc-v1

Best,

- Juri


Juri Lelli (8):
sched/cpufreq_schedutil: make use of DEADLINE utilization signal
sched/deadline: move cpu frequency selection triggering points
sched/cpufreq_schedutil: make worker kthread be SCHED_DEADLINE
sched/cpufreq_schedutil: split utilization signals
sched/cpufreq_schedutil: always consider all CPUs when deciding next
freq
sched/sched.h: remove sd arch_scale_freq_capacity parameter
sched/sched.h: move arch_scale_{freq,cpu}_capacity outside CONFIG_SMP
sched/deadline: make bandwidth enforcement scale-invariant

include/linux/sched.h | 1 +
include/linux/sched/cpufreq.h | 2 --
include/linux/sched/topology.h | 12 +++----
kernel/sched/core.c | 15 ++++++--
kernel/sched/cpufreq_schedutil.c | 77 ++++++++++++++++++++++++++--------------
kernel/sched/deadline.c | 46 ++++++++++++++++++++----
kernel/sched/fair.c | 4 +--
kernel/sched/sched.h | 51 +++++++++++++++++++-------
8 files changed, 149 insertions(+), 59 deletions(-)

--
2.11.0


2017-07-05 09:00:22

by Juri Lelli

[permalink] [raw]
Subject: [RFC PATCH v1 2/8] sched/deadline: move cpu frequency selection triggering points

Since SCHED_DEADLINE doesn't track utilization signal (but reserves a
fraction of CPU bandwidth to tasks admitted to the system), there is no
point in evaluating frequency changes during each tick event.

Move frequency selection triggering points to where running_bw changes.

Co-authored-by: Claudio Scordino <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Luca Abeni <[email protected]>
---
Changes from RFCv0:

- modify comment regarding periodic RT updates (Claudio)
---
kernel/sched/deadline.c | 7 ++++---
kernel/sched/sched.h | 12 ++++++------
2 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index a84299f44b5d..6912f7f35f9b 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -85,6 +85,8 @@ void add_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
dl_rq->running_bw += dl_bw;
SCHED_WARN_ON(dl_rq->running_bw < old); /* overflow */
SCHED_WARN_ON(dl_rq->running_bw > dl_rq->this_bw);
+ /* kick cpufreq (see the comment in kernel/sched/sched.h). */
+ cpufreq_update_this_cpu(rq_of_dl_rq(dl_rq), SCHED_CPUFREQ_DL);
}

static inline
@@ -97,6 +99,8 @@ void sub_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
SCHED_WARN_ON(dl_rq->running_bw > old); /* underflow */
if (dl_rq->running_bw > old)
dl_rq->running_bw = 0;
+ /* kick cpufreq (see the comment in kernel/sched/sched.h). */
+ cpufreq_update_this_cpu(rq_of_dl_rq(dl_rq), SCHED_CPUFREQ_DL);
}

static inline
@@ -1135,9 +1139,6 @@ static void update_curr_dl(struct rq *rq)
return;
}

- /* kick cpufreq (see the comment in kernel/sched/sched.h). */
- cpufreq_update_this_cpu(rq, SCHED_CPUFREQ_DL);
-
schedstat_set(curr->se.statistics.exec_max,
max(curr->se.statistics.exec_max, delta_exec));

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index eeef1a3086d1..d8798bb54ace 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2057,14 +2057,14 @@ DECLARE_PER_CPU(struct update_util_data *, cpufreq_update_util_data);
* The way cpufreq is currently arranged requires it to evaluate the CPU
* performance state (frequency/voltage) on a regular basis to prevent it from
* being stuck in a completely inadequate performance level for too long.
- * That is not guaranteed to happen if the updates are only triggered from CFS,
- * though, because they may not be coming in if RT or deadline tasks are active
- * all the time (or there are RT and DL tasks only).
+ * That is not guaranteed to happen if the updates are only triggered from CFS
+ * and DL, though, because they may not be coming in if only RT tasks are
+ * active all the time (or there are RT tasks only).
*
- * As a workaround for that issue, this function is called by the RT and DL
- * sched classes to trigger extra cpufreq updates to prevent it from stalling,
+ * As a workaround for that issue, this function is called periodically by the
+ * RT sched class to trigger extra cpufreq updates to prevent it from stalling,
* but that really is a band-aid. Going forward it should be replaced with
- * solutions targeted more specifically at RT and DL tasks.
+ * solutions targeted more specifically at RT tasks.
*/
static inline void cpufreq_update_util(struct rq *rq, unsigned int flags)
{
--
2.11.0

2017-07-05 09:00:29

by Juri Lelli

[permalink] [raw]
Subject: [RFC PATCH v1 1/8] sched/cpufreq_schedutil: make use of DEADLINE utilization signal

SCHED_DEADLINE tracks active utilization signal with a per dl_rq
variable named running_bw.

Make use of that to drive cpu frequency selection: add up FAIR and
DEADLINE contribution to get the required CPU capacity to handle both
requirements (while RT still selects max frequency).

Co-authored-by: Claudio Scordino <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Luca Abeni <[email protected]>
---
Changes from RFCv0:

- use BW_SHIFT (Peter)
- add comment about guaranteed and requested freq (Peter)
- modify comment about go to max behaviour (Claudio)
---
include/linux/sched/cpufreq.h | 2 --
kernel/sched/cpufreq_schedutil.c | 25 +++++++++++++++----------
2 files changed, 15 insertions(+), 12 deletions(-)

diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h
index d2be2ccbb372..39640bb3a8ee 100644
--- a/include/linux/sched/cpufreq.h
+++ b/include/linux/sched/cpufreq.h
@@ -11,8 +11,6 @@
#define SCHED_CPUFREQ_DL (1U << 1)
#define SCHED_CPUFREQ_IOWAIT (1U << 2)

-#define SCHED_CPUFREQ_RT_DL (SCHED_CPUFREQ_RT | SCHED_CPUFREQ_DL)
-
#ifdef CONFIG_CPU_FREQ
struct update_util_data {
void (*func)(struct update_util_data *data, u64 time, unsigned int flags);
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 076a2e31951c..f2494d1fc8ef 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -157,12 +157,17 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
static void sugov_get_util(unsigned long *util, unsigned long *max)
{
struct rq *rq = this_rq();
- unsigned long cfs_max;
+ unsigned long dl_util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE)
+ >> BW_SHIFT;

- cfs_max = arch_scale_cpu_capacity(NULL, smp_processor_id());
+ *max = arch_scale_cpu_capacity(NULL, smp_processor_id());

- *util = min(rq->cfs.avg.util_avg, cfs_max);
- *max = cfs_max;
+ /*
+ * Ideally we would like to set util_dl as min/guaranteed freq and
+ * util_cfs + util_dl as requested freq. However, cpufreq is not yet
+ * ready for such an interface. So, we only do the latter for now.
+ */
+ *util = min(rq->cfs.avg.util_avg + dl_util, *max);
}

static void sugov_set_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
@@ -226,7 +231,7 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,

busy = sugov_cpu_is_busy(sg_cpu);

- if (flags & SCHED_CPUFREQ_RT_DL) {
+ if (flags & SCHED_CPUFREQ_RT) {
next_f = policy->cpuinfo.max_freq;
} else {
sugov_get_util(&util, &max);
@@ -266,7 +271,7 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
j_sg_cpu->iowait_boost = 0;
continue;
}
- if (j_sg_cpu->flags & SCHED_CPUFREQ_RT_DL)
+ if (j_sg_cpu->flags & SCHED_CPUFREQ_RT)
return policy->cpuinfo.max_freq;

j_util = j_sg_cpu->util;
@@ -302,7 +307,7 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
sg_cpu->last_update = time;

if (sugov_should_update_freq(sg_policy, time)) {
- if (flags & SCHED_CPUFREQ_RT_DL)
+ if (flags & SCHED_CPUFREQ_RT)
next_f = sg_policy->policy->cpuinfo.max_freq;
else
next_f = sugov_next_freq_shared(sg_cpu, time);
@@ -332,9 +337,9 @@ static void sugov_irq_work(struct irq_work *irq_work)
sg_policy = container_of(irq_work, struct sugov_policy, irq_work);

/*
- * For RT and deadline tasks, the schedutil governor shoots the
- * frequency to maximum. Special care must be taken to ensure that this
- * kthread doesn't result in the same behavior.
+ * For RT tasks, the schedutil governor shoots the frequency to maximum.
+ * Special care must be taken to ensure that this kthread doesn't result
+ * in the same behavior.
*
* This is (mostly) guaranteed by the work_in_progress flag. The flag is
* updated only at the end of the sugov_work() function and before that
--
2.11.0

2017-07-05 09:00:31

by Juri Lelli

[permalink] [raw]
Subject: [RFC PATCH v1 3/8] sched/cpufreq_schedutil: make worker kthread be SCHED_DEADLINE

Worker kthread needs to be able to change frequency for all other
threads.

Make it special, just under STOP class.

Signed-off-by: Juri Lelli <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Luca Abeni <[email protected]>
Cc: Claudio Scordino <[email protected]>
---
Changes from RFCv0:

- use a high bit of sched_flags and don't expose the new flag to
userspace (also add derisory comments) (Peter)
---
include/linux/sched.h | 1 +
kernel/sched/core.c | 15 +++++++++++++--
kernel/sched/cpufreq_schedutil.c | 15 ++++++++++++---
kernel/sched/deadline.c | 13 +++++++++++++
kernel/sched/sched.h | 20 +++++++++++++++++++-
5 files changed, 58 insertions(+), 6 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1f0f427e0292..0ba767fdedae 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1359,6 +1359,7 @@ extern int idle_cpu(int cpu);
extern int sched_setscheduler(struct task_struct *, int, const struct sched_param *);
extern int sched_setscheduler_nocheck(struct task_struct *, int, const struct sched_param *);
extern int sched_setattr(struct task_struct *, const struct sched_attr *);
+extern int sched_setattr_nocheck(struct task_struct *, const struct sched_attr *);
extern struct task_struct *idle_task(int cpu);

/**
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5186797908dc..0e40c7ff2bf4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3998,7 +3998,9 @@ static int __sched_setscheduler(struct task_struct *p,
}

if (attr->sched_flags &
- ~(SCHED_FLAG_RESET_ON_FORK | SCHED_FLAG_RECLAIM))
+ ~(SCHED_FLAG_RESET_ON_FORK |
+ SCHED_FLAG_RECLAIM |
+ SCHED_FLAG_SPECIAL))
return -EINVAL;

/*
@@ -4065,6 +4067,9 @@ static int __sched_setscheduler(struct task_struct *p,
}

if (user) {
+ if (attr->sched_flags & SCHED_FLAG_SPECIAL)
+ return -EPERM;
+
retval = security_task_setscheduler(p);
if (retval)
return retval;
@@ -4120,7 +4125,8 @@ static int __sched_setscheduler(struct task_struct *p,
}
#endif
#ifdef CONFIG_SMP
- if (dl_bandwidth_enabled() && dl_policy(policy)) {
+ if (dl_bandwidth_enabled() && dl_policy(policy) &&
+ !(attr->sched_flags & SCHED_FLAG_SPECIAL)) {
cpumask_t *span = rq->rd->span;

/*
@@ -4250,6 +4256,11 @@ int sched_setattr(struct task_struct *p, const struct sched_attr *attr)
}
EXPORT_SYMBOL_GPL(sched_setattr);

+int sched_setattr_nocheck(struct task_struct *p, const struct sched_attr *attr)
+{
+ return __sched_setscheduler(p, attr, false, true);
+}
+
/**
* sched_setscheduler_nocheck - change the scheduling policy and/or RT priority of a thread from kernelspace.
* @p: the task in question.
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index f2494d1fc8ef..ba6227625f24 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -424,7 +424,16 @@ static void sugov_policy_free(struct sugov_policy *sg_policy)
static int sugov_kthread_create(struct sugov_policy *sg_policy)
{
struct task_struct *thread;
- struct sched_param param = { .sched_priority = MAX_USER_RT_PRIO / 2 };
+ struct sched_attr attr = {
+ .size = sizeof(struct sched_attr),
+ .sched_policy = SCHED_DEADLINE,
+ .sched_flags = SCHED_FLAG_SPECIAL,
+ .sched_nice = 0,
+ .sched_priority = 0,
+ .sched_runtime = 0,
+ .sched_deadline = 0,
+ .sched_period = 0,
+ };
struct cpufreq_policy *policy = sg_policy->policy;
int ret;

@@ -442,10 +451,10 @@ static int sugov_kthread_create(struct sugov_policy *sg_policy)
return PTR_ERR(thread);
}

- ret = sched_setscheduler_nocheck(thread, SCHED_FIFO, &param);
+ ret = sched_setattr_nocheck(thread, &attr);
if (ret) {
kthread_stop(thread);
- pr_warn("%s: failed to set SCHED_FIFO\n", __func__);
+ pr_warn("%s: failed to set SCHED_DEADLINE\n", __func__);
return ret;
}

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 6912f7f35f9b..4ec82fa56b04 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -220,6 +220,9 @@ static void task_non_contending(struct task_struct *p)
if (dl_se->dl_runtime == 0)
return;

+ if (unlikely(dl_entity_is_special(dl_se)))
+ return;
+
WARN_ON(hrtimer_active(&dl_se->inactive_timer));
WARN_ON(dl_se->dl_non_contending);

@@ -1150,6 +1153,9 @@ static void update_curr_dl(struct rq *rq)

sched_rt_avg_update(rq, delta_exec);

+ if (unlikely(dl_entity_is_special(dl_se)))
+ return;
+
if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM))
delta_exec = grub_reclaim(delta_exec, rq, &curr->dl);
dl_se->runtime -= delta_exec;
@@ -2447,6 +2453,9 @@ int sched_dl_overflow(struct task_struct *p, int policy,
u64 new_bw = dl_policy(policy) ? to_ratio(period, runtime) : 0;
int cpus, err = -1;

+ if (attr->sched_flags & SCHED_FLAG_SPECIAL)
+ return 0;
+
/* !deadline task may carry old deadline bandwidth */
if (new_bw == p->dl.dl_bw && task_has_dl_policy(p))
return 0;
@@ -2533,6 +2542,10 @@ void __getparam_dl(struct task_struct *p, struct sched_attr *attr)
*/
bool __checkparam_dl(const struct sched_attr *attr)
{
+ /* special dl tasks don't actually use any parameter */
+ if (attr->sched_flags & SCHED_FLAG_SPECIAL)
+ return true;
+
/* deadline != 0 */
if (attr->sched_deadline == 0)
return false;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d8798bb54ace..1be5dac728a4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -156,12 +156,30 @@ static inline int task_has_dl_policy(struct task_struct *p)
}

/*
+ * !! For sched_setattr_nocheck() (kernel) only !!
+ *
+ * This is actually gross. :(
+ *
+ * It is used to make schedutil kworker(s) higher priority than SCHED_DEADLINE
+ * tasks, but still be able to sleep. We need this on platforms that cannot
+ * atomically change clock frequency. Remove once fast switching will be
+ * available on such platforms.
+ */
+#define SCHED_FLAG_SPECIAL 0x10000000
+
+static inline int dl_entity_is_special(struct sched_dl_entity *dl_se)
+{
+ return dl_se->flags & SCHED_FLAG_SPECIAL;
+}
+
+/*
* Tells if entity @a should preempt entity @b.
*/
static inline bool
dl_entity_preempt(struct sched_dl_entity *a, struct sched_dl_entity *b)
{
- return dl_time_before(a->deadline, b->deadline);
+ return dl_entity_is_special(a) ||
+ dl_time_before(a->deadline, b->deadline);
}

/*
--
2.11.0

2017-07-05 09:00:44

by Juri Lelli

[permalink] [raw]
Subject: [RFC PATCH v1 5/8] sched/cpufreq_schedutil: always consider all CPUs when deciding next freq

No assumption can be made upon the rate at which frequency updates get
triggered, as there are scheduling policies (like SCHED_DEADLINE) which
don't trigger them so frequently.

Remove such assumption from the code, by always considering
SCHED_DEADLINE utilization signal as not stale.

Signed-off-by: Juri Lelli <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Luca Abeni <[email protected]>
Cc: Claudio Scordino <[email protected]>
---
kernel/sched/cpufreq_schedutil.c | 17 +++++++++++------
1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index e835fa886225..066b876d81e7 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -267,17 +267,22 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
s64 delta_ns;

/*
- * If the CPU utilization was last updated before the previous
- * frequency update and the time elapsed between the last update
- * of the CPU utilization and the last frequency update is long
- * enough, don't take the CPU into account as it probably is
- * idle now (and clear iowait_boost for it).
+ * If the CFS CPU utilization was last updated before the
+ * previous frequency update and the time elapsed between the
+ * last update of the CPU utilization and the last frequency
+ * update is long enough, reset iowait_boost and util_cfs, as
+ * they are now probably stale. However, still consider the
+ * CPU contribution if it has some DEADLINE utilization
+ * (util_dl).
*/
delta_ns = time - j_sg_cpu->last_update;
if (delta_ns > TICK_NSEC) {
j_sg_cpu->iowait_boost = 0;
- continue;
+ j_sg_cpu->util_cfs = 0;
+ if (j_sg_cpu->util_dl == 0)
+ continue;
}
+
if (j_sg_cpu->flags & SCHED_CPUFREQ_RT)
return policy->cpuinfo.max_freq;

--
2.11.0

2017-07-05 09:00:37

by Juri Lelli

[permalink] [raw]
Subject: [RFC PATCH v1 6/8] sched/sched.h: remove sd arch_scale_freq_capacity parameter

sd parameter is never used in arch_scale_freq_capacity (and it's hard to
see where information coming from scheduling domains might help doing
frequency invariance scaling).

Remove it; also in anticipation of moving arch_scale_freq_capacity
outside CONFIG_SMP.

Signed-off-by: Juri Lelli <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
kernel/sched/fair.c | 2 +-
kernel/sched/sched.h | 4 ++--
2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 008c514dc241..97bd5aae285d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2875,7 +2875,7 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
u32 contrib = (u32)delta; /* p == 0 -> delta < 1024 */
u64 periods;

- scale_freq = arch_scale_freq_capacity(NULL, cpu);
+ scale_freq = arch_scale_freq_capacity(cpu);
scale_cpu = arch_scale_cpu_capacity(NULL, cpu);

delta += sa->period_contrib;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1be5dac728a4..ec5769288927 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1673,7 +1673,7 @@ extern void sched_avg_update(struct rq *rq);

#ifndef arch_scale_freq_capacity
static __always_inline
-unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
+unsigned long arch_scale_freq_capacity(int cpu)
{
return SCHED_CAPACITY_SCALE;
}
@@ -1692,7 +1692,7 @@ unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)

static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
{
- rq->rt_avg += rt_delta * arch_scale_freq_capacity(NULL, cpu_of(rq));
+ rq->rt_avg += rt_delta * arch_scale_freq_capacity(cpu_of(rq));
sched_avg_update(rq);
}
#else
--
2.11.0

2017-07-05 09:00:51

by Juri Lelli

[permalink] [raw]
Subject: [RFC PATCH v1 7/8] sched/sched.h: move arch_scale_{freq,cpu}_capacity outside CONFIG_SMP

Currently, frequency and cpu capacity scaling is only performed on
CONFIG_SMP systems (as CFS PELT signals are only present for such
systems). However, other scheduling classes want to do freq/cpu scaling,
and for !CONFIG_SMP configurations as well.

arch_scale_freq_capacity is useful to implement frequency scaling even
on !CONFIG_SMP platforms, so we simply move it outside CONFIG_SMP
ifdeffery.

Even if arch_scale_cpu_capacity is not useful on !CONFIG_SMP platforms,
we make a default implementation available for such configurations anyway
to simplify scheduler code doing CPU scale invariance.

Signed-off-by: Juri Lelli <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
include/linux/sched/topology.h | 12 ++++++------
kernel/sched/sched.h | 13 ++++++++++---
2 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 7d065abc7a47..953cf4f889ec 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -6,6 +6,12 @@
#include <linux/sched/idle.h>

/*
+ * Increase resolution of cpu_capacity calculations
+ */
+#define SCHED_CAPACITY_SHIFT SCHED_FIXEDPOINT_SHIFT
+#define SCHED_CAPACITY_SCALE (1L << SCHED_CAPACITY_SHIFT)
+
+/*
* sched-domains (multiprocessor balancing) declarations:
*/
#ifdef CONFIG_SMP
@@ -26,12 +32,6 @@
#define SD_OVERLAP 0x2000 /* sched_domains of this level overlap */
#define SD_NUMA 0x4000 /* cross-node balancing */

-/*
- * Increase resolution of cpu_capacity calculations
- */
-#define SCHED_CAPACITY_SHIFT SCHED_FIXEDPOINT_SHIFT
-#define SCHED_CAPACITY_SCALE (1L << SCHED_CAPACITY_SHIFT)
-
#ifdef CONFIG_SCHED_SMT
static inline int cpu_smt_flags(void)
{
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ec5769288927..178f4a5df2fa 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1668,9 +1668,6 @@ static inline int hrtick_enabled(struct rq *rq)

#endif /* CONFIG_SCHED_HRTICK */

-#ifdef CONFIG_SMP
-extern void sched_avg_update(struct rq *rq);
-
#ifndef arch_scale_freq_capacity
static __always_inline
unsigned long arch_scale_freq_capacity(int cpu)
@@ -1679,6 +1676,9 @@ unsigned long arch_scale_freq_capacity(int cpu)
}
#endif

+#ifdef CONFIG_SMP
+extern void sched_avg_update(struct rq *rq);
+
#ifndef arch_scale_cpu_capacity
static __always_inline
unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
@@ -1696,6 +1696,13 @@ static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
sched_avg_update(rq);
}
#else
+#ifndef arch_scale_cpu_capacity
+static __always_inline
+unsigned long arch_scale_cpu_capacity(void __always_unused *sd, int cpu)
+{
+ return SCHED_CAPACITY_SCALE;
+}
+#endif
static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta) { }
static inline void sched_avg_update(struct rq *rq) { }
#endif
--
2.11.0

2017-07-05 09:01:03

by Juri Lelli

[permalink] [raw]
Subject: [RFC PATCH v1 8/8] sched/deadline: make bandwidth enforcement scale-invariant

Apply frequency and cpu scale-invariance correction factor to bandwidth
enforcement (similar to what we already do to fair utilization tracking).

Each delta_exec gets scaled considering current frequency and maximum
cpu capacity; which means that the reservation runtime parameter (that
need to be specified profiling the task execution at max frequency on
biggest capacity core) gets thus scaled accordingly.

Signed-off-by: Juri Lelli <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Luca Abeni <[email protected]>
Cc: Claudio Scordino <[email protected]>
---
kernel/sched/deadline.c | 26 ++++++++++++++++++++++----
kernel/sched/fair.c | 2 --
kernel/sched/sched.h | 2 ++
3 files changed, 24 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 4ec82fa56b04..5bed1a2fae06 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1122,7 +1122,8 @@ static void update_curr_dl(struct rq *rq)
{
struct task_struct *curr = rq->curr;
struct sched_dl_entity *dl_se = &curr->dl;
- u64 delta_exec;
+ u64 delta_exec, scaled_delta_exec;
+ int cpu = cpu_of(rq);

if (!dl_task(curr) || !on_dl_rq(dl_se))
return;
@@ -1156,9 +1157,26 @@ static void update_curr_dl(struct rq *rq)
if (unlikely(dl_entity_is_special(dl_se)))
return;

- if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM))
- delta_exec = grub_reclaim(delta_exec, rq, &curr->dl);
- dl_se->runtime -= delta_exec;
+ /*
+ * For tasks that participate in GRUB, we implement GRUB-PA: the
+ * spare reclaimed bandwidth is used to clock down frequency.
+ *
+ * For the others, we still need to scale reservation parameters
+ * according to current frequency and CPU maximum capacity.
+ */
+ if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM)) {
+ scaled_delta_exec = grub_reclaim(delta_exec,
+ rq,
+ &curr->dl);
+ } else {
+ unsigned long scale_freq = arch_scale_freq_capacity(cpu);
+ unsigned long scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
+
+ scaled_delta_exec = cap_scale(delta_exec, scale_freq);
+ scaled_delta_exec = cap_scale(scaled_delta_exec, scale_cpu);
+ }
+
+ dl_se->runtime -= scaled_delta_exec;

throttle:
if (dl_runtime_exceeded(dl_se) || dl_se->dl_yielded) {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 97bd5aae285d..a6e8a4f5786f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2844,8 +2844,6 @@ static u32 __accumulate_pelt_segments(u64 periods, u32 d1, u32 d3)
return c1 + c2 + c3;
}

-#define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
-
/*
* Accumulate the three separate parts of the sum; d1 the remainder
* of the last (incomplete) period, d2 the span of full periods and d3
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 178f4a5df2fa..688185761eaf 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -155,6 +155,8 @@ static inline int task_has_dl_policy(struct task_struct *p)
return dl_policy(p->policy);
}

+#define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
+
/*
* !! For sched_setattr_nocheck() (kernel) only !!
*
--
2.11.0

2017-07-05 09:01:50

by Juri Lelli

[permalink] [raw]
Subject: [RFC PATCH v1 4/8] sched/cpufreq_schedutil: split utilization signals

To be able to treat utilization signals of different scheduling classes
in different ways (e.g., CFS signal might be stale while DEADLINE signal
is never stale by design) we need to split sugov_cpu::util signal in two:
util_cfs and util_dl.

This patch does that by also changing sugov_get_util() parameter list.
After this change, aggregation of the different signals has to be performed
by sugov_get_util() users (so that they can decide what to do with the
different signals).

Suggested-by: Rafael J. Wysocki <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Luca Abeni <[email protected]>
Cc: Claudio Scordino <[email protected]>
---
Changes from RFCv0:

- refactor aggregation of utilization in sugov_aggregate_util()
---
kernel/sched/cpufreq_schedutil.c | 28 ++++++++++++++++------------
1 file changed, 16 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index ba6227625f24..e835fa886225 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -58,7 +58,8 @@ struct sugov_cpu {
u64 last_update;

/* The fields below are only needed when sharing a policy. */
- unsigned long util;
+ unsigned long util_cfs;
+ unsigned long util_dl;
unsigned long max;
unsigned int flags;

@@ -154,20 +155,24 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
return cpufreq_driver_resolve_freq(policy, freq);
}

-static void sugov_get_util(unsigned long *util, unsigned long *max)
+static void sugov_get_util(struct sugov_cpu *sg_cpu)
{
struct rq *rq = this_rq();
- unsigned long dl_util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE)
- >> BW_SHIFT;

- *max = arch_scale_cpu_capacity(NULL, smp_processor_id());
+ sg_cpu->max = arch_scale_cpu_capacity(NULL, smp_processor_id());
+ sg_cpu->util_cfs = rq->cfs.avg.util_avg;
+ sg_cpu->util_dl = (rq->dl.running_bw * SCHED_CAPACITY_SCALE)
+ >> BW_SHIFT;
+}

+static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
+{
/*
* Ideally we would like to set util_dl as min/guaranteed freq and
* util_cfs + util_dl as requested freq. However, cpufreq is not yet
* ready for such an interface. So, we only do the latter for now.
*/
- *util = min(rq->cfs.avg.util_avg + dl_util, *max);
+ return min(sg_cpu->util_cfs + sg_cpu->util_dl, sg_cpu->max);
}

static void sugov_set_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
@@ -234,7 +239,9 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
if (flags & SCHED_CPUFREQ_RT) {
next_f = policy->cpuinfo.max_freq;
} else {
- sugov_get_util(&util, &max);
+ sugov_get_util(sg_cpu);
+ max = sg_cpu->max;
+ util = sugov_aggregate_util(sg_cpu);
sugov_iowait_boost(sg_cpu, &util, &max);
next_f = get_next_freq(sg_policy, util, max);
/*
@@ -274,8 +281,8 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
if (j_sg_cpu->flags & SCHED_CPUFREQ_RT)
return policy->cpuinfo.max_freq;

- j_util = j_sg_cpu->util;
j_max = j_sg_cpu->max;
+ j_util = sugov_aggregate_util(j_sg_cpu);
if (j_util * max > j_max * util) {
util = j_util;
max = j_max;
@@ -292,15 +299,12 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
{
struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_util);
struct sugov_policy *sg_policy = sg_cpu->sg_policy;
- unsigned long util, max;
unsigned int next_f;

- sugov_get_util(&util, &max);

raw_spin_lock(&sg_policy->update_lock);

- sg_cpu->util = util;
- sg_cpu->max = max;
+ sugov_get_util(sg_cpu);
sg_cpu->flags = flags;

sugov_set_iowait_boost(sg_cpu, time, flags);
--
2.11.0

2017-07-06 15:57:22

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/8] SCHED_DEADLINE freq/cpu invariance and OPP selection

On Wed, 5 Jul 2017 09:58:57 +0100
Juri Lelli <[email protected]> wrote:

> Hi,
>
> v1 of the RFC set implementing frequency/cpu invariance and OPP selection for

It would be nice if you specify what "OPP" stands for. A quick google
search shows "Other Peoples Privates", which isn't the type of
selection I would be looking for.

-- Steve


> SCHED_DEADLINE [1]. The set is based on tip/sched/core as of today
> (72298e5c92c5), which now already includes Luca's "CPU reclaiming for
> SCHED_DEADLINE".
>

2017-07-06 16:08:30

by Juri Lelli

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/8] SCHED_DEADLINE freq/cpu invariance and OPP selection

On 06/07/17 11:57, Steven Rostedt wrote:
> On Wed, 5 Jul 2017 09:58:57 +0100
> Juri Lelli <[email protected]> wrote:
>
> > Hi,
> >
> > v1 of the RFC set implementing frequency/cpu invariance and OPP selection for
>
> It would be nice if you specify what "OPP" stands for. A quick google
> search shows "Other Peoples Privates", which isn't the type of
> selection I would be looking for.
>

Not something we could influence from the scheduler anyway. :)

Just to clarify, OPP in this context stands for "Operating Performance
Point": (frequency, voltage) tuple [1], used in this context as a
synonym of clock frequency.

I'll add in next version.

Thanks,

- Juri

[1] http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.den0024a/CHDGICAJ.html

2017-07-06 16:15:27

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/8] SCHED_DEADLINE freq/cpu invariance and OPP selection

On Thu, Jul 06, 2017 at 11:57:15AM -0400, Steven Rostedt wrote:
> On Wed, 5 Jul 2017 09:58:57 +0100
> Juri Lelli <[email protected]> wrote:
>
> > Hi,
> >
> > v1 of the RFC set implementing frequency/cpu invariance and OPP selection for
>
> It would be nice if you specify what "OPP" stands for. A quick google
> search shows "Other Peoples Privates", which isn't the type of
> selection I would be looking for.

Its ARM speak for P-state. It stands for OPerating Point or something
like that. But yes, I too always play the Naughty by Nature song in my
head when I read that.

2017-07-06 21:15:50

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/8] SCHED_DEADLINE freq/cpu invariance and OPP selection

On Wednesday, July 05, 2017 09:58:57 AM Juri Lelli wrote:
> Hi,
>
> v1 of the RFC set implementing frequency/cpu invariance and OPP selection for
> SCHED_DEADLINE [1]. The set is based on tip/sched/core as of today
> (72298e5c92c5), which now already includes Luca's "CPU reclaiming for
> SCHED_DEADLINE".
>
> Thanks a lot for reviewing RFCv0!
>
> Patches high level description:
>
> o [01-02]/08 add the necessary links to start accounting DEADLINE contribution
> to OPP selection
> o 03/08 it's a temporary solution to make possible (on ARM) to change
> frequency for DEADLINE tasks (that would possibly delay the SCHED_FIFO
> worker kthread); proper solution would be to be able to issue frequency
> transition from an atomic ctx
> o [04-05]/08 it's a schedutil change that copes with the fact that DEADLINE
> doesn't require periodic OPP selection triggering point
> o [06-07]/08 make arch_scale_{freq,cpu}_capacity() function available on !CONFIG_SMP
> configurations too
> o 08/08 implements frequency/cpu invariance for tasks' reservation
> parameters; which basically means that we implement GRUB-PA [2]
>
> Changes w.r.t. RFCv0:
>
> - rebase on tip/sched/core
> - make use of BW_SHIFT for calculations (Peter)
> - added a comment about guaranteed/requested frequency (Peter)
> - use a high bit for sched_flags SCHED_FLAG_SPECIAL hack; don't expose it to
> userspace and add comments (Peter)
> - refactor aggregation of utilization from scheduling classes
>
> Please have a look. Feedback and comments are, as usual, more than welcome.
>
> In case you would like to test this out:
>
> git://linux-arm.org/linux-jl.git upstream/deadline/freq-rfc-v1
>
> Best,
>
> - Juri
>
>
> Juri Lelli (8):
> sched/cpufreq_schedutil: make use of DEADLINE utilization signal
> sched/deadline: move cpu frequency selection triggering points
> sched/cpufreq_schedutil: make worker kthread be SCHED_DEADLINE
> sched/cpufreq_schedutil: split utilization signals
> sched/cpufreq_schedutil: always consider all CPUs when deciding next
> freq
> sched/sched.h: remove sd arch_scale_freq_capacity parameter
> sched/sched.h: move arch_scale_{freq,cpu}_capacity outside CONFIG_SMP
> sched/deadline: make bandwidth enforcement scale-invariant
>
> include/linux/sched.h | 1 +
> include/linux/sched/cpufreq.h | 2 --
> include/linux/sched/topology.h | 12 +++----
> kernel/sched/core.c | 15 ++++++--
> kernel/sched/cpufreq_schedutil.c | 77 ++++++++++++++++++++++++++--------------
> kernel/sched/deadline.c | 46 ++++++++++++++++++++----
> kernel/sched/fair.c | 4 +--
> kernel/sched/sched.h | 51 +++++++++++++++++++-------
> 8 files changed, 149 insertions(+), 59 deletions(-)

The schedutil changes in this series look OK to me, so please feel free to add

Acked-by: Rafael J. Wysocki <[email protected]>

to these patches.

Thanks,
Rafael

2017-07-07 03:26:03

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH v1 4/8] sched/cpufreq_schedutil: split utilization signals

Hi Juri,

On Wed, Jul 5, 2017 at 1:59 AM, Juri Lelli <[email protected]> wrote:
> To be able to treat utilization signals of different scheduling classes
> in different ways (e.g., CFS signal might be stale while DEADLINE signal
> is never stale by design) we need to split sugov_cpu::util signal in two:
> util_cfs and util_dl.
>
> This patch does that by also changing sugov_get_util() parameter list.
> After this change, aggregation of the different signals has to be performed
> by sugov_get_util() users (so that they can decide what to do with the
> different signals).
>
> Suggested-by: Rafael J. Wysocki <[email protected]>
> Signed-off-by: Juri Lelli <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: Viresh Kumar <[email protected]>
> Cc: Luca Abeni <[email protected]>
> Cc: Claudio Scordino <[email protected]>
> ---
> Changes from RFCv0:
>
> - refactor aggregation of utilization in sugov_aggregate_util()
> ---
> kernel/sched/cpufreq_schedutil.c | 28 ++++++++++++++++------------
> 1 file changed, 16 insertions(+), 12 deletions(-)
>
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index ba6227625f24..e835fa886225 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -58,7 +58,8 @@ struct sugov_cpu {
> u64 last_update;
>
> /* The fields below are only needed when sharing a policy. */
> - unsigned long util;
> + unsigned long util_cfs;
> + unsigned long util_dl;
> unsigned long max;
> unsigned int flags;
>
> @@ -154,20 +155,24 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
> return cpufreq_driver_resolve_freq(policy, freq);
> }
>
> -static void sugov_get_util(unsigned long *util, unsigned long *max)
> +static void sugov_get_util(struct sugov_cpu *sg_cpu)
> {
> struct rq *rq = this_rq();
> - unsigned long dl_util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE)
> - >> BW_SHIFT;
>
> - *max = arch_scale_cpu_capacity(NULL, smp_processor_id());
> + sg_cpu->max = arch_scale_cpu_capacity(NULL, smp_processor_id());
> + sg_cpu->util_cfs = rq->cfs.avg.util_avg;
> + sg_cpu->util_dl = (rq->dl.running_bw * SCHED_CAPACITY_SCALE)
> + >> BW_SHIFT;
> +}
>
> +static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
> +{
> /*
> * Ideally we would like to set util_dl as min/guaranteed freq and
> * util_cfs + util_dl as requested freq. However, cpufreq is not yet
> * ready for such an interface. So, we only do the latter for now.
> */
> - *util = min(rq->cfs.avg.util_avg + dl_util, *max);
> + return min(sg_cpu->util_cfs + sg_cpu->util_dl, sg_cpu->max);
> }

I am wondering why the need for a separate aggregation API. To me, it
looks like using sugov_get_util to set the sg_cpu util elements and
then do the aggregation at the same time would have the same effect
(without changing the existing parameter list). Is this to handle a
future usecase where aggregation may need to be done differently? For
all the user's of sugov_get_util, aggregation is done in the same way.
Anyway if I missed something, sorry for the noise.

thanks,

-Joel

2017-07-07 03:56:35

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH v1 3/8] sched/cpufreq_schedutil: make worker kthread be SCHED_DEADLINE

Hi Juri,

On Wed, Jul 5, 2017 at 1:59 AM, Juri Lelli <[email protected]> wrote:
> Worker kthread needs to be able to change frequency for all other
> threads.
>
> Make it special, just under STOP class.
>
> Signed-off-by: Juri Lelli <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: Viresh Kumar <[email protected]>
> Cc: Luca Abeni <[email protected]>
> Cc: Claudio Scordino <[email protected]>
> ---
> Changes from RFCv0:
>
> - use a high bit of sched_flags and don't expose the new flag to
> userspace (also add derisory comments) (Peter)
> ---
> include/linux/sched.h | 1 +
> kernel/sched/core.c | 15 +++++++++++++--
> kernel/sched/cpufreq_schedutil.c | 15 ++++++++++++---
> kernel/sched/deadline.c | 13 +++++++++++++
> kernel/sched/sched.h | 20 +++++++++++++++++++-
> 5 files changed, 58 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 1f0f427e0292..0ba767fdedae 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1359,6 +1359,7 @@ extern int idle_cpu(int cpu);
> extern int sched_setscheduler(struct task_struct *, int, const struct sched_param *);
> extern int sched_setscheduler_nocheck(struct task_struct *, int, const struct sched_param *);
> extern int sched_setattr(struct task_struct *, const struct sched_attr *);
> +extern int sched_setattr_nocheck(struct task_struct *, const struct sched_attr *);
> extern struct task_struct *idle_task(int cpu);
>
> /**
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 5186797908dc..0e40c7ff2bf4 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3998,7 +3998,9 @@ static int __sched_setscheduler(struct task_struct *p,
> }
>
> if (attr->sched_flags &
> - ~(SCHED_FLAG_RESET_ON_FORK | SCHED_FLAG_RECLAIM))
> + ~(SCHED_FLAG_RESET_ON_FORK |
> + SCHED_FLAG_RECLAIM |
> + SCHED_FLAG_SPECIAL))
> return -EINVAL;

I was thinking if its better to name SCHED_FLAG_SPECIAL something more
descriptive than "special", since it will not be clear looking at
other parts of the code that this is used for the frequency scaling
thread or that its a DL task. I was thinking since this flag is
specifically setup for the sugov thread frequency scaling purpose, a
better flag name than SPECIAL might be SCHED_FLAG_DL_FREQ_SCALE or
SCHED_FLAG_FREQ_SCALE. But I am Ok with whatever makes sense to do
here. Thanks,

Regards,

-Joel

2017-07-07 07:20:09

by Viresh Kumar

[permalink] [raw]
Subject: Re: [RFC PATCH v1 1/8] sched/cpufreq_schedutil: make use of DEADLINE utilization signal

On 05-07-17, 09:58, Juri Lelli wrote:
> SCHED_DEADLINE tracks active utilization signal with a per dl_rq
> variable named running_bw.
>
> Make use of that to drive cpu frequency selection: add up FAIR and
> DEADLINE contribution to get the required CPU capacity to handle both
> requirements (while RT still selects max frequency).
>
> Co-authored-by: Claudio Scordino <[email protected]>
> Signed-off-by: Juri Lelli <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: Viresh Kumar <[email protected]>
> Cc: Luca Abeni <[email protected]>
> ---
> Changes from RFCv0:
>
> - use BW_SHIFT (Peter)
> - add comment about guaranteed and requested freq (Peter)
> - modify comment about go to max behaviour (Claudio)
> ---
> include/linux/sched/cpufreq.h | 2 --
> kernel/sched/cpufreq_schedutil.c | 25 +++++++++++++++----------
> 2 files changed, 15 insertions(+), 12 deletions(-)

Acked-by: Viresh Kumar <[email protected]>

--
viresh

2017-07-07 07:21:10

by Viresh Kumar

[permalink] [raw]
Subject: Re: [RFC PATCH v1 2/8] sched/deadline: move cpu frequency selection triggering points

On 05-07-17, 09:58, Juri Lelli wrote:
> Since SCHED_DEADLINE doesn't track utilization signal (but reserves a
> fraction of CPU bandwidth to tasks admitted to the system), there is no
> point in evaluating frequency changes during each tick event.
>
> Move frequency selection triggering points to where running_bw changes.
>
> Co-authored-by: Claudio Scordino <[email protected]>
> Signed-off-by: Juri Lelli <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: Viresh Kumar <[email protected]>
> Cc: Luca Abeni <[email protected]>
> ---
> Changes from RFCv0:
>
> - modify comment regarding periodic RT updates (Claudio)
> ---
> kernel/sched/deadline.c | 7 ++++---
> kernel/sched/sched.h | 12 ++++++------
> 2 files changed, 10 insertions(+), 9 deletions(-)

Reviewed-by: Viresh Kumar <[email protected]>

--
viresh

2017-07-07 07:21:52

by Viresh Kumar

[permalink] [raw]
Subject: Re: [RFC PATCH v1 3/8] sched/cpufreq_schedutil: make worker kthread be SCHED_DEADLINE

On 05-07-17, 09:59, Juri Lelli wrote:
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index f2494d1fc8ef..ba6227625f24 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -424,7 +424,16 @@ static void sugov_policy_free(struct sugov_policy *sg_policy)
> static int sugov_kthread_create(struct sugov_policy *sg_policy)
> {
> struct task_struct *thread;
> - struct sched_param param = { .sched_priority = MAX_USER_RT_PRIO / 2 };
> + struct sched_attr attr = {
> + .size = sizeof(struct sched_attr),
> + .sched_policy = SCHED_DEADLINE,
> + .sched_flags = SCHED_FLAG_SPECIAL,
> + .sched_nice = 0,
> + .sched_priority = 0,
> + .sched_runtime = 0,
> + .sched_deadline = 0,
> + .sched_period = 0,
> + };
> struct cpufreq_policy *policy = sg_policy->policy;
> int ret;
>
> @@ -442,10 +451,10 @@ static int sugov_kthread_create(struct sugov_policy *sg_policy)
> return PTR_ERR(thread);
> }
>
> - ret = sched_setscheduler_nocheck(thread, SCHED_FIFO, &param);
> + ret = sched_setattr_nocheck(thread, &attr);
> if (ret) {
> kthread_stop(thread);
> - pr_warn("%s: failed to set SCHED_FIFO\n", __func__);
> + pr_warn("%s: failed to set SCHED_DEADLINE\n", __func__);
> return ret;
> }

Acked-by: Viresh Kumar <[email protected]> (schedutil)

--
viresh

2017-07-07 08:58:54

by Viresh Kumar

[permalink] [raw]
Subject: Re: [RFC PATCH v1 4/8] sched/cpufreq_schedutil: split utilization signals

On 05-07-17, 09:59, Juri Lelli wrote:
> To be able to treat utilization signals of different scheduling classes
> in different ways (e.g., CFS signal might be stale while DEADLINE signal
> is never stale by design) we need to split sugov_cpu::util signal in two:
> util_cfs and util_dl.
>
> This patch does that by also changing sugov_get_util() parameter list.
> After this change, aggregation of the different signals has to be performed
> by sugov_get_util() users (so that they can decide what to do with the
> different signals).
>
> Suggested-by: Rafael J. Wysocki <[email protected]>

Are you referring to this response here ?

https://marc.info/?l=linux-kernel&m=149095102600847&w=2

If yes, then I don't think it was about having separate APIs, but just storing
util_cfs/dl separately.

> -static void sugov_get_util(unsigned long *util, unsigned long *max)
> +static void sugov_get_util(struct sugov_cpu *sg_cpu)
> {
> struct rq *rq = this_rq();
> - unsigned long dl_util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE)
> - >> BW_SHIFT;
>
> - *max = arch_scale_cpu_capacity(NULL, smp_processor_id());
> + sg_cpu->max = arch_scale_cpu_capacity(NULL, smp_processor_id());
> + sg_cpu->util_cfs = rq->cfs.avg.util_avg;
> + sg_cpu->util_dl = (rq->dl.running_bw * SCHED_CAPACITY_SCALE)
> + >> BW_SHIFT;
> +}
>
> +static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)

As Joel already mentioned, I don't think we should create two separate routines
here.

--
viresh

2017-07-07 08:59:16

by Viresh Kumar

[permalink] [raw]
Subject: Re: [RFC PATCH v1 5/8] sched/cpufreq_schedutil: always consider all CPUs when deciding next freq

On 05-07-17, 09:59, Juri Lelli wrote:
> No assumption can be made upon the rate at which frequency updates get
> triggered, as there are scheduling policies (like SCHED_DEADLINE) which
> don't trigger them so frequently.
>
> Remove such assumption from the code, by always considering
> SCHED_DEADLINE utilization signal as not stale.
>
> Signed-off-by: Juri Lelli <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: Viresh Kumar <[email protected]>
> Cc: Luca Abeni <[email protected]>
> Cc: Claudio Scordino <[email protected]>
> ---
> kernel/sched/cpufreq_schedutil.c | 17 +++++++++++------
> 1 file changed, 11 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index e835fa886225..066b876d81e7 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -267,17 +267,22 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
> s64 delta_ns;
>
> /*
> - * If the CPU utilization was last updated before the previous
> - * frequency update and the time elapsed between the last update
> - * of the CPU utilization and the last frequency update is long
> - * enough, don't take the CPU into account as it probably is
> - * idle now (and clear iowait_boost for it).
> + * If the CFS CPU utilization was last updated before the
> + * previous frequency update and the time elapsed between the
> + * last update of the CPU utilization and the last frequency
> + * update is long enough, reset iowait_boost and util_cfs, as
> + * they are now probably stale. However, still consider the
> + * CPU contribution if it has some DEADLINE utilization
> + * (util_dl).
> */
> delta_ns = time - j_sg_cpu->last_update;
> if (delta_ns > TICK_NSEC) {
> j_sg_cpu->iowait_boost = 0;
> - continue;
> + j_sg_cpu->util_cfs = 0;
> + if (j_sg_cpu->util_dl == 0)
> + continue;
> }
> +
> if (j_sg_cpu->flags & SCHED_CPUFREQ_RT)
> return policy->cpuinfo.max_freq;
>

Acked-by: Viresh Kumar <[email protected]>

--
viresh

2017-07-07 10:43:50

by Juri Lelli

[permalink] [raw]
Subject: Re: [RFC PATCH v1 3/8] sched/cpufreq_schedutil: make worker kthread be SCHED_DEADLINE

Hi,

On 06/07/17 20:56, Joel Fernandes wrote:
> Hi Juri,
>
> On Wed, Jul 5, 2017 at 1:59 AM, Juri Lelli <[email protected]> wrote:

[...]

> > if (attr->sched_flags &
> > - ~(SCHED_FLAG_RESET_ON_FORK | SCHED_FLAG_RECLAIM))
> > + ~(SCHED_FLAG_RESET_ON_FORK |
> > + SCHED_FLAG_RECLAIM |
> > + SCHED_FLAG_SPECIAL))
> > return -EINVAL;
>
> I was thinking if its better to name SCHED_FLAG_SPECIAL something more
> descriptive than "special", since it will not be clear looking at
> other parts of the code that this is used for the frequency scaling
> thread or that its a DL task. I was thinking since this flag is
> specifically setup for the sugov thread frequency scaling purpose, a
> better flag name than SPECIAL might be SCHED_FLAG_DL_FREQ_SCALE or
> SCHED_FLAG_FREQ_SCALE. But I am Ok with whatever makes sense to do
> here. Thanks,
>

Sure.

How about SCHED_FLAG_SUGOV then?

Thanks,

- Juri

2017-07-07 10:46:15

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC PATCH v1 3/8] sched/cpufreq_schedutil: make worker kthread be SCHED_DEADLINE

On Fri, 7 Jul 2017, Juri Lelli wrote:
> How about SCHED_FLAG_SUGOV then?

I know sugo della carne, but what's sugo V?

Thanks,

tglx

2017-07-07 10:53:24

by Juri Lelli

[permalink] [raw]
Subject: Re: [RFC PATCH v1 3/8] sched/cpufreq_schedutil: make worker kthread be SCHED_DEADLINE

On 07/07/17 12:46, Thomas Gleixner wrote:
> On Fri, 7 Jul 2017, Juri Lelli wrote:
> > How about SCHED_FLAG_SUGOV then?
>
> I know sugo della carne, but what's sugo V?
>

Right.. can't really help not thinking about the same (especially close
to lunch) :)

But the abbreviation stands for SchedUtil GOVernor (AFAIK). We already
have a SUGOV_KTHREAD_PRIORITY (that I just noticed I need to remove BTW)
in sched/cpufreq_schedutil.c.

Thanks,

- Juri

2017-07-07 10:59:40

by Juri Lelli

[permalink] [raw]
Subject: Re: [RFC PATCH v1 4/8] sched/cpufreq_schedutil: split utilization signals

Hi,

On 07/07/17 14:28, Viresh Kumar wrote:
> On 05-07-17, 09:59, Juri Lelli wrote:
> > To be able to treat utilization signals of different scheduling classes
> > in different ways (e.g., CFS signal might be stale while DEADLINE signal
> > is never stale by design) we need to split sugov_cpu::util signal in two:
> > util_cfs and util_dl.
> >
> > This patch does that by also changing sugov_get_util() parameter list.
> > After this change, aggregation of the different signals has to be performed
> > by sugov_get_util() users (so that they can decide what to do with the
> > different signals).
> >
> > Suggested-by: Rafael J. Wysocki <[email protected]>
>
> Are you referring to this response here ?
>
> https://marc.info/?l=linux-kernel&m=149095102600847&w=2
>

Yep.

> If yes, then I don't think it was about having separate APIs, but just storing
> util_cfs/dl separately.
>
> > -static void sugov_get_util(unsigned long *util, unsigned long *max)
> > +static void sugov_get_util(struct sugov_cpu *sg_cpu)
> > {
> > struct rq *rq = this_rq();
> > - unsigned long dl_util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE)
> > - >> BW_SHIFT;
> >
> > - *max = arch_scale_cpu_capacity(NULL, smp_processor_id());
> > + sg_cpu->max = arch_scale_cpu_capacity(NULL, smp_processor_id());
> > + sg_cpu->util_cfs = rq->cfs.avg.util_avg;
> > + sg_cpu->util_dl = (rq->dl.running_bw * SCHED_CAPACITY_SCALE)
> > + >> BW_SHIFT;
> > +}
> >
> > +static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
>
> As Joel already mentioned, I don't think we should create two separate routines
> here.
>

Mmm, it makes retrieving of utilization in sugov_update_shared and
aggregating values for the domain in sugov_next_freq_shared cleaner,
IMHO.

Thanks,

- Juri

2017-07-07 13:19:24

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC PATCH v1 3/8] sched/cpufreq_schedutil: make worker kthread be SCHED_DEADLINE

On Friday, July 07, 2017 11:53:16 AM Juri Lelli wrote:
> On 07/07/17 12:46, Thomas Gleixner wrote:
> > On Fri, 7 Jul 2017, Juri Lelli wrote:
> > > How about SCHED_FLAG_SUGOV then?
> >
> > I know sugo della carne, but what's sugo V?
> >
>
> Right.. can't really help not thinking about the same (especially close
> to lunch) :)
>
> But the abbreviation stands for SchedUtil GOVernor (AFAIK). We already
> have a SUGOV_KTHREAD_PRIORITY (that I just noticed I need to remove BTW)
> in sched/cpufreq_schedutil.c.

SCHED_FLAG_CPUFREQ_WORKER comes to mind, but it's somewhat long.

Thanks,
Rafael

2017-07-07 21:58:11

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH v1 3/8] sched/cpufreq_schedutil: make worker kthread be SCHED_DEADLINE

On Fri, 07 Jul 2017 15:11:45 +0200
"Rafael J. Wysocki" <[email protected]> wrote:

> On Friday, July 07, 2017 11:53:16 AM Juri Lelli wrote:
> > On 07/07/17 12:46, Thomas Gleixner wrote:
> > > On Fri, 7 Jul 2017, Juri Lelli wrote:
> > > > How about SCHED_FLAG_SUGOV then?
> > >
> > > I know sugo della carne, but what's sugo V?
> > >
> >
> > Right.. can't really help not thinking about the same (especially close
> > to lunch) :)
> >
> > But the abbreviation stands for SchedUtil GOVernor (AFAIK). We already
> > have a SUGOV_KTHREAD_PRIORITY (that I just noticed I need to remove BTW)
> > in sched/cpufreq_schedutil.c.
>
> SCHED_FLAG_CPUFREQ_WORKER comes to mind, but it's somewhat long.
>

It is rather long. Although I actually hate the SUGOV, it is easily
grepable. Just comment what it stands for. We can always change it
later.

-- Steve

2017-07-07 22:05:04

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH v1 7/8] sched/sched.h: move arch_scale_{freq,cpu}_capacity outside CONFIG_SMP

On Wed, 5 Jul 2017 09:59:04 +0100
Juri Lelli <[email protected]> wrote:

> Currently, frequency and cpu capacity scaling is only performed on
> CONFIG_SMP systems (as CFS PELT signals are only present for such
> systems). However, other scheduling classes want to do freq/cpu scaling,
> and for !CONFIG_SMP configurations as well.
>
> arch_scale_freq_capacity is useful to implement frequency scaling even
> on !CONFIG_SMP platforms, so we simply move it outside CONFIG_SMP
> ifdeffery.
>
> Even if arch_scale_cpu_capacity is not useful on !CONFIG_SMP platforms,
> we make a default implementation available for such configurations anyway
> to simplify scheduler code doing CPU scale invariance.
>

Reviewed-by: Steven Rostedt (VMware) <[email protected]>

-- Steve

> Signed-off-by: Juri Lelli <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> ---
> include/linux/sched/topology.h | 12 ++++++------
> kernel/sched/sched.h | 13 ++++++++++---
> 2 files changed, 16 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 7d065abc7a47..953cf4f889ec 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -6,6 +6,12 @@
> #include <linux/sched/idle.h>
>
> /*
> + * Increase resolution of cpu_capacity calculations
> + */
> +#define SCHED_CAPACITY_SHIFT SCHED_FIXEDPOINT_SHIFT
> +#define SCHED_CAPACITY_SCALE (1L << SCHED_CAPACITY_SHIFT)
> +
> +/*
> * sched-domains (multiprocessor balancing) declarations:
> */
> #ifdef CONFIG_SMP
> @@ -26,12 +32,6 @@
> #define SD_OVERLAP 0x2000 /* sched_domains of this level overlap */
> #define SD_NUMA 0x4000 /* cross-node balancing */
>
> -/*
> - * Increase resolution of cpu_capacity calculations
> - */
> -#define SCHED_CAPACITY_SHIFT SCHED_FIXEDPOINT_SHIFT
> -#define SCHED_CAPACITY_SCALE (1L << SCHED_CAPACITY_SHIFT)
> -
> #ifdef CONFIG_SCHED_SMT
> static inline int cpu_smt_flags(void)
> {
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index ec5769288927..178f4a5df2fa 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1668,9 +1668,6 @@ static inline int hrtick_enabled(struct rq *rq)
>
> #endif /* CONFIG_SCHED_HRTICK */
>
> -#ifdef CONFIG_SMP
> -extern void sched_avg_update(struct rq *rq);
> -
> #ifndef arch_scale_freq_capacity
> static __always_inline
> unsigned long arch_scale_freq_capacity(int cpu)
> @@ -1679,6 +1676,9 @@ unsigned long arch_scale_freq_capacity(int cpu)
> }
> #endif
>
> +#ifdef CONFIG_SMP
> +extern void sched_avg_update(struct rq *rq);
> +
> #ifndef arch_scale_cpu_capacity
> static __always_inline
> unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
> @@ -1696,6 +1696,13 @@ static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
> sched_avg_update(rq);
> }
> #else
> +#ifndef arch_scale_cpu_capacity
> +static __always_inline
> +unsigned long arch_scale_cpu_capacity(void __always_unused *sd, int cpu)
> +{
> + return SCHED_CAPACITY_SCALE;
> +}
> +#endif
> static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta) { }
> static inline void sched_avg_update(struct rq *rq) { }
> #endif

2017-07-07 22:07:13

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH v1 3/8] sched/cpufreq_schedutil: make worker kthread be SCHED_DEADLINE

Hi,

On Fri, Jul 7, 2017 at 2:58 PM, Steven Rostedt <[email protected]> wrote:
> On Fri, 07 Jul 2017 15:11:45 +0200
> "Rafael J. Wysocki" <[email protected]> wrote:
>
>> On Friday, July 07, 2017 11:53:16 AM Juri Lelli wrote:
>> > On 07/07/17 12:46, Thomas Gleixner wrote:
>> > > On Fri, 7 Jul 2017, Juri Lelli wrote:
>> > > > How about SCHED_FLAG_SUGOV then?
>> > >
>> > > I know sugo della carne, but what's sugo V?
>> > >
>> >
>> > Right.. can't really help not thinking about the same (especially close
>> > to lunch) :)
>> >
>> > But the abbreviation stands for SchedUtil GOVernor (AFAIK). We already
>> > have a SUGOV_KTHREAD_PRIORITY (that I just noticed I need to remove BTW)
>> > in sched/cpufreq_schedutil.c.
>>
>> SCHED_FLAG_CPUFREQ_WORKER comes to mind, but it's somewhat long.
>>
>
> It is rather long. Although I actually hate the SUGOV, it is easily
> grepable. Just comment what it stands for. We can always change it
> later.

I was thinking why not just SCHED_FLAG_CPUFREQ. That says its for
cpufreq purposes, and is a bit self-documenting. "WORKER" is a bit
redundant and can be dropped in my opinion.

thanks,

-Joel

2017-07-07 22:15:55

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH v1 3/8] sched/cpufreq_schedutil: make worker kthread be SCHED_DEADLINE

On Fri, 7 Jul 2017 15:07:09 -0700
Joel Fernandes <[email protected]> wrote:


> > It is rather long. Although I actually hate the SUGOV, it is easily
> > grepable. Just comment what it stands for. We can always change it
> > later.
>
> I was thinking why not just SCHED_FLAG_CPUFREQ. That says its for
> cpufreq purposes, and is a bit self-documenting. "WORKER" is a bit
> redundant and can be dropped in my opinion.

I was thinking that too, but was wondering how tightly coupled is this
with SCHED_DEADLINE? I like the searchability of SUGOV, where as
CPUFREQ is still quite broad.

-- Steve

2017-07-07 22:57:32

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH v1 3/8] sched/cpufreq_schedutil: make worker kthread be SCHED_DEADLINE

On Fri, Jul 7, 2017 at 3:15 PM, Steven Rostedt <[email protected]> wrote:
> On Fri, 7 Jul 2017 15:07:09 -0700
> Joel Fernandes <[email protected]> wrote:
>
>
>> > It is rather long. Although I actually hate the SUGOV, it is easily
>> > grepable. Just comment what it stands for. We can always change it
>> > later.
>>
>> I was thinking why not just SCHED_FLAG_CPUFREQ. That says its for
>> cpufreq purposes, and is a bit self-documenting. "WORKER" is a bit
>> redundant and can be dropped in my opinion.
>
> I was thinking that too, but was wondering how tightly coupled is this
> with SCHED_DEADLINE? I like the searchability of SUGOV, where as
> CPUFREQ is still quite broad.

Yes this is tightly coupled with DL so in that case probably a more
specific name is better as you mentioned.

thanks,

-Joel

2017-07-10 07:05:41

by Viresh Kumar

[permalink] [raw]
Subject: Re: [RFC PATCH v1 4/8] sched/cpufreq_schedutil: split utilization signals

On 05-07-17, 09:59, Juri Lelli wrote:
> To be able to treat utilization signals of different scheduling classes
> in different ways (e.g., CFS signal might be stale while DEADLINE signal
> is never stale by design) we need to split sugov_cpu::util signal in two:
> util_cfs and util_dl.
>
> This patch does that by also changing sugov_get_util() parameter list.
> After this change, aggregation of the different signals has to be performed
> by sugov_get_util() users (so that they can decide what to do with the
> different signals).
>
> Suggested-by: Rafael J. Wysocki <[email protected]>
> Signed-off-by: Juri Lelli <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: Viresh Kumar <[email protected]>
> Cc: Luca Abeni <[email protected]>
> Cc: Claudio Scordino <[email protected]>
> ---
> Changes from RFCv0:
>
> - refactor aggregation of utilization in sugov_aggregate_util()
> ---
> kernel/sched/cpufreq_schedutil.c | 28 ++++++++++++++++------------
> 1 file changed, 16 insertions(+), 12 deletions(-)

Acked-by: Viresh Kumar <[email protected]>

--
viresh

2017-07-10 07:46:50

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH v1 4/8] sched/cpufreq_schedutil: split utilization signals

On Fri, Jul 7, 2017 at 3:59 AM, Juri Lelli <[email protected]> wrote:
[..]
>
>> If yes, then I don't think it was about having separate APIs, but just storing
>> util_cfs/dl separately.
>>
>> > -static void sugov_get_util(unsigned long *util, unsigned long *max)
>> > +static void sugov_get_util(struct sugov_cpu *sg_cpu)
>> > {
>> > struct rq *rq = this_rq();
>> > - unsigned long dl_util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE)
>> > - >> BW_SHIFT;
>> >
>> > - *max = arch_scale_cpu_capacity(NULL, smp_processor_id());
>> > + sg_cpu->max = arch_scale_cpu_capacity(NULL, smp_processor_id());
>> > + sg_cpu->util_cfs = rq->cfs.avg.util_avg;
>> > + sg_cpu->util_dl = (rq->dl.running_bw * SCHED_CAPACITY_SCALE)
>> > + >> BW_SHIFT;
>> > +}
>> >
>> > +static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
>>
>> As Joel already mentioned, I don't think we should create two separate routines
>> here.
>>
>
> Mmm, it makes retrieving of utilization in sugov_update_shared and
> aggregating values for the domain in sugov_next_freq_shared cleaner,
> IMHO.
>

I agree, thanks.

-Joel

2017-07-11 12:37:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v1 3/8] sched/cpufreq_schedutil: make worker kthread be SCHED_DEADLINE

On Thu, Jul 06, 2017 at 08:56:32PM -0700, Joel Fernandes wrote:

> > @@ -3998,7 +3998,9 @@ static int __sched_setscheduler(struct task_struct *p,
> > }
> >
> > if (attr->sched_flags &
> > - ~(SCHED_FLAG_RESET_ON_FORK | SCHED_FLAG_RECLAIM))
> > + ~(SCHED_FLAG_RESET_ON_FORK |
> > + SCHED_FLAG_RECLAIM |
> > + SCHED_FLAG_SPECIAL))
> > return -EINVAL;
>
> I was thinking if its better to name SCHED_FLAG_SPECIAL something more
> descriptive than "special", since it will not be clear looking at
> other parts of the code that this is used for the frequency scaling
> thread or that its a DL task. I was thinking since this flag is
> specifically setup for the sugov thread frequency scaling purpose, a
> better flag name than SPECIAL might be SCHED_FLAG_DL_FREQ_SCALE or
> SCHED_FLAG_FREQ_SCALE. But I am Ok with whatever makes sense to do
> here. Thanks,

SCHED_FLAG_IM_DOING_IT_WRONG ;-)

2017-07-11 16:17:58

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v1 5/8] sched/cpufreq_schedutil: always consider all CPUs when deciding next freq

On Wed, Jul 05, 2017 at 09:59:02AM +0100, Juri Lelli wrote:
> delta_ns = time - j_sg_cpu->last_update;
> if (delta_ns > TICK_NSEC) {
> j_sg_cpu->iowait_boost = 0;
> - continue;
> + j_sg_cpu->util_cfs = 0;

this is slighly confusing. Is this because we might not 'continue' with
the new code?

> + if (j_sg_cpu->util_dl == 0)
> + continue;
> }
> +
> if (j_sg_cpu->flags & SCHED_CPUFREQ_RT)
> return policy->cpuinfo.max_freq;
>
> --
> 2.11.0
>

2017-07-11 16:18:38

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v1 3/8] sched/cpufreq_schedutil: make worker kthread be SCHED_DEADLINE

On Wed, Jul 05, 2017 at 09:59:00AM +0100, Juri Lelli wrote:
> @@ -4065,6 +4067,9 @@ static int __sched_setscheduler(struct task_struct *p,
> }
>
> if (user) {
> + if (attr->sched_flags & SCHED_FLAG_SPECIAL)
> + return -EPERM;

Should be -EINVAL I think, as if the bit didn't exist at all (it
doesn't, from a userspace perspective).

> +
> retval = security_task_setscheduler(p);
> if (retval)
> return retval;

2017-07-11 17:02:19

by Juri Lelli

[permalink] [raw]
Subject: Re: [RFC PATCH v1 3/8] sched/cpufreq_schedutil: make worker kthread be SCHED_DEADLINE

On 11/07/17 18:18, Peter Zijlstra wrote:
> On Wed, Jul 05, 2017 at 09:59:00AM +0100, Juri Lelli wrote:
> > @@ -4065,6 +4067,9 @@ static int __sched_setscheduler(struct task_struct *p,
> > }
> >
> > if (user) {
> > + if (attr->sched_flags & SCHED_FLAG_SPECIAL)
> > + return -EPERM;
>
> Should be -EINVAL I think, as if the bit didn't exist at all (it
> doesn't, from a userspace perspective).
>

Makes sense. I'll change it.

Thanks,

- Juri

2017-07-11 17:18:42

by Juri Lelli

[permalink] [raw]
Subject: Re: [RFC PATCH v1 5/8] sched/cpufreq_schedutil: always consider all CPUs when deciding next freq

On 11/07/17 18:17, Peter Zijlstra wrote:
> On Wed, Jul 05, 2017 at 09:59:02AM +0100, Juri Lelli wrote:
> > delta_ns = time - j_sg_cpu->last_update;
> > if (delta_ns > TICK_NSEC) {
> > j_sg_cpu->iowait_boost = 0;
> > - continue;
> > + j_sg_cpu->util_cfs = 0;
>
> this is slighly confusing. Is this because we might not 'continue' with
> the new code?
>

This is because, after TICK_NSEC, we only want to discard CFS
contribution and (yes) continue (so don't take into account
j_sg_cpu contribution) if DEADLINE contribution is zero as well.

> > + if (j_sg_cpu->util_dl == 0)
> > + continue;
> > }
> > +

With this change we might not continue if some DEADLINE utilization is
present for j_sg_cpu.

> > if (j_sg_cpu->flags & SCHED_CPUFREQ_RT)
> > return policy->cpuinfo.max_freq;
> >
> > --
> > 2.11.0
> >

2017-07-19 07:22:05

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v1 8/8] sched/deadline: make bandwidth enforcement scale-invariant

On Wed, Jul 05, 2017 at 09:59:05AM +0100, Juri Lelli wrote:
> @@ -1156,9 +1157,26 @@ static void update_curr_dl(struct rq *rq)
> if (unlikely(dl_entity_is_special(dl_se)))
> return;
>
> - if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM))
> - delta_exec = grub_reclaim(delta_exec, rq, &curr->dl);
> - dl_se->runtime -= delta_exec;
> + /*
> + * For tasks that participate in GRUB, we implement GRUB-PA: the
> + * spare reclaimed bandwidth is used to clock down frequency.
> + *
> + * For the others, we still need to scale reservation parameters
> + * according to current frequency and CPU maximum capacity.
> + */
> + if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM)) {
> + scaled_delta_exec = grub_reclaim(delta_exec,
> + rq,
> + &curr->dl);
> + } else {
> + unsigned long scale_freq = arch_scale_freq_capacity(cpu);
> + unsigned long scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
> +
> + scaled_delta_exec = cap_scale(delta_exec, scale_freq);
> + scaled_delta_exec = cap_scale(scaled_delta_exec, scale_cpu);
> + }
> +
> + dl_se->runtime -= scaled_delta_exec;
>

This I don't get...

2017-07-19 09:20:39

by Juri Lelli

[permalink] [raw]
Subject: Re: [RFC PATCH v1 8/8] sched/deadline: make bandwidth enforcement scale-invariant

On 19/07/17 09:21, Peter Zijlstra wrote:
> On Wed, Jul 05, 2017 at 09:59:05AM +0100, Juri Lelli wrote:
> > @@ -1156,9 +1157,26 @@ static void update_curr_dl(struct rq *rq)
> > if (unlikely(dl_entity_is_special(dl_se)))
> > return;
> >
> > - if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM))
> > - delta_exec = grub_reclaim(delta_exec, rq, &curr->dl);
> > - dl_se->runtime -= delta_exec;
> > + /*
> > + * For tasks that participate in GRUB, we implement GRUB-PA: the
> > + * spare reclaimed bandwidth is used to clock down frequency.
> > + *
> > + * For the others, we still need to scale reservation parameters
> > + * according to current frequency and CPU maximum capacity.
> > + */
> > + if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM)) {
> > + scaled_delta_exec = grub_reclaim(delta_exec,
> > + rq,
> > + &curr->dl);
> > + } else {
> > + unsigned long scale_freq = arch_scale_freq_capacity(cpu);
> > + unsigned long scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
> > +
> > + scaled_delta_exec = cap_scale(delta_exec, scale_freq);
> > + scaled_delta_exec = cap_scale(scaled_delta_exec, scale_cpu);
> > + }
> > +
> > + dl_se->runtime -= scaled_delta_exec;
> >
>
> This I don't get...


Considering that we use GRUB's active utilization to drive clock
frequency selection, rationale is that GRUB tasks don't need any special
scaling, as their delta_exec is already scaled according to GRUB rules.
OTOH, normal tasks need to have their runtime (delta_exec) explicitly
scaled considering current frequency (and CPU max capacity), otherwise
they are going to receive less runtime than granted at AC, when
frequency is reduced.

2017-07-19 11:00:50

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v1 8/8] sched/deadline: make bandwidth enforcement scale-invariant

On Wed, Jul 19, 2017 at 10:20:29AM +0100, Juri Lelli wrote:
> On 19/07/17 09:21, Peter Zijlstra wrote:
> > On Wed, Jul 05, 2017 at 09:59:05AM +0100, Juri Lelli wrote:
> > > @@ -1156,9 +1157,26 @@ static void update_curr_dl(struct rq *rq)
> > > if (unlikely(dl_entity_is_special(dl_se)))
> > > return;
> > >
> > > - if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM))
> > > - delta_exec = grub_reclaim(delta_exec, rq, &curr->dl);
> > > - dl_se->runtime -= delta_exec;
> > > + /*
> > > + * For tasks that participate in GRUB, we implement GRUB-PA: the
> > > + * spare reclaimed bandwidth is used to clock down frequency.
> > > + *
> > > + * For the others, we still need to scale reservation parameters
> > > + * according to current frequency and CPU maximum capacity.
> > > + */
> > > + if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM)) {
> > > + scaled_delta_exec = grub_reclaim(delta_exec,
> > > + rq,
> > > + &curr->dl);
> > > + } else {
> > > + unsigned long scale_freq = arch_scale_freq_capacity(cpu);
> > > + unsigned long scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
> > > +
> > > + scaled_delta_exec = cap_scale(delta_exec, scale_freq);
> > > + scaled_delta_exec = cap_scale(scaled_delta_exec, scale_cpu);
> > > + }
> > > +
> > > + dl_se->runtime -= scaled_delta_exec;
> > >
> >
> > This I don't get...
>
>
> Considering that we use GRUB's active utilization to drive clock
> frequency selection, rationale is that GRUB tasks don't need any special
> scaling, as their delta_exec is already scaled according to GRUB rules.
> OTOH, normal tasks need to have their runtime (delta_exec) explicitly
> scaled considering current frequency (and CPU max capacity), otherwise
> they are going to receive less runtime than granted at AC, when
> frequency is reduced.

I don't think that quite works out. Given that the frequency selection
will never quite end up at exactly the same fraction (if the hardware
listens to your requests at all).

Also, by not scaling the GRUB stuff, don't you run the risk of
attempting to hand out more idle time than there actually is?

2017-07-19 11:16:36

by Juri Lelli

[permalink] [raw]
Subject: Re: [RFC PATCH v1 8/8] sched/deadline: make bandwidth enforcement scale-invariant

On 19/07/17 13:00, Peter Zijlstra wrote:
> On Wed, Jul 19, 2017 at 10:20:29AM +0100, Juri Lelli wrote:
> > On 19/07/17 09:21, Peter Zijlstra wrote:
> > > On Wed, Jul 05, 2017 at 09:59:05AM +0100, Juri Lelli wrote:
> > > > @@ -1156,9 +1157,26 @@ static void update_curr_dl(struct rq *rq)
> > > > if (unlikely(dl_entity_is_special(dl_se)))
> > > > return;
> > > >
> > > > - if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM))
> > > > - delta_exec = grub_reclaim(delta_exec, rq, &curr->dl);
> > > > - dl_se->runtime -= delta_exec;
> > > > + /*
> > > > + * For tasks that participate in GRUB, we implement GRUB-PA: the
> > > > + * spare reclaimed bandwidth is used to clock down frequency.
> > > > + *
> > > > + * For the others, we still need to scale reservation parameters
> > > > + * according to current frequency and CPU maximum capacity.
> > > > + */
> > > > + if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM)) {
> > > > + scaled_delta_exec = grub_reclaim(delta_exec,
> > > > + rq,
> > > > + &curr->dl);
> > > > + } else {
> > > > + unsigned long scale_freq = arch_scale_freq_capacity(cpu);
> > > > + unsigned long scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
> > > > +
> > > > + scaled_delta_exec = cap_scale(delta_exec, scale_freq);
> > > > + scaled_delta_exec = cap_scale(scaled_delta_exec, scale_cpu);
> > > > + }
> > > > +
> > > > + dl_se->runtime -= scaled_delta_exec;
> > > >
> > >
> > > This I don't get...
> >
> >
> > Considering that we use GRUB's active utilization to drive clock
> > frequency selection, rationale is that GRUB tasks don't need any special
> > scaling, as their delta_exec is already scaled according to GRUB rules.
> > OTOH, normal tasks need to have their runtime (delta_exec) explicitly
> > scaled considering current frequency (and CPU max capacity), otherwise
> > they are going to receive less runtime than granted at AC, when
> > frequency is reduced.
>
> I don't think that quite works out. Given that the frequency selection
> will never quite end up at exactly the same fraction (if the hardware
> listens to your requests at all).
>

It's an approximation yes (how big it depends on the granularity of the
available frequencies). But, for the !GRUB tasks it should be OK, as we
always select a frequency (among the available ones) bigger than the
current active utilization.

Also, for platforms/archs that don't redefine arch_scale_* this is not
used. In case they are defined instead the assumption is that either hw
listens to requests or scaling factors can be derived in some other ways
(avgs?).

> Also, by not scaling the GRUB stuff, don't you run the risk of
> attempting to hand out more idle time than there actually is?

The way I understand it is that for GRUB tasks we always scale
considering the "correct" factor. Then frequency could be higher, but
this spare idle time will be reclaimed by other GRUB tasks.

2017-07-24 16:44:17

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v1 8/8] sched/deadline: make bandwidth enforcement scale-invariant

On Wed, Jul 19, 2017 at 12:16:24PM +0100, Juri Lelli wrote:
> On 19/07/17 13:00, Peter Zijlstra wrote:
> > On Wed, Jul 19, 2017 at 10:20:29AM +0100, Juri Lelli wrote:
> > > On 19/07/17 09:21, Peter Zijlstra wrote:
> > > > On Wed, Jul 05, 2017 at 09:59:05AM +0100, Juri Lelli wrote:
> > > > > @@ -1156,9 +1157,26 @@ static void update_curr_dl(struct rq *rq)
> > > > > if (unlikely(dl_entity_is_special(dl_se)))
> > > > > return;
> > > > >
> > > > > - if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM))
> > > > > - delta_exec = grub_reclaim(delta_exec, rq, &curr->dl);
> > > > > - dl_se->runtime -= delta_exec;
> > > > > + /*
> > > > > + * For tasks that participate in GRUB, we implement GRUB-PA: the
> > > > > + * spare reclaimed bandwidth is used to clock down frequency.
> > > > > + *
> > > > > + * For the others, we still need to scale reservation parameters
> > > > > + * according to current frequency and CPU maximum capacity.
> > > > > + */
> > > > > + if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM)) {
> > > > > + scaled_delta_exec = grub_reclaim(delta_exec,
> > > > > + rq,
> > > > > + &curr->dl);
> > > > > + } else {
> > > > > + unsigned long scale_freq = arch_scale_freq_capacity(cpu);
> > > > > + unsigned long scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
> > > > > +
> > > > > + scaled_delta_exec = cap_scale(delta_exec, scale_freq);
> > > > > + scaled_delta_exec = cap_scale(scaled_delta_exec, scale_cpu);
> > > > > + }
> > > > > +
> > > > > + dl_se->runtime -= scaled_delta_exec;
> > > > >
> > > >
> > > > This I don't get...
> > >
> > >
> > > Considering that we use GRUB's active utilization to drive clock
> > > frequency selection, rationale is that GRUB tasks don't need any special
> > > scaling, as their delta_exec is already scaled according to GRUB rules.
> > > OTOH, normal tasks need to have their runtime (delta_exec) explicitly
> > > scaled considering current frequency (and CPU max capacity), otherwise
> > > they are going to receive less runtime than granted at AC, when
> > > frequency is reduced.
> >
> > I don't think that quite works out. Given that the frequency selection
> > will never quite end up at exactly the same fraction (if the hardware
> > listens to your requests at all).
> >
>
> It's an approximation yes (how big it depends on the granularity of the
> available frequencies). But, for the !GRUB tasks it should be OK, as we
> always select a frequency (among the available ones) bigger than the
> current active utilization.
>
> Also, for platforms/archs that don't redefine arch_scale_* this is not
> used. In case they are defined instead the assumption is that either hw
> listens to requests or scaling factors can be derived in some other ways
> (avgs?).
>
> > Also, by not scaling the GRUB stuff, don't you run the risk of
> > attempting to hand out more idle time than there actually is?
>
> The way I understand it is that for GRUB tasks we always scale
> considering the "correct" factor. Then frequency could be higher, but
> this spare idle time will be reclaimed by other GRUB tasks.

I'm still confused..

So GRUB does:

dq = Uact -dt

right? (yeah, I know, the code does something a little more complicated,
but it should still be more or less the same of you take out the 'extra'
bits).

Now, you do DVFS using that same Uact. If we lower the clock, we need
more time, so would we then not end up with something like:

dq = 1/Uact -dt

After all; our budget assignment is such that we're able to complete
our work at max freq. Therefore, when we lower the frequency, we'll have
to increase budget pro rata, otherwise we'll not complete our work and
badness happens.

Say we have a 1 Ghz part and Uact=0.5 we'd select 500 Mhz and need
double the time to complete.

Now, if we fold these two together, you'd get:

dq = Uact/Uact -dt = -dt

Because, after all, if we lowered the clock to consume all idle time,
there's no further idle time to reclaim.

Now, of course, our DVFS assignment isn't as precise nor as
deterministic as this, so we'll get a slightly different ratio, lets
call that Udvfs.

So would then not GRUB change into something like:

dq = Uact/Udvfs -dt

Allowing it to only reclaim that idle time that exists because our DVFS
level is strictly higher than required?

This way, on our 1 GHz part, with Uact=.5 but Udvfs=.6, we'll allow it
to reclaim just the additional 100Mhz of idle time.


Or am I completely off the rails now?

2017-07-25 07:03:17

by luca abeni

[permalink] [raw]
Subject: Re: [RFC PATCH v1 8/8] sched/deadline: make bandwidth enforcement scale-invariant

Hi Peter,

On Mon, 24 Jul 2017 18:43:49 +0200
Peter Zijlstra <[email protected]> wrote:

> On Wed, Jul 19, 2017 at 12:16:24PM +0100, Juri Lelli wrote:
> > On 19/07/17 13:00, Peter Zijlstra wrote:
> > > On Wed, Jul 19, 2017 at 10:20:29AM +0100, Juri Lelli wrote:
> > > > On 19/07/17 09:21, Peter Zijlstra wrote:
> > > > > On Wed, Jul 05, 2017 at 09:59:05AM +0100, Juri Lelli wrote:
> > > > > > @@ -1156,9 +1157,26 @@ static void update_curr_dl(struct rq *rq)
> > > > > > if (unlikely(dl_entity_is_special(dl_se)))
> > > > > > return;
> > > > > >
> > > > > > - if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM))
> > > > > > - delta_exec = grub_reclaim(delta_exec, rq, &curr->dl);
> > > > > > - dl_se->runtime -= delta_exec;
> > > > > > + /*
> > > > > > + * For tasks that participate in GRUB, we implement GRUB-PA: the
> > > > > > + * spare reclaimed bandwidth is used to clock down frequency.
> > > > > > + *
> > > > > > + * For the others, we still need to scale reservation parameters
> > > > > > + * according to current frequency and CPU maximum capacity.
> > > > > > + */
> > > > > > + if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM)) {
> > > > > > + scaled_delta_exec = grub_reclaim(delta_exec,
> > > > > > + rq,
> > > > > > + &curr->dl);
> > > > > > + } else {
> > > > > > + unsigned long scale_freq = arch_scale_freq_capacity(cpu);
> > > > > > + unsigned long scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
> > > > > > +
> > > > > > + scaled_delta_exec = cap_scale(delta_exec, scale_freq);
> > > > > > + scaled_delta_exec = cap_scale(scaled_delta_exec, scale_cpu);
> > > > > > + }
> > > > > > +
> > > > > > + dl_se->runtime -= scaled_delta_exec;
> > > > > >
> > > > >
> > > > > This I don't get...
> > > >
> > > >
> > > > Considering that we use GRUB's active utilization to drive clock
> > > > frequency selection, rationale is that GRUB tasks don't need any special
> > > > scaling, as their delta_exec is already scaled according to GRUB rules.
> > > > OTOH, normal tasks need to have their runtime (delta_exec) explicitly
> > > > scaled considering current frequency (and CPU max capacity), otherwise
> > > > they are going to receive less runtime than granted at AC, when
> > > > frequency is reduced.
> > >
> > > I don't think that quite works out. Given that the frequency selection
> > > will never quite end up at exactly the same fraction (if the hardware
> > > listens to your requests at all).
> > >
> >
> > It's an approximation yes (how big it depends on the granularity of the
> > available frequencies). But, for the !GRUB tasks it should be OK, as we
> > always select a frequency (among the available ones) bigger than the
> > current active utilization.
> >
> > Also, for platforms/archs that don't redefine arch_scale_* this is not
> > used. In case they are defined instead the assumption is that either hw
> > listens to requests or scaling factors can be derived in some other ways
> > (avgs?).
> >
> > > Also, by not scaling the GRUB stuff, don't you run the risk of
> > > attempting to hand out more idle time than there actually is?
> >
> > The way I understand it is that for GRUB tasks we always scale
> > considering the "correct" factor. Then frequency could be higher, but
> > this spare idle time will be reclaimed by other GRUB tasks.
>
> I'm still confused..
>
> So GRUB does:
>
> dq = Uact -dt
>
> right?

Right. This is what the original (single processor) GRUB did. And this
was used by the "GRUB-PA" algorithm:
https://www.researchgate.net/profile/Giuseppe_Lipari/publication/220800940_Using_resource_reservation_techniques_for_power-aware_scheduling/links/09e41513639b2703fc000000.pdf

(basically, GRUB-PA uses GRUB for reclaiming, and scales the CPU
frequency based on Uact)


> Now, you do DVFS using that same Uact. If we lower the clock, we need
> more time, so would we then not end up with something like:
>
> dq = 1/Uact -dt

Well, in the GRUB-PA algorithm GRUB reclaiming is the mechanism used to
give more runtime to the task... Since Uact is < 1, doing
dq = - Uact * dt
means that we decrease the current runtime by a smaller amount of time.
And so we end up giving more runtime to the task: instead of giving
dl_runtime every dl_period, we give "dl_runtime / Uact" every
dl_period... And since the CPU is slower (by a ratio Uact), this is
equivalent to giving dl_runtime at the maximum CPU speed / frequency
(at least, in theory :).


> After all; our budget assignment is such that we're able to complete
> our work at max freq. Therefore, when we lower the frequency, we'll have
> to increase budget pro rata, otherwise we'll not complete our work and
> badness happens.

Right. But instead of increasing dl_runtime, GRUB-PA decreases the
amount of time accounted to the current runtime.


> Say we have a 1 Ghz part and Uact=0.5 we'd select 500 Mhz and need
> double the time to complete.
>
> Now, if we fold these two together, you'd get:
>
> dq = Uact/Uact -dt = -dt

Not sure why " / Uact"... According to the GRUB-PA algorithm, you just
do
dq = - Uact * dt = -0.5dt
and you end up giving the CPU to the task for 2 * dl_runtime every
dl_period (as expected)

> Because, after all, if we lowered the clock to consume all idle time,
> there's no further idle time to reclaim.

The idea of GRUB-PA is that you do not change dl_runtime... So, the
task is still "nominally reserved" dl_runtime every dl_period (in
this example, 1/2*dl_period every dl_period)... It is the reclaiming
mechanism that allows the task to execute for dl_runtime/Uact every
dl_period (in this example, for dl_period every dl_period, so it
reclaims 1/2dl_period every dl_period).


> Now, of course, our DVFS assignment isn't as precise nor as
> deterministic as this, so we'll get a slightly different ratio, lets
> call that Udvfs.

This is where GRUB-PA starts to have issues... :)
The implicit assumption is (I think) that is the DVFS mechanism cannot
assign exactly the requested frequency then it makes a "conservative"
assignment (assigning a frequency higher than the requested one).


> So would then not GRUB change into something like:
>
> dq = Uact/Udvfs -dt
>
> Allowing it to only reclaim that idle time that exists because our DVFS
> level is strictly higher than required?

I think GRUB should still do "dq = -Uact * dt", trying to reclaim all
the idle CPU time (up to the configured limit, of course).



Luca

>
> This way, on our 1 GHz part, with Uact=.5 but Udvfs=.6, we'll allow it
> to reclaim just the additional 100Mhz of idle time.
>
>
> Or am I completely off the rails now?

2017-07-25 13:51:26

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v1 8/8] sched/deadline: make bandwidth enforcement scale-invariant

On Tue, Jul 25, 2017 at 09:03:08AM +0200, Luca Abeni wrote:

> > I'm still confused..
> >
> > So GRUB does:
> >
> > dq = Uact -dt
> >
> > right?
>
> Right. This is what the original (single processor) GRUB did. And this
> was used by the "GRUB-PA" algorithm:
> https://www.researchgate.net/profile/Giuseppe_Lipari/publication/220800940_Using_resource_reservation_techniques_for_power-aware_scheduling/links/09e41513639b2703fc000000.pdf
>
> (basically, GRUB-PA uses GRUB for reclaiming, and scales the CPU
> frequency based on Uact)
>
>
> > Now, you do DVFS using that same Uact. If we lower the clock, we need
> > more time, so would we then not end up with something like:
> >
> > dq = 1/Uact -dt
>
> Well, in the GRUB-PA algorithm GRUB reclaiming is the mechanism used to
> give more runtime to the task... Since Uact is < 1, doing
> dq = - Uact * dt
> means that we decrease the current runtime by a smaller amount of time.
> And so we end up giving more runtime to the task: instead of giving
> dl_runtime every dl_period, we give "dl_runtime / Uact" every
> dl_period... And since the CPU is slower (by a ratio Uact), this is
> equivalent to giving dl_runtime at the maximum CPU speed / frequency
> (at least, in theory :).
>
>
> > After all; our budget assignment is such that we're able to complete
> > our work at max freq. Therefore, when we lower the frequency, we'll have
> > to increase budget pro rata, otherwise we'll not complete our work and
> > badness happens.
>
> Right. But instead of increasing dl_runtime, GRUB-PA decreases the
> amount of time accounted to the current runtime.
>
>
> > Say we have a 1 Ghz part and Uact=0.5 we'd select 500 Mhz and need
> > double the time to complete.
> >
> > Now, if we fold these two together, you'd get:
> >
> > dq = Uact/Uact -dt = -dt
>
> Not sure why " / Uact"... According to the GRUB-PA algorithm, you just
> do
> dq = - Uact * dt = -0.5dt
> and you end up giving the CPU to the task for 2 * dl_runtime every
> dl_period (as expected)

Yeah, I seem to have gone off the rails there... Bah I'm terminally
confused now. Let me try and get my brain the right way up.


2017-07-26 13:50:40

by luca abeni

[permalink] [raw]
Subject: Re: [RFC PATCH v1 8/8] sched/deadline: make bandwidth enforcement scale-invariant

On Tue, 25 Jul 2017 15:51:05 +0200
Peter Zijlstra <[email protected]> wrote:

> On Tue, Jul 25, 2017 at 09:03:08AM +0200, Luca Abeni wrote:
>
> > > I'm still confused..
> > >
> > > So GRUB does:
> > >
> > > dq = Uact -dt
> > >
> > > right?
> >
> > Right. This is what the original (single processor) GRUB did. And
> > this was used by the "GRUB-PA" algorithm:
> > https://www.researchgate.net/profile/Giuseppe_Lipari/publication/220800940_Using_resource_reservation_techniques_for_power-aware_scheduling/links/09e41513639b2703fc000000.pdf
> >
> > (basically, GRUB-PA uses GRUB for reclaiming, and scales the CPU
> > frequency based on Uact)
> >
> >
> > > Now, you do DVFS using that same Uact. If we lower the clock, we
> > > need more time, so would we then not end up with something like:
> > >
> > > dq = 1/Uact -dt
> >
> > Well, in the GRUB-PA algorithm GRUB reclaiming is the mechanism
> > used to give more runtime to the task... Since Uact is < 1, doing
> > dq = - Uact * dt
> > means that we decrease the current runtime by a smaller amount of
> > time. And so we end up giving more runtime to the task: instead of
> > giving dl_runtime every dl_period, we give "dl_runtime / Uact" every
> > dl_period... And since the CPU is slower (by a ratio Uact), this is
> > equivalent to giving dl_runtime at the maximum CPU speed / frequency
> > (at least, in theory :).
> >
> >
> > > After all; our budget assignment is such that we're able to
> > > complete our work at max freq. Therefore, when we lower the
> > > frequency, we'll have to increase budget pro rata, otherwise
> > > we'll not complete our work and badness happens.
> >
> > Right. But instead of increasing dl_runtime, GRUB-PA decreases the
> > amount of time accounted to the current runtime.
> >
> >
> > > Say we have a 1 Ghz part and Uact=0.5 we'd select 500 Mhz and need
> > > double the time to complete.
> > >
> > > Now, if we fold these two together, you'd get:
> > >
> > > dq = Uact/Uact -dt = -dt
> >
> > Not sure why " / Uact"... According to the GRUB-PA algorithm, you
> > just do
> > dq = - Uact * dt = -0.5dt
> > and you end up giving the CPU to the task for 2 * dl_runtime every
> > dl_period (as expected)
>
> Yeah, I seem to have gone off the rails there... Bah I'm terminally
> confused now. Let me try and get my brain the right way up.

This stuff always confuses me too... :)

The parts that gives me more headaches is how to combine GRUB-PA with
non-reclaiming tasks, and how to cope with "real world issues" (such as
an actual DVFS frequency different from the theoretical one, GRUB
reclaiming less than 100% of the CPU time, etc...)

Anyway, Claudio is running some experiments with this patchset,
measuring power saving and missed deadlines for various sets of
periodic real-time tasks... We hope to present the results at RTLWS.


Luca