2017-03-02 15:49:26

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH 0/6] cpufreq: schedutil: fixes for flags updates

The current version of schedutil has some issues related to the management
of update flags used by systems with frequency domains spawning multiple CPUs.

Each time a CPU utilisation update is issued by the scheduler a set of flags
are configured to define (mainly) which class is asking for a utilisation
update. These flags are then used by the frequency selection policy to
identify the OPP to choose.

In the current implementation, CPU flags are overridden each time the
scheduler calls schedutil for an update. Such a behaviour produces issues
in these scenarios, where we assume CPU1 and CPU2 share the same frequency
domain:
a) a RT task which executed on CPU1 can keep the domain at an high frequency
for a long period of time, even if there are no longer RT tasks on
CPUs in that domain
b) a FAIR task co-scheduled in the same CPU of a RT task can override the
flags configured by the RT task and potentially this can cause an
unwanted frequency drop

These misbehaviours have been verified using a set of simple rt-app based
synthetic workloads, running on a ARM's Juno board, and results are
available in this Notebook [1].

This series proposes a set of fixes for the aforementioned issues as well
as a small improvement to speedup the selection of the maximum frequency
when RT tasks enter a CPU.

This series is based on top of today's tip/sched/core and is public available
from this repository:

git://http://www.linux-arm.com/linux-pb eas/schedutil/flags_fixes

Cheers Patrick

[1] https://gist.github.com/d6a21b459a18091b2b058668a550010d

Patrick Bellasi (6):
cpufreq: schedutil: reset sg_cpus's flags at IDLE enter
cpufreq: schedutil: ignore the sugov kthread for frequencies
selections
cpufreq: schedutil: ensure max frequency while running RT/DL tasks
cpufreq: schedutil: relax rate-limiting while running RT/DL tasks
cpufreq: schedutil: avoid utilisation update when not necessary
sched/rt: fast switch to maximum frequency when RT tasks are scheduled

include/linux/sched.h | 1 +
kernel/sched/cpufreq_schedutil.c | 59 ++++++++++++++++++++++++++++++++++------
kernel/sched/idle_task.c | 4 +++
kernel/sched/rt.c | 15 ++++++++--
4 files changed, 68 insertions(+), 11 deletions(-)

--
2.7.4


2017-03-02 15:49:10

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH 4/6] cpufreq: schedutil: relax rate-limiting while running RT/DL tasks

The policy in use for RT/DL tasks sets the maximum frequency when a task
in these classes calls for a cpufreq_update_this_cpu(). However, the
current implementation is still enforcing a frequency switch rate
limiting when these tasks are running.
This is potentially working against the goal to switch to the maximum OPP
when RT tasks are running. In certain unfortunate cases it can also happen
that a RT task almost complete its activation at a lower OPP.

This patch overrides on purpose the rate limiting configuration
to better serve RT/DL tasks. As long as a frequency scaling operation
is not in progress, a frequency switch is always authorized when
running in "rt_mode", i.e. the current task in a CPU belongs to the
RT/DL class.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
kernel/sched/cpufreq_schedutil.c | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index b98a167..44bff37 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -66,7 +66,8 @@ static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);

/************************ Governor internals ***********************/

-static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
+static bool sugov_should_update_freq(struct sugov_policy *sg_policy,
+ u64 time, bool rt_mode)
{
s64 delta_ns;

@@ -83,6 +84,10 @@ static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
return true;
}

+ /* Always update if a RT/DL task is running */
+ if (rt_mode)
+ return true;
+
delta_ns = time - sg_policy->last_freq_update_time;
return delta_ns >= sg_policy->freq_update_delay_ns;
}
@@ -215,7 +220,7 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
sugov_set_iowait_boost(sg_cpu, time, flags);
sg_cpu->last_update = time;

- if (!sugov_should_update_freq(sg_policy, time))
+ if (!sugov_should_update_freq(sg_policy, time, rt_mode))
return;

if (rt_mode) {
@@ -324,7 +329,7 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
sugov_set_iowait_boost(sg_cpu, time, flags);
sg_cpu->last_update = time;

- if (sugov_should_update_freq(sg_policy, time)) {
+ if (sugov_should_update_freq(sg_policy, time, rt_mode)) {
next_f = sg_policy->policy->cpuinfo.max_freq;
if (!rt_mode)
next_f = sugov_next_freq_shared(sg_cpu, util, max, flags);
--
2.7.4

2017-03-02 15:49:12

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH 3/6] cpufreq: schedutil: ensure max frequency while running RT/DL tasks

The policy in use for RT/DL tasks sets the maximum frequency when a task
in these classes calls for a cpufreq_update_this_cpu(). However, the
current implementation might cause a frequency drop while a RT/DL task
is still running, just because for example a FAIR task wakes up and is
enqueued in the same CPU.

This issue is due to the sg_cpu's flags being overwritten at each call
of sugov_update_*. The wakeup of a FAIR task resets the flags and can
trigger a frequency update thus affecting the currently running RT/DL
task.

This can be fixed, in shared frequency domains, by adding (instead of
overwriting) the new flags before triggering a frequency update. This
grants to stay at least at the frequency requested by the RT/DL class,
which is the maximum one for the time being, but can also be lower when
for example DL will be extended to provide a precise bandwidth
requirement.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
kernel/sched/cpufreq_schedutil.c | 32 +++++++++++++++++++++++++++++---
1 file changed, 29 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index a3fe5e4..b98a167 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -196,10 +196,21 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
unsigned int flags)
{
struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_util);
+ struct task_struct *curr = cpu_curr(smp_processor_id());
struct sugov_policy *sg_policy = sg_cpu->sg_policy;
struct cpufreq_policy *policy = sg_policy->policy;
unsigned long util, max;
unsigned int next_f;
+ bool rt_mode;
+
+ /*
+ * While RT/DL tasks are running we do not want FAIR tasks to
+ * overvrite this CPU's flags, still we can update utilization and
+ * frequency (if required/possible) to be fair with these tasks.
+ */
+ rt_mode = task_has_dl_policy(curr) ||
+ task_has_rt_policy(curr) ||
+ (flags & SCHED_CPUFREQ_RT_DL);

sugov_set_iowait_boost(sg_cpu, time, flags);
sg_cpu->last_update = time;
@@ -207,7 +218,7 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
if (!sugov_should_update_freq(sg_policy, time))
return;

- if (flags & SCHED_CPUFREQ_RT_DL) {
+ if (rt_mode) {
next_f = policy->cpuinfo.max_freq;
} else {
sugov_get_util(&util, &max);
@@ -278,6 +289,7 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
struct task_struct *curr = cpu_curr(cpu);
unsigned long util, max;
unsigned int next_f;
+ bool rt_mode;

sugov_get_util(&util, &max);

@@ -293,15 +305,29 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
if (curr == sg_policy->thread)
goto done;

+ /*
+ * While RT/DL tasks are running we do not want FAIR tasks to
+ * overwrite this CPU's flags, still we can update utilization and
+ * frequency (if required/possible) to be fair with these tasks.
+ */
+ rt_mode = task_has_dl_policy(curr) ||
+ task_has_rt_policy(curr) ||
+ (flags & SCHED_CPUFREQ_RT_DL);
+ if (rt_mode)
+ sg_cpu->flags |= flags;
+ else
+ sg_cpu->flags = flags;
+
sg_cpu->util = util;
sg_cpu->max = max;
- sg_cpu->flags = flags;

sugov_set_iowait_boost(sg_cpu, time, flags);
sg_cpu->last_update = time;

if (sugov_should_update_freq(sg_policy, time)) {
- next_f = sugov_next_freq_shared(sg_cpu, util, max, flags);
+ next_f = sg_policy->policy->cpuinfo.max_freq;
+ if (!rt_mode)
+ next_f = sugov_next_freq_shared(sg_cpu, util, max, flags);
sugov_update_commit(sg_policy, time, next_f);
}

--
2.7.4

2017-03-02 15:49:20

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH 1/6] cpufreq: schedutil: reset sg_cpus's flags at IDLE enter

Currently, sg_cpu's flags are set to the value defined by the last call of
the cpufreq_update_util()/cpufreq_update_this_cpu(); for RT/DL classes
this corresponds to the SCHED_CPUFREQ_{RT/DL} flags always being set.

When multiple CPU shares the same frequency domain it might happen that a
CPU which executed an RT task, right before entering IDLE, has one of the
SCHED_CPUFREQ_RT_DL flags set, permanently, until it exits IDLE.

Thus, in sugov_next_freq_shared(), where utilisation and flags are
aggregated across all the CPUs of a frequency domain, it turns out that all
the CPUs of that domain will always run at the maximum OPP until another
event happens in the idle CPU to eventually clear the SCHED_CPUFREQ_{RT/DL}
flag.

Such a behaviour can harm the energy efficiency of systems when RT
workloads are not so frequent and other CPUs in the same frequency
domain are running small utilisation workloads, which is a quite common
scenario in mobile embedded systems.

This patch proposes a solution which is aligned with the current principle
to update the flags each time a scheduling event happens. The scheduling
of the idle_task on a CPU is considered one of such meaningful events.
That's why when the idle_task is selected for execution we poke the
schedutil policy to reset the flags for that CPU.

Moreover, no frequency transitions are activated at that point, which is
fair in case the RT workload should come back in the future, but it allows
other CPUs in the same frequency domain to scale down the frequency in
case that should be needed.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
include/linux/sched.h | 1 +
kernel/sched/cpufreq_schedutil.c | 7 +++++++
kernel/sched/idle_task.c | 4 ++++
3 files changed, 12 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index e2ed46d..739b29d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -3653,6 +3653,7 @@ static inline unsigned long rlimit_max(unsigned int limit)
#define SCHED_CPUFREQ_RT (1U << 0)
#define SCHED_CPUFREQ_DL (1U << 1)
#define SCHED_CPUFREQ_IOWAIT (1U << 2)
+#define SCHED_CPUFREQ_IDLE (1U << 3)

#define SCHED_CPUFREQ_RT_DL (SCHED_CPUFREQ_RT | SCHED_CPUFREQ_DL)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index fd46593..084a98b 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -281,6 +281,12 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,

raw_spin_lock(&sg_policy->update_lock);

+ /* CPU is entering IDLE, reset flags without triggering an update */
+ if (flags & SCHED_CPUFREQ_IDLE) {
+ sg_cpu->flags = 0;
+ goto done;
+ }
+
sg_cpu->util = util;
sg_cpu->max = max;
sg_cpu->flags = flags;
@@ -293,6 +299,7 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
sugov_update_commit(sg_policy, time, next_f);
}

+done:
raw_spin_unlock(&sg_policy->update_lock);
}

diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
index 0c00172..a844c91 100644
--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -29,6 +29,10 @@ pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
put_prev_task(rq, prev);
update_idle_core(rq);
schedstat_inc(rq->sched_goidle);
+
+ /* kick cpufreq (see the comment in kernel/sched/sched.h). */
+ cpufreq_update_this_cpu(rq, SCHED_CPUFREQ_IDLE);
+
return rq->idle;
}

--
2.7.4

2017-03-02 15:49:15

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH 2/6] cpufreq: schedutil: ignore the sugov kthread for frequencies selections

In system where multiple CPUs shares the same frequency domain a small
workload on a CPU can still be subject frequency spikes, generated by
the activation of the sugov's kthread.

Since the sugov kthread is a special RT task, which goal is just that to
activate a frequency transition, it does not make sense for it to bias
the schedutil's frequency selection.

This patch exploits the information related to the current task to silently
ignore cpufreq_update_this_cpu() calls, coming from the RT scheduler, while
the sugov kthread is running.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
kernel/sched/cpufreq_schedutil.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 084a98b..a3fe5e4 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -274,6 +274,8 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
{
struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_util);
struct sugov_policy *sg_policy = sg_cpu->sg_policy;
+ unsigned int cpu = smp_processor_id();
+ struct task_struct *curr = cpu_curr(cpu);
unsigned long util, max;
unsigned int next_f;

@@ -287,6 +289,10 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
goto done;
}

+ /* Skip updates generated by sugov kthreads */
+ if (curr == sg_policy->thread)
+ goto done;
+
sg_cpu->util = util;
sg_cpu->max = max;
sg_cpu->flags = flags;
--
2.7.4

2017-03-02 15:48:37

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH 5/6] cpufreq: schedutil: avoid utilisation update when not necessary

Under certain conditions (i.e. CPU entering idle and current task being
the sugov thread) we can skip a frequency update.
Thus, let's postpone the collection of the FAIR utilisation when really
needed.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
kernel/sched/cpufreq_schedutil.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 44bff37..c8ed645 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -296,8 +296,6 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
unsigned int next_f;
bool rt_mode;

- sugov_get_util(&util, &max);
-
raw_spin_lock(&sg_policy->update_lock);

/* CPU is entering IDLE, reset flags without triggering an update */
@@ -323,6 +321,7 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
else
sg_cpu->flags = flags;

+ sugov_get_util(&util, &max);
sg_cpu->util = util;
sg_cpu->max = max;

--
2.7.4

2017-03-02 15:48:32

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH 6/6] sched/rt: fast switch to maximum frequency when RT tasks are scheduled

Currently schedutil updates are triggered for the RT class using a single
call place, which is part of the rt::update_curr_rt() used in:

- dequeue_task_rt:
but it does not make sense to set the schedutil's SCHED_CPUFREQ_RT in
case the next task should not be an RT one

- put_prev_task_rt:
likewise, we set the SCHED_CPUFREQ_RT flag without knowing if required
by the next task

- pick_next_task_rt:
likewise, the schedutil's SCHED_CPUFREQ_RT is set in case the prev task
was RT, while we don't yet know if the next will be RT

- task_tick_rt:
that's the only really useful call, which can ramp up the frequency in
case a RT task started its execution without a chance to order a
frequency switch (e.g. because of the schedutil ratelimit)

Apart from the last call in task_tick_rt, the others are at least useless.
Thus, although being a simple solution, not all the call sites of that
update_curr_rt() are interesting to trigger a frequency switch as well as
some of the most interesting points are not covered by that call.
For example, a task set to RT has to wait the next tick to get the frequency
boost.

This patch fixes these issue by placing explicitly the schedutils update
calls in the only sensible places, which are:
- when an RT task wakeups and it's enqueued in a CPU
- when we actually pick a RT task for execution
- at each tick time
- when a task is set to be RT

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
kernel/sched/rt.c | 15 ++++++++++++---
1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 4101f9d..df7046c 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -958,9 +958,6 @@ static void update_curr_rt(struct rq *rq)
if (unlikely((s64)delta_exec <= 0))
return;

- /* Kick cpufreq (see the comment in kernel/sched/sched.h). */
- cpufreq_update_this_cpu(rq, SCHED_CPUFREQ_RT);
-
schedstat_set(curr->se.statistics.exec_max,
max(curr->se.statistics.exec_max, delta_exec));

@@ -1326,6 +1323,9 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p, int flags)

if (!task_current(rq, p) && tsk_nr_cpus_allowed(p) > 1)
enqueue_pushable_task(rq, p);
+
+ /* Kick cpufreq (see the comment in kernel/sched/sched.h). */
+ cpufreq_update_this_cpu(rq, SCHED_CPUFREQ_RT);
}

static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags)
@@ -1563,6 +1563,9 @@ pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)

p = _pick_next_task_rt(rq);

+ /* Kick cpufreq (see the comment in kernel/sched/sched.h). */
+ cpufreq_update_this_cpu(rq, SCHED_CPUFREQ_RT);
+
/* The running task is never eligible for pushing */
dequeue_pushable_task(rq, p);

@@ -2272,6 +2275,9 @@ static void task_tick_rt(struct rq *rq, struct task_struct *p, int queued)
{
struct sched_rt_entity *rt_se = &p->rt;

+ /* Kick cpufreq (see the comment in kernel/sched/sched.h). */
+ cpufreq_update_this_cpu(rq, SCHED_CPUFREQ_RT);
+
update_curr_rt(rq);

watchdog(rq, p);
@@ -2307,6 +2313,9 @@ static void set_curr_task_rt(struct rq *rq)

p->se.exec_start = rq_clock_task(rq);

+ /* Kick cpufreq (see the comment in kernel/sched/sched.h). */
+ cpufreq_update_this_cpu(rq, SCHED_CPUFREQ_RT);
+
/* The running task is never eligible for pushing */
dequeue_pushable_task(rq, p);
}
--
2.7.4

2017-03-02 16:18:15

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 0/6] cpufreq: schedutil: fixes for flags updates

On 2 March 2017 at 16:45, Patrick Bellasi <[email protected]> wrote:
> The current version of schedutil has some issues related to the management
> of update flags used by systems with frequency domains spawning multiple CPUs.
>
> Each time a CPU utilisation update is issued by the scheduler a set of flags
> are configured to define (mainly) which class is asking for a utilisation
> update. These flags are then used by the frequency selection policy to
> identify the OPP to choose.
>
> In the current implementation, CPU flags are overridden each time the
> scheduler calls schedutil for an update. Such a behaviour produces issues
> in these scenarios, where we assume CPU1 and CPU2 share the same frequency
> domain:
> a) a RT task which executed on CPU1 can keep the domain at an high frequency
> for a long period of time, even if there are no longer RT tasks on
> CPUs in that domain

Normally this is dropped after a tick.
Nevertheless, there is an issue if the freq_update_delay_ns is shorter
than a tick because of sugov RT thread

> b) a FAIR task co-scheduled in the same CPU of a RT task can override the
> flags configured by the RT task and potentially this can cause an
> unwanted frequency drop
>
> These misbehaviours have been verified using a set of simple rt-app based
> synthetic workloads, running on a ARM's Juno board, and results are
> available in this Notebook [1].
>
> This series proposes a set of fixes for the aforementioned issues as well
> as a small improvement to speedup the selection of the maximum frequency
> when RT tasks enter a CPU.
>
> This series is based on top of today's tip/sched/core and is public available
> from this repository:
>
> git://http://www.linux-arm.com/linux-pb eas/schedutil/flags_fixes
>
> Cheers Patrick
>
> [1] https://gist.github.com/d6a21b459a18091b2b058668a550010d
>
> Patrick Bellasi (6):
> cpufreq: schedutil: reset sg_cpus's flags at IDLE enter
> cpufreq: schedutil: ignore the sugov kthread for frequencies
> selections
> cpufreq: schedutil: ensure max frequency while running RT/DL tasks
> cpufreq: schedutil: relax rate-limiting while running RT/DL tasks
> cpufreq: schedutil: avoid utilisation update when not necessary
> sched/rt: fast switch to maximum frequency when RT tasks are scheduled
>
> include/linux/sched.h | 1 +
> kernel/sched/cpufreq_schedutil.c | 59 ++++++++++++++++++++++++++++++++++------
> kernel/sched/idle_task.c | 4 +++
> kernel/sched/rt.c | 15 ++++++++--
> 4 files changed, 68 insertions(+), 11 deletions(-)
>
> --
> 2.7.4
>

2017-03-02 17:34:23

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH 0/6] cpufreq: schedutil: fixes for flags updates

On 02-Mar 17:09, Vincent Guittot wrote:
> On 2 March 2017 at 16:45, Patrick Bellasi <[email protected]> wrote:
> > The current version of schedutil has some issues related to the management
> > of update flags used by systems with frequency domains spawning multiple CPUs.
> >
> > Each time a CPU utilisation update is issued by the scheduler a set of flags
> > are configured to define (mainly) which class is asking for a utilisation
> > update. These flags are then used by the frequency selection policy to
> > identify the OPP to choose.
> >
> > In the current implementation, CPU flags are overridden each time the
> > scheduler calls schedutil for an update. Such a behaviour produces issues
> > in these scenarios, where we assume CPU1 and CPU2 share the same frequency
> > domain:
> > a) a RT task which executed on CPU1 can keep the domain at an high frequency
> > for a long period of time, even if there are no longer RT tasks on
> > CPUs in that domain
>
> Normally this is dropped after a tick.
> Nevertheless, there is an issue if the freq_update_delay_ns is shorter
> than a tick because of sugov RT thread

Indeed, I've noticed that a small FAIR task running on CPU2 is quite
likely to be running at an higher OPP just because on the other CPU of
the same frequency domain (i.e. CPU1) sometimes we have the sugov RT
thread running.

In this case, the RT flag in CPU1 is never cleared when that core is
always idle but for the execution of the sugov thread.

--
#include <best/regards.h>

Patrick Bellasi

2017-03-03 06:24:39

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH 2/6] cpufreq: schedutil: ignore the sugov kthread for frequencies selections

On 02-03-17, 15:45, Patrick Bellasi wrote:
> In system where multiple CPUs shares the same frequency domain a small
> workload on a CPU can still be subject frequency spikes, generated by
> the activation of the sugov's kthread.
>
> Since the sugov kthread is a special RT task, which goal is just that to
> activate a frequency transition, it does not make sense for it to bias
> the schedutil's frequency selection.
>
> This patch exploits the information related to the current task to silently
> ignore cpufreq_update_this_cpu() calls, coming from the RT scheduler, while
> the sugov kthread is running.
>
> Signed-off-by: Patrick Bellasi <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: Viresh Kumar <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> ---
> kernel/sched/cpufreq_schedutil.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 084a98b..a3fe5e4 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -274,6 +274,8 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
> {
> struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_util);
> struct sugov_policy *sg_policy = sg_cpu->sg_policy;
> + unsigned int cpu = smp_processor_id();
> + struct task_struct *curr = cpu_curr(cpu);

Maybe merge these two as you have done in the next patch.

> unsigned long util, max;
> unsigned int next_f;
>
> @@ -287,6 +289,10 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
> goto done;
> }
>
> + /* Skip updates generated by sugov kthreads */
> + if (curr == sg_policy->thread)
> + goto done;
> +

I always wanted to avoid such hacks when I moved to the RT thread :(

I was discussing the exact same problem with Vincent few days back and
one of the ideas we had was to clear the flags when any RT task is
dequeued from a rq. AFAIU, the problem we are discussing here
shouldn't normally occur while the sugov RT thread is running as the
work_in_progress flag makes sure we don't reevaluate the load while
the RT thread is updating the frequency. The problem perhaps occurs as
the flag for CPU X is never cleared, and so on the next callback from
the scheduler (after the frequency is updated and work_in_progress is
cleared) we move to the highest frequency.

So what about clearing the flags, just like the previous patch, when
the RT or DL task has finished?

Sorry for the noise if it was all nonsense :)

--
viresh

2017-03-03 07:38:27

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH 1/6] cpufreq: schedutil: reset sg_cpus's flags at IDLE enter

On 02-03-17, 15:45, Patrick Bellasi wrote:
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index e2ed46d..739b29d 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -3653,6 +3653,7 @@ static inline unsigned long rlimit_max(unsigned int limit)
> #define SCHED_CPUFREQ_RT (1U << 0)
> #define SCHED_CPUFREQ_DL (1U << 1)
> #define SCHED_CPUFREQ_IOWAIT (1U << 2)
> +#define SCHED_CPUFREQ_IDLE (1U << 3)
>
> #define SCHED_CPUFREQ_RT_DL (SCHED_CPUFREQ_RT | SCHED_CPUFREQ_DL)
>
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index fd46593..084a98b 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -281,6 +281,12 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
>
> raw_spin_lock(&sg_policy->update_lock);
>
> + /* CPU is entering IDLE, reset flags without triggering an update */
> + if (flags & SCHED_CPUFREQ_IDLE) {

Will "flags == SCHED_CPUFREQ_IDLE" generate better assembly ?

> + sg_cpu->flags = 0;
> + goto done;
> + }
> +
> sg_cpu->util = util;
> sg_cpu->max = max;
> sg_cpu->flags = flags;
> @@ -293,6 +299,7 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
> sugov_update_commit(sg_policy, time, next_f);
> }
>
> +done:
> raw_spin_unlock(&sg_policy->update_lock);
> }
>
> diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
> index 0c00172..a844c91 100644
> --- a/kernel/sched/idle_task.c
> +++ b/kernel/sched/idle_task.c
> @@ -29,6 +29,10 @@ pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
> put_prev_task(rq, prev);
> update_idle_core(rq);
> schedstat_inc(rq->sched_goidle);
> +
> + /* kick cpufreq (see the comment in kernel/sched/sched.h). */
> + cpufreq_update_this_cpu(rq, SCHED_CPUFREQ_IDLE);
> +
> return rq->idle;
> }
>
> --
> 2.7.4

--
viresh

2017-03-03 08:58:53

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH 3/6] cpufreq: schedutil: ensure max frequency while running RT/DL tasks

On 02-03-17, 15:45, Patrick Bellasi wrote:
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> @@ -293,15 +305,29 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
> if (curr == sg_policy->thread)
> goto done;
>
> + /*
> + * While RT/DL tasks are running we do not want FAIR tasks to
> + * overwrite this CPU's flags, still we can update utilization and
> + * frequency (if required/possible) to be fair with these tasks.
> + */
> + rt_mode = task_has_dl_policy(curr) ||
> + task_has_rt_policy(curr) ||
> + (flags & SCHED_CPUFREQ_RT_DL);
> + if (rt_mode)
> + sg_cpu->flags |= flags;
> + else
> + sg_cpu->flags = flags;

This looks so hacked up :)

Wouldn't it be better to let the scheduler tell us what all kind of tasks it has
in the rq of a CPU and pass a mask of flags? I think it wouldn't be difficult
(or time consuming) for the scheduler to know that, but I am not 100% sure.

IOW, the flags field in cpufreq_update_util() will represent all tasks in the
rq, instead of just the task that is getting enqueued/dequeued..

And obviously we need to get some utilization numbers for the RT and DL tasks
going forward, switching to max isn't going to work for ever :)

--
viresh

2017-03-03 12:49:25

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH 3/6] cpufreq: schedutil: ensure max frequency while running RT/DL tasks

On 03-Mar 14:01, Viresh Kumar wrote:
> On 02-03-17, 15:45, Patrick Bellasi wrote:
> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > @@ -293,15 +305,29 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
> > if (curr == sg_policy->thread)
> > goto done;
> >
> > + /*
> > + * While RT/DL tasks are running we do not want FAIR tasks to
> > + * overwrite this CPU's flags, still we can update utilization and
> > + * frequency (if required/possible) to be fair with these tasks.
> > + */
> > + rt_mode = task_has_dl_policy(curr) ||
> > + task_has_rt_policy(curr) ||
> > + (flags & SCHED_CPUFREQ_RT_DL);
> > + if (rt_mode)
> > + sg_cpu->flags |= flags;
> > + else
> > + sg_cpu->flags = flags;
>
> This looks so hacked up :)

It is... a bit... :)

> Wouldn't it be better to let the scheduler tell us what all kind of tasks it has
> in the rq of a CPU and pass a mask of flags?

That would definitively report a more consistent view of what's going
on on each CPU.

> I think it wouldn't be difficult (or time consuming) for the
> scheduler to know that, but I am not 100% sure.

Main issue perhaps is that cpufreq_update_{util,this_cpu} are
currently called by the scheduling classes codes and not from the core
scheduler. However I agree that it should be possible to build up such
information and make it available to the scheduling classes code.

I'll have a look at that.

> IOW, the flags field in cpufreq_update_util() will represent all tasks in the
> rq, instead of just the task that is getting enqueued/dequeued..
>
> And obviously we need to get some utilization numbers for the RT and DL tasks
> going forward, switching to max isn't going to work for ever :)

Regarding this last point, there are WIP patches Juri is working on to
feed DL demands to schedutil, his presentation at last ELC partially
covers these developments:
https://www.youtube.com/watch?v=wzrcWNIneWY&index=37&list=PLbzoR-pLrL6pSlkQDW7RpnNLuxPq6WVUR

Instead, RT tasks are currently covered by an rt_avg metric which we
already know is not fitting for most purposes.
It seems that the main goal is twofold: move people to DL whenever
possible otherwise live with the go-to-max policy which is the only
sensible solution to satisfy the RT's class main goal, i.e. latency
reduction.

Of course such a go-to-max policy for all RT tasks we already know
that is going to destroy energy on many different mobile scenarios.

As a possible mitigation for that, while still being compliant with
the main RT's class goal, we recently posted the SchedTune v3
proposal:
https://lkml.org/lkml/2017/2/28/355

In that proposal, the simple usage of CGroups and the new capacity_max
attribute of the (existing) CPU controller should allow to define what
is the "max" value which is just enough to match the latency
constraints of a mobile application without sacrificing too much
energy.

> --
> viresh

Cheers Patrick

--
#include <best/regards.h>

Patrick Bellasi

2017-03-03 13:19:29

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH 2/6] cpufreq: schedutil: ignore the sugov kthread for frequencies selections

On 03-Mar 10:49, Viresh Kumar wrote:
> On 02-03-17, 15:45, Patrick Bellasi wrote:
> > In system where multiple CPUs shares the same frequency domain a small
> > workload on a CPU can still be subject frequency spikes, generated by
> > the activation of the sugov's kthread.
> >
> > Since the sugov kthread is a special RT task, which goal is just that to
> > activate a frequency transition, it does not make sense for it to bias
> > the schedutil's frequency selection.
> >
> > This patch exploits the information related to the current task to silently
> > ignore cpufreq_update_this_cpu() calls, coming from the RT scheduler, while
> > the sugov kthread is running.
> >
> > Signed-off-by: Patrick Bellasi <[email protected]>
> > Cc: Ingo Molnar <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > Cc: Rafael J. Wysocki <[email protected]>
> > Cc: Viresh Kumar <[email protected]>
> > Cc: [email protected]
> > Cc: [email protected]
> > ---
> > kernel/sched/cpufreq_schedutil.c | 6 ++++++
> > 1 file changed, 6 insertions(+)
> >
> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > index 084a98b..a3fe5e4 100644
> > --- a/kernel/sched/cpufreq_schedutil.c
> > +++ b/kernel/sched/cpufreq_schedutil.c
> > @@ -274,6 +274,8 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
> > {
> > struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_util);
> > struct sugov_policy *sg_policy = sg_cpu->sg_policy;
> > + unsigned int cpu = smp_processor_id();
> > + struct task_struct *curr = cpu_curr(cpu);
>
> Maybe merge these two as you have done in the next patch.

Yes, better.

> > unsigned long util, max;
> > unsigned int next_f;
> >
> > @@ -287,6 +289,10 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
> > goto done;
> > }
> >
> > + /* Skip updates generated by sugov kthreads */
> > + if (curr == sg_policy->thread)
> > + goto done;
> > +
>
> I always wanted to avoid such hacks when I moved to the RT thread :(

Indeed, it is a bit of an hack... but still it's true that this is a
"special" RT thread which must not bias OPP selection.

> I was discussing the exact same problem with Vincent few days back and
> one of the ideas we had was to clear the flags when any RT task is
> dequeued from a rq.

That's a possible solution, however at dequeue time we don't know
what's going on right after. We can end up picking up another RT task.
Thus we can reset the flag when it's not really required, and this
open for possible races...

> AFAIU, the problem we are discussing here
> shouldn't normally occur while the sugov RT thread is running as the
> work_in_progress flag makes sure we don't reevaluate the load while
> the RT thread is updating the frequency.

True...

> The problem perhaps occurs as the flag for CPU X is never cleared,
> and so on the next callback from the scheduler (after the frequency
> is updated and work_in_progress is cleared) we move to the highest
> frequency.

... right.

> So what about clearing the flags, just like the previous patch, when
> the RT or DL task has finished?

As a general goal I think it would be useful to feed input only when
the scheduler knows something is going to happen. That's why in the
last patch of these series I'm proposing to remove updates from
update_curr_rt() and replace it with calls in most interesting events,
like enqueue/pickup instead of dequeue.

> Sorry for the noise if it was all nonsense :)

That's absolutely not nonsense, thanks for the feedback.

I agree with you that the solution proposed by this patch sound a bit
of an hack, but still we can argue that using an RT task to change an
OPP is by itself a sort-of an hack.

The main downside I see is the condition check for each and every
update. I don't completely like it, but I'm also not completely
convinced by the "always reset" policy at RT tasks dequeue.


> --
> viresh

Cheers Patrick

--
#include <best/regards.h>

Patrick Bellasi

2017-03-06 05:17:04

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH 2/6] cpufreq: schedutil: ignore the sugov kthread for frequencies selections

On 03-03-17, 12:12, Patrick Bellasi wrote:
> On 03-Mar 10:49, Viresh Kumar wrote:
> > I always wanted to avoid such hacks when I moved to the RT thread :(
>
> Indeed, it is a bit of an hack... but still it's true that this is a
> "special" RT thread which must not bias OPP selection.

I agree. We have a problem in hand and we need to fix it somehow.

> > I was discussing the exact same problem with Vincent few days back and
> > one of the ideas we had was to clear the flags when any RT task is
> > dequeued from a rq.
>
> That's a possible solution, however at dequeue time we don't know
> what's going on right after. We can end up picking up another RT task.
> Thus we can reset the flag when it's not really required, and this
> open for possible races...

I don't have much knowledge of the scheduler right now and so I will not be able
to tell the exact places where we can clear the flag. So, dequeue was just one
of the places I could think of logically.

My idea was to make sure that we clear the RT and DL flags once we know that we
don't have any more RT/DL work to do. Dequeue or any other place that would suit
better, but we should be clearing it.

And once that is done, we wouldn't be required to have hacks like what this
patch is doing.

> > AFAIU, the problem we are discussing here
> > shouldn't normally occur while the sugov RT thread is running as the
> > work_in_progress flag makes sure we don't reevaluate the load while
> > the RT thread is updating the frequency.
>
> True...
>
> > The problem perhaps occurs as the flag for CPU X is never cleared,
> > and so on the next callback from the scheduler (after the frequency
> > is updated and work_in_progress is cleared) we move to the highest
> > frequency.
>
> ... right.
>
> > So what about clearing the flags, just like the previous patch, when
> > the RT or DL task has finished?
>
> As a general goal I think it would be useful to feed input only when
> the scheduler knows something is going to happen. That's why in the
> last patch of these series I'm proposing to remove updates from
> update_curr_rt() and replace it with calls in most interesting events,
> like enqueue/pickup instead of dequeue.

Sure, I liked that. I am just wondering if we can have such a location to clear
the RT/DL flags.

> > Sorry for the noise if it was all nonsense :)
>
> That's absolutely not nonsense, thanks for the feedback.
>
> I agree with you that the solution proposed by this patch sound a bit
> of an hack, but still we can argue that using an RT task to change an
> OPP is by itself a sort-of an hack.

Just for the sake of argument, why do you think so ? :)

That work needs to get done via some sort of task, wq or RT thread, etc. We
chose RT to make sure we do it in time and never delay it. The ugliness happened
because we hit max frequency on RT task, which will be optimized later on.

--
viresh

2017-03-06 14:32:10

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 1/6] cpufreq: schedutil: reset sg_cpus's flags at IDLE enter

On Fri, 3 Mar 2017 09:11:25 +0530
Viresh Kumar <[email protected]> wrote:

> On 02-03-17, 15:45, Patrick Bellasi wrote:
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index e2ed46d..739b29d 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -3653,6 +3653,7 @@ static inline unsigned long rlimit_max(unsigned int limit)
> > #define SCHED_CPUFREQ_RT (1U << 0)
> > #define SCHED_CPUFREQ_DL (1U << 1)
> > #define SCHED_CPUFREQ_IOWAIT (1U << 2)
> > +#define SCHED_CPUFREQ_IDLE (1U << 3)
> >
> > #define SCHED_CPUFREQ_RT_DL (SCHED_CPUFREQ_RT | SCHED_CPUFREQ_DL)
> >
> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > index fd46593..084a98b 100644
> > --- a/kernel/sched/cpufreq_schedutil.c
> > +++ b/kernel/sched/cpufreq_schedutil.c
> > @@ -281,6 +281,12 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
> >
> > raw_spin_lock(&sg_policy->update_lock);
> >
> > + /* CPU is entering IDLE, reset flags without triggering an update */
> > + if (flags & SCHED_CPUFREQ_IDLE) {
>
> Will "flags == SCHED_CPUFREQ_IDLE" generate better assembly ?
>

Even if it does, a bit check and an equal check are pretty negligible
in difference wrt execution time. I would choose whatever is the most
readable to humans.

flags == SCHED_CPUFREQ_IDLE

will tell me (as a reviewer) that we expect no other flag to be set.

flags & SCHED_CPUFREQ_IDLE

will tell me that we only care about the IDLE flag.

Which ever is the more meaningful is what should be used.

-- Steve


> > + sg_cpu->flags = 0;
> > + goto done;
> > + }
> > +
> > sg_cpu->util = util;
> > sg_cpu->max = max;
> > sg_cpu->flags = flags;
> > @@ -293,6 +299,7 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
> > sugov_update_commit(sg_policy, time, next_f);
> > }
> >
> > +done:
> > raw_spin_unlock(&sg_policy->update_lock);
> > }
> >
> > diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
> > index 0c00172..a844c91 100644
> > --- a/kernel/sched/idle_task.c
> > +++ b/kernel/sched/idle_task.c
> > @@ -29,6 +29,10 @@ pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
> > put_prev_task(rq, prev);
> > update_idle_core(rq);
> > schedstat_inc(rq->sched_goidle);
> > +
> > + /* kick cpufreq (see the comment in kernel/sched/sched.h). */
> > + cpufreq_update_this_cpu(rq, SCHED_CPUFREQ_IDLE);
> > +
> > return rq->idle;
> > }
> >
> > --
> > 2.7.4
>

2017-03-06 14:36:04

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 2/6] cpufreq: schedutil: ignore the sugov kthread for frequencies selections

On Thu, 2 Mar 2017 15:45:03 +0000
Patrick Bellasi <[email protected]> wrote:

> @@ -287,6 +289,10 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
> goto done;
> }
>
> + /* Skip updates generated by sugov kthreads */
> + if (curr == sg_policy->thread)

I think you want to put in an "unlikely()" around that statement. I'm
assuming you don't care about he performance of scheduling in the sugov
thread. At least tell gcc to optimize for the false path.

-- Steve

> + goto done;
> +
> sg_cpu->util = util;
> sg_cpu->max = max;
> sg_cpu->flags = flags;

2017-03-15 11:58:04

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 3/6] cpufreq: schedutil: ensure max frequency while running RT/DL tasks

On Friday, March 03, 2017 12:38:30 PM Patrick Bellasi wrote:
> On 03-Mar 14:01, Viresh Kumar wrote:
> > On 02-03-17, 15:45, Patrick Bellasi wrote:
> > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > > @@ -293,15 +305,29 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
> > > if (curr == sg_policy->thread)
> > > goto done;
> > >
> > > + /*
> > > + * While RT/DL tasks are running we do not want FAIR tasks to
> > > + * overwrite this CPU's flags, still we can update utilization and
> > > + * frequency (if required/possible) to be fair with these tasks.
> > > + */
> > > + rt_mode = task_has_dl_policy(curr) ||
> > > + task_has_rt_policy(curr) ||
> > > + (flags & SCHED_CPUFREQ_RT_DL);
> > > + if (rt_mode)
> > > + sg_cpu->flags |= flags;
> > > + else
> > > + sg_cpu->flags = flags;
> >
> > This looks so hacked up :)
>
> It is... a bit... :)
>
> > Wouldn't it be better to let the scheduler tell us what all kind of tasks it has
> > in the rq of a CPU and pass a mask of flags?
>
> That would definitively report a more consistent view of what's going
> on on each CPU.
>
> > I think it wouldn't be difficult (or time consuming) for the
> > scheduler to know that, but I am not 100% sure.
>
> Main issue perhaps is that cpufreq_update_{util,this_cpu} are
> currently called by the scheduling classes codes and not from the core
> scheduler. However I agree that it should be possible to build up such
> information and make it available to the scheduling classes code.
>
> I'll have a look at that.
>
> > IOW, the flags field in cpufreq_update_util() will represent all tasks in the
> > rq, instead of just the task that is getting enqueued/dequeued..
> >
> > And obviously we need to get some utilization numbers for the RT and DL tasks
> > going forward, switching to max isn't going to work for ever :)
>
> Regarding this last point, there are WIP patches Juri is working on to
> feed DL demands to schedutil, his presentation at last ELC partially
> covers these developments:
> https://www.youtube.com/watch?v=wzrcWNIneWY&index=37&list=PLbzoR-pLrL6pSlkQDW7RpnNLuxPq6WVUR
>
> Instead, RT tasks are currently covered by an rt_avg metric which we
> already know is not fitting for most purposes.
> It seems that the main goal is twofold: move people to DL whenever
> possible otherwise live with the go-to-max policy which is the only
> sensible solution to satisfy the RT's class main goal, i.e. latency
> reduction.
>
> Of course such a go-to-max policy for all RT tasks we already know
> that is going to destroy energy on many different mobile scenarios.
>
> As a possible mitigation for that, while still being compliant with
> the main RT's class goal, we recently posted the SchedTune v3
> proposal:
> https://lkml.org/lkml/2017/2/28/355
>
> In that proposal, the simple usage of CGroups and the new capacity_max
> attribute of the (existing) CPU controller should allow to define what
> is the "max" value which is just enough to match the latency
> constraints of a mobile application without sacrificing too much
> energy.

And who's going to figure out what "max" value is most suitable? And how?

Thanks,
Rafael

2017-03-15 14:42:15

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH 3/6] cpufreq: schedutil: ensure max frequency while running RT/DL tasks

On 15-Mar 12:52, Rafael J. Wysocki wrote:
> On Friday, March 03, 2017 12:38:30 PM Patrick Bellasi wrote:
> > On 03-Mar 14:01, Viresh Kumar wrote:
> > > On 02-03-17, 15:45, Patrick Bellasi wrote:
> > > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > > > @@ -293,15 +305,29 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
> > > > if (curr == sg_policy->thread)
> > > > goto done;
> > > >
> > > > + /*
> > > > + * While RT/DL tasks are running we do not want FAIR tasks to
> > > > + * overwrite this CPU's flags, still we can update utilization and
> > > > + * frequency (if required/possible) to be fair with these tasks.
> > > > + */
> > > > + rt_mode = task_has_dl_policy(curr) ||
> > > > + task_has_rt_policy(curr) ||
> > > > + (flags & SCHED_CPUFREQ_RT_DL);
> > > > + if (rt_mode)
> > > > + sg_cpu->flags |= flags;
> > > > + else
> > > > + sg_cpu->flags = flags;
> > >
> > > This looks so hacked up :)
> >
> > It is... a bit... :)
> >
> > > Wouldn't it be better to let the scheduler tell us what all kind of tasks it has
> > > in the rq of a CPU and pass a mask of flags?
> >
> > That would definitively report a more consistent view of what's going
> > on on each CPU.
> >
> > > I think it wouldn't be difficult (or time consuming) for the
> > > scheduler to know that, but I am not 100% sure.
> >
> > Main issue perhaps is that cpufreq_update_{util,this_cpu} are
> > currently called by the scheduling classes codes and not from the core
> > scheduler. However I agree that it should be possible to build up such
> > information and make it available to the scheduling classes code.
> >
> > I'll have a look at that.
> >
> > > IOW, the flags field in cpufreq_update_util() will represent all tasks in the
> > > rq, instead of just the task that is getting enqueued/dequeued..
> > >
> > > And obviously we need to get some utilization numbers for the RT and DL tasks
> > > going forward, switching to max isn't going to work for ever :)
> >
> > Regarding this last point, there are WIP patches Juri is working on to
> > feed DL demands to schedutil, his presentation at last ELC partially
> > covers these developments:
> > https://www.youtube.com/watch?v=wzrcWNIneWY&index=37&list=PLbzoR-pLrL6pSlkQDW7RpnNLuxPq6WVUR
> >
> > Instead, RT tasks are currently covered by an rt_avg metric which we
> > already know is not fitting for most purposes.
> > It seems that the main goal is twofold: move people to DL whenever
> > possible otherwise live with the go-to-max policy which is the only
> > sensible solution to satisfy the RT's class main goal, i.e. latency
> > reduction.
> >
> > Of course such a go-to-max policy for all RT tasks we already know
> > that is going to destroy energy on many different mobile scenarios.
> >
> > As a possible mitigation for that, while still being compliant with
> > the main RT's class goal, we recently posted the SchedTune v3
> > proposal:
> > https://lkml.org/lkml/2017/2/28/355
> >
> > In that proposal, the simple usage of CGroups and the new capacity_max
> > attribute of the (existing) CPU controller should allow to define what
> > is the "max" value which is just enough to match the latency
> > constraints of a mobile application without sacrificing too much
> > energy.

Given the following interesting question, let's add Thomas Gleixner to
the discussion, which can be interested in sharing his viewpoint.

> And who's going to figure out what "max" value is most suitable? And how?

That's usually up to the system profiling which is part of the
platform optimizations and tunings.
For example it's possible to run experiments to measure the bandwidth
and (completion) latency requirements from the RT workloads.

It's something which people developing embedded/mobile systems are
quite aware of. I'm also quite confident on saying that most of
them can agree that just going to the max OPP, each and every time a
RT task becomes RUNNABLE, it is something which is more likely to hurt
than to give benefits.

AFAIK the current policy (i.e. "go to max") has been adopted for the
following main reasons, which I'm reporting with some observations.


.:: Missing of a suitable utilization metric for RT tasks

There is actually a utilization signal (rq->rt_avg) but it has been
verified to be "too slow" for the practical usage of driving OPP
selection.
Other possibilities are perhaps under exploration but they are not
yet there.


.:: Promote the migration from RT to DEADLINE

Which makes a lot of sens for many existing use-cases, starting from
Android as well. However, it's also true that we cannot (at least yet)
split the world in DEALINE vs FAIR.
There is still, and there will be, a fair amount of RT tasks which
just it makes sense to serve at best both from the performance as
well as the power/energy standpoint.


.:: Because RT is all about "reducing latencies"

Running at the maximum OPP is certainly the best way to aim for the
minimum latencies but... RT is about doing things "in time", which
does not imply "as fast as possible".
There can be many different workloads where a lower OPP is just good
enough to meet the expected soft RT behaviors provided by the Linux
RT scheduler.


All that considered, the modifications proposed in this series,
combined with other bits which are for discussion in this [1] other
posting, can work together to provide a better and more tunable OPP
selection policy for RT tasks.

> Thanks,
> Rafael

Cheers Patrick

[1] https://lkml.org/lkml/2017/2/28/355

--
#include <best/regards.h>

Patrick Bellasi

2017-03-15 18:02:59

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH 2/6] cpufreq: schedutil: ignore the sugov kthread for frequencies selections

On 06-Mar 09:35, Steven Rostedt wrote:
> On Thu, 2 Mar 2017 15:45:03 +0000
> Patrick Bellasi <[email protected]> wrote:
>
> > @@ -287,6 +289,10 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
> > goto done;
> > }
> >
> > + /* Skip updates generated by sugov kthreads */
> > + if (curr == sg_policy->thread)
>
> I think you want to put in an "unlikely()" around that statement. I'm
> assuming you don't care about he performance of scheduling in the sugov
> thread. At least tell gcc to optimize for the false path.

Right, good point! Will add it in the next posting! +1

Thanks!

--
#include <best/regards.h>

Patrick Bellasi

2017-03-15 18:15:27

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH 1/6] cpufreq: schedutil: reset sg_cpus's flags at IDLE enter

On 06-Mar 09:29, Steven Rostedt wrote:
> On Fri, 3 Mar 2017 09:11:25 +0530
> Viresh Kumar <[email protected]> wrote:
>
> > On 02-03-17, 15:45, Patrick Bellasi wrote:
> > > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > > index e2ed46d..739b29d 100644
> > > --- a/include/linux/sched.h
> > > +++ b/include/linux/sched.h
> > > @@ -3653,6 +3653,7 @@ static inline unsigned long rlimit_max(unsigned int limit)
> > > #define SCHED_CPUFREQ_RT (1U << 0)
> > > #define SCHED_CPUFREQ_DL (1U << 1)
> > > #define SCHED_CPUFREQ_IOWAIT (1U << 2)
> > > +#define SCHED_CPUFREQ_IDLE (1U << 3)
> > >
> > > #define SCHED_CPUFREQ_RT_DL (SCHED_CPUFREQ_RT | SCHED_CPUFREQ_DL)
> > >
> > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > > index fd46593..084a98b 100644
> > > --- a/kernel/sched/cpufreq_schedutil.c
> > > +++ b/kernel/sched/cpufreq_schedutil.c
> > > @@ -281,6 +281,12 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
> > >
> > > raw_spin_lock(&sg_policy->update_lock);
> > >
> > > + /* CPU is entering IDLE, reset flags without triggering an update */
> > > + if (flags & SCHED_CPUFREQ_IDLE) {
> >
> > Will "flags == SCHED_CPUFREQ_IDLE" generate better assembly ?
> >
>
> Even if it does, a bit check and an equal check are pretty negligible
> in difference wrt execution time. I would choose whatever is the most
> readable to humans.
>
> flags == SCHED_CPUFREQ_IDLE
>
> will tell me (as a reviewer) that we expect no other flag to be set.
>
> flags & SCHED_CPUFREQ_IDLE
>
> will tell me that we only care about the IDLE flag.
>
> Which ever is the more meaningful is what should be used.

Agree on the approach, whenever not silly code should be written to be
easy to understand from other humans.

Here the intent is "whatever flags you set, if the IDLE one is set" we
assume we are entering idle. Thus, to me the current version is easier
to understand without being "overkilling" in its semantics.

Cheers Patrick

--
#include <best/regards.h>

Patrick Bellasi

2017-03-15 23:32:30

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 3/6] cpufreq: schedutil: ensure max frequency while running RT/DL tasks

On Wed, Mar 15, 2017 at 3:40 PM, Patrick Bellasi
<[email protected]> wrote:
> On 15-Mar 12:52, Rafael J. Wysocki wrote:
>> On Friday, March 03, 2017 12:38:30 PM Patrick Bellasi wrote:
>> > On 03-Mar 14:01, Viresh Kumar wrote:
>> > > On 02-03-17, 15:45, Patrick Bellasi wrote:
>> > > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
>> > > > @@ -293,15 +305,29 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
>> > > > if (curr == sg_policy->thread)
>> > > > goto done;
>> > > >
>> > > > + /*
>> > > > + * While RT/DL tasks are running we do not want FAIR tasks to
>> > > > + * overwrite this CPU's flags, still we can update utilization and
>> > > > + * frequency (if required/possible) to be fair with these tasks.
>> > > > + */
>> > > > + rt_mode = task_has_dl_policy(curr) ||
>> > > > + task_has_rt_policy(curr) ||
>> > > > + (flags & SCHED_CPUFREQ_RT_DL);
>> > > > + if (rt_mode)
>> > > > + sg_cpu->flags |= flags;
>> > > > + else
>> > > > + sg_cpu->flags = flags;
>> > >
>> > > This looks so hacked up :)
>> >
>> > It is... a bit... :)
>> >
>> > > Wouldn't it be better to let the scheduler tell us what all kind of tasks it has
>> > > in the rq of a CPU and pass a mask of flags?
>> >
>> > That would definitively report a more consistent view of what's going
>> > on on each CPU.
>> >
>> > > I think it wouldn't be difficult (or time consuming) for the
>> > > scheduler to know that, but I am not 100% sure.
>> >
>> > Main issue perhaps is that cpufreq_update_{util,this_cpu} are
>> > currently called by the scheduling classes codes and not from the core
>> > scheduler. However I agree that it should be possible to build up such
>> > information and make it available to the scheduling classes code.
>> >
>> > I'll have a look at that.
>> >
>> > > IOW, the flags field in cpufreq_update_util() will represent all tasks in the
>> > > rq, instead of just the task that is getting enqueued/dequeued..
>> > >
>> > > And obviously we need to get some utilization numbers for the RT and DL tasks
>> > > going forward, switching to max isn't going to work for ever :)
>> >
>> > Regarding this last point, there are WIP patches Juri is working on to
>> > feed DL demands to schedutil, his presentation at last ELC partially
>> > covers these developments:
>> > https://www.youtube.com/watch?v=wzrcWNIneWY&index=37&list=PLbzoR-pLrL6pSlkQDW7RpnNLuxPq6WVUR
>> >
>> > Instead, RT tasks are currently covered by an rt_avg metric which we
>> > already know is not fitting for most purposes.
>> > It seems that the main goal is twofold: move people to DL whenever
>> > possible otherwise live with the go-to-max policy which is the only
>> > sensible solution to satisfy the RT's class main goal, i.e. latency
>> > reduction.
>> >
>> > Of course such a go-to-max policy for all RT tasks we already know
>> > that is going to destroy energy on many different mobile scenarios.
>> >
>> > As a possible mitigation for that, while still being compliant with
>> > the main RT's class goal, we recently posted the SchedTune v3
>> > proposal:
>> > https://lkml.org/lkml/2017/2/28/355
>> >
>> > In that proposal, the simple usage of CGroups and the new capacity_max
>> > attribute of the (existing) CPU controller should allow to define what
>> > is the "max" value which is just enough to match the latency
>> > constraints of a mobile application without sacrificing too much
>> > energy.
>
> Given the following interesting question, let's add Thomas Gleixner to
> the discussion, which can be interested in sharing his viewpoint.
>
>> And who's going to figure out what "max" value is most suitable? And how?
>
> That's usually up to the system profiling which is part of the
> platform optimizations and tunings.
> For example it's possible to run experiments to measure the bandwidth
> and (completion) latency requirements from the RT workloads.
>
> It's something which people developing embedded/mobile systems are
> quite aware of.

Well, I was expecting an answer like this to be honest and let me say
that it is not too convincing from my perspective.

At least throwing embedded and mobile into one bag seems to be a
stretch, because the usage patterns of those two groups are quite
different, even though they may be similar from the hardware POV.

Mobile are mostly used as general-purpose computers nowadays (and I
guess we're essentially talking about anything running Android, not
just phones, aren't we?) with applications installed by users rather
than by system integrators, so I'm doubtful about the viability of the
"system integrators should take care of it" assumption in this
particular case.

> I'm also quite confident on saying that most of
> them can agree that just going to the max OPP, each and every time a
> RT task becomes RUNNABLE, it is something which is more likely to hurt
> than to give benefits.

It would be good to be able to estimate that likelihood somehow ...

> AFAIK the current policy (i.e. "go to max") has been adopted for the
> following main reasons, which I'm reporting with some observations.
>
>
> .:: Missing of a suitable utilization metric for RT tasks
>
> There is actually a utilization signal (rq->rt_avg) but it has been
> verified to be "too slow" for the practical usage of driving OPP
> selection.
> Other possibilities are perhaps under exploration but they are not
> yet there.
>
>
> .:: Promote the migration from RT to DEADLINE
>
> Which makes a lot of sens for many existing use-cases, starting from
> Android as well. However, it's also true that we cannot (at least yet)
> split the world in DEALINE vs FAIR.
> There is still, and there will be, a fair amount of RT tasks which
> just it makes sense to serve at best both from the performance as
> well as the power/energy standpoint.
>
>
> .:: Because RT is all about "reducing latencies"
>
> Running at the maximum OPP is certainly the best way to aim for the
> minimum latencies but... RT is about doing things "in time", which
> does not imply "as fast as possible".
> There can be many different workloads where a lower OPP is just good
> enough to meet the expected soft RT behaviors provided by the Linux
> RT scheduler.

AFAICS, RT is about determinism and meeting deadlines is one aspect of that.

Essentially, the problem here is to know which OPP is sufficient to do
things "on time" and there simply is no good way to figure that out
for RT tasks, again, AFAICS.

And since we can't say which OPP is sufficient, we pretty much have no
choice but to use the top-most one.

> All that considered, the modifications proposed in this series,
> combined with other bits which are for discussion in this [1] other
> posting, can work together to provide a better and more tunable OPP
> selection policy for RT tasks.

OK, but "more tunable" need not imply "better".

Thanks,
Rafael

2017-03-17 11:36:41

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH 3/6] cpufreq: schedutil: ensure max frequency while running RT/DL tasks

On 16-Mar 00:32, Rafael J. Wysocki wrote:
> On Wed, Mar 15, 2017 at 3:40 PM, Patrick Bellasi
> <[email protected]> wrote:
> > On 15-Mar 12:52, Rafael J. Wysocki wrote:
> >> On Friday, March 03, 2017 12:38:30 PM Patrick Bellasi wrote:
> >> > On 03-Mar 14:01, Viresh Kumar wrote:
> >> > > On 02-03-17, 15:45, Patrick Bellasi wrote:
> >> > > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> >> > > > @@ -293,15 +305,29 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
> >> > > > if (curr == sg_policy->thread)
> >> > > > goto done;
> >> > > >
> >> > > > + /*
> >> > > > + * While RT/DL tasks are running we do not want FAIR tasks to
> >> > > > + * overwrite this CPU's flags, still we can update utilization and
> >> > > > + * frequency (if required/possible) to be fair with these tasks.
> >> > > > + */
> >> > > > + rt_mode = task_has_dl_policy(curr) ||
> >> > > > + task_has_rt_policy(curr) ||
> >> > > > + (flags & SCHED_CPUFREQ_RT_DL);
> >> > > > + if (rt_mode)
> >> > > > + sg_cpu->flags |= flags;
> >> > > > + else
> >> > > > + sg_cpu->flags = flags;
> >> > >
> >> > > This looks so hacked up :)
> >> >
> >> > It is... a bit... :)
> >> >
> >> > > Wouldn't it be better to let the scheduler tell us what all kind of tasks it has
> >> > > in the rq of a CPU and pass a mask of flags?
> >> >
> >> > That would definitively report a more consistent view of what's going
> >> > on on each CPU.
> >> >
> >> > > I think it wouldn't be difficult (or time consuming) for the
> >> > > scheduler to know that, but I am not 100% sure.
> >> >
> >> > Main issue perhaps is that cpufreq_update_{util,this_cpu} are
> >> > currently called by the scheduling classes codes and not from the core
> >> > scheduler. However I agree that it should be possible to build up such
> >> > information and make it available to the scheduling classes code.
> >> >
> >> > I'll have a look at that.
> >> >
> >> > > IOW, the flags field in cpufreq_update_util() will represent all tasks in the
> >> > > rq, instead of just the task that is getting enqueued/dequeued..
> >> > >
> >> > > And obviously we need to get some utilization numbers for the RT and DL tasks
> >> > > going forward, switching to max isn't going to work for ever :)
> >> >
> >> > Regarding this last point, there are WIP patches Juri is working on to
> >> > feed DL demands to schedutil, his presentation at last ELC partially
> >> > covers these developments:
> >> > https://www.youtube.com/watch?v=wzrcWNIneWY&index=37&list=PLbzoR-pLrL6pSlkQDW7RpnNLuxPq6WVUR
> >> >
> >> > Instead, RT tasks are currently covered by an rt_avg metric which we
> >> > already know is not fitting for most purposes.
> >> > It seems that the main goal is twofold: move people to DL whenever
> >> > possible otherwise live with the go-to-max policy which is the only
> >> > sensible solution to satisfy the RT's class main goal, i.e. latency
> >> > reduction.
> >> >
> >> > Of course such a go-to-max policy for all RT tasks we already know
> >> > that is going to destroy energy on many different mobile scenarios.
> >> >
> >> > As a possible mitigation for that, while still being compliant with
> >> > the main RT's class goal, we recently posted the SchedTune v3
> >> > proposal:
> >> > https://lkml.org/lkml/2017/2/28/355
> >> >
> >> > In that proposal, the simple usage of CGroups and the new capacity_max
> >> > attribute of the (existing) CPU controller should allow to define what
> >> > is the "max" value which is just enough to match the latency
> >> > constraints of a mobile application without sacrificing too much
> >> > energy.
> >
> > Given the following interesting question, let's add Thomas Gleixner to
> > the discussion, which can be interested in sharing his viewpoint.
> >
> >> And who's going to figure out what "max" value is most suitable? And how?
> >
> > That's usually up to the system profiling which is part of the
> > platform optimizations and tunings.
> > For example it's possible to run experiments to measure the bandwidth
> > and (completion) latency requirements from the RT workloads.
> >
> > It's something which people developing embedded/mobile systems are
> > quite aware of.
>
> Well, I was expecting an answer like this to be honest and let me say
> that it is not too convincing from my perspective.
>
> At least throwing embedded and mobile into one bag seems to be a
> stretch, because the usage patterns of those two groups are quite
> different, even though they may be similar from the hardware POV.
>
> Mobile are mostly used as general-purpose computers nowadays (and I
> guess we're essentially talking about anything running Android, not
> just phones, aren't we?) with applications installed by users rather
> than by system integrators, so I'm doubtful about the viability of the
> "system integrators should take care of it" assumption in this
> particular case.

I would say that it's a stretch also to consider Android systems just
like "general purpose computers".

Main difference with respect to desktop/laptop is that Android (or
ChromeOS) is a quite "complex" run-time sitting in between the Linux
kernel and the "general purpose" applications.
If you miss this point it's actually difficult to see how some of the
things we are proposing can make any sense.

Being a run-time, Android has the power and knowledge to act as an
orchestrator of resources assigned to applications.
Android applications are not completely free to do whatever they want,
which is instead the case of most apps running on a desktop.

Android provides a lot of fundamental and critical services to
applications. These services are ultimately under control of the
"system integrator" which has the power and knowledge to optimize them
as much as possible for a given HW platform.

>From this perspective Android is much more similar to an embedded
system than a "general-purpose computer". Not only from an HW
standpoint but also and mainly form the system tuning and optimization
standpoints.

Thus, the system integrator, whatever it's Google (for the main
framework components) or a smartphone producer (for the platform
dependent components) is more than willing to take care of optimize
its platform for whatever app will run on it.
In other words, the bulk of the possible optimizations are
most of the time in the application agnostic SW stacks and do not
depends on any specific app.

> > I'm also quite confident on saying that most of
> > them can agree that just going to the max OPP, each and every time a
> > RT task becomes RUNNABLE, it is something which is more likely to hurt
> > than to give benefits.
>
> It would be good to be able to estimate that likelihood somehow ...

True, even if I don't remember a similar likelihood estimation about
"not hurting" when the we decided for the "go-to-max" policy.

We never really even considered to use schedutil with the current
"go-to-max" behavior. However, I've spent some time doing some
experiments. Hereafter are some numbers we have got while running
using the schedutil governor on a Pixel phone. The schedutil governor
code is the mainline version backported to the Android Common Kernel
(ACK) based on a v3.18 Linux kernel.

In the following table compares these configurations:

A) pelt_gtm: PELT for FAIRs and GoToMax for RTs
This is essentially what mainline schedutil does nowadays.

B) pelt_rtavg: PELT for FAIRs and rq->rt_avg for RTs
main difference here is that we use the rt_avg metrics to drive OPP
selection for RT tasks.

C) walt: WALT for both FAIRs and RTs
Here we use WALT to estimate RT utilization.
This is reported as a reference just because it's the solution
currently used on premium and production grade Pixel devices.

No other system tuning parameter are changing between case A and B.
The only change in code is the replacement of the GoToMax policy for
RT tasks with the usage of rt_avg to estimate the utilization of RT
tasks.

| A) pelt_gtm | B) pelt_rtavg | C) walt
| mJ wrt_walt | mJ wrt_walt | mJ
--------------------------------+-----------------------------
Jankbench | 54170 24.31% | 43520 -0.13% | 43577
YouTube | 90484 39.53% | 64965 0.18% | 64851
UiBench | 24869 54.57% | 16370 1.75% | 16089
| | |
Vellamo | 520564 7.81% | 482332 -0.11% | 482860
PCMark | 761806 27.55% | 596807 -0.08% | 597282

The first three are workloads mainly stressing the UI which are used
to evaluate impact on user experience. These workloads mimics the most
common usage and user interaction patterns on an Android device.
For all of theme the goal is to run at MINIMUM energy consumption
while not generating jank frames. IOW, as much power as required,
noting more. In these cases the system is expected to run on
low-medium OPPs most of the time.

The last two are more standard system stressing benchmarks where the
system is quite likely to run at higher OPPs for most of the time.

I'm not sharing performance scores, because it's not the goal of this
comparison. However, in all cases the performance metrics are just
good with respect to the expectations.

As an additional info, consider that in an Android Pixel phone there
are around 100 RT FIFO tasks. One of these RT tasks is a principal
component of the GFX rendering pipeline. That's why we have such a big
regression on energy consumptions using the gotomax policy wihtout
any sensible performance improvement.

Energy measures for B and C cases are based on averages across
multiple executions collected from our CI system. For A I've run a set
of experiments and the number reported in this table are the _most
favorable_ ones, i.e. the runs where I've got the lower energy
consumption.

I would say that, as a first and simple proof-of-concept, there are
quite good evidence that a (forced) go-to-max policy is more likely to
hurt than give benefits.

> > AFAIK the current policy (i.e. "go to max") has been adopted for the
> > following main reasons, which I'm reporting with some observations.
> >
> >
> > .:: Missing of a suitable utilization metric for RT tasks
> >
> > There is actually a utilization signal (rq->rt_avg) but it has been
> > verified to be "too slow" for the practical usage of driving OPP
> > selection.
> > Other possibilities are perhaps under exploration but they are not
> > yet there.
> >
> >
> > .:: Promote the migration from RT to DEADLINE
> >
> > Which makes a lot of sens for many existing use-cases, starting from
> > Android as well. However, it's also true that we cannot (at least yet)
> > split the world in DEALINE vs FAIR.
> > There is still, and there will be, a fair amount of RT tasks which
> > just it makes sense to serve at best both from the performance as
> > well as the power/energy standpoint.
> >
> >
> > .:: Because RT is all about "reducing latencies"
> >
> > Running at the maximum OPP is certainly the best way to aim for the
> > minimum latencies but... RT is about doing things "in time", which
> > does not imply "as fast as possible".
> > There can be many different workloads where a lower OPP is just good
> > enough to meet the expected soft RT behaviors provided by the Linux
> > RT scheduler.
>
> AFAICS, RT is about determinism and meeting deadlines is one aspect of that.

Determinism, first of all, does not mean you have to go fast, and
deadlines cannot be granted by the Linux RT scheduling class.
What you are speacking about is what DEADLINE provide, which is a RT
scheduler... but not the RT scheduling class.

> Essentially, the problem here is to know which OPP is sufficient to do
> things "on time" and there simply is no good way to figure that out
> for RT tasks, again, AFAICS.

I do not agree on that point. If you know the workload composition and
have control on task composition and priority, a suitable set of
profiling experiments can be sufficient to verify how much CPU
bandwidth is required by your tasks to meet the timing requirements,
both in terms of activation and completion latencies.

If that should not be possible, than DEADLINE would be almost useless.
And if that is possible for DEADLINE, than it's possible for RT as
well.

I agree that kernel space cannot know these requirements by just
looking at a utilization tracking signal (e.g. rq->rt_avg).
>From user-space instead these information are more likely to be known
(or obtained by experiment). What we are currently missing is a proper
API to specify these values for RT, if we cannot or don't want to use
DEADLINE of course.

> And since we can't say which OPP is sufficient, we pretty much have no
> choice but to use the top-most one.

This is a big change wrt what on demand used to do, and still this
governor is the default one in many systems where RT tasks are used.

IMHO, if you really care about deadlines then we have a scheduling
class for these tasks. The DEADLINE class is going to be extended to
provide support to specify the exact CPU bandwidth required by these
tasks.
If we instead accept to run using the RT class, than we are already in
the domain of "reduced guarantees".

I fully agree that one simple solution can be that to go to max but at
the same time I'm also convinced that, with a proper profiling activity,
it's possible to identify bandwidth requirements for latency sensitive
tasks.
This would allow to get the expected RT behaviors without sacrificing
energy also when RT tasks are legitimately used.

> > All that considered, the modifications proposed in this series,
> > combined with other bits which are for discussion in this [1] other
> > posting, can work together to provide a better and more tunable OPP
> > selection policy for RT tasks.
>
> OK, but "more tunable" need not imply "better".

Agree, that's absolutely true. However, for the specific proposal in
[1], it should be noted that the proposed additional tunables:

1. do not affect the default behavior, which is currently a go-to-max
policy for RT tasks

2. they just add a possible interface to tune at run-time,
depending on the "context awarness", the default behavior by
setting a capacity_max value to use instead of the max OPP.

Thus, I would say that's a possible way to improve the current
situation. Maybe not the best, but in that case we should still talk
about a possible alternative approach.

> Thanks,
> Rafael

Cheers Patrick

--
#include <best/regards.h>

Patrick Bellasi

2017-03-28 22:23:49

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 1/6] cpufreq: schedutil: reset sg_cpus's flags at IDLE enter

On Thursday, March 02, 2017 03:45:02 PM Patrick Bellasi wrote:
> Currently, sg_cpu's flags are set to the value defined by the last call of
> the cpufreq_update_util()/cpufreq_update_this_cpu(); for RT/DL classes
> this corresponds to the SCHED_CPUFREQ_{RT/DL} flags always being set.
>
> When multiple CPU shares the same frequency domain it might happen that a
> CPU which executed an RT task, right before entering IDLE, has one of the
> SCHED_CPUFREQ_RT_DL flags set, permanently, until it exits IDLE.

But if it is idle, it won't be taken into account by sugov_next_freq_shared(), will it?

Thanks,
Rafael

2017-04-07 14:59:22

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH 1/6] cpufreq: schedutil: reset sg_cpus's flags at IDLE enter

On 29-Mar 00:18, Rafael J. Wysocki wrote:
> On Thursday, March 02, 2017 03:45:02 PM Patrick Bellasi wrote:
> > Currently, sg_cpu's flags are set to the value defined by the last call of
> > the cpufreq_update_util()/cpufreq_update_this_cpu(); for RT/DL classes
> > this corresponds to the SCHED_CPUFREQ_{RT/DL} flags always being set.
> >
> > When multiple CPU shares the same frequency domain it might happen that a
> > CPU which executed an RT task, right before entering IDLE, has one of the
> > SCHED_CPUFREQ_RT_DL flags set, permanently, until it exits IDLE.
>
> But if it is idle, it won't be taken into account by sugov_next_freq_shared(), will it?

Yes, but:

1) This kind of "useless RT requests" are ignored only if more then
TICK_NSEC have elapsed since the last update.

2) The proposed patch, apart from resetting the flags, it also bails
out without potentially triggering an already too late switch to MAX,
which starts also a new throttling interval.

3) By resetting the flags we keep the internal state machine more
consistent with what the scheduler knows, i.e. we are now idle.

Considering overall these three points, IMO this small patch makes the
schedutil machinery a bit more predictable.

> Thanks,
> Rafael

Cheers Patrick

--
#include <best/regards.h>

Patrick Bellasi

2017-04-07 15:31:12

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 3/6] cpufreq: schedutil: ensure max frequency while running RT/DL tasks

On Thu, Mar 02, 2017 at 03:45:04PM +0000, Patrick Bellasi wrote:
> + struct task_struct *curr = cpu_curr(smp_processor_id());

Isn't that a weird way of writing 'current' ?

2017-04-07 16:59:23

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH 3/6] cpufreq: schedutil: ensure max frequency while running RT/DL tasks

On 07-Apr 17:30, Peter Zijlstra wrote:
> On Thu, Mar 02, 2017 at 03:45:04PM +0000, Patrick Bellasi wrote:
> > + struct task_struct *curr = cpu_curr(smp_processor_id());
>
> Isn't that a weird way of writing 'current' ?

Right... (cough)... it's a new fangled way. :-/

Will cleanup before reposting the series.

--
#include <best/regards.h>

Patrick Bellasi

2017-06-06 09:26:58

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH 1/6] cpufreq: schedutil: reset sg_cpus's flags at IDLE enter

On 02-03-17, 15:45, Patrick Bellasi wrote:
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index e2ed46d..739b29d 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -3653,6 +3653,7 @@ static inline unsigned long rlimit_max(unsigned int limit)
> #define SCHED_CPUFREQ_RT (1U << 0)
> #define SCHED_CPUFREQ_DL (1U << 1)
> #define SCHED_CPUFREQ_IOWAIT (1U << 2)
> +#define SCHED_CPUFREQ_IDLE (1U << 3)
>
> #define SCHED_CPUFREQ_RT_DL (SCHED_CPUFREQ_RT | SCHED_CPUFREQ_DL)
>
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index fd46593..084a98b 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -281,6 +281,12 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
>
> raw_spin_lock(&sg_policy->update_lock);
>
> + /* CPU is entering IDLE, reset flags without triggering an update */
> + if (flags & SCHED_CPUFREQ_IDLE) {
> + sg_cpu->flags = 0;
> + goto done;
> + }
> +
> sg_cpu->util = util;
> sg_cpu->max = max;
> sg_cpu->flags = flags;
> @@ -293,6 +299,7 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
> sugov_update_commit(sg_policy, time, next_f);
> }
>
> +done:
> raw_spin_unlock(&sg_policy->update_lock);
> }
>
> diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
> index 0c00172..a844c91 100644
> --- a/kernel/sched/idle_task.c
> +++ b/kernel/sched/idle_task.c
> @@ -29,6 +29,10 @@ pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
> put_prev_task(rq, prev);
> update_idle_core(rq);
> schedstat_inc(rq->sched_goidle);
> +
> + /* kick cpufreq (see the comment in kernel/sched/sched.h). */
> + cpufreq_update_this_cpu(rq, SCHED_CPUFREQ_IDLE);
> +
> return rq->idle;
> }

I was discussing about almost the same problem with Vincent today and we were
convinced to write exactly the same patch to solve that. And then I saw this old
thread again :)

Why did this thread die completely ?

Can we at least get the patches which don't have any objections merged
separately first ?

--
viresh

2017-06-06 15:56:55

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 1/6] cpufreq: schedutil: reset sg_cpus's flags at IDLE enter

On Tue, Jun 6, 2017 at 11:26 AM, Viresh Kumar <[email protected]> wrote:
> On 02-03-17, 15:45, Patrick Bellasi wrote:
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index e2ed46d..739b29d 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -3653,6 +3653,7 @@ static inline unsigned long rlimit_max(unsigned int limit)
>> #define SCHED_CPUFREQ_RT (1U << 0)
>> #define SCHED_CPUFREQ_DL (1U << 1)
>> #define SCHED_CPUFREQ_IOWAIT (1U << 2)
>> +#define SCHED_CPUFREQ_IDLE (1U << 3)
>>
>> #define SCHED_CPUFREQ_RT_DL (SCHED_CPUFREQ_RT | SCHED_CPUFREQ_DL)
>>
>> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
>> index fd46593..084a98b 100644
>> --- a/kernel/sched/cpufreq_schedutil.c
>> +++ b/kernel/sched/cpufreq_schedutil.c
>> @@ -281,6 +281,12 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
>>
>> raw_spin_lock(&sg_policy->update_lock);
>>
>> + /* CPU is entering IDLE, reset flags without triggering an update */
>> + if (flags & SCHED_CPUFREQ_IDLE) {
>> + sg_cpu->flags = 0;
>> + goto done;
>> + }
>> +
>> sg_cpu->util = util;
>> sg_cpu->max = max;
>> sg_cpu->flags = flags;
>> @@ -293,6 +299,7 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
>> sugov_update_commit(sg_policy, time, next_f);
>> }
>>
>> +done:
>> raw_spin_unlock(&sg_policy->update_lock);
>> }
>>
>> diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
>> index 0c00172..a844c91 100644
>> --- a/kernel/sched/idle_task.c
>> +++ b/kernel/sched/idle_task.c
>> @@ -29,6 +29,10 @@ pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
>> put_prev_task(rq, prev);
>> update_idle_core(rq);
>> schedstat_inc(rq->sched_goidle);
>> +
>> + /* kick cpufreq (see the comment in kernel/sched/sched.h). */
>> + cpufreq_update_this_cpu(rq, SCHED_CPUFREQ_IDLE);
>> +
>> return rq->idle;
>> }
>
> I was discussing about almost the same problem with Vincent today and we were
> convinced to write exactly the same patch to solve that. And then I saw this old
> thread again :)
>
> Why did this thread die completely ?

Because nobody followed up? :-)

> Can we at least get the patches which don't have any objections merged
> separately first ?

Yes, we can in general, but someone needs to "shepherd" them and I've
been traveling lately.

So, if there's anything that appears non-controversial and looks like
it could be applied, the best way to make that happen would be to
resend it.

Thanks,
Rafael

2017-06-06 18:03:53

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH 1/6] cpufreq: schedutil: reset sg_cpus's flags at IDLE enter

On 06-Jun 17:56, Rafael J. Wysocki wrote:
> On Tue, Jun 6, 2017 at 11:26 AM, Viresh Kumar <[email protected]> wrote:
> > On 02-03-17, 15:45, Patrick Bellasi wrote:
> >> diff --git a/include/linux/sched.h b/include/linux/sched.h
> >> index e2ed46d..739b29d 100644
> >> --- a/include/linux/sched.h
> >> +++ b/include/linux/sched.h
> >> @@ -3653,6 +3653,7 @@ static inline unsigned long rlimit_max(unsigned int limit)
> >> #define SCHED_CPUFREQ_RT (1U << 0)
> >> #define SCHED_CPUFREQ_DL (1U << 1)
> >> #define SCHED_CPUFREQ_IOWAIT (1U << 2)
> >> +#define SCHED_CPUFREQ_IDLE (1U << 3)
> >>
> >> #define SCHED_CPUFREQ_RT_DL (SCHED_CPUFREQ_RT | SCHED_CPUFREQ_DL)
> >>
> >> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> >> index fd46593..084a98b 100644
> >> --- a/kernel/sched/cpufreq_schedutil.c
> >> +++ b/kernel/sched/cpufreq_schedutil.c
> >> @@ -281,6 +281,12 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
> >>
> >> raw_spin_lock(&sg_policy->update_lock);
> >>
> >> + /* CPU is entering IDLE, reset flags without triggering an update */
> >> + if (flags & SCHED_CPUFREQ_IDLE) {
> >> + sg_cpu->flags = 0;
> >> + goto done;
> >> + }
> >> +
> >> sg_cpu->util = util;
> >> sg_cpu->max = max;
> >> sg_cpu->flags = flags;
> >> @@ -293,6 +299,7 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
> >> sugov_update_commit(sg_policy, time, next_f);
> >> }
> >>
> >> +done:
> >> raw_spin_unlock(&sg_policy->update_lock);
> >> }
> >>
> >> diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
> >> index 0c00172..a844c91 100644
> >> --- a/kernel/sched/idle_task.c
> >> +++ b/kernel/sched/idle_task.c
> >> @@ -29,6 +29,10 @@ pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
> >> put_prev_task(rq, prev);
> >> update_idle_core(rq);
> >> schedstat_inc(rq->sched_goidle);
> >> +
> >> + /* kick cpufreq (see the comment in kernel/sched/sched.h). */
> >> + cpufreq_update_this_cpu(rq, SCHED_CPUFREQ_IDLE);
> >> +
> >> return rq->idle;
> >> }
> >
> > I was discussing about almost the same problem with Vincent today and we were
> > convinced to write exactly the same patch to solve that. And then I saw this old
> > thread again :)
> >
> > Why did this thread die completely ?
>
> Because nobody followed up? :-)
>
> > Can we at least get the patches which don't have any objections merged
> > separately first ?
>
> Yes, we can in general, but someone needs to "shepherd" them and I've
> been traveling lately.
>
> So, if there's anything that appears non-controversial and looks like
> it could be applied, the best way to make that happen would be to
> resend it.

You right guys, I should had done it since some time.
Will do the best to resend it in few days.

Cheers Patrick

--
#include <best/regards.h>

Patrick Bellasi