Each time a CPU utilisation update is issued by the scheduler a flag, which
mainly defines which scheduling class is asking for the update, is used by the
frequency selection policy to support the selection of the most appropriate
OPP.
In the current implementation, CPU flags are overridden each time the scheduler
calls schedutil for an update. Such a behavior seems to be sub-optimal,
especially on systems where frequency domains span across multiple CPUs.
Indeed, assuming CPU1 and CPU2 share the same frequency domain, there can be
the following issues:
A) Small FAIR task running at MAX OPP.
A RT task, which just executed on CPU1, can keep the domain at the
max frequency for a prolonged period of time after its completion,
even if there are no longer RT tasks running on CPUs of its domain.
B) FAIR wakeup reducing the OPP of the current RT task.
A FAIR task enqueued in a CPU where a RT task is running overrides the flag
configured by the RT task thus potentially causing an unwanted frequency
drop.
C) RT wakeup not running at max OPP.
An RT task waking up on a CPU which has recently updated its OPP can
be forced to run at a lower frequency because of the throttling
enforced by schedutil, even if there are not OPP transitions
currently in progress.
.:: Patches organization
========================
This series proposes a set of fixes for the aforementioned issues and it's an
update addressing all the main comments collected from the previous posting
[1].
Patches have been re-ordered to have the "less controversial" bits at the
beginning and also to better match the order of the three main issues described
above. These are the relative patches:
A) Fix small FAIR task running at MAX OPP:
cpufreq: schedutil: ignore the sugov kthread for frequencies selections
cpufreq: schedutil: reset sg_cpus's flags at IDLE enter
B) FAIR wakeup reducing the OPP of the current RT task.
cpufreq: schedutil: ensure max frequency while running RT/DL tasks
C) RT wakeup not running at max OPP.
sched/rt: fast switch to maximum frequency when RT tasks are scheduled
cpufreq: schedutil: relax rate-limiting while running RT/DL tasks
cpufreq: schedutil: avoid utilisation update when not necessary
.:: Experimental Results
========================
The misbehavior have been verified using a set of simple rt-app based synthetic
workloads, running on a ARM's Juno R2 board where the CPUs of the big cluster
(CPU1 and CPU2) have been reserved to run the workload tasks in isolation from
other system tasks.
A detailed description of the experiments executed, and the corresponding
collected results, is available [2] online.
Short highlights for these experiments are:
- Patches in group A reduce energy consumption by ~50% by ensuring that
a small task is always running at the minimum OPP even when the
sugov's RT kthread is used to change frequencies in the same cluster.
- Patches in group B increase from 4% to 98% the chances for a RT
task to complete its activations while running at the max OPP.
- Patches in group C do not show measurable differences mainly because of the
slow OPP switching support available on the JUNO board used for testing.
However, a trace inspection shows that the sequence of traced events is much
more deterministic and it better matches the expected system behaviors.
For example, as soon as a RT task wakeup the scheduler ask for an OPP switch
to max frequency.
Cheers Patrick
.:: References
==============
[1] https://lkml.org/lkml/2017/3/2/385
[2] https://gist.github.com/derkling/0cd7210e4fa6f2ec3558073006e5ad70
Patrick Bellasi (6):
cpufreq: schedutil: ignore sugov kthreads
cpufreq: schedutil: reset sg_cpus's flags at IDLE enter
cpufreq: schedutil: ensure max frequency while running RT/DL tasks
cpufreq: schedutil: update CFS util only if used
sched/rt: fast switch to maximum frequency when RT tasks are scheduled
cpufreq: schedutil: relax rate-limiting while running RT/DL tasks
include/linux/sched/cpufreq.h | 1 +
kernel/sched/cpufreq_schedutil.c | 61 ++++++++++++++++++++++++++++++++--------
kernel/sched/idle_task.c | 4 +++
kernel/sched/rt.c | 15 ++++++++--
4 files changed, 67 insertions(+), 14 deletions(-)
--
2.7.4
In system where multiple CPUs shares the same frequency domain a small
workload on a CPU can still be subject to frequency spikes, generated by
the activation of the sugov's kthread.
Since the sugov kthread is a special RT task, which goal is just that to
activate a frequency transition, it does not make sense for it to bias
the schedutil's frequency selection policy.
This patch exploits the information related to the current task to silently
ignore cpufreq_update_this_cpu() calls, coming from the RT scheduler, while
the sugov kthread is running.
Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes from v1:
- move check before policy spinlock (JuriL)
---
kernel/sched/cpufreq_schedutil.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index c982dd0..eaba6d6 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -218,6 +218,10 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
unsigned int next_f;
bool busy;
+ /* Skip updates generated by sugov kthreads */
+ if (unlikely(current == sg_policy->thread))
+ return;
+
sugov_set_iowait_boost(sg_cpu, time, flags);
sg_cpu->last_update = time;
@@ -290,6 +294,10 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
unsigned long util, max;
unsigned int next_f;
+ /* Skip updates generated by sugov kthreads */
+ if (unlikely(current == sg_policy->thread))
+ return;
+
sugov_get_util(&util, &max);
raw_spin_lock(&sg_policy->update_lock);
--
2.7.4
The policy in use for RT/DL tasks sets the maximum frequency when a task
in these classes calls for a cpufreq_update_this_cpu(). However, the
current implementation might cause a frequency drop while a RT/DL task
is still running, just because for example a FAIR task wakes up and it's
enqueued in the same CPU.
This issue is due to the sg_cpu's flags being overwritten at each call
of sugov_update_*. Thus, the wakeup of a FAIR task resets the flags and
can trigger a frequency update thus affecting the currently running
RT/DL task.
This can be fixed, in shared frequency domains, by ORing (instead of
overwriting) the new flag before triggering a frequency update. This
grants to stay at least at the frequency requested by the RT/DL class,
which is the maximum one for the time being.
This patch does the flags aggregation in the schedutil governor, where
it's easy to verify if we currently have RT/DL workload on a CPU.
This approach is aligned with the current schedutil API design where the
core scheduler does not interact directly with schedutil, while instead
are the scheduling classes which call directly into the policy via
cpufreq_update_{util,this_cpu}. Thus, it makes more sense to have flags
aggregation in the schedutil code instead of the core scheduler.
Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Steve Muckle <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes from v1:
- use "current" to check for RT/DL tasks (PeterZ)
---
kernel/sched/cpufreq_schedutil.c | 34 +++++++++++++++++++++++++++-------
1 file changed, 27 insertions(+), 7 deletions(-)
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 004ae18..98704d8 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -216,6 +216,7 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
struct cpufreq_policy *policy = sg_policy->policy;
unsigned long util, max;
unsigned int next_f;
+ bool rt_mode;
bool busy;
/* Skip updates generated by sugov kthreads */
@@ -230,7 +231,15 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
busy = sugov_cpu_is_busy(sg_cpu);
- if (flags & SCHED_CPUFREQ_RT_DL) {
+ /*
+ * While RT/DL tasks are running we do not want FAIR tasks to
+ * overvrite this CPU's flags, still we can update utilization and
+ * frequency (if required/possible) to be fair with these tasks.
+ */
+ rt_mode = task_has_dl_policy(current) ||
+ task_has_rt_policy(current) ||
+ (flags & SCHED_CPUFREQ_RT_DL);
+ if (rt_mode) {
next_f = policy->cpuinfo.max_freq;
} else {
sugov_get_util(&util, &max);
@@ -293,6 +302,7 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
struct sugov_policy *sg_policy = sg_cpu->sg_policy;
unsigned long util, max;
unsigned int next_f;
+ bool rt_mode;
/* Skip updates generated by sugov kthreads */
if (unlikely(current == sg_policy->thread))
@@ -310,17 +320,27 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
sg_cpu->flags = 0;
goto done;
}
- sg_cpu->flags = flags;
+
+ /*
+ * While RT/DL tasks are running we do not want FAIR tasks to
+ * overwrite this CPU's flags, still we can update utilization and
+ * frequency (if required/possible) to be fair with these tasks.
+ */
+ rt_mode = task_has_dl_policy(current) ||
+ task_has_rt_policy(current) ||
+ (flags & SCHED_CPUFREQ_RT_DL);
+ if (rt_mode)
+ sg_cpu->flags |= flags;
+ else
+ sg_cpu->flags = flags;
sugov_set_iowait_boost(sg_cpu, time, flags);
sg_cpu->last_update = time;
if (sugov_should_update_freq(sg_policy, time)) {
- if (flags & SCHED_CPUFREQ_RT_DL)
- next_f = sg_policy->policy->cpuinfo.max_freq;
- else
- next_f = sugov_next_freq_shared(sg_cpu, time);
-
+ next_f = rt_mode
+ ? sg_policy->policy->cpuinfo.max_freq
+ : sugov_next_freq_shared(sg_cpu, time);
sugov_update_commit(sg_policy, time, next_f);
}
--
2.7.4
Currently, sg_cpu's flags are set to the value defined by the last call of
the cpufreq_update_util()/cpufreq_update_this_cpu(); for RT/DL classes
this corresponds to the SCHED_CPUFREQ_{RT/DL} flags always being set.
When multiple CPU shares the same frequency domain it might happen that a
CPU which executed a RT task, right before entering IDLE, has one of the
SCHED_CPUFREQ_RT_DL flags set, permanently, until it exits IDLE.
Although such an idle CPU is _going to be_ ignored by the
sugov_next_freq_shared():
1. this kind of "useless RT requests" are ignored only if more then
TICK_NSEC have elapsed since the last update
2. we can still potentially trigger an already too late switch to
MAX, which starts also a new throttling interval
3. the internal state machine is not consistent with what the
scheduler knows, i.e. the CPU is now actually idle
Thus, in sugov_next_freq_shared(), where utilisation and flags are
aggregated across all the CPUs of a frequency domain, it can turn out
that all the CPUs of that domain can run unnecessary at the maximum OPP
until another event happens in the idle CPU, which eventually clear the
SCHED_CPUFREQ_{RT/DL} flag, or the IDLE CPUs gets ignored after
TICK_NSEC since the CPU entering IDLE.
Such a behaviour can harm the energy efficiency of systems where RT
workloads are not so frequent and other CPUs in the same frequency
domain are running small utilisation workloads, which is a quite common
scenario in mobile embedded systems.
This patch proposes a solution which is aligned with the current principle
to update the flags each time a scheduling event happens. The scheduling
of the idle_task on a CPU is considered one of such meaningful events.
That's why when the idle_task is selected for execution we poke the
schedutil policy to reset the flags for that CPU.
No frequency transitions are activated at that point, which is fair in
case the RT workload should come back in the future. However, this still
allows other CPUs in the same frequency domain to scale down the
frequency in case that should be possible.
Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes from v1:
- added "unlikely()" around the statement (SteveR)
---
include/linux/sched/cpufreq.h | 1 +
kernel/sched/cpufreq_schedutil.c | 7 +++++++
kernel/sched/idle_task.c | 4 ++++
3 files changed, 12 insertions(+)
diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h
index d2be2cc..36ac8d2 100644
--- a/include/linux/sched/cpufreq.h
+++ b/include/linux/sched/cpufreq.h
@@ -10,6 +10,7 @@
#define SCHED_CPUFREQ_RT (1U << 0)
#define SCHED_CPUFREQ_DL (1U << 1)
#define SCHED_CPUFREQ_IOWAIT (1U << 2)
+#define SCHED_CPUFREQ_IDLE (1U << 3)
#define SCHED_CPUFREQ_RT_DL (SCHED_CPUFREQ_RT | SCHED_CPUFREQ_DL)
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index eaba6d6..004ae18 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -304,6 +304,12 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
sg_cpu->util = util;
sg_cpu->max = max;
+
+ /* CPU is entering IDLE, reset flags without triggering an update */
+ if (unlikely(flags & SCHED_CPUFREQ_IDLE)) {
+ sg_cpu->flags = 0;
+ goto done;
+ }
sg_cpu->flags = flags;
sugov_set_iowait_boost(sg_cpu, time, flags);
@@ -318,6 +324,7 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
sugov_update_commit(sg_policy, time, next_f);
}
+done:
raw_spin_unlock(&sg_policy->update_lock);
}
diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
index 0c00172..a844c91 100644
--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -29,6 +29,10 @@ pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
put_prev_task(rq, prev);
update_idle_core(rq);
schedstat_inc(rq->sched_goidle);
+
+ /* kick cpufreq (see the comment in kernel/sched/sched.h). */
+ cpufreq_update_this_cpu(rq, SCHED_CPUFREQ_IDLE);
+
return rq->idle;
}
--
2.7.4
The policy in use for RT/DL tasks sets the maximum frequency when a task
in these classes calls for a cpufreq_update_this_cpu(). However, the
current implementation is still enforcing a frequency switch rate
limiting when these tasks are running.
This is potentially working against the goal to switch to the maximum OPP
when RT tasks are running. In certain unfortunate cases it can also happen
that a RT task almost completes its activation at a lower OPP.
This patch overrides on purpose the rate limiting configuration
to better serve RT/DL tasks. As long as a frequency scaling operation
is not in progress, a frequency switch is always authorized when
running in "rt_mode", i.e. the current task in a CPU belongs to the
RT/DL class.
Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
kernel/sched/cpufreq_schedutil.c | 19 ++++++++++++-------
1 file changed, 12 insertions(+), 7 deletions(-)
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index df433f1..7b1dc7e 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -72,7 +72,8 @@ static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
/************************ Governor internals ***********************/
-static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
+static bool sugov_should_update_freq(struct sugov_policy *sg_policy,
+ u64 time, bool rt_mode)
{
s64 delta_ns;
@@ -89,6 +90,10 @@ static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
return true;
}
+ /* Always update if a RT/DL task is running */
+ if (rt_mode)
+ return true;
+
delta_ns = time - sg_policy->last_freq_update_time;
return delta_ns >= sg_policy->freq_update_delay_ns;
}
@@ -226,11 +231,6 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
sugov_set_iowait_boost(sg_cpu, time, flags);
sg_cpu->last_update = time;
- if (!sugov_should_update_freq(sg_policy, time))
- return;
-
- busy = sugov_cpu_is_busy(sg_cpu);
-
/*
* While RT/DL tasks are running we do not want FAIR tasks to
* overvrite this CPU's flags, still we can update utilization and
@@ -239,6 +239,11 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
rt_mode = task_has_dl_policy(current) ||
task_has_rt_policy(current) ||
(flags & SCHED_CPUFREQ_RT_DL);
+ if (!sugov_should_update_freq(sg_policy, time, rt_mode))
+ return;
+
+ busy = sugov_cpu_is_busy(sg_cpu);
+
if (rt_mode) {
next_f = policy->cpuinfo.max_freq;
} else {
@@ -336,7 +341,7 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
sugov_set_iowait_boost(sg_cpu, time, flags);
sg_cpu->last_update = time;
- if (sugov_should_update_freq(sg_policy, time)) {
+ if (sugov_should_update_freq(sg_policy, time, rt_mode)) {
next_f = rt_mode
? sg_policy->policy->cpuinfo.max_freq
: sugov_next_freq_shared(sg_cpu, time);
--
2.7.4
Currently schedutil updates are triggered for the RT class using a single
call place, which is part of the rt::update_curr_rt() used in:
- dequeue_task_rt:
but it does not make sense to set the schedutil's SCHED_CPUFREQ_RT in
case the next task should not be an RT one
- put_prev_task_rt:
likewise, we set the SCHED_CPUFREQ_RT flag without knowing if required
by the next task
- pick_next_task_rt:
likewise, the schedutil's SCHED_CPUFREQ_RT is set in case the prev task
was RT, while we don't yet know if the next will be RT
- task_tick_rt:
that's the only really useful call, which can ramp up the frequency in
case a RT task started its execution without a chance to order a
frequency switch (e.g. because of the schedutil ratelimit)
Apart from the last call in task_tick_rt, the others are at least useless.
Thus, although being a simple solution, not all the call sites of that
update_curr_rt() are interesting to trigger a frequency switch as well as
some of the most interesting points are not covered by that call.
For example, a task set to RT has to wait the next tick to get the
frequency boost.
This patch fixes these issues by placing explicitly the schedutils
update calls in the only sensible places, which are:
- when an RT task wakeups and it's enqueued in a CPU
- when we actually pick a RT task for execution
- at each tick time
- when a task is set to be RT
Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
kernel/sched/rt.c | 15 ++++++++++++---
1 file changed, 12 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 45caf93..8c25e95 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -969,9 +969,6 @@ static void update_curr_rt(struct rq *rq)
if (unlikely((s64)delta_exec <= 0))
return;
- /* Kick cpufreq (see the comment in kernel/sched/sched.h). */
- cpufreq_update_this_cpu(rq, SCHED_CPUFREQ_RT);
-
schedstat_set(curr->se.statistics.exec_max,
max(curr->se.statistics.exec_max, delta_exec));
@@ -1337,6 +1334,9 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p, int flags)
if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
enqueue_pushable_task(rq, p);
+
+ /* Kick cpufreq (see the comment in kernel/sched/sched.h). */
+ cpufreq_update_this_cpu(rq, SCHED_CPUFREQ_RT);
}
static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags)
@@ -1574,6 +1574,9 @@ pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
p = _pick_next_task_rt(rq);
+ /* Kick cpufreq (see the comment in kernel/sched/sched.h). */
+ cpufreq_update_this_cpu(rq, SCHED_CPUFREQ_RT);
+
/* The running task is never eligible for pushing */
dequeue_pushable_task(rq, p);
@@ -2367,6 +2370,9 @@ static void task_tick_rt(struct rq *rq, struct task_struct *p, int queued)
{
struct sched_rt_entity *rt_se = &p->rt;
+ /* Kick cpufreq (see the comment in kernel/sched/sched.h). */
+ cpufreq_update_this_cpu(rq, SCHED_CPUFREQ_RT);
+
update_curr_rt(rq);
watchdog(rq, p);
@@ -2402,6 +2408,9 @@ static void set_curr_task_rt(struct rq *rq)
p->se.exec_start = rq_clock_task(rq);
+ /* Kick cpufreq (see the comment in kernel/sched/sched.h). */
+ cpufreq_update_this_cpu(rq, SCHED_CPUFREQ_RT);
+
/* The running task is never eligible for pushing */
dequeue_pushable_task(rq, p);
}
--
2.7.4
Currently the utilization of the FAIR class is collected before locking
the policy. Although that should not be a big issue for most cases, we
also don't really know how much latency there can be between the
utilization reading and its usage.
Let's get the FAIR utilization right before its usage to be better in
sync with the current status of a CPU.
Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
kernel/sched/cpufreq_schedutil.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 98704d8..df433f1 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -308,10 +308,9 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
if (unlikely(current == sg_policy->thread))
return;
- sugov_get_util(&util, &max);
-
raw_spin_lock(&sg_policy->update_lock);
+ sugov_get_util(&util, &max);
sg_cpu->util = util;
sg_cpu->max = max;
--
2.7.4
On 04-07-17, 18:34, Patrick Bellasi wrote:
> diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h
> index d2be2cc..36ac8d2 100644
> --- a/include/linux/sched/cpufreq.h
> +++ b/include/linux/sched/cpufreq.h
> @@ -10,6 +10,7 @@
> #define SCHED_CPUFREQ_RT (1U << 0)
> #define SCHED_CPUFREQ_DL (1U << 1)
> #define SCHED_CPUFREQ_IOWAIT (1U << 2)
> +#define SCHED_CPUFREQ_IDLE (1U << 3)
>
> #define SCHED_CPUFREQ_RT_DL (SCHED_CPUFREQ_RT | SCHED_CPUFREQ_DL)
>
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index eaba6d6..004ae18 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -304,6 +304,12 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
>
> sg_cpu->util = util;
> sg_cpu->max = max;
> +
> + /* CPU is entering IDLE, reset flags without triggering an update */
> + if (unlikely(flags & SCHED_CPUFREQ_IDLE)) {
> + sg_cpu->flags = 0;
> + goto done;
> + }
Why is it important to have the above diff at all ? For example we aren't doing
similar stuff in sugov_update_single() and that will go on and try to change the
frequency if rate_limit_us time is over since last update.
And also why is it important to write 0 to sg_cpu->flags ? What wouldn't work if
we set sg_cpu->flags to SCHED_CPUFREQ_IDLE in this case ? i.e. Just the below
statement should be good for us.
> sg_cpu->flags = flags;
>
> sugov_set_iowait_boost(sg_cpu, time, flags);
> @@ -318,6 +324,7 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
> sugov_update_commit(sg_policy, time, next_f);
> }
>
> +done:
> raw_spin_unlock(&sg_policy->update_lock);
> }
>
> diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
> index 0c00172..a844c91 100644
> --- a/kernel/sched/idle_task.c
> +++ b/kernel/sched/idle_task.c
> @@ -29,6 +29,10 @@ pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
> put_prev_task(rq, prev);
> update_idle_core(rq);
> schedstat_inc(rq->sched_goidle);
> +
> + /* kick cpufreq (see the comment in kernel/sched/sched.h). */
> + cpufreq_update_this_cpu(rq, SCHED_CPUFREQ_IDLE);
> +
This looks correct.
Can we completely avoid the utilization contribution of the CPUs which have gone
idle? Right now we avoid them with help of (delta_ns > TICK_NSEC). Can we
instead check this SCHED_CPUFREQ_IDLE flag ?
--
viresh
On 04-07-17, 18:34, Patrick Bellasi wrote:
> In system where multiple CPUs shares the same frequency domain a small
> workload on a CPU can still be subject to frequency spikes, generated by
> the activation of the sugov's kthread.
>
> Since the sugov kthread is a special RT task, which goal is just that to
> activate a frequency transition, it does not make sense for it to bias
> the schedutil's frequency selection policy.
>
> This patch exploits the information related to the current task to silently
> ignore cpufreq_update_this_cpu() calls, coming from the RT scheduler, while
> the sugov kthread is running.
>
> Signed-off-by: Patrick Bellasi <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: Viresh Kumar <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
>
> ---
> Changes from v1:
> - move check before policy spinlock (JuriL)
> ---
> kernel/sched/cpufreq_schedutil.c | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index c982dd0..eaba6d6 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -218,6 +218,10 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
> unsigned int next_f;
> bool busy;
>
> + /* Skip updates generated by sugov kthreads */
> + if (unlikely(current == sg_policy->thread))
> + return;
> +
> sugov_set_iowait_boost(sg_cpu, time, flags);
> sg_cpu->last_update = time;
>
> @@ -290,6 +294,10 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
> unsigned long util, max;
> unsigned int next_f;
>
> + /* Skip updates generated by sugov kthreads */
> + if (unlikely(current == sg_policy->thread))
> + return;
> +
> sugov_get_util(&util, &max);
Yes we discussed this last time as well (I looked again at those discussions and
am still confused a bit), but wanted to clarify one more time.
After the 2nd patch of this series is applied, why will we still have this
problem? As we concluded it last time, the problem wouldn't happen until the
time the sugov RT thread is running (Hint: work_in_progress). And once the sugov
RT thread is gone, one of the other scheduling classes will take over and should
update the flag pretty quickly.
Are we worried about the time between the sugov RT thread finishes and when the
CFS or IDLE sched class call the util handler again? If yes, then we will still
have that problem for any normal RT/DL task. Isn't it ?
--
viresh
On 04-07-17, 18:34, Patrick Bellasi wrote:
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 004ae18..98704d8 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -216,6 +216,7 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
> struct cpufreq_policy *policy = sg_policy->policy;
> unsigned long util, max;
> unsigned int next_f;
> + bool rt_mode;
> bool busy;
>
> /* Skip updates generated by sugov kthreads */
> @@ -230,7 +231,15 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
>
> busy = sugov_cpu_is_busy(sg_cpu);
>
> - if (flags & SCHED_CPUFREQ_RT_DL) {
> + /*
> + * While RT/DL tasks are running we do not want FAIR tasks to
> + * overvrite this CPU's flags, still we can update utilization and
> + * frequency (if required/possible) to be fair with these tasks.
> + */
> + rt_mode = task_has_dl_policy(current) ||
> + task_has_rt_policy(current) ||
> + (flags & SCHED_CPUFREQ_RT_DL);
We may want to create a separate inline function for above, as it is already
used twice in this patch.
But I was wondering if we can get some help from the scheduler to avoid such
code here. I understand that we don't want to do the aggregation in the
scheduler to keep it clean and keep such governor specific thing here.
But what about clearing the sched-class's flag from .pick_next_task() callback
when they return NULL ?
What about something like this instead (completely untested), with which we
don't need the 2/3 patch as well:
diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h
index d2be2ccbb372..e81a6b5591f5 100644
--- a/include/linux/sched/cpufreq.h
+++ b/include/linux/sched/cpufreq.h
@@ -11,6 +11,10 @@
#define SCHED_CPUFREQ_DL (1U << 1)
#define SCHED_CPUFREQ_IOWAIT (1U << 2)
+#define SCHED_CPUFREQ_CLEAR (1U << 31)
+#define SCHED_CPUFREQ_CLEAR_RT (SCHED_CPUFREQ_CLEAR | SCHED_CPUFREQ_RT)
+#define SCHED_CPUFREQ_CLEAR_DL (SCHED_CPUFREQ_CLEAR | SCHED_CPUFREQ_DL)
+
#define SCHED_CPUFREQ_RT_DL (SCHED_CPUFREQ_RT | SCHED_CPUFREQ_DL)
#ifdef CONFIG_CPU_FREQ
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 076a2e31951c..f32e15d59d62 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -218,6 +218,9 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
unsigned int next_f;
bool busy;
+ if (flags & SCHED_CPUFREQ_CLEAR)
+ return;
+
sugov_set_iowait_boost(sg_cpu, time, flags);
sg_cpu->last_update = time;
@@ -296,7 +299,13 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
sg_cpu->util = util;
sg_cpu->max = max;
- sg_cpu->flags = flags;
+
+ if (unlikely(flags & SCHED_CPUFREQ_CLEAR)) {
+ sg_cpu->flags &= ~(flags & ~SCHED_CPUFREQ_CLEAR);
+ return;
+ }
+
+ sg_cpu->flags |= flags;
sugov_set_iowait_boost(sg_cpu, time, flags);
sg_cpu->last_update = time;
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index a2ce59015642..441d6153d654 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1203,8 +1203,10 @@ pick_next_task_dl(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
if (prev->sched_class == &dl_sched_class)
update_curr_dl(rq);
- if (unlikely(!dl_rq->dl_nr_running))
+ if (unlikely(!dl_rq->dl_nr_running)) {
+ cpufreq_update_this_cpu(rq, SCHED_CPUFREQ_CLEAR_DL);
return NULL;
+ }
put_prev_task(rq, prev);
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 979b7341008a..bca9e4bb7ec4 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1556,8 +1556,10 @@ pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
if (prev->sched_class == &rt_sched_class)
update_curr_rt(rq);
- if (!rt_rq->rt_queued)
+ if (!rt_rq->rt_queued) {
+ cpufreq_update_this_cpu(rq, SCHED_CPUFREQ_CLEAR_RT);
return NULL;
+ }
put_prev_task(rq, prev);
--
viresh
On 05-Jul 10:30, Viresh Kumar wrote:
> On 04-07-17, 18:34, Patrick Bellasi wrote:
> > In system where multiple CPUs shares the same frequency domain a small
> > workload on a CPU can still be subject to frequency spikes, generated by
> > the activation of the sugov's kthread.
> >
> > Since the sugov kthread is a special RT task, which goal is just that to
> > activate a frequency transition, it does not make sense for it to bias
> > the schedutil's frequency selection policy.
> >
> > This patch exploits the information related to the current task to silently
> > ignore cpufreq_update_this_cpu() calls, coming from the RT scheduler, while
> > the sugov kthread is running.
> >
> > Signed-off-by: Patrick Bellasi <[email protected]>
> > Cc: Ingo Molnar <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > Cc: Rafael J. Wysocki <[email protected]>
> > Cc: Viresh Kumar <[email protected]>
> > Cc: [email protected]
> > Cc: [email protected]
> >
> > ---
> > Changes from v1:
> > - move check before policy spinlock (JuriL)
> > ---
> > kernel/sched/cpufreq_schedutil.c | 8 ++++++++
> > 1 file changed, 8 insertions(+)
> >
> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > index c982dd0..eaba6d6 100644
> > --- a/kernel/sched/cpufreq_schedutil.c
> > +++ b/kernel/sched/cpufreq_schedutil.c
> > @@ -218,6 +218,10 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
> > unsigned int next_f;
> > bool busy;
> >
> > + /* Skip updates generated by sugov kthreads */
> > + if (unlikely(current == sg_policy->thread))
> > + return;
> > +
> > sugov_set_iowait_boost(sg_cpu, time, flags);
> > sg_cpu->last_update = time;
> >
> > @@ -290,6 +294,10 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
> > unsigned long util, max;
> > unsigned int next_f;
> >
> > + /* Skip updates generated by sugov kthreads */
> > + if (unlikely(current == sg_policy->thread))
> > + return;
> > +
> > sugov_get_util(&util, &max);
>
> Yes we discussed this last time as well (I looked again at those discussions and
> am still confused a bit), but wanted to clarify one more time.
>
> After the 2nd patch of this series is applied, why will we still have this
> problem? As we concluded it last time, the problem wouldn't happen until the
> time the sugov RT thread is running (Hint: work_in_progress). And once the sugov
> RT thread is gone, one of the other scheduling classes will take over and should
> update the flag pretty quickly.
>
> Are we worried about the time between the sugov RT thread finishes and when the
> CFS or IDLE sched class call the util handler again? If yes, then we will still
> have that problem for any normal RT/DL task. Isn't it ?
Yes, we are worried about that time, without this we can generate
spikes to the max OPP even when only relatively small FAIR tasks are
running.
The same problem is not there for the other "normal RT/DL" tasks, just
because for those tasks this is the expected behavior: we wanna go to
max.
To the contrary the sugov kthread, although being a RT task, is just
functional to the "machinery" to work, it's an actuator. Thus, IMO it
makes no sense from a design standpoint for it to interfere whatsoever
with what the "machinery" is doing.
Finally, the second patch of this series fixes a kind-of symmetrical
issue: while this one avoid going to max OPP, the next one avoid to
stay at max OPP once not more needed.
Cheers Patrick
--
#include <best/regards.h>
Patrick Bellasi
On 05-Jul 10:20, Viresh Kumar wrote:
> On 04-07-17, 18:34, Patrick Bellasi wrote:
> > diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h
> > index d2be2cc..36ac8d2 100644
> > --- a/include/linux/sched/cpufreq.h
> > +++ b/include/linux/sched/cpufreq.h
> > @@ -10,6 +10,7 @@
> > #define SCHED_CPUFREQ_RT (1U << 0)
> > #define SCHED_CPUFREQ_DL (1U << 1)
> > #define SCHED_CPUFREQ_IOWAIT (1U << 2)
> > +#define SCHED_CPUFREQ_IDLE (1U << 3)
> >
> > #define SCHED_CPUFREQ_RT_DL (SCHED_CPUFREQ_RT | SCHED_CPUFREQ_DL)
> >
> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > index eaba6d6..004ae18 100644
> > --- a/kernel/sched/cpufreq_schedutil.c
> > +++ b/kernel/sched/cpufreq_schedutil.c
> > @@ -304,6 +304,12 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
> >
> > sg_cpu->util = util;
> > sg_cpu->max = max;
> > +
> > + /* CPU is entering IDLE, reset flags without triggering an update */
> > + if (unlikely(flags & SCHED_CPUFREQ_IDLE)) {
> > + sg_cpu->flags = 0;
> > + goto done;
> > + }
>
> Why is it important to have the above diff at all ? For example we aren't doing
> similar stuff in sugov_update_single() and that will go on and try to change the
> frequency if rate_limit_us time is over since last update.
The p repose here is mainly to avoid interference of IDLE CPUs on
other CPUs in the same frequency domain, by just resetting their
"requests".
In the single case, it's completely up to the policy to decide what to
do when we enter IDLE without risking to affect other CPUs.
But perhaps you are right, maybe we should use the same heuristics in
both cases. Entering idle just reset the flags and do not enforce for
example a frequency drop.
> And also why is it important to write 0 to sg_cpu->flags ? What wouldn't work if
> we set sg_cpu->flags to SCHED_CPUFREQ_IDLE in this case ? i.e. Just the below
> statement should be good for us.
Let say flags have the RT/DL flag set when the RT task sleep, is there
any specific reason to keep this flag up while the CPU is IDLE?
IOW, why should we care about an information related to an even which
is now over?
The proposal of this patch is just meant to make sure that the flags,
being a state variable, always describe the current status of the
sugov "state machine".
If a CPU is IDLE there are not sensible events going on and thus flags
should better be reset.
>
> > sg_cpu->flags = flags;
> >
> > sugov_set_iowait_boost(sg_cpu, time, flags);
> > @@ -318,6 +324,7 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
> > sugov_update_commit(sg_policy, time, next_f);
> > }
> >
> > +done:
> > raw_spin_unlock(&sg_policy->update_lock);
> > }
> >
> > diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
> > index 0c00172..a844c91 100644
> > --- a/kernel/sched/idle_task.c
> > +++ b/kernel/sched/idle_task.c
> > @@ -29,6 +29,10 @@ pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
> > put_prev_task(rq, prev);
> > update_idle_core(rq);
> > schedstat_inc(rq->sched_goidle);
> > +
> > + /* kick cpufreq (see the comment in kernel/sched/sched.h). */
> > + cpufreq_update_this_cpu(rq, SCHED_CPUFREQ_IDLE);
> > +
>
> This looks correct.
>
> Can we completely avoid the utilization contribution of the CPUs which have gone
> idle? Right now we avoid them with help of (delta_ns > TICK_NSEC). Can we
> instead check this SCHED_CPUFREQ_IDLE flag ?
I would say that the blocked utilization of an IDLE CPU is still worth
to be considered, at least for a limited amount of time, for few main
reasons:
1. it represents CPU bandwidth that is likely to be required by a task
which can wakeup in a short while. Consider for example an 80% task
activated every 16ms: even if it's not running right now it's
likely to wakeup in the next ~3ms to run for the following ~13ms.
Thus, we should probably better consider that CPU utilization.
2. we already have policies to gratefully reduce the current OPP if
its utilization decrease. This means that we are interested in a
sort of policy which favors higher OPPs to avoid impacting
performance of tasks which suddenly wakeup.
3. A CPU entering IDLE is not a great source of new information
for OPP selection, I would not strictly bind an OPP change to this
event. That's also why this patch propose to clear the flags
without actually triggering an OPP change.
Moreover, maybe the issue you are trying to solve it's more related to
having a stale utilization for an IDLE CPUs?
In that case we should fix the real source of the issue, which is the
utilization of an IDLE CPU not being updated over time. But that's
outside of the scope of this series.
Cheers Patrick
--
#include <best/regards.h>
Patrick Bellasi
[+Brendan]
On 05-Jul 11:31, Viresh Kumar wrote:
> On 04-07-17, 18:34, Patrick Bellasi wrote:
> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > index 004ae18..98704d8 100644
> > --- a/kernel/sched/cpufreq_schedutil.c
> > +++ b/kernel/sched/cpufreq_schedutil.c
> > @@ -216,6 +216,7 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
> > struct cpufreq_policy *policy = sg_policy->policy;
> > unsigned long util, max;
> > unsigned int next_f;
> > + bool rt_mode;
> > bool busy;
> >
> > /* Skip updates generated by sugov kthreads */
> > @@ -230,7 +231,15 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
> >
> > busy = sugov_cpu_is_busy(sg_cpu);
> >
> > - if (flags & SCHED_CPUFREQ_RT_DL) {
> > + /*
> > + * While RT/DL tasks are running we do not want FAIR tasks to
> > + * overvrite this CPU's flags, still we can update utilization and
> > + * frequency (if required/possible) to be fair with these tasks.
> > + */
> > + rt_mode = task_has_dl_policy(current) ||
> > + task_has_rt_policy(current) ||
> > + (flags & SCHED_CPUFREQ_RT_DL);
>
> We may want to create a separate inline function for above, as it is already
> used twice in this patch.
Good idea.
> But I was wondering if we can get some help from the scheduler to avoid such
> code here. I understand that we don't want to do the aggregation in the
> scheduler to keep it clean and keep such governor specific thing here.
Indeed.
> But what about clearing the sched-class's flag from .pick_next_task() callback
> when they return NULL ?
>
> What about something like this instead (completely untested), with which we
> don't need the 2/3 patch as well:
Just had a fast check but I think something like that can work.
We had an internal discussion with Brendan (in CC now) which had a
similar proposal.
Main counter arguments for me was:
1. we wanna to reduce the pollution of scheduling classes code with
schedutil specific code, unless strictly necessary
2. we never had a "clear bit" semantics for flags updates
Thus this proposal seemed to me less of a discontinuity wrt the
current interface. However, something similar to what you propose
below should also work. Let's collect some more feedback...
> diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h
> index d2be2ccbb372..e81a6b5591f5 100644
> --- a/include/linux/sched/cpufreq.h
> +++ b/include/linux/sched/cpufreq.h
> @@ -11,6 +11,10 @@
> #define SCHED_CPUFREQ_DL (1U << 1)
> #define SCHED_CPUFREQ_IOWAIT (1U << 2)
>
> +#define SCHED_CPUFREQ_CLEAR (1U << 31)
> +#define SCHED_CPUFREQ_CLEAR_RT (SCHED_CPUFREQ_CLEAR | SCHED_CPUFREQ_RT)
> +#define SCHED_CPUFREQ_CLEAR_DL (SCHED_CPUFREQ_CLEAR | SCHED_CPUFREQ_DL)
> +
> #define SCHED_CPUFREQ_RT_DL (SCHED_CPUFREQ_RT | SCHED_CPUFREQ_DL)
>
> #ifdef CONFIG_CPU_FREQ
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 076a2e31951c..f32e15d59d62 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -218,6 +218,9 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
> unsigned int next_f;
> bool busy;
>
> + if (flags & SCHED_CPUFREQ_CLEAR)
> + return;
Here we should still clear the flags, like what we do for the shared
case... just to keep the internal status consiste with the
notifications we have got from the scheduling classes.
> +
> sugov_set_iowait_boost(sg_cpu, time, flags);
> sg_cpu->last_update = time;
>
> @@ -296,7 +299,13 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
>
> sg_cpu->util = util;
> sg_cpu->max = max;
> - sg_cpu->flags = flags;
> +
> + if (unlikely(flags & SCHED_CPUFREQ_CLEAR)) {
> + sg_cpu->flags &= ~(flags & ~SCHED_CPUFREQ_CLEAR);
> + return;
> + }
> +
> + sg_cpu->flags |= flags;
>
> sugov_set_iowait_boost(sg_cpu, time, flags);
> sg_cpu->last_update = time;
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index a2ce59015642..441d6153d654 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1203,8 +1203,10 @@ pick_next_task_dl(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> if (prev->sched_class == &dl_sched_class)
> update_curr_dl(rq);
>
> - if (unlikely(!dl_rq->dl_nr_running))
> + if (unlikely(!dl_rq->dl_nr_running)) {
> + cpufreq_update_this_cpu(rq, SCHED_CPUFREQ_CLEAR_DL);
> return NULL;
> + }
>
> put_prev_task(rq, prev);
>
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 979b7341008a..bca9e4bb7ec4 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1556,8 +1556,10 @@ pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> if (prev->sched_class == &rt_sched_class)
> update_curr_rt(rq);
>
> - if (!rt_rq->rt_queued)
> + if (!rt_rq->rt_queued) {
> + cpufreq_update_this_cpu(rq, SCHED_CPUFREQ_CLEAR_RT);
> return NULL;
> + }
>
> put_prev_task(rq, prev);
>
> --
> viresh
--
#include <best/regards.h>
Patrick Bellasi
On 05-07-17, 12:38, Patrick Bellasi wrote:
> On 05-Jul 10:30, Viresh Kumar wrote:
> > Yes we discussed this last time as well (I looked again at those discussions and
> > am still confused a bit), but wanted to clarify one more time.
> >
> > After the 2nd patch of this series is applied, why will we still have this
> > problem? As we concluded it last time, the problem wouldn't happen until the
> > time the sugov RT thread is running (Hint: work_in_progress). And once the sugov
> > RT thread is gone, one of the other scheduling classes will take over and should
> > update the flag pretty quickly.
> >
> > Are we worried about the time between the sugov RT thread finishes and when the
> > CFS or IDLE sched class call the util handler again? If yes, then we will still
> > have that problem for any normal RT/DL task. Isn't it ?
>
> Yes, we are worried about that time,
But isn't that a very very small amount of time? i.e. As soon as the RT thread
is finished, we will select the next task from CFS or go to IDLE class (of
course if there is nothing left in DL/RT). And this should happen very quickly.
Are we sure we really see problems in that short time? Sure it can happen, but
it looks to be an extreme corner case and just wanted to check if it really
happened for you after the 2nd patch.
> without this we can generate
> spikes to the max OPP even when only relatively small FAIR tasks are
> running.
>
> The same problem is not there for the other "normal RT/DL" tasks, just
> because for those tasks this is the expected behavior: we wanna go to
> max.
By same problem I meant that after the last RT task is finished and before the
pick_next_task of the IDLE_CLASS (or CFS) is called, we can still get a callback
into schedutil and that may raise the frequency to MAX. Its a similar kind of
problem, but yes we never wanted the freq to go to max for sugov thread.
> To the contrary the sugov kthread, although being a RT task, is just
> functional to the "machinery" to work, it's an actuator. Thus, IMO it
> makes no sense from a design standpoint for it to interfere whatsoever
> with what the "machinery" is doing.
I think everyone agrees on this. I was just exploring if that can be achieved
without any special code like what this patch proposes.
I was wondering about what will happen for a case where we have two RT tasks
(one of them is sugov thread) and when we land into schedutil the current task
is sugov. With this patch we will not set the flag, but actually we have another
task which is RT.
--
viresh
On 05-07-17, 14:04, Patrick Bellasi wrote:
> On 05-Jul 10:20, Viresh Kumar wrote:
> > And also why is it important to write 0 to sg_cpu->flags ? What wouldn't work if
> > we set sg_cpu->flags to SCHED_CPUFREQ_IDLE in this case ? i.e. Just the below
> > statement should be good for us.
>
> Let say flags have the RT/DL flag set when the RT task sleep, is there
> any specific reason to keep this flag up while the CPU is IDLE?
> IOW, why should we care about an information related to an even which
> is now over?
Maybe I wasn't able to communicate what I wanted to say, but I am not asking you
to keep RT/DL flags as is, but rather set the flags variable to
SCHED_CPUFREQ_IDLE (1 << 3). My concerns were about adding an additional
conditional statement here, while we can live without one.
> The proposal of this patch is just meant to make sure that the flags,
> being a state variable, always describe the current status of the
> sugov "state machine".
> If a CPU is IDLE there are not sensible events going on and thus flags
> should better be reset.
or set to SCHED_CPUFREQ_IDLE.
> > This looks correct.
> >
> > Can we completely avoid the utilization contribution of the CPUs which have gone
> > idle? Right now we avoid them with help of (delta_ns > TICK_NSEC). Can we
> > instead check this SCHED_CPUFREQ_IDLE flag ?
>
> I would say that the blocked utilization of an IDLE CPU is still worth
> to be considered, at least for a limited amount of time, for few main
> reasons:
>
> 1. it represents CPU bandwidth that is likely to be required by a task
> which can wakeup in a short while. Consider for example an 80% task
> activated every 16ms: even if it's not running right now it's
> likely to wakeup in the next ~3ms to run for the following ~13ms.
> Thus, we should probably better consider that CPU utilization.
>
> 2. we already have policies to gratefully reduce the current OPP if
> its utilization decrease. This means that we are interested in a
> sort of policy which favors higher OPPs to avoid impacting
> performance of tasks which suddenly wakeup.
>
> 3. A CPU entering IDLE is not a great source of new information
> for OPP selection, I would not strictly bind an OPP change to this
> event. That's also why this patch propose to clear the flags
> without actually triggering an OPP change.
>
> Moreover, maybe the issue you are trying to solve it's more related to
> having a stale utilization for an IDLE CPUs?
I wasn't trying to solve any issue here, but just discussing about what should
we do here. Yeah it seems fair to keep the utilization of the idle CPU for
another TICK, after which we are ignoring it anyway.
--
viresh
On 05-07-17, 14:41, Patrick Bellasi wrote:
> On 05-Jul 11:31, Viresh Kumar wrote:
> Just had a fast check but I think something like that can work.
> We had an internal discussion with Brendan (in CC now) which had a
> similar proposal.
>
> Main counter arguments for me was:
> 1. we wanna to reduce the pollution of scheduling classes code with
> schedutil specific code, unless strictly necessary
s/schedutil/cpufreq, as the util hooks are getting called for some other stuff
as well.
> 2. we never had a "clear bit" semantics for flags updates
>
> Thus this proposal seemed to me less of a discontinuity wrt the
> current interface. However, something similar to what you propose
> below should also work.
With the kind of problems we have in hand now, it seems that it would be good
for the governors to know what kind of stuff is queued on the CPU (i.e. the
aggregation of all the flags) and the only sane way of doing that is by clearing
the flag once a class is done with it.
Else we would be required to have code that tries to find the same information
in an indirect way, like what this patch does with the current task.
> Let's collect some more feedback...
Sure.
> > diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h
> > index d2be2ccbb372..e81a6b5591f5 100644
> > --- a/include/linux/sched/cpufreq.h
> > +++ b/include/linux/sched/cpufreq.h
> > @@ -11,6 +11,10 @@
> > #define SCHED_CPUFREQ_DL (1U << 1)
> > #define SCHED_CPUFREQ_IOWAIT (1U << 2)
> >
> > +#define SCHED_CPUFREQ_CLEAR (1U << 31)
> > +#define SCHED_CPUFREQ_CLEAR_RT (SCHED_CPUFREQ_CLEAR | SCHED_CPUFREQ_RT)
> > +#define SCHED_CPUFREQ_CLEAR_DL (SCHED_CPUFREQ_CLEAR | SCHED_CPUFREQ_DL)
> > +
> > #define SCHED_CPUFREQ_RT_DL (SCHED_CPUFREQ_RT | SCHED_CPUFREQ_DL)
> >
> > #ifdef CONFIG_CPU_FREQ
> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > index 076a2e31951c..f32e15d59d62 100644
> > --- a/kernel/sched/cpufreq_schedutil.c
> > +++ b/kernel/sched/cpufreq_schedutil.c
> > @@ -218,6 +218,9 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
> > unsigned int next_f;
> > bool busy;
> >
> > + if (flags & SCHED_CPUFREQ_CLEAR)
> > + return;
>
> Here we should still clear the flags, like what we do for the shared
> case... just to keep the internal status consiste with the
> notifications we have got from the scheduling classes.
The sg_cpu->flags field isn't used currently for the single CPU per policy case,
but only for shared policies. But yes, we need to maintain that here now as
well to know what all is queued on a CPU.
--
viresh
On Wednesday, July 05, 2017 12:38:34 PM Patrick Bellasi wrote:
> On 05-Jul 10:30, Viresh Kumar wrote:
> > On 04-07-17, 18:34, Patrick Bellasi wrote:
> > > In system where multiple CPUs shares the same frequency domain a small
> > > workload on a CPU can still be subject to frequency spikes, generated by
> > > the activation of the sugov's kthread.
> > >
> > > Since the sugov kthread is a special RT task, which goal is just that to
> > > activate a frequency transition, it does not make sense for it to bias
> > > the schedutil's frequency selection policy.
> > >
> > > This patch exploits the information related to the current task to silently
> > > ignore cpufreq_update_this_cpu() calls, coming from the RT scheduler, while
> > > the sugov kthread is running.
> > >
> > > Signed-off-by: Patrick Bellasi <[email protected]>
> > > Cc: Ingo Molnar <[email protected]>
> > > Cc: Peter Zijlstra <[email protected]>
> > > Cc: Rafael J. Wysocki <[email protected]>
> > > Cc: Viresh Kumar <[email protected]>
> > > Cc: [email protected]
> > > Cc: [email protected]
> > >
> > > ---
> > > Changes from v1:
> > > - move check before policy spinlock (JuriL)
> > > ---
> > > kernel/sched/cpufreq_schedutil.c | 8 ++++++++
> > > 1 file changed, 8 insertions(+)
> > >
> > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > > index c982dd0..eaba6d6 100644
> > > --- a/kernel/sched/cpufreq_schedutil.c
> > > +++ b/kernel/sched/cpufreq_schedutil.c
> > > @@ -218,6 +218,10 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
> > > unsigned int next_f;
> > > bool busy;
> > >
> > > + /* Skip updates generated by sugov kthreads */
> > > + if (unlikely(current == sg_policy->thread))
> > > + return;
> > > +
> > > sugov_set_iowait_boost(sg_cpu, time, flags);
> > > sg_cpu->last_update = time;
> > >
> > > @@ -290,6 +294,10 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
> > > unsigned long util, max;
> > > unsigned int next_f;
> > >
> > > + /* Skip updates generated by sugov kthreads */
> > > + if (unlikely(current == sg_policy->thread))
> > > + return;
> > > +
> > > sugov_get_util(&util, &max);
> >
> > Yes we discussed this last time as well (I looked again at those discussions and
> > am still confused a bit), but wanted to clarify one more time.
> >
> > After the 2nd patch of this series is applied, why will we still have this
> > problem? As we concluded it last time, the problem wouldn't happen until the
> > time the sugov RT thread is running (Hint: work_in_progress). And once the sugov
> > RT thread is gone, one of the other scheduling classes will take over and should
> > update the flag pretty quickly.
> >
> > Are we worried about the time between the sugov RT thread finishes and when the
> > CFS or IDLE sched class call the util handler again? If yes, then we will still
> > have that problem for any normal RT/DL task. Isn't it ?
>
> Yes, we are worried about that time, without this we can generate
> spikes to the max OPP even when only relatively small FAIR tasks are
> running.
>
> The same problem is not there for the other "normal RT/DL" tasks, just
> because for those tasks this is the expected behavior: we wanna go to
> max.
>
> To the contrary the sugov kthread, although being a RT task, is just
> functional to the "machinery" to work, it's an actuator. Thus, IMO it
> makes no sense from a design standpoint for it to interfere whatsoever
> with what the "machinery" is doing.
How is this related to the Juri's series?
Thanks,
Rafael
On Tuesday, July 04, 2017 06:34:05 PM Patrick Bellasi wrote:
> Each time a CPU utilisation update is issued by the scheduler a flag, which
> mainly defines which scheduling class is asking for the update, is used by the
> frequency selection policy to support the selection of the most appropriate
> OPP.
>
> In the current implementation, CPU flags are overridden each time the scheduler
> calls schedutil for an update. Such a behavior seems to be sub-optimal,
> especially on systems where frequency domains span across multiple CPUs.
>
> Indeed, assuming CPU1 and CPU2 share the same frequency domain, there can be
> the following issues:
>
> A) Small FAIR task running at MAX OPP.
> A RT task, which just executed on CPU1, can keep the domain at the
> max frequency for a prolonged period of time after its completion,
> even if there are no longer RT tasks running on CPUs of its domain.
>
> B) FAIR wakeup reducing the OPP of the current RT task.
> A FAIR task enqueued in a CPU where a RT task is running overrides the flag
> configured by the RT task thus potentially causing an unwanted frequency
> drop.
>
> C) RT wakeup not running at max OPP.
> An RT task waking up on a CPU which has recently updated its OPP can
> be forced to run at a lower frequency because of the throttling
> enforced by schedutil, even if there are not OPP transitions
> currently in progress.
>
> .:: Patches organization
> ========================
>
> This series proposes a set of fixes for the aforementioned issues and it's an
> update addressing all the main comments collected from the previous posting
> [1].
It seems to me that there is a nonzero overlap between this and the Juri's work.
If that's correct, I'd like this series to go on top of the Juri's one.
Thanks,
Rafael
On Tue, Jul 4, 2017 at 10:34 AM, Patrick Bellasi
<[email protected]> wrote:
> Currently, sg_cpu's flags are set to the value defined by the last call of
> the cpufreq_update_util()/cpufreq_update_this_cpu(); for RT/DL classes
> this corresponds to the SCHED_CPUFREQ_{RT/DL} flags always being set.
>
> When multiple CPU shares the same frequency domain it might happen that a
> CPU which executed a RT task, right before entering IDLE, has one of the
> SCHED_CPUFREQ_RT_DL flags set, permanently, until it exits IDLE.
>
> Although such an idle CPU is _going to be_ ignored by the
> sugov_next_freq_shared():
> 1. this kind of "useless RT requests" are ignored only if more then
> TICK_NSEC have elapsed since the last update
> 2. we can still potentially trigger an already too late switch to
> MAX, which starts also a new throttling interval
> 3. the internal state machine is not consistent with what the
> scheduler knows, i.e. the CPU is now actually idle
>
> Thus, in sugov_next_freq_shared(), where utilisation and flags are
> aggregated across all the CPUs of a frequency domain, it can turn out
> that all the CPUs of that domain can run unnecessary at the maximum OPP
> until another event happens in the idle CPU, which eventually clear the
> SCHED_CPUFREQ_{RT/DL} flag, or the IDLE CPUs gets ignored after
> TICK_NSEC since the CPU entering IDLE.
>
> Such a behaviour can harm the energy efficiency of systems where RT
> workloads are not so frequent and other CPUs in the same frequency
> domain are running small utilisation workloads, which is a quite common
> scenario in mobile embedded systems.
>
> This patch proposes a solution which is aligned with the current principle
> to update the flags each time a scheduling event happens. The scheduling
> of the idle_task on a CPU is considered one of such meaningful events.
> That's why when the idle_task is selected for execution we poke the
> schedutil policy to reset the flags for that CPU.
>
> No frequency transitions are activated at that point, which is fair in
> case the RT workload should come back in the future. However, this still
> allows other CPUs in the same frequency domain to scale down the
> frequency in case that should be possible.
>
> Signed-off-by: Patrick Bellasi <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: Viresh Kumar <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
>
> ---
> Changes from v1:
> - added "unlikely()" around the statement (SteveR)
> ---
> include/linux/sched/cpufreq.h | 1 +
> kernel/sched/cpufreq_schedutil.c | 7 +++++++
> kernel/sched/idle_task.c | 4 ++++
> 3 files changed, 12 insertions(+)
>
> diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h
> index d2be2cc..36ac8d2 100644
> --- a/include/linux/sched/cpufreq.h
> +++ b/include/linux/sched/cpufreq.h
> @@ -10,6 +10,7 @@
> #define SCHED_CPUFREQ_RT (1U << 0)
> #define SCHED_CPUFREQ_DL (1U << 1)
> #define SCHED_CPUFREQ_IOWAIT (1U << 2)
> +#define SCHED_CPUFREQ_IDLE (1U << 3)
>
> #define SCHED_CPUFREQ_RT_DL (SCHED_CPUFREQ_RT | SCHED_CPUFREQ_DL)
>
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index eaba6d6..004ae18 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -304,6 +304,12 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
>
> sg_cpu->util = util;
> sg_cpu->max = max;
> +
> + /* CPU is entering IDLE, reset flags without triggering an update */
> + if (unlikely(flags & SCHED_CPUFREQ_IDLE)) {
> + sg_cpu->flags = 0;
> + goto done;
> + }
Instead of defining a new flag for idle, wouldn't another way be to
just clear the flag from the RT scheduling class with an extra call to
cpufreq_update_util with flags = 0 during dequeue_rt_entity? That
seems to me to be also the right place to clear the flag since the
flag is set in the corresponding class to begin with.
thanks,
-Joel
Hi,
On Wed, Jul 5, 2017 at 6:41 AM, Patrick Bellasi <[email protected]> wrote:
[..]
>
>> But what about clearing the sched-class's flag from .pick_next_task() callback
>> when they return NULL ?
>>
>> What about something like this instead (completely untested), with which we
>> don't need the 2/3 patch as well:
>
> Just had a fast check but I think something like that can work.
> We had an internal discussion with Brendan (in CC now) which had a
> similar proposal.
>
> Main counter arguments for me was:
> 1. we wanna to reduce the pollution of scheduling classes code with
> schedutil specific code, unless strictly necessary
> 2. we never had a "clear bit" semantics for flags updates
>
> Thus this proposal seemed to me less of a discontinuity wrt the
> current interface. However, something similar to what you propose
> below should also work. Let's collect some more feedback...
>
I was going to say something similar. I feel its much cleaner if the
scheduler clears the flags that it set, thus taking "ownership" of the
flag. I feel that will avoid complication like this where the governor
has to peek into what's currently running and such (and also help with
the suggestion I made for patch 2/3.
Maybe the interface needs an extension for "clear flag" semantics?
thanks,
-Joel
On Tue, Jul 4, 2017 at 10:34 AM, Patrick Bellasi
<[email protected]> wrote:
> Currently the utilization of the FAIR class is collected before locking
> the policy. Although that should not be a big issue for most cases, we
> also don't really know how much latency there can be between the
> utilization reading and its usage.
>
> Let's get the FAIR utilization right before its usage to be better in
> sync with the current status of a CPU.
>
> Signed-off-by: Patrick Bellasi <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: Viresh Kumar <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> ---
> kernel/sched/cpufreq_schedutil.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 98704d8..df433f1 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -308,10 +308,9 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
> if (unlikely(current == sg_policy->thread))
> return;
>
> - sugov_get_util(&util, &max);
> -
> raw_spin_lock(&sg_policy->update_lock);
>
> + sugov_get_util(&util, &max);
> sg_cpu->util = util;
> sg_cpu->max = max;
Is your concern that there will we spinlock contention before calling
sugov_get_util?
If that's the case, with your patch it seems to me such contention
(and hence spinning) itself could cause the utilization to be inflated
- thus calling sugov_get_util after acquiring the lock will not be as
accurate as before. In that case it seems to me its better to let
sugov_get_util be called before acquiring the lock (as before).
thanks,
-Joel
On 2017-07-04 10:34, Patrick Bellasi wrote:
> Currently the utilization of the FAIR class is collected before locking
> the policy. Although that should not be a big issue for most cases, we
> also don't really know how much latency there can be between the
> utilization reading and its usage.
>
> Let's get the FAIR utilization right before its usage to be better in
> sync with the current status of a CPU.
>
> Signed-off-by: Patrick Bellasi <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: Viresh Kumar <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> ---
> kernel/sched/cpufreq_schedutil.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/kernel/sched/cpufreq_schedutil.c
> b/kernel/sched/cpufreq_schedutil.c
> index 98704d8..df433f1 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -308,10 +308,9 @@ static void sugov_update_shared(struct
> update_util_data *hook, u64 time,
> if (unlikely(current == sg_policy->thread))
> return;
>
> - sugov_get_util(&util, &max);
> -
> raw_spin_lock(&sg_policy->update_lock);
>
> + sugov_get_util(&util, &max);
> sg_cpu->util = util;
> sg_cpu->max = max;
Given that the utilization update hooks are called with the per-cpu rq
lock held (for all classes), I don't think PELT utilization can change
throughout the lifetime of the cpufreq_update_{util,this_cpu} call? Even
with Viresh's remote cpu callback series we'd still have to hold the rq
lock across cpufreq_update_util.. what can change today is 'max'
(arch_scale_cpu_capacity) when a cpufreq policy is shared, so the patch
is still needed for that reason I think?
Thanks,
Vikram
--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project
On 06/07/17 21:43, Joel Fernandes wrote:
> On Tue, Jul 4, 2017 at 10:34 AM, Patrick Bellasi
> <[email protected]> wrote:
[...]
> > @@ -304,6 +304,12 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
> >
> > sg_cpu->util = util;
> > sg_cpu->max = max;
> > +
> > + /* CPU is entering IDLE, reset flags without triggering an update */
> > + if (unlikely(flags & SCHED_CPUFREQ_IDLE)) {
> > + sg_cpu->flags = 0;
> > + goto done;
> > + }
>
> Instead of defining a new flag for idle, wouldn't another way be to
> just clear the flag from the RT scheduling class with an extra call to
> cpufreq_update_util with flags = 0 during dequeue_rt_entity? That
> seems to me to be also the right place to clear the flag since the
> flag is set in the corresponding class to begin with.
>
Make sense to me too. Also considering that for DL (with my patches) we
don't generally want to clear the flag at dequeue time, but only when
the 0-lag timer fires.
Best,
- Juri
Hi Vikram,
On Thu, Jul 6, 2017 at 11:44 PM, Vikram Mulukutla
<[email protected]> wrote:
> On 2017-07-04 10:34, Patrick Bellasi wrote:
>>
>> Currently the utilization of the FAIR class is collected before locking
>> the policy. Although that should not be a big issue for most cases, we
>> also don't really know how much latency there can be between the
>> utilization reading and its usage.
>>
>> Let's get the FAIR utilization right before its usage to be better in
>> sync with the current status of a CPU.
>>
>> Signed-off-by: Patrick Bellasi <[email protected]>
>> Cc: Ingo Molnar <[email protected]>
>> Cc: Peter Zijlstra <[email protected]>
>> Cc: Rafael J. Wysocki <[email protected]>
>> Cc: Viresh Kumar <[email protected]>
>> Cc: [email protected]
>> Cc: [email protected]
>> ---
>> kernel/sched/cpufreq_schedutil.c | 3 +--
>> 1 file changed, 1 insertion(+), 2 deletions(-)
>>
>> diff --git a/kernel/sched/cpufreq_schedutil.c
>> b/kernel/sched/cpufreq_schedutil.c
>> index 98704d8..df433f1 100644
>> --- a/kernel/sched/cpufreq_schedutil.c
>> +++ b/kernel/sched/cpufreq_schedutil.c
>> @@ -308,10 +308,9 @@ static void sugov_update_shared(struct
>> update_util_data *hook, u64 time,
>> if (unlikely(current == sg_policy->thread))
>> return;
>>
>> - sugov_get_util(&util, &max);
>> -
>> raw_spin_lock(&sg_policy->update_lock);
>>
>> + sugov_get_util(&util, &max);
>> sg_cpu->util = util;
>> sg_cpu->max = max;
>
>
> Given that the utilization update hooks are called with the per-cpu rq lock
> held (for all classes), I don't think PELT utilization can change throughout
> the lifetime of the cpufreq_update_{util,this_cpu} call? Even with Viresh's
> remote cpu callback series we'd still have to hold the rq lock across
> cpufreq_update_util.. what can change today is 'max'
> (arch_scale_cpu_capacity) when a cpufreq policy is shared, so the patch is
> still needed for that reason I think?
>
I didn't follow, Could you elaborate more why you think the patch
helps with the case where max changes while the per-cpu rq lock held?
thanks,
-Joel
On 2017-07-07 23:14, Joel Fernandes wrote:
> Hi Vikram,
>
> On Thu, Jul 6, 2017 at 11:44 PM, Vikram Mulukutla
> <[email protected]> wrote:
>> On 2017-07-04 10:34, Patrick Bellasi wrote:
>>>
>>> Currently the utilization of the FAIR class is collected before
>>> locking
>>> the policy. Although that should not be a big issue for most cases,
>>> we
>>> also don't really know how much latency there can be between the
>>> utilization reading and its usage.
>>>
>>> Let's get the FAIR utilization right before its usage to be better in
>>> sync with the current status of a CPU.
>>>
>>> Signed-off-by: Patrick Bellasi <[email protected]>
<snip>
>> Given that the utilization update hooks are called with the per-cpu rq
>> lock
>> held (for all classes), I don't think PELT utilization can change
>> throughout
>> the lifetime of the cpufreq_update_{util,this_cpu} call? Even with
>> Viresh's
>> remote cpu callback series we'd still have to hold the rq lock across
>> cpufreq_update_util.. what can change today is 'max'
>> (arch_scale_cpu_capacity) when a cpufreq policy is shared, so the
>> patch is
>> still needed for that reason I think?
>>
>
> I didn't follow, Could you elaborate more why you think the patch
> helps with the case where max changes while the per-cpu rq lock held?
>
So going by Patrick's commit text, the concern was a TOC/TOU
problem, but since we agree that CFS utilization can't change
within an rq-locked critical section, the only thing that can
change is 'max'. So you might be the 8th cpu in line waiting
for the sg-policy lock, and after you get the lock, the max may
be outdated.
But come to think of it max changes should be triggering schedutil
updates and those shouldn't be rate-throttled, so maybe we don't
need this at all? It's still somewhat future-proof in case there
is some stat that we read in sugov_get_util that can be updated
asynchronously. However we can put it in when we need it...
> thanks,
>
> -Joel
--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project
On Mon, Jul 10, 2017 at 10:49 AM, Vikram Mulukutla
<[email protected]> wrote:
[..]
>
>>> Given that the utilization update hooks are called with the per-cpu rq
>>> lock
>>> held (for all classes), I don't think PELT utilization can change
>>> throughout
>>> the lifetime of the cpufreq_update_{util,this_cpu} call? Even with
>>> Viresh's
>>> remote cpu callback series we'd still have to hold the rq lock across
>>> cpufreq_update_util.. what can change today is 'max'
>>> (arch_scale_cpu_capacity) when a cpufreq policy is shared, so the patch
>>> is
>>> still needed for that reason I think?
>>>
>>
>> I didn't follow, Could you elaborate more why you think the patch
>> helps with the case where max changes while the per-cpu rq lock held?
>>
>
> So going by Patrick's commit text, the concern was a TOC/TOU
> problem, but since we agree that CFS utilization can't change
> within an rq-locked critical section, the only thing that can
> change is 'max'. So you might be the 8th cpu in line waiting
> for the sg-policy lock, and after you get the lock, the max may
> be outdated.
>
> But come to think of it max changes should be triggering schedutil
> updates and those shouldn't be rate-throttled, so maybe we don't
> need this at all? It's still somewhat future-proof in case there
> is some stat that we read in sugov_get_util that can be updated
> asynchronously. However we can put it in when we need it...
It looks like Juri's patch [1] to split signals already cleaned it up
so we should be all set ;-)
Thanks,
-Joel
[1] https://patchwork.kernel.org/patch/9826201/
On 07/04/2017 10:34 AM, Patrick Bellasi wrote:
> In system where multiple CPUs shares the same frequency domain a small
> workload on a CPU can still be subject to frequency spikes, generated by
> the activation of the sugov's kthread.
>
> Since the sugov kthread is a special RT task, which goal is just that to
> activate a frequency transition, it does not make sense for it to bias
> the schedutil's frequency selection policy.
>
> This patch exploits the information related to the current task to silently
> ignore cpufreq_update_this_cpu() calls, coming from the RT scheduler, while
> the sugov kthread is running.
>
> Signed-off-by: Patrick Bellasi <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: Viresh Kumar <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
>
> ---
> Changes from v1:
> - move check before policy spinlock (JuriL)
> ---
> kernel/sched/cpufreq_schedutil.c | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index c982dd0..eaba6d6 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -218,6 +218,10 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
> unsigned int next_f;
> bool busy;
>
> + /* Skip updates generated by sugov kthreads */
> + if (unlikely(current == sg_policy->thread))
> + return;
> +
> sugov_set_iowait_boost(sg_cpu, time, flags);
> sg_cpu->last_update = time;
>
> @@ -290,6 +294,10 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
> unsigned long util, max;
> unsigned int next_f;
>
> + /* Skip updates generated by sugov kthreads */
> + if (unlikely(current == sg_policy->thread))
> + return;
> +
This seems super race-y. Especially when combined with rate_limit_us.
Deciding to not update the frequency for a policy just because the call
back happened in the context of the kthread is not right. Especially
when it's combined with the remote CPU call backs patches Viresh is
putting out (which I think is a well intended patch series).
-Saravana
--
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project
On 07/07/2017 03:17 AM, Juri Lelli wrote:
> On 06/07/17 21:43, Joel Fernandes wrote:
>> On Tue, Jul 4, 2017 at 10:34 AM, Patrick Bellasi
>> <[email protected]> wrote:
>
> [...]
>
>>> @@ -304,6 +304,12 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
>>>
>>> sg_cpu->util = util;
>>> sg_cpu->max = max;
>>> +
>>> + /* CPU is entering IDLE, reset flags without triggering an update */
>>> + if (unlikely(flags & SCHED_CPUFREQ_IDLE)) {
>>> + sg_cpu->flags = 0;
>>> + goto done;
>>> + }
>>
>> Instead of defining a new flag for idle, wouldn't another way be to
>> just clear the flag from the RT scheduling class with an extra call to
>> cpufreq_update_util with flags = 0 during dequeue_rt_entity? That
>> seems to me to be also the right place to clear the flag since the
>> flag is set in the corresponding class to begin with.
>>
>
> Make sense to me too. Also considering that for DL (with my patches) we
> don't generally want to clear the flag at dequeue time, but only when
> the 0-lag timer fires.
>
Makes sense to me too.
-Saravana
--
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project