2017-03-19 13:41:20

by Rafael J. Wysocki

[permalink] [raw]
Subject: [PATCH 0/2] cpufreq: schedutil: Fix and optimization

Hi,

Patch [1/2] in this series fixes an initialization issue in schedutil.

Patch [2/2] is an optimization, but it also might be regarded as a fix or at
least a problem workaround.

Namely, while playing with the passive mode in intel_pstate lately I noticed
that schedutil evidently underestimated CPU utilization at high loads. I guess
it has been known for a while, but this is the first time I've seen that so
clearly (actually by looking at the values written to the P-state request
register and not via frequencies).

After some investigation I can quite confidently attribute that to a PELT
metric issue, as everything else has been ruled out with high confidence,
and hence the patch (which also works around the problem adding even more
condifence to the observation that PELT underestimates CPU utilization at
least sometimes).

Thanks,
Rafael


2017-03-19 13:41:22

by Rafael J. Wysocki

[permalink] [raw]
Subject: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

From: Rafael J. Wysocki <[email protected]>

The PELT metric used by the schedutil governor underestimates the
CPU utilization in some cases. The reason for that may be time spent
in interrupt handlers and similar which is not accounted for by PELT.

That can be easily demonstrated by running kernel compilation on
a Sandy Bridge Intel processor, running turbostat in parallel with
it and looking at the values written to the MSR_IA32_PERF_CTL
register. Namely, the expected result would be that when all CPUs
were 100% busy, all of them would be requested to run in the maximum
P-state, but observation shows that this clearly isn't the case.
The CPUs run in the maximum P-state for a while and then are
requested to run slower and go back to the maximum P-state after
a while again. That causes the actual frequency of the processor to
visibly oscillate below the sustainable maximum in a jittery fashion
which clearly is not desirable.

To work around this issue use the observation that, from the
schedutil governor's perspective, CPUs that are never idle should
always run at the maximum frequency and make that happen.

To that end, add a counter of idle calls to struct sugov_cpu and
modify cpuidle_idle_call() to increment that counter every time it
is about to put the given CPU into an idle state. Next, make the
schedutil governor look at that counter for the current CPU every
time before it is about to start heavy computations. If the counter
has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms),
the CPU has not been idle for at least that long and the governor
will choose the maximum frequency for it without looking at the PELT
metric at all.

Signed-off-by: Rafael J. Wysocki <[email protected]>
---
include/linux/sched/cpufreq.h | 6 ++++++
kernel/sched/cpufreq_schedutil.c | 38 ++++++++++++++++++++++++++++++++++++--
kernel/sched/idle.c | 3 +++
3 files changed, 45 insertions(+), 2 deletions(-)

Index: linux-pm/kernel/sched/cpufreq_schedutil.c
===================================================================
--- linux-pm.orig/kernel/sched/cpufreq_schedutil.c
+++ linux-pm/kernel/sched/cpufreq_schedutil.c
@@ -20,6 +20,7 @@
#include "sched.h"

#define SUGOV_KTHREAD_PRIORITY 50
+#define SUGOV_BUSY_THRESHOLD (50 * NSEC_PER_MSEC)

struct sugov_tunables {
struct gov_attr_set attr_set;
@@ -55,6 +56,9 @@ struct sugov_cpu {

unsigned long iowait_boost;
unsigned long iowait_boost_max;
+ unsigned long idle_calls;
+ unsigned long saved_idle_calls;
+ u64 busy_start;
u64 last_update;

/* The fields below are only needed when sharing a policy. */
@@ -192,6 +196,34 @@ static void sugov_iowait_boost(struct su
sg_cpu->iowait_boost >>= 1;
}

+void cpufreq_schedutil_idle_call(void)
+{
+ struct sugov_cpu *sg_cpu = this_cpu_ptr(&sugov_cpu);
+
+ sg_cpu->idle_calls++;
+}
+
+static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu)
+{
+ if (sg_cpu->idle_calls != sg_cpu->saved_idle_calls) {
+ sg_cpu->busy_start = 0;
+ return false;
+ }
+
+ if (!sg_cpu->busy_start) {
+ sg_cpu->busy_start = sg_cpu->last_update;
+ return false;
+ }
+
+ return sg_cpu->last_update - sg_cpu->busy_start > SUGOV_BUSY_THRESHOLD;
+}
+
+static void sugov_save_idle_calls(struct sugov_cpu *sg_cpu)
+{
+ if (!sg_cpu->busy_start)
+ sg_cpu->saved_idle_calls = sg_cpu->idle_calls;
+}
+
static void sugov_update_single(struct update_util_data *hook, u64 time,
unsigned int flags)
{
@@ -207,7 +239,7 @@ static void sugov_update_single(struct u
if (!sugov_should_update_freq(sg_policy, time))
return;

- if (flags & SCHED_CPUFREQ_RT_DL) {
+ if ((flags & SCHED_CPUFREQ_RT_DL) || sugov_cpu_is_busy(sg_cpu)) {
next_f = policy->cpuinfo.max_freq;
} else {
sugov_get_util(&util, &max);
@@ -215,6 +247,7 @@ static void sugov_update_single(struct u
next_f = get_next_freq(sg_policy, util, max);
}
sugov_update_commit(sg_policy, time, next_f);
+ sugov_save_idle_calls(sg_cpu);
}

static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu)
@@ -278,12 +311,13 @@ static void sugov_update_shared(struct u
sg_cpu->last_update = time;

if (sugov_should_update_freq(sg_policy, time)) {
- if (flags & SCHED_CPUFREQ_RT_DL)
+ if ((flags & SCHED_CPUFREQ_RT_DL) || sugov_cpu_is_busy(sg_cpu))
next_f = sg_policy->policy->cpuinfo.max_freq;
else
next_f = sugov_next_freq_shared(sg_cpu);

sugov_update_commit(sg_policy, time, next_f);
+ sugov_save_idle_calls(sg_cpu);
}

raw_spin_unlock(&sg_policy->update_lock);
Index: linux-pm/kernel/sched/idle.c
===================================================================
--- linux-pm.orig/kernel/sched/idle.c
+++ linux-pm/kernel/sched/idle.c
@@ -151,6 +151,9 @@ static void cpuidle_idle_call(void)
*/
rcu_idle_enter();

+ /* Notify the schedutil cpufreq governor that we are entering idle. */
+ cpufreq_schedutil_idle_call();
+
if (cpuidle_not_available(drv, dev)) {
default_idle_call();
goto exit_idle;
Index: linux-pm/include/linux/sched/cpufreq.h
===================================================================
--- linux-pm.orig/include/linux/sched/cpufreq.h
+++ linux-pm/include/linux/sched/cpufreq.h
@@ -24,4 +24,10 @@ void cpufreq_add_update_util_hook(int cp
void cpufreq_remove_update_util_hook(int cpu);
#endif /* CONFIG_CPU_FREQ */

+#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
+extern void cpufreq_schedutil_idle_call(void);
+#else
+static inline void cpufreq_schedutil_idle_call(void) {}
+#endif /* CONFIG_CPU_FREQ_GOV_SCHEDUTIL */
+
#endif /* _LINUX_SCHED_CPUFREQ_H */

2017-03-19 13:41:18

by Rafael J. Wysocki

[permalink] [raw]
Subject: [PATCH 1/2] cpufreq: schedutil: Fix per-CPU structure initialization in sugov_start()

From: Rafael J. Wysocki <[email protected]>

sugov_start() only initializes struct sugov_cpu per-CPU structures
for shared policies, but it should do that for single-CPU policies too.

That in particular makes the IO-wait boost mechanism work in the
cases when cpufreq policies correspond to individual CPUs.

Fixes: 21ca6d2c52f8 (cpufreq: schedutil: Add iowait boosting)
Signed-off-by: Rafael J. Wysocki <[email protected]>
---
kernel/sched/cpufreq_schedutil.c | 20 +++++++-------------
1 file changed, 7 insertions(+), 13 deletions(-)

Index: linux-pm/kernel/sched/cpufreq_schedutil.c
===================================================================
--- linux-pm.orig/kernel/sched/cpufreq_schedutil.c
+++ linux-pm/kernel/sched/cpufreq_schedutil.c
@@ -577,20 +577,14 @@ static int sugov_start(struct cpufreq_po
for_each_cpu(cpu, policy->cpus) {
struct sugov_cpu *sg_cpu = &per_cpu(sugov_cpu, cpu);

+ memset(sg_cpu, 0, sizeof(*sg_cpu));
sg_cpu->sg_policy = sg_policy;
- if (policy_is_shared(policy)) {
- sg_cpu->util = 0;
- sg_cpu->max = 0;
- sg_cpu->flags = SCHED_CPUFREQ_RT;
- sg_cpu->last_update = 0;
- sg_cpu->iowait_boost = 0;
- sg_cpu->iowait_boost_max = policy->cpuinfo.max_freq;
- cpufreq_add_update_util_hook(cpu, &sg_cpu->update_util,
- sugov_update_shared);
- } else {
- cpufreq_add_update_util_hook(cpu, &sg_cpu->update_util,
- sugov_update_single);
- }
+ sg_cpu->flags = SCHED_CPUFREQ_RT;
+ sg_cpu->iowait_boost_max = policy->cpuinfo.max_freq;
+ cpufreq_add_update_util_hook(cpu, &sg_cpu->update_util,
+ policy_is_shared(policy) ?
+ sugov_update_shared :
+ sugov_update_single);
}
return 0;
}

2017-03-19 21:30:21

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

On Sunday, March 19, 2017 02:34:32 PM Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <[email protected]>
>
> The PELT metric used by the schedutil governor underestimates the
> CPU utilization in some cases. The reason for that may be time spent
> in interrupt handlers and similar which is not accounted for by PELT.
>
> That can be easily demonstrated by running kernel compilation on
> a Sandy Bridge Intel processor, running turbostat in parallel with
> it and looking at the values written to the MSR_IA32_PERF_CTL
> register. Namely, the expected result would be that when all CPUs
> were 100% busy, all of them would be requested to run in the maximum
> P-state, but observation shows that this clearly isn't the case.
> The CPUs run in the maximum P-state for a while and then are
> requested to run slower and go back to the maximum P-state after
> a while again. That causes the actual frequency of the processor to
> visibly oscillate below the sustainable maximum in a jittery fashion
> which clearly is not desirable.

In case you are wondering about the actual numbers, attached are two turbostat
log files from two runs of the same workload, without (before.txt) and with (after.txt)
the patch applied.

The workload is essentially "make -j 5" in the kernel source tree and the
machine has an SSD storage and a quad-core Intel Sandy Bridge processor.
The P-states available for each core are between 8 and 31 (0x1f) corresponding
to 800 MHz and 3.1 GHz, respectively. All cores can run sustainably at 2.9 GHz
at the same time, although that is not a guaranteed sustainable frequency
(it may be dropped occasionally for thermal reasons, for example).

The interesting columns are Bzy_MHz (and specifically the rows with "-" under
CPU that correspond to the entire processor), which is the avreage frequency
between iterations based on the numbers read from feedback registers, and
the rightmost one, which is the values written to the P-state request register
(the 3rd and 4th hex digits from the right represent the requested P-state).

The turbostat data collection ran every 2 seconds and I looked at the last 30
iterations in each case corresponding to about 1 minute of the workload run
during which all of the cores were around 100% busy.

Now, if you look at after.txt (the run with the patch applied), you'll notice that
during those last 30 iterations P-state 31 (0x1f) had been requested on all
cores pretty much 100% of the time (meaning: as expected in that case) and
the average processor frequency (computed by taking the average from
all of the 30 "-" rows) was 2899.33 MHz (apparently, the hardware decided to
drop it from 2.9 GHz occasionally).

In the before.txt case (without the patch) the average frequency over the last
30 iterations was 2896.90 MHz which is about 0.8% slower than with the patch
applied (on the average). That already is quite a measurable difference, but it
would have been much worse if the processor had not coordinated P-states in
hardware (such that if any core requested 31, the processor would pick that
one or close to it for the entire package regardless of the requests from the
other cores).

Namely, if you look at the P-states requested for different cores (during the last
30 iterations of the before.txt run), which essentially is what should be used
according to the governor, the average of them is 27.25 (almost 4 bins lower
than the maximum) and the standard deviation is 6, so it is not like they are a
little off occasionally. At least some of them are way off most of the time.

Honestly, if the processor had been capable of doing per-core P-states, that
would have been a disaster and there are customers who wouldn't look at
schedutil again after being confronted with these numbers.

So this is rather serious.

BTW, both intel_pstate in the active mode and ondemand request 0x1f on
all cores for that workload, like in the after.txt case.

Thanks,
Rafael


Attachments:
before.txt (15.54 kB)
after.txt (18.03 kB)
Download all attachments

2017-03-19 21:48:03

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

On Sunday, March 19, 2017 10:24:24 PM Rafael J. Wysocki wrote:
> On Sunday, March 19, 2017 02:34:32 PM Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <[email protected]>
> >
> > The PELT metric used by the schedutil governor underestimates the
> > CPU utilization in some cases. The reason for that may be time spent
> > in interrupt handlers and similar which is not accounted for by PELT.
> >
> > That can be easily demonstrated by running kernel compilation on
> > a Sandy Bridge Intel processor, running turbostat in parallel with
> > it and looking at the values written to the MSR_IA32_PERF_CTL
> > register. Namely, the expected result would be that when all CPUs
> > were 100% busy, all of them would be requested to run in the maximum
> > P-state, but observation shows that this clearly isn't the case.
> > The CPUs run in the maximum P-state for a while and then are
> > requested to run slower and go back to the maximum P-state after
> > a while again. That causes the actual frequency of the processor to
> > visibly oscillate below the sustainable maximum in a jittery fashion
> > which clearly is not desirable.
>
> In case you are wondering about the actual numbers, attached are two turbostat
> log files from two runs of the same workload, without (before.txt) and with (after.txt)
> the patch applied.
>
> The workload is essentially "make -j 5" in the kernel source tree and the
> machine has an SSD storage and a quad-core Intel Sandy Bridge processor.
> The P-states available for each core are between 8 and 31 (0x1f) corresponding
> to 800 MHz and 3.1 GHz, respectively. All cores can run sustainably at 2.9 GHz
> at the same time, although that is not a guaranteed sustainable frequency
> (it may be dropped occasionally for thermal reasons, for example).
>
> The interesting columns are Bzy_MHz (and specifically the rows with "-" under
> CPU that correspond to the entire processor), which is the avreage frequency
> between iterations based on the numbers read from feedback registers, and
> the rightmost one, which is the values written to the P-state request register
> (the 3rd and 4th hex digits from the right represent the requested P-state).
>
> The turbostat data collection ran every 2 seconds and I looked at the last 30
> iterations in each case corresponding to about 1 minute of the workload run
> during which all of the cores were around 100% busy.
>
> Now, if you look at after.txt (the run with the patch applied), you'll notice that
> during those last 30 iterations P-state 31 (0x1f) had been requested on all
> cores pretty much 100% of the time (meaning: as expected in that case) and
> the average processor frequency (computed by taking the average from
> all of the 30 "-" rows) was 2899.33 MHz (apparently, the hardware decided to
> drop it from 2.9 GHz occasionally).
>
> In the before.txt case (without the patch) the average frequency over the last
> 30 iterations was 2896.90 MHz which is about 0.8% slower than with the patch
> applied (on the average).

0.08% of course, sorry. Still visible, though. :-)

Thanks,
Rafael

2017-03-20 03:28:52

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH 1/2] cpufreq: schedutil: Fix per-CPU structure initialization in sugov_start()

On 19-03-17, 14:30, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <[email protected]>
>
> sugov_start() only initializes struct sugov_cpu per-CPU structures
> for shared policies, but it should do that for single-CPU policies too.
>
> That in particular makes the IO-wait boost mechanism work in the
> cases when cpufreq policies correspond to individual CPUs.
>
> Fixes: 21ca6d2c52f8 (cpufreq: schedutil: Add iowait boosting)
> Signed-off-by: Rafael J. Wysocki <[email protected]>

Cc: 4.9+ <[email protected]> # 4.9+

Acked-by: Viresh Kumar <[email protected]>

--
viresh

2017-03-20 03:57:52

by Viresh Kumar

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

On 19-03-17, 14:34, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <[email protected]>
>
> The PELT metric used by the schedutil governor underestimates the
> CPU utilization in some cases. The reason for that may be time spent
> in interrupt handlers and similar which is not accounted for by PELT.
>
> That can be easily demonstrated by running kernel compilation on
> a Sandy Bridge Intel processor, running turbostat in parallel with
> it and looking at the values written to the MSR_IA32_PERF_CTL
> register. Namely, the expected result would be that when all CPUs
> were 100% busy, all of them would be requested to run in the maximum
> P-state, but observation shows that this clearly isn't the case.
> The CPUs run in the maximum P-state for a while and then are
> requested to run slower and go back to the maximum P-state after
> a while again. That causes the actual frequency of the processor to
> visibly oscillate below the sustainable maximum in a jittery fashion
> which clearly is not desirable.
>
> To work around this issue use the observation that, from the
> schedutil governor's perspective, CPUs that are never idle should
> always run at the maximum frequency and make that happen.
>
> To that end, add a counter of idle calls to struct sugov_cpu and
> modify cpuidle_idle_call() to increment that counter every time it
> is about to put the given CPU into an idle state. Next, make the
> schedutil governor look at that counter for the current CPU every
> time before it is about to start heavy computations. If the counter
> has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms),
> the CPU has not been idle for at least that long and the governor
> will choose the maximum frequency for it without looking at the PELT
> metric at all.

Looks like we are fixing a PELT problem with a schedutil Hack :)

> Signed-off-by: Rafael J. Wysocki <[email protected]>
> ---
> include/linux/sched/cpufreq.h | 6 ++++++
> kernel/sched/cpufreq_schedutil.c | 38 ++++++++++++++++++++++++++++++++++++--
> kernel/sched/idle.c | 3 +++
> 3 files changed, 45 insertions(+), 2 deletions(-)
>
> Index: linux-pm/kernel/sched/cpufreq_schedutil.c
> ===================================================================
> --- linux-pm.orig/kernel/sched/cpufreq_schedutil.c
> +++ linux-pm/kernel/sched/cpufreq_schedutil.c
> @@ -20,6 +20,7 @@
> #include "sched.h"
>
> #define SUGOV_KTHREAD_PRIORITY 50
> +#define SUGOV_BUSY_THRESHOLD (50 * NSEC_PER_MSEC)
>
> struct sugov_tunables {
> struct gov_attr_set attr_set;
> @@ -55,6 +56,9 @@ struct sugov_cpu {
>
> unsigned long iowait_boost;
> unsigned long iowait_boost_max;
> + unsigned long idle_calls;
> + unsigned long saved_idle_calls;
> + u64 busy_start;
> u64 last_update;
>
> /* The fields below are only needed when sharing a policy. */
> @@ -192,6 +196,34 @@ static void sugov_iowait_boost(struct su
> sg_cpu->iowait_boost >>= 1;
> }
>
> +void cpufreq_schedutil_idle_call(void)
> +{
> + struct sugov_cpu *sg_cpu = this_cpu_ptr(&sugov_cpu);
> +
> + sg_cpu->idle_calls++;
> +}
> +
> +static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu)
> +{
> + if (sg_cpu->idle_calls != sg_cpu->saved_idle_calls) {
> + sg_cpu->busy_start = 0;
> + return false;
> + }
> +
> + if (!sg_cpu->busy_start) {
> + sg_cpu->busy_start = sg_cpu->last_update;
> + return false;
> + }
> +
> + return sg_cpu->last_update - sg_cpu->busy_start > SUGOV_BUSY_THRESHOLD;
> +}
> +
> +static void sugov_save_idle_calls(struct sugov_cpu *sg_cpu)
> +{
> + if (!sg_cpu->busy_start)
> + sg_cpu->saved_idle_calls = sg_cpu->idle_calls;

Why aren't we doing this in sugov_cpu_is_busy() itself ? And isn't it possible
for idle_calls to get incremented by this time?

> +}
> +
> static void sugov_update_single(struct update_util_data *hook, u64 time,
> unsigned int flags)
> {
> @@ -207,7 +239,7 @@ static void sugov_update_single(struct u
> if (!sugov_should_update_freq(sg_policy, time))
> return;
>
> - if (flags & SCHED_CPUFREQ_RT_DL) {
> + if ((flags & SCHED_CPUFREQ_RT_DL) || sugov_cpu_is_busy(sg_cpu)) {
> next_f = policy->cpuinfo.max_freq;
> } else {
> sugov_get_util(&util, &max);
> @@ -215,6 +247,7 @@ static void sugov_update_single(struct u
> next_f = get_next_freq(sg_policy, util, max);
> }
> sugov_update_commit(sg_policy, time, next_f);
> + sugov_save_idle_calls(sg_cpu);
> }
>
> static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu)
> @@ -278,12 +311,13 @@ static void sugov_update_shared(struct u
> sg_cpu->last_update = time;
>
> if (sugov_should_update_freq(sg_policy, time)) {
> - if (flags & SCHED_CPUFREQ_RT_DL)
> + if ((flags & SCHED_CPUFREQ_RT_DL) || sugov_cpu_is_busy(sg_cpu))

What about others CPUs in this policy?

> next_f = sg_policy->policy->cpuinfo.max_freq;
> else
> next_f = sugov_next_freq_shared(sg_cpu);
>
> sugov_update_commit(sg_policy, time, next_f);
> + sugov_save_idle_calls(sg_cpu);
> }
>
> raw_spin_unlock(&sg_policy->update_lock);

--
viresh

2017-03-20 08:27:40

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

On 20 March 2017 at 04:57, Viresh Kumar <[email protected]> wrote:
> On 19-03-17, 14:34, Rafael J. Wysocki wrote:
>> From: Rafael J. Wysocki <[email protected]>
>>
>> The PELT metric used by the schedutil governor underestimates the
>> CPU utilization in some cases. The reason for that may be time spent
>> in interrupt handlers and similar which is not accounted for by PELT.

Are you sure of the root cause described above (time stolen by irq
handler) or is it just a hypotheses ? That would be good to be sure of
the root cause
Furthermore, IIRC the time spent in irq context is also accounted as
run time for the running cfs task but not RT and deadline task running
time
So I'm not really aligned with the description of your problem: PELT
metric underestimates the load of the CPU. The PELT is just about
tracking CFS task utilization but not whole CPU utilization and
according to your description of the problem (time stolen by irq),
your problem doesn't come from an underestimation of CFS task but from
time spent in something else but not accounted in the value used by
schedutil

That would be good to be sure of what is running during this not
accounted time and find a way to account them


>>
>> That can be easily demonstrated by running kernel compilation on
>> a Sandy Bridge Intel processor, running turbostat in parallel with
>> it and looking at the values written to the MSR_IA32_PERF_CTL
>> register. Namely, the expected result would be that when all CPUs
>> were 100% busy, all of them would be requested to run in the maximum
>> P-state, but observation shows that this clearly isn't the case.
>> The CPUs run in the maximum P-state for a while and then are
>> requested to run slower and go back to the maximum P-state after
>> a while again. That causes the actual frequency of the processor to
>> visibly oscillate below the sustainable maximum in a jittery fashion
>> which clearly is not desirable.
>>
>> To work around this issue use the observation that, from the
>> schedutil governor's perspective, CPUs that are never idle should
>> always run at the maximum frequency and make that happen.
>>
>> To that end, add a counter of idle calls to struct sugov_cpu and
>> modify cpuidle_idle_call() to increment that counter every time it
>> is about to put the given CPU into an idle state. Next, make the
>> schedutil governor look at that counter for the current CPU every
>> time before it is about to start heavy computations. If the counter
>> has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms),
>> the CPU has not been idle for at least that long and the governor
>> will choose the maximum frequency for it without looking at the PELT
>> metric at all.
>
> Looks like we are fixing a PELT problem with a schedutil Hack :)

I would not say that it's a PELT problem (at least based on current
description) but more that we don't track all
activities of CPU

>
>> Signed-off-by: Rafael J. Wysocki <[email protected]>
>> ---
>> include/linux/sched/cpufreq.h | 6 ++++++
>> kernel/sched/cpufreq_schedutil.c | 38 ++++++++++++++++++++++++++++++++++++--
>> kernel/sched/idle.c | 3 +++
>> 3 files changed, 45 insertions(+), 2 deletions(-)
>>
>> Index: linux-pm/kernel/sched/cpufreq_schedutil.c
>> ===================================================================
>> --- linux-pm.orig/kernel/sched/cpufreq_schedutil.c
>> +++ linux-pm/kernel/sched/cpufreq_schedutil.c
>> @@ -20,6 +20,7 @@
>> #include "sched.h"
>>
>> #define SUGOV_KTHREAD_PRIORITY 50
>> +#define SUGOV_BUSY_THRESHOLD (50 * NSEC_PER_MSEC)
>>
>> struct sugov_tunables {
>> struct gov_attr_set attr_set;
>> @@ -55,6 +56,9 @@ struct sugov_cpu {
>>
>> unsigned long iowait_boost;
>> unsigned long iowait_boost_max;
>> + unsigned long idle_calls;
>> + unsigned long saved_idle_calls;
>> + u64 busy_start;
>> u64 last_update;
>>
>> /* The fields below are only needed when sharing a policy. */
>> @@ -192,6 +196,34 @@ static void sugov_iowait_boost(struct su
>> sg_cpu->iowait_boost >>= 1;
>> }
>>
>> +void cpufreq_schedutil_idle_call(void)
>> +{
>> + struct sugov_cpu *sg_cpu = this_cpu_ptr(&sugov_cpu);
>> +
>> + sg_cpu->idle_calls++;
>> +}
>> +
>> +static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu)
>> +{
>> + if (sg_cpu->idle_calls != sg_cpu->saved_idle_calls) {
>> + sg_cpu->busy_start = 0;
>> + return false;
>> + }
>> +
>> + if (!sg_cpu->busy_start) {
>> + sg_cpu->busy_start = sg_cpu->last_update;
>> + return false;
>> + }
>> +
>> + return sg_cpu->last_update - sg_cpu->busy_start > SUGOV_BUSY_THRESHOLD;
>> +}
>> +
>> +static void sugov_save_idle_calls(struct sugov_cpu *sg_cpu)
>> +{
>> + if (!sg_cpu->busy_start)
>> + sg_cpu->saved_idle_calls = sg_cpu->idle_calls;
>
> Why aren't we doing this in sugov_cpu_is_busy() itself ? And isn't it possible
> for idle_calls to get incremented by this time?
>
>> +}
>> +
>> static void sugov_update_single(struct update_util_data *hook, u64 time,
>> unsigned int flags)
>> {
>> @@ -207,7 +239,7 @@ static void sugov_update_single(struct u
>> if (!sugov_should_update_freq(sg_policy, time))
>> return;
>>
>> - if (flags & SCHED_CPUFREQ_RT_DL) {
>> + if ((flags & SCHED_CPUFREQ_RT_DL) || sugov_cpu_is_busy(sg_cpu)) {
>> next_f = policy->cpuinfo.max_freq;
>> } else {
>> sugov_get_util(&util, &max);
>> @@ -215,6 +247,7 @@ static void sugov_update_single(struct u
>> next_f = get_next_freq(sg_policy, util, max);
>> }
>> sugov_update_commit(sg_policy, time, next_f);
>> + sugov_save_idle_calls(sg_cpu);
>> }
>>
>> static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu)
>> @@ -278,12 +311,13 @@ static void sugov_update_shared(struct u
>> sg_cpu->last_update = time;
>>
>> if (sugov_should_update_freq(sg_policy, time)) {
>> - if (flags & SCHED_CPUFREQ_RT_DL)
>> + if ((flags & SCHED_CPUFREQ_RT_DL) || sugov_cpu_is_busy(sg_cpu))
>
> What about others CPUs in this policy?
>
>> next_f = sg_policy->policy->cpuinfo.max_freq;
>> else
>> next_f = sugov_next_freq_shared(sg_cpu);
>>
>> sugov_update_commit(sg_policy, time, next_f);
>> + sugov_save_idle_calls(sg_cpu);
>> }
>>
>> raw_spin_unlock(&sg_policy->update_lock);
>
> --
> viresh

2017-03-20 11:00:53

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

On Sun, Mar 19, 2017 at 10:24:24PM +0100, Rafael J. Wysocki wrote:

> Honestly, if the processor had been capable of doing per-core P-states, that
> would have been a disaster and there are customers who wouldn't look at
> schedutil again after being confronted with these numbers.

This, I feel, is a bit overstated. We have bug, we fix them.

2017-03-20 11:45:27

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

On Sun, Mar 19, 2017 at 02:34:32PM +0100, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <[email protected]>
>
> The PELT metric used by the schedutil governor underestimates the
> CPU utilization in some cases. The reason for that may be time spent
> in interrupt handlers and similar which is not accounted for by PELT.
>
> That can be easily demonstrated by running kernel compilation on
> a Sandy Bridge Intel processor, running turbostat in parallel with
> it and looking at the values written to the MSR_IA32_PERF_CTL
> register. Namely, the expected result would be that when all CPUs
> were 100% busy, all of them would be requested to run in the maximum
> P-state, but observation shows that this clearly isn't the case.
> The CPUs run in the maximum P-state for a while and then are
> requested to run slower and go back to the maximum P-state after
> a while again. That causes the actual frequency of the processor to
> visibly oscillate below the sustainable maximum in a jittery fashion
> which clearly is not desirable.
>
> To work around this issue use the observation that, from the
> schedutil governor's perspective, CPUs that are never idle should
> always run at the maximum frequency and make that happen.
>
> To that end, add a counter of idle calls to struct sugov_cpu and
> modify cpuidle_idle_call() to increment that counter every time it
> is about to put the given CPU into an idle state. Next, make the
> schedutil governor look at that counter for the current CPU every
> time before it is about to start heavy computations. If the counter
> has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms),
> the CPU has not been idle for at least that long and the governor
> will choose the maximum frequency for it without looking at the PELT
> metric at all.

Why the time limit?

2017-03-20 12:35:14

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

On 20-Mar 09:26, Vincent Guittot wrote:
> On 20 March 2017 at 04:57, Viresh Kumar <[email protected]> wrote:
> > On 19-03-17, 14:34, Rafael J. Wysocki wrote:
> >> From: Rafael J. Wysocki <[email protected]>
> >>
> >> The PELT metric used by the schedutil governor underestimates the
> >> CPU utilization in some cases. The reason for that may be time spent
> >> in interrupt handlers and similar which is not accounted for by PELT.
>
> Are you sure of the root cause described above (time stolen by irq
> handler) or is it just a hypotheses ? That would be good to be sure of
> the root cause
> Furthermore, IIRC the time spent in irq context is also accounted as
> run time for the running cfs task but not RT and deadline task running
> time

As long as the IRQ processing does not generate a context switch,
which is happening (eventually) if the top half schedule some deferred
work to be executed by a bottom half.

Thus, me too I would say that all the top half time is accounted in
PELT, since the current task is still RUNNABLE/RUNNING.

> So I'm not really aligned with the description of your problem: PELT
> metric underestimates the load of the CPU. The PELT is just about
> tracking CFS task utilization but not whole CPU utilization and
> according to your description of the problem (time stolen by irq),
> your problem doesn't come from an underestimation of CFS task but from
> time spent in something else but not accounted in the value used by
> schedutil

Quite likely. Indeed, it can really be that the CFS task is preempted
because of some RT activity generated by the IRQ handler.

More in general, I've also noticed many suboptimal freq switches when
RT tasks interleave with CFS ones, because of:
- relatively long down _and up_ throttling times
- the way schedutil's flags are tracked and updated
- the callsites from where we call schedutil updates

For example it can really happen that we are running at the highest
OPP because of some RT activity. Then we switch back to a relatively
low utilization CFS workload and then:
1. a tick happens which produces a frequency drop
2. right after a new IRQ starts a RT task

Well, because of the schedutil's throttling mechanism we can end up
running at the reduce OPP for quite long before (eventually) going
back to the higher OPP... if and only we are lucky enough to get a new
RT task starting outside of a throttling window.

I guess that if we have quite a lot of IRQ->RT activations in a CPU
with a relatively low CFS utilization, this unwanted frequency hopping
is quite likely.

> That would be good to be sure of what is running during this not
> accounted time and find a way to account them

Do we have any number to characterize the frequency of IRQ/RT
activations?

We should notice also that, while CFS tasks are preempted by RT ones,
the PELT signal decays. This can contribute on lowering even further
the utilization demand from the CFS class.

> >> That can be easily demonstrated by running kernel compilation on
> >> a Sandy Bridge Intel processor, running turbostat in parallel with
> >> it and looking at the values written to the MSR_IA32_PERF_CTL
> >> register. Namely, the expected result would be that when all CPUs
> >> were 100% busy, all of them would be requested to run in the maximum
> >> P-state, but observation shows that this clearly isn't the case.

Can you share a trace with at least sched_switch events enabled?
This can help on identify which workloads are there and how they
interact with schedutil.

> >> The CPUs run in the maximum P-state for a while and then are
> >> requested to run slower and go back to the maximum P-state after
> >> a while again. That causes the actual frequency of the processor to

That's possibly exactly the pattern I've described above.

> >> visibly oscillate below the sustainable maximum in a jittery fashion
> >> which clearly is not desirable.
> >>
> >> To work around this issue use the observation that, from the
> >> schedutil governor's perspective, CPUs that are never idle should
> >> always run at the maximum frequency and make that happen.

Perhaps some of the patches I've posted in this series:
https://lkml.org/lkml/2017/3/2/385
can help on that side.

The pattern you report seems to me quite matching with the case "b"
described in the cover letter. I've also shared a notebook:
https://gist.github.com/derkling/d6a21b459a18091b2b058668a550010d
which clearly shows that behavior happening in the first plot,
cell named Out[15].

Did you had a chance to have a look at them?

It would be nice to know if they can provide some benefits to your
use-case.

> >> To that end, add a counter of idle calls to struct sugov_cpu and
> >> modify cpuidle_idle_call() to increment that counter every time it
> >> is about to put the given CPU into an idle state. Next, make the
> >> schedutil governor look at that counter for the current CPU every
> >> time before it is about to start heavy computations. If the counter
> >> has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms),
> >> the CPU has not been idle for at least that long and the governor
> >> will choose the maximum frequency for it without looking at the PELT
> >> metric at all.
> >
> > Looks like we are fixing a PELT problem with a schedutil Hack :)
>
> I would not say that it's a PELT problem (at least based on current
> description) but more that we don't track all
> activities of CPU

Agree... with both comments.


> >> Signed-off-by: Rafael J. Wysocki <[email protected]>
> >> ---
> >> include/linux/sched/cpufreq.h | 6 ++++++
> >> kernel/sched/cpufreq_schedutil.c | 38 ++++++++++++++++++++++++++++++++++++--
> >> kernel/sched/idle.c | 3 +++
> >> 3 files changed, 45 insertions(+), 2 deletions(-)
> >>
> >> Index: linux-pm/kernel/sched/cpufreq_schedutil.c
> >> ===================================================================
> >> --- linux-pm.orig/kernel/sched/cpufreq_schedutil.c
> >> +++ linux-pm/kernel/sched/cpufreq_schedutil.c
> >> @@ -20,6 +20,7 @@
> >> #include "sched.h"
> >>
> >> #define SUGOV_KTHREAD_PRIORITY 50
> >> +#define SUGOV_BUSY_THRESHOLD (50 * NSEC_PER_MSEC)
> >>
> >> struct sugov_tunables {
> >> struct gov_attr_set attr_set;
> >> @@ -55,6 +56,9 @@ struct sugov_cpu {
> >>
> >> unsigned long iowait_boost;
> >> unsigned long iowait_boost_max;
> >> + unsigned long idle_calls;
> >> + unsigned long saved_idle_calls;
> >> + u64 busy_start;
> >> u64 last_update;
> >>
> >> /* The fields below are only needed when sharing a policy. */
> >> @@ -192,6 +196,34 @@ static void sugov_iowait_boost(struct su
> >> sg_cpu->iowait_boost >>= 1;
> >> }
> >>
> >> +void cpufreq_schedutil_idle_call(void)
> >> +{
> >> + struct sugov_cpu *sg_cpu = this_cpu_ptr(&sugov_cpu);
> >> +
> >> + sg_cpu->idle_calls++;
> >> +}
> >> +
> >> +static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu)
> >> +{
> >> + if (sg_cpu->idle_calls != sg_cpu->saved_idle_calls) {
> >> + sg_cpu->busy_start = 0;
> >> + return false;
> >> + }
> >> +
> >> + if (!sg_cpu->busy_start) {
> >> + sg_cpu->busy_start = sg_cpu->last_update;
> >> + return false;
> >> + }
> >> +
> >> + return sg_cpu->last_update - sg_cpu->busy_start > SUGOV_BUSY_THRESHOLD;
> >> +}
> >> +
> >> +static void sugov_save_idle_calls(struct sugov_cpu *sg_cpu)
> >> +{
> >> + if (!sg_cpu->busy_start)
> >> + sg_cpu->saved_idle_calls = sg_cpu->idle_calls;
> >
> > Why aren't we doing this in sugov_cpu_is_busy() itself ? And isn't it possible
> > for idle_calls to get incremented by this time?
> >
> >> +}
> >> +
> >> static void sugov_update_single(struct update_util_data *hook, u64 time,
> >> unsigned int flags)
> >> {
> >> @@ -207,7 +239,7 @@ static void sugov_update_single(struct u
> >> if (!sugov_should_update_freq(sg_policy, time))
> >> return;
> >>
> >> - if (flags & SCHED_CPUFREQ_RT_DL) {
> >> + if ((flags & SCHED_CPUFREQ_RT_DL) || sugov_cpu_is_busy(sg_cpu)) {
> >> next_f = policy->cpuinfo.max_freq;
> >> } else {
> >> sugov_get_util(&util, &max);
> >> @@ -215,6 +247,7 @@ static void sugov_update_single(struct u
> >> next_f = get_next_freq(sg_policy, util, max);
> >> }
> >> sugov_update_commit(sg_policy, time, next_f);
> >> + sugov_save_idle_calls(sg_cpu);
> >> }
> >>
> >> static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu)
> >> @@ -278,12 +311,13 @@ static void sugov_update_shared(struct u
> >> sg_cpu->last_update = time;
> >>
> >> if (sugov_should_update_freq(sg_policy, time)) {
> >> - if (flags & SCHED_CPUFREQ_RT_DL)
> >> + if ((flags & SCHED_CPUFREQ_RT_DL) || sugov_cpu_is_busy(sg_cpu))
> >
> > What about others CPUs in this policy?
> >
> >> next_f = sg_policy->policy->cpuinfo.max_freq;
> >> else
> >> next_f = sugov_next_freq_shared(sg_cpu);
> >>
> >> sugov_update_commit(sg_policy, time, next_f);
> >> + sugov_save_idle_calls(sg_cpu);
> >> }
> >>
> >> raw_spin_unlock(&sg_policy->update_lock);
> >
> > --
> > viresh

--
#include <best/regards.h>

Patrick Bellasi

2017-03-20 12:40:52

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

On Monday, March 20, 2017 11:36:45 AM Peter Zijlstra wrote:
> On Sun, Mar 19, 2017 at 02:34:32PM +0100, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <[email protected]>
> >
> > The PELT metric used by the schedutil governor underestimates the
> > CPU utilization in some cases. The reason for that may be time spent
> > in interrupt handlers and similar which is not accounted for by PELT.
> >
> > That can be easily demonstrated by running kernel compilation on
> > a Sandy Bridge Intel processor, running turbostat in parallel with
> > it and looking at the values written to the MSR_IA32_PERF_CTL
> > register. Namely, the expected result would be that when all CPUs
> > were 100% busy, all of them would be requested to run in the maximum
> > P-state, but observation shows that this clearly isn't the case.
> > The CPUs run in the maximum P-state for a while and then are
> > requested to run slower and go back to the maximum P-state after
> > a while again. That causes the actual frequency of the processor to
> > visibly oscillate below the sustainable maximum in a jittery fashion
> > which clearly is not desirable.
> >
> > To work around this issue use the observation that, from the
> > schedutil governor's perspective, CPUs that are never idle should
> > always run at the maximum frequency and make that happen.
> >
> > To that end, add a counter of idle calls to struct sugov_cpu and
> > modify cpuidle_idle_call() to increment that counter every time it
> > is about to put the given CPU into an idle state. Next, make the
> > schedutil governor look at that counter for the current CPU every
> > time before it is about to start heavy computations. If the counter
> > has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms),
> > the CPU has not been idle for at least that long and the governor
> > will choose the maximum frequency for it without looking at the PELT
> > metric at all.
>
> Why the time limit?

One iteration appeared to be a bit too aggressive, but honestly I think
I need to check again if this thing is regarded as viable at all.

2017-03-20 12:43:02

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 1/2] cpufreq: schedutil: Fix per-CPU structure initialization in sugov_start()

On Monday, March 20, 2017 08:58:40 AM Viresh Kumar wrote:
> On 19-03-17, 14:30, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <[email protected]>
> >
> > sugov_start() only initializes struct sugov_cpu per-CPU structures
> > for shared policies, but it should do that for single-CPU policies too.
> >
> > That in particular makes the IO-wait boost mechanism work in the
> > cases when cpufreq policies correspond to individual CPUs.
> >
> > Fixes: 21ca6d2c52f8 (cpufreq: schedutil: Add iowait boosting)
> > Signed-off-by: Rafael J. Wysocki <[email protected]>
>
> Cc: 4.9+ <[email protected]> # 4.9+

"stable" checks Fixes: tags too, but yes.

> Acked-by: Viresh Kumar <[email protected]>

Thanks,
Rafael

2017-03-20 12:43:44

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

On Monday, March 20, 2017 11:38:15 AM Peter Zijlstra wrote:
> On Sun, Mar 19, 2017 at 10:24:24PM +0100, Rafael J. Wysocki wrote:
>
> > Honestly, if the processor had been capable of doing per-core P-states, that
> > would have been a disaster and there are customers who wouldn't look at
> > schedutil again after being confronted with these numbers.
>
> This, I feel, is a bit overstated.

Maybe a bit, but I'm not really sure ...

> We have bug, we fix them.

Of course. :-)

2017-03-20 12:51:46

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

On Mon, Mar 20, 2017 at 01:35:12PM +0100, Rafael J. Wysocki wrote:
> On Monday, March 20, 2017 11:36:45 AM Peter Zijlstra wrote:
> > On Sun, Mar 19, 2017 at 02:34:32PM +0100, Rafael J. Wysocki wrote:
> > > From: Rafael J. Wysocki <[email protected]>
> > >
> > > The PELT metric used by the schedutil governor underestimates the
> > > CPU utilization in some cases. The reason for that may be time spent
> > > in interrupt handlers and similar which is not accounted for by PELT.
> > >
> > > That can be easily demonstrated by running kernel compilation on
> > > a Sandy Bridge Intel processor, running turbostat in parallel with
> > > it and looking at the values written to the MSR_IA32_PERF_CTL
> > > register. Namely, the expected result would be that when all CPUs
> > > were 100% busy, all of them would be requested to run in the maximum
> > > P-state, but observation shows that this clearly isn't the case.
> > > The CPUs run in the maximum P-state for a while and then are
> > > requested to run slower and go back to the maximum P-state after
> > > a while again. That causes the actual frequency of the processor to
> > > visibly oscillate below the sustainable maximum in a jittery fashion
> > > which clearly is not desirable.
> > >
> > > To work around this issue use the observation that, from the
> > > schedutil governor's perspective, CPUs that are never idle should
> > > always run at the maximum frequency and make that happen.
> > >
> > > To that end, add a counter of idle calls to struct sugov_cpu and
> > > modify cpuidle_idle_call() to increment that counter every time it
> > > is about to put the given CPU into an idle state. Next, make the
> > > schedutil governor look at that counter for the current CPU every
> > > time before it is about to start heavy computations. If the counter
> > > has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms),
> > > the CPU has not been idle for at least that long and the governor
> > > will choose the maximum frequency for it without looking at the PELT
> > > metric at all.
> >
> > Why the time limit?
>
> One iteration appeared to be a bit too aggressive, but honestly I think
> I need to check again if this thing is regarded as viable at all.
>

I don't hate the idea; if we don't hit idle; we shouldn't shift down. I
just wonder if we don't already keep a idle-seqcount somewhere; NOHZ and
RCU come to mind as things that might already use something like that.

2017-03-20 13:07:09

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

On 20-Mar 13:50, Peter Zijlstra wrote:
> On Mon, Mar 20, 2017 at 01:35:12PM +0100, Rafael J. Wysocki wrote:
> > On Monday, March 20, 2017 11:36:45 AM Peter Zijlstra wrote:
> > > On Sun, Mar 19, 2017 at 02:34:32PM +0100, Rafael J. Wysocki wrote:
> > > > From: Rafael J. Wysocki <[email protected]>
> > > >
> > > > The PELT metric used by the schedutil governor underestimates the
> > > > CPU utilization in some cases. The reason for that may be time spent
> > > > in interrupt handlers and similar which is not accounted for by PELT.
> > > >
> > > > That can be easily demonstrated by running kernel compilation on
> > > > a Sandy Bridge Intel processor, running turbostat in parallel with
> > > > it and looking at the values written to the MSR_IA32_PERF_CTL
> > > > register. Namely, the expected result would be that when all CPUs
> > > > were 100% busy, all of them would be requested to run in the maximum
> > > > P-state, but observation shows that this clearly isn't the case.
> > > > The CPUs run in the maximum P-state for a while and then are
> > > > requested to run slower and go back to the maximum P-state after
> > > > a while again. That causes the actual frequency of the processor to
> > > > visibly oscillate below the sustainable maximum in a jittery fashion
> > > > which clearly is not desirable.
> > > >
> > > > To work around this issue use the observation that, from the
> > > > schedutil governor's perspective, CPUs that are never idle should
> > > > always run at the maximum frequency and make that happen.
> > > >
> > > > To that end, add a counter of idle calls to struct sugov_cpu and
> > > > modify cpuidle_idle_call() to increment that counter every time it
> > > > is about to put the given CPU into an idle state. Next, make the
> > > > schedutil governor look at that counter for the current CPU every
> > > > time before it is about to start heavy computations. If the counter
> > > > has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms),
> > > > the CPU has not been idle for at least that long and the governor
> > > > will choose the maximum frequency for it without looking at the PELT
> > > > metric at all.
> > >
> > > Why the time limit?
> >
> > One iteration appeared to be a bit too aggressive, but honestly I think
> > I need to check again if this thing is regarded as viable at all.
> >
>
> I don't hate the idea; if we don't hit idle; we shouldn't shift down. I
> just wonder if we don't already keep a idle-seqcount somewhere; NOHZ and
> RCU come to mind as things that might already use something like that.

Maybe the problem is not going down (e.g. when there are only small
CFS tasks it makes perfectly sense) but instead not being fast enough
on rampin-up when a new RT task is activated.

And this boils down to two main point:
1) throttling for up transitions perhaps is only harmful
2) the call sites for schedutils updates are not properly positioned
in specific scheduler decision points.

The proposed patch is adding yet another throttling mechanism, perhaps
on top of one which already needs to be improved.

--
#include <best/regards.h>

Patrick Bellasi

2017-03-20 13:06:16

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

On Monday, March 20, 2017 09:26:34 AM Vincent Guittot wrote:
> On 20 March 2017 at 04:57, Viresh Kumar <[email protected]> wrote:
> > On 19-03-17, 14:34, Rafael J. Wysocki wrote:
> >> From: Rafael J. Wysocki <[email protected]>
> >>
> >> The PELT metric used by the schedutil governor underestimates the
> >> CPU utilization in some cases. The reason for that may be time spent
> >> in interrupt handlers and similar which is not accounted for by PELT.
>
> Are you sure of the root cause described above (time stolen by irq
> handler) or is it just a hypotheses ? That would be good to be sure of
> the root cause

No, I'm not sure. That's why I said "may be". :-)

> Furthermore, IIRC the time spent in irq context is also accounted as
> run time for the running cfs task but not RT and deadline task running
> time

OK

Anyway, the problem is that we have a 100% busy CPU which quite evidently
is not reported as 100% busy by the metric we use.

Now, if I was sure about the root cause, I would fix it rather than suggest
workarounds.

> So I'm not really aligned with the description of your problem: PELT
> metric underestimates the load of the CPU. The PELT is just about
> tracking CFS task utilization but not whole CPU utilization and
> according to your description of the problem (time stolen by irq),
> your problem doesn't come from an underestimation of CFS task but from
> time spent in something else but not accounted in the value used by
> schedutil

That's fair enough, but the assumption was that PELT would be sufficient for
that. To me, it clearly isn't, so the assumption was incorrect.

> That would be good to be sure of what is running during this not
> accounted time and find a way to account them

Yes, I agree.

I'm not sure if I can carry out full investigation of that any time soon, however.

I sent this mostly to let everybody know that there was a problem and how it
could be worked around. That's why it is an RFC.

>
> >>
> >> That can be easily demonstrated by running kernel compilation on
> >> a Sandy Bridge Intel processor, running turbostat in parallel with
> >> it and looking at the values written to the MSR_IA32_PERF_CTL
> >> register. Namely, the expected result would be that when all CPUs
> >> were 100% busy, all of them would be requested to run in the maximum
> >> P-state, but observation shows that this clearly isn't the case.
> >> The CPUs run in the maximum P-state for a while and then are
> >> requested to run slower and go back to the maximum P-state after
> >> a while again. That causes the actual frequency of the processor to
> >> visibly oscillate below the sustainable maximum in a jittery fashion
> >> which clearly is not desirable.
> >>
> >> To work around this issue use the observation that, from the
> >> schedutil governor's perspective, CPUs that are never idle should
> >> always run at the maximum frequency and make that happen.
> >>
> >> To that end, add a counter of idle calls to struct sugov_cpu and
> >> modify cpuidle_idle_call() to increment that counter every time it
> >> is about to put the given CPU into an idle state. Next, make the
> >> schedutil governor look at that counter for the current CPU every
> >> time before it is about to start heavy computations. If the counter
> >> has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms),
> >> the CPU has not been idle for at least that long and the governor
> >> will choose the maximum frequency for it without looking at the PELT
> >> metric at all.
> >
> > Looks like we are fixing a PELT problem with a schedutil Hack :)
>
> I would not say that it's a PELT problem (at least based on current
> description) but more that we don't track all
> activities of CPU

I generally agree. PELT does what it does.

However, using PELT the way we do that in schedutil turns out to be problematic.

Thanks,
Rafael

2017-03-20 13:16:39

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

On Monday, March 20, 2017 01:50:09 PM Peter Zijlstra wrote:
> On Mon, Mar 20, 2017 at 01:35:12PM +0100, Rafael J. Wysocki wrote:
> > On Monday, March 20, 2017 11:36:45 AM Peter Zijlstra wrote:
> > > On Sun, Mar 19, 2017 at 02:34:32PM +0100, Rafael J. Wysocki wrote:
> > > > From: Rafael J. Wysocki <[email protected]>
> > > >
> > > > The PELT metric used by the schedutil governor underestimates the
> > > > CPU utilization in some cases. The reason for that may be time spent
> > > > in interrupt handlers and similar which is not accounted for by PELT.
> > > >
> > > > That can be easily demonstrated by running kernel compilation on
> > > > a Sandy Bridge Intel processor, running turbostat in parallel with
> > > > it and looking at the values written to the MSR_IA32_PERF_CTL
> > > > register. Namely, the expected result would be that when all CPUs
> > > > were 100% busy, all of them would be requested to run in the maximum
> > > > P-state, but observation shows that this clearly isn't the case.
> > > > The CPUs run in the maximum P-state for a while and then are
> > > > requested to run slower and go back to the maximum P-state after
> > > > a while again. That causes the actual frequency of the processor to
> > > > visibly oscillate below the sustainable maximum in a jittery fashion
> > > > which clearly is not desirable.
> > > >
> > > > To work around this issue use the observation that, from the
> > > > schedutil governor's perspective, CPUs that are never idle should
> > > > always run at the maximum frequency and make that happen.
> > > >
> > > > To that end, add a counter of idle calls to struct sugov_cpu and
> > > > modify cpuidle_idle_call() to increment that counter every time it
> > > > is about to put the given CPU into an idle state. Next, make the
> > > > schedutil governor look at that counter for the current CPU every
> > > > time before it is about to start heavy computations. If the counter
> > > > has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms),
> > > > the CPU has not been idle for at least that long and the governor
> > > > will choose the maximum frequency for it without looking at the PELT
> > > > metric at all.
> > >
> > > Why the time limit?
> >
> > One iteration appeared to be a bit too aggressive, but honestly I think
> > I need to check again if this thing is regarded as viable at all.
> >
>
> I don't hate the idea; if we don't hit idle; we shouldn't shift down.

OK

> I just wonder if we don't already keep a idle-seqcount somewhere; NOHZ and
> RCU come to mind as things that might already use something like that.

NOHZ does that, but I did't want this to artificially depend on NOHZ. That said,
yes, we can use that one too.

2017-03-20 13:22:43

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

On Monday, March 20, 2017 09:27:45 AM Viresh Kumar wrote:
> On 19-03-17, 14:34, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <[email protected]>
> >
> > The PELT metric used by the schedutil governor underestimates the
> > CPU utilization in some cases. The reason for that may be time spent
> > in interrupt handlers and similar which is not accounted for by PELT.
> >
> > That can be easily demonstrated by running kernel compilation on
> > a Sandy Bridge Intel processor, running turbostat in parallel with
> > it and looking at the values written to the MSR_IA32_PERF_CTL
> > register. Namely, the expected result would be that when all CPUs
> > were 100% busy, all of them would be requested to run in the maximum
> > P-state, but observation shows that this clearly isn't the case.
> > The CPUs run in the maximum P-state for a while and then are
> > requested to run slower and go back to the maximum P-state after
> > a while again. That causes the actual frequency of the processor to
> > visibly oscillate below the sustainable maximum in a jittery fashion
> > which clearly is not desirable.
> >
> > To work around this issue use the observation that, from the
> > schedutil governor's perspective, CPUs that are never idle should
> > always run at the maximum frequency and make that happen.
> >
> > To that end, add a counter of idle calls to struct sugov_cpu and
> > modify cpuidle_idle_call() to increment that counter every time it
> > is about to put the given CPU into an idle state. Next, make the
> > schedutil governor look at that counter for the current CPU every
> > time before it is about to start heavy computations. If the counter
> > has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms),
> > the CPU has not been idle for at least that long and the governor
> > will choose the maximum frequency for it without looking at the PELT
> > metric at all.
>
> Looks like we are fixing a PELT problem with a schedutil Hack :)

Did you notice the "To work around this issue" phrase above? This is a
workaround, not a fix.

But it is an optimization too, because avoiding heavy computations if they
are not necessary should be a win in any case.

> > Signed-off-by: Rafael J. Wysocki <[email protected]>
> > ---
> > include/linux/sched/cpufreq.h | 6 ++++++
> > kernel/sched/cpufreq_schedutil.c | 38 ++++++++++++++++++++++++++++++++++++--
> > kernel/sched/idle.c | 3 +++
> > 3 files changed, 45 insertions(+), 2 deletions(-)
> >
> > Index: linux-pm/kernel/sched/cpufreq_schedutil.c
> > ===================================================================
> > --- linux-pm.orig/kernel/sched/cpufreq_schedutil.c
> > +++ linux-pm/kernel/sched/cpufreq_schedutil.c
> > @@ -20,6 +20,7 @@
> > #include "sched.h"
> >
> > #define SUGOV_KTHREAD_PRIORITY 50
> > +#define SUGOV_BUSY_THRESHOLD (50 * NSEC_PER_MSEC)
> >
> > struct sugov_tunables {
> > struct gov_attr_set attr_set;
> > @@ -55,6 +56,9 @@ struct sugov_cpu {
> >
> > unsigned long iowait_boost;
> > unsigned long iowait_boost_max;
> > + unsigned long idle_calls;
> > + unsigned long saved_idle_calls;

The above two could be unsigned int, BTW.

> > + u64 busy_start;
> > u64 last_update;
> >
> > /* The fields below are only needed when sharing a policy. */
> > @@ -192,6 +196,34 @@ static void sugov_iowait_boost(struct su
> > sg_cpu->iowait_boost >>= 1;
> > }
> >
> > +void cpufreq_schedutil_idle_call(void)
> > +{
> > + struct sugov_cpu *sg_cpu = this_cpu_ptr(&sugov_cpu);
> > +
> > + sg_cpu->idle_calls++;
> > +}
> > +
> > +static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu)
> > +{
> > + if (sg_cpu->idle_calls != sg_cpu->saved_idle_calls) {
> > + sg_cpu->busy_start = 0;
> > + return false;
> > + }
> > +
> > + if (!sg_cpu->busy_start) {
> > + sg_cpu->busy_start = sg_cpu->last_update;
> > + return false;
> > + }
> > +
> > + return sg_cpu->last_update - sg_cpu->busy_start > SUGOV_BUSY_THRESHOLD;
> > +}
> > +
> > +static void sugov_save_idle_calls(struct sugov_cpu *sg_cpu)
> > +{
> > + if (!sg_cpu->busy_start)
> > + sg_cpu->saved_idle_calls = sg_cpu->idle_calls;
>
> Why aren't we doing this in sugov_cpu_is_busy() itself ?

No, we aren't.

sugov_cpu_is_busy() may not be called at all sometimes.

> And isn't it possible for idle_calls to get incremented by this time?

No, it isn't.

The idle loop does that and after it has disabled interrupts. If this code is
running, we surely are not in there on the same CPU, are we?

> > +}
> > +
> > static void sugov_update_single(struct update_util_data *hook, u64 time,
> > unsigned int flags)
> > {
> > @@ -207,7 +239,7 @@ static void sugov_update_single(struct u
> > if (!sugov_should_update_freq(sg_policy, time))
> > return;
> >
> > - if (flags & SCHED_CPUFREQ_RT_DL) {
> > + if ((flags & SCHED_CPUFREQ_RT_DL) || sugov_cpu_is_busy(sg_cpu)) {
> > next_f = policy->cpuinfo.max_freq;
> > } else {
> > sugov_get_util(&util, &max);
> > @@ -215,6 +247,7 @@ static void sugov_update_single(struct u
> > next_f = get_next_freq(sg_policy, util, max);
> > }
> > sugov_update_commit(sg_policy, time, next_f);
> > + sugov_save_idle_calls(sg_cpu);
> > }
> >
> > static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu)
> > @@ -278,12 +311,13 @@ static void sugov_update_shared(struct u
> > sg_cpu->last_update = time;
> >
> > if (sugov_should_update_freq(sg_policy, time)) {
> > - if (flags & SCHED_CPUFREQ_RT_DL)
> > + if ((flags & SCHED_CPUFREQ_RT_DL) || sugov_cpu_is_busy(sg_cpu))
>
> What about others CPUs in this policy?

We'll check this for them when they get updated.

The point is that we can just bail out earlier here and avoid the heavy stuff,
but we don't have to. This is opportunistic.

> > next_f = sg_policy->policy->cpuinfo.max_freq;
> > else
> > next_f = sugov_next_freq_shared(sg_cpu);
> >
> > sugov_update_commit(sg_policy, time, next_f);
> > + sugov_save_idle_calls(sg_cpu);
> > }
> >
> > raw_spin_unlock(&sg_policy->update_lock);
>
>

Thanks,
Rafael

2017-03-20 13:29:33

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

On 20 March 2017 at 13:59, Rafael J. Wysocki <[email protected]> wrote:
> On Monday, March 20, 2017 09:26:34 AM Vincent Guittot wrote:
>> On 20 March 2017 at 04:57, Viresh Kumar <[email protected]> wrote:
>> > On 19-03-17, 14:34, Rafael J. Wysocki wrote:
>> >> From: Rafael J. Wysocki <[email protected]>
>> >>
>> >> The PELT metric used by the schedutil governor underestimates the
>> >> CPU utilization in some cases. The reason for that may be time spent
>> >> in interrupt handlers and similar which is not accounted for by PELT.
>>
>> Are you sure of the root cause described above (time stolen by irq
>> handler) or is it just a hypotheses ? That would be good to be sure of
>> the root cause
>
> No, I'm not sure. That's why I said "may be". :-)
>
>> Furthermore, IIRC the time spent in irq context is also accounted as
>> run time for the running cfs task but not RT and deadline task running
>> time
>
> OK
>
> Anyway, the problem is that we have a 100% busy CPU which quite evidently
> is not reported as 100% busy by the metric we use.


>
> Now, if I was sure about the root cause, I would fix it rather than suggest
> workarounds.
>
>> So I'm not really aligned with the description of your problem: PELT
>> metric underestimates the load of the CPU. The PELT is just about
>> tracking CFS task utilization but not whole CPU utilization and
>> according to your description of the problem (time stolen by irq),
>> your problem doesn't come from an underestimation of CFS task but from
>> time spent in something else but not accounted in the value used by
>> schedutil
>
> That's fair enough, but the assumption was that PELT would be sufficient for
> that. To me, it clearly isn't, so the assumption was incorrect.

So PELT is just about CFS utilization and we need metrics for other
sched class as well
I know that Juri and other people are working on the dealine part

In current kernel, the only available metric is rt_avg (and
scale_rt_capacity) which gives you the available compute capacity for
CFS and in some way the amount of capacity used by RT and deadline.
That would be interesting to see if the amount of capacity used by RT
tasks (i assume that you don't have deadline task in your use case)
is similar to the amount of underestimation of CFS

>
>> That would be good to be sure of what is running during this not
>> accounted time and find a way to account them
>
> Yes, I agree.
>
> I'm not sure if I can carry out full investigation of that any time soon, however.

If you can share traces as Patrick proposed, we could estimate how
much time/capacity is used by rt tasks and if this can explain the
underestimation.

>
> I sent this mostly to let everybody know that there was a problem and how it
> could be worked around. That's why it is an RFC.

Ok

>
>>
>> >>
>> >> That can be easily demonstrated by running kernel compilation on
>> >> a Sandy Bridge Intel processor, running turbostat in parallel with
>> >> it and looking at the values written to the MSR_IA32_PERF_CTL
>> >> register. Namely, the expected result would be that when all CPUs
>> >> were 100% busy, all of them would be requested to run in the maximum
>> >> P-state, but observation shows that this clearly isn't the case.
>> >> The CPUs run in the maximum P-state for a while and then are
>> >> requested to run slower and go back to the maximum P-state after
>> >> a while again. That causes the actual frequency of the processor to
>> >> visibly oscillate below the sustainable maximum in a jittery fashion
>> >> which clearly is not desirable.
>> >>
>> >> To work around this issue use the observation that, from the
>> >> schedutil governor's perspective, CPUs that are never idle should
>> >> always run at the maximum frequency and make that happen.
>> >>
>> >> To that end, add a counter of idle calls to struct sugov_cpu and
>> >> modify cpuidle_idle_call() to increment that counter every time it
>> >> is about to put the given CPU into an idle state. Next, make the
>> >> schedutil governor look at that counter for the current CPU every
>> >> time before it is about to start heavy computations. If the counter
>> >> has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms),
>> >> the CPU has not been idle for at least that long and the governor
>> >> will choose the maximum frequency for it without looking at the PELT
>> >> metric at all.
>> >
>> > Looks like we are fixing a PELT problem with a schedutil Hack :)
>>
>> I would not say that it's a PELT problem (at least based on current
>> description) but more that we don't track all
>> activities of CPU
>
> I generally agree. PELT does what it does.
>
> However, using PELT the way we do that in schedutil turns out to be problematic.

yes. PELT is not enough but just part of the equation

>
> Thanks,
> Rafael
>

2017-03-20 13:39:29

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

On Monday, March 20, 2017 01:06:15 PM Patrick Bellasi wrote:
> On 20-Mar 13:50, Peter Zijlstra wrote:
> > On Mon, Mar 20, 2017 at 01:35:12PM +0100, Rafael J. Wysocki wrote:
> > > On Monday, March 20, 2017 11:36:45 AM Peter Zijlstra wrote:
> > > > On Sun, Mar 19, 2017 at 02:34:32PM +0100, Rafael J. Wysocki wrote:
> > > > > From: Rafael J. Wysocki <[email protected]>
> > > > >
> > > > > The PELT metric used by the schedutil governor underestimates the
> > > > > CPU utilization in some cases. The reason for that may be time spent
> > > > > in interrupt handlers and similar which is not accounted for by PELT.
> > > > >
> > > > > That can be easily demonstrated by running kernel compilation on
> > > > > a Sandy Bridge Intel processor, running turbostat in parallel with
> > > > > it and looking at the values written to the MSR_IA32_PERF_CTL
> > > > > register. Namely, the expected result would be that when all CPUs
> > > > > were 100% busy, all of them would be requested to run in the maximum
> > > > > P-state, but observation shows that this clearly isn't the case.
> > > > > The CPUs run in the maximum P-state for a while and then are
> > > > > requested to run slower and go back to the maximum P-state after
> > > > > a while again. That causes the actual frequency of the processor to
> > > > > visibly oscillate below the sustainable maximum in a jittery fashion
> > > > > which clearly is not desirable.
> > > > >
> > > > > To work around this issue use the observation that, from the
> > > > > schedutil governor's perspective, CPUs that are never idle should
> > > > > always run at the maximum frequency and make that happen.
> > > > >
> > > > > To that end, add a counter of idle calls to struct sugov_cpu and
> > > > > modify cpuidle_idle_call() to increment that counter every time it
> > > > > is about to put the given CPU into an idle state. Next, make the
> > > > > schedutil governor look at that counter for the current CPU every
> > > > > time before it is about to start heavy computations. If the counter
> > > > > has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms),
> > > > > the CPU has not been idle for at least that long and the governor
> > > > > will choose the maximum frequency for it without looking at the PELT
> > > > > metric at all.
> > > >
> > > > Why the time limit?
> > >
> > > One iteration appeared to be a bit too aggressive, but honestly I think
> > > I need to check again if this thing is regarded as viable at all.
> > >
> >
> > I don't hate the idea; if we don't hit idle; we shouldn't shift down. I
> > just wonder if we don't already keep a idle-seqcount somewhere; NOHZ and
> > RCU come to mind as things that might already use something like that.
>
> Maybe the problem is not going down (e.g. when there are only small
> CFS tasks it makes perfectly sense) but instead not being fast enough
> on rampin-up when a new RT task is activated.
>
> And this boils down to two main point:
> 1) throttling for up transitions perhaps is only harmful
> 2) the call sites for schedutils updates are not properly positioned
> in specific scheduler decision points.
>
> The proposed patch is adding yet another throttling mechanism, perhaps
> on top of one which already needs to be improved.

It is not throttling anything.

Thanks,
Rafael

2017-03-20 14:13:50

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

On 20-Mar 14:05, Rafael J. Wysocki wrote:
> On Monday, March 20, 2017 01:06:15 PM Patrick Bellasi wrote:
> > On 20-Mar 13:50, Peter Zijlstra wrote:
> > > On Mon, Mar 20, 2017 at 01:35:12PM +0100, Rafael J. Wysocki wrote:
> > > > On Monday, March 20, 2017 11:36:45 AM Peter Zijlstra wrote:
> > > > > On Sun, Mar 19, 2017 at 02:34:32PM +0100, Rafael J. Wysocki wrote:
> > > > > > From: Rafael J. Wysocki <[email protected]>
> > > > > >
> > > > > > The PELT metric used by the schedutil governor underestimates the
> > > > > > CPU utilization in some cases. The reason for that may be time spent
> > > > > > in interrupt handlers and similar which is not accounted for by PELT.
> > > > > >
> > > > > > That can be easily demonstrated by running kernel compilation on
> > > > > > a Sandy Bridge Intel processor, running turbostat in parallel with
> > > > > > it and looking at the values written to the MSR_IA32_PERF_CTL
> > > > > > register. Namely, the expected result would be that when all CPUs
> > > > > > were 100% busy, all of them would be requested to run in the maximum
> > > > > > P-state, but observation shows that this clearly isn't the case.
> > > > > > The CPUs run in the maximum P-state for a while and then are
> > > > > > requested to run slower and go back to the maximum P-state after
> > > > > > a while again. That causes the actual frequency of the processor to
> > > > > > visibly oscillate below the sustainable maximum in a jittery fashion
> > > > > > which clearly is not desirable.
> > > > > >
> > > > > > To work around this issue use the observation that, from the
> > > > > > schedutil governor's perspective, CPUs that are never idle should
> > > > > > always run at the maximum frequency and make that happen.
> > > > > >
> > > > > > To that end, add a counter of idle calls to struct sugov_cpu and
> > > > > > modify cpuidle_idle_call() to increment that counter every time it
> > > > > > is about to put the given CPU into an idle state. Next, make the
> > > > > > schedutil governor look at that counter for the current CPU every
> > > > > > time before it is about to start heavy computations. If the counter
> > > > > > has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms),
> > > > > > the CPU has not been idle for at least that long and the governor
> > > > > > will choose the maximum frequency for it without looking at the PELT
> > > > > > metric at all.
> > > > >
> > > > > Why the time limit?
> > > >
> > > > One iteration appeared to be a bit too aggressive, but honestly I think
> > > > I need to check again if this thing is regarded as viable at all.
> > > >
> > >
> > > I don't hate the idea; if we don't hit idle; we shouldn't shift down. I
> > > just wonder if we don't already keep a idle-seqcount somewhere; NOHZ and
> > > RCU come to mind as things that might already use something like that.
> >
> > Maybe the problem is not going down (e.g. when there are only small
> > CFS tasks it makes perfectly sense) but instead not being fast enough
> > on rampin-up when a new RT task is activated.
> >
> > And this boils down to two main point:
> > 1) throttling for up transitions perhaps is only harmful
> > 2) the call sites for schedutils updates are not properly positioned
> > in specific scheduler decision points.
> >
> > The proposed patch is adding yet another throttling mechanism, perhaps
> > on top of one which already needs to be improved.
>
> It is not throttling anything.

It's a kind-of...

- if (flags & SCHED_CPUFREQ_RT_DL) {
+ if ((flags & SCHED_CPUFREQ_RT_DL) || sugov_cpu_is_busy(sg_cpu)) {
next_f = policy->cpuinfo.max_freq;

This check disregard any signal the scheduler can provide.

A 60% CFS task with a 100ms period, with such a policy will end up
running at the highest OPP for just 10% of its entire activation.
Moreover, when it completes, we are likely to enter an idle OPP while
still remaining at the highest OPP.

IMHO the ultimate goal of scheduitl should be that to be driven by the
scheduler, which has (or can have) all the required information to
support OPP selection.

If something is not working, well, then we should properly fix the
signals and/or provide (at least) a per-task tunable interface.

Adding an hardcoded threshold is an easy fix but it will ultimately
increase the complexity of the governor.

--
#include <best/regards.h>

Patrick Bellasi

2017-03-20 21:52:02

by Rafael J. Wysocki

[permalink] [raw]
Subject: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs

From: Rafael J. Wysocki <[email protected]>

The way the schedutil governor uses the PELT metric causes it to
underestimate the CPU utilization in some cases.

That can be easily demonstrated by running kernel compilation on
a Sandy Bridge Intel processor, running turbostat in parallel with
it and looking at the values written to the MSR_IA32_PERF_CTL
register. Namely, the expected result would be that when all CPUs
were 100% busy, all of them would be requested to run in the maximum
P-state, but observation shows that this clearly isn't the case.
The CPUs run in the maximum P-state for a while and then are
requested to run slower and go back to the maximum P-state after
a while again. That causes the actual frequency of the processor to
visibly oscillate below the sustainable maximum in a jittery fashion
which clearly is not desirable.

To work around this issue use the observation that, from the
schedutil governor's perspective, it does not make sense to decrease
the frequency of a CPU that doesn't enter idle and avoid decreasing
the frequency of busy CPUs.

To that end, use the counter of idle calls in the timekeeping code.
Namely, make the schedutil governor look at that counter for the
current CPU every time before it is about to set a new frequency
for that CPU's policy. If the counter has not changed since the
previous iteration, the CPU has been busy for all that time and
its frequency should not be decreased, so if the new frequency would
be lower than the one set previously, the governor will skip the
frequency update.

Signed-off-by: Rafael J. Wysocki <[email protected]>
---

This is a slightly different approach (avoid decreasing frequency for busy CPUs
instead of bumping if for them to the max upfront) and it works around the
original problem too.

I tried to address a few Peter's comments here and the result doesn't seem to
be too heavy-wieght.

Thanks,
Rafael

---
include/linux/tick.h | 1 +
kernel/sched/cpufreq_schedutil.c | 28 ++++++++++++++++++++++++----
kernel/time/tick-sched.c | 12 ++++++++++++
3 files changed, 37 insertions(+), 4 deletions(-)

Index: linux-pm/kernel/sched/cpufreq_schedutil.c
===================================================================
--- linux-pm.orig/kernel/sched/cpufreq_schedutil.c
+++ linux-pm/kernel/sched/cpufreq_schedutil.c
@@ -56,6 +56,9 @@ struct sugov_cpu {
unsigned long iowait_boost;
unsigned long iowait_boost_max;
u64 last_update;
+#ifdef CONFIG_NO_HZ_COMMON
+ unsigned long saved_idle_calls;
+#endif

/* The fields below are only needed when sharing a policy. */
unsigned long util;
@@ -88,11 +91,28 @@ static bool sugov_should_update_freq(str
return delta_ns >= sg_policy->freq_update_delay_ns;
}

-static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
- unsigned int next_freq)
+#ifdef CONFIG_NO_HZ_COMMON
+static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu)
+{
+ unsigned long idle_calls = tick_nohz_get_idle_calls();
+ bool ret = idle_calls == sg_cpu->saved_idle_calls;
+
+ sg_cpu->saved_idle_calls = idle_calls;
+ return ret;
+}
+#else
+static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; }
+#endif /* CONFIG_NO_HZ_COMMON */
+
+static void sugov_update_commit(struct sugov_cpu *sg_cpu,
+ struct sugov_policy *sg_policy,
+ u64 time, unsigned int next_freq)
{
struct cpufreq_policy *policy = sg_policy->policy;

+ if (sugov_cpu_is_busy(sg_cpu) && next_freq < sg_policy->next_freq)
+ next_freq = sg_policy->next_freq;
+
if (policy->fast_switch_enabled) {
if (sg_policy->next_freq == next_freq) {
trace_cpu_frequency(policy->cur, smp_processor_id());
@@ -214,7 +234,7 @@ static void sugov_update_single(struct u
sugov_iowait_boost(sg_cpu, &util, &max);
next_f = get_next_freq(sg_policy, util, max);
}
- sugov_update_commit(sg_policy, time, next_f);
+ sugov_update_commit(sg_cpu, sg_policy, time, next_f);
}

static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu)
@@ -283,7 +303,7 @@ static void sugov_update_shared(struct u
else
next_f = sugov_next_freq_shared(sg_cpu);

- sugov_update_commit(sg_policy, time, next_f);
+ sugov_update_commit(sg_cpu, sg_policy, time, next_f);
}

raw_spin_unlock(&sg_policy->update_lock);
Index: linux-pm/include/linux/tick.h
===================================================================
--- linux-pm.orig/include/linux/tick.h
+++ linux-pm/include/linux/tick.h
@@ -117,6 +117,7 @@ extern void tick_nohz_idle_enter(void);
extern void tick_nohz_idle_exit(void);
extern void tick_nohz_irq_exit(void);
extern ktime_t tick_nohz_get_sleep_length(void);
+extern unsigned long tick_nohz_get_idle_calls(void);
extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
#else /* !CONFIG_NO_HZ_COMMON */
Index: linux-pm/kernel/time/tick-sched.c
===================================================================
--- linux-pm.orig/kernel/time/tick-sched.c
+++ linux-pm/kernel/time/tick-sched.c
@@ -993,6 +993,18 @@ ktime_t tick_nohz_get_sleep_length(void)
return ts->sleep_length;
}

+/**
+ * tick_nohz_get_idle_calls - return the current idle calls counter value
+ *
+ * Called from the schedutil frequency scaling governor in scheduler context.
+ */
+unsigned long tick_nohz_get_idle_calls(void)
+{
+ struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
+
+ return ts->idle_calls;
+}
+
static void tick_nohz_account_idle_ticks(struct tick_sched *ts)
{
#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE

2017-03-21 06:40:52

by Viresh Kumar

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs

On 20-03-17, 22:46, Rafael J. Wysocki wrote:
> Index: linux-pm/kernel/sched/cpufreq_schedutil.c

> +static void sugov_update_commit(struct sugov_cpu *sg_cpu,
> + struct sugov_policy *sg_policy,
> + u64 time, unsigned int next_freq)
> {
> struct cpufreq_policy *policy = sg_policy->policy;
>
> + if (sugov_cpu_is_busy(sg_cpu) && next_freq < sg_policy->next_freq)
> + next_freq = sg_policy->next_freq;
> +

In the earlier version you said that we want to be opportunistic and
don't want to do heavy computation and so check only for current CPU.

But in this version, all those computations are already done by now.
Why shouldn't we check all CPUs in the policy now? I am asking as we
will still have the same problem, we are trying to work-around if the
current CPU isn't busy but others sharing the policy are.

Also, why not return directly from within the if block? To run
trace_cpu_frequency()?

I don't remember exactly, but why don't we run that for !fast-switch
case? We can simplify the code a bit if we check for no freq change at
the top of the routine.

--
viresh

2017-03-21 08:59:16

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs

On 20 March 2017 at 22:46, Rafael J. Wysocki <[email protected]> wrote:
> From: Rafael J. Wysocki <[email protected]>
>
> The way the schedutil governor uses the PELT metric causes it to
> underestimate the CPU utilization in some cases.
>
> That can be easily demonstrated by running kernel compilation on
> a Sandy Bridge Intel processor, running turbostat in parallel with
> it and looking at the values written to the MSR_IA32_PERF_CTL
> register. Namely, the expected result would be that when all CPUs
> were 100% busy, all of them would be requested to run in the maximum
> P-state, but observation shows that this clearly isn't the case.
> The CPUs run in the maximum P-state for a while and then are
> requested to run slower and go back to the maximum P-state after
> a while again. That causes the actual frequency of the processor to
> visibly oscillate below the sustainable maximum in a jittery fashion
> which clearly is not desirable.
>
> To work around this issue use the observation that, from the
> schedutil governor's perspective, it does not make sense to decrease
> the frequency of a CPU that doesn't enter idle and avoid decreasing
> the frequency of busy CPUs.

I don't fully agree with that statement.
If there are 2 runnable tasks on CPU A and scheduler migrates the
waiting task to another CPU B so CPU A is less loaded now, it makes
sense to reduce the OPP. That's even for that purpose that we have
decided to use scheduler metrics in cpufreq governor so we can adjust
OPP immediately when tasks migrate.
That being said, i probably know why you see such OPP switches in your
use case. When we migrate a task, we also migrate/remove its
utilization from CPU.
If the CPU is not overloaded, it means that runnable tasks have all
computation that they need and don't have any reason to use more when
a task migrates to another CPU. so decreasing the OPP makes sense
because the utilzation is decreasing
If the CPU is overloaded, it means that runnable tasks have to share
CPU time and probably don't have all computations that they would like
so when a task migrate, the remaining tasks on the CPU will increase
their utilization and fill space left by the task that has just
migrated. So the CPU's utilization will decrease when a task migrates
(and as a result the OPP) but then its utilization will increase with
remaining tasks running more time as well as the OPP

So you need to make the difference between this 2 cases: Is a CPU
overloaded or not. You can't really rely on the utilization to detect
that but you could take advantage of the load which take into account
the waiting time of tasks

Vincent
>
> To that end, use the counter of idle calls in the timekeeping code.
> Namely, make the schedutil governor look at that counter for the
> current CPU every time before it is about to set a new frequency
> for that CPU's policy. If the counter has not changed since the
> previous iteration, the CPU has been busy for all that time and
> its frequency should not be decreased, so if the new frequency would
> be lower than the one set previously, the governor will skip the
> frequency update.
>
> Signed-off-by: Rafael J. Wysocki <[email protected]>
> ---
>
> This is a slightly different approach (avoid decreasing frequency for busy CPUs
> instead of bumping if for them to the max upfront) and it works around the
> original problem too.
>
> I tried to address a few Peter's comments here and the result doesn't seem to
> be too heavy-wieght.
>
> Thanks,
> Rafael
>
> ---
> include/linux/tick.h | 1 +
> kernel/sched/cpufreq_schedutil.c | 28 ++++++++++++++++++++++++----
> kernel/time/tick-sched.c | 12 ++++++++++++
> 3 files changed, 37 insertions(+), 4 deletions(-)
>
> Index: linux-pm/kernel/sched/cpufreq_schedutil.c
> ===================================================================
> --- linux-pm.orig/kernel/sched/cpufreq_schedutil.c
> +++ linux-pm/kernel/sched/cpufreq_schedutil.c
> @@ -56,6 +56,9 @@ struct sugov_cpu {
> unsigned long iowait_boost;
> unsigned long iowait_boost_max;
> u64 last_update;
> +#ifdef CONFIG_NO_HZ_COMMON
> + unsigned long saved_idle_calls;
> +#endif
>
> /* The fields below are only needed when sharing a policy. */
> unsigned long util;
> @@ -88,11 +91,28 @@ static bool sugov_should_update_freq(str
> return delta_ns >= sg_policy->freq_update_delay_ns;
> }
>
> -static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
> - unsigned int next_freq)
> +#ifdef CONFIG_NO_HZ_COMMON
> +static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu)
> +{
> + unsigned long idle_calls = tick_nohz_get_idle_calls();
> + bool ret = idle_calls == sg_cpu->saved_idle_calls;
> +
> + sg_cpu->saved_idle_calls = idle_calls;
> + return ret;
> +}
> +#else
> +static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; }
> +#endif /* CONFIG_NO_HZ_COMMON */
> +
> +static void sugov_update_commit(struct sugov_cpu *sg_cpu,
> + struct sugov_policy *sg_policy,
> + u64 time, unsigned int next_freq)
> {
> struct cpufreq_policy *policy = sg_policy->policy;
>
> + if (sugov_cpu_is_busy(sg_cpu) && next_freq < sg_policy->next_freq)
> + next_freq = sg_policy->next_freq;
> +
> if (policy->fast_switch_enabled) {
> if (sg_policy->next_freq == next_freq) {
> trace_cpu_frequency(policy->cur, smp_processor_id());
> @@ -214,7 +234,7 @@ static void sugov_update_single(struct u
> sugov_iowait_boost(sg_cpu, &util, &max);
> next_f = get_next_freq(sg_policy, util, max);
> }
> - sugov_update_commit(sg_policy, time, next_f);
> + sugov_update_commit(sg_cpu, sg_policy, time, next_f);
> }
>
> static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu)
> @@ -283,7 +303,7 @@ static void sugov_update_shared(struct u
> else
> next_f = sugov_next_freq_shared(sg_cpu);
>
> - sugov_update_commit(sg_policy, time, next_f);
> + sugov_update_commit(sg_cpu, sg_policy, time, next_f);
> }
>
> raw_spin_unlock(&sg_policy->update_lock);
> Index: linux-pm/include/linux/tick.h
> ===================================================================
> --- linux-pm.orig/include/linux/tick.h
> +++ linux-pm/include/linux/tick.h
> @@ -117,6 +117,7 @@ extern void tick_nohz_idle_enter(void);
> extern void tick_nohz_idle_exit(void);
> extern void tick_nohz_irq_exit(void);
> extern ktime_t tick_nohz_get_sleep_length(void);
> +extern unsigned long tick_nohz_get_idle_calls(void);
> extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
> extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
> #else /* !CONFIG_NO_HZ_COMMON */
> Index: linux-pm/kernel/time/tick-sched.c
> ===================================================================
> --- linux-pm.orig/kernel/time/tick-sched.c
> +++ linux-pm/kernel/time/tick-sched.c
> @@ -993,6 +993,18 @@ ktime_t tick_nohz_get_sleep_length(void)
> return ts->sleep_length;
> }
>
> +/**
> + * tick_nohz_get_idle_calls - return the current idle calls counter value
> + *
> + * Called from the schedutil frequency scaling governor in scheduler context.
> + */
> +unsigned long tick_nohz_get_idle_calls(void)
> +{
> + struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
> +
> + return ts->idle_calls;
> +}
> +
> static void tick_nohz_account_idle_ticks(struct tick_sched *ts)
> {
> #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
>

2017-03-21 11:50:56

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs

On 20-Mar 22:46, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <[email protected]>
>
> The way the schedutil governor uses the PELT metric causes it to
> underestimate the CPU utilization in some cases.
>
> That can be easily demonstrated by running kernel compilation on
> a Sandy Bridge Intel processor, running turbostat in parallel with
> it and looking at the values written to the MSR_IA32_PERF_CTL
> register. Namely, the expected result would be that when all CPUs
> were 100% busy, all of them would be requested to run in the maximum
> P-state, but observation shows that this clearly isn't the case.
> The CPUs run in the maximum P-state for a while and then are
> requested to run slower and go back to the maximum P-state after
> a while again. That causes the actual frequency of the processor to
> visibly oscillate below the sustainable maximum in a jittery fashion
> which clearly is not desirable.
>
> To work around this issue use the observation that, from the
> schedutil governor's perspective, it does not make sense to decrease
> the frequency of a CPU that doesn't enter idle and avoid decreasing
> the frequency of busy CPUs.
>
> To that end, use the counter of idle calls in the timekeeping code.
> Namely, make the schedutil governor look at that counter for the
> current CPU every time before it is about to set a new frequency
> for that CPU's policy. If the counter has not changed since the
> previous iteration, the CPU has been busy for all that time and
> its frequency should not be decreased, so if the new frequency would
> be lower than the one set previously, the governor will skip the
> frequency update.
>
> Signed-off-by: Rafael J. Wysocki <[email protected]>
> ---
>
> This is a slightly different approach (avoid decreasing frequency for busy CPUs
> instead of bumping if for them to the max upfront) and it works around the
> original problem too.

I like much better this version where we do not enforce max frequency
as well as we removed the hardcoded time threshold. ;-)

Makes sense to me also to avoid down scaling until we don't hit an IDLE.

However, I also agree with Vincent's observation: this constraint should be
there only for "overutilized" CPUs... but unfortunately, in mainline we are
still missing that flag and thus we should probably use the
"overloaded" one for the time being.

> I tried to address a few Peter's comments here and the result doesn't seem to
> be too heavy-wieght.

Nice!

> Thanks,
> Rafael
>
> ---
> include/linux/tick.h | 1 +
> kernel/sched/cpufreq_schedutil.c | 28 ++++++++++++++++++++++++----
> kernel/time/tick-sched.c | 12 ++++++++++++
> 3 files changed, 37 insertions(+), 4 deletions(-)
>
> Index: linux-pm/kernel/sched/cpufreq_schedutil.c
> ===================================================================
> --- linux-pm.orig/kernel/sched/cpufreq_schedutil.c
> +++ linux-pm/kernel/sched/cpufreq_schedutil.c
> @@ -56,6 +56,9 @@ struct sugov_cpu {
> unsigned long iowait_boost;
> unsigned long iowait_boost_max;
> u64 last_update;
> +#ifdef CONFIG_NO_HZ_COMMON
> + unsigned long saved_idle_calls;
> +#endif
>
> /* The fields below are only needed when sharing a policy. */
> unsigned long util;
> @@ -88,11 +91,28 @@ static bool sugov_should_update_freq(str
> return delta_ns >= sg_policy->freq_update_delay_ns;
> }
>
> -static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
> - unsigned int next_freq)
> +#ifdef CONFIG_NO_HZ_COMMON
> +static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu)
> +{
> + unsigned long idle_calls = tick_nohz_get_idle_calls();
> + bool ret = idle_calls == sg_cpu->saved_idle_calls;
> +
> + sg_cpu->saved_idle_calls = idle_calls;
> + return ret;
> +}
> +#else
> +static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; }
> +#endif /* CONFIG_NO_HZ_COMMON */
> +
> +static void sugov_update_commit(struct sugov_cpu *sg_cpu,
> + struct sugov_policy *sg_policy,
> + u64 time, unsigned int next_freq)
> {
> struct cpufreq_policy *policy = sg_policy->policy;
>
> + if (sugov_cpu_is_busy(sg_cpu) && next_freq < sg_policy->next_freq)
> + next_freq = sg_policy->next_freq;
> +
> if (policy->fast_switch_enabled) {
> if (sg_policy->next_freq == next_freq) {
> trace_cpu_frequency(policy->cur, smp_processor_id());
> @@ -214,7 +234,7 @@ static void sugov_update_single(struct u
> sugov_iowait_boost(sg_cpu, &util, &max);
> next_f = get_next_freq(sg_policy, util, max);
> }
> - sugov_update_commit(sg_policy, time, next_f);
> + sugov_update_commit(sg_cpu, sg_policy, time, next_f);
> }
>
> static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu)
> @@ -283,7 +303,7 @@ static void sugov_update_shared(struct u
> else
> next_f = sugov_next_freq_shared(sg_cpu);
>
> - sugov_update_commit(sg_policy, time, next_f);
> + sugov_update_commit(sg_cpu, sg_policy, time, next_f);
> }
>
> raw_spin_unlock(&sg_policy->update_lock);
> Index: linux-pm/include/linux/tick.h
> ===================================================================
> --- linux-pm.orig/include/linux/tick.h
> +++ linux-pm/include/linux/tick.h
> @@ -117,6 +117,7 @@ extern void tick_nohz_idle_enter(void);
> extern void tick_nohz_idle_exit(void);
> extern void tick_nohz_irq_exit(void);
> extern ktime_t tick_nohz_get_sleep_length(void);
> +extern unsigned long tick_nohz_get_idle_calls(void);
> extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
> extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
> #else /* !CONFIG_NO_HZ_COMMON */
> Index: linux-pm/kernel/time/tick-sched.c
> ===================================================================
> --- linux-pm.orig/kernel/time/tick-sched.c
> +++ linux-pm/kernel/time/tick-sched.c
> @@ -993,6 +993,18 @@ ktime_t tick_nohz_get_sleep_length(void)
> return ts->sleep_length;
> }
>
> +/**
> + * tick_nohz_get_idle_calls - return the current idle calls counter value
> + *
> + * Called from the schedutil frequency scaling governor in scheduler context.
> + */
> +unsigned long tick_nohz_get_idle_calls(void)
> +{
> + struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
> +
> + return ts->idle_calls;
> +}
> +
> static void tick_nohz_account_idle_ticks(struct tick_sched *ts)
> {
> #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
>

--
#include <best/regards.h>

Patrick Bellasi

2017-03-21 11:56:32

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs

On 21-Mar 09:50, Vincent Guittot wrote:
> On 20 March 2017 at 22:46, Rafael J. Wysocki <[email protected]> wrote:
> > From: Rafael J. Wysocki <[email protected]>
> >
> > The way the schedutil governor uses the PELT metric causes it to
> > underestimate the CPU utilization in some cases.
> >
> > That can be easily demonstrated by running kernel compilation on
> > a Sandy Bridge Intel processor, running turbostat in parallel with
> > it and looking at the values written to the MSR_IA32_PERF_CTL
> > register. Namely, the expected result would be that when all CPUs
> > were 100% busy, all of them would be requested to run in the maximum
> > P-state, but observation shows that this clearly isn't the case.
> > The CPUs run in the maximum P-state for a while and then are
> > requested to run slower and go back to the maximum P-state after
> > a while again. That causes the actual frequency of the processor to
> > visibly oscillate below the sustainable maximum in a jittery fashion
> > which clearly is not desirable.
> >
> > To work around this issue use the observation that, from the
> > schedutil governor's perspective, it does not make sense to decrease
> > the frequency of a CPU that doesn't enter idle and avoid decreasing
> > the frequency of busy CPUs.
>
> I don't fully agree with that statement.
> If there are 2 runnable tasks on CPU A and scheduler migrates the
> waiting task to another CPU B so CPU A is less loaded now, it makes
> sense to reduce the OPP. That's even for that purpose that we have
> decided to use scheduler metrics in cpufreq governor so we can adjust
> OPP immediately when tasks migrate.
> That being said, i probably know why you see such OPP switches in your
> use case. When we migrate a task, we also migrate/remove its
> utilization from CPU.
> If the CPU is not overloaded, it means that runnable tasks have all
> computation that they need and don't have any reason to use more when
> a task migrates to another CPU. so decreasing the OPP makes sense
> because the utilzation is decreasing
> If the CPU is overloaded, it means that runnable tasks have to share
> CPU time and probably don't have all computations that they would like
> so when a task migrate, the remaining tasks on the CPU will increase
> their utilization and fill space left by the task that has just
> migrated. So the CPU's utilization will decrease when a task migrates
> (and as a result the OPP) but then its utilization will increase with
> remaining tasks running more time as well as the OPP
>
> So you need to make the difference between this 2 cases: Is a CPU
> overloaded or not. You can't really rely on the utilization to detect
> that but you could take advantage of the load which take into account
> the waiting time of tasks

Right, we can use "overloaded" for the time being until we push the
"overutilized" bits.

[...]

> > +#ifdef CONFIG_NO_HZ_COMMON
> > +static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu)
> > +{
> > + unsigned long idle_calls = tick_nohz_get_idle_calls();
> > + bool ret = idle_calls == sg_cpu->saved_idle_calls;

Vincent: are you proposing something like this?

+ if (this_rq()->rd->overload)
+ return false;

> > +
> > + sg_cpu->saved_idle_calls = idle_calls;
> > + return ret;
> > +}
> > +#else
> > +static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; }
> > +#endif /* CONFIG_NO_HZ_COMMON */
> > +
> > +static void sugov_update_commit(struct sugov_cpu *sg_cpu,
> > + struct sugov_policy *sg_policy,
> > + u64 time, unsigned int next_freq)
> > {
> > struct cpufreq_policy *policy = sg_policy->policy;
> >
> > + if (sugov_cpu_is_busy(sg_cpu) && next_freq < sg_policy->next_freq)
> > + next_freq = sg_policy->next_freq;
> > +
> > if (policy->fast_switch_enabled) {
> > if (sg_policy->next_freq == next_freq) {
> > trace_cpu_frequency(policy->cur, smp_processor_id());
> > @@ -214,7 +234,7 @@ static void sugov_update_single(struct u
> > sugov_iowait_boost(sg_cpu, &util, &max);
> > next_f = get_next_freq(sg_policy, util, max);
> > }
> > - sugov_update_commit(sg_policy, time, next_f);
> > + sugov_update_commit(sg_cpu, sg_policy, time, next_f);
> > }
> >

[...]

--
#include <best/regards.h>

Patrick Bellasi

2017-03-21 12:31:08

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs

On Tue, Mar 21, 2017 at 7:40 AM, Viresh Kumar <[email protected]> wrote:
> On 20-03-17, 22:46, Rafael J. Wysocki wrote:
>> Index: linux-pm/kernel/sched/cpufreq_schedutil.c
>
>> +static void sugov_update_commit(struct sugov_cpu *sg_cpu,
>> + struct sugov_policy *sg_policy,
>> + u64 time, unsigned int next_freq)
>> {
>> struct cpufreq_policy *policy = sg_policy->policy;
>>
>> + if (sugov_cpu_is_busy(sg_cpu) && next_freq < sg_policy->next_freq)
>> + next_freq = sg_policy->next_freq;
>> +
>
> In the earlier version you said that we want to be opportunistic and
> don't want to do heavy computation and so check only for current CPU.
>
> But in this version, all those computations are already done by now.
> Why shouldn't we check all CPUs in the policy now? I am asking as we
> will still have the same problem, we are trying to work-around if the
> current CPU isn't busy but others sharing the policy are.

This isn't the way I'm looking at that.

This is an easy (and relatively cheap) check to make for the *current*
*CPU* and our frequency selection algorithm turns out to have
problems, so it would be kind of unreasonable to not use the
opportunity to fix up the value coming from it - if we can do that
easily enough.

For the other CPUs in the policy that would require extra
synchronization etc., so not that easy any more.

> Also, why not return directly from within the if block? To run
> trace_cpu_frequency()?

Yes.

> I don't remember exactly, but why don't we run that for !fast-switch
> case?

That's an interesting question.

We do that in the fast switch case, because otherwise utilities get
confused if the frequency is not updated for a long enough time.

I'm not really sure why they don't get confused in the other case,
though. [In that case the core calls trace_cpu_frequency() for us,
but only if we actually run the async work.]

It looks like it wouldn't hurt to always run trace_cpu_frequency()
when we want to bail out early for next_freq == sg_policy->next_freq.

Let me prepare a patch for that. :-)

> We can simplify the code a bit if we check for no freq change at
> the top of the routine.

Right.

Thanks,
Rafael

2017-03-21 13:23:05

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs

On Tue, Mar 21, 2017 at 09:50:28AM +0100, Vincent Guittot wrote:
> On 20 March 2017 at 22:46, Rafael J. Wysocki <[email protected]> wrote:

> > To work around this issue use the observation that, from the
> > schedutil governor's perspective, it does not make sense to decrease
> > the frequency of a CPU that doesn't enter idle and avoid decreasing
> > the frequency of busy CPUs.
>
> I don't fully agree with that statement.
> If there are 2 runnable tasks on CPU A and scheduler migrates the
> waiting task to another CPU B so CPU A is less loaded now, it makes
> sense to reduce the OPP. That's even for that purpose that we have
> decided to use scheduler metrics in cpufreq governor so we can adjust
> OPP immediately when tasks migrate.
> That being said, i probably know why you see such OPP switches in your
> use case. When we migrate a task, we also migrate/remove its
> utilization from CPU.
> If the CPU is not overloaded, it means that runnable tasks have all
> computation that they need and don't have any reason to use more when
> a task migrates to another CPU. so decreasing the OPP makes sense
> because the utilzation is decreasing
> If the CPU is overloaded, it means that runnable tasks have to share
> CPU time and probably don't have all computations that they would like
> so when a task migrate, the remaining tasks on the CPU will increase
> their utilization and fill space left by the task that has just
> migrated. So the CPU's utilization will decrease when a task migrates
> (and as a result the OPP) but then its utilization will increase with
> remaining tasks running more time as well as the OPP
>
> So you need to make the difference between this 2 cases: Is a CPU
> overloaded or not. You can't really rely on the utilization to detect
> that but you could take advantage of the load which take into account
> the waiting time of tasks

I'm confused. What two cases? You only list the overloaded case, but he
does exactly that. Note that the lack of idle time is an exact
equivalent of 100% utilized.

So even while we cannot currently detect the 100% utilized state through
the running state tracking; because averages etc.. we can detect the
lack of idle time.

2017-03-21 13:37:44

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs

On 21 March 2017 at 14:22, Peter Zijlstra <[email protected]> wrote:
> On Tue, Mar 21, 2017 at 09:50:28AM +0100, Vincent Guittot wrote:
>> On 20 March 2017 at 22:46, Rafael J. Wysocki <[email protected]> wrote:
>
>> > To work around this issue use the observation that, from the
>> > schedutil governor's perspective, it does not make sense to decrease
>> > the frequency of a CPU that doesn't enter idle and avoid decreasing
>> > the frequency of busy CPUs.
>>
>> I don't fully agree with that statement.
>> If there are 2 runnable tasks on CPU A and scheduler migrates the
>> waiting task to another CPU B so CPU A is less loaded now, it makes
>> sense to reduce the OPP. That's even for that purpose that we have
>> decided to use scheduler metrics in cpufreq governor so we can adjust
>> OPP immediately when tasks migrate.
>> That being said, i probably know why you see such OPP switches in your
>> use case. When we migrate a task, we also migrate/remove its
>> utilization from CPU.
>> If the CPU is not overloaded, it means that runnable tasks have all
>> computation that they need and don't have any reason to use more when
>> a task migrates to another CPU. so decreasing the OPP makes sense
>> because the utilzation is decreasing
>> If the CPU is overloaded, it means that runnable tasks have to share
>> CPU time and probably don't have all computations that they would like
>> so when a task migrate, the remaining tasks on the CPU will increase
>> their utilization and fill space left by the task that has just
>> migrated. So the CPU's utilization will decrease when a task migrates
>> (and as a result the OPP) but then its utilization will increase with
>> remaining tasks running more time as well as the OPP
>>
>> So you need to make the difference between this 2 cases: Is a CPU
>> overloaded or not. You can't really rely on the utilization to detect
>> that but you could take advantage of the load which take into account
>> the waiting time of tasks
>
> I'm confused. What two cases? You only list the overloaded case, but he

overloaded vs not overloaded use case.
For the not overloaded case, it makes sense to immediately update to
OPP to be aligned with the new utilization of the CPU even if it was
not idle in the past couple of ticks

> does exactly that. Note that the lack of idle time is an exact
> equivalent of 100% utilized.
>
> So even while we cannot currently detect the 100% utilized state through
> the running state tracking; because averages etc.. we can detect the
> lack of idle time.

But after how much lack of idle time do we consider that we are overloaded ?

>

2017-03-21 14:26:58

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs

On 21 March 2017 at 15:03, Peter Zijlstra <[email protected]> wrote:
> On Tue, Mar 21, 2017 at 02:37:08PM +0100, Vincent Guittot wrote:
>> On 21 March 2017 at 14:22, Peter Zijlstra <[email protected]> wrote:
>
>> For the not overloaded case, it makes sense to immediately update to
>> OPP to be aligned with the new utilization of the CPU even if it was
>> not idle in the past couple of ticks
>
> Yeah, but we cannot know. Also, who cares?

embedded system that doesn't want to stay at higest OPP if significant
part of the utilzation has moved away as an example
AFAICT, schedutil tries to select the best OPP according to the
current utilization of the CPU so if the utilization decreases, the
OPP should also decrease

>
>> > does exactly that. Note that the lack of idle time is an exact
>> > equivalent of 100% utilized.
>> >
>> > So even while we cannot currently detect the 100% utilized state through
>> > the running state tracking; because averages etc.. we can detect the
>> > lack of idle time.
>>
>> But after how much lack of idle time do we consider that we are overloaded ?
>
> 0 :-)
>
> Note that utilization is an absolute metric, not a windowed one. That
> is, there is no actual time associated with it. Now, for practical
> purposes we end up using windowed things in many places,
>

2017-03-21 14:31:54

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs

On Tuesday, March 21, 2017 02:37:08 PM Vincent Guittot wrote:
> On 21 March 2017 at 14:22, Peter Zijlstra <[email protected]> wrote:
> > On Tue, Mar 21, 2017 at 09:50:28AM +0100, Vincent Guittot wrote:
> >> On 20 March 2017 at 22:46, Rafael J. Wysocki <[email protected]> wrote:
> >
> >> > To work around this issue use the observation that, from the
> >> > schedutil governor's perspective, it does not make sense to decrease
> >> > the frequency of a CPU that doesn't enter idle and avoid decreasing
> >> > the frequency of busy CPUs.
> >>
> >> I don't fully agree with that statement.
> >> If there are 2 runnable tasks on CPU A and scheduler migrates the
> >> waiting task to another CPU B so CPU A is less loaded now, it makes
> >> sense to reduce the OPP. That's even for that purpose that we have
> >> decided to use scheduler metrics in cpufreq governor so we can adjust
> >> OPP immediately when tasks migrate.
> >> That being said, i probably know why you see such OPP switches in your
> >> use case. When we migrate a task, we also migrate/remove its
> >> utilization from CPU.
> >> If the CPU is not overloaded, it means that runnable tasks have all
> >> computation that they need and don't have any reason to use more when
> >> a task migrates to another CPU. so decreasing the OPP makes sense
> >> because the utilzation is decreasing
> >> If the CPU is overloaded, it means that runnable tasks have to share
> >> CPU time and probably don't have all computations that they would like
> >> so when a task migrate, the remaining tasks on the CPU will increase
> >> their utilization and fill space left by the task that has just
> >> migrated. So the CPU's utilization will decrease when a task migrates
> >> (and as a result the OPP) but then its utilization will increase with
> >> remaining tasks running more time as well as the OPP
> >>
> >> So you need to make the difference between this 2 cases: Is a CPU
> >> overloaded or not. You can't really rely on the utilization to detect
> >> that but you could take advantage of the load which take into account
> >> the waiting time of tasks
> >
> > I'm confused. What two cases? You only list the overloaded case, but he
>
> overloaded vs not overloaded use case.
> For the not overloaded case, it makes sense to immediately update to
> OPP to be aligned with the new utilization of the CPU even if it was
> not idle in the past couple of ticks

Yes, if the OPP (or P-state if you will) can be changed immediately. If it can't,
conditions may change by the time we actually update it and in that case It'd
be better to wait and see IMO.

In any case, the theory about migrating tasks made sense to me, so below is
what I tested. It works, and besides it has a nice feature that I don't need
to fetch for the timekeeping data. :-)

I only wonder if we want to do this or only prevent the frequency from
decreasing in the overloaded case?

---
kernel/sched/cpufreq_schedutil.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)

Index: linux-pm/kernel/sched/cpufreq_schedutil.c
===================================================================
--- linux-pm.orig/kernel/sched/cpufreq_schedutil.c
+++ linux-pm/kernel/sched/cpufreq_schedutil.c
@@ -61,6 +61,7 @@ struct sugov_cpu {
unsigned long util;
unsigned long max;
unsigned int flags;
+ bool overload;
};

static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
@@ -207,7 +208,7 @@ static void sugov_update_single(struct u
if (!sugov_should_update_freq(sg_policy, time))
return;

- if (flags & SCHED_CPUFREQ_RT_DL) {
+ if ((flags & SCHED_CPUFREQ_RT_DL) || this_rq()->rd->overload) {
next_f = policy->cpuinfo.max_freq;
} else {
sugov_get_util(&util, &max);
@@ -242,7 +243,7 @@ static unsigned int sugov_next_freq_shar
j_sg_cpu->iowait_boost = 0;
continue;
}
- if (j_sg_cpu->flags & SCHED_CPUFREQ_RT_DL)
+ if ((j_sg_cpu->flags & SCHED_CPUFREQ_RT_DL) || j_sg_cpu->overload)
return policy->cpuinfo.max_freq;

j_util = j_sg_cpu->util;
@@ -273,12 +274,13 @@ static void sugov_update_shared(struct u
sg_cpu->util = util;
sg_cpu->max = max;
sg_cpu->flags = flags;
+ sg_cpu->overload = this_rq()->rd->overload;

sugov_set_iowait_boost(sg_cpu, time, flags);
sg_cpu->last_update = time;

if (sugov_should_update_freq(sg_policy, time)) {
- if (flags & SCHED_CPUFREQ_RT_DL)
+ if ((flags & SCHED_CPUFREQ_RT_DL) || sg_cpu->overload)
next_f = sg_policy->policy->cpuinfo.max_freq;
else
next_f = sugov_next_freq_shared(sg_cpu);

2017-03-21 14:35:47

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs

On 21-Mar 15:03, Peter Zijlstra wrote:
> On Tue, Mar 21, 2017 at 02:37:08PM +0100, Vincent Guittot wrote:
> > On 21 March 2017 at 14:22, Peter Zijlstra <[email protected]> wrote:
>
> > For the not overloaded case, it makes sense to immediately update to
> > OPP to be aligned with the new utilization of the CPU even if it was
> > not idle in the past couple of ticks
>
> Yeah, but we cannot know. Also, who cares?
>
> > > does exactly that. Note that the lack of idle time is an exact
> > > equivalent of 100% utilized.
> > >
> > > So even while we cannot currently detect the 100% utilized state through
> > > the running state tracking; because averages etc.. we can detect the
> > > lack of idle time.
> >
> > But after how much lack of idle time do we consider that we are overloaded ?
>
> 0 :-)

If we should use "utilization" this time can be non 0 and it depends
for example on how long PELT takes to build up a utilization value
which marks the CPU as "overutilized"... thus we already have a
suitable time at least for CFS tasks.

> Note that utilization is an absolute metric, not a windowed one. That
> is, there is no actual time associated with it. Now, for practical
> purposes we end up using windowed things in many places,
>

--
#include <best/regards.h>

Patrick Bellasi

2017-03-21 14:45:46

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs

On 21-Mar 15:26, Rafael J. Wysocki wrote:
> On Tuesday, March 21, 2017 02:37:08 PM Vincent Guittot wrote:
> > On 21 March 2017 at 14:22, Peter Zijlstra <[email protected]> wrote:
> > > On Tue, Mar 21, 2017 at 09:50:28AM +0100, Vincent Guittot wrote:
> > >> On 20 March 2017 at 22:46, Rafael J. Wysocki <[email protected]> wrote:
> > >
> > >> > To work around this issue use the observation that, from the
> > >> > schedutil governor's perspective, it does not make sense to decrease
> > >> > the frequency of a CPU that doesn't enter idle and avoid decreasing
> > >> > the frequency of busy CPUs.
> > >>
> > >> I don't fully agree with that statement.
> > >> If there are 2 runnable tasks on CPU A and scheduler migrates the
> > >> waiting task to another CPU B so CPU A is less loaded now, it makes
> > >> sense to reduce the OPP. That's even for that purpose that we have
> > >> decided to use scheduler metrics in cpufreq governor so we can adjust
> > >> OPP immediately when tasks migrate.
> > >> That being said, i probably know why you see such OPP switches in your
> > >> use case. When we migrate a task, we also migrate/remove its
> > >> utilization from CPU.
> > >> If the CPU is not overloaded, it means that runnable tasks have all
> > >> computation that they need and don't have any reason to use more when
> > >> a task migrates to another CPU. so decreasing the OPP makes sense
> > >> because the utilzation is decreasing
> > >> If the CPU is overloaded, it means that runnable tasks have to share
> > >> CPU time and probably don't have all computations that they would like
> > >> so when a task migrate, the remaining tasks on the CPU will increase
> > >> their utilization and fill space left by the task that has just
> > >> migrated. So the CPU's utilization will decrease when a task migrates
> > >> (and as a result the OPP) but then its utilization will increase with
> > >> remaining tasks running more time as well as the OPP
> > >>
> > >> So you need to make the difference between this 2 cases: Is a CPU
> > >> overloaded or not. You can't really rely on the utilization to detect
> > >> that but you could take advantage of the load which take into account
> > >> the waiting time of tasks
> > >
> > > I'm confused. What two cases? You only list the overloaded case, but he
> >
> > overloaded vs not overloaded use case.
> > For the not overloaded case, it makes sense to immediately update to
> > OPP to be aligned with the new utilization of the CPU even if it was
> > not idle in the past couple of ticks
>
> Yes, if the OPP (or P-state if you will) can be changed immediately. If it can't,
> conditions may change by the time we actually update it and in that case It'd
> be better to wait and see IMO.
>
> In any case, the theory about migrating tasks made sense to me, so below is
> what I tested. It works, and besides it has a nice feature that I don't need
> to fetch for the timekeeping data. :-)
>
> I only wonder if we want to do this or only prevent the frequency from
> decreasing in the overloaded case?
>
> ---
> kernel/sched/cpufreq_schedutil.c | 8 +++++---
> 1 file changed, 5 insertions(+), 3 deletions(-)
>
> Index: linux-pm/kernel/sched/cpufreq_schedutil.c
> ===================================================================
> --- linux-pm.orig/kernel/sched/cpufreq_schedutil.c
> +++ linux-pm/kernel/sched/cpufreq_schedutil.c
> @@ -61,6 +61,7 @@ struct sugov_cpu {
> unsigned long util;
> unsigned long max;
> unsigned int flags;
> + bool overload;
> };
>
> static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
> @@ -207,7 +208,7 @@ static void sugov_update_single(struct u
> if (!sugov_should_update_freq(sg_policy, time))
> return;
>
> - if (flags & SCHED_CPUFREQ_RT_DL) {
> + if ((flags & SCHED_CPUFREQ_RT_DL) || this_rq()->rd->overload) {
> next_f = policy->cpuinfo.max_freq;

Isn't this going to max OPP every time we have more than 1 task in
that CPU?

In that case it will not fit the case: we have two 10% tasks on that CPU.

Previous solution was better IMO, apart from using overloaded instead
of overutilized (which is not yet there) :-/

> } else {
> sugov_get_util(&util, &max);
> @@ -242,7 +243,7 @@ static unsigned int sugov_next_freq_shar
> j_sg_cpu->iowait_boost = 0;
> continue;
> }
> - if (j_sg_cpu->flags & SCHED_CPUFREQ_RT_DL)
> + if ((j_sg_cpu->flags & SCHED_CPUFREQ_RT_DL) || j_sg_cpu->overload)
> return policy->cpuinfo.max_freq;
>
> j_util = j_sg_cpu->util;
> @@ -273,12 +274,13 @@ static void sugov_update_shared(struct u
> sg_cpu->util = util;
> sg_cpu->max = max;
> sg_cpu->flags = flags;
> + sg_cpu->overload = this_rq()->rd->overload;
>
> sugov_set_iowait_boost(sg_cpu, time, flags);
> sg_cpu->last_update = time;
>
> if (sugov_should_update_freq(sg_policy, time)) {
> - if (flags & SCHED_CPUFREQ_RT_DL)
> + if ((flags & SCHED_CPUFREQ_RT_DL) || sg_cpu->overload)
> next_f = sg_policy->policy->cpuinfo.max_freq;
> else
> next_f = sugov_next_freq_shared(sg_cpu);
>

--
#include <best/regards.h>

Patrick Bellasi

2017-03-21 14:50:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs

On Tue, Mar 21, 2017 at 02:37:08PM +0100, Vincent Guittot wrote:
> On 21 March 2017 at 14:22, Peter Zijlstra <[email protected]> wrote:

> For the not overloaded case, it makes sense to immediately update to
> OPP to be aligned with the new utilization of the CPU even if it was
> not idle in the past couple of ticks

Yeah, but we cannot know. Also, who cares?

> > does exactly that. Note that the lack of idle time is an exact
> > equivalent of 100% utilized.
> >
> > So even while we cannot currently detect the 100% utilized state through
> > the running state tracking; because averages etc.. we can detect the
> > lack of idle time.
>
> But after how much lack of idle time do we consider that we are overloaded ?

0 :-)

Note that utilization is an absolute metric, not a windowed one. That
is, there is no actual time associated with it. Now, for practical
purposes we end up using windowed things in many places,

2017-03-21 14:51:48

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs

On Tuesday, March 21, 2017 02:38:42 PM Patrick Bellasi wrote:
> On 21-Mar 15:26, Rafael J. Wysocki wrote:
> > On Tuesday, March 21, 2017 02:37:08 PM Vincent Guittot wrote:
> > > On 21 March 2017 at 14:22, Peter Zijlstra <[email protected]> wrote:
> > > > On Tue, Mar 21, 2017 at 09:50:28AM +0100, Vincent Guittot wrote:
> > > >> On 20 March 2017 at 22:46, Rafael J. Wysocki <[email protected]> wrote:
> > > >
> > > >> > To work around this issue use the observation that, from the
> > > >> > schedutil governor's perspective, it does not make sense to decrease
> > > >> > the frequency of a CPU that doesn't enter idle and avoid decreasing
> > > >> > the frequency of busy CPUs.
> > > >>
> > > >> I don't fully agree with that statement.
> > > >> If there are 2 runnable tasks on CPU A and scheduler migrates the
> > > >> waiting task to another CPU B so CPU A is less loaded now, it makes
> > > >> sense to reduce the OPP. That's even for that purpose that we have
> > > >> decided to use scheduler metrics in cpufreq governor so we can adjust
> > > >> OPP immediately when tasks migrate.
> > > >> That being said, i probably know why you see such OPP switches in your
> > > >> use case. When we migrate a task, we also migrate/remove its
> > > >> utilization from CPU.
> > > >> If the CPU is not overloaded, it means that runnable tasks have all
> > > >> computation that they need and don't have any reason to use more when
> > > >> a task migrates to another CPU. so decreasing the OPP makes sense
> > > >> because the utilzation is decreasing
> > > >> If the CPU is overloaded, it means that runnable tasks have to share
> > > >> CPU time and probably don't have all computations that they would like
> > > >> so when a task migrate, the remaining tasks on the CPU will increase
> > > >> their utilization and fill space left by the task that has just
> > > >> migrated. So the CPU's utilization will decrease when a task migrates
> > > >> (and as a result the OPP) but then its utilization will increase with
> > > >> remaining tasks running more time as well as the OPP
> > > >>
> > > >> So you need to make the difference between this 2 cases: Is a CPU
> > > >> overloaded or not. You can't really rely on the utilization to detect
> > > >> that but you could take advantage of the load which take into account
> > > >> the waiting time of tasks
> > > >
> > > > I'm confused. What two cases? You only list the overloaded case, but he
> > >
> > > overloaded vs not overloaded use case.
> > > For the not overloaded case, it makes sense to immediately update to
> > > OPP to be aligned with the new utilization of the CPU even if it was
> > > not idle in the past couple of ticks
> >
> > Yes, if the OPP (or P-state if you will) can be changed immediately. If it can't,
> > conditions may change by the time we actually update it and in that case It'd
> > be better to wait and see IMO.
> >
> > In any case, the theory about migrating tasks made sense to me, so below is
> > what I tested. It works, and besides it has a nice feature that I don't need
> > to fetch for the timekeeping data. :-)
> >
> > I only wonder if we want to do this or only prevent the frequency from
> > decreasing in the overloaded case?
> >
> > ---
> > kernel/sched/cpufreq_schedutil.c | 8 +++++---
> > 1 file changed, 5 insertions(+), 3 deletions(-)
> >
> > Index: linux-pm/kernel/sched/cpufreq_schedutil.c
> > ===================================================================
> > --- linux-pm.orig/kernel/sched/cpufreq_schedutil.c
> > +++ linux-pm/kernel/sched/cpufreq_schedutil.c
> > @@ -61,6 +61,7 @@ struct sugov_cpu {
> > unsigned long util;
> > unsigned long max;
> > unsigned int flags;
> > + bool overload;
> > };
> >
> > static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
> > @@ -207,7 +208,7 @@ static void sugov_update_single(struct u
> > if (!sugov_should_update_freq(sg_policy, time))
> > return;
> >
> > - if (flags & SCHED_CPUFREQ_RT_DL) {
> > + if ((flags & SCHED_CPUFREQ_RT_DL) || this_rq()->rd->overload) {
> > next_f = policy->cpuinfo.max_freq;
>
> Isn't this going to max OPP every time we have more than 1 task in
> that CPU?
>
> In that case it will not fit the case: we have two 10% tasks on that CPU.

Good point.

> Previous solution was better IMO, apart from using overloaded instead
> of overutilized (which is not yet there) :-/

OK, so the one below works too.

---
kernel/sched/cpufreq_schedutil.c | 11 +++++++++++
1 file changed, 11 insertions(+)

Index: linux-pm/kernel/sched/cpufreq_schedutil.c
===================================================================
--- linux-pm.orig/kernel/sched/cpufreq_schedutil.c
+++ linux-pm/kernel/sched/cpufreq_schedutil.c
@@ -37,6 +37,7 @@ struct sugov_policy {
s64 freq_update_delay_ns;
unsigned int next_freq;
unsigned int cached_raw_freq;
+ bool overload;

/* The next fields are only needed if fast switch cannot be used. */
struct irq_work irq_work;
@@ -61,6 +62,7 @@ struct sugov_cpu {
unsigned long util;
unsigned long max;
unsigned int flags;
+ bool overload;
};

static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
@@ -93,6 +95,9 @@ static void sugov_update_commit(struct s
{
struct cpufreq_policy *policy = sg_policy->policy;

+ if (sg_policy->overload && next_freq < sg_policy->next_freq)
+ next_freq = sg_policy->next_freq;
+
if (policy->fast_switch_enabled) {
if (sg_policy->next_freq == next_freq) {
trace_cpu_frequency(policy->cur, smp_processor_id());
@@ -207,6 +212,8 @@ static void sugov_update_single(struct u
if (!sugov_should_update_freq(sg_policy, time))
return;

+ sg_policy->overload = this_rq()->rd->overload;
+
if (flags & SCHED_CPUFREQ_RT_DL) {
next_f = policy->cpuinfo.max_freq;
} else {
@@ -225,6 +232,8 @@ static unsigned int sugov_next_freq_shar
unsigned long util = 0, max = 1;
unsigned int j;

+ sg_policy->overload = false;
+
for_each_cpu(j, policy->cpus) {
struct sugov_cpu *j_sg_cpu = &per_cpu(sugov_cpu, j);
unsigned long j_util, j_max;
@@ -253,6 +262,7 @@ static unsigned int sugov_next_freq_shar
}

sugov_iowait_boost(j_sg_cpu, &util, &max);
+ sg_policy->overload = sg_policy->overload || sg_cpu->overload;
}

return get_next_freq(sg_policy, util, max);
@@ -273,6 +283,7 @@ static void sugov_update_shared(struct u
sg_cpu->util = util;
sg_cpu->max = max;
sg_cpu->flags = flags;
+ sg_cpu->overload = this_rq()->rd->overload;

sugov_set_iowait_boost(sg_cpu, time, flags);
sg_cpu->last_update = time;

2017-03-21 14:59:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs

On Tue, Mar 21, 2017 at 03:16:19PM +0100, Vincent Guittot wrote:
> On 21 March 2017 at 15:03, Peter Zijlstra <[email protected]> wrote:
>
> > On Tue, Mar 21, 2017 at 02:37:08PM +0100, Vincent Guittot wrote:
> > > On 21 March 2017 at 14:22, Peter Zijlstra <[email protected]> wrote:
> >
> > > For the not overloaded case, it makes sense to immediately update to
> > > OPP to be aligned with the new utilization of the CPU even if it was
> > > not idle in the past couple of ticks
> >
> > Yeah, but we cannot know. Also, who cares?
> >
>
> embedded system that doesn't want to stay at higest OPP if significant part
> of the utilzation has moved away as an example
> AFAICT, schedutil tries to select the best OPP according to the current
> utilization of the CPU so if the utilization decreases, the OPP should also
> decrease

Sure I get that; but given the lack of crystal ball instructions we
cannot know if this is the case or not.

And if we really dropped below 100% utilization, we should hit idle
fairly soon.

2017-03-21 15:02:26

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs

On Tuesday, March 21, 2017 03:46:07 PM Rafael J. Wysocki wrote:
> On Tuesday, March 21, 2017 02:38:42 PM Patrick Bellasi wrote:
> > On 21-Mar 15:26, Rafael J. Wysocki wrote:
> > > On Tuesday, March 21, 2017 02:37:08 PM Vincent Guittot wrote:
> > > > On 21 March 2017 at 14:22, Peter Zijlstra <[email protected]> wrote:
> > > > > On Tue, Mar 21, 2017 at 09:50:28AM +0100, Vincent Guittot wrote:
> > > > >> On 20 March 2017 at 22:46, Rafael J. Wysocki <[email protected]> wrote:
> > > > >
> > > > >> > To work around this issue use the observation that, from the
> > > > >> > schedutil governor's perspective, it does not make sense to decrease
> > > > >> > the frequency of a CPU that doesn't enter idle and avoid decreasing
> > > > >> > the frequency of busy CPUs.
> > > > >>
> > > > >> I don't fully agree with that statement.
> > > > >> If there are 2 runnable tasks on CPU A and scheduler migrates the
> > > > >> waiting task to another CPU B so CPU A is less loaded now, it makes
> > > > >> sense to reduce the OPP. That's even for that purpose that we have
> > > > >> decided to use scheduler metrics in cpufreq governor so we can adjust
> > > > >> OPP immediately when tasks migrate.
> > > > >> That being said, i probably know why you see such OPP switches in your
> > > > >> use case. When we migrate a task, we also migrate/remove its
> > > > >> utilization from CPU.
> > > > >> If the CPU is not overloaded, it means that runnable tasks have all
> > > > >> computation that they need and don't have any reason to use more when
> > > > >> a task migrates to another CPU. so decreasing the OPP makes sense
> > > > >> because the utilzation is decreasing
> > > > >> If the CPU is overloaded, it means that runnable tasks have to share
> > > > >> CPU time and probably don't have all computations that they would like
> > > > >> so when a task migrate, the remaining tasks on the CPU will increase
> > > > >> their utilization and fill space left by the task that has just
> > > > >> migrated. So the CPU's utilization will decrease when a task migrates
> > > > >> (and as a result the OPP) but then its utilization will increase with
> > > > >> remaining tasks running more time as well as the OPP
> > > > >>
> > > > >> So you need to make the difference between this 2 cases: Is a CPU
> > > > >> overloaded or not. You can't really rely on the utilization to detect
> > > > >> that but you could take advantage of the load which take into account
> > > > >> the waiting time of tasks
> > > > >
> > > > > I'm confused. What two cases? You only list the overloaded case, but he
> > > >
> > > > overloaded vs not overloaded use case.
> > > > For the not overloaded case, it makes sense to immediately update to
> > > > OPP to be aligned with the new utilization of the CPU even if it was
> > > > not idle in the past couple of ticks
> > >
> > > Yes, if the OPP (or P-state if you will) can be changed immediately. If it can't,
> > > conditions may change by the time we actually update it and in that case It'd
> > > be better to wait and see IMO.
> > >
> > > In any case, the theory about migrating tasks made sense to me, so below is
> > > what I tested. It works, and besides it has a nice feature that I don't need
> > > to fetch for the timekeeping data. :-)
> > >
> > > I only wonder if we want to do this or only prevent the frequency from
> > > decreasing in the overloaded case?
> > >
> > > ---
> > > kernel/sched/cpufreq_schedutil.c | 8 +++++---
> > > 1 file changed, 5 insertions(+), 3 deletions(-)
> > >
> > > Index: linux-pm/kernel/sched/cpufreq_schedutil.c
> > > ===================================================================
> > > --- linux-pm.orig/kernel/sched/cpufreq_schedutil.c
> > > +++ linux-pm/kernel/sched/cpufreq_schedutil.c
> > > @@ -61,6 +61,7 @@ struct sugov_cpu {
> > > unsigned long util;
> > > unsigned long max;
> > > unsigned int flags;
> > > + bool overload;
> > > };
> > >
> > > static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
> > > @@ -207,7 +208,7 @@ static void sugov_update_single(struct u
> > > if (!sugov_should_update_freq(sg_policy, time))
> > > return;
> > >
> > > - if (flags & SCHED_CPUFREQ_RT_DL) {
> > > + if ((flags & SCHED_CPUFREQ_RT_DL) || this_rq()->rd->overload) {
> > > next_f = policy->cpuinfo.max_freq;
> >
> > Isn't this going to max OPP every time we have more than 1 task in
> > that CPU?
> >
> > In that case it will not fit the case: we have two 10% tasks on that CPU.
>
> Good point.
>
> > Previous solution was better IMO, apart from using overloaded instead
> > of overutilized (which is not yet there) :-/
>
> OK, so the one below works too.

Admittedly, we could check the idle condition and the overload flag at the same
time, though.

Let me try that too.

Thanks,
Rafael

2017-03-21 15:02:29

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs

On Tue, Mar 21, 2017 at 03:26:06PM +0100, Rafael J. Wysocki wrote:
> + if ((flags & SCHED_CPUFREQ_RT_DL) || this_rq()->rd->overload) {
> next_f = policy->cpuinfo.max_freq;

So this I think is wrong; rd->overload is set if _any_ of the CPUs in the
root domain is overloaded. And given the root domain is typically the
_entire_ machine, this would have a tendency to run at max_freq far too
often.

2017-03-21 15:04:23

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs

On Tue, Mar 21, 2017 at 03:46:07PM +0100, Rafael J. Wysocki wrote:
> @@ -207,6 +212,8 @@ static void sugov_update_single(struct u
> if (!sugov_should_update_freq(sg_policy, time))
> return;
>
> + sg_policy->overload = this_rq()->rd->overload;
> +

Same problem as before; rd->overload is set if _any_ CPU in the root
domain has more than 1 runnable task at a random point in history (when
we ran the load balance tick -- and since that is the same tick used for
timers, there's a bias to over-account there).

2017-03-21 15:08:28

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs

On 21-Mar 15:46, Rafael J. Wysocki wrote:
> On Tuesday, March 21, 2017 02:38:42 PM Patrick Bellasi wrote:
> > On 21-Mar 15:26, Rafael J. Wysocki wrote:
> > > On Tuesday, March 21, 2017 02:37:08 PM Vincent Guittot wrote:
> > > > On 21 March 2017 at 14:22, Peter Zijlstra <[email protected]> wrote:
> > > > > On Tue, Mar 21, 2017 at 09:50:28AM +0100, Vincent Guittot wrote:
> > > > >> On 20 March 2017 at 22:46, Rafael J. Wysocki <[email protected]> wrote:
> > > > >
> > > > >> > To work around this issue use the observation that, from the
> > > > >> > schedutil governor's perspective, it does not make sense to decrease
> > > > >> > the frequency of a CPU that doesn't enter idle and avoid decreasing
> > > > >> > the frequency of busy CPUs.
> > > > >>
> > > > >> I don't fully agree with that statement.
> > > > >> If there are 2 runnable tasks on CPU A and scheduler migrates the
> > > > >> waiting task to another CPU B so CPU A is less loaded now, it makes
> > > > >> sense to reduce the OPP. That's even for that purpose that we have
> > > > >> decided to use scheduler metrics in cpufreq governor so we can adjust
> > > > >> OPP immediately when tasks migrate.
> > > > >> That being said, i probably know why you see such OPP switches in your
> > > > >> use case. When we migrate a task, we also migrate/remove its
> > > > >> utilization from CPU.
> > > > >> If the CPU is not overloaded, it means that runnable tasks have all
> > > > >> computation that they need and don't have any reason to use more when
> > > > >> a task migrates to another CPU. so decreasing the OPP makes sense
> > > > >> because the utilzation is decreasing
> > > > >> If the CPU is overloaded, it means that runnable tasks have to share
> > > > >> CPU time and probably don't have all computations that they would like
> > > > >> so when a task migrate, the remaining tasks on the CPU will increase
> > > > >> their utilization and fill space left by the task that has just
> > > > >> migrated. So the CPU's utilization will decrease when a task migrates
> > > > >> (and as a result the OPP) but then its utilization will increase with
> > > > >> remaining tasks running more time as well as the OPP
> > > > >>
> > > > >> So you need to make the difference between this 2 cases: Is a CPU
> > > > >> overloaded or not. You can't really rely on the utilization to detect
> > > > >> that but you could take advantage of the load which take into account
> > > > >> the waiting time of tasks
> > > > >
> > > > > I'm confused. What two cases? You only list the overloaded case, but he
> > > >
> > > > overloaded vs not overloaded use case.
> > > > For the not overloaded case, it makes sense to immediately update to
> > > > OPP to be aligned with the new utilization of the CPU even if it was
> > > > not idle in the past couple of ticks
> > >
> > > Yes, if the OPP (or P-state if you will) can be changed immediately. If it can't,
> > > conditions may change by the time we actually update it and in that case It'd
> > > be better to wait and see IMO.
> > >
> > > In any case, the theory about migrating tasks made sense to me, so below is
> > > what I tested. It works, and besides it has a nice feature that I don't need
> > > to fetch for the timekeeping data. :-)
> > >
> > > I only wonder if we want to do this or only prevent the frequency from
> > > decreasing in the overloaded case?
> > >
> > > ---
> > > kernel/sched/cpufreq_schedutil.c | 8 +++++---
> > > 1 file changed, 5 insertions(+), 3 deletions(-)
> > >
> > > Index: linux-pm/kernel/sched/cpufreq_schedutil.c
> > > ===================================================================
> > > --- linux-pm.orig/kernel/sched/cpufreq_schedutil.c
> > > +++ linux-pm/kernel/sched/cpufreq_schedutil.c
> > > @@ -61,6 +61,7 @@ struct sugov_cpu {
> > > unsigned long util;
> > > unsigned long max;
> > > unsigned int flags;
> > > + bool overload;
> > > };
> > >
> > > static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
> > > @@ -207,7 +208,7 @@ static void sugov_update_single(struct u
> > > if (!sugov_should_update_freq(sg_policy, time))
> > > return;
> > >
> > > - if (flags & SCHED_CPUFREQ_RT_DL) {
> > > + if ((flags & SCHED_CPUFREQ_RT_DL) || this_rq()->rd->overload) {
> > > next_f = policy->cpuinfo.max_freq;
> >
> > Isn't this going to max OPP every time we have more than 1 task in
> > that CPU?
> >
> > In that case it will not fit the case: we have two 10% tasks on that CPU.
>
> Good point.
>
> > Previous solution was better IMO, apart from using overloaded instead
> > of overutilized (which is not yet there) :-/
>
> OK, so the one below works too.

Better... just one minor comment.


> ---
> kernel/sched/cpufreq_schedutil.c | 11 +++++++++++
> 1 file changed, 11 insertions(+)
>
> Index: linux-pm/kernel/sched/cpufreq_schedutil.c
> ===================================================================
> --- linux-pm.orig/kernel/sched/cpufreq_schedutil.c
> +++ linux-pm/kernel/sched/cpufreq_schedutil.c
> @@ -37,6 +37,7 @@ struct sugov_policy {
> s64 freq_update_delay_ns;
> unsigned int next_freq;
> unsigned int cached_raw_freq;
> + bool overload;

Can we avoid using "overloaded" in favor of a more generic and
schedutil specific name. Mainly because in the future we would
probably like to switch from "overloaded" to "overutilized".

What about something like: "busy" ?

>
> /* The next fields are only needed if fast switch cannot be used. */
> struct irq_work irq_work;
> @@ -61,6 +62,7 @@ struct sugov_cpu {
> unsigned long util;
> unsigned long max;
> unsigned int flags;
> + bool overload;
> };
>
> static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
> @@ -93,6 +95,9 @@ static void sugov_update_commit(struct s
> {
> struct cpufreq_policy *policy = sg_policy->policy;
>
> + if (sg_policy->overload && next_freq < sg_policy->next_freq)
> + next_freq = sg_policy->next_freq;
> +
> if (policy->fast_switch_enabled) {
> if (sg_policy->next_freq == next_freq) {
> trace_cpu_frequency(policy->cur, smp_processor_id());
> @@ -207,6 +212,8 @@ static void sugov_update_single(struct u
> if (!sugov_should_update_freq(sg_policy, time))
> return;
>
> + sg_policy->overload = this_rq()->rd->overload;

And than we can move this bit into an inline function, something like e.g.:

static inline bool sugov_this_cpu_is_busy()
{
return this_rq()->rd->overloaded
}

Where in future we can easily switch from usage of "overloaded" to
usage of "utilization".

> +
> if (flags & SCHED_CPUFREQ_RT_DL) {
> next_f = policy->cpuinfo.max_freq;
> } else {
> @@ -225,6 +232,8 @@ static unsigned int sugov_next_freq_shar
> unsigned long util = 0, max = 1;
> unsigned int j;
>
> + sg_policy->overload = false;
> +
> for_each_cpu(j, policy->cpus) {
> struct sugov_cpu *j_sg_cpu = &per_cpu(sugov_cpu, j);
> unsigned long j_util, j_max;
> @@ -253,6 +262,7 @@ static unsigned int sugov_next_freq_shar
> }
>
> sugov_iowait_boost(j_sg_cpu, &util, &max);
> + sg_policy->overload = sg_policy->overload || sg_cpu->overload;
> }
>
> return get_next_freq(sg_policy, util, max);
> @@ -273,6 +283,7 @@ static void sugov_update_shared(struct u
> sg_cpu->util = util;
> sg_cpu->max = max;
> sg_cpu->flags = flags;
> + sg_cpu->overload = this_rq()->rd->overload;
>
> sugov_set_iowait_boost(sg_cpu, time, flags);
> sg_cpu->last_update = time;
>

--
#include <best/regards.h>

Patrick Bellasi

2017-03-21 15:18:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs


Seriously people, trim your replies.

On Tue, Mar 21, 2017 at 03:08:20PM +0000, Patrick Bellasi wrote:

> And than we can move this bit into an inline function, something like e.g.:
>
> static inline bool sugov_this_cpu_is_busy()
> {
> return this_rq()->rd->overloaded
> }

No, that's just entirely and utterly wrong. It being in rd means its
very much not about _this_ CPU in any way.

2017-03-21 15:24:28

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs

On Tuesday, March 21, 2017 04:04:03 PM Peter Zijlstra wrote:
> On Tue, Mar 21, 2017 at 03:46:07PM +0100, Rafael J. Wysocki wrote:
> > @@ -207,6 +212,8 @@ static void sugov_update_single(struct u
> > if (!sugov_should_update_freq(sg_policy, time))
> > return;
> >
> > + sg_policy->overload = this_rq()->rd->overload;
> > +
>
> Same problem as before; rd->overload is set if _any_ CPU in the root
> domain has more than 1 runnable task at a random point in history (when
> we ran the load balance tick -- and since that is the same tick used for
> timers, there's a bias to over-account there).

OK

What about the one below then?

It checks both the idle calls count and overload and only then it will prevent
the frequency from being decreased.

It is sufficient for the case at hand.

I guess if rd->overload is not set, this means that none of the CPUs is
oversubscribed and we just saturate the capacity in a one-task-per-CPU kind
of fashion. Right?

---
include/linux/tick.h | 1 +
kernel/sched/cpufreq_schedutil.c | 27 +++++++++++++++++++++++++++
kernel/time/tick-sched.c | 12 ++++++++++++
3 files changed, 40 insertions(+)

Index: linux-pm/kernel/sched/cpufreq_schedutil.c
===================================================================
--- linux-pm.orig/kernel/sched/cpufreq_schedutil.c
+++ linux-pm/kernel/sched/cpufreq_schedutil.c
@@ -37,6 +37,7 @@ struct sugov_policy {
s64 freq_update_delay_ns;
unsigned int next_freq;
unsigned int cached_raw_freq;
+ bool busy;

/* The next fields are only needed if fast switch cannot be used. */
struct irq_work irq_work;
@@ -56,11 +57,15 @@ struct sugov_cpu {
unsigned long iowait_boost;
unsigned long iowait_boost_max;
u64 last_update;
+#ifdef CONFIG_NO_HZ_COMMON
+ unsigned long saved_idle_calls;
+#endif

/* The fields below are only needed when sharing a policy. */
unsigned long util;
unsigned long max;
unsigned int flags;
+ bool busy;
};

static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
@@ -93,6 +98,9 @@ static void sugov_update_commit(struct s
{
struct cpufreq_policy *policy = sg_policy->policy;

+ if (sg_policy->busy && next_freq < sg_policy->next_freq)
+ next_freq = sg_policy->next_freq;
+
if (policy->fast_switch_enabled) {
if (sg_policy->next_freq == next_freq) {
trace_cpu_frequency(policy->cur, smp_processor_id());
@@ -192,6 +200,19 @@ static void sugov_iowait_boost(struct su
sg_cpu->iowait_boost >>= 1;
}

+#ifdef CONFIG_NO_HZ_COMMON
+static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu)
+{
+ unsigned long idle_calls = tick_nohz_get_idle_calls();
+ bool not_idle = idle_calls == sg_cpu->saved_idle_calls;
+
+ sg_cpu->saved_idle_calls = idle_calls;
+ return not_idle && this_rq()->rd->overload;
+}
+#else
+static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; }
+#endif /* CONFIG_NO_HZ_COMMON */
+
static void sugov_update_single(struct update_util_data *hook, u64 time,
unsigned int flags)
{
@@ -207,6 +228,8 @@ static void sugov_update_single(struct u
if (!sugov_should_update_freq(sg_policy, time))
return;

+ sg_policy->busy = sugov_cpu_is_busy(sg_cpu);
+
if (flags & SCHED_CPUFREQ_RT_DL) {
next_f = policy->cpuinfo.max_freq;
} else {
@@ -225,6 +248,8 @@ static unsigned int sugov_next_freq_shar
unsigned long util = 0, max = 1;
unsigned int j;

+ sg_policy->busy = false;
+
for_each_cpu(j, policy->cpus) {
struct sugov_cpu *j_sg_cpu = &per_cpu(sugov_cpu, j);
unsigned long j_util, j_max;
@@ -253,6 +278,7 @@ static unsigned int sugov_next_freq_shar
}

sugov_iowait_boost(j_sg_cpu, &util, &max);
+ sg_policy->busy = sg_policy->busy || sg_cpu->busy;
}

return get_next_freq(sg_policy, util, max);
@@ -273,6 +299,7 @@ static void sugov_update_shared(struct u
sg_cpu->util = util;
sg_cpu->max = max;
sg_cpu->flags = flags;
+ sg_cpu->busy = sugov_cpu_is_busy(sg_cpu);

sugov_set_iowait_boost(sg_cpu, time, flags);
sg_cpu->last_update = time;
Index: linux-pm/include/linux/tick.h
===================================================================
--- linux-pm.orig/include/linux/tick.h
+++ linux-pm/include/linux/tick.h
@@ -117,6 +117,7 @@ extern void tick_nohz_idle_enter(void);
extern void tick_nohz_idle_exit(void);
extern void tick_nohz_irq_exit(void);
extern ktime_t tick_nohz_get_sleep_length(void);
+extern unsigned long tick_nohz_get_idle_calls(void);
extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
#else /* !CONFIG_NO_HZ_COMMON */
Index: linux-pm/kernel/time/tick-sched.c
===================================================================
--- linux-pm.orig/kernel/time/tick-sched.c
+++ linux-pm/kernel/time/tick-sched.c
@@ -993,6 +993,18 @@ ktime_t tick_nohz_get_sleep_length(void)
return ts->sleep_length;
}

+/**
+ * tick_nohz_get_idle_calls - return the current idle calls counter value
+ *
+ * Called from the schedutil frequency scaling governor in scheduler context.
+ */
+unsigned long tick_nohz_get_idle_calls(void)
+{
+ struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
+
+ return ts->idle_calls;
+}
+
static void tick_nohz_account_idle_ticks(struct tick_sched *ts)
{
#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE

2017-03-21 17:00:31

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs

On Tue, Mar 21, 2017 at 04:18:52PM +0100, Rafael J. Wysocki wrote:
> +static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu)
> +{
> + unsigned long idle_calls = tick_nohz_get_idle_calls();
> + bool not_idle = idle_calls == sg_cpu->saved_idle_calls;
> +
> + sg_cpu->saved_idle_calls = idle_calls;
> + return not_idle && this_rq()->rd->overload;
> +}

So I really don't understand the rd->overload thing. What is it supposed
to do here?

2017-03-21 17:03:07

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs

On 21 March 2017 at 18:00, Vincent Guittot <[email protected]> wrote:
> On 21 March 2017 at 15:58, Peter Zijlstra <[email protected]> wrote:
>>
>> On Tue, Mar 21, 2017 at 03:16:19PM +0100, Vincent Guittot wrote:
>> > On 21 March 2017 at 15:03, Peter Zijlstra <[email protected]> wrote:
>> >
>> > > On Tue, Mar 21, 2017 at 02:37:08PM +0100, Vincent Guittot wrote:
>> > > > On 21 March 2017 at 14:22, Peter Zijlstra <[email protected]> wrote:
>> > >
>> > > > For the not overloaded case, it makes sense to immediately update to
>> > > > OPP to be aligned with the new utilization of the CPU even if it was
>> > > > not idle in the past couple of ticks
>> > >
>> > > Yeah, but we cannot know. Also, who cares?
>> > >
>> >
>> > embedded system that doesn't want to stay at higest OPP if significant part
>> > of the utilzation has moved away as an example
>> > AFAICT, schedutil tries to select the best OPP according to the current
>> > utilization of the CPU so if the utilization decreases, the OPP should also
>> > decrease
>>
>> Sure I get that; but given the lack of crystal ball instructions we
>> cannot know if this is the case or not.
>
> cfs_rq->avg.load_avg account the waiting time of CPU (in addition to

sorry i wanted to say the waiting time of tasks on the CPU

> the weight of task) so i was wondering if we can't use it to detect if
> we are in the overloaded case or not even if utilization is not mac
> capacity because we have just migrated a task (and its utilization)
> out
>
>
>
>>
>> And if we really dropped below 100% utilization, we should hit idle
>> fairly soon.

2017-03-21 17:03:44

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs

On 21 March 2017 at 15:58, Peter Zijlstra <[email protected]> wrote:
>
> On Tue, Mar 21, 2017 at 03:16:19PM +0100, Vincent Guittot wrote:
> > On 21 March 2017 at 15:03, Peter Zijlstra <[email protected]> wrote:
> >
> > > On Tue, Mar 21, 2017 at 02:37:08PM +0100, Vincent Guittot wrote:
> > > > On 21 March 2017 at 14:22, Peter Zijlstra <[email protected]> wrote:
> > >
> > > > For the not overloaded case, it makes sense to immediately update to
> > > > OPP to be aligned with the new utilization of the CPU even if it was
> > > > not idle in the past couple of ticks
> > >
> > > Yeah, but we cannot know. Also, who cares?
> > >
> >
> > embedded system that doesn't want to stay at higest OPP if significant part
> > of the utilzation has moved away as an example
> > AFAICT, schedutil tries to select the best OPP according to the current
> > utilization of the CPU so if the utilization decreases, the OPP should also
> > decrease
>
> Sure I get that; but given the lack of crystal ball instructions we
> cannot know if this is the case or not.

cfs_rq->avg.load_avg account the waiting time of CPU (in addition to
the weight of task) so i was wondering if we can't use it to detect if
we are in the overloaded case or not even if utilization is not mac
capacity because we have just migrated a task (and its utilization)
out



>
> And if we really dropped below 100% utilization, we should hit idle
> fairly soon.

2017-03-21 17:22:53

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs

On Tuesday, March 21, 2017 06:00:17 PM Peter Zijlstra wrote:
> On Tue, Mar 21, 2017 at 04:18:52PM +0100, Rafael J. Wysocki wrote:
> > +static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu)
> > +{
> > + unsigned long idle_calls = tick_nohz_get_idle_calls();
> > + bool not_idle = idle_calls == sg_cpu->saved_idle_calls;
> > +
> > + sg_cpu->saved_idle_calls = idle_calls;
> > + return not_idle && this_rq()->rd->overload;
> > +}
>
> So I really don't understand the rd->overload thing. What is it supposed
> to do here?

The idea was that if the CPU was running one task saturating the capacity which
then was migrated out of it, the frequency should still be reduced.

And since rd->overload covers all CPUs (in general) it kind of tells us whether
or not there are other tasks to replace the migrated one any time soon.

However, if there are no tasks to replace the migrated one, the CPU will go
idle quickly (as there are no taks to run on it), in which case keeping the
current frequency on it shouldn't matter.

In all of the other cases keeping the current frequency is the right thing to
do IMO.

So, it looks like checking this_rq()->rd->overload doesn't really help after all. :-)

Thanks,
Rafael

2017-03-21 19:28:33

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 2/2] cpufreq: schedutil: Avoid decreasing frequency of busy CPUs

On 21-Mar 16:18, Peter Zijlstra wrote:
> On Tue, Mar 21, 2017 at 03:08:20PM +0000, Patrick Bellasi wrote:
>
> > And than we can move this bit into an inline function, something like e.g.:
> >
> > static inline bool sugov_this_cpu_is_busy()
> > {
> > return this_rq()->rd->overloaded
> > }
>
> No, that's just entirely and utterly wrong. It being in rd means its
> very much not about _this_ CPU in any way.

You right (of course), we cannot really use "this_" in the name of
a function with such a code.

The suggestion here was at least to factor out whatever code we want
to use to check if the current CPU has to be subject to a down-scaling
constraint.

However, using rd->overload is not the best option, for the many reasons
you explained in your previous comment. Thus, we should probably stay
with the idle time tracking solution initially proposed by Rafael.

Sorry for the noise :-(

--
#include <best/regards.h>

Patrick Bellasi

2017-03-21 23:14:27

by Rafael J. Wysocki

[permalink] [raw]
Subject: [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely

From: Rafael J. Wysocki <[email protected]>

The way the schedutil governor uses the PELT metric causes it to
underestimate the CPU utilization in some cases.

That can be easily demonstrated by running kernel compilation on
a Sandy Bridge Intel processor, running turbostat in parallel with
it and looking at the values written to the MSR_IA32_PERF_CTL
register. Namely, the expected result would be that when all CPUs
were 100% busy, all of them would be requested to run in the maximum
P-state, but observation shows that this clearly isn't the case.
The CPUs run in the maximum P-state for a while and then are
requested to run slower and go back to the maximum P-state after
a while again. That causes the actual frequency of the processor to
visibly oscillate below the sustainable maximum in a jittery fashion
which clearly is not desirable.

That has been attributed to CPU utilization metric updates on task
migration that cause the total utilization value for the CPU to be
reduced by the utilization of the migrated task. If that happens,
the schedutil governor may see a CPU utilization reduction and will
attempt to reduce the CPU frequency accordingly right away. That
may be premature, though, for example if the system is generally
busy and there are other runnable tasks waiting to be run on that
CPU already.

This is unlikely to be an issue on systems where cpufreq policies are
shared between multiple CPUs, because in those cases the policy
utilization is computed as the maximum of the CPU utilization values
over the whole policy and if that turns out to be low, reducing the
frequency for the policy most likely is a good idea anyway. On
systems with one CPU per policy, however, it may affect performance
adversely and even lead to increased energy consumption in some cases.

On those systems it may be addressed by taking another utilization
metric into consideration, like whether or not the CPU whose
frequency is about to be reduced has been idle recently, because if
that's not the case, the CPU is likely to be busy in the near future
and its frequency should not be reduced.

To that end, use the counter of idle calls in the timekeeping code.
Namely, make the schedutil governor look at that counter for the
current CPU every time before its frequency is about to be reduced.
If the counter has not changed since the previous iteration of the
governor computations for that CPU, the CPU has been busy for all
that time and its frequency should not be decreased, so if the new
frequency would be lower than the one set previously, the governor
will skip the frequency update.

Signed-off-by: Rafael J. Wysocki <[email protected]>
---
include/linux/tick.h | 1 +
kernel/sched/cpufreq_schedutil.c | 27 +++++++++++++++++++++++++++
kernel/time/tick-sched.c | 12 ++++++++++++
3 files changed, 40 insertions(+)

Index: linux-pm/kernel/sched/cpufreq_schedutil.c
===================================================================
--- linux-pm.orig/kernel/sched/cpufreq_schedutil.c
+++ linux-pm/kernel/sched/cpufreq_schedutil.c
@@ -61,6 +61,11 @@ struct sugov_cpu {
unsigned long util;
unsigned long max;
unsigned int flags;
+
+ /* The field below is for single-CPU policies only. */
+#ifdef CONFIG_NO_HZ_COMMON
+ unsigned long saved_idle_calls;
+#endif
};

static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
@@ -192,6 +197,19 @@ static void sugov_iowait_boost(struct su
sg_cpu->iowait_boost >>= 1;
}

+#ifdef CONFIG_NO_HZ_COMMON
+static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu)
+{
+ unsigned long idle_calls = tick_nohz_get_idle_calls();
+ bool ret = idle_calls == sg_cpu->saved_idle_calls;
+
+ sg_cpu->saved_idle_calls = idle_calls;
+ return ret;
+}
+#else
+static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; }
+#endif /* CONFIG_NO_HZ_COMMON */
+
static void sugov_update_single(struct update_util_data *hook, u64 time,
unsigned int flags)
{
@@ -200,6 +218,7 @@ static void sugov_update_single(struct u
struct cpufreq_policy *policy = sg_policy->policy;
unsigned long util, max;
unsigned int next_f;
+ bool busy;

sugov_set_iowait_boost(sg_cpu, time, flags);
sg_cpu->last_update = time;
@@ -207,12 +226,20 @@ static void sugov_update_single(struct u
if (!sugov_should_update_freq(sg_policy, time))
return;

+ busy = sugov_cpu_is_busy(sg_cpu);
+
if (flags & SCHED_CPUFREQ_RT_DL) {
next_f = policy->cpuinfo.max_freq;
} else {
sugov_get_util(&util, &max);
sugov_iowait_boost(sg_cpu, &util, &max);
next_f = get_next_freq(sg_policy, util, max);
+ /*
+ * Do not reduce the frequency if the CPU has not been idle
+ * recently, as the reduction is likely to be premature then.
+ */
+ if (busy && next_f < sg_policy->next_freq)
+ next_f = sg_policy->next_freq;
}
sugov_update_commit(sg_policy, time, next_f);
}
Index: linux-pm/include/linux/tick.h
===================================================================
--- linux-pm.orig/include/linux/tick.h
+++ linux-pm/include/linux/tick.h
@@ -117,6 +117,7 @@ extern void tick_nohz_idle_enter(void);
extern void tick_nohz_idle_exit(void);
extern void tick_nohz_irq_exit(void);
extern ktime_t tick_nohz_get_sleep_length(void);
+extern unsigned long tick_nohz_get_idle_calls(void);
extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
#else /* !CONFIG_NO_HZ_COMMON */
Index: linux-pm/kernel/time/tick-sched.c
===================================================================
--- linux-pm.orig/kernel/time/tick-sched.c
+++ linux-pm/kernel/time/tick-sched.c
@@ -993,6 +993,18 @@ ktime_t tick_nohz_get_sleep_length(void)
return ts->sleep_length;
}

+/**
+ * tick_nohz_get_idle_calls - return the current idle calls counter value
+ *
+ * Called from the schedutil frequency scaling governor in scheduler context.
+ */
+unsigned long tick_nohz_get_idle_calls(void)
+{
+ struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
+
+ return ts->idle_calls;
+}
+
static void tick_nohz_account_idle_ticks(struct tick_sched *ts)
{
#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE

2017-03-22 09:27:10

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely

On Wed, Mar 22, 2017 at 12:08:50AM +0100, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <[email protected]>
>
> The way the schedutil governor uses the PELT metric causes it to
> underestimate the CPU utilization in some cases.
>
> That can be easily demonstrated by running kernel compilation on
> a Sandy Bridge Intel processor, running turbostat in parallel with
> it and looking at the values written to the MSR_IA32_PERF_CTL
> register. Namely, the expected result would be that when all CPUs
> were 100% busy, all of them would be requested to run in the maximum
> P-state, but observation shows that this clearly isn't the case.
> The CPUs run in the maximum P-state for a while and then are
> requested to run slower and go back to the maximum P-state after
> a while again. That causes the actual frequency of the processor to
> visibly oscillate below the sustainable maximum in a jittery fashion
> which clearly is not desirable.
>
> That has been attributed to CPU utilization metric updates on task
> migration that cause the total utilization value for the CPU to be
> reduced by the utilization of the migrated task. If that happens,
> the schedutil governor may see a CPU utilization reduction and will
> attempt to reduce the CPU frequency accordingly right away. That
> may be premature, though, for example if the system is generally
> busy and there are other runnable tasks waiting to be run on that
> CPU already.
>
> This is unlikely to be an issue on systems where cpufreq policies are
> shared between multiple CPUs, because in those cases the policy
> utilization is computed as the maximum of the CPU utilization values
> over the whole policy and if that turns out to be low, reducing the
> frequency for the policy most likely is a good idea anyway. On
> systems with one CPU per policy, however, it may affect performance
> adversely and even lead to increased energy consumption in some cases.
>
> On those systems it may be addressed by taking another utilization
> metric into consideration, like whether or not the CPU whose
> frequency is about to be reduced has been idle recently, because if
> that's not the case, the CPU is likely to be busy in the near future
> and its frequency should not be reduced.
>
> To that end, use the counter of idle calls in the timekeeping code.
> Namely, make the schedutil governor look at that counter for the
> current CPU every time before its frequency is about to be reduced.
> If the counter has not changed since the previous iteration of the
> governor computations for that CPU, the CPU has been busy for all
> that time and its frequency should not be decreased, so if the new
> frequency would be lower than the one set previously, the governor
> will skip the frequency update.
>
> Signed-off-by: Rafael J. Wysocki <[email protected]>

Right; this makes sense to me. Of course it would be good to have some
more measurements on this, but in principle:

Acked-by: Peter Zijlstra (Intel) <[email protected]>

2017-03-22 10:05:16

by Viresh Kumar

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely

On 22-03-17, 00:08, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <[email protected]>
>
> The way the schedutil governor uses the PELT metric causes it to
> underestimate the CPU utilization in some cases.
>
> That can be easily demonstrated by running kernel compilation on
> a Sandy Bridge Intel processor, running turbostat in parallel with
> it and looking at the values written to the MSR_IA32_PERF_CTL
> register. Namely, the expected result would be that when all CPUs
> were 100% busy, all of them would be requested to run in the maximum
> P-state, but observation shows that this clearly isn't the case.
> The CPUs run in the maximum P-state for a while and then are
> requested to run slower and go back to the maximum P-state after
> a while again. That causes the actual frequency of the processor to
> visibly oscillate below the sustainable maximum in a jittery fashion
> which clearly is not desirable.
>
> That has been attributed to CPU utilization metric updates on task
> migration that cause the total utilization value for the CPU to be
> reduced by the utilization of the migrated task. If that happens,
> the schedutil governor may see a CPU utilization reduction and will
> attempt to reduce the CPU frequency accordingly right away. That
> may be premature, though, for example if the system is generally
> busy and there are other runnable tasks waiting to be run on that
> CPU already.
>
> This is unlikely to be an issue on systems where cpufreq policies are
> shared between multiple CPUs, because in those cases the policy
> utilization is computed as the maximum of the CPU utilization values
> over the whole policy and if that turns out to be low, reducing the
> frequency for the policy most likely is a good idea anyway. On
> systems with one CPU per policy, however, it may affect performance
> adversely and even lead to increased energy consumption in some cases.
>
> On those systems it may be addressed by taking another utilization
> metric into consideration, like whether or not the CPU whose
> frequency is about to be reduced has been idle recently, because if
> that's not the case, the CPU is likely to be busy in the near future
> and its frequency should not be reduced.
>
> To that end, use the counter of idle calls in the timekeeping code.
> Namely, make the schedutil governor look at that counter for the
> current CPU every time before its frequency is about to be reduced.
> If the counter has not changed since the previous iteration of the
> governor computations for that CPU, the CPU has been busy for all
> that time and its frequency should not be decreased, so if the new
> frequency would be lower than the one set previously, the governor
> will skip the frequency update.
>
> Signed-off-by: Rafael J. Wysocki <[email protected]>
> ---
> include/linux/tick.h | 1 +
> kernel/sched/cpufreq_schedutil.c | 27 +++++++++++++++++++++++++++
> kernel/time/tick-sched.c | 12 ++++++++++++
> 3 files changed, 40 insertions(+)

Acked-by: Viresh Kumar <[email protected]>

--
viresh

2017-03-22 23:56:52

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

On Mon, Mar 20, 2017 at 5:34 AM, Patrick Bellasi
<[email protected]> wrote:
> On 20-Mar 09:26, Vincent Guittot wrote:
>> On 20 March 2017 at 04:57, Viresh Kumar <[email protected]> wrote:
>> > On 19-03-17, 14:34, Rafael J. Wysocki wrote:
>> >> From: Rafael J. Wysocki <[email protected]>
>> >>
>> >> The PELT metric used by the schedutil governor underestimates the
>> >> CPU utilization in some cases. The reason for that may be time spent
>> >> in interrupt handlers and similar which is not accounted for by PELT.
>>
>> Are you sure of the root cause described above (time stolen by irq
>> handler) or is it just a hypotheses ? That would be good to be sure of
>> the root cause
>> Furthermore, IIRC the time spent in irq context is also accounted as
>> run time for the running cfs task but not RT and deadline task running
>> time
>
> As long as the IRQ processing does not generate a context switch,
> which is happening (eventually) if the top half schedule some deferred
> work to be executed by a bottom half.
>
> Thus, me too I would say that all the top half time is accounted in
> PELT, since the current task is still RUNNABLE/RUNNING.

Sorry if I'm missing something but doesn't this depend on whether you
have CONFIG_IRQ_TIME_ACCOUNTING enabled?

__update_load_avg uses rq->clock_task for deltas which I think
shouldn't account IRQ time with that config option. So it should be
quite possible for IRQ time spent to reduce the PELT signal right?

>
>> So I'm not really aligned with the description of your problem: PELT
>> metric underestimates the load of the CPU. The PELT is just about
>> tracking CFS task utilization but not whole CPU utilization and
>> according to your description of the problem (time stolen by irq),
>> your problem doesn't come from an underestimation of CFS task but from
>> time spent in something else but not accounted in the value used by
>> schedutil
>
> Quite likely. Indeed, it can really be that the CFS task is preempted
> because of some RT activity generated by the IRQ handler.
>
> More in general, I've also noticed many suboptimal freq switches when
> RT tasks interleave with CFS ones, because of:
> - relatively long down _and up_ throttling times
> - the way schedutil's flags are tracked and updated
> - the callsites from where we call schedutil updates
>
> For example it can really happen that we are running at the highest
> OPP because of some RT activity. Then we switch back to a relatively
> low utilization CFS workload and then:
> 1. a tick happens which produces a frequency drop

Any idea why this frequency drop would happen? Say a running CFS task
gets preempted by RT task, the PELT signal shouldn't drop for the
duration the CFS task is preempted because the task is runnable, so
once the CFS task gets CPU back, schedutil should still maintain the
capacity right?

Regards,
Joel

2017-03-23 01:04:36

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely

On Tue, Mar 21, 2017 at 4:08 PM, Rafael J. Wysocki <[email protected]> wrote:
> From: Rafael J. Wysocki <[email protected]>
>
> The way the schedutil governor uses the PELT metric causes it to
> underestimate the CPU utilization in some cases.
>
> That can be easily demonstrated by running kernel compilation on
> a Sandy Bridge Intel processor, running turbostat in parallel with
> it and looking at the values written to the MSR_IA32_PERF_CTL
> register. Namely, the expected result would be that when all CPUs
> were 100% busy, all of them would be requested to run in the maximum
> P-state, but observation shows that this clearly isn't the case.
> The CPUs run in the maximum P-state for a while and then are
> requested to run slower and go back to the maximum P-state after
> a while again. That causes the actual frequency of the processor to
> visibly oscillate below the sustainable maximum in a jittery fashion
> which clearly is not desirable.
>
> That has been attributed to CPU utilization metric updates on task
> migration that cause the total utilization value for the CPU to be
> reduced by the utilization of the migrated task. If that happens,
> the schedutil governor may see a CPU utilization reduction and will
> attempt to reduce the CPU frequency accordingly right away. That
> may be premature, though, for example if the system is generally
> busy and there are other runnable tasks waiting to be run on that
> CPU already.
>
> This is unlikely to be an issue on systems where cpufreq policies are
> shared between multiple CPUs, because in those cases the policy
> utilization is computed as the maximum of the CPU utilization values
> over the whole policy and if that turns out to be low, reducing the
> frequency for the policy most likely is a good idea anyway. On
> systems with one CPU per policy, however, it may affect performance
> adversely and even lead to increased energy consumption in some cases.
>
> On those systems it may be addressed by taking another utilization
> metric into consideration, like whether or not the CPU whose
> frequency is about to be reduced has been idle recently, because if
> that's not the case, the CPU is likely to be busy in the near future
> and its frequency should not be reduced.
>
> To that end, use the counter of idle calls in the timekeeping code.
> Namely, make the schedutil governor look at that counter for the
> current CPU every time before its frequency is about to be reduced.
> If the counter has not changed since the previous iteration of the
> governor computations for that CPU, the CPU has been busy for all
> that time and its frequency should not be decreased, so if the new
> frequency would be lower than the one set previously, the governor
> will skip the frequency update.
>
> Signed-off-by: Rafael J. Wysocki <[email protected]>

Makes sense,

Reviewed-by: Joel Fernandes <[email protected]>

Thanks,
Joel

2017-03-23 19:29:10

by Sai Gurrappadi

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely

Hi Rafael,

On 03/21/2017 04:08 PM, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <[email protected]>

<snip>

>
> That has been attributed to CPU utilization metric updates on task
> migration that cause the total utilization value for the CPU to be
> reduced by the utilization of the migrated task. If that happens,
> the schedutil governor may see a CPU utilization reduction and will
> attempt to reduce the CPU frequency accordingly right away. That
> may be premature, though, for example if the system is generally
> busy and there are other runnable tasks waiting to be run on that
> CPU already.
>
> This is unlikely to be an issue on systems where cpufreq policies are
> shared between multiple CPUs, because in those cases the policy
> utilization is computed as the maximum of the CPU utilization values
> over the whole policy and if that turns out to be low, reducing the
> frequency for the policy most likely is a good idea anyway. On

I have observed this issue even in the shared policy case (one clock domain for many CPUs). On migrate, the actual load update is split into two updates:

1. Add to removed_load on src_cpu (cpu_util(src_cpu) not updated yet)
2. Do wakeup on dst_cpu, add load to dst_cpu

Now if src_cpu manages to do a PELT update before 2. happens, ex: say a small periodic task woke up on src_cpu, it'll end up subtracting the removed_load from its utilization and issue a frequency update before 2. happens.

This causes a premature dip in frequency which doesn't get corrected until the next util update that fires after rate_limit_us. The dst_cpu freq. update from step 2. above gets rate limited in this scenario.


> systems with one CPU per policy, however, it may affect performance
> adversely and even lead to increased energy consumption in some cases.
>
> On those systems it may be addressed by taking another utilization
> metric into consideration, like whether or not the CPU whose
> frequency is about to be reduced has been idle recently, because if
> that's not the case, the CPU is likely to be busy in the near future
> and its frequency should not be reduced.
>
> To that end, use the counter of idle calls in the timekeeping code.
> Namely, make the schedutil governor look at that counter for the
> current CPU every time before its frequency is about to be reduced.
> If the counter has not changed since the previous iteration of the
> governor computations for that CPU, the CPU has been busy for all
> that time and its frequency should not be decreased, so if the new
> frequency would be lower than the one set previously, the governor
> will skip the frequency update.
>
> Signed-off-by: Rafael J. Wysocki <[email protected]>
> ---
> include/linux/tick.h | 1 +
> kernel/sched/cpufreq_schedutil.c | 27 +++++++++++++++++++++++++++
> kernel/time/tick-sched.c | 12 ++++++++++++
> 3 files changed, 40 insertions(+)
>
> Index: linux-pm/kernel/sched/cpufreq_schedutil.c
> ===================================================================
> --- linux-pm.orig/kernel/sched/cpufreq_schedutil.c
> +++ linux-pm/kernel/sched/cpufreq_schedutil.c
> @@ -61,6 +61,11 @@ struct sugov_cpu {
> unsigned long util;
> unsigned long max;
> unsigned int flags;
> +
> + /* The field below is for single-CPU policies only. */
> +#ifdef CONFIG_NO_HZ_COMMON
> + unsigned long saved_idle_calls;
> +#endif
> };
>
> static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
> @@ -192,6 +197,19 @@ static void sugov_iowait_boost(struct su
> sg_cpu->iowait_boost >>= 1;
> }
>
> +#ifdef CONFIG_NO_HZ_COMMON
> +static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu)
> +{
> + unsigned long idle_calls = tick_nohz_get_idle_calls();
> + bool ret = idle_calls == sg_cpu->saved_idle_calls;
> +
> + sg_cpu->saved_idle_calls = idle_calls;
> + return ret;
> +}

Hm, sorry I am a bit confused perhaps you could help me understand the problem/solution better :)

Say we have the this simple case of only a single periodic task running on one CPU, wouldn't the PELT update on wakeup cause a frequency update which updates the sg_cpu->saved_idle_calls value here? That would then cause the frequency update on idle entry to always skip dropping frequency right?

If I am reading this correctly, the PELT update on the dequeue for the periodic task (in the scenario above) happens _before_ the idle_calls++ which is in tick_nohz_idle_enter.

Thanks!
-Sai

2017-03-23 20:50:36

by Sai Gurrappadi

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely



On 03/23/2017 12:26 PM, Sai Gurrappadi wrote:

>
> Hm, sorry I am a bit confused perhaps you could help me understand the problem/solution better :)
>
> Say we have the this simple case of only a single periodic task running on one CPU, wouldn't the PELT update on wakeup cause a frequency update which updates the sg_cpu->saved_idle_calls value here? That would then cause the frequency update on idle entry to always skip dropping frequency right?

Apologies, this example here is flawed because on task dequeue, its utilization isn't removed. There is no problem in this case...


-Sai

2017-03-23 22:08:40

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

On 23 March 2017 at 00:56, Joel Fernandes <[email protected]> wrote:
> On Mon, Mar 20, 2017 at 5:34 AM, Patrick Bellasi
> <[email protected]> wrote:
>> On 20-Mar 09:26, Vincent Guittot wrote:
>>> On 20 March 2017 at 04:57, Viresh Kumar <[email protected]> wrote:
>>> > On 19-03-17, 14:34, Rafael J. Wysocki wrote:
>>> >> From: Rafael J. Wysocki <[email protected]>
>>> >>
>>> >> The PELT metric used by the schedutil governor underestimates the
>>> >> CPU utilization in some cases. The reason for that may be time spent
>>> >> in interrupt handlers and similar which is not accounted for by PELT.
>>>
>>> Are you sure of the root cause described above (time stolen by irq
>>> handler) or is it just a hypotheses ? That would be good to be sure of
>>> the root cause
>>> Furthermore, IIRC the time spent in irq context is also accounted as
>>> run time for the running cfs task but not RT and deadline task running
>>> time
>>
>> As long as the IRQ processing does not generate a context switch,
>> which is happening (eventually) if the top half schedule some deferred
>> work to be executed by a bottom half.
>>
>> Thus, me too I would say that all the top half time is accounted in
>> PELT, since the current task is still RUNNABLE/RUNNING.
>
> Sorry if I'm missing something but doesn't this depend on whether you
> have CONFIG_IRQ_TIME_ACCOUNTING enabled?
>
> __update_load_avg uses rq->clock_task for deltas which I think
> shouldn't account IRQ time with that config option. So it should be
> quite possible for IRQ time spent to reduce the PELT signal right?
>
>>
>>> So I'm not really aligned with the description of your problem: PELT
>>> metric underestimates the load of the CPU. The PELT is just about
>>> tracking CFS task utilization but not whole CPU utilization and
>>> according to your description of the problem (time stolen by irq),
>>> your problem doesn't come from an underestimation of CFS task but from
>>> time spent in something else but not accounted in the value used by
>>> schedutil
>>
>> Quite likely. Indeed, it can really be that the CFS task is preempted
>> because of some RT activity generated by the IRQ handler.
>>
>> More in general, I've also noticed many suboptimal freq switches when
>> RT tasks interleave with CFS ones, because of:
>> - relatively long down _and up_ throttling times
>> - the way schedutil's flags are tracked and updated
>> - the callsites from where we call schedutil updates
>>
>> For example it can really happen that we are running at the highest
>> OPP because of some RT activity. Then we switch back to a relatively
>> low utilization CFS workload and then:
>> 1. a tick happens which produces a frequency drop
>
> Any idea why this frequency drop would happen? Say a running CFS task
> gets preempted by RT task, the PELT signal shouldn't drop for the
> duration the CFS task is preempted because the task is runnable, so

utilization only tracks the running state but not runnable state.
Runnable state is tracked in load_avg

> once the CFS task gets CPU back, schedutil should still maintain the
> capacity right?
>
> Regards,
> Joel

2017-03-24 01:39:07

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely

On Thu, Mar 23, 2017 at 8:26 PM, Sai Gurrappadi <[email protected]> wrote:
> Hi Rafael,

Hi,

> On 03/21/2017 04:08 PM, Rafael J. Wysocki wrote:
>> From: Rafael J. Wysocki <[email protected]>
>
> <snip>
>
>>
>> That has been attributed to CPU utilization metric updates on task
>> migration that cause the total utilization value for the CPU to be
>> reduced by the utilization of the migrated task. If that happens,
>> the schedutil governor may see a CPU utilization reduction and will
>> attempt to reduce the CPU frequency accordingly right away. That
>> may be premature, though, for example if the system is generally
>> busy and there are other runnable tasks waiting to be run on that
>> CPU already.
>>
>> This is unlikely to be an issue on systems where cpufreq policies are
>> shared between multiple CPUs, because in those cases the policy
>> utilization is computed as the maximum of the CPU utilization values
>> over the whole policy and if that turns out to be low, reducing the
>> frequency for the policy most likely is a good idea anyway. On
>
> I have observed this issue even in the shared policy case (one clock domain for many CPUs). On migrate, the actual load update is split into two updates:
>
> 1. Add to removed_load on src_cpu (cpu_util(src_cpu) not updated yet)
> 2. Do wakeup on dst_cpu, add load to dst_cpu
>
> Now if src_cpu manages to do a PELT update before 2. happens, ex: say a small periodic task woke up on src_cpu, it'll end up subtracting the removed_load from its utilization and issue a frequency update before 2. happens.
>
> This causes a premature dip in frequency which doesn't get corrected until the next util update that fires after rate_limit_us. The dst_cpu freq. update from step 2. above gets rate limited in this scenario.

Interesting, and this seems to be related to last_freq_update_time
being per-policy (which it has to be, because frequency updates are
per-policy too and that's what we need to rate-limit).

Does this happen often enough to be a real concern in practice on
those configurations, though?

The other CPUs in the policy need to be either idle (so schedutil
doesn't take them into account at all) or lightly utilized for that to
happen, so that would affect workloads with one CPU hog type of task
that is migrated from one CPU to another within a policy and that
doesn't happen too often AFAICS.

Thanks,
Rafael

2017-03-24 19:10:55

by Sai Gurrappadi

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely

On 03/23/2017 06:39 PM, Rafael J. Wysocki wrote:
> On Thu, Mar 23, 2017 at 8:26 PM, Sai Gurrappadi <[email protected]> wrote:
>> Hi Rafael,
>
> Hi,
>
>> On 03/21/2017 04:08 PM, Rafael J. Wysocki wrote:
>>> From: Rafael J. Wysocki <[email protected]>
>>
>> <snip>
>>
>>>
>>> That has been attributed to CPU utilization metric updates on task
>>> migration that cause the total utilization value for the CPU to be
>>> reduced by the utilization of the migrated task. If that happens,
>>> the schedutil governor may see a CPU utilization reduction and will
>>> attempt to reduce the CPU frequency accordingly right away. That
>>> may be premature, though, for example if the system is generally
>>> busy and there are other runnable tasks waiting to be run on that
>>> CPU already.
>>>
>>> This is unlikely to be an issue on systems where cpufreq policies are
>>> shared between multiple CPUs, because in those cases the policy
>>> utilization is computed as the maximum of the CPU utilization values
>>> over the whole policy and if that turns out to be low, reducing the
>>> frequency for the policy most likely is a good idea anyway. On
>>
>> I have observed this issue even in the shared policy case (one clock domain for many CPUs). On migrate, the actual load update is split into two updates:
>>
>> 1. Add to removed_load on src_cpu (cpu_util(src_cpu) not updated yet)
>> 2. Do wakeup on dst_cpu, add load to dst_cpu
>>
>> Now if src_cpu manages to do a PELT update before 2. happens, ex: say a small periodic task woke up on src_cpu, it'll end up subtracting the removed_load from its utilization and issue a frequency update before 2. happens.
>>
>> This causes a premature dip in frequency which doesn't get corrected until the next util update that fires after rate_limit_us. The dst_cpu freq. update from step 2. above gets rate limited in this scenario.
>
> Interesting, and this seems to be related to last_freq_update_time
> being per-policy (which it has to be, because frequency updates are
> per-policy too and that's what we need to rate-limit).
>

Correct.

> Does this happen often enough to be a real concern in practice on
> those configurations, though?
>
> The other CPUs in the policy need to be either idle (so schedutil
> doesn't take them into account at all) or lightly utilized for that to
> happen, so that would affect workloads with one CPU hog type of task
> that is migrated from one CPU to another within a policy and that
> doesn't happen too often AFAICS.

So it is possible, even likely in some cases for a heavy CPU task to migrate on wakeup between the policy->cpus via select_idle_sibling() if the prev_cpu it was on was !idle on wakeup.

This style of heavy thread + lots of light work is a common pattern on Android (games, browsing, etc.) given how Android does its threading for ipc (Binder stuff) + its rendering/audio pipelines.

I unfortunately don't have any numbers atm though.

-Sai

2017-03-25 01:17:08

by Sai Gurrappadi

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely

Hi Rafael,

On 03/21/2017 04:08 PM, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <[email protected]>
>
> The way the schedutil governor uses the PELT metric causes it to
> underestimate the CPU utilization in some cases.
>
> That can be easily demonstrated by running kernel compilation on
> a Sandy Bridge Intel processor, running turbostat in parallel with
> it and looking at the values written to the MSR_IA32_PERF_CTL
> register. Namely, the expected result would be that when all CPUs
> were 100% busy, all of them would be requested to run in the maximum
> P-state, but observation shows that this clearly isn't the case.
> The CPUs run in the maximum P-state for a while and then are
> requested to run slower and go back to the maximum P-state after
> a while again. That causes the actual frequency of the processor to
> visibly oscillate below the sustainable maximum in a jittery fashion
> which clearly is not desirable.
>
> That has been attributed to CPU utilization metric updates on task
> migration that cause the total utilization value for the CPU to be
> reduced by the utilization of the migrated task. If that happens,
> the schedutil governor may see a CPU utilization reduction and will
> attempt to reduce the CPU frequency accordingly right away. That
> may be premature, though, for example if the system is generally
> busy and there are other runnable tasks waiting to be run on that
> CPU already.
>

Thinking out loud a bit, I wonder if what you really want to do is basically:

schedutil_cpu_util(cpu) = max(cpu_rq(cpu)->cfs.util_avg, total_cpu_util_avg);

Where total_cpu_util_avg tracks the average utilization of the CPU itself over time (% of time the CPU was busy) in the same PELT like manner. The difference here is that it doesn't change instantaneously as tasks migrate in/out but it decays/accumulates just like the per-entity util_avgs.

Over time, total_cpu_util_avg and cfs_rq(cpu)->util_avg will tend towards each other the lesser the amount of 'overlap' / overloading.

Yes, the above metric would 'overestimate' in case all tasks have migrated away and we are left with an idle CPU. A fix for that could be to just use the PELT value like so:

schedutil_cpu_util(cpu) = max(cpu_rq(cpu)->cfs.util_avg, idle_cpu(cpu) ? 0 : total_cpu_util_avg);

Note that the problem described here in the commit message doesn't need fully runnable threads, it just needs two threads to execute in parallel on the same CPU for a period of time. I don't think looking at just idle_calls necessarily covers all cases.

Thoughts?

Thanks,
-Sai

2017-03-25 01:39:22

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely

On Sat, Mar 25, 2017 at 2:14 AM, Sai Gurrappadi <[email protected]> wrote:
> Hi Rafael,
>
> On 03/21/2017 04:08 PM, Rafael J. Wysocki wrote:
>> From: Rafael J. Wysocki <[email protected]>
>>
>> The way the schedutil governor uses the PELT metric causes it to
>> underestimate the CPU utilization in some cases.
>>
>> That can be easily demonstrated by running kernel compilation on
>> a Sandy Bridge Intel processor, running turbostat in parallel with
>> it and looking at the values written to the MSR_IA32_PERF_CTL
>> register. Namely, the expected result would be that when all CPUs
>> were 100% busy, all of them would be requested to run in the maximum
>> P-state, but observation shows that this clearly isn't the case.
>> The CPUs run in the maximum P-state for a while and then are
>> requested to run slower and go back to the maximum P-state after
>> a while again. That causes the actual frequency of the processor to
>> visibly oscillate below the sustainable maximum in a jittery fashion
>> which clearly is not desirable.
>>
>> That has been attributed to CPU utilization metric updates on task
>> migration that cause the total utilization value for the CPU to be
>> reduced by the utilization of the migrated task. If that happens,
>> the schedutil governor may see a CPU utilization reduction and will
>> attempt to reduce the CPU frequency accordingly right away. That
>> may be premature, though, for example if the system is generally
>> busy and there are other runnable tasks waiting to be run on that
>> CPU already.
>>
>
> Thinking out loud a bit, I wonder if what you really want to do is basically:
>
> schedutil_cpu_util(cpu) = max(cpu_rq(cpu)->cfs.util_avg, total_cpu_util_avg);
>
> Where total_cpu_util_avg tracks the average utilization of the CPU itself over time (% of time the CPU was busy) in the same PELT like manner. The difference here is that it doesn't change instantaneously as tasks migrate in/out but it decays/accumulates just like the per-entity util_avgs.
>
> Over time, total_cpu_util_avg and cfs_rq(cpu)->util_avg will tend towards each other the lesser the amount of 'overlap' / overloading.
>
> Yes, the above metric would 'overestimate' in case all tasks have migrated away and we are left with an idle CPU. A fix for that could be to just use the PELT value like so:
>
> schedutil_cpu_util(cpu) = max(cpu_rq(cpu)->cfs.util_avg, idle_cpu(cpu) ? 0 : total_cpu_util_avg);
>
> Note that the problem described here in the commit message doesn't need fully runnable threads, it just needs two threads to execute in parallel on the same CPU for a period of time. I don't think looking at just idle_calls necessarily covers all cases.
>
> Thoughts?

Well, is the total_cpu_util_avg metric readily available?

Thanks,
Rafael

2017-03-25 03:48:32

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

Hi Vincent,

On Thu, Mar 23, 2017 at 3:08 PM, Vincent Guittot
<[email protected]> wrote:
[..]
>>>
>>>> So I'm not really aligned with the description of your problem: PELT
>>>> metric underestimates the load of the CPU. The PELT is just about
>>>> tracking CFS task utilization but not whole CPU utilization and
>>>> according to your description of the problem (time stolen by irq),
>>>> your problem doesn't come from an underestimation of CFS task but from
>>>> time spent in something else but not accounted in the value used by
>>>> schedutil
>>>
>>> Quite likely. Indeed, it can really be that the CFS task is preempted
>>> because of some RT activity generated by the IRQ handler.
>>>
>>> More in general, I've also noticed many suboptimal freq switches when
>>> RT tasks interleave with CFS ones, because of:
>>> - relatively long down _and up_ throttling times
>>> - the way schedutil's flags are tracked and updated
>>> - the callsites from where we call schedutil updates
>>>
>>> For example it can really happen that we are running at the highest
>>> OPP because of some RT activity. Then we switch back to a relatively
>>> low utilization CFS workload and then:
>>> 1. a tick happens which produces a frequency drop
>>
>> Any idea why this frequency drop would happen? Say a running CFS task
>> gets preempted by RT task, the PELT signal shouldn't drop for the
>> duration the CFS task is preempted because the task is runnable, so
>
> utilization only tracks the running state but not runnable state.
> Runnable state is tracked in load_avg

Thanks. I got it now.

Correct me if I'm wrong but strictly speaking utilization for a cfs_rq
(which drives the frequency for CFS) still tracks the blocked/runnable
time of tasks although its decayed as time moves forward. Only when we
migrate the rq of a cfs task is the util_avg contribution removed from
the rq. But I can see now why running RT can decay this load tracking
signal.

Regards,
Joel

2017-03-27 06:59:36

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

On 25 March 2017 at 04:48, Joel Fernandes <[email protected]> wrote:
> Hi Vincent,
>
> On Thu, Mar 23, 2017 at 3:08 PM, Vincent Guittot
> <[email protected]> wrote:
> [..]
>>>>
>>>>> So I'm not really aligned with the description of your problem: PELT
>>>>> metric underestimates the load of the CPU. The PELT is just about
>>>>> tracking CFS task utilization but not whole CPU utilization and
>>>>> according to your description of the problem (time stolen by irq),
>>>>> your problem doesn't come from an underestimation of CFS task but from
>>>>> time spent in something else but not accounted in the value used by
>>>>> schedutil
>>>>
>>>> Quite likely. Indeed, it can really be that the CFS task is preempted
>>>> because of some RT activity generated by the IRQ handler.
>>>>
>>>> More in general, I've also noticed many suboptimal freq switches when
>>>> RT tasks interleave with CFS ones, because of:
>>>> - relatively long down _and up_ throttling times
>>>> - the way schedutil's flags are tracked and updated
>>>> - the callsites from where we call schedutil updates
>>>>
>>>> For example it can really happen that we are running at the highest
>>>> OPP because of some RT activity. Then we switch back to a relatively
>>>> low utilization CFS workload and then:
>>>> 1. a tick happens which produces a frequency drop
>>>
>>> Any idea why this frequency drop would happen? Say a running CFS task
>>> gets preempted by RT task, the PELT signal shouldn't drop for the
>>> duration the CFS task is preempted because the task is runnable, so
>>
>> utilization only tracks the running state but not runnable state.
>> Runnable state is tracked in load_avg
>
> Thanks. I got it now.
>
> Correct me if I'm wrong but strictly speaking utilization for a cfs_rq
> (which drives the frequency for CFS) still tracks the blocked/runnable
> time of tasks although its decayed as time moves forward. Only when we
> migrate the rq of a cfs task is the util_avg contribution removed from
> the rq. But I can see now why running RT can decay this load tracking
> signal.

Yes. you're right


>
> Regards,
> Joel

2017-03-27 07:14:28

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely

On 25 March 2017 at 02:14, Sai Gurrappadi <[email protected]> wrote:
> Hi Rafael,
>
> On 03/21/2017 04:08 PM, Rafael J. Wysocki wrote:
>> From: Rafael J. Wysocki <[email protected]>
>>
>> The way the schedutil governor uses the PELT metric causes it to
>> underestimate the CPU utilization in some cases.
>>
>> That can be easily demonstrated by running kernel compilation on
>> a Sandy Bridge Intel processor, running turbostat in parallel with
>> it and looking at the values written to the MSR_IA32_PERF_CTL
>> register. Namely, the expected result would be that when all CPUs
>> were 100% busy, all of them would be requested to run in the maximum
>> P-state, but observation shows that this clearly isn't the case.
>> The CPUs run in the maximum P-state for a while and then are
>> requested to run slower and go back to the maximum P-state after
>> a while again. That causes the actual frequency of the processor to
>> visibly oscillate below the sustainable maximum in a jittery fashion
>> which clearly is not desirable.
>>
>> That has been attributed to CPU utilization metric updates on task
>> migration that cause the total utilization value for the CPU to be
>> reduced by the utilization of the migrated task. If that happens,
>> the schedutil governor may see a CPU utilization reduction and will
>> attempt to reduce the CPU frequency accordingly right away. That
>> may be premature, though, for example if the system is generally
>> busy and there are other runnable tasks waiting to be run on that
>> CPU already.
>>
>
> Thinking out loud a bit, I wonder if what you really want to do is basically:
>
> schedutil_cpu_util(cpu) = max(cpu_rq(cpu)->cfs.util_avg, total_cpu_util_avg);
>
> Where total_cpu_util_avg tracks the average utilization of the CPU itself over time (% of time the CPU was busy) in the same PELT like manner. The difference here is that it doesn't change instantaneously as tasks migrate in/out but it decays/accumulates just like the per-entity util_avgs.

But we loose the interest of immediate decrease when tasks migrate.
Instead of total_cpu_util_avg we should better track RT utilization in
the same manner so with ongoing work for deadline we will have :
total_utilization = cfs.util_avg + rt's util_avg + deadline's util avg
and we still take advantage of task migration effect


>
> Over time, total_cpu_util_avg and cfs_rq(cpu)->util_avg will tend towards each other the lesser the amount of 'overlap' / overloading.
>
> Yes, the above metric would 'overestimate' in case all tasks have migrated away and we are left with an idle CPU. A fix for that could be to just use the PELT value like so:
>
> schedutil_cpu_util(cpu) = max(cpu_rq(cpu)->cfs.util_avg, idle_cpu(cpu) ? 0 : total_cpu_util_avg);
>
> Note that the problem described here in the commit message doesn't need fully runnable threads, it just needs two threads to execute in parallel on the same CPU for a period of time. I don't think looking at just idle_calls necessarily covers all cases.
>
> Thoughts?
>
> Thanks,
> -Sai

2017-03-27 21:04:36

by Sai Gurrappadi

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely

Hi Vincent,

On 03/27/2017 12:04 AM, Vincent Guittot wrote:
> On 25 March 2017 at 02:14, Sai Gurrappadi <[email protected]> wrote:
>> Hi Rafael,
>>
>> On 03/21/2017 04:08 PM, Rafael J. Wysocki wrote:
>>> From: Rafael J. Wysocki <[email protected]>
>>>
>>> The way the schedutil governor uses the PELT metric causes it to
>>> underestimate the CPU utilization in some cases.
>>>
>>> That can be easily demonstrated by running kernel compilation on
>>> a Sandy Bridge Intel processor, running turbostat in parallel with
>>> it and looking at the values written to the MSR_IA32_PERF_CTL
>>> register. Namely, the expected result would be that when all CPUs
>>> were 100% busy, all of them would be requested to run in the maximum
>>> P-state, but observation shows that this clearly isn't the case.
>>> The CPUs run in the maximum P-state for a while and then are
>>> requested to run slower and go back to the maximum P-state after
>>> a while again. That causes the actual frequency of the processor to
>>> visibly oscillate below the sustainable maximum in a jittery fashion
>>> which clearly is not desirable.
>>>
>>> That has been attributed to CPU utilization metric updates on task
>>> migration that cause the total utilization value for the CPU to be
>>> reduced by the utilization of the migrated task. If that happens,
>>> the schedutil governor may see a CPU utilization reduction and will
>>> attempt to reduce the CPU frequency accordingly right away. That
>>> may be premature, though, for example if the system is generally
>>> busy and there are other runnable tasks waiting to be run on that
>>> CPU already.
>>>
>>
>> Thinking out loud a bit, I wonder if what you really want to do is basically:
>>
>> schedutil_cpu_util(cpu) = max(cpu_rq(cpu)->cfs.util_avg, total_cpu_util_avg);
>>
>> Where total_cpu_util_avg tracks the average utilization of the CPU itself over time (% of time the CPU was busy) in the same PELT like manner. The difference here is that it doesn't change instantaneously as tasks migrate in/out but it decays/accumulates just like the per-entity util_avgs.
>
> But we loose the interest of immediate decrease when tasks migrate.

Indeed, this is not ideal.

> Instead of total_cpu_util_avg we should better track RT utilization in
> the same manner so with ongoing work for deadline we will have :
> total_utilization = cfs.util_avg + rt's util_avg + deadline's util avg
> and we still take advantage of task migration effect

I agree that we need better tracking for RT and DL tasks but that doesn't solve the overloaded case with more than one CFS thread sharing a CPU.

In the overloaded case, we care not just about the instant where the migrate happens but also subsequent windows where the PELT metric is slowly ramping up to reflect the real utilization of a task now that it has a CPU to itself.

Maybe there are better ways to solve that though :-)

Thanks,
-Sai

2017-03-27 21:11:56

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC][PATCH v3 2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely

On Mon, Mar 27, 2017 at 11:01 PM, Sai Gurrappadi <[email protected]> wrote:
> Hi Vincent,
>
> On 03/27/2017 12:04 AM, Vincent Guittot wrote:
>> On 25 March 2017 at 02:14, Sai Gurrappadi <[email protected]> wrote:
>>> Hi Rafael,
>>>
>>> On 03/21/2017 04:08 PM, Rafael J. Wysocki wrote:
>>>> From: Rafael J. Wysocki <[email protected]>
>>>>
>>>> The way the schedutil governor uses the PELT metric causes it to
>>>> underestimate the CPU utilization in some cases.
>>>>
>>>> That can be easily demonstrated by running kernel compilation on
>>>> a Sandy Bridge Intel processor, running turbostat in parallel with
>>>> it and looking at the values written to the MSR_IA32_PERF_CTL
>>>> register. Namely, the expected result would be that when all CPUs
>>>> were 100% busy, all of them would be requested to run in the maximum
>>>> P-state, but observation shows that this clearly isn't the case.
>>>> The CPUs run in the maximum P-state for a while and then are
>>>> requested to run slower and go back to the maximum P-state after
>>>> a while again. That causes the actual frequency of the processor to
>>>> visibly oscillate below the sustainable maximum in a jittery fashion
>>>> which clearly is not desirable.
>>>>
>>>> That has been attributed to CPU utilization metric updates on task
>>>> migration that cause the total utilization value for the CPU to be
>>>> reduced by the utilization of the migrated task. If that happens,
>>>> the schedutil governor may see a CPU utilization reduction and will
>>>> attempt to reduce the CPU frequency accordingly right away. That
>>>> may be premature, though, for example if the system is generally
>>>> busy and there are other runnable tasks waiting to be run on that
>>>> CPU already.
>>>>
>>>
>>> Thinking out loud a bit, I wonder if what you really want to do is basically:
>>>
>>> schedutil_cpu_util(cpu) = max(cpu_rq(cpu)->cfs.util_avg, total_cpu_util_avg);
>>>
>>> Where total_cpu_util_avg tracks the average utilization of the CPU itself over time (% of time the CPU was busy) in the same PELT like manner. The difference here is that it doesn't change instantaneously as tasks migrate in/out but it decays/accumulates just like the per-entity util_avgs.
>>
>> But we loose the interest of immediate decrease when tasks migrate.
>
> Indeed, this is not ideal.
>
>> Instead of total_cpu_util_avg we should better track RT utilization in
>> the same manner so with ongoing work for deadline we will have :
>> total_utilization = cfs.util_avg + rt's util_avg + deadline's util avg
>> and we still take advantage of task migration effect
>
> I agree that we need better tracking for RT and DL tasks but that doesn't solve the overloaded case with more than one CFS thread sharing a CPU.
>
> In the overloaded case, we care not just about the instant where the migrate happens but also subsequent windows where the PELT metric is slowly ramping up to reflect the real utilization of a task now that it has a CPU to itself.
>
> Maybe there are better ways to solve that though :-)

I wonder if it's viable to postpone the utilization update on both the
source and target runqueues until the task has been fully migrated?

That would make the artificial utilization reductions go away.

Thanks,
Rafael