2019-10-25 20:42:41

by Doug Smythies

[permalink] [raw]
Subject: [PATCH] Revert "sched/fair: Fix O(nr_cgroups) in the load balancing path"

This reverts commit 039ae8bcf7a5f4476f4487e6bf816885fb3fb617,
which, in turn, was a re-apply of
commit a9e7f6544b9c ("sched/fair: Fix O(nr_cgroups) in load balance path")
after it was reverted via
commit c40f7d74c741 ("sched/fair: Fix infinite loop in update_blocked_averages() by reverting a9e7f6544b9c")

For an idle system, the cfs_rq_is_decayed function components can underflow to 0 and
incorrectly return TRUE, when the item should not be deleted from the list.

Signed-off-by: Doug Smythies <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Peter Zijlstra (Intel) <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: Srinivas Pandruvada <[email protected]>

---
Note 1: Both this reversion and just deleting the cfs_rq_is_decayed function
and it's call and leaving the other changes have been tested. I do not know
which solution is better. (ie for the "list_for_each_entry_rcu" part of it.)

Note 2: Previous controversy over this patch was based on heavy workloads,
but this is based on minimal or no workload, or "idle".
Where "idle" on my test server, with no gui and many services disabled,
tends to mean more "idle" than most systems.

Note 3: While this supporting data only involves the intel_pstate CPU
frequency scaling driver as a casualty, it is beyond my capabilities
to determine what other tasks that should be running might be omitted.

Use case example 1:
System Idle: The intel pstate CPU frequency scaling driver:
Mode: Active, non-hwp, powersave governor.
Expected behaviour: There is never ever a duration (time between calls to
the driver / per CPU) longer than 4 seconds (the watchdog time, I think).
Actual behaviour: There are long long gaps between calls to the driver:

Kernel: 5.4-rc2 CPU:7
duration: 327.17 Seconds. (this is one of many hundreds of examples.)
mpref: 44023326
apref: 20716861
tsc: 1.11604E+12
load: 0
CPU frequency: 1.6053 GHz (average over this 327 second sample period).
old pstate: 16 (the lowest for my processor)
new pstate: 16

Kernel: 5.4-rc2 + reversion (either method)
After several hours of testing, maximum durations were never more
than 4 seconds (well plus some jitter).
reversion method: max=4.07908 seconds
CPU:7
mperf: 492578
apref: 231813 (56,829 per second average is consistent with other tests)
tsc: 13914264074
load: 0
CPU frequency: 1.6052 GHz
old pstate: 16 (the lowest for my precessor)
new pstate: 16

On average, the non-reverted kernel executes the driver 25% less
than the reverted kernel during idle.

O.K. so who cares, the requested pstate doesn't change?
First, one wonders if the math could overflow.
(although 7180ddd suggests maybe it won't)
Second, the sample is largely dominated by obsolete information.
Third, this can be problematic, and potentially wastes energy,
for the busy to idle transition.

Use case example 2:
The busy to idle transition:

Typically, the pstate request response to a busy to idle transition
is very slow because the duration suddenly goes from, typically,
10 milliseconds to much much longer, up to 4 seconds. Transition
times to the system being fully idle, with all requested pstates
being at minimum, takes around 8 seconds with this reversion,
and, potentially, a very very long time (over 100 seconds has been
measured) without.

Again, so who cares, if the processor is in a deep idle state anyway,
not consuming much energy? O.K. but what if it is in an idle state
where energy consumption is a function of the requested pstate?
For example, for my processor (i7-2600K), idle state 1, then processor
package energy can be over double what it should be for many 10s of
seconds.

Experiment method:

enable only idle state 1
Dountil stopped
apply a 100% load (all CPUs)
after awhile (about 50 seconds) remove the load.
allow a short transient delay (1 second).
measure the processor package joules used over the next 149 seconds.
Enduntil

Kernel k5.4-rc2 + reversion (this method)
Average processor package power: 9.148 watts (128 samples, > 7 hours)
Minimum: 9.02 watts
Maximum: 9.29 watts
Note: outlyer data point group removed, as it was assumed the computer
had something to do and wasn't actually "idle".

Kernel 5.4-rc2:
Average processor package power: 9.969 watts (150 samples, > 8 hours)
Or 9% more energy for the idle phases of the work load.
Minimum: 9.15 watts
Maximum: 13.79 watts (51% more power)

---
kernel/sched/fair.c | 43 +++++++++----------------------------------
1 file changed, 9 insertions(+), 34 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 83ab35e..51625b8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -381,10 +381,9 @@ static inline void assert_list_leaf_cfs_rq(struct rq *rq)
SCHED_WARN_ON(rq->tmp_alone_branch != &rq->leaf_cfs_rq_list);
}

-/* Iterate thr' all leaf cfs_rq's on a runqueue */
-#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) \
- list_for_each_entry_safe(cfs_rq, pos, &rq->leaf_cfs_rq_list, \
- leaf_cfs_rq_list)
+/* Iterate through all cfs_rq's on a runqueue in bottom-up order */
+#define for_each_leaf_cfs_rq(rq, cfs_rq) \
+ list_for_each_entry_rcu(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list)

/* Do the two (enqueued) entities belong to the same group ? */
static inline struct cfs_rq *
@@ -481,8 +480,8 @@ static inline void assert_list_leaf_cfs_rq(struct rq *rq)
{
}

-#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) \
- for (cfs_rq = &rq->cfs, pos = NULL; cfs_rq; cfs_rq = pos)
+#define for_each_leaf_cfs_rq(rq, cfs_rq) \
+ for (cfs_rq = &rq->cfs; cfs_rq; cfs_rq = NULL)

static inline struct sched_entity *parent_entity(struct sched_entity *se)
{
@@ -7502,27 +7501,10 @@ static inline void update_blocked_load_status(struct rq *rq, bool has_blocked) {

#ifdef CONFIG_FAIR_GROUP_SCHED

-static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
-{
- if (cfs_rq->load.weight)
- return false;
-
- if (cfs_rq->avg.load_sum)
- return false;
-
- if (cfs_rq->avg.util_sum)
- return false;
-
- if (cfs_rq->avg.runnable_load_sum)
- return false;
-
- return true;
-}
-
static void update_blocked_averages(int cpu)
{
struct rq *rq = cpu_rq(cpu);
- struct cfs_rq *cfs_rq, *pos;
+ struct cfs_rq *cfs_rq;
const struct sched_class *curr_class;
struct rq_flags rf;
bool done = true;
@@ -7534,7 +7516,7 @@ static void update_blocked_averages(int cpu)
* Iterates the task_group tree in a bottom up fashion, see
* list_add_leaf_cfs_rq() for details.
*/
- for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) {
+ for_each_leaf_cfs_rq(rq, cfs_rq) {
struct sched_entity *se;

if (update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq))
@@ -7545,13 +7527,6 @@ static void update_blocked_averages(int cpu)
if (se && !skip_blocked_update(se))
update_load_avg(cfs_rq_of(se), se, 0);

- /*
- * There can be a lot of idle CPU cgroups. Don't let fully
- * decayed cfs_rqs linger on the list.
- */
- if (cfs_rq_is_decayed(cfs_rq))
- list_del_leaf_cfs_rq(cfs_rq);
-
/* Don't need periodic decay once load/util_avg are null */
if (cfs_rq_has_blocked(cfs_rq))
done = false;
@@ -10444,10 +10419,10 @@ const struct sched_class fair_sched_class = {
#ifdef CONFIG_SCHED_DEBUG
void print_cfs_stats(struct seq_file *m, int cpu)
{
- struct cfs_rq *cfs_rq, *pos;
+ struct cfs_rq *cfs_rq;

rcu_read_lock();
- for_each_leaf_cfs_rq_safe(cpu_rq(cpu), cfs_rq, pos)
+ for_each_leaf_cfs_rq(cpu_rq(cpu), cfs_rq)
print_cfs_rq(m, cpu, cfs_rq);
rcu_read_unlock();
}
--
2.7.4


2019-10-25 20:45:54

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] Revert "sched/fair: Fix O(nr_cgroups) in the load balancing path"

Hi Doug,


On Fri, 25 Oct 2019 at 17:55, Doug Smythies <[email protected]> wrote:
>
> This reverts commit 039ae8bcf7a5f4476f4487e6bf816885fb3fb617,
> which, in turn, was a re-apply of
> commit a9e7f6544b9c ("sched/fair: Fix O(nr_cgroups) in load balance path")
> after it was reverted via
> commit c40f7d74c741 ("sched/fair: Fix infinite loop in update_blocked_averages() by reverting a9e7f6544b9c")
>
> For an idle system, the cfs_rq_is_decayed function components can underflow to 0 and
> incorrectly return TRUE, when the item should not be deleted from the list.

The patch from Rik solves the problem of cfs_rq_is_decayed wrongly returns true
https://lore.kernel.org/lkml/[email protected]/

>
> Signed-off-by: Doug Smythies <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Peter Zijlstra (Intel) <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: Srinivas Pandruvada <[email protected]>
>
> ---
> Note 1: Both this reversion and just deleting the cfs_rq_is_decayed function
> and it's call and leaving the other changes have been tested. I do not know
> which solution is better. (ie for the "list_for_each_entry_rcu" part of it.)
>
> Note 2: Previous controversy over this patch was based on heavy workloads,
> but this is based on minimal or no workload, or "idle".
> Where "idle" on my test server, with no gui and many services disabled,
> tends to mean more "idle" than most systems.
>
> Note 3: While this supporting data only involves the intel_pstate CPU
> frequency scaling driver as a casualty, it is beyond my capabilities
> to determine what other tasks that should be running might be omitted.
>
> Use case example 1:
> System Idle: The intel pstate CPU frequency scaling driver:
> Mode: Active, non-hwp, powersave governor.
> Expected behaviour: There is never ever a duration (time between calls to
> the driver / per CPU) longer than 4 seconds (the watchdog time, I think).
> Actual behaviour: There are long long gaps between calls to the driver:
>
> Kernel: 5.4-rc2 CPU:7
> duration: 327.17 Seconds. (this is one of many hundreds of examples.)
> mpref: 44023326
> apref: 20716861
> tsc: 1.11604E+12
> load: 0
> CPU frequency: 1.6053 GHz (average over this 327 second sample period).
> old pstate: 16 (the lowest for my processor)
> new pstate: 16
>
> Kernel: 5.4-rc2 + reversion (either method)
> After several hours of testing, maximum durations were never more
> than 4 seconds (well plus some jitter).
> reversion method: max=4.07908 seconds
> CPU:7
> mperf: 492578
> apref: 231813 (56,829 per second average is consistent with other tests)
> tsc: 13914264074
> load: 0
> CPU frequency: 1.6052 GHz
> old pstate: 16 (the lowest for my precessor)
> new pstate: 16
>
> On average, the non-reverted kernel executes the driver 25% less
> than the reverted kernel during idle.
>
> O.K. so who cares, the requested pstate doesn't change?
> First, one wonders if the math could overflow.
> (although 7180ddd suggests maybe it won't)
> Second, the sample is largely dominated by obsolete information.
> Third, this can be problematic, and potentially wastes energy,
> for the busy to idle transition.
>
> Use case example 2:
> The busy to idle transition:
>
> Typically, the pstate request response to a busy to idle transition
> is very slow because the duration suddenly goes from, typically,
> 10 milliseconds to much much longer, up to 4 seconds. Transition
> times to the system being fully idle, with all requested pstates
> being at minimum, takes around 8 seconds with this reversion,
> and, potentially, a very very long time (over 100 seconds has been
> measured) without.
>
> Again, so who cares, if the processor is in a deep idle state anyway,
> not consuming much energy? O.K. but what if it is in an idle state
> where energy consumption is a function of the requested pstate?
> For example, for my processor (i7-2600K), idle state 1, then processor
> package energy can be over double what it should be for many 10s of
> seconds.
>
> Experiment method:
>
> enable only idle state 1
> Dountil stopped
> apply a 100% load (all CPUs)
> after awhile (about 50 seconds) remove the load.
> allow a short transient delay (1 second).
> measure the processor package joules used over the next 149 seconds.
> Enduntil
>
> Kernel k5.4-rc2 + reversion (this method)
> Average processor package power: 9.148 watts (128 samples, > 7 hours)
> Minimum: 9.02 watts
> Maximum: 9.29 watts
> Note: outlyer data point group removed, as it was assumed the computer
> had something to do and wasn't actually "idle".
>
> Kernel 5.4-rc2:
> Average processor package power: 9.969 watts (150 samples, > 8 hours)
> Or 9% more energy for the idle phases of the work load.
> Minimum: 9.15 watts
> Maximum: 13.79 watts (51% more power)
>
> ---
> kernel/sched/fair.c | 43 +++++++++----------------------------------
> 1 file changed, 9 insertions(+), 34 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 83ab35e..51625b8 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -381,10 +381,9 @@ static inline void assert_list_leaf_cfs_rq(struct rq *rq)
> SCHED_WARN_ON(rq->tmp_alone_branch != &rq->leaf_cfs_rq_list);
> }
>
> -/* Iterate thr' all leaf cfs_rq's on a runqueue */
> -#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) \
> - list_for_each_entry_safe(cfs_rq, pos, &rq->leaf_cfs_rq_list, \
> - leaf_cfs_rq_list)
> +/* Iterate through all cfs_rq's on a runqueue in bottom-up order */
> +#define for_each_leaf_cfs_rq(rq, cfs_rq) \
> + list_for_each_entry_rcu(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list)
>
> /* Do the two (enqueued) entities belong to the same group ? */
> static inline struct cfs_rq *
> @@ -481,8 +480,8 @@ static inline void assert_list_leaf_cfs_rq(struct rq *rq)
> {
> }
>
> -#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) \
> - for (cfs_rq = &rq->cfs, pos = NULL; cfs_rq; cfs_rq = pos)
> +#define for_each_leaf_cfs_rq(rq, cfs_rq) \
> + for (cfs_rq = &rq->cfs; cfs_rq; cfs_rq = NULL)
>
> static inline struct sched_entity *parent_entity(struct sched_entity *se)
> {
> @@ -7502,27 +7501,10 @@ static inline void update_blocked_load_status(struct rq *rq, bool has_blocked) {
>
> #ifdef CONFIG_FAIR_GROUP_SCHED
>
> -static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
> -{
> - if (cfs_rq->load.weight)
> - return false;
> -
> - if (cfs_rq->avg.load_sum)
> - return false;
> -
> - if (cfs_rq->avg.util_sum)
> - return false;
> -
> - if (cfs_rq->avg.runnable_load_sum)
> - return false;
> -
> - return true;
> -}
> -
> static void update_blocked_averages(int cpu)
> {
> struct rq *rq = cpu_rq(cpu);
> - struct cfs_rq *cfs_rq, *pos;
> + struct cfs_rq *cfs_rq;
> const struct sched_class *curr_class;
> struct rq_flags rf;
> bool done = true;
> @@ -7534,7 +7516,7 @@ static void update_blocked_averages(int cpu)
> * Iterates the task_group tree in a bottom up fashion, see
> * list_add_leaf_cfs_rq() for details.
> */
> - for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) {
> + for_each_leaf_cfs_rq(rq, cfs_rq) {
> struct sched_entity *se;
>
> if (update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq))
> @@ -7545,13 +7527,6 @@ static void update_blocked_averages(int cpu)
> if (se && !skip_blocked_update(se))
> update_load_avg(cfs_rq_of(se), se, 0);
>
> - /*
> - * There can be a lot of idle CPU cgroups. Don't let fully
> - * decayed cfs_rqs linger on the list.
> - */
> - if (cfs_rq_is_decayed(cfs_rq))
> - list_del_leaf_cfs_rq(cfs_rq);
> -
> /* Don't need periodic decay once load/util_avg are null */
> if (cfs_rq_has_blocked(cfs_rq))
> done = false;
> @@ -10444,10 +10419,10 @@ const struct sched_class fair_sched_class = {
> #ifdef CONFIG_SCHED_DEBUG
> void print_cfs_stats(struct seq_file *m, int cpu)
> {
> - struct cfs_rq *cfs_rq, *pos;
> + struct cfs_rq *cfs_rq;
>
> rcu_read_lock();
> - for_each_leaf_cfs_rq_safe(cpu_rq(cpu), cfs_rq, pos)
> + for_each_leaf_cfs_rq(cpu_rq(cpu), cfs_rq)
> print_cfs_rq(m, cpu, cfs_rq);
> rcu_read_unlock();
> }
> --
> 2.7.4
>

2019-10-26 07:01:55

by Doug Smythies

[permalink] [raw]
Subject: RE: [PATCH] Revert "sched/fair: Fix O(nr_cgroups) in the load balancing path"

Hi Vincent,

Thank you for your quick reply.

On 2010.10.25 09:51 Vincent Guittot wrote:
> On Fri, 25 Oct 2019 at 17:55, Doug Smythies <[email protected]> wrote:
>>
>> This reverts commit 039ae8bcf7a5f4476f4487e6bf816885fb3fb617,
>> which, in turn, was a re-apply of
>> commit a9e7f6544b9c ("sched/fair: Fix O(nr_cgroups) in load balance path")
>> after it was reverted via
>> commit c40f7d74c741 ("sched/fair: Fix infinite loop in update_blocked_averages() by reverting a9e7f6544b9c")
>>
>> For an idle system, the cfs_rq_is_decayed function components can underflow to 0 and
>> incorrectly return TRUE, when the item should not be deleted from the list.
>
> The patch from Rik solves the problem of cfs_rq_is_decayed wrongly returns true
> https://lore.kernel.org/lkml/[email protected]/

Not for my use case.

I applied Rik's patch to kernel 5.4-rc2 (since all my other reference
test data had been acquired against a base of 5.4-rc2). Tests results
are similar to the non-reverted kernel (results added in-line
below).

>>
>> Signed-off-by: Doug Smythies <[email protected]>
>> Cc: Vincent Guittot <[email protected]>
>> Cc: Peter Zijlstra (Intel) <[email protected]>
>> Cc: Ingo Molnar <[email protected]>
>> Cc: Linus Torvalds <[email protected]>
>> Cc: Peter Zijlstra <[email protected]>
>> Cc: Thomas Gleixner <[email protected]>
>> Cc: [email protected]
>> Cc: [email protected]
>> Cc: [email protected]
>> Cc: [email protected]
>> Cc: Srinivas Pandruvada <[email protected]>
>>
>> ---
>> Note 1: Both this reversion and just deleting the cfs_rq_is_decayed function
>> and it's call and leaving the other changes have been tested. I do not know
>> which solution is better. (ie for the "list_for_each_entry_rcu" part of it.)
>>
>> Note 2: Previous controversy over this patch was based on heavy workloads,
>> but this is based on minimal or no workload, or "idle".
>> Where "idle" on my test server, with no gui and many services disabled,
>> tends to mean more "idle" than most systems.
>>
>> Note 3: While this supporting data only involves the intel_pstate CPU
>> frequency scaling driver as a casualty, it is beyond my capabilities
>> to determine what other tasks that should be running might be omitted.
>>
>> Use case example 1:
>> System Idle: The intel pstate CPU frequency scaling driver:
>> Mode: Active, non-hwp, powersave governor.
>> Expected behaviour: There is never ever a duration (time between calls to
>> the driver / per CPU) longer than 4 seconds (the watchdog time, I think).
>> Actual behaviour: There are long long gaps between calls to the driver:
>>
>> Kernel: 5.4-rc2 CPU:7
>> duration: 327.17 Seconds. (this is one of many hundreds of examples.)
>> mpref: 44023326
>> apref: 20716861
>> tsc: 1.11604E+12
>> load: 0
>> CPU frequency: 1.6053 GHz (average over this 327 second sample period).
>> old pstate: 16 (the lowest for my processor)
>> new pstate: 16
>>
>> Kernel: 5.4-rc2 + reversion (either method)
>> After several hours of testing, maximum durations were never more
>> than 4 seconds (well plus some jitter).
>> reversion method: max=4.07908 seconds
>> CPU:7
>> mperf: 492578
>> apref: 231813 (56,829 per second average is consistent with other tests)
>> tsc: 13914264074
>> load: 0
>> CPU frequency: 1.6052 GHz
>> old pstate: 16 (the lowest for my precessor)
>> new pstate: 16
>>
>> On average, the non-reverted kernel executes the driver 25% less
>> than the reverted kernel during idle.

On (shorter)average, the Rik patched kernel executes the driver
14% less than the reverted kernel during idle.

Longer and repeated testing would be required to determine if
this is a trend or simply non-repeatable noise.

>> O.K. so who cares, the requested pstate doesn't change?
>> First, one wonders if the math could overflow.
>> (although 7180ddd suggests maybe it won't)
>> Second, the sample is largely dominated by obsolete information.
>> Third, this can be problematic, and potentially wastes energy,
>> for the busy to idle transition.
>>
>> Use case example 2:
>> The busy to idle transition:
>>
>> Typically, the pstate request response to a busy to idle transition
>> is very slow because the duration suddenly goes from, typically,
>> 10 milliseconds to much much longer, up to 4 seconds. Transition
>> times to the system being fully idle, with all requested pstates
>> being at minimum, takes around 8 seconds with this reversion,
>> and, potentially, a very very long time (over 100 seconds has been
>> measured) without.
>>
>> Again, so who cares, if the processor is in a deep idle state anyway,
>> not consuming much energy? O.K. but what if it is in an idle state
>> where energy consumption is a function of the requested pstate?
>> For example, for my processor (i7-2600K), idle state 1, then processor
>> package energy can be over double what it should be for many 10s of
>> seconds.
>>
>> Experiment method:
>>
>> enable only idle state 1
>> Dountil stopped
>> apply a 100% load (all CPUs)
>> after awhile (about 50 seconds) remove the load.
>> allow a short transient delay (1 second).
>> measure the processor package joules used over the next 149 seconds.
>> Enduntil
>>
>> Kernel k5.4-rc2 + reversion (this method)
>> Average processor package power: 9.148 watts (128 samples, > 7 hours)
>> Minimum: 9.02 watts
>> Maximum: 9.29 watts
>> Note: outlyer data point group removed, as it was assumed the computer
>> had something to do and wasn't actually "idle".
>>
>> Kernel 5.4-rc2:
>> Average processor package power: 9.969 watts (150 samples, > 8 hours)
>> Or 9% more energy for the idle phases of the work load.
>> Minimum: 9.15 watts
>> Maximum: 13.79 watts (51% more power)

Kernel 5.4-rc2 + Rik-patch:
Average processor package power: 9.85 watts (53 samples, < 3 hours)
Or 7.7% more energy for the idle phases of the work load.
Minimum: 9.23 watts
Maximum: 12.79 watts (40% more power)



2019-10-28 18:25:24

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] Revert "sched/fair: Fix O(nr_cgroups) in the load balancing path"

On Sat, 26 Oct 2019 at 08:59, Doug Smythies <[email protected]> wrote:
>
> Hi Vincent,
>
> Thank you for your quick reply.
>
> On 2010.10.25 09:51 Vincent Guittot wrote:
> > On Fri, 25 Oct 2019 at 17:55, Doug Smythies <[email protected]> wrote:
> >>
> >> This reverts commit 039ae8bcf7a5f4476f4487e6bf816885fb3fb617,
> >> which, in turn, was a re-apply of
> >> commit a9e7f6544b9c ("sched/fair: Fix O(nr_cgroups) in load balance path")
> >> after it was reverted via
> >> commit c40f7d74c741 ("sched/fair: Fix infinite loop in update_blocked_averages() by reverting a9e7f6544b9c")
> >>
> >> For an idle system, the cfs_rq_is_decayed function components can underflow to 0 and
> >> incorrectly return TRUE, when the item should not be deleted from the list.
> >
> > The patch from Rik solves the problem of cfs_rq_is_decayed wrongly returns true
> > https://lore.kernel.org/lkml/[email protected]/
>
> Not for my use case.
>
> I applied Rik's patch to kernel 5.4-rc2 (since all my other reference
> test data had been acquired against a base of 5.4-rc2). Tests results
> are similar to the non-reverted kernel (results added in-line
> below).

Ok.

I have studied a bit more your results below and IIUC your problem,
some periodic wakes up (every 4sec) are missing with kernel 5.4-rc2
that helps cpuidle to enters deeper idle state after each new wake up
until reaching the deepest state, isn't it ?
My 1st point is that this code doesn't use timer or hrtimer to wake up
the system but only take advantage of the wake up of something else to
update the blocked load. So I don't see how this patch could remove
the 4sec periodic wakeup of the watchdog timer that you are
mentioning.
Then, when a system is idle and not used, the load should obviously be
null most of the time and the update of decayed load should not happen
anyway. It looks like you take advantage of some spurious and
un-necessary wake up to help cpuidle to reach deeper idle state. Is
this understanding correct ?

Then, without or without removing the cfs_rq from the list, we should
end up with rq->has_blocked_load == 0 and nohz.has_blocked == 0 too.
The only main impact will be the duration of the loop that can be
significantly shorter when you have a lot of rqs and cgroups.

>
> >>
> >> Signed-off-by: Doug Smythies <[email protected]>
> >> Cc: Vincent Guittot <[email protected]>
> >> Cc: Peter Zijlstra (Intel) <[email protected]>
> >> Cc: Ingo Molnar <[email protected]>
> >> Cc: Linus Torvalds <[email protected]>
> >> Cc: Peter Zijlstra <[email protected]>
> >> Cc: Thomas Gleixner <[email protected]>
> >> Cc: [email protected]
> >> Cc: [email protected]
> >> Cc: [email protected]
> >> Cc: [email protected]
> >> Cc: Srinivas Pandruvada <[email protected]>
> >>
> >> ---
> >> Note 1: Both this reversion and just deleting the cfs_rq_is_decayed function
> >> and it's call and leaving the other changes have been tested. I do not know
> >> which solution is better. (ie for the "list_for_each_entry_rcu" part of it.)
> >>
> >> Note 2: Previous controversy over this patch was based on heavy workloads,
> >> but this is based on minimal or no workload, or "idle".
> >> Where "idle" on my test server, with no gui and many services disabled,
> >> tends to mean more "idle" than most systems.
> >>
> >> Note 3: While this supporting data only involves the intel_pstate CPU
> >> frequency scaling driver as a casualty, it is beyond my capabilities
> >> to determine what other tasks that should be running might be omitted.
> >>
> >> Use case example 1:
> >> System Idle: The intel pstate CPU frequency scaling driver:
> >> Mode: Active, non-hwp, powersave governor.
> >> Expected behaviour: There is never ever a duration (time between calls to
> >> the driver / per CPU) longer than 4 seconds (the watchdog time, I think).
> >> Actual behaviour: There are long long gaps between calls to the driver:
> >>
> >> Kernel: 5.4-rc2 CPU:7
> >> duration: 327.17 Seconds. (this is one of many hundreds of examples.)
> >> mpref: 44023326
> >> apref: 20716861
> >> tsc: 1.11604E+12
> >> load: 0
> >> CPU frequency: 1.6053 GHz (average over this 327 second sample period).
> >> old pstate: 16 (the lowest for my processor)
> >> new pstate: 16
> >>
> >> Kernel: 5.4-rc2 + reversion (either method)
> >> After several hours of testing, maximum durations were never more
> >> than 4 seconds (well plus some jitter).
> >> reversion method: max=4.07908 seconds
> >> CPU:7
> >> mperf: 492578
> >> apref: 231813 (56,829 per second average is consistent with other tests)
> >> tsc: 13914264074
> >> load: 0
> >> CPU frequency: 1.6052 GHz
> >> old pstate: 16 (the lowest for my precessor)
> >> new pstate: 16
> >>
> >> On average, the non-reverted kernel executes the driver 25% less
> >> than the reverted kernel during idle.
>
> On (shorter)average, the Rik patched kernel executes the driver
> 14% less than the reverted kernel during idle.
>
> Longer and repeated testing would be required to determine if
> this is a trend or simply non-repeatable noise.
>
> >> O.K. so who cares, the requested pstate doesn't change?
> >> First, one wonders if the math could overflow.
> >> (although 7180ddd suggests maybe it won't)
> >> Second, the sample is largely dominated by obsolete information.
> >> Third, this can be problematic, and potentially wastes energy,
> >> for the busy to idle transition.
> >>
> >> Use case example 2:
> >> The busy to idle transition:
> >>
> >> Typically, the pstate request response to a busy to idle transition
> >> is very slow because the duration suddenly goes from, typically,
> >> 10 milliseconds to much much longer, up to 4 seconds. Transition
> >> times to the system being fully idle, with all requested pstates
> >> being at minimum, takes around 8 seconds with this reversion,
> >> and, potentially, a very very long time (over 100 seconds has been
> >> measured) without.
> >>
> >> Again, so who cares, if the processor is in a deep idle state anyway,
> >> not consuming much energy? O.K. but what if it is in an idle state
> >> where energy consumption is a function of the requested pstate?
> >> For example, for my processor (i7-2600K), idle state 1, then processor
> >> package energy can be over double what it should be for many 10s of
> >> seconds.
> >>
> >> Experiment method:
> >>
> >> enable only idle state 1
> >> Dountil stopped
> >> apply a 100% load (all CPUs)
> >> after awhile (about 50 seconds) remove the load.
> >> allow a short transient delay (1 second).
> >> measure the processor package joules used over the next 149 seconds.
> >> Enduntil
> >>
> >> Kernel k5.4-rc2 + reversion (this method)
> >> Average processor package power: 9.148 watts (128 samples, > 7 hours)
> >> Minimum: 9.02 watts
> >> Maximum: 9.29 watts
> >> Note: outlyer data point group removed, as it was assumed the computer
> >> had something to do and wasn't actually "idle".
> >>
> >> Kernel 5.4-rc2:
> >> Average processor package power: 9.969 watts (150 samples, > 8 hours)
> >> Or 9% more energy for the idle phases of the work load.
> >> Minimum: 9.15 watts
> >> Maximum: 13.79 watts (51% more power)
>
> Kernel 5.4-rc2 + Rik-patch:
> Average processor package power: 9.85 watts (53 samples, < 3 hours)
> Or 7.7% more energy for the idle phases of the work load.
> Minimum: 9.23 watts
> Maximum: 12.79 watts (40% more power)
>
>
>

2019-10-29 15:41:08

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] Revert "sched/fair: Fix O(nr_cgroups) in the load balancing path"

On Tue, Oct 29, 2019 at 07:55:26AM -0700, Doug Smythies wrote:

> I only know that the call to the intel_pstate driver doesn't
> happen, and that it is because cfs_rq_is_decayed returns TRUE.
> So, I am asserting that the request is not actually decayed, and
> should not have been deleted.

So what cfs_rq_is_decayed() does is allow a cgroup's cfs_rq to be
removed from the list.

Once it is removed, that cfs_rq will no longer be checked in the
update_blocked_averages() loop. Which means done has less chance of
getting false. Which in turn means that it's more likely
rq->has_blocked_load becomes 0.

Which all sounds good.

Can you please trace what keeps the CPU awake?

> Now, if we also look back at the comments for the original commit:
>
> "In an edge case where temporary cgroups were leaking, this
> caused the kernel to consume good several tens of percents of
> CPU cycles running update_blocked_averages(), each run taking
> multiple millisecs."
>
> To my way of thinking: Fix the leak, don't program around it; The
> commit breaks something else, so revert it.

The leak was fixed, but it still doesn't make sense to keep idle cgroups
on that list. Some people have a stupid amount of cgroups, most of which
are pointless and unused, so being able to remove them is good.

Which is why it got added back, once list management issues were sorted.

2019-10-29 16:08:10

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] Revert "sched/fair: Fix O(nr_cgroups) in the load balancing path"

Le Tuesday 29 Oct 2019 ? 07:55:26 (-0700), Doug Smythies a ?crit :
> On 2019.10.28 01:22 Vincent Guittot wrote:
> > On Sat, 26 Oct 2019 at 08:59, Doug Smythies <[email protected]> wrote:
> >> On 2010.10.25 09:51 Vincent Guittot wrote:
> >>> On Fri, 25 Oct 2019 at 17:55, Doug Smythies <[email protected]> wrote:
> >>>>
> >>>> This reverts commit 039ae8bcf7a5f4476f4487e6bf816885fb3fb617,
> >>>> which, in turn, was a re-apply of
> >>>> commit a9e7f6544b9c ("sched/fair: Fix O(nr_cgroups) in load balance path")
> >>>> after it was reverted via
> >>>> commit c40f7d74c741 ("sched/fair: Fix infinite loop in update_blocked_averages() by reverting a9e7f6544b9c")
> >>>>
> >>>> For an idle system, the cfs_rq_is_decayed function components can underflow to 0 and
> >>>> incorrectly return TRUE, when the item should not be deleted from the list.
> >>>
> >>> The patch from Rik solves the problem of cfs_rq_is_decayed wrongly returns true
> >>> https://lore.kernel.org/lkml/[email protected]/
> >>
> >> Not for my use case.
> >>
> >> I applied Rik's patch to kernel 5.4-rc2 (since all my other reference
> >> test data had been acquired against a base of 5.4-rc2). Tests results
> >> are similar to the non-reverted kernel (results added in-line
> >> below).
> >
> > Ok.
> >
> > I have studied a bit more your results below and IIUC your problem,
> > some periodic wakes up (every 4sec) are missing with kernel 5.4-rc2
>
> Actually, I don't know that the periodic wake ups are missing, I only
> know that the intel_pstate CPU scaling driver is not being called.
> This has been since kernel 5.1-rc1.
> I bisected the kernel and found the patch that is this subject.
> Then I used kernel 5.4-rc2 as my baseline for the data submitted.
>
> Looking at the number of clocks cycles that have being used since the last
> call to the driver suggests that the CPU has been doing something between
> the long time between calls:
> For the example given, there were 20,716,861 active clocks in 327.175
> seconds. At 4 seconds per doing something that's 253,286 clocks each,
> which is consistent (o.k. the variability is high) with actual data
> (and, for example, see the aperf number of 231,813 clocks for the
> 4 second example given below).
>
> > that helps cpuidle to enters deeper idle state after each new wake up
> > until reaching the deepest state, isn't it ?
>
> Well, it is the delay in the intel_pstate driver calls that is the
> root issue. Dragging idle into it was just the use case example that
> started this investigation.

Ok, I misuderstood your explanation.
Your point is that cfs_rq_util_change is not called anymore and as a result?intel pstate driver.

Could you try the patch below ? It ensures that at least the root cfs rq stays
in the list so each time update_blocked_averages is called, we will call update_cfs_rq_load_avg()
for the root cfs_rq at least and even if everything already reach zero.
This will ensure that cfs_rq_util_change is called even if nothing has
changed.


---
kernel/sched/fair.c | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 151c0b7..ac0a549 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7552,6 +7552,8 @@ static inline void update_blocked_load_status(struct rq *rq, bool has_blocked) {

static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
{
+ struct rq *rq = rq_of(cfs_rq);
+
if (cfs_rq->load.weight)
return false;

@@ -7564,6 +7566,9 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
if (cfs_rq->avg.runnable_load_sum)
return false;

+ if (cfs_rq == &rq->cfs)
+ return false;
+
return true;
}

--
2.7.4



>
> > My 1st point is that this code doesn't use timer or hrtimer to wake up
> > the system but only take advantage of the wake up of something else to
> > update the blocked load. So I don't see how this patch could remove
> > the 4sec periodic wakeup of the watchdog timer that you are
> > mentioning.
>
> I don't know that it is, as mentioned above.
>
> > Then, when a system is idle and not used, the load should obviously be
> > null most of the time and the update of decayed load should not happen
> > anyway. It looks like you take advantage of some spurious and
> > un-necessary wake up to help cpuidle to reach deeper idle state. Is
> > this understanding correct ?
>
> I don't know.
>
> I only know that the call to the intel_pstate driver doesn't
> happen, and that it is because cfs_rq_is_decayed returns TRUE.
> So, I am asserting that the request is not actually decayed, and
> should not have been deleted. Furthermore, I am wondering if other
> tasks that should be run are suffering the same fate.
>
> Now, if we also look back at the comments for the original commit:
>
> "In an edge case where temporary cgroups were leaking, this
> caused the kernel to consume good several tens of percents of
> CPU cycles running update_blocked_averages(), each run taking
> multiple millisecs."
>
> To my way of thinking: Fix the leak, don't program around it; The
> commit breaks something else, so revert it.
>
> >
> > Then, without or without removing the cfs_rq from the list, we should
> > end up with rq->has_blocked_load == 0 and nohz.has_blocked == 0 too.
> > The only main impact will be the duration of the loop that can be
> > significantly shorter when you have a lot of rqs and cgroups.
>
> I'm not following.
>
> >
> >>>>
> >>>> Signed-off-by: Doug Smythies <[email protected]>
> >>>> Cc: Vincent Guittot <[email protected]>
> >>>> Cc: Peter Zijlstra (Intel) <[email protected]>
> >>>> Cc: Ingo Molnar <[email protected]>
> >>>> Cc: Linus Torvalds <[email protected]>
> >>>> Cc: Peter Zijlstra <[email protected]>
> >>>> Cc: Thomas Gleixner <[email protected]>
> >>> Cc: [email protected]
> >>>> Cc: [email protected]
> >>>> Cc: [email protected]
> >>>> Cc: [email protected]
> >>>> Cc: Srinivas Pandruvada <[email protected]>
> >>>>
> >>>> ---
> >>>> Note 1: Both this reversion and just deleting the cfs_rq_is_decayed function
> >>>> and it's call and leaving the other changes have been tested. I do not know
> >>>> which solution is better. (ie for the "list_for_each_entry_rcu" part of it.)
> >>>>
> >>>> Note 2: Previous controversy over this patch was based on heavy workloads,
> >>>> but this is based on minimal or no workload, or "idle".
> >>>> Where "idle" on my test server, with no gui and many services disabled,
> >>>> tends to mean more "idle" than most systems.
> >>>>
> >>>> Note 3: While this supporting data only involves the intel_pstate CPU
> >>>> frequency scaling driver as a casualty, it is beyond my capabilities
> >>>> to determine what other tasks that should be running might be omitted.
> >>>>
> >>>> Use case example 1:
> >>>> System Idle: The intel pstate CPU frequency scaling driver:
> >>>> Mode: Active, non-hwp, powersave governor.
> >>>> Expected behaviour: There is never ever a duration (time between calls to
> >>>> the driver / per CPU) longer than 4 seconds (the watchdog time, I think).
> >>>> Actual behaviour: There are long long gaps between calls to the driver:
> >>>>
> >>>> Kernel: 5.4-rc2 CPU:7
> >>>> duration: 327.17 Seconds. (this is one of many hundreds of examples.)
> >>>> mpref: 44023326
> >>>> apref: 20716861
> >>>> tsc: 1.11604E+12
> >>>> load: 0
> >>>> CPU frequency: 1.6053 GHz (average over this 327 second sample period).
> >>>> old pstate: 16 (the lowest for my processor)
> >>>> new pstate: 16
> >>>>
> >>>> Kernel: 5.4-rc2 + reversion (either method)
> >>>> After several hours of testing, maximum durations were never more
> >>>> than 4 seconds (well plus some jitter).
> >>>> reversion method: max=4.07908 seconds
> >>>> CPU:7
> >>>> mperf: 492578
> >>>> apref: 231813 (56,829 per second average is consistent with other tests)
> >>>> tsc: 13914264074
> >>>> load: 0
> >>>> CPU frequency: 1.6052 GHz
> >>>> old pstate: 16 (the lowest for my precessor)
> >>>> new pstate: 16
> >>>>
> >>>> On average, the non-reverted kernel executes the driver 25% less
> >>>> than the reverted kernel during idle.
> >>
> >> On (shorter)average, the Rik patched kernel executes the driver
> >> 14% less than the reverted kernel during idle.
> >>
> >> Longer and repeated testing would be required to determine if
> >> this is a trend or simply non-repeatable noise.
>
> The difference in probabilities of the issue occurring does appear to be
> somewhat consistent. Not sure what it means, if anything.
>
> >>>> O.K. so who cares, the requested pstate doesn't change?
> >>>> First, one wonders if the math could overflow.
> >>>> (although 7180ddd suggests maybe it won't)
> >>>> Second, the sample is largely dominated by obsolete information.
> >>>> Third, this can be problematic, and potentially wastes energy,
> >>>> for the busy to idle transition.
> >>>>
> >>>> Use case example 2:
> >>>> The busy to idle transition:
> >>>>
> >>>> Typically, the pstate request response to a busy to idle transition
> >>>> is very slow because the duration suddenly goes from, typically,
> >>>> 10 milliseconds to much much longer, up to 4 seconds. Transition
> >>>> times to the system being fully idle, with all requested pstates
> >>>> being at minimum, takes around 8 seconds with this reversion,
> >>>> and, potentially, a very very long time (over 100 seconds has been
> >>>> measured) without.
> >>>>
> >>>> Again, so who cares, if the processor is in a deep idle state anyway,
> >>>> not consuming much energy? O.K. but what if it is in an idle state
> >>>> where energy consumption is a function of the requested pstate?
> >>>> For example, for my processor (i7-2600K), idle state 1, then processor
> >>>> package energy can be over double what it should be for many 10s of
> >>>> seconds.
> >>>>
> >>>> Experiment method:
> >>>>
> >>>> enable only idle state 1
> >>>> Dountil stopped
> >>>> apply a 100% load (all CPUs)
> >>>> after awhile (about 50 seconds) remove the load.
> >>>> allow a short transient delay (1 second).
> >>>> measure the processor package joules used over the next 149 seconds.
> >>>> Enduntil
> >>>>
> >>>> Kernel k5.4-rc2 + reversion (this method)
> >>>> Average processor package power: 9.148 watts (128 samples, > 7 hours)
> >>>> Minimum: 9.02 watts
> >>>> Maximum: 9.29 watts
> >>>> Note: outlyer data point group removed, as it was assumed the computer
> >>>> had something to do and wasn't actually "idle".
> >>>>
> >>>> Kernel 5.4-rc2:
> >>>> Average processor package power: 9.969 watts (150 samples, > 8 hours)
> >>>> Or 9% more energy for the idle phases of the work load.
> >>>> Minimum: 9.15 watts
> >>>> Maximum: 13.79 watts (51% more power)
> >>
> >> Kernel 5.4-rc2 + Rik-patch:
> >> Average processor package power: 9.85 watts (53 samples, < 3 hours)
> >> Or 7.7% more energy for the idle phases of the work load.
> >> Minimum: 9.23 watts
> >> Maximum: 12.79 watts (40% more power)
>
>

2019-10-29 16:25:01

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] Revert "sched/fair: Fix O(nr_cgroups) in the load balancing path"

On Tue, 29 Oct 2019 at 16:36, Peter Zijlstra <[email protected]> wrote:
>
> On Tue, Oct 29, 2019 at 07:55:26AM -0700, Doug Smythies wrote:
>
> > I only know that the call to the intel_pstate driver doesn't
> > happen, and that it is because cfs_rq_is_decayed returns TRUE.
> > So, I am asserting that the request is not actually decayed, and
> > should not have been deleted.
>
> So what cfs_rq_is_decayed() does is allow a cgroup's cfs_rq to be
> removed from the list.
>
> Once it is removed, that cfs_rq will no longer be checked in the
> update_blocked_averages() loop. Which means done has less chance of
> getting false. Which in turn means that it's more likely
> rq->has_blocked_load becomes 0.
>
> Which all sounds good.
>
> Can you please trace what keeps the CPU awake?

I think that the sequence below is what intel pstate driver was using

rt/dl task wakes up and run for some times
rt/dl pelt signal is no more null so periodic decay happens.

before optimization update_cfs_rq_load_avg() for root cfs_rq was
called even if pelt was null,
which calls cfs_rq_util_change, which calls intel pstate

after optimization its no more called.

The patch that i just sent will check that sequence but it's more a
hack than a clean fix because
it uses cfs notification to cpufreq for update that is not related to cfs.

I will look at a proper solution if the test confirms my assumption

>
> > Now, if we also look back at the comments for the original commit:
> >
> > "In an edge case where temporary cgroups were leaking, this
> > caused the kernel to consume good several tens of percents of
> > CPU cycles running update_blocked_averages(), each run taking
> > multiple millisecs."
> >
> > To my way of thinking: Fix the leak, don't program around it; The
> > commit breaks something else, so revert it.
>
> The leak was fixed, but it still doesn't make sense to keep idle cgroups
> on that list. Some people have a stupid amount of cgroups, most of which
> are pointless and unused, so being able to remove them is good.
>
> Which is why it got added back, once list management issues were sorted.

2019-10-29 16:53:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] Revert "sched/fair: Fix O(nr_cgroups) in the load balancing path"

On Tue, Oct 29, 2019 at 05:20:56PM +0100, Vincent Guittot wrote:
> On Tue, 29 Oct 2019 at 16:36, Peter Zijlstra <[email protected]> wrote:
> >
> > On Tue, Oct 29, 2019 at 07:55:26AM -0700, Doug Smythies wrote:
> >
> > > I only know that the call to the intel_pstate driver doesn't
> > > happen, and that it is because cfs_rq_is_decayed returns TRUE.
> > > So, I am asserting that the request is not actually decayed, and
> > > should not have been deleted.
> >
> > So what cfs_rq_is_decayed() does is allow a cgroup's cfs_rq to be
> > removed from the list.
> >
> > Once it is removed, that cfs_rq will no longer be checked in the
> > update_blocked_averages() loop. Which means done has less chance of
> > getting false. Which in turn means that it's more likely
> > rq->has_blocked_load becomes 0.
> >
> > Which all sounds good.
> >
> > Can you please trace what keeps the CPU awake?
>
> I think that the sequence below is what intel pstate driver was using
>
> rt/dl task wakes up and run for some times
> rt/dl pelt signal is no more null so periodic decay happens.
>
> before optimization update_cfs_rq_load_avg() for root cfs_rq was
> called even if pelt was null,
> which calls cfs_rq_util_change, which calls intel pstate
>
> after optimization its no more called.

Not calling cfs_rq_util_change() when it doesn't change, seems like the
right thing. Why would intel_pstate want it called when it doesn't
change?

2019-10-29 17:02:38

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] Revert "sched/fair: Fix O(nr_cgroups) in the load balancing path"

On Tue, 29 Oct 2019 at 17:50, Peter Zijlstra <[email protected]> wrote:
>
> On Tue, Oct 29, 2019 at 05:20:56PM +0100, Vincent Guittot wrote:
> > On Tue, 29 Oct 2019 at 16:36, Peter Zijlstra <[email protected]> wrote:
> > >
> > > On Tue, Oct 29, 2019 at 07:55:26AM -0700, Doug Smythies wrote:
> > >
> > > > I only know that the call to the intel_pstate driver doesn't
> > > > happen, and that it is because cfs_rq_is_decayed returns TRUE.
> > > > So, I am asserting that the request is not actually decayed, and
> > > > should not have been deleted.
> > >
> > > So what cfs_rq_is_decayed() does is allow a cgroup's cfs_rq to be
> > > removed from the list.
> > >
> > > Once it is removed, that cfs_rq will no longer be checked in the
> > > update_blocked_averages() loop. Which means done has less chance of
> > > getting false. Which in turn means that it's more likely
> > > rq->has_blocked_load becomes 0.
> > >
> > > Which all sounds good.
> > >
> > > Can you please trace what keeps the CPU awake?
> >
> > I think that the sequence below is what intel pstate driver was using
> >
> > rt/dl task wakes up and run for some times
> > rt/dl pelt signal is no more null so periodic decay happens.
> >
> > before optimization update_cfs_rq_load_avg() for root cfs_rq was
> > called even if pelt was null,
> > which calls cfs_rq_util_change, which calls intel pstate
> >
> > after optimization its no more called.
>
> Not calling cfs_rq_util_change() when it doesn't change, seems like the
> right thing. Why would intel_pstate want it called when it doesn't
> change?

Yes I agree

My original thought was that either irq/rt ordl pelt signals was used
to set frequency and it needs to be called to decrease this freq while
pelt signals was decaying but it doesn't seem to use it but only needs
to be called from time to time

2019-10-29 17:11:11

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] Revert "sched/fair: Fix O(nr_cgroups) in the load balancing path"

On Tue, 29 Oct 2019 at 18:00, Vincent Guittot
<[email protected]> wrote:
>
> On Tue, 29 Oct 2019 at 17:50, Peter Zijlstra <[email protected]> wrote:
> >
> > On Tue, Oct 29, 2019 at 05:20:56PM +0100, Vincent Guittot wrote:
> > > On Tue, 29 Oct 2019 at 16:36, Peter Zijlstra <[email protected]> wrote:
> > > >
> > > > On Tue, Oct 29, 2019 at 07:55:26AM -0700, Doug Smythies wrote:
> > > >
> > > > > I only know that the call to the intel_pstate driver doesn't
> > > > > happen, and that it is because cfs_rq_is_decayed returns TRUE.
> > > > > So, I am asserting that the request is not actually decayed, and
> > > > > should not have been deleted.
> > > >
> > > > So what cfs_rq_is_decayed() does is allow a cgroup's cfs_rq to be
> > > > removed from the list.
> > > >
> > > > Once it is removed, that cfs_rq will no longer be checked in the
> > > > update_blocked_averages() loop. Which means done has less chance of
> > > > getting false. Which in turn means that it's more likely
> > > > rq->has_blocked_load becomes 0.
> > > >
> > > > Which all sounds good.
> > > >
> > > > Can you please trace what keeps the CPU awake?
> > >
> > > I think that the sequence below is what intel pstate driver was using
> > >
> > > rt/dl task wakes up and run for some times
> > > rt/dl pelt signal is no more null so periodic decay happens.
> > >
> > > before optimization update_cfs_rq_load_avg() for root cfs_rq was
> > > called even if pelt was null,
> > > which calls cfs_rq_util_change, which calls intel pstate
> > >
> > > after optimization its no more called.
> >
> > Not calling cfs_rq_util_change() when it doesn't change, seems like the
> > right thing. Why would intel_pstate want it called when it doesn't
> > change?
>
> Yes I agree
>
> My original thought was that either irq/rt ordl pelt signals was used
> to set frequency and it needs to be called to decrease this freq while
> pelt signals was decaying but it doesn't seem to use it but only needs
> to be called from time to time

Apart from Doug's problem, we have 2 possible problems with the
current update_blocked_averages()
1- irq, dl and rt are updated after cfs but it is the cfs update that
will call schedutil for updating the frequency which means that this
is done with old irq/rt/dl value. we should change the order and start
with irq/rt and dl
2- when cfs is null but not irq/rt or dl, we decay the values but we
never call schedutil to update the freq accordingly. The impact is
probably minimal because only irq and timer can really run without
call schedutil to update frequency but this can happen.

I'm going to prepare some patches

2019-10-29 19:43:54

by Doug Smythies

[permalink] [raw]
Subject: RE: [PATCH] Revert "sched/fair: Fix O(nr_cgroups) in the load balancing path"

On 2019.10.28 01:22 Vincent Guittot wrote:
> On Sat, 26 Oct 2019 at 08:59, Doug Smythies <[email protected]> wrote:
>> On 2010.10.25 09:51 Vincent Guittot wrote:
>>> On Fri, 25 Oct 2019 at 17:55, Doug Smythies <[email protected]> wrote:
>>>>
>>>> This reverts commit 039ae8bcf7a5f4476f4487e6bf816885fb3fb617,
>>>> which, in turn, was a re-apply of
>>>> commit a9e7f6544b9c ("sched/fair: Fix O(nr_cgroups) in load balance path")
>>>> after it was reverted via
>>>> commit c40f7d74c741 ("sched/fair: Fix infinite loop in update_blocked_averages() by reverting a9e7f6544b9c")
>>>>
>>>> For an idle system, the cfs_rq_is_decayed function components can underflow to 0 and
>>>> incorrectly return TRUE, when the item should not be deleted from the list.
>>>
>>> The patch from Rik solves the problem of cfs_rq_is_decayed wrongly returns true
>>> https://lore.kernel.org/lkml/[email protected]/
>>
>> Not for my use case.
>>
>> I applied Rik's patch to kernel 5.4-rc2 (since all my other reference
>> test data had been acquired against a base of 5.4-rc2). Tests results
>> are similar to the non-reverted kernel (results added in-line
>> below).
>
> Ok.
>
> I have studied a bit more your results below and IIUC your problem,
> some periodic wakes up (every 4sec) are missing with kernel 5.4-rc2

Actually, I don't know that the periodic wake ups are missing, I only
know that the intel_pstate CPU scaling driver is not being called.
This has been since kernel 5.1-rc1.
I bisected the kernel and found the patch that is this subject.
Then I used kernel 5.4-rc2 as my baseline for the data submitted.

Looking at the number of clocks cycles that have being used since the last
call to the driver suggests that the CPU has been doing something between
the long time between calls:
For the example given, there were 20,716,861 active clocks in 327.175
seconds. At 4 seconds per doing something that's 253,286 clocks each,
which is consistent (o.k. the variability is high) with actual data
(and, for example, see the aperf number of 231,813 clocks for the
4 second example given below).

> that helps cpuidle to enters deeper idle state after each new wake up
> until reaching the deepest state, isn't it ?

Well, it is the delay in the intel_pstate driver calls that is the
root issue. Dragging idle into it was just the use case example that
started this investigation.

> My 1st point is that this code doesn't use timer or hrtimer to wake up
> the system but only take advantage of the wake up of something else to
> update the blocked load. So I don't see how this patch could remove
> the 4sec periodic wakeup of the watchdog timer that you are
> mentioning.

I don't know that it is, as mentioned above.

> Then, when a system is idle and not used, the load should obviously be
> null most of the time and the update of decayed load should not happen
> anyway. It looks like you take advantage of some spurious and
> un-necessary wake up to help cpuidle to reach deeper idle state. Is
> this understanding correct ?

I don't know.

I only know that the call to the intel_pstate driver doesn't
happen, and that it is because cfs_rq_is_decayed returns TRUE.
So, I am asserting that the request is not actually decayed, and
should not have been deleted. Furthermore, I am wondering if other
tasks that should be run are suffering the same fate.

Now, if we also look back at the comments for the original commit:

"In an edge case where temporary cgroups were leaking, this
caused the kernel to consume good several tens of percents of
CPU cycles running update_blocked_averages(), each run taking
multiple millisecs."

To my way of thinking: Fix the leak, don't program around it; The
commit breaks something else, so revert it.

>
> Then, without or without removing the cfs_rq from the list, we should
> end up with rq->has_blocked_load == 0 and nohz.has_blocked == 0 too.
> The only main impact will be the duration of the loop that can be
> significantly shorter when you have a lot of rqs and cgroups.

I'm not following.

>
>>>>
>>>> Signed-off-by: Doug Smythies <[email protected]>
>>>> Cc: Vincent Guittot <[email protected]>
>>>> Cc: Peter Zijlstra (Intel) <[email protected]>
>>>> Cc: Ingo Molnar <[email protected]>
>>>> Cc: Linus Torvalds <[email protected]>
>>>> Cc: Peter Zijlstra <[email protected]>
>>>> Cc: Thomas Gleixner <[email protected]>
>>> Cc: [email protected]
>>>> Cc: [email protected]
>>>> Cc: [email protected]
>>>> Cc: [email protected]
>>>> Cc: Srinivas Pandruvada <[email protected]>
>>>>
>>>> ---
>>>> Note 1: Both this reversion and just deleting the cfs_rq_is_decayed function
>>>> and it's call and leaving the other changes have been tested. I do not know
>>>> which solution is better. (ie for the "list_for_each_entry_rcu" part of it.)
>>>>
>>>> Note 2: Previous controversy over this patch was based on heavy workloads,
>>>> but this is based on minimal or no workload, or "idle".
>>>> Where "idle" on my test server, with no gui and many services disabled,
>>>> tends to mean more "idle" than most systems.
>>>>
>>>> Note 3: While this supporting data only involves the intel_pstate CPU
>>>> frequency scaling driver as a casualty, it is beyond my capabilities
>>>> to determine what other tasks that should be running might be omitted.
>>>>
>>>> Use case example 1:
>>>> System Idle: The intel pstate CPU frequency scaling driver:
>>>> Mode: Active, non-hwp, powersave governor.
>>>> Expected behaviour: There is never ever a duration (time between calls to
>>>> the driver / per CPU) longer than 4 seconds (the watchdog time, I think).
>>>> Actual behaviour: There are long long gaps between calls to the driver:
>>>>
>>>> Kernel: 5.4-rc2 CPU:7
>>>> duration: 327.17 Seconds. (this is one of many hundreds of examples.)
>>>> mpref: 44023326
>>>> apref: 20716861
>>>> tsc: 1.11604E+12
>>>> load: 0
>>>> CPU frequency: 1.6053 GHz (average over this 327 second sample period).
>>>> old pstate: 16 (the lowest for my processor)
>>>> new pstate: 16
>>>>
>>>> Kernel: 5.4-rc2 + reversion (either method)
>>>> After several hours of testing, maximum durations were never more
>>>> than 4 seconds (well plus some jitter).
>>>> reversion method: max=4.07908 seconds
>>>> CPU:7
>>>> mperf: 492578
>>>> apref: 231813 (56,829 per second average is consistent with other tests)
>>>> tsc: 13914264074
>>>> load: 0
>>>> CPU frequency: 1.6052 GHz
>>>> old pstate: 16 (the lowest for my precessor)
>>>> new pstate: 16
>>>>
>>>> On average, the non-reverted kernel executes the driver 25% less
>>>> than the reverted kernel during idle.
>>
>> On (shorter)average, the Rik patched kernel executes the driver
>> 14% less than the reverted kernel during idle.
>>
>> Longer and repeated testing would be required to determine if
>> this is a trend or simply non-repeatable noise.

The difference in probabilities of the issue occurring does appear to be
somewhat consistent. Not sure what it means, if anything.

>>>> O.K. so who cares, the requested pstate doesn't change?
>>>> First, one wonders if the math could overflow.
>>>> (although 7180ddd suggests maybe it won't)
>>>> Second, the sample is largely dominated by obsolete information.
>>>> Third, this can be problematic, and potentially wastes energy,
>>>> for the busy to idle transition.
>>>>
>>>> Use case example 2:
>>>> The busy to idle transition:
>>>>
>>>> Typically, the pstate request response to a busy to idle transition
>>>> is very slow because the duration suddenly goes from, typically,
>>>> 10 milliseconds to much much longer, up to 4 seconds. Transition
>>>> times to the system being fully idle, with all requested pstates
>>>> being at minimum, takes around 8 seconds with this reversion,
>>>> and, potentially, a very very long time (over 100 seconds has been
>>>> measured) without.
>>>>
>>>> Again, so who cares, if the processor is in a deep idle state anyway,
>>>> not consuming much energy? O.K. but what if it is in an idle state
>>>> where energy consumption is a function of the requested pstate?
>>>> For example, for my processor (i7-2600K), idle state 1, then processor
>>>> package energy can be over double what it should be for many 10s of
>>>> seconds.
>>>>
>>>> Experiment method:
>>>>
>>>> enable only idle state 1
>>>> Dountil stopped
>>>> apply a 100% load (all CPUs)
>>>> after awhile (about 50 seconds) remove the load.
>>>> allow a short transient delay (1 second).
>>>> measure the processor package joules used over the next 149 seconds.
>>>> Enduntil
>>>>
>>>> Kernel k5.4-rc2 + reversion (this method)
>>>> Average processor package power: 9.148 watts (128 samples, > 7 hours)
>>>> Minimum: 9.02 watts
>>>> Maximum: 9.29 watts
>>>> Note: outlyer data point group removed, as it was assumed the computer
>>>> had something to do and wasn't actually "idle".
>>>>
>>>> Kernel 5.4-rc2:
>>>> Average processor package power: 9.969 watts (150 samples, > 8 hours)
>>>> Or 9% more energy for the idle phases of the work load.
>>>> Minimum: 9.15 watts
>>>> Maximum: 13.79 watts (51% more power)
>>
>> Kernel 5.4-rc2 + Rik-patch:
>> Average processor package power: 9.85 watts (53 samples, < 3 hours)
>> Or 7.7% more energy for the idle phases of the work load.
>> Minimum: 9.23 watts
>> Maximum: 12.79 watts (40% more power)


2019-10-29 20:05:57

by srinivas pandruvada

[permalink] [raw]
Subject: Re: [PATCH] Revert "sched/fair: Fix O(nr_cgroups) in the load balancing path"

On Fri, 2019-10-25 at 08:55 -0700, Doug Smythies wrote:

[...]

> Experiment method:
>
> enable only idle state 1
> Dountil stopped
> apply a 100% load (all CPUs)
> after awhile (about 50 seconds) remove the load.
> allow a short transient delay (1 second).
> measure the processor package joules used over the next 149
> seconds.
> Enduntil
>
> Kernel k5.4-rc2 + reversion (this method)
> Average processor package power: 9.148 watts (128 samples, > 7 hours)
> Minimum: 9.02 watts
> Maximum: 9.29 watts
> Note: outlyer data point group removed, as it was assumed the
> computer
> had something to do and wasn't actually "idle".
>
> Kernel 5.4-rc2:
> Average processor package power: 9.969 watts (150 samples, > 8 hours)
> Or 9% more energy for the idle phases of the work load.
> Minimum: 9.15 watts
> Maximum: 13.79 watts (51% more power)
Hi Doug,

Do you have intel_pstate_tracer output? I guess that when started
request to measure the measure joules, it started at higher P-state
without revert.
Other way is check by fixing the max and min scaling frequency to some
frequency, then we shouldn't see power difference.

Thanks,
Srinivas


>
> ---
> kernel/sched/fair.c | 43 +++++++++----------------------------------
> 1 file changed, 9 insertions(+), 34 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 83ab35e..51625b8 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -381,10 +381,9 @@ static inline void
> assert_list_leaf_cfs_rq(struct rq *rq)
> SCHED_WARN_ON(rq->tmp_alone_branch != &rq->leaf_cfs_rq_list);
> }
>
> -/* Iterate thr' all leaf cfs_rq's on a runqueue */
> -#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos)
> \
> - list_for_each_entry_safe(cfs_rq, pos, &rq->leaf_cfs_rq_list,
> \
> - leaf_cfs_rq_list)
> +/* Iterate through all cfs_rq's on a runqueue in bottom-up order */
> +#define for_each_leaf_cfs_rq(rq, cfs_rq) \
> + list_for_each_entry_rcu(cfs_rq, &rq->leaf_cfs_rq_list,
> leaf_cfs_rq_list)
>
> /* Do the two (enqueued) entities belong to the same group ? */
> static inline struct cfs_rq *
> @@ -481,8 +480,8 @@ static inline void assert_list_leaf_cfs_rq(struct
> rq *rq)
> {
> }
>
> -#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) \
> - for (cfs_rq = &rq->cfs, pos = NULL; cfs_rq; cfs_rq =
> pos)
> +#define for_each_leaf_cfs_rq(rq, cfs_rq) \
> + for (cfs_rq = &rq->cfs; cfs_rq; cfs_rq = NULL)
>
> static inline struct sched_entity *parent_entity(struct sched_entity
> *se)
> {
> @@ -7502,27 +7501,10 @@ static inline void
> update_blocked_load_status(struct rq *rq, bool has_blocked) {
>
> #ifdef CONFIG_FAIR_GROUP_SCHED
>
> -static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
> -{
> - if (cfs_rq->load.weight)
> - return false;
> -
> - if (cfs_rq->avg.load_sum)
> - return false;
> -
> - if (cfs_rq->avg.util_sum)
> - return false;
> -
> - if (cfs_rq->avg.runnable_load_sum)
> - return false;
> -
> - return true;
> -}
> -
> static void update_blocked_averages(int cpu)
> {
> struct rq *rq = cpu_rq(cpu);
> - struct cfs_rq *cfs_rq, *pos;
> + struct cfs_rq *cfs_rq;
> const struct sched_class *curr_class;
> struct rq_flags rf;
> bool done = true;
> @@ -7534,7 +7516,7 @@ static void update_blocked_averages(int cpu)
> * Iterates the task_group tree in a bottom up fashion, see
> * list_add_leaf_cfs_rq() for details.
> */
> - for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) {
> + for_each_leaf_cfs_rq(rq, cfs_rq) {
> struct sched_entity *se;
>
> if (update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq),
> cfs_rq))
> @@ -7545,13 +7527,6 @@ static void update_blocked_averages(int cpu)
> if (se && !skip_blocked_update(se))
> update_load_avg(cfs_rq_of(se), se, 0);
>
> - /*
> - * There can be a lot of idle CPU cgroups. Don't let
> fully
> - * decayed cfs_rqs linger on the list.
> - */
> - if (cfs_rq_is_decayed(cfs_rq))
> - list_del_leaf_cfs_rq(cfs_rq);
> -
> /* Don't need periodic decay once load/util_avg are
> null */
> if (cfs_rq_has_blocked(cfs_rq))
> done = false;
> @@ -10444,10 +10419,10 @@ const struct sched_class fair_sched_class =
> {
> #ifdef CONFIG_SCHED_DEBUG
> void print_cfs_stats(struct seq_file *m, int cpu)
> {
> - struct cfs_rq *cfs_rq, *pos;
> + struct cfs_rq *cfs_rq;
>
> rcu_read_lock();
> - for_each_leaf_cfs_rq_safe(cpu_rq(cpu), cfs_rq, pos)
> + for_each_leaf_cfs_rq(cpu_rq(cpu), cfs_rq)
> print_cfs_rq(m, cpu, cfs_rq);
> rcu_read_unlock();
> }

2019-10-29 22:09:17

by srinivas pandruvada

[permalink] [raw]
Subject: Re: [PATCH] Revert "sched/fair: Fix O(nr_cgroups) in the load balancing path"

On Tue, 2019-10-29 at 12:34 -0700, Srinivas Pandruvada wrote:
> On Fri, 2019-10-25 at 08:55 -0700, Doug Smythies wrote:
>
> [...]
>
> > Experiment method:
> >
> > enable only idle state 1
> > Dountil stopped
> > apply a 100% load (all CPUs)
> > after awhile (about 50 seconds) remove the load.
> > allow a short transient delay (1 second).
> > measure the processor package joules used over the next 149
> > seconds.
> > Enduntil
> >
> > Kernel k5.4-rc2 + reversion (this method)
> > Average processor package power: 9.148 watts (128 samples, > 7
> > hours)
> > Minimum: 9.02 watts
> > Maximum: 9.29 watts
> > Note: outlyer data point group removed, as it was assumed the
> > computer
> > had something to do and wasn't actually "idle".
> >
> > Kernel 5.4-rc2:
> > Average processor package power: 9.969 watts (150 samples, > 8
> > hours)
> > Or 9% more energy for the idle phases of the work load.
> > Minimum: 9.15 watts
> > Maximum: 13.79 watts (51% more power)
>
> Hi Doug,
>
> Do you have intel_pstate_tracer output? I guess that when started
> request to measure the measure joules, it started at higher P-state
> without revert.
> Other way is check by fixing the max and min scaling frequency to
> some
> frequency, then we shouldn't see power difference.
I mean not significant power difference. Also to get real numbers, need
to use some power meter measuring CPU power. If I can get your script,
I may be able to measure that.

Thanks,
Srinivas

>
> Thanks,
> Srinivas
>
>
> >
> > ---
> > kernel/sched/fair.c | 43 +++++++++------------------------------
> > ----
> > 1 file changed, 9 insertions(+), 34 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 83ab35e..51625b8 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -381,10 +381,9 @@ static inline void
> > assert_list_leaf_cfs_rq(struct rq *rq)
> > SCHED_WARN_ON(rq->tmp_alone_branch != &rq->leaf_cfs_rq_list);
> > }
> >
> > -/* Iterate thr' all leaf cfs_rq's on a runqueue */
> > -#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos)
> > \
> > - list_for_each_entry_safe(cfs_rq, pos, &rq->leaf_cfs_rq_list,
> > \
> > - leaf_cfs_rq_list)
> > +/* Iterate through all cfs_rq's on a runqueue in bottom-up order
> > */
> > +#define for_each_leaf_cfs_rq(rq, cfs_rq) \
> > + list_for_each_entry_rcu(cfs_rq, &rq->leaf_cfs_rq_list,
> > leaf_cfs_rq_list)
> >
> > /* Do the two (enqueued) entities belong to the same group ? */
> > static inline struct cfs_rq *
> > @@ -481,8 +480,8 @@ static inline void
> > assert_list_leaf_cfs_rq(struct
> > rq *rq)
> > {
> > }
> >
> > -#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) \
> > - for (cfs_rq = &rq->cfs, pos = NULL; cfs_rq; cfs_rq =
> > pos)
> > +#define for_each_leaf_cfs_rq(rq, cfs_rq) \
> > + for (cfs_rq = &rq->cfs; cfs_rq; cfs_rq = NULL)
> >
> > static inline struct sched_entity *parent_entity(struct
> > sched_entity
> > *se)
> > {
> > @@ -7502,27 +7501,10 @@ static inline void
> > update_blocked_load_status(struct rq *rq, bool has_blocked) {
> >
> > #ifdef CONFIG_FAIR_GROUP_SCHED
> >
> > -static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
> > -{
> > - if (cfs_rq->load.weight)
> > - return false;
> > -
> > - if (cfs_rq->avg.load_sum)
> > - return false;
> > -
> > - if (cfs_rq->avg.util_sum)
> > - return false;
> > -
> > - if (cfs_rq->avg.runnable_load_sum)
> > - return false;
> > -
> > - return true;
> > -}
> > -
> > static void update_blocked_averages(int cpu)
> > {
> > struct rq *rq = cpu_rq(cpu);
> > - struct cfs_rq *cfs_rq, *pos;
> > + struct cfs_rq *cfs_rq;
> > const struct sched_class *curr_class;
> > struct rq_flags rf;
> > bool done = true;
> > @@ -7534,7 +7516,7 @@ static void update_blocked_averages(int cpu)
> > * Iterates the task_group tree in a bottom up fashion, see
> > * list_add_leaf_cfs_rq() for details.
> > */
> > - for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) {
> > + for_each_leaf_cfs_rq(rq, cfs_rq) {
> > struct sched_entity *se;
> >
> > if (update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq),
> > cfs_rq))
> > @@ -7545,13 +7527,6 @@ static void update_blocked_averages(int cpu)
> > if (se && !skip_blocked_update(se))
> > update_load_avg(cfs_rq_of(se), se, 0);
> >
> > - /*
> > - * There can be a lot of idle CPU cgroups. Don't let
> > fully
> > - * decayed cfs_rqs linger on the list.
> > - */
> > - if (cfs_rq_is_decayed(cfs_rq))
> > - list_del_leaf_cfs_rq(cfs_rq);
> > -
> > /* Don't need periodic decay once load/util_avg are
> > null */
> > if (cfs_rq_has_blocked(cfs_rq))
> > done = false;
> > @@ -10444,10 +10419,10 @@ const struct sched_class fair_sched_class
> > =
> > {
> > #ifdef CONFIG_SCHED_DEBUG
> > void print_cfs_stats(struct seq_file *m, int cpu)
> > {
> > - struct cfs_rq *cfs_rq, *pos;
> > + struct cfs_rq *cfs_rq;
> >
> > rcu_read_lock();
> > - for_each_leaf_cfs_rq_safe(cpu_rq(cpu), cfs_rq, pos)
> > + for_each_leaf_cfs_rq(cpu_rq(cpu), cfs_rq)
> > print_cfs_rq(m, cpu, cfs_rq);
> > rcu_read_unlock();
> > }
>
>

2019-10-30 14:07:50

by Doug Smythies

[permalink] [raw]
Subject: RE: [PATCH] Revert "sched/fair: Fix O(nr_cgroups) in the load balancing path"

On 2019.10.29 09:02 Vincent Guittot wrote:

> Could you try the patch below ? It ensures that at least the root cfs rq stays
> in the list so each time update_blocked_averages is called, we will call update_cfs_rq_load_avg()
> for the root cfs_rq at least and even if everything already reach zero.
> This will ensure that cfs_rq_util_change is called even if nothing has
> changed.
>
> ---
> kernel/sched/fair.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 151c0b7..ac0a549 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7552,6 +7552,8 @@ static inline void update_blocked_load_status(struct rq *rq, bool has_blocked) {
>
> static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
> {
> + struct rq *rq = rq_of(cfs_rq);
> +
> if (cfs_rq->load.weight)
> return false;
>
> @@ -7564,6 +7566,9 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
> if (cfs_rq->avg.runnable_load_sum)
> return false;
>
> + if (cfs_rq == &rq->cfs)
> + return false;
> +
> return true;
> }
>
> --
> 2.7.4

Yes, this patch works and solves the long time
between calls of the intel_pstate CPU frequency scaling
driver issue.
I see you sent a formal patch a few hours ago.
I'll try it and report back.

... Doug


2019-10-30 16:03:26

by Doug Smythies

[permalink] [raw]
Subject: RE: [PATCH] Revert "sched/fair: Fix O(nr_cgroups) in the load balancing path"

On 2019.10.29 13:00 Srinivas Pandruvada wrote:
> On Tue, 2019-10-29 at 12:34 -0700, Srinivas Pandruvada wrote:
>> On Fri, 2019-10-25 at 08:55 -0700, Doug Smythies wrote:
>>
>> [...]
>>
>>> Experiment method:
>>>
>>> enable only idle state 1
>> Dountil stopped
>>> apply a 100% load (all CPUs)
>>> after awhile (about 50 seconds) remove the load.
>>> allow a short transient delay (1 second).
>>> measure the processor package joules used over the next 149
>>> seconds.
>>> Enduntil
>>>
>>> Kernel k5.4-rc2 + reversion (this method)
>>> Average processor package power: 9.148 watts (128 samples, > 7
>>> hours)
>>> Minimum: 9.02 watts
>>> Maximum: 9.29 watts
>>> Note: outlyer data point group removed, as it was assumed the
>>> computer
>>> had something to do and wasn't actually "idle".
>>>
>>> Kernel 5.4-rc2:
>>> Average processor package power: 9.969 watts (150 samples, > 8
>>> hours)
>>> Or 9% more energy for the idle phases of the work load.
>>> Minimum: 9.15 watts
>>> Maximum: 13.79 watts (51% more power)

>> Hi Doug,

Hi Srinivas,

>>
>> Do you have intel_pstate_tracer output?

Yes, I have many many runs of intel_pstate_tracer.py
and many plots of pstate and CPU frequency lingering high
for a very very long time after the load is removed.
Here is one example (my reference: results/teo041):

The load is removed at test time 539.047 seconds,
and requested pstates do start to fall. Example,
cpu 4 at time 539.052, pstate request goes from
38 (the max for an i7-2600K) to 32. The last CPU
to reduce the pstate request is CPU 0 at time
539.714, but only to 25.

Then, CPU 4 doesn't run the driver for another
5.9 seconds, and even then only reduces its request
to pstate 21.

CPU 4 remains the defining CPU, and doesn't run the
driver again until time 577.235 seconds, at which time
its pstate request drops to 18, even with 0 load.
So, 38 seconds, and still only at pstate 18.

>> I guess that when started
>> request to measure the measure joules, it started at higher P-state
>> without revert.

No, not really. The main difference is in the time it takes to fully
drop to the lowest pstate.

>> Other way is check by fixing the max and min scaling frequency to
>> some frequency, then we shouldn't see power difference.

Yes, I did that, to learn the numbers.

> I mean not significant power difference.

For idle state 1, at least for my processor (i7-2600K), the difference
is huge. Keep in mind that (at least for my processor) a CPU in idle
state 1 does not relinquish its vote into the CPU frequency PLL, thus
the highest request dictates the CPU frequency.

Here are the idle state 1 powers (42 percent is the minimum for my
processor. For reference, with all idle state enabled, the idle
power is 3.68 watts and the processor package temperature is about
25 degrees, independent of the requested pstate):

Min-percent watts temp
42 8.7 35
50 10.0 36
60 12.0 37
70 14.4 38
80 17.3 41
90 21 43
100 21 43

Note that the 90 (pstate 35) and 100 (pstate 38)
powers are the same due to this (I assume):

cpu5: MSR_TURBO_RATIO_LIMIT: 0x23242526
35 * 100.0 = 3500.0 MHz max turbo 4 active cores
36 * 100.0 = 3600.0 MHz max turbo 3 active cores
37 * 100.0 = 3700.0 MHz max turbo 2 active cores
38 * 100.0 = 3800.0 MHz max turbo 1 active cores

And can be verified by looking at the request
And granted MSRs directly:

At 100% min percent:

Requested:
doug@s15:~/temp-k-git/linux$ sudo rdmsr --bitfield 15:8 -d -a 0x199
38
38
38
38
38
38
38
38

Granted:
doug@s15:~/temp-k-git/linux$ sudo rdmsr --bitfield 15:8 -d -a 0x198
35
35
35
35
35
35
35
35

> Also to get real numbers, need
> to use some power meter measuring CPU power.

Well, I have one, but for the box AC only. It just didn't seem
worth the overhead. Yes, I used the joules MSR directly. I also
do a sanity check by checking that the processor package temperature
makes sense for the calculated processor package watts. I added a
temperature column above.

> If I can get your script,
> I may be able to measure that.

Hmmm... This was actually a saga all by itself, mainly my own
fault. I am running a bit of a mess here so that I could minimize
the time between the multiple load drops from 100% to 0% so as
to make it easier to follow via the intel_pstate_tracer data.
Load methods aside, the rest is pretty simple:

doug@s15:~/idle$ cat load-no-load-forever2
#!/bin/dash

#
# load-no-load-forever2. Smyhies 2019.10.21
# Just trying to get some debug data.
# apply load (real, not via disabling all idle
# states) then no load. loop forever.
# load version 2.

echo "load-no-load-forever2. Start. Doug Smythies 2019.10.21"

while [ 1 ];
do
~/c/waiter 9 2 4 2000000000 0 1 > /dev/null
sleep 1
sudo ~/c/measure_energy 149
done

Where "measure_energy.c" just samples the
joules MSR over an interval, sometimes a longer
interval than turbostat will allow (with full
accuracy), because I know that the joules counter
did not wrap around.
In a separate e-mail I'll send you the c programs,
although I seem to recall that Intel strips out
attached c programs from e-mails.

Hope this helps.

... Doug


2019-10-30 17:58:54

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] Revert "sched/fair: Fix O(nr_cgroups) in the load balancing path"

On Wed, 30 Oct 2019 at 15:04, Doug Smythies <[email protected]> wrote:
>
> On 2019.10.29 09:02 Vincent Guittot wrote:
>
> > Could you try the patch below ? It ensures that at least the root cfs rq stays
> > in the list so each time update_blocked_averages is called, we will call update_cfs_rq_load_avg()
> > for the root cfs_rq at least and even if everything already reach zero.
> > This will ensure that cfs_rq_util_change is called even if nothing has
> > changed.
> >
> > ---
> > kernel/sched/fair.c | 5 +++++
> > 1 file changed, 5 insertions(+)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 151c0b7..ac0a549 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7552,6 +7552,8 @@ static inline void update_blocked_load_status(struct rq *rq, bool has_blocked) {
> >
> > static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
> > {
> > + struct rq *rq = rq_of(cfs_rq);
> > +
> > if (cfs_rq->load.weight)
> > return false;
> >
> > @@ -7564,6 +7566,9 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
> > if (cfs_rq->avg.runnable_load_sum)
> > return false;
> >
> > + if (cfs_rq == &rq->cfs)
> > + return false;
> > +
> > return true;
> > }
> >
> > --
> > 2.7.4
>
> Yes, this patch works and solves the long time
> between calls of the intel_pstate CPU frequency scaling
> driver issue.
> I see you sent a formal patch a few hours ago.
> I'll try it and report back.

The patch that I sent a few hours ago, doesn't solve your problem. It
solves item 1 of my previous email.

The fact that this hack fix your problem means that Intel pstate needs
to be called periodically even if the cfs pelt signals are null and
this is probably linked to the internal state machine of the driver.
The current behavior for CFS makes perfectly sense because cfs signal
is already null so we don't need to update freq because of cfs' signal
Then it remains the rt, dl and irq signals which might not be null yet
and which doesn't trigger a call to cpufreq_update_util whereas it
could worth calling it.

I have to prepare a patch for this part which is item 2

Regards,
Vincent


>
> ... Doug
>
>

2019-11-08 09:19:49

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] Revert "sched/fair: Fix O(nr_cgroups) in the load balancing path"

Hi Doug,

Le Wednesday 30 Oct 2019 ? 16:27:08 (+0100), Vincent Guittot a ?crit :
> On Wed, 30 Oct 2019 at 15:04, Doug Smythies <[email protected]> wrote:
> >
> > On 2019.10.29 09:02 Vincent Guittot wrote:
> >
> > > Could you try the patch below ? It ensures that at least the root cfs rq stays
> > > in the list so each time update_blocked_averages is called, we will call update_cfs_rq_load_avg()
> > > for the root cfs_rq at least and even if everything already reach zero.
> > > This will ensure that cfs_rq_util_change is called even if nothing has
> > > changed.
> > >
> > > ---
> > > kernel/sched/fair.c | 5 +++++
> > > 1 file changed, 5 insertions(+)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 151c0b7..ac0a549 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -7552,6 +7552,8 @@ static inline void update_blocked_load_status(struct rq *rq, bool has_blocked) {
> > >
> > > static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
> > > {
> > > + struct rq *rq = rq_of(cfs_rq);
> > > +
> > > if (cfs_rq->load.weight)
> > > return false;
> > >
> > > @@ -7564,6 +7566,9 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
> > > if (cfs_rq->avg.runnable_load_sum)
> > > return false;
> > >
> > > + if (cfs_rq == &rq->cfs)
> > > + return false;
> > > +
> > > return true;
> > > }
> > >
> > > --
> > > 2.7.4
> >
> > Yes, this patch works and solves the long time
> > between calls of the intel_pstate CPU frequency scaling
> > driver issue.
> > I see you sent a formal patch a few hours ago.
> > I'll try it and report back.
>
> The patch that I sent a few hours ago, doesn't solve your problem. It
> solves item 1 of my previous email.
>
> The fact that this hack fix your problem means that Intel pstate needs
> to be called periodically even if the cfs pelt signals are null and
> this is probably linked to the internal state machine of the driver.
> The current behavior for CFS makes perfectly sense because cfs signal
> is already null so we don't need to update freq because of cfs' signal
> Then it remains the rt, dl and irq signals which might not be null yet
> and which doesn't trigger a call to cpufreq_update_util whereas it
> could worth calling it.
>
> I have to prepare a patch for this part which is item 2

I have finally been able to prepared the patch for item 2. Could you check
that it also fixes your problem ?

Subject: [PATCH] sched/freq: move call to cpufreq_update_util

update_cfs_rq_load_avg() calls cfs_rq_util_change() everytime pelt decays,
which might be inefficient when cpufreq driver has rate limitation.

When a task is attached on a CPU, we have call path:

update_blocked_averages()
update_cfs_rq_load_avg()
cfs_rq_util_change -- > trig frequency update
attach_entity_load_avg()
cfs_rq_util_change -- > trig frequency update

The 1st frequency update will not take into account the utilization of the
newly attached task and the 2nd one might be discard because of rate
limitation of the cpufreq driver.

update_cfs_rq_load_avg() is only called by update_blocked_averages()
and update_load_avg() so we can move the call to
{cfs_rq,cpufreq}_util_change() into these 2 functions. It's also
interesting to notice that update_load_avg() already calls directly
cfs_rq_util_change() for !SMP case.

This changes will also ensure that cpufreq_update_util() is called even
when there is no more CFS rq in the leaf_cfs_rq_list to update but only
irq, rt or dl pelt signals.

Signed-off-by: Vincent Guittot <[email protected]>
---

We can still have some spurious call to cpufreq_util_change in
update_blocked_average() with this patch but at least the value will be
up to date in both calls, which was not the case before. If this fix
Doug's problem, I can prepare an additional one to fix the spurious call
but I wanted to make sure that this fix the problem first.


kernel/sched/fair.c | 33 +++++++++++++++++++++------------
1 file changed, 21 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index adc923f..4fd324e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3509,9 +3509,6 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
cfs_rq->load_last_update_time_copy = sa->last_update_time;
#endif

- if (decayed)
- cfs_rq_util_change(cfs_rq, 0);
-
return decayed;
}

@@ -3621,8 +3618,12 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
attach_entity_load_avg(cfs_rq, se, SCHED_CPUFREQ_MIGRATION);
update_tg_load_avg(cfs_rq, 0);

- } else if (decayed && (flags & UPDATE_TG))
- update_tg_load_avg(cfs_rq, 0);
+ } else if (decayed) {
+ cfs_rq_util_change(cfs_rq, 0);
+
+ if (flags & UPDATE_TG)
+ update_tg_load_avg(cfs_rq, 0);
+ }
}

#ifndef CONFIG_64BIT
@@ -7441,6 +7442,7 @@ static void update_blocked_averages(int cpu)
const struct sched_class *curr_class;
struct rq_flags rf;
bool done = true;
+ int decayed = 0;

rq_lock_irqsave(rq, &rf);
update_rq_clock(rq);
@@ -7450,9 +7452,9 @@ static void update_blocked_averages(int cpu)
* that RT, DL and IRQ signals have been updated before updating CFS.
*/
curr_class = rq->curr->sched_class;
- update_rt_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &rt_sched_class);
- update_dl_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &dl_sched_class);
- update_irq_load_avg(rq, 0);
+ decayed |= update_rt_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &rt_sched_class);
+ decayed |= update_dl_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &dl_sched_class);
+ decayed |= update_irq_load_avg(rq, 0);

/* Don't need periodic decay once load/util_avg are null */
if (others_have_blocked(rq))
@@ -7486,6 +7488,9 @@ static void update_blocked_averages(int cpu)
}

update_blocked_load_status(rq, !done);
+
+ if (decayed)
+ cpufreq_update_util(rq, 0);
rq_unlock_irqrestore(rq, &rf);
}

@@ -7542,6 +7547,7 @@ static inline void update_blocked_averages(int cpu)
struct cfs_rq *cfs_rq = &rq->cfs;
const struct sched_class *curr_class;
struct rq_flags rf;
+ int decayed = 0;

rq_lock_irqsave(rq, &rf);
update_rq_clock(rq);
@@ -7551,13 +7557,16 @@ static inline void update_blocked_averages(int cpu)
* that RT, DL and IRQ signals have been updated before updating CFS.
*/
curr_class = rq->curr->sched_class;
- update_rt_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &rt_sched_class);
- update_dl_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &dl_sched_class);
- update_irq_load_avg(rq, 0);
+ decayed |= update_rt_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &rt_sched_class);
+ decayed |= update_dl_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &dl_sched_class);
+ decayed |= update_irq_load_avg(rq, 0);

- update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq);
+ decayed |= update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq);

update_blocked_load_status(rq, cfs_rq_has_blocked(cfs_rq) || others_have_blocked(rq));
+
+ if (decayed)
+ cpufreq_update_util(rq, 0);
rq_unlock_irqrestore(rq, &rf);
}

--
2.7.4


>
> Regards,
> Vincent
>
>
> >
> > ... Doug
> >
> >

2019-11-09 16:49:22

by Doug Smythies

[permalink] [raw]
Subject: RE: [PATCH] Revert "sched/fair: Fix O(nr_cgroups) in the load balancing path"

Hi Vincent,

Thank you for your item 2 patch.

On 2019.11.08 01:19 Vincent Guittot wrote:
...
>> I have to prepare a patch for this part which is item 2
>
> I have finally been able to prepared the patch for item 2. Could you check
> that it also fixes your problem ?
...
> We can still have some spurious call to cpufreq_util_change in
> update_blocked_average() with this patch but at least the value will be
> up to date in both calls, which was not the case before. If this fix
> Doug's problem, I can prepare an additional one to fix the spurious call
> but I wanted to make sure that this fix the problem first.

Yes, the issue is solved with this patch.
I do wonder if I am seeing the effect of the spurious calls.

Details:

Test 1: an 8000 second trace during system idle:
Maximum duration: 4.00015 seconds. Good.
Typically, there would have been about 300 durations
of over 10 seconds in 8000 seconds.
Number of calls to driver: 103168, which is about 8% more than
the previous experimental solution.
(Should be repeated a few times to verify repeatability, but
not going to.)

Test 2: one 8000 second energy sample, for high accuracy idle power:
3.703 watts which is about +0.7% idle power increase.

Test 3: The load-no-load test with only idle state 1 enabled:
There was never an excessive energy sample for the no load samples.
The test ran for about 8 hours.
Maximum: 9.49 watts
Minimum: 9.13 watts
Recall that with the issue, the max would have been about 14 watts

Kernel: 5.4-rc6 + your items 1 and item 2 patches.
Idle governor = menu, because teo fixes are still pending.
Note: some reference data is from kernel 5.4-rc2, and really
should have been re-done with 5.4-rc6 as the baseline.

... Doug


2019-11-10 15:18:10

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] Revert "sched/fair: Fix O(nr_cgroups) in the load balancing path"

Hi Doug,

On Sat, 9 Nov 2019 at 17:47, Doug Smythies <[email protected]> wrote:
>
> Hi Vincent,
>
> Thank you for your item 2 patch.
>
> On 2019.11.08 01:19 Vincent Guittot wrote:
> ...
> >> I have to prepare a patch for this part which is item 2
> >
> > I have finally been able to prepared the patch for item 2. Could you check
> > that it also fixes your problem ?
> ...
> > We can still have some spurious call to cpufreq_util_change in
> > update_blocked_average() with this patch but at least the value will be
> > up to date in both calls, which was not the case before. If this fix
> > Doug's problem, I can prepare an additional one to fix the spurious call
> > but I wanted to make sure that this fix the problem first.
>
> Yes, the issue is solved with this patch.

Thanks for your tests

> I do wonder if I am seeing the effect of the spurious calls.

I don't think so because the spurious calls are in fact a 2nd call
during the same update_blocked_average and from what i have seen ,
intel pstate driver filter call when there is less than 3 or 10ms
between the 2 calls

>
> Details:
>
> Test 1: an 8000 second trace during system idle:
> Maximum duration: 4.00015 seconds. Good.
> Typically, there would have been about 300 durations
> of over 10 seconds in 8000 seconds.
> Number of calls to driver: 103168, which is about 8% more than
> the previous experimental solution.
> (Should be repeated a few times to verify repeatability, but
> not going to.)
>
> Test 2: one 8000 second energy sample, for high accuracy idle power:
> 3.703 watts which is about +0.7% idle power increase.
>
> Test 3: The load-no-load test with only idle state 1 enabled:
> There was never an excessive energy sample for the no load samples.
> The test ran for about 8 hours.
> Maximum: 9.49 watts
> Minimum: 9.13 watts
> Recall that with the issue, the max would have been about 14 watts
>
> Kernel: 5.4-rc6 + your items 1 and item 2 patches.
> Idle governor = menu, because teo fixes are still pending.
> Note: some reference data is from kernel 5.4-rc2, and really
> should have been re-done with 5.4-rc6 as the baseline.
>
> ... Doug
>
>