2021-01-22 17:31:58

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

On Fri, 22 Jan 2021 at 16:46, Joel Fernandes (Google)
<[email protected]> wrote:
>
> On an octacore ARM64 device running ChromeOS Linux kernel v5.4, I found
> that there are a lot of calls to update_blocked_averages(). This causes
> the schedule loop to slow down to taking upto 500 micro seconds at
> times (due to newidle load balance). I have also seen this manifest in
> the periodic balancer.
>
> Closer look shows that the problem is caused by the following
> ingredients:
> 1. If the system has a lot of inactive CGroups (thanks Dietmar for
> suggesting to inspect /proc/sched_debug for this), this can make
> __update_blocked_fair() take a long time.

Inactive cgroups are removed from the list so they should not impact
the duration

>
> 2. The device has a lot of CPUs in a cluster which causes schedutil in a
> shared frequency domain configuration to be slower than usual. (the load

What do you mean exactly by it causes schedutil to be slower than usual ?

> average updates also try to update the frequency in schedutil).
>
> 3. The CPU is running at a low frequency causing the scheduler/schedutil
> code paths to take longer than when running at a high CPU frequency.

Low frequency usually means low utilization so it should happen that much.

>
> The fix is simply rate limit the calls to update_blocked_averages to 20
> times per second. It appears that updating the blocked average less
> often is sufficient. Currently I see about 200 calls per second

Would be good to explain why updating less often is sufficient ?

> sometimes, which seems overkill.
>
> schbench shows a clear improvement with the change:

Have you got more details about your test setup ?
which platform ?
which kernel ?

>
> Without patch:
> ~/schbench -t 2 -m 2 -r 5
> Latency percentiles (usec) runtime 5 (s) (212 total samples)
> 50.0th: 210 (106 samples)
> 75.0th: 619 (53 samples)
> 90.0th: 665 (32 samples)
> 95.0th: 703 (11 samples)
> *99.0th: 12656 (8 samples)
> 99.5th: 12784 (1 samples)
> 99.9th: 13424 (1 samples)
> min=15, max=13424
>
> With patch:
> ~/schbench -t 2 -m 2 -r 5
> Latency percentiles (usec) runtime 5 (s) (214 total samples)
> 50.0th: 188 (108 samples)
> 75.0th: 238 (53 samples)
> 90.0th: 623 (32 samples)
> 95.0th: 657 (12 samples)
> *99.0th: 717 (7 samples)
> 99.5th: 725 (2 samples)
> 99.9th: 725 (0 samples)
>
> Cc: Paul McKenney <[email protected]>
> Cc: Frederic Weisbecker <[email protected]>
> Suggested-by: Dietmar Eggeman <[email protected]>
> Co-developed-by: Qais Yousef <[email protected]>
> Signed-off-by: Qais Yousef <[email protected]>
> Signed-off-by: Joel Fernandes (Google) <[email protected]>
>
> ---
> kernel/sched/fair.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 04a3ce20da67..fe2dc0024db5 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8381,7 +8381,7 @@ static bool update_nohz_stats(struct rq *rq, bool force)
> if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
> return false;
>
> - if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick))
> + if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick + (HZ/20)))

This condition is there to make sure to update blocked load at most
once a tick in order to filter newly idle case otherwise the rate
limit is already done by load balance interval
This hard coded (HZ/20) looks really like an ugly hack

> return true;
>
> update_blocked_averages(cpu);
> --
> 2.30.0.280.ga3ce27912f-goog
>


2021-01-22 19:13:48

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

Hi Vincent,

Thanks for reply. Please see the replies below:

On Fri, Jan 22, 2021 at 05:56:22PM +0100, Vincent Guittot wrote:
> On Fri, 22 Jan 2021 at 16:46, Joel Fernandes (Google)
> <[email protected]> wrote:
> >
> > On an octacore ARM64 device running ChromeOS Linux kernel v5.4, I found
> > that there are a lot of calls to update_blocked_averages(). This causes
> > the schedule loop to slow down to taking upto 500 micro seconds at
> > times (due to newidle load balance). I have also seen this manifest in
> > the periodic balancer.
> >
> > Closer look shows that the problem is caused by the following
> > ingredients:
> > 1. If the system has a lot of inactive CGroups (thanks Dietmar for
> > suggesting to inspect /proc/sched_debug for this), this can make
> > __update_blocked_fair() take a long time.
>
> Inactive cgroups are removed from the list so they should not impact
> the duration

I meant blocked CGroups. According to this code, a cfs_rq can be partially
decayed and not have any tasks running on it but its load needs to be
decayed, correct? That's what I meant by 'inactive'. I can reword it to
'blocked'.

* There can be a lot of idle CPU cgroups. Don't let fully
* decayed cfs_rqs linger on the list.
*/
if (cfs_rq_is_decayed(cfs_rq))
list_del_leaf_cfs_rq(cfs_rq);

> > 2. The device has a lot of CPUs in a cluster which causes schedutil in a
> > shared frequency domain configuration to be slower than usual. (the load
>
> What do you mean exactly by it causes schedutil to be slower than usual ?

sugov_next_freq_shared() is order number of CPUs in the a cluster. This
system is a 6+2 system with 6 CPUs in a cluster. schedutil shared policy
frequency update needs to go through utilization of other CPUs in the
cluster. I believe this could be adding to the problem but is not really
needed to optimize if we can rate limit the calls to update_blocked_averages
to begin with.

> > average updates also try to update the frequency in schedutil).
> >
> > 3. The CPU is running at a low frequency causing the scheduler/schedutil
> > code paths to take longer than when running at a high CPU frequency.
>
> Low frequency usually means low utilization so it should happen that much.

It happens a lot as can be seen with schbench. It is super easy to reproduce.

schedule() can result in new idle balance with the CFS pick call happening
often. Here is a function graph trace. The tracer shows
update_blocked_averages taking a lot of time.

sugov:0-2454 [002] 2657.992570: funcgraph_entry: | load_balance() {
sugov:0-2454 [002] 2657.992577: funcgraph_entry: | update_group_capacity() {
sugov:0-2454 [002] 2657.992580: funcgraph_entry: 2.656 us | __msecs_to_jiffies();
sugov:0-2454 [002] 2657.992585: funcgraph_entry: 2.447 us | _raw_spin_lock_irqsave();
sugov:0-2454 [002] 2657.992591: funcgraph_entry: 2.552 us | _raw_spin_unlock_irqrestore();
sugov:0-2454 [002] 2657.992595: funcgraph_exit: + 17.448 us | }
sugov:0-2454 [002] 2657.992597: funcgraph_entry: 1.875 us | update_nohz_stats();
sugov:0-2454 [002] 2657.992601: funcgraph_entry: 1.667 us | idle_cpu();
sugov:0-2454 [002] 2657.992605: funcgraph_entry: | update_nohz_stats() {
sugov:0-2454 [002] 2657.992608: funcgraph_entry: + 33.333 us | update_blocked_averages();
sugov:0-2454 [002] 2657.992643: funcgraph_exit: + 38.073 us | }
sugov:0-2454 [002] 2657.992645: funcgraph_entry: 1.770 us | idle_cpu();
sugov:0-2454 [002] 2657.992649: funcgraph_entry: | update_nohz_stats() {
sugov:0-2454 [002] 2657.992651: funcgraph_entry: + 41.823 us | update_blocked_averages();
sugov:0-2454 [002] 2657.992694: funcgraph_exit: + 45.729 us | }
sugov:0-2454 [002] 2657.992696: funcgraph_entry: 1.823 us | idle_cpu();
sugov:0-2454 [002] 2657.992700: funcgraph_entry: | update_nohz_stats() {
sugov:0-2454 [002] 2657.992702: funcgraph_entry: + 35.312 us | update_blocked_averages();
sugov:0-2454 [002] 2657.992740: funcgraph_exit: + 39.792 us | }
sugov:0-2454 [002] 2657.992742: funcgraph_entry: 1.771 us | idle_cpu();
sugov:0-2454 [002] 2657.992746: funcgraph_entry: | update_nohz_stats() {
sugov:0-2454 [002] 2657.992748: funcgraph_entry: + 33.438 us | update_blocked_averages();
sugov:0-2454 [002] 2657.992783: funcgraph_exit: + 37.500 us | }
sugov:0-2454 [002] 2657.992785: funcgraph_entry: 1.771 us | idle_cpu();
sugov:0-2454 [002] 2657.992790: funcgraph_entry: | update_nohz_stats() {
sugov:0-2454 [002] 2657.992792: funcgraph_entry: + 45.521 us | update_blocked_averages();
sugov:0-2454 [002] 2657.992839: funcgraph_exit: + 49.323 us | }
sugov:0-2454 [002] 2657.992842: funcgraph_entry: 1.823 us | idle_cpu();
sugov:0-2454 [002] 2657.992847: funcgraph_entry: | update_nohz_stats() {
sugov:0-2454 [002] 2657.992850: funcgraph_entry: + 67.187 us | update_blocked_averages();
sugov:0-2454 [002] 2657.992919: funcgraph_exit: + 72.031 us | }
sugov:0-2454 [002] 2657.992921: funcgraph_entry: 2.760 us | idle_cpu();
sugov:0-2454 [002] 2657.992926: funcgraph_entry: | update_nohz_stats() {
sugov:0-2454 [002] 2657.992928: funcgraph_entry: + 61.146 us | update_blocked_averages();
sugov:0-2454 [002] 2657.992992: funcgraph_exit: + 65.886 us | }
sugov:0-2454 [002] 2657.992994: funcgraph_entry: 1.771 us | idle_cpu();
sugov:0-2454 [002] 2657.992998: funcgraph_exit: ! 430.209 us | }
sugov:0-2454 [002] 2657.993006: bprint: trace_long: wtf: lb: 432916
sugov:0-2454 [002] 2657.993017: bprint: trace_long: wtf: newidle_balance: 501458


> > The fix is simply rate limit the calls to update_blocked_averages to 20
> > times per second. It appears that updating the blocked average less
> > often is sufficient. Currently I see about 200 calls per second
>
> Would be good to explain why updating less often is sufficient ?

I don't know this code that well, intuitively it seems to me updating blocked
averages at such a high rate seems pointless. But I defer to your expertise
on that. Why do you feel an update is needed at least HZ times per second?
What about system with HZ=1000 or 300, that seems to be an insane rate of
updating (not to mention all the complexity of going through the leaf cgroup
list and doing the frequency updates).

> > sometimes, which seems overkill.
> >
> > schbench shows a clear improvement with the change:
>
> Have you got more details about your test setup ?
> which platform ?
> which kernel ?

I mentioned in the commit message it is a v5.4 kernel.

The platform is Snapdragon 7c. The platform is:
https://www.qualcomm.com/products/snapdragon-7c-compute-platform

>
> >
> > Without patch:
> > ~/schbench -t 2 -m 2 -r 5
> > Latency percentiles (usec) runtime 5 (s) (212 total samples)
> > 50.0th: 210 (106 samples)
> > 75.0th: 619 (53 samples)
> > 90.0th: 665 (32 samples)
> > 95.0th: 703 (11 samples)
> > *99.0th: 12656 (8 samples)
> > 99.5th: 12784 (1 samples)
> > 99.9th: 13424 (1 samples)
> > min=15, max=13424
> >
> > With patch:
> > ~/schbench -t 2 -m 2 -r 5
> > Latency percentiles (usec) runtime 5 (s) (214 total samples)
> > 50.0th: 188 (108 samples)
> > 75.0th: 238 (53 samples)
> > 90.0th: 623 (32 samples)
> > 95.0th: 657 (12 samples)
> > *99.0th: 717 (7 samples)
> > 99.5th: 725 (2 samples)
> > 99.9th: 725 (0 samples)
> >
> > Cc: Paul McKenney <[email protected]>
> > Cc: Frederic Weisbecker <[email protected]>
> > Suggested-by: Dietmar Eggeman <[email protected]>
> > Co-developed-by: Qais Yousef <[email protected]>
> > Signed-off-by: Qais Yousef <[email protected]>
> > Signed-off-by: Joel Fernandes (Google) <[email protected]>
> >
> > ---
> > kernel/sched/fair.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 04a3ce20da67..fe2dc0024db5 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -8381,7 +8381,7 @@ static bool update_nohz_stats(struct rq *rq, bool force)
> > if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
> > return false;
> >
> > - if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick))
> > + if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick + (HZ/20)))
>
> This condition is there to make sure to update blocked load at most
> once a tick in order to filter newly idle case otherwise the rate
> limit is already done by load balance interval

What prevents newidle_balance from hitting this code? Are you suggesting rate
limit that as well?

> This hard coded (HZ/20) looks really like an ugly hack

Yes, it was not a well researched number. If you have a suggestion for better
fix, let me know.

thanks,

- Joel


> > return true;
> >
> > update_blocked_averages(cpu);
> > --
> > 2.30.0.280.ga3ce27912f-goog
> >

2021-01-22 19:39:24

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

On 01/22/21 17:56, Vincent Guittot wrote:
> > ---
> > kernel/sched/fair.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 04a3ce20da67..fe2dc0024db5 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -8381,7 +8381,7 @@ static bool update_nohz_stats(struct rq *rq, bool force)
> > if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
> > return false;
> >
> > - if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick))
> > + if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick + (HZ/20)))
>
> This condition is there to make sure to update blocked load at most
> once a tick in order to filter newly idle case otherwise the rate
> limit is already done by load balance interval
> This hard coded (HZ/20) looks really like an ugly hack

This was meant as an RFC patch to discuss the problem really.

Joel is seeing update_blocked_averages() taking ~100us. Half of it seems in
processing __update_blocked_fair() and the other half in sugov_update_shared().
So roughly 50us each. Note that each function is calling an iterator in
return. Correct me if my numbers are wrong Joel.

Running on a little core on low frequency these numbers don't look too odd.
So I'm not seeing how we can speed these functions up.

But since update_sg_lb_stats() will end up with multiple calls to
update_blocked_averages() in one go, this latency adds up quickly.

One noticeable factor in Joel's system is the presence of a lot of cgroups.
Which is essentially what makes __update_blocked_fair() expensive, and it seems
to always return something has decayed so we end up with a call to
sugov_update_shared() in every call.

I think we should limit the expensive call to update_blocked_averages() but
I honestly don't know what would be the right way to do it :-/

Or maybe there's another better alternative too..

Thanks

--
Qais Yousef

>
> > return true;
> >
> > update_blocked_averages(cpu);
> > --
> > 2.30.0.280.ga3ce27912f-goog
> >

2021-01-25 14:51:10

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

On Fri, 22 Jan 2021 at 20:10, Joel Fernandes <[email protected]> wrote:
>
> Hi Vincent,
>
> Thanks for reply. Please see the replies below:
>
> On Fri, Jan 22, 2021 at 05:56:22PM +0100, Vincent Guittot wrote:
> > On Fri, 22 Jan 2021 at 16:46, Joel Fernandes (Google)
> > <[email protected]> wrote:
> > >
> > > On an octacore ARM64 device running ChromeOS Linux kernel v5.4, I found
> > > that there are a lot of calls to update_blocked_averages(). This causes
> > > the schedule loop to slow down to taking upto 500 micro seconds at
> > > times (due to newidle load balance). I have also seen this manifest in
> > > the periodic balancer.
> > >
> > > Closer look shows that the problem is caused by the following
> > > ingredients:
> > > 1. If the system has a lot of inactive CGroups (thanks Dietmar for
> > > suggesting to inspect /proc/sched_debug for this), this can make
> > > __update_blocked_fair() take a long time.
> >
> > Inactive cgroups are removed from the list so they should not impact
> > the duration
>
> I meant blocked CGroups. According to this code, a cfs_rq can be partially
> decayed and not have any tasks running on it but its load needs to be
> decayed, correct? That's what I meant by 'inactive'. I can reword it to
> 'blocked'.

How many blocked cgroups have you got ?

>
> * There can be a lot of idle CPU cgroups. Don't let fully
> * decayed cfs_rqs linger on the list.
> */
> if (cfs_rq_is_decayed(cfs_rq))
> list_del_leaf_cfs_rq(cfs_rq);
>
> > > 2. The device has a lot of CPUs in a cluster which causes schedutil in a
> > > shared frequency domain configuration to be slower than usual. (the load
> >
> > What do you mean exactly by it causes schedutil to be slower than usual ?
>
> sugov_next_freq_shared() is order number of CPUs in the a cluster. This
> system is a 6+2 system with 6 CPUs in a cluster. schedutil shared policy
> frequency update needs to go through utilization of other CPUs in the
> cluster. I believe this could be adding to the problem but is not really
> needed to optimize if we can rate limit the calls to update_blocked_averages
> to begin with.

Qais mentioned half of the time being used by
sugov_next_freq_shared(). Are there any frequency changes resulting in
this call ?

>
> > > average updates also try to update the frequency in schedutil).
> > >
> > > 3. The CPU is running at a low frequency causing the scheduler/schedutil
> > > code paths to take longer than when running at a high CPU frequency.
> >
> > Low frequency usually means low utilization so it should happen that much.
>
> It happens a lot as can be seen with schbench. It is super easy to reproduce.

Happening a lot in itself is not a problem if there is nothing else to
do so it's not a argument in itself

So why is it a problem for you ? You are mentioning newly idle load
balance so I assume that your root problem is the scheduling delay
generated by the newly idle load balance which then calls
update_blocked_averages

rate limiting the call to update_blocked_averages() will only reduce
the number of time it happens but it will not prevent it to happen.

>
> schedule() can result in new idle balance with the CFS pick call happening
> often. Here is a function graph trace. The tracer shows
> update_blocked_averages taking a lot of time.
>
> sugov:0-2454 [002] 2657.992570: funcgraph_entry: | load_balance() {
> sugov:0-2454 [002] 2657.992577: funcgraph_entry: | update_group_capacity() {
> sugov:0-2454 [002] 2657.992580: funcgraph_entry: 2.656 us | __msecs_to_jiffies();
> sugov:0-2454 [002] 2657.992585: funcgraph_entry: 2.447 us | _raw_spin_lock_irqsave();
> sugov:0-2454 [002] 2657.992591: funcgraph_entry: 2.552 us | _raw_spin_unlock_irqrestore();
> sugov:0-2454 [002] 2657.992595: funcgraph_exit: + 17.448 us | }
> sugov:0-2454 [002] 2657.992597: funcgraph_entry: 1.875 us | update_nohz_stats();
> sugov:0-2454 [002] 2657.992601: funcgraph_entry: 1.667 us | idle_cpu();
> sugov:0-2454 [002] 2657.992605: funcgraph_entry: | update_nohz_stats() {
> sugov:0-2454 [002] 2657.992608: funcgraph_entry: + 33.333 us | update_blocked_averages();
> sugov:0-2454 [002] 2657.992643: funcgraph_exit: + 38.073 us | }
> sugov:0-2454 [002] 2657.992645: funcgraph_entry: 1.770 us | idle_cpu();
> sugov:0-2454 [002] 2657.992649: funcgraph_entry: | update_nohz_stats() {
> sugov:0-2454 [002] 2657.992651: funcgraph_entry: + 41.823 us | update_blocked_averages();
> sugov:0-2454 [002] 2657.992694: funcgraph_exit: + 45.729 us | }
> sugov:0-2454 [002] 2657.992696: funcgraph_entry: 1.823 us | idle_cpu();
> sugov:0-2454 [002] 2657.992700: funcgraph_entry: | update_nohz_stats() {
> sugov:0-2454 [002] 2657.992702: funcgraph_entry: + 35.312 us | update_blocked_averages();
> sugov:0-2454 [002] 2657.992740: funcgraph_exit: + 39.792 us | }
> sugov:0-2454 [002] 2657.992742: funcgraph_entry: 1.771 us | idle_cpu();
> sugov:0-2454 [002] 2657.992746: funcgraph_entry: | update_nohz_stats() {
> sugov:0-2454 [002] 2657.992748: funcgraph_entry: + 33.438 us | update_blocked_averages();
> sugov:0-2454 [002] 2657.992783: funcgraph_exit: + 37.500 us | }
> sugov:0-2454 [002] 2657.992785: funcgraph_entry: 1.771 us | idle_cpu();
> sugov:0-2454 [002] 2657.992790: funcgraph_entry: | update_nohz_stats() {
> sugov:0-2454 [002] 2657.992792: funcgraph_entry: + 45.521 us | update_blocked_averages();
> sugov:0-2454 [002] 2657.992839: funcgraph_exit: + 49.323 us | }
> sugov:0-2454 [002] 2657.992842: funcgraph_entry: 1.823 us | idle_cpu();
> sugov:0-2454 [002] 2657.992847: funcgraph_entry: | update_nohz_stats() {
> sugov:0-2454 [002] 2657.992850: funcgraph_entry: + 67.187 us | update_blocked_averages();
> sugov:0-2454 [002] 2657.992919: funcgraph_exit: + 72.031 us | }
> sugov:0-2454 [002] 2657.992921: funcgraph_entry: 2.760 us | idle_cpu();
> sugov:0-2454 [002] 2657.992926: funcgraph_entry: | update_nohz_stats() {
> sugov:0-2454 [002] 2657.992928: funcgraph_entry: + 61.146 us | update_blocked_averages();
> sugov:0-2454 [002] 2657.992992: funcgraph_exit: + 65.886 us | }
> sugov:0-2454 [002] 2657.992994: funcgraph_entry: 1.771 us | idle_cpu();
> sugov:0-2454 [002] 2657.992998: funcgraph_exit: ! 430.209 us | }
> sugov:0-2454 [002] 2657.993006: bprint: trace_long: wtf: lb: 432916
> sugov:0-2454 [002] 2657.993017: bprint: trace_long: wtf: newidle_balance: 501458
>
>
> > > The fix is simply rate limit the calls to update_blocked_averages to 20
> > > times per second. It appears that updating the blocked average less
> > > often is sufficient. Currently I see about 200 calls per second
> >
> > Would be good to explain why updating less often is sufficient ?
>
> I don't know this code that well, intuitively it seems to me updating blocked
> averages at such a high rate seems pointless. But I defer to your expertise
> on that. Why do you feel an update is needed at least HZ times per second?
> What about system with HZ=1000 or 300, that seems to be an insane rate of
> updating (not to mention all the complexity of going through the leaf cgroup
> list and doing the frequency updates).

Your trace above is a full update of blocked load which is a normal
sequence that happen regularly but you don't show that this happens
too often

The call to update_blocked _average is rate limited by load_balance
interval for idle and busy load balance so those 2 cases should not be
a problem. There is no rate limit in case of newly_idle because it is
assumed that the latter is normally called only if it has been
estimated that there is enough time to run it. That's why
update_blocked_load_average is called all the time during newly idle
load balance

see in update_sd_lb_stats()
if (env->idle == CPU_NEWLY_IDLE && READ_ONCE(nohz.has_blocked))
env->flags |= LBF_NOHZ_STATS;

We could consider removing this condition but I need time to remind
all the side effect of this

IIUC, your real problem is that newidle_balance is running whereas a
task is about to wake up on the cpu and we don't abort quickly during
this load_balance

so we could also try to abort earlier in case of newly idle load balance

>
> > > sometimes, which seems overkill.
> > >
> > > schbench shows a clear improvement with the change:
> >
> > Have you got more details about your test setup ?
> > which platform ?
> > which kernel ?
>
> I mentioned in the commit message it is a v5.4 kernel.

I was not sure if the tests results done with this kernel because we
usually ask for results against mainline to make sure you are not
facing a problem that has solved since v5.4 has been released

>
> The platform is Snapdragon 7c. The platform is:
> https://www.qualcomm.com/products/snapdragon-7c-compute-platform
>
> >
> > >
> > > Without patch:
> > > ~/schbench -t 2 -m 2 -r 5
> > > Latency percentiles (usec) runtime 5 (s) (212 total samples)
> > > 50.0th: 210 (106 samples)
> > > 75.0th: 619 (53 samples)
> > > 90.0th: 665 (32 samples)
> > > 95.0th: 703 (11 samples)
> > > *99.0th: 12656 (8 samples)
> > > 99.5th: 12784 (1 samples)
> > > 99.9th: 13424 (1 samples)
> > > min=15, max=13424
> > >
> > > With patch:
> > > ~/schbench -t 2 -m 2 -r 5
> > > Latency percentiles (usec) runtime 5 (s) (214 total samples)
> > > 50.0th: 188 (108 samples)
> > > 75.0th: 238 (53 samples)
> > > 90.0th: 623 (32 samples)
> > > 95.0th: 657 (12 samples)
> > > *99.0th: 717 (7 samples)
> > > 99.5th: 725 (2 samples)
> > > 99.9th: 725 (0 samples)
> > >
> > > Cc: Paul McKenney <[email protected]>
> > > Cc: Frederic Weisbecker <[email protected]>
> > > Suggested-by: Dietmar Eggeman <[email protected]>
> > > Co-developed-by: Qais Yousef <[email protected]>
> > > Signed-off-by: Qais Yousef <[email protected]>
> > > Signed-off-by: Joel Fernandes (Google) <[email protected]>
> > >
> > > ---
> > > kernel/sched/fair.c | 2 +-
> > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 04a3ce20da67..fe2dc0024db5 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -8381,7 +8381,7 @@ static bool update_nohz_stats(struct rq *rq, bool force)
> > > if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
> > > return false;
> > >
> > > - if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick))
> > > + if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick + (HZ/20)))
> >
> > This condition is there to make sure to update blocked load at most
> > once a tick in order to filter newly idle case otherwise the rate
> > limit is already done by load balance interval
>
> What prevents newidle_balance from hitting this code? Are you suggesting rate
> limit that as well?
>
> > This hard coded (HZ/20) looks really like an ugly hack
>
> Yes, it was not a well researched number. If you have a suggestion for better
> fix, let me know.
>
> thanks,
>
> - Joel
>
>
> > > return true;
> > >
> > > update_blocked_averages(cpu);
> > > --
> > > 2.30.0.280.ga3ce27912f-goog
> > >

2021-01-26 04:33:21

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

On 22/01/2021 20:10, Joel Fernandes wrote:
> Hi Vincent,
>
> Thanks for reply. Please see the replies below:
>
> On Fri, Jan 22, 2021 at 05:56:22PM +0100, Vincent Guittot wrote:
>> On Fri, 22 Jan 2021 at 16:46, Joel Fernandes (Google)
>> <[email protected]> wrote:
>>>
>>> On an octacore ARM64 device running ChromeOS Linux kernel v5.4, I found
>>> that there are a lot of calls to update_blocked_averages(). This causes
>>> the schedule loop to slow down to taking upto 500 micro seconds at
>>> times (due to newidle load balance). I have also seen this manifest in
>>> the periodic balancer.
>>>
>>> Closer look shows that the problem is caused by the following
>>> ingredients:
>>> 1. If the system has a lot of inactive CGroups (thanks Dietmar for
>>> suggesting to inspect /proc/sched_debug for this), this can make
>>> __update_blocked_fair() take a long time.
>>
>> Inactive cgroups are removed from the list so they should not impact
>> the duration
>
> I meant blocked CGroups. According to this code, a cfs_rq can be partially
> decayed and not have any tasks running on it but its load needs to be
> decayed, correct? That's what I meant by 'inactive'. I can reword it to
> 'blocked'.
>
> * There can be a lot of idle CPU cgroups. Don't let fully
> * decayed cfs_rqs linger on the list.
> */
> if (cfs_rq_is_decayed(cfs_rq))
> list_del_leaf_cfs_rq(cfs_rq);
>
>>> 2. The device has a lot of CPUs in a cluster which causes schedutil in a
>>> shared frequency domain configuration to be slower than usual. (the load
>>
>> What do you mean exactly by it causes schedutil to be slower than usual ?
>
> sugov_next_freq_shared() is order number of CPUs in the a cluster. This
> system is a 6+2 system with 6 CPUs in a cluster. schedutil shared policy
> frequency update needs to go through utilization of other CPUs in the
> cluster. I believe this could be adding to the problem but is not really
> needed to optimize if we can rate limit the calls to update_blocked_averages
> to begin with.
>
>>> average updates also try to update the frequency in schedutil).
>>>
>>> 3. The CPU is running at a low frequency causing the scheduler/schedutil
>>> code paths to take longer than when running at a high CPU frequency.
>>
>> Low frequency usually means low utilization so it should happen that much.
>
> It happens a lot as can be seen with schbench. It is super easy to reproduce.
>
> schedule() can result in new idle balance with the CFS pick call happening
> often. Here is a function graph trace. The tracer shows
> update_blocked_averages taking a lot of time.
>
> sugov:0-2454 [002] 2657.992570: funcgraph_entry: | load_balance() {
> sugov:0-2454 [002] 2657.992577: funcgraph_entry: | update_group_capacity() {
> sugov:0-2454 [002] 2657.992580: funcgraph_entry: 2.656 us | __msecs_to_jiffies();
> sugov:0-2454 [002] 2657.992585: funcgraph_entry: 2.447 us | _raw_spin_lock_irqsave();
> sugov:0-2454 [002] 2657.992591: funcgraph_entry: 2.552 us | _raw_spin_unlock_irqrestore();
> sugov:0-2454 [002] 2657.992595: funcgraph_exit: + 17.448 us | }
> sugov:0-2454 [002] 2657.992597: funcgraph_entry: 1.875 us | update_nohz_stats();
> sugov:0-2454 [002] 2657.992601: funcgraph_entry: 1.667 us | idle_cpu();
> sugov:0-2454 [002] 2657.992605: funcgraph_entry: | update_nohz_stats() {
> sugov:0-2454 [002] 2657.992608: funcgraph_entry: + 33.333 us | update_blocked_averages();
> sugov:0-2454 [002] 2657.992643: funcgraph_exit: + 38.073 us | }
> sugov:0-2454 [002] 2657.992645: funcgraph_entry: 1.770 us | idle_cpu();
> sugov:0-2454 [002] 2657.992649: funcgraph_entry: | update_nohz_stats() {
> sugov:0-2454 [002] 2657.992651: funcgraph_entry: + 41.823 us | update_blocked_averages();
> sugov:0-2454 [002] 2657.992694: funcgraph_exit: + 45.729 us | }
> sugov:0-2454 [002] 2657.992696: funcgraph_entry: 1.823 us | idle_cpu();
> sugov:0-2454 [002] 2657.992700: funcgraph_entry: | update_nohz_stats() {
> sugov:0-2454 [002] 2657.992702: funcgraph_entry: + 35.312 us | update_blocked_averages();
> sugov:0-2454 [002] 2657.992740: funcgraph_exit: + 39.792 us | }
> sugov:0-2454 [002] 2657.992742: funcgraph_entry: 1.771 us | idle_cpu();
> sugov:0-2454 [002] 2657.992746: funcgraph_entry: | update_nohz_stats() {
> sugov:0-2454 [002] 2657.992748: funcgraph_entry: + 33.438 us | update_blocked_averages();
> sugov:0-2454 [002] 2657.992783: funcgraph_exit: + 37.500 us | }
> sugov:0-2454 [002] 2657.992785: funcgraph_entry: 1.771 us | idle_cpu();
> sugov:0-2454 [002] 2657.992790: funcgraph_entry: | update_nohz_stats() {
> sugov:0-2454 [002] 2657.992792: funcgraph_entry: + 45.521 us | update_blocked_averages();
> sugov:0-2454 [002] 2657.992839: funcgraph_exit: + 49.323 us | }
> sugov:0-2454 [002] 2657.992842: funcgraph_entry: 1.823 us | idle_cpu();
> sugov:0-2454 [002] 2657.992847: funcgraph_entry: | update_nohz_stats() {
> sugov:0-2454 [002] 2657.992850: funcgraph_entry: + 67.187 us | update_blocked_averages();
> sugov:0-2454 [002] 2657.992919: funcgraph_exit: + 72.031 us | }
> sugov:0-2454 [002] 2657.992921: funcgraph_entry: 2.760 us | idle_cpu();
> sugov:0-2454 [002] 2657.992926: funcgraph_entry: | update_nohz_stats() {
> sugov:0-2454 [002] 2657.992928: funcgraph_entry: + 61.146 us | update_blocked_averages();
> sugov:0-2454 [002] 2657.992992: funcgraph_exit: + 65.886 us | }
> sugov:0-2454 [002] 2657.992994: funcgraph_entry: 1.771 us | idle_cpu();
> sugov:0-2454 [002] 2657.992998: funcgraph_exit: ! 430.209 us | }
> sugov:0-2454 [002] 2657.993006: bprint: trace_long: wtf: lb: 432916
> sugov:0-2454 [002] 2657.993017: bprint: trace_long: wtf: newidle_balance: 501458
>
>
>>> The fix is simply rate limit the calls to update_blocked_averages to 20
>>> times per second. It appears that updating the blocked average less
>>> often is sufficient. Currently I see about 200 calls per second
>>
>> Would be good to explain why updating less often is sufficient ?
>
> I don't know this code that well, intuitively it seems to me updating blocked
> averages at such a high rate seems pointless. But I defer to your expertise
> on that. Why do you feel an update is needed at least HZ times per second?
> What about system with HZ=1000 or 300, that seems to be an insane rate of
> updating (not to mention all the complexity of going through the leaf cgroup
> list and doing the frequency updates).

I assume this is what you're seeing on your device. This is on tip sched/core
but should be close to your kernel. I glanced over the diffs in fair.c between
chromeos-5.4 and tip sched/core and didn't spot any changes in this area.

I ran on a hikey620 w/o CONFIG_SCHED_MC to mimic the 8 CPUs (8 sched groups
(sg)) in the MC domain (the only sched domain).

Since nohz.has_blocked=1 in your newidle_balance() calls,
load_balance() -> update_sd_lb_stats() sets LBF_NOHZ_STATS and calls
update_sg_lb_stats() for each of the 8 sg's.

Since LBF_NOHZ_STATS is set, update_sg_lb_stats() calls
update_nohz_stats(..., false) per cpu in sg.

And for a lot of these 8 sg's, i.e. 8 CPUs, update_blocked_averages()
is called since none of the 3 bail-out conditions:

(1) !rq->has_blocked_load
(2) !cpumask_test_cpu(cpu, nohz.idle_cpus_mask)
(3) force && !time_after(jiffies, rq->last_blocked_load_update_tick))

trigger.

We advance nohz.next_blocked by msecs_to_jiffies(LOAD_AVG_PERIOD) (32ms,
8 jiffies w/ HZ=250) but we advance 'rq->last_blocked_load_update_tick
only to jiffies' in update_blocked_load_status().


[005] 7370.188469: bprint: nohz_balance_enter_idle: CPU5 nohz.has_blocked=1
...
[005] 7370.210068: bprint: pick_next_task_fair: CPU5
[005] 7370.210079: bprint: update_sd_lb_stats: CPU5 nohz.has_blocked=1 -> set LBF_NOHZ_STATS
[005] 7370.210082: bprint: update_sd_lb_stats: CPU5 dst_cpu=5 sg=(first_cpu=5 weight=1)
[005] 7370.210085: bprint: update_nohz_stats: CPU5 cpu=5 not in nohz.idle_cpus_mask -> bail
[005] 7370.210088: bprint: update_sd_lb_stats: CPU5 dst_cpu=5 sg=(first_cpu=6 weight=1)
[005] 7370.210091: bprint: update_nohz_stats: CPU5 cpu=6 force=0 jiffies-last_blocked_load_update_tick=1 (jiffies-nohz.next_blocked=-7) -> update_blocked_averages()
[005] 7370.210112: bprint: update_sd_lb_stats: CPU5 dst_cpu=5 sg=(first_cpu=7 weight=1)
[005] 7370.210116: bprint: update_nohz_stats: CPU5 cpu=7 force=0 jiffies-last_blocked_load_update_tick=1 (jiffies-nohz.next_blocked=-7) -> update_blocked_averages()
[005] 7370.210134: bprint: update_sd_lb_stats: CPU5 dst_cpu=5 sg=(first_cpu=0 weight=1)
[005] 7370.210137: bprint: update_nohz_stats: CPU5 cpu=0 force=0 jiffies-last_blocked_load_update_tick=1 (jiffies-nohz.next_blocked=-7) -> update_blocked_averages()
[005] 7370.210156: bprint: update_sd_lb_stats: CPU5 dst_cpu=5 sg=(first_cpu=1 weight=1)
[005] 7370.210159: bprint: update_nohz_stats: CPU5 cpu=1 not in nohz.idle_cpus_mask -> bail
[005] 7370.210162: bprint: update_sd_lb_stats: CPU5 dst_cpu=5 sg=(first_cpu=2 weight=1)
[005] 7370.210165: bprint: update_nohz_stats: CPU5 cpu=2 force=0 jiffies-last_blocked_load_update_tick=1 (jiffies-nohz.next_blocked=-7) -> update_blocked_averages()
[005] 7370.210183: bprint: update_sd_lb_stats: CPU5 dst_cpu=5 sg=(first_cpu=3 weight=1)
[005] 7370.210186: bprint: update_nohz_stats: CPU5 cpu=3 force=0 jiffies-last_blocked_load_update_tick=1 (jiffies-nohz.next_blocked=-7) -> update_blocked_averages()
[005] 7370.210205: bprint: update_sd_lb_stats: CPU5 dst_cpu=5 sg=(first_cpu=4 weight=1)
[005] 7370.210207: bprint: update_nohz_stats: CPU5 cpu=4 not in nohz.idle_cpus_mask -> bail
...
[005] 7370.444704: bprint: _nohz_idle_balance: CPU5 nohz.has_blocked=0


If I understood you correctly, you want to avoid these frequent calls
to update_blocked_averages() here to further avoid invoking sched_util
via update_blocked_averages() -> cpufreq_update_util() (since 'decayed'
is set) very often in your setup.
Since you have up to 6 CPUs in a frequency domain, this could be more
costly than usual.

2021-01-26 06:22:16

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

On Fri, 22 Jan 2021 at 19:39, Qais Yousef <[email protected]> wrote:
>
> On 01/22/21 17:56, Vincent Guittot wrote:
> > > ---
> > > kernel/sched/fair.c | 2 +-
> > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 04a3ce20da67..fe2dc0024db5 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -8381,7 +8381,7 @@ static bool update_nohz_stats(struct rq *rq, bool force)
> > > if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
> > > return false;
> > >
> > > - if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick))
> > > + if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick + (HZ/20)))
> >
> > This condition is there to make sure to update blocked load at most
> > once a tick in order to filter newly idle case otherwise the rate
> > limit is already done by load balance interval
> > This hard coded (HZ/20) looks really like an ugly hack
>
> This was meant as an RFC patch to discuss the problem really.
>
> Joel is seeing update_blocked_averages() taking ~100us. Half of it seems in
> processing __update_blocked_fair() and the other half in sugov_update_shared().
> So roughly 50us each. Note that each function is calling an iterator in

Can I assume that a freq change happens if sugov_update_shared() takes 50us ?
which would mean that the update was useful at the end ?

> return. Correct me if my numbers are wrong Joel.
>
> Running on a little core on low frequency these numbers don't look too odd.
> So I'm not seeing how we can speed these functions up.
>
> But since update_sg_lb_stats() will end up with multiple calls to
> update_blocked_averages() in one go, this latency adds up quickly.
>
> One noticeable factor in Joel's system is the presence of a lot of cgroups.
> Which is essentially what makes __update_blocked_fair() expensive, and it seems
> to always return something has decayed so we end up with a call to
> sugov_update_shared() in every call.
>
> I think we should limit the expensive call to update_blocked_averages() but

At the opposite, some will complain that block values stay stall to
high value and prevent any useful adjustment.

Also update_blocked average is already rate limited with idle and busy
load_balance

Seems like the problem raised by Joel is the number of newly idle load balance

> I honestly don't know what would be the right way to do it :-/
>
> Or maybe there's another better alternative too..
>
> Thanks
>
> --
> Qais Yousef
>
> >
> > > return true;
> > >
> > > update_blocked_averages(cpu);
> > > --
> > > 2.30.0.280.ga3ce27912f-goog
> > >

2021-01-27 06:14:18

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

On 01/25/21 14:23, Vincent Guittot wrote:
> On Fri, 22 Jan 2021 at 19:39, Qais Yousef <[email protected]> wrote:
> >
> > On 01/22/21 17:56, Vincent Guittot wrote:
> > > > ---
> > > > kernel/sched/fair.c | 2 +-
> > > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index 04a3ce20da67..fe2dc0024db5 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -8381,7 +8381,7 @@ static bool update_nohz_stats(struct rq *rq, bool force)
> > > > if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
> > > > return false;
> > > >
> > > > - if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick))
> > > > + if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick + (HZ/20)))
> > >
> > > This condition is there to make sure to update blocked load at most
> > > once a tick in order to filter newly idle case otherwise the rate
> > > limit is already done by load balance interval
> > > This hard coded (HZ/20) looks really like an ugly hack
> >
> > This was meant as an RFC patch to discuss the problem really.
> >
> > Joel is seeing update_blocked_averages() taking ~100us. Half of it seems in
> > processing __update_blocked_fair() and the other half in sugov_update_shared().
> > So roughly 50us each. Note that each function is calling an iterator in
>
> Can I assume that a freq change happens if sugov_update_shared() takes 50us ?
> which would mean that the update was useful at the end ?

I couldn't reproduce his problem on Juno. But I think it is not actually doing
any frequency update. sugov_update_shared() is rate limited by
sugov_should_update_freq(). Joel has a 1ms rate limit for schedutil in sysfs.
The function traces showed that it is continuously doing the full scan which
indicates that sugov_update_next_freq() is continuously bailing out at

if else (sg_policy->next_freq == next_freq)
return false;

>
> > return. Correct me if my numbers are wrong Joel.
> >
> > Running on a little core on low frequency these numbers don't look too odd.
> > So I'm not seeing how we can speed these functions up.
> >
> > But since update_sg_lb_stats() will end up with multiple calls to
> > update_blocked_averages() in one go, this latency adds up quickly.
> >
> > One noticeable factor in Joel's system is the presence of a lot of cgroups.
> > Which is essentially what makes __update_blocked_fair() expensive, and it seems
> > to always return something has decayed so we end up with a call to
> > sugov_update_shared() in every call.
> >
> > I think we should limit the expensive call to update_blocked_averages() but
>
> At the opposite, some will complain that block values stay stall to
> high value and prevent any useful adjustment.
>
> Also update_blocked average is already rate limited with idle and busy
> load_balance
>
> Seems like the problem raised by Joel is the number of newly idle load balance

It could be. When Joel comments out the update_blocked_averages() or rate limit
it the big schedule delay disappears. Which is just an indication of where we
could do better. Either by making update_blocked_averages() faster, or control
how often we do it in a row. I thought the latter made more sense - though
implementation wise I'm not sure how we should do that.

We might actually help update_blocked_averages() being a bit faster by not
doing cpufreq_update_util() in every call and do it once in the last call to
update_blocked_averages(). Since it seemed the for_each_leaf_cfs_rq_safe() loop
in __update_blocked_fair() is expensive too, not sure if reducing the calls to
cpufreq_update_util() will be enough.

Thanks

--
Qais Yousef

2021-01-28 00:13:13

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

Hi Vincent,

On Mon, Jan 25, 2021 at 03:42:41PM +0100, Vincent Guittot wrote:
> On Fri, 22 Jan 2021 at 20:10, Joel Fernandes <[email protected]> wrote:
> > On Fri, Jan 22, 2021 at 05:56:22PM +0100, Vincent Guittot wrote:
> > > On Fri, 22 Jan 2021 at 16:46, Joel Fernandes (Google)
> > > <[email protected]> wrote:
> > > >
> > > > On an octacore ARM64 device running ChromeOS Linux kernel v5.4, I found
> > > > that there are a lot of calls to update_blocked_averages(). This causes
> > > > the schedule loop to slow down to taking upto 500 micro seconds at
> > > > times (due to newidle load balance). I have also seen this manifest in
> > > > the periodic balancer.
> > > >
> > > > Closer look shows that the problem is caused by the following
> > > > ingredients:
> > > > 1. If the system has a lot of inactive CGroups (thanks Dietmar for
> > > > suggesting to inspect /proc/sched_debug for this), this can make
> > > > __update_blocked_fair() take a long time.
> > >
> > > Inactive cgroups are removed from the list so they should not impact
> > > the duration
> >
> > I meant blocked CGroups. According to this code, a cfs_rq can be partially
> > decayed and not have any tasks running on it but its load needs to be
> > decayed, correct? That's what I meant by 'inactive'. I can reword it to
> > 'blocked'.
>
> How many blocked cgroups have you got ?

I put a counter in for_each_leaf_cfs_rq_safe() { } to count how many times
this loop runs per new idle balance. When the problem happens I see this loop
run 35-40 times (for one single instance of newidle balance). So in total
there are at least these many cfs_rq load updates.

I also see that new idle balance can be called 200-500 times per second.

> >
> > * There can be a lot of idle CPU cgroups. Don't let fully
> > * decayed cfs_rqs linger on the list.
> > */
> > if (cfs_rq_is_decayed(cfs_rq))
> > list_del_leaf_cfs_rq(cfs_rq);
> >
> > > > 2. The device has a lot of CPUs in a cluster which causes schedutil in a
> > > > shared frequency domain configuration to be slower than usual. (the load
> > >
> > > What do you mean exactly by it causes schedutil to be slower than usual ?
> >
> > sugov_next_freq_shared() is order number of CPUs in the a cluster. This
> > system is a 6+2 system with 6 CPUs in a cluster. schedutil shared policy
> > frequency update needs to go through utilization of other CPUs in the
> > cluster. I believe this could be adding to the problem but is not really
> > needed to optimize if we can rate limit the calls to update_blocked_averages
> > to begin with.
>
> Qais mentioned half of the time being used by
> sugov_next_freq_shared(). Are there any frequency changes resulting in
> this call ?

I do not see a frequency update happening at the time of the problem. However
note that sugov_iowait_boost() does run even if frequency is not being
updated. IIRC, this function is also not that light weight and I am not sure
if it is a good idea to call this that often.

> > > > average updates also try to update the frequency in schedutil).
> > > >
> > > > 3. The CPU is running at a low frequency causing the scheduler/schedutil
> > > > code paths to take longer than when running at a high CPU frequency.
> > >
> > > Low frequency usually means low utilization so it should happen that much.
> >
> > It happens a lot as can be seen with schbench. It is super easy to reproduce.
>
> Happening a lot in itself is not a problem if there is nothing else to
> do so it's not a argument in itself

It is a problem - it shows up in the preempt off critical section latency
tracer. Are you saying its Ok for preemption to be disabled on system for 500
micro seconds? That hurts real-time applications (audio etc).

> So why is it a problem for you ? You are mentioning newly idle load
> balance so I assume that your root problem is the scheduling delay
> generated by the newly idle load balance which then calls
> update_blocked_averages

Yes, the new idle balance is when I see it happen quite often. I do see it
happen with other load balance as well, but it not that often as those LB
don't run as often as new idle balance.

>
> rate limiting the call to update_blocked_averages() will only reduce
> the number of time it happens but it will not prevent it to happen.

Sure, but soft real-time issue can tolerate if the issue does not happen very
often. In this case though, it is frequent.

> IIUC, your real problem is that newidle_balance is running whereas a
> task is about to wake up on the cpu and we don't abort quickly during
> this load_balance

Yes.

> so we could also try to abort earlier in case of newly idle load balance

I think interrupts are disabled when the load balance runs, so there's no way
for say an audio interrupt to even run in order to wake up a task. How would
you know to abort the new idle load balance?

Could you elaborate more also on the drawback of the rate limiting patch we
posted? Do you see a side effect?

> > > > sometimes, which seems overkill.
> > > >
> > > > schbench shows a clear improvement with the change:
> > >
> > > Have you got more details about your test setup ?
> > > which platform ?
> > > which kernel ?
> >
> > I mentioned in the commit message it is a v5.4 kernel.
>
> I was not sure if the tests results done with this kernel because we
> usually ask for results against mainline to make sure you are not
> facing a problem that has solved since v5.4 has been released

Ok, yes I have a userspace up and running only on 5.4 kernel unfortunately. I
was hoping that is recent enough for this debug.

thanks,

- Joel

2021-01-28 14:01:04

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

Hi Joel,

On Wed, 27 Jan 2021 at 19:43, Joel Fernandes <[email protected]> wrote:
>
> Hi Vincent,
>
> On Mon, Jan 25, 2021 at 03:42:41PM +0100, Vincent Guittot wrote:
> > On Fri, 22 Jan 2021 at 20:10, Joel Fernandes <[email protected]> wrote:
> > > On Fri, Jan 22, 2021 at 05:56:22PM +0100, Vincent Guittot wrote:
> > > > On Fri, 22 Jan 2021 at 16:46, Joel Fernandes (Google)
> > > > <[email protected]> wrote:
> > > > >
> > > > > On an octacore ARM64 device running ChromeOS Linux kernel v5.4, I found
> > > > > that there are a lot of calls to update_blocked_averages(). This causes
> > > > > the schedule loop to slow down to taking upto 500 micro seconds at
> > > > > times (due to newidle load balance). I have also seen this manifest in
> > > > > the periodic balancer.
> > > > >
> > > > > Closer look shows that the problem is caused by the following
> > > > > ingredients:
> > > > > 1. If the system has a lot of inactive CGroups (thanks Dietmar for
> > > > > suggesting to inspect /proc/sched_debug for this), this can make
> > > > > __update_blocked_fair() take a long time.
> > > >
> > > > Inactive cgroups are removed from the list so they should not impact
> > > > the duration
> > >
> > > I meant blocked CGroups. According to this code, a cfs_rq can be partially
> > > decayed and not have any tasks running on it but its load needs to be
> > > decayed, correct? That's what I meant by 'inactive'. I can reword it to
> > > 'blocked'.
> >
> > How many blocked cgroups have you got ?
>
> I put a counter in for_each_leaf_cfs_rq_safe() { } to count how many times
> this loop runs per new idle balance. When the problem happens I see this loop
> run 35-40 times (for one single instance of newidle balance). So in total
> there are at least these many cfs_rq load updates.

Do you mean that you have 35-40 cgroups ? Or the 35-40 includes all CPUs ?

>
> I also see that new idle balance can be called 200-500 times per second.

This is not surprising because newidle_balance() is called every time
the CPU is about to become idle

>
> > >
> > > * There can be a lot of idle CPU cgroups. Don't let fully
> > > * decayed cfs_rqs linger on the list.
> > > */
> > > if (cfs_rq_is_decayed(cfs_rq))
> > > list_del_leaf_cfs_rq(cfs_rq);
> > >
> > > > > 2. The device has a lot of CPUs in a cluster which causes schedutil in a
> > > > > shared frequency domain configuration to be slower than usual. (the load
> > > >
> > > > What do you mean exactly by it causes schedutil to be slower than usual ?
> > >
> > > sugov_next_freq_shared() is order number of CPUs in the a cluster. This
> > > system is a 6+2 system with 6 CPUs in a cluster. schedutil shared policy
> > > frequency update needs to go through utilization of other CPUs in the
> > > cluster. I believe this could be adding to the problem but is not really
> > > needed to optimize if we can rate limit the calls to update_blocked_averages
> > > to begin with.
> >
> > Qais mentioned half of the time being used by
> > sugov_next_freq_shared(). Are there any frequency changes resulting in
> > this call ?
>
> I do not see a frequency update happening at the time of the problem. However
> note that sugov_iowait_boost() does run even if frequency is not being
> updated. IIRC, this function is also not that light weight and I am not sure
> if it is a good idea to call this that often.

Scheduler can't make any assumption about how often schedutil/cpufreq
wants to be called. Some are fast and straightforward and can be
called very often to adjust frequency; Others can't handle much
updates. The rate limit mechanism in schedutil and io-boost should be
there for such purpose.

>
> > > > > average updates also try to update the frequency in schedutil).
> > > > >
> > > > > 3. The CPU is running at a low frequency causing the scheduler/schedutil
> > > > > code paths to take longer than when running at a high CPU frequency.
> > > >
> > > > Low frequency usually means low utilization so it should happen that much.
> > >
> > > It happens a lot as can be seen with schbench. It is super easy to reproduce.
> >
> > Happening a lot in itself is not a problem if there is nothing else to
> > do so it's not a argument in itself
>
> It is a problem - it shows up in the preempt off critical section latency

But this point is not related to the point above which is about how
often it happens.

> tracer. Are you saying its Ok for preemption to be disabled on system for 500
> micro seconds? That hurts real-time applications (audio etc).

So. Is your problem related to real-time applications (audio etc) ?

>
> > So why is it a problem for you ? You are mentioning newly idle load
> > balance so I assume that your root problem is the scheduling delay
> > generated by the newly idle load balance which then calls
> > update_blocked_averages
>
> Yes, the new idle balance is when I see it happen quite often. I do see it
> happen with other load balance as well, but it not that often as those LB
> don't run as often as new idle balance.

The update of average blocked load has been added in newidle_balance
to take advantage of the cpu becoming idle but it seems to create a
long preempt off sequence. I 'm going to prepare a patch to move it
out the schedule path.

>
> >
> > rate limiting the call to update_blocked_averages() will only reduce
> > the number of time it happens but it will not prevent it to happen.
>
> Sure, but soft real-time issue can tolerate if the issue does not happen very
> often. In this case though, it is frequent.

Could you provide details of the problem that you are facing ? It's
not clear for me what happens in your case at the end. Have you got an
audio glitch as an example?

"Not often" doesn't really give any clue.

Also update_blocked_averages was supposed called in newlyidle_balance
when the coming idle duration is expected to be long enough

>
> > IIUC, your real problem is that newidle_balance is running whereas a
> > task is about to wake up on the cpu and we don't abort quickly during
> > this load_balance
>
> Yes.
>
> > so we could also try to abort earlier in case of newly idle load balance
>
> I think interrupts are disabled when the load balance runs, so there's no way
> for say an audio interrupt to even run in order to wake up a task. How would
> you know to abort the new idle load balance?
>
> Could you elaborate more also on the drawback of the rate limiting patch we
> posted? Do you see a side effect?

Your patch just tries to hide your problem and not to solve the root cause.

>
> > > > > sometimes, which seems overkill.
> > > > >
> > > > > schbench shows a clear improvement with the change:
> > > >
> > > > Have you got more details about your test setup ?
> > > > which platform ?
> > > > which kernel ?
> > >
> > > I mentioned in the commit message it is a v5.4 kernel.
> >
> > I was not sure if the tests results done with this kernel because we
> > usually ask for results against mainline to make sure you are not
> > facing a problem that has solved since v5.4 has been released
>
> Ok, yes I have a userspace up and running only on 5.4 kernel unfortunately. I
> was hoping that is recent enough for this debug.

more than 14 months old is not really recent... It's always good to
have a reproducer against mainline

Regards,
Vincent
>
> thanks,
>
> - Joel
>

2021-01-28 15:13:17

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

Hi Vincent,

On Thu, Jan 28, 2021 at 8:57 AM Vincent Guittot
<[email protected]> wrote:
> > On Mon, Jan 25, 2021 at 03:42:41PM +0100, Vincent Guittot wrote:
> > > On Fri, 22 Jan 2021 at 20:10, Joel Fernandes <[email protected]> wrote:
> > > > On Fri, Jan 22, 2021 at 05:56:22PM +0100, Vincent Guittot wrote:
> > > > > On Fri, 22 Jan 2021 at 16:46, Joel Fernandes (Google)
> > > > > <[email protected]> wrote:
> > > > > >
> > > > > > On an octacore ARM64 device running ChromeOS Linux kernel v5.4, I found
> > > > > > that there are a lot of calls to update_blocked_averages(). This causes
> > > > > > the schedule loop to slow down to taking upto 500 micro seconds at
> > > > > > times (due to newidle load balance). I have also seen this manifest in
> > > > > > the periodic balancer.
> > > > > >
> > > > > > Closer look shows that the problem is caused by the following
> > > > > > ingredients:
> > > > > > 1. If the system has a lot of inactive CGroups (thanks Dietmar for
> > > > > > suggesting to inspect /proc/sched_debug for this), this can make
> > > > > > __update_blocked_fair() take a long time.
> > > > >
> > > > > Inactive cgroups are removed from the list so they should not impact
> > > > > the duration
> > > >
> > > > I meant blocked CGroups. According to this code, a cfs_rq can be partially
> > > > decayed and not have any tasks running on it but its load needs to be
> > > > decayed, correct? That's what I meant by 'inactive'. I can reword it to
> > > > 'blocked'.
> > >
> > > How many blocked cgroups have you got ?
> >
> > I put a counter in for_each_leaf_cfs_rq_safe() { } to count how many times
> > this loop runs per new idle balance. When the problem happens I see this loop
> > run 35-40 times (for one single instance of newidle balance). So in total
> > there are at least these many cfs_rq load updates.
>
> Do you mean that you have 35-40 cgroups ? Or the 35-40 includes all CPUs ?

All CPUs.

> > I also see that new idle balance can be called 200-500 times per second.
>
> This is not surprising because newidle_balance() is called every time
> the CPU is about to become idle

Sure.

> > > >
> > > > * There can be a lot of idle CPU cgroups. Don't let fully
> > > > * decayed cfs_rqs linger on the list.
> > > > */
> > > > if (cfs_rq_is_decayed(cfs_rq))
> > > > list_del_leaf_cfs_rq(cfs_rq);
> > > >
> > > > > > 2. The device has a lot of CPUs in a cluster which causes schedutil in a
> > > > > > shared frequency domain configuration to be slower than usual. (the load
> > > > >
> > > > > What do you mean exactly by it causes schedutil to be slower than usual ?
> > > >
> > > > sugov_next_freq_shared() is order number of CPUs in the a cluster. This
> > > > system is a 6+2 system with 6 CPUs in a cluster. schedutil shared policy
> > > > frequency update needs to go through utilization of other CPUs in the
> > > > cluster. I believe this could be adding to the problem but is not really
> > > > needed to optimize if we can rate limit the calls to update_blocked_averages
> > > > to begin with.
> > >
> > > Qais mentioned half of the time being used by
> > > sugov_next_freq_shared(). Are there any frequency changes resulting in
> > > this call ?
> >
> > I do not see a frequency update happening at the time of the problem. However
> > note that sugov_iowait_boost() does run even if frequency is not being
> > updated. IIRC, this function is also not that light weight and I am not sure
> > if it is a good idea to call this that often.
>
> Scheduler can't make any assumption about how often schedutil/cpufreq
> wants to be called. Some are fast and straightforward and can be
> called very often to adjust frequency; Others can't handle much
> updates. The rate limit mechanism in schedutil and io-boost should be
> there for such purpose.

Sure, I know that's the intention.

> > > > > > average updates also try to update the frequency in schedutil).
> > > > > >
> > > > > > 3. The CPU is running at a low frequency causing the scheduler/schedutil
> > > > > > code paths to take longer than when running at a high CPU frequency.
> > > > >
> > > > > Low frequency usually means low utilization so it should happen that much.
> > > >
> > > > It happens a lot as can be seen with schbench. It is super easy to reproduce.
> > >
> > > Happening a lot in itself is not a problem if there is nothing else to
> > > do so it's not a argument in itself
> >
> > It is a problem - it shows up in the preempt off critical section latency
>
> But this point is not related to the point above which is about how
> often it happens.

It is related. A bad thing happening quite often is worse than a bad
thing happening infrequently. I agree it is not a root cause fix, but
it makes things "better".

> > tracer. Are you saying its Ok for preemption to be disabled on system for 500
> > micro seconds? That hurts real-time applications (audio etc).
>
> So. Is your problem related to real-time applications (audio etc) ?

Yes it is. I don't have a reproducer and it could be the audio
buffering will hide the problem. But that doesn't mean that there is
no problem or that we cannot improve things. There will also likely be
power consumption improvement.

> > > So why is it a problem for you ? You are mentioning newly idle load
> > > balance so I assume that your root problem is the scheduling delay
> > > generated by the newly idle load balance which then calls
> > > update_blocked_averages
> >
> > Yes, the new idle balance is when I see it happen quite often. I do see it
> > happen with other load balance as well, but it not that often as those LB
> > don't run as often as new idle balance.
>
> The update of average blocked load has been added in newidle_balance
> to take advantage of the cpu becoming idle but it seems to create a
> long preempt off sequence. I 'm going to prepare a patch to move it
> out the schedule path.

Ok thanks, that would really help!

> > > rate limiting the call to update_blocked_averages() will only reduce
> > > the number of time it happens but it will not prevent it to happen.
> >
> > Sure, but soft real-time issue can tolerate if the issue does not happen very
> > often. In this case though, it is frequent.
>
> Could you provide details of the problem that you are facing ? It's
> not clear for me what happens in your case at the end. Have you got an
> audio glitch as an example?
>
> "Not often" doesn't really give any clue.

I believe from my side I have provided details. I shared output from a
function graph tracer and schbench micro benchmark clearly showing the
issue and improvements. Sorry, I don't have a real-world reproducer
for this.

> Also update_blocked_averages was supposed called in newlyidle_balance
> when the coming idle duration is expected to be long enough

No, we do not want the schedule loop to take half a millisecond.

> > > IIUC, your real problem is that newidle_balance is running whereas a
> > > task is about to wake up on the cpu and we don't abort quickly during
> > > this load_balance
> >
> > Yes.
> >
> > > so we could also try to abort earlier in case of newly idle load balance
> >
> > I think interrupts are disabled when the load balance runs, so there's no way
> > for say an audio interrupt to even run in order to wake up a task. How would
> > you know to abort the new idle load balance?
> >
> > Could you elaborate more also on the drawback of the rate limiting patch we
> > posted? Do you see a side effect?
>
> Your patch just tries to hide your problem and not to solve the root cause.

Agreed, the solution presented is a band aid and not a proper fix. It
was just intended to illustrate the problem and start a discussion. I
should have marked it RFC for sure.

thanks!

- Joel

2021-01-28 17:01:24

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

On 01/28/21 10:09, Joel Fernandes wrote:
> > > > Qais mentioned half of the time being used by
> > > > sugov_next_freq_shared(). Are there any frequency changes resulting in
> > > > this call ?
> > >
> > > I do not see a frequency update happening at the time of the problem. However
> > > note that sugov_iowait_boost() does run even if frequency is not being
> > > updated. IIRC, this function is also not that light weight and I am not sure
> > > if it is a good idea to call this that often.
> >
> > Scheduler can't make any assumption about how often schedutil/cpufreq
> > wants to be called. Some are fast and straightforward and can be
> > called very often to adjust frequency; Others can't handle much
> > updates. The rate limit mechanism in schedutil and io-boost should be
> > there for such purpose.
>
> Sure, I know that's the intention.

How about the below patch to help with reducing the impact of
sugov_update_shared()?

I basically made sure it'll bail out immediately if it gets 2 calls within the
defined rate_limit_us value.

And tweaked the way we call cpufreq_update_util() from
update_blocked_averages() too so that we first update blocked load on all cpus,
then we ask for the frequency update. Combined with above this should result to
a single call to sugov_update_shared() for each policy. Each call should see
the final state of what is the load really is.

Only compile tested!

Thanks

--
Qais Yousef

----->8-----


diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 6931f0cdeb80..bd83e9343749 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -109,7 +109,6 @@ static bool sugov_update_next_freq(struct sugov_policy *sg_policy, u64 time,
return false;

sg_policy->next_freq = next_freq;
- sg_policy->last_freq_update_time = time;

return true;
}
@@ -554,20 +553,23 @@ sugov_update_shared(struct update_util_data *hook, u64 time, unsigned int flags)

raw_spin_lock(&sg_policy->update_lock);

+ ignore_dl_rate_limit(sg_cpu, sg_policy);
+
+ if (!sugov_should_update_freq(sg_policy, time))
+ goto out;
+
sugov_iowait_boost(sg_cpu, time, flags);
sg_cpu->last_update = time;

- ignore_dl_rate_limit(sg_cpu, sg_policy);
-
- if (sugov_should_update_freq(sg_policy, time)) {
- next_f = sugov_next_freq_shared(sg_cpu, time);
+ next_f = sugov_next_freq_shared(sg_cpu, time);

- if (sg_policy->policy->fast_switch_enabled)
- sugov_fast_switch(sg_policy, time, next_f);
- else
- sugov_deferred_update(sg_policy, time, next_f);
- }
+ if (sg_policy->policy->fast_switch_enabled)
+ sugov_fast_switch(sg_policy, time, next_f);
+ else
+ sugov_deferred_update(sg_policy, time, next_f);

+ sg_policy->last_freq_update_time = time;
+out:
raw_spin_unlock(&sg_policy->update_lock);
}

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 04a3ce20da67..47bd8be03a6c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8033,7 +8033,7 @@ static unsigned long task_h_load(struct task_struct *p)
}
#endif

-static void update_blocked_averages(int cpu)
+static void update_blocked_averages(int cpu, bool update_freq)
{
bool decayed = false, done = true;
struct rq *rq = cpu_rq(cpu);
@@ -8046,7 +8046,7 @@ static void update_blocked_averages(int cpu)
decayed |= __update_blocked_fair(rq, &done);

update_blocked_load_status(rq, !done);
- if (decayed)
+ if (decayed && update_freq)
cpufreq_update_util(rq, 0);
rq_unlock_irqrestore(rq, &rf);
}
@@ -8384,7 +8384,7 @@ static bool update_nohz_stats(struct rq *rq, bool force)
if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick))
return true;

- update_blocked_averages(cpu);
+ update_blocked_averages(cpu, false);

return rq->has_blocked_load;
#else
@@ -8454,6 +8454,15 @@ static inline void update_sg_lb_stats(struct lb_env *env,
}
}

+ /*
+ * We did a bunch a blocked average updates, let cpufreq know about
+ * them in one go. For system with a lot of cpus on a frequency domain
+ * this will make sure we take all the changes that affects the policy
+ * in one go.
+ */
+ for_each_cpu_and(i, sched_group_span(group), env->cpus)
+ cpufreq_update_util(cpu_rq(i), 0);
+
/* Check if dst CPU is idle and preferred to this group */
if (env->sd->flags & SD_ASYM_PACKING &&
env->idle != CPU_NOT_IDLE &&
@@ -10464,7 +10473,7 @@ static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags,

/* Newly idle CPU doesn't need an update */
if (idle != CPU_NEWLY_IDLE) {
- update_blocked_averages(this_cpu);
+ update_blocked_averages(this_cpu, false);
has_blocked_load |= this_rq->has_blocked_load;
}

@@ -10478,6 +10487,15 @@ static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags,
ret = true;

abort:
+ /*
+ * We did a bunch a blocked average updates, let cpufreq know about
+ * them in one go. For system with a lot of cpus on a frequency domain
+ * this will make sure we take all the changes that affects the policy
+ * in one go.
+ */
+ for_each_cpu(balance_cpu, nohz.idle_cpus_mask)
+ cpufreq_update_util(cpu_rq(balance_cpu), 0);
+
/* There is still blocked load, enable periodic update */
if (has_blocked_load)
WRITE_ONCE(nohz.has_blocked, 1);
@@ -10603,7 +10621,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)

raw_spin_unlock(&this_rq->lock);

- update_blocked_averages(this_cpu);
+ update_blocked_averages(this_cpu, true);
rcu_read_lock();
for_each_domain(this_cpu, sd) {
int continue_balancing = 1;
@@ -10691,7 +10709,7 @@ static __latent_entropy void run_rebalance_domains(struct softirq_action *h)
return;

/* normal load balance */
- update_blocked_averages(this_rq->cpu);
+ update_blocked_averages(this_rq->cpu, true);
rebalance_domains(this_rq, idle);
}